Benchling provides a data warehouse as a service to customers for connecting your third party tools to data in Benchling. Common tools that customers can connect include Tableau, Power BI, Spotfire, and GraphPad Prism.
You can read more about how the warehouse works here: https://docs.benchling.com/docs/getting-started
Some customers choose to pull data from the warehouse into downstream systems. This article will walk through the basics of how that works. However, it should be noted that the Benchling warehouse is not the only way to do this.
Ingest data from Benchling
There are two ways to programmatically pull data from Benchling.
Pull data via the Benchling REST APIs.
Pull data via the Benchling data warehouse.
The Benchling REST API has a few key benefits:
Data from the REST API is live.
The REST API allows for integrations to "writeback" into Benchling.
However, some customers prefer to query with more transformation and query flexibility. In that case, the warehouse has the following properties:
Data from the warehouse is sync'ed from the live Benchling application. Usually that's within a few minutes, but can be up to a few hours.
Data from the warehouse can only be read, not written.
How do I set up a SQL-based ingestion into a downstream system?
Once you've set up your warehouse connection (https://help.benchling.com/en/articles/2453412-access-your-data-warehouse-enterprise) you can use a tool like FiveTran or Debezium to pull data into your downstream system. There are two possible approaches:
Regular snapshots - in cases where the amount of data in Benchling isn't that large, one can simply run queries for all of the data in the warehouse and pull it regularly (say, once a week). These snapshots practically look like this:
SELECT * FROM registry_entity;
SELECT * FROM entry$raw;
SELECT * FROM container$raw;
Though configuring them in your ingestion tool of choice may look different. The main advantage of snapshotting is that the logic to ingest from the warehouse is fairly straightforward - you're pulling all the data, at some increment.
Incremental pulls - in cases where the amount of data in Benchling is large enough (or the ETL is too slow to snapshot - like queries with lots of joins) customers can run queries based on what has changed since the last time a pull was made. Unfortunately, there is not a universal column that can be used for such incremental calculations, but most of our tables have a modified_at timestamp which can be used for this purpose.
Some data ingestion tools prefer to just be pointed at the tables of interest and "autocalculate" what changes have been made to propagate to the downstream system. Unfortunately, those systems often don't work well with Benchling, primarily because of the way we make updates (we don't update the row directly, we insert a new row with the same id, and delete the previous row).
As a consequence, tools like Fivetran (using something like xmin replication) will often duplicate data into downstream systems. For example, if data looks like this in Benchling's warehouse:
The downstream system might contain this:
insert_time (tool produced)
The key is to deduplicate the rows based on id - there should only ever be one row per value of id. Most ingestion tools bring along a "insert_time" which can be used to filter the old rows out.
What about write-ahead-log based replication?
Unfortunately, WAL replication is not supported in the Benchling warehouse today - access to the WAL requires the ability to write to specific tables in Postgres, and we don't support write use cases. However, if you have a need for ingestion from the warehouse, please get in touch! We'd love to learn more about the use case, and add appropriate features to this service where relevant.