Introduction
Datasets are tabular semi-structured data objects that are searchable and live in folders rather than within schemas. Thus, Datasets support your goals to capture structured data ready for AI/ML models while still allowing for flexibility of how it is used in Benchling.
Some examples of how Datasets can be used as a fast and flexible way to track intermediate data from an instrument, a list of limits or parameters for a particular assay, or temporary storage for well roles. Datasets are often intermediate pieces of a larger data analysis pipeline.
There are three main ways that you can create a Dataset:
- Through Analysis
- Through the API
- Through Connect Runs
Creating Datasets
After you create Datasets, you can filter Global Search by type Dataset to find a Dataset that has been created. This means that you can apply additional filters (such as when it was created, who created it, etc.) that helps you better understand data traceability.
From Analysis
Datasets can be data inputs and/or outputs to an Analysis. Below are examples of both import and export:
To ingest a previously created Dataset into Analysis:
- Click the + icon to add a new Analysis table
- From the list of table creation options, click From dataset
- Use the text box to search for your source dataset
-
Define the table name, then click the Add table button
Once you have completed an analysis, you can export a Dataset as an output as follows:
- Select the table that contains the data you want to use to create an output
- Click create outputs in the top right corner
- Use the checkboxes to select the items to associate with an output
- Define the folder where the Dataset will be located
- Define an existing Notebook entry that the Dataset will be associated with
- Click create outputs at the bottom of the modal
These outputs will populate at the bottom of Outputs as shown in the selected Analysis in the left toolbar
From API
Datasets can also be created with the Benchling API, for more information see our API reference.
From Connect
When configuring an Output file in a run schema, the “Record dataset” option (under Benchling action) allows for the Output file to be processed and recorded as Datasets. Using a Dataset eliminates the need for schema configuration, such as registry or result schemas. Datasets are recommended when the output data will serve as an input for subsequent analysis.
When a run is executed using the "Record dataset" option to process an output file, the processed data is stored and displayed in a tabular format and as a new Dataset object, as illustrated below:
This Dataset can be sent directly to either a new analysis or an existing analysis.
Organizing Datasets Datasets & Studies
To tag a Study within an Analysis:
1. Click on the info icon in the upper right corner of the Analysis to view the metadata
2. Click on the edit icon under Studies to add or edit the Study that the Analysis is tagged with
3. Use the text box to search for the relevant Study then click on the check mark icon to save the tag
This will allow the Datasets associated with an Analysis to appear within the Study Items, where all Datasets and Analyses linked with the Study will appear. When you are done viewing or updating metadata in this modal, click the Save button.
Datasets are not the same as results tables, registry tables, or unstructured tables within the notebook
This is the Dataset workspace. Datasets can be optionally tagged with a Study, or they can have Custom Fields to tag with custom metadata.
What is a Dataset?
Datasets are Benchling objects that capture non-schematized tabular data stored in S3 and are used for analysis.
Datasets:
Datasets are simpler representation of tabular data within Benchling.They do not need to be schematized nor associated with any other specific objects. You can think of it like a CSV file within Benchling. They are flexible and fast.
There are two types of Benchling objects called datasets: Intermediate Datasets and Published Datasets. An Intermediate Dataset is a Dataset that is part of an analysis that has not been published while Published Datasets occur where an output of an analysis has been defined, as shown below
A dataset can be finalized by creating “outputs” as shown below. These outputs will be saved as a dataset once certain data (e.g. entities) in the view are saved.
Details on how each type of Dataset is used are below.
Intermediate Datasets (also known as data frames in the API)
Intermediate Datasets are transient steps taken during an analysis and can be created by:
- the API
- through analysis
An Intermediate Dataset can be made into a Published Dataset once finalized, as shown below
An Intermediate Data set can be finalized by creating “outputs” as shown below. These outputs will be saved as a formalized dataset once certain data (for ex. entities) in the view are saved.
Shown here are intermediate datasets. These are available via API and as part of an analysis, but are not in global search due to the noise it would create as part of analysis exploration. Once they are published, they are available to be searched for via global search
Published Dataset:
Like Intermediate Datasets, Published Datasets are available to access through the API. However, Published Datasets are also searchable and have permissions. They are used to represent transformed data
- Transformed data
When an ‘output’ of the analysis has been saved, as shown above, the Analysis publishes the data set, and makes it searchable via Global search and the API.
Files & Datasets tagged with a Study (add link to study article here) also appear as studyable items, making Studies a package for data
To tag a Study within an Analysis, navigate to the right corner as shown below to view the metadata of the Analysis, and you can specify the Study as shown
This will allow the Published Datasets associated with the Analysis to appear within the Study Items, and all Datasets and Analyses linked with the Study will then appear in the Study metadata