How do I query over multi-select JSONB arrays like ["seq_D3fZa648", "seq_bVWwI73e"]?

Aarthi
Aarthi
  • Updated

Data from multi-select fields are stored in JSONB arrays in the Benchling warehouse, which in Insights automatically get displayed as the blue clickable chips. There are specific JSONB operators and functions to work with these arrays in your postgreSQL query. This article will show you some examples.

Combine tables with JOIN

If you have two tables with matching text columns, you can easily combine these tables using:

SELECT * FROM table1
JOIN table2 ON table1.column1 = table2.column2

However, the same JOIN ... ON column1 = column2 syntax does not work when one of the columns is a JSONB array, and you will get an error: operator does not exist: jsonb = character varying.

We will use an example to show how to JOIN tables based on JSONB arrays. We will join data from a Cell Line schema with data from the Plasmid schema. These schemas are linked via a multi-select entity link, and thus is the link from the cell_line to the plasmid table stored in a JSONB array. Here is our example cell_line data:

id name$ plasmids
bfi_6mMswng0 Cell-001 ["seq_D3fZa648", "seq_bVWwI73e"]
bfi_jyMlI3nF Cell-002 NULL

Option 1: use the ? operator

JSONB arrays have specific operators to compare arrays and values. The ? operator checks if a value is part of the array, which you can use to JOIN tables ON jsonb_column1 ? column2.

SELECT 
cell_line.name$ AS cell_line_name,
plasmid.name$ AS plasmid_name
FROM cell_line
LEFT JOIN plasmid ON cell_line.plasmids ? plasmid.id

# Output
cell_line_name | plasmid_name
Cell-001 | pBR322
Cell-001 | pX330
Cell-002 | NULL

Note that since Cell-001 has 2 plasmids associated, it has 2 rows, one for each plasmid, in the output.

To aggregate the multiple rows per cell line back into one row, you can use GROUP BY cell_line.id to indicate that all values per Cell Line should be combined into one row. For columns that are not part of the GROUP BY clause, you have to use aggregation functions to specify how these values should be combined into a single row:

  • STRING_AGG(column, separator) combines text values with a separator
  • JSONB_AGG(column) combines IDs or text values into a new JSONB array
    (in which entity and inventory item IDs will be rendered as clickable chips in Insights).
# SQL query
SELECT
cell_line.name$ AS cell_line_name,
STRING_AGG(plasmid.name$, ', ') AS plasmid_names,
JSONB_AGG(plasmid.gene) AS linked_genes,
FROM cell_line
LEFT JOIN plasmid ON cell_line.plasmids ? plasmid.id
GROUP BY cell_line.id, cell_line_name

# Output
cell_line_name | plasmid_names | linked_genes
Cell-001 | pBR322, pX330 | ["seq_R9aBs84a", "seq_OvWx49a"]
Cell-002 | NULL | []

Option 2: unpack the JSONB array

Alternatively, you can unpack the JSONB array into single values, each on a separate row, using jsonb_array_elements_text(jsonb_column). After you've unpacked the array, you can use these unpacked text values to join other tables with JOIN ... ON unpacked_column1 = column2. You can also use this in the SELECT clause to unpack elements into separate rows without joining any other tables.

SELECT 
cell_line.name$ AS cell_line_name,
plasmid.id AS plasmid_id,
plasmid.name$ AS plasmid_name
FROM cell_line
LEFT JOIN jsonb_array_elements_text(cell_line.plasmids) AS cell_line_plasmid_id
ON TRUE
LEFT JOIN plasmid ON cell_line_plasmid_id = plasmid.id

# Output
cell_line_name | plasmid_id | plasmid_name
Cell-001 | seq_D3fZa648 | pBR322
Cell-001 | seq_bVWwI73e | pX330
Cell-002 | NULL | NULL

In this example, jsonb_array_elements_text thus converts the plasmid column into a new table that has 1 row per element (seq_D3fZa648 and seq_bVWwI73e).

Next, LEFT JOIN ... ON TRUE joins that table to our existing cell_line table. We specifically recommend a LEFT JOIN instead of (INNER) JOIN so cell lines that don’t have any plasmids, like Cell-002, are still included in the output. Otherwise, they would be filtered out.

Note that since Cell-001 has 2 plasmids associated, it has 2 rows, one for each plasmid, in the output. See Option 1 above on how to use GROUP BY to combine these rows into a single row per cell line again.

Select a single JSONB element

You can select a single element from a JSONB array using column->>[index, starting at 0]. For example, you can get the first element of the array with column->>0, selecting seq_D3fZa648 from the array ["seq_D3fZa648", "seq_bVWwI73e"].

Computed fields are often multi-select, even if your data model only permits it to contain a single value. You can use ->>0 in SELECT column->>0 FROM ... statements to select the first text value, or use it to join tables on this single value: JOIN ... ON jsonb_column1->>0 = column2.

Count the number of JSONB elements

You can usejsonb_array_length(jsonb_column) to count how many elements are present in a JSONB array:

# Example cell_line table
id | name$ | plasmids
bfi_6mMswng0 | Cell-001 | ["seq_D3fZa648", "seq_bVWwI73e"]
bfi_jyMlI3nF | Cell-002 | []

# SQL query
SELECT
cell_line.name$ AS cell_line_name,
jsonb_array_length(cell_line.plasmids) AS plasmid_count
FROM cell_line

# Output
cell_line_name | plasmid_count
Cell-001 | 2
Cell-002 | 0

 

Additional resources

https://www.postgresql.org/docs/12/functions-json.html
https://riptutorial.com/postgresql/example/5195/using-jsonb-operators
https://www.postgresqltutorial.com/postgresql-json/

Was this article helpful?

Have more questions? Submit a request