Why are Benchling-generated canonical SMILES different from other SMILES?

Diamond
Diamond
  • Updated

Issue

Benchling SMILES wording differs from other generated SMILES wording.

Environment

Molecular Biology application

Cause

Benchling stores both the original, user-supplied molecule format, a molfile for rendering chemical structures, and a canonicalized SMILES string using RDKit provided algorithms. The canonicalized SMILES string is generated from either the original SMILES string or the original molfile provided by the user.

When canonicalizing the SMILES string, the following are considered:

  • Add enhanced stereochemistry - The output of the enhanced stereochemistry step is to add explicit stereochemistry to the canonicalized SMILES string.
  • Remove explicit hydrogens - The explicit hydrogen step removes any explicit hydrogens, and converts them to implicit hydrogens (hydrogens that do not appear directly in the structure or are written with the connecting atom).

For strip salts, this only accounts for salts that are stripped from the canonicalized SMILES string.

Dearomatize/Kekulization: The Kekulization step places alternating single and double bonds instead of aromatic bonds. Other canonicalization steps that may or may not apply to SMILES input include:

  • Remove atom maps
  • Remove atom label properties
  • Remove conformers
  • Remove atom valence properties

Salts that may be included in a chemical structure are currently stripped during the chemical structure canonicalization process.

Note: Currently, Benchling keeps the largest fragment (based on atom count) of the original molecule as the parent molecule, and all other fragments are treated as salts. Salts can be saved in the schema and custom fields when detected upon bulk import.

Was this article helpful?

Have more questions? Submit a request