Thursday, 2 June 2011

Chemical similarity

Currently pondering how to quantify chemical similarity, based on SMILES or InChI strings representing any given two chemicals / metabolites.

I've had a little play with the rather good Chemical Development Kit (CDK), developed by Egon Wilighagen and Christoph Steinbeck and have hit a bit of a brick wall.

Consider the following reaction:


In this case, I'd like to quantify the similarity between glucose and glucose 6-phosphate.

My first approach was to use the CDK's ExtendedFingerprinter class to generate a BitSet representation of each molecule, and then generate a Tanimoto coefficient which should quantify similarity:


String smiles1 = 
"
OC[C@H]1OC(O)[C@H](O)[C@@H](O)[C@@H]1O";
String smiles2 = 
"
O[C@H]1O[C@H](COP(O)(O)=O)[C@@H](O)[C@H](O)[C@H]1O";
IAtomContainer molecule1 = smilesParser.parseSmiles( 
smiles1 
);

IAtomContainer molecule2 = smilesParser.parseSmiles( smiles2
 );

ExtendedFingerprinter ef = new ExtendedFingerprinter();
BitSet fingerprint1 = ef.getFingerprint( molecule1 );
BitSet fingerprint2 = 
ef.getFingerprint( molecule2 );
float
tanimotoCoefficient = Tanimoto.calculate( fingerprint1, fingerprint2 );
System.out.println( tanimotoCoefficient );


In this case, a "similarity score" of 0.475 is calculated, which intuitively I find a little low, given how much of the structure is shared.

I'm assuming (using a sequence analysis analogy) that the Tanimoto coefficient is a measure of global similarity. What I'm after here, I suppose, is a measure of local similarity - effectively something that can quantify the fact that we can pretty much map the entire structure of glucose onto that of glucose 6-phosphate.

I understand that elsewhere in the CDK, the UniversalIsomorphismTester class may be more appropriate. Does anyone have any experience of this?

Any help would be appreciated!

4 comments:

  1. Hi Neil, I agree that the similarity by fingerprint is a bit low. I would also expect something larger. Have you tried stripping away the H's? They sometimes lead to weird effects.
    The UniversalIsomorphismTester provides a method for maximum common substructure detection which in this case would give you some simple measure. Have you looked at http://www.jcheminf.com/content/1/1/12 ?
    Asad in our institute has looked into your problem and his methods was highly regarded.
    Cheers, Chris

    ReplyDelete
  2. Mapping of molecules in reactions like this, is the playing field of SMSD. Check this paper:

    http://www.jcheminf.com/content/1/1/12

    Code is in the org.openscience.cdk.smsd package:

    http://pele.farmbio.uu.se/nightly/cdk-javadoc-1.5.0.git/

    ReplyDelete
  3. Thanks, guys. Knew that you wouldn't let me down! I'll take a look at the paper.

    ReplyDelete
  4. The problem here is path-based fingerprints (in general, though not always) do not encode multiple occurences of the same path. Glucose has multiple occurences of the same paths, and so has fewer bits set than expected for a molecule of that size. When you add the P, a whole host of new paths are available, which cause an unexpected large decrease in the Tanimoto coeff.

    ReplyDelete