chem-bla-ics

Biology, ACPs, lipids, cheminformatics, and Dagstuhl

2022-08-01T00:00:00+00:00

Already 3 months ago I visited Dagstuhl for the second time. The weather was much better than in the January right before the start of the pandemic. The first I attended the Computational Metabolomics meeting, with the focus From Cheminformatics to Machine Learning, one of the things we concerned ourselves with was how to do computation with compound classes (see Section 3.6 and this online book). We know how to handle SMILES and we know how to the substructure searching with SMARTS, but what if you have compound classes or lipid classes? Biology is a greasy business.

From a WikiPathways there is additional complexity, with modified proteins involved in lipid metabolism, the acyl-carrier proteins. They look like this, and the R group is a protein:

We have quite a few of them in WikiPathway and they also show up in ChEBI (and likely Reactome), LIPID MAPS, and KEGG.

During this years Dagstuhl we used up one session to continue working on it (report pending). Part of the results is that Wikidata (see doi:10.7554/eLife.52614 and doi:10.7554/eLife.70780) now has a property for CXSMILES. CDK 2.0 (doi:10.1186/s13321-017-0220-4) already supported CXSMILES and the above image is actually created with CDK Depict (thx to John!).

So, that means I can now start adding all those ACPs to Wikidata :) Here’s hexadecanoyl-[acp] (or this Scholia page):

CDK Workshop - Day #2

2007-01-30T00:00:00+00:00

Because of other obligations, I was unable to attend the first day of the CDK Workshop, though Christoph had set up Skype so that at least I could hear the talks from Prof. Berthold (Konstanz, Germany) about KNIME and Prof. Zielesny about CDK-Taverna.

Today, Miguel Rojas and Stefan Kuhn discussed their research. Miguel showed the state of mass spectrum prediction using the CDK and the MEDEA plugin for Bioclipse. Stefan demonstrated the NMRShiftDB and a new lab systems for NMR experiment scheduling and management system based on that. Dr. Ott (Nijmegen, Netherlands) showed the BioMeta Database which contains metabolite and reaction information derived from the KEGG, but which fixes a set of chemical problems in the latter (see also the article, DOI:10.1186/1471-2105-7-517).

The afternoons of CDK workshops traditionally have discussion sessions and hackathons. Two groups were formed: one consisted of the KNIME guys who, together with Miguel and Federico focused in QSAR descriptor calculations in KNIME, while Stefan, Martin and me looked at the fingerprinter peculiarities that Martin found (see also this CDK News article), and came up with a possible further performance improvement of the AllRingsFinder. Because one class of molecules that is causing trouble consist of two ring systems connected by a long linker, like Choloyl-CoA (below), we anticipate that splitting the molecule up into ring systems prior to using the SSSR algorithm should speed up the complete all-ring finding process.

Currently, the spanning tree is calculated before deciding on using the SSSR finder, which, we think, can be used to partition the molecule into separate ring systems. On each of them, then, the further steps of the ring search can be applied.

After dinner (pasta/pizza), during the Spanish-German handball game, we continued the hacking and discussions, now focusing as a whole group on QSAR descriptors in KNIME. We looked at each descriptor and decided if it should go into a QSAR calculator node, or even in a node of its own.

Bugs found

I won’t close this blog entry without giving a list of problems we found in the current CDK; some minor and small, some more troublesome. Here goes: typos all over the place; the OrderQueryBond lack a return statement in an else clause; the Mol2Reader does not mark atom and bond aromaticity properly and reads a single bond as aromatic, and an aromatic bond as single; the Renderer2D does not always highlight both atoms when hovering over a bond; SmilesGenerator.parseBond() should output bond orders correctly; the SSSR finder seems to have a messed up if-else statement for the ringBondCount limit of 37; the BondCount descriptor should count all bonds by default, not just the single bonds; IDescriptor.getParameters() should return null instead of Object[0]; several programs use the SYBYL atomtype S.o2, while the specification and the CDK config defines S.O2; the IP descriptor now returns a variable length descriptor.

Mining the KEGG pathway database with self-organizing maps

2006-04-04T00:00:00+00:00

The Self-organizing map (SOM) is a popular (again) and intuitive non-linear mapping method: it transforms a multidimensional space into two dimensions (normally: they are so easy to visualize). Latino and Aires-de-Sousa published a paper that uses this method to analyze the whole KEGG pathway database: Genome-Scale Classification of Metabolic Reactions: A Chemoinformatics Approach (DOI: anie.200503833).

The method is based on earlier work by Zhang and Aires-de-Sousa: Structure-Based Classification of Chemical Reactions without Assignment of Reaction Centers (DOI: 10.1021/ci0502707). A non-trivial feature of the suggested method is the use of two SOMs. The first maps the reaction onto a fixed-length vector (coined MOLMAP), which is used as input vector for the second map. This later map is used to cluster the KEGG reactions on a purely chemical basis. The resemblence with the EC numbering system is striking.