chem-bla-ics

Artificial intelligence for natural product drug discovery

2023-09-24T00:00:00+00:00

Two weeks ago the write up of a week-long scientific discussions around artificial intelligence for natural product drug discovery in Leiden at the Lorentz Center got published (doi:10.1038/s41573-023-00774-7, free PDF).

Sadly, the meetings was still during the (partial) lockdown, and I think my contribution could have been more extensive. But I am happy I got to pitch the idea of using Wikidata in this area too, taking advantage of the work done by the LOTUS (doi:10.7554/eLife.70780) team earlier.

And this is key to me: you cannot do statistics, chemometrics, machine learning, or artificial intelligence without good quality linked data. Happy reading!

New Paper: “The ChEMBL database as linked open data”

2013-05-09T00:00:00+00:00

Update: Mark wrote up a blog post on the RDF that the ChEMBL team itself.

Yesterday, the paper “The ChEMBL database as linked open data” (doi:10.1186/1758-2946-5-23) by Andra Waagmeester (@andrawaag), Ola Spjuth (@ola_spjuth), Peter Ansell (@p_ansell), Antony Williams (@chemconnector), Valery Tkachenko, Janna Hastings, Bin Chen (@binchenindiana), David J Wild (@davidjohnwild), and me appeared in the OA JChemInf journal.

I am also indebted to the ChEMBL team (@chembl) for both providing such valuable data under a liberal Open Access license and their critical reading of the manuscript! Additionally, I would like to stress that the ChEMBL team will create their own RDF version of ChEMBL and that this paper is not describing the version they will release.

BTW, the source of the paper is available from GitHub. And the (original) scripts to create RDF from the MySQL dump of ChEMBL are also on GitHub.

This paper outlines the RDF as it has evolved from various earlier projects. The above diagram visualizes the basic structure (red), various Linked Data resources linked too (blue) and illustrates how various ontologies are used, such as the CHEMINF, BIBO, and CiTO ontologies.

Additionally, various applications and links are described developed by various co-authors. For example, Peter worked on the use in Bio2RDF and Bin and David on Chem2Bio2RDF. Andra developed an extension for his (#altmetric) CitedIn resource, giving credit to a paper when data in it is extracted into ChEMBL. Ola, Valery, and Anthony developed a Bioclipse Decision Support extension, which supports a nearest neighbor search in ChEMBL using ChemSpider. Of course, Ola also hosts the SPARQL end point of which you can monitor the uptime at the also cool mondeca.com service:

(Yes, I think I have all the cool buzzwords covered in this paper. Sadly, marketing is needed nowadays as a scientist. Where is the time that you could rant on page after page in all your domain specific jargon, not having to worry if your reader would understand it immediately, or without a university degree…)

What this paper does not describe, is all the things I did with ChEMBL-RDF in the Open PHACTS project (@Open_PHACTS), which includes the use of QUDT and the jQUDT library for unit normalization outlined in this document and the use of VoID for link sets as described in this document.

ChEMBL 13 as RDF

2012-03-04T00:00:00+00:00

Update: this work is now described in this paper.

Last week, ChEMBL 13 was released, with even more data, data fixes, etc. Since my RDF for ChEMBL 09 my workflow has become more solid and uses more common ontologies, started using more common ontologies and ontologies I just like, such as CHEMINF and CiTO. Below is an overview of the resource types present in the RDF: activities (almost 7M now), chemical entities, assays, targets, and documents.

The data on Kasabi will be updated soon, and the SPARQL end point hosted by Uppsala University was updated yesterday, including the SNORQL frontend:

The new data is not fully backwards compatible. The changes to the RDF include the use of cito:citesAsDataSource, more typing using existing ontologies, e.g. with cheminf:CHEMINF_000000 and pro:PR_000000001 from the PRotein Ontology.

A paper dedicated to the ChEMBL-RDF is in preparation. Existing use cases can be found here.

OSMB2007 Day #1: venture capital, scientific blogger and Kepler

2007-01-24T00:00:00+00:00

The second day just started of the Open Source Meets Business, and now actually listening to the PHP talk, but here is a short update on day 1, which was the investment summit. It was not so crowded, but especially the talks from the venture capitalists were interesting. During lunch we actually talked to one in person, which was insightful. I will be putting up links to interesting sites mentioned during this conference on my delicious account.

Nothing much more I can tell about this, except for a few general quotes:

2% of the downloaders become paying customers
an active community is important, cherish it
support as business model is not interesting for venture capatilists
don’t think you understand the legal implications

Noteworthy is that we have free wireless at the conference site :) So I downloaded a recent presentation by Jean-Claude about his open science work and blogging efforts, which I enjoyed watching very much. I skyped with my wife and children, and I booked a hotel for the ACS meeting in March in Chicago, as chances are high that I will attend that meeting.

Last night it started snowing, and it is completely white outside right now. The temperature has dropped to normal winter season, which made the burritos in downtown Nuernberg extra nice. Later today, Christoph’s COSI talk is scheduled, and I was delighted to learn via Chemical blogspace that Carlos blogged about it yesterday! Cheers Carlos! In the same blog he also mentions that he is integrating the CDK with something called Kepler. Carlos, if you read this: what is the URL for Kepler?

Open Source Meets Business 2007

2007-01-22T00:00:00+00:00

Today I leave for a two day visit at the Open Source Meets Business conference in Nürnberg, where Christoph will speak about the Chemoinformatics OpenSource Initiative (COSI). If you happen to go to that meeting too, let’s try to meet!

Modern chemistry in the CDK: beyond the two-atom bond

2006-12-30T00:00:00+00:00

Rich recently blogged about the limitations of the two-atom bond representation often used in chemoinformatics, triggered by the four ferrocene entries in PubChem. In reply to himself, Rich described FlexMol, an XML language that can describe bond systems that involve more than two atoms.

Obviously, the problems originates from the lack of mathematical knowledge of chemists: the current chemoinformatics heavily depends on graph theory, where each atom is a vertex and each bond an edge. This has the advantage that we can borrow all algorithms that work with graph representations, such as Dijkstra’s algorithm to find the shortest path between two vertices. Or, in chemical language, an algorithm to calculate how many bonds two atoms are apart in a molecule.

When discussing FlexMol, Rich mentions the work by Dietz (DOI:10.1021/ci00027a001), but I would like to mention the PhD thesis of S. Bauerschmidt to this (see DOI:10.1021/ci9704423) done in Gasteiger’s group. Dropping this ‘two-atom bond’ representation in favor of something that better describes compounds like ferrocene, like the Dietz and Bauerschmidt approaches, has the unfortunate disadvantage of loosing compatibility with graph theory algorithms. Nevertheless, in order to take chemoinformatics to the next level, we have to address these issues. But hope is not lost, and people are working on rewriting our toolkit of chemoinformatics algorithms to match such new representations.

CDK

I will postpone analyzing the CDK for compatibility with such more modern representations (look out for a CDK News article), and now just describe how the CDK can be used for FlexMol/Dietz/Bauerschmidt representations. Consider the four examples Rich gives in his blog. Here are the CDK ways of doing the same.

For example, 1,3,5-cyclohexatriene:

public IMolecule makeCycloHexaTriene() {
  IMolecule cyclohexatriene = builder.newMolecule();

  IAtom atomC0 = builder.newAtom(Elements.CARBON);
    atomC0.setID("C0"); atomC0.setHydrogenCount(1);
  IAtom atomC1 = builder.newAtom(Elements.CARBON);
    atomC1.setID("C1"); atomC1.setHydrogenCount(1);
  IAtom atomC2 = builder.newAtom(Elements.CARBON);
    atomC2.setID("C2"); atomC2.setHydrogenCount(1);
  IAtom atomC3 = builder.newAtom(Elements.CARBON);
    atomC3.setID("C3"); atomC3.setHydrogenCount(1);
  IAtom atomC4 = builder.newAtom(Elements.CARBON);
    atomC4.setID("C4"); atomC4.setHydrogenCount(1);
  IAtom atomC5 = builder.newAtom(Elements.CARBON);
    atomC5.setID("C5"); atomC5.setHydrogenCount(1);

  IBond bondB0 = builder.newBond(atomC0, atomC1, 1.0);
    bondB0.setElectronCount(2);
  IBond bondB1 = builder.newBond(atomC1, atomC2, 2.0);
    bondB1.setElectronCount(4);
  IBond bondB2 = builder.newBond(atomC2, atomC3, 1.0);
    bondB2.setElectronCount(2);
  IBond bondB3 = builder.newBond(atomC3, atomC4, 2.0);
    bondB3.setElectronCount(4);
  IBond bondB4 = builder.newBond(atomC4, atomC5, 1.0);
    bondB4.setElectronCount(2);
  IBond bondB5 = builder.newBond(atomC0, atomC5, 2.0);
    bondB5.setElectronCount(4);

  cyclohexatriene.addAtom(atomC0); cyclohexatriene.addAtom(atomC1);
  cyclohexatriene.addAtom(atomC2); cyclohexatriene.addAtom(atomC3);
  cyclohexatriene.addAtom(atomC4); cyclohexatriene.addAtom(atomC5);

  cyclohexatriene.addBond(bondB0); cyclohexatriene.addBond(bondB1);
  cyclohexatriene.addBond(bondB2); cyclohexatriene.addBond(bondB3);
  cyclohexatriene.addBond(bondB4); cyclohexatriene.addBond(bondB5);

  return cyclohexatriene;
}

Summarizing, the key thing is to use the IBond.setElectronCount() method. The call is sort of redundant, as the CDK defaults to two electrons if not explicitly given. This compound is, of course, benzene which we can represent like this too:

public IMolecule makeBenzene() {
  IMolecule benzene = builder.newMolecule();

  IAtom atomC0 = builder.newAtom(Elements.CARBON);
    atomC0.setID("C0"); atomC0.setHydrogenCount(1);
  IAtom atomC1 = builder.newAtom(Elements.CARBON);
    atomC1.setID("C1"); atomC1.setHydrogenCount(1);
  IAtom atomC2 = builder.newAtom(Elements.CARBON);
    atomC2.setID("C2"); atomC2.setHydrogenCount(1);
  IAtom atomC3 = builder.newAtom(Elements.CARBON);
    atomC3.setID("C3"); atomC3.setHydrogenCount(1);
  IAtom atomC4 = builder.newAtom(Elements.CARBON); 
    atomC4.setID("C4"); atomC4.setHydrogenCount(1);
  IAtom atomC5 = builder.newAtom(Elements.CARBON); 
    atomC5.setID("C5"); atomC5.setHydrogenCount(1);

  IBond bondB0 = builder.newBond(atomC0, atomC1);
    bondB0.setElectronCount(2);
  IBond bondB1 = builder.newBond(atomC1, atomC2);
    bondB1.setElectronCount(2);
  IBond bondB2 = builder.newBond(atomC2, atomC3);
    bondB2.setElectronCount(2);
  IBond bondB3 = builder.newBond(atomC3, atomC4);
    bondB3.setElectronCount(2);
  IBond bondB4 = builder.newBond(atomC4, atomC5);
    bondB4.setElectronCount(2);
  IBond bondB5 = builder.newBond(atomC0, atomC5);
    bondB5.setElectronCount(2);

  IBond bondingSystem = builder.newBond();
    bondingSystem.setElectronCount(6);
    bondingSystem.setAtoms(
      new IAtom[] { atomC0, atomC1, atomC2, 
                    atomC3, atomC4, atomC5}
    );

  benzene.addAtom(atomC0); benzene.addAtom(atomC1);
  benzene.addAtom(atomC2); benzene.addAtom(atomC3);
  benzene.addAtom(atomC4); benzene.addAtom(atomC5);

  benzene.addBond(bondB0); benzene.addBond(bondB1);
  benzene.addBond(bondB2); benzene.addBond(bondB3);
  benzene.addBond(bondB4); benzene.addBond(bondB5);
  benzene.addBond(bondingSystem);

  return benzene;
}

This version represents the delocalized aromatic pi-system as one IBond: one with 6 electrons, and 6 associated atoms.

The cyclopentadienyl anion is represented similarly:

public IMolecule makeCycloPentadienylAnion() {
  IMolecule cp = builder.newMolecule();

  IAtom atomC0 = builder.newAtom(Elements.CARBON);
 atomC0.setID("C0"); atomC0.setHydrogenCount(1);
  IAtom atomC1 = builder.newAtom(Elements.CARBON);
 atomC1.setID("C1"); atomC1.setHydrogenCount(1);
  IAtom atomC2 = builder.newAtom(Elements.CARBON);
 atomC2.setID("C2"); atomC2.setHydrogenCount(1);
  IAtom atomC3 = builder.newAtom(Elements.CARBON);
 atomC3.setID("C3"); atomC3.setHydrogenCount(1);
  IAtom atomC4 = builder.newAtom(Elements.CARBON);
 atomC4.setID("C4"); atomC4.setHydrogenCount(1);

  IBond bondB0 = builder.newBond(atomC0, atomC1);
    bondB0.setElectronCount(2);
  IBond bondB1 = builder.newBond(atomC1, atomC2);
    bondB1.setElectronCount(2);
  IBond bondB2 = builder.newBond(atomC2, atomC3);
    bondB2.setElectronCount(2);
  IBond bondB3 = builder.newBond(atomC3, atomC4);
    bondB3.setElectronCount(2);
  IBond bondB4 = builder.newBond(atomC4, atomC0);
    bondB4.setElectronCount(2);

  IBond bondingSystem = builder.newBond();
    bondingSystem.setElectronCount(6);
  bondingSystem.setAtoms(
    new IAtom[]{ atomC0, atomC1, atomC2, atomC3, atomC4}
  );

  cp.addAtom(atomC0); cp.addAtom(atomC1);
  cp.addAtom(atomC2); cp.addAtom(atomC3);
  cp.addAtom(atomC4);

  cp.addBond(bondB0); cp.addBond(bondB1);
  cp.addBond(bondB2); cp.addBond(bondB3);
  cp.addBond(bondB4); cp.addBond(bondingSystem);

  return cp;
}

And the final step in this series, is ferrocene:

public IMolecule makeFerrocene() {
  IMolecule ferrocene = builder.newMolecule();

  IAtom atomC0 = builder.newAtom(Elements.CARBON);
    atomC0.setID("C0"); atomC0.setHydrogenCount(1);
  IAtom atomC1 = builder.newAtom(Elements.CARBON);
    atomC1.setID("C1"); atomC1.setHydrogenCount(1);
  IAtom atomC2 = builder.newAtom(Elements.CARBON);
    atomC2.setID("C2"); atomC2.setHydrogenCount(1);
  IAtom atomC3 = builder.newAtom(Elements.CARBON);
    atomC3.setID("C3"); atomC3.setHydrogenCount(1);
  IAtom atomC4 = builder.newAtom(Elements.CARBON);
    atomC4.setID("C4"); atomC4.setHydrogenCount(1);
  IAtom atomC5 = builder.newAtom(Elements.CARBON);
    atomC5.setID("C5"); atomC5.setHydrogenCount(1);
  IAtom atomC6 = builder.newAtom(Elements.CARBON);
    atomC6.setID("C6"); atomC6.setHydrogenCount(1);
  IAtom atomC7 = builder.newAtom(Elements.CARBON);
    atomC7.setID("C7"); atomC7.setHydrogenCount(1);
  IAtom atomC8 = builder.newAtom(Elements.CARBON);
    atomC8.setID("C8"); atomC8.setHydrogenCount(1);
  IAtom atomC9 = builder.newAtom(Elements.CARBON);
    atomC9.setID("C9"); atomC9.setHydrogenCount(1);
  IAtom iron = builder.newAtom(Elements.IRON);
    iron.setID("Fe10"); iron.setHydrogenCount(0);

  IBond bondB0 = builder.newBond(atomC0, atomC1);
    bondB0.setElectronCount(2);
  IBond bondB1 = builder.newBond(atomC1, atomC2);
    bondB1.setElectronCount(2);
  IBond bondB2 = builder.newBond(atomC2, atomC3);
    bondB2.setElectronCount(2);
  IBond bondB3 = builder.newBond(atomC3, atomC4);
    bondB3.setElectronCount(2);
  IBond bondB4 = builder.newBond(atomC4, atomC0);
    bondB4.setElectronCount(2);
  IBond bondB5 = builder.newBond(atomC5, atomC6);
    bondB5.setElectronCount(2);
  IBond bondB6 = builder.newBond(atomC6, atomC7);
    bondB6.setElectronCount(2);
  IBond bondB7 = builder.newBond(atomC7, atomC8);
    bondB7.setElectronCount(2);
  IBond bondB8 = builder.newBond(atomC8, atomC9);
    bondB8.setElectronCount(2);
  IBond bondB9 = builder.newBond(atomC9, atomC5);
    bondB9.setElectronCount(2);

  IBond bondingSystem1 = builder.newBond();
    bondingSystem1.setElectronCount(6);
    bondingSystem1.setAtoms(
      new IAtom[] {
       atomC0, atomC1, atomC2, atomC3, atomC4, iron
      }
    );
  IBond bondingSystem2 = builder.newBond(); 
    bondingSystem2.setElectronCount(6);
    bondingSystem2.setAtoms(
      new IAtom[] {
        atomC5, atomC6, atomC7, atomC8, atomC9, iron
      }
    );
  IBond bondingSystem3 = builder.newBond();
    bondingSystem3.setElectronCount(6);
    bondingSystem3.setAtoms(
      new IAtom[]{
        atomC0, atomC1, atomC2, atomC3, atomC4,
        atomC5, atomC6, atomC7, atomC8, atomC9,
        iron
      }
    );

  ferrocene.addAtom(atomC0); ferrocene.addAtom(atomC1);
  ferrocene.addAtom(atomC2); ferrocene.addAtom(atomC3);
  ferrocene.addAtom(atomC4); ferrocene.addAtom(atomC5);
  ferrocene.addAtom(atomC6); ferrocene.addAtom(atomC7);
  ferrocene.addAtom(atomC8); ferrocene.addAtom(atomC9);
  ferrocene.addAtom(iron);

  ferrocene.addBond(bondB0); ferrocene.addBond(bondB1);
  ferrocene.addBond(bondB2); ferrocene.addBond(bondB3);
  ferrocene.addBond(bondB4);
  ferrocene.addBond(bondB5); ferrocene.addBond(bondB6);
  ferrocene.addBond(bondB7); ferrocene.addBond(bondB8);
  ferrocene.addBond(bondB9);
  ferrocene.addBond(bondingSystem1);
  ferrocene.addBond(bondingSystem2);
  ferrocene.addBond(bondingSystem3);

  return ferrocene;
}

Now, you will note that this approach does not exactly follow Rich’s FlexMol examples: the skipped atom pair concepts in the FlexMol version of ferrocene. His example, more closely follows what we are likely to draw, while the CDK code above more closely follows the molecular orbital concept. (I have to check to see how Dietz and Bauerschmidt did this.)

As said, the real trick is to have the chemoinformatics toolkit that can work with this representation, but I will save that for later. At least our algorithms to calculate the molecular mass should work ;)

Counting constitutional isomers from the molecular formula

2006-12-17T00:00:00+00:00

Update: check these two papers.

We all know the combinatorial explosion when calculating the number of possible constitutional isomers (see wp:structural isomorphism) of a certain molecular formula. For example, C2H6 has only one constitutional isomer (ethane, InChI=1/C2H6/c1-2/h1-2H3), and C4H10 has only two. Especially, breaking symmetry by replacing one carbon by another element, or replacing a single by a double bond, increases the number sharply. For example, C7H16 has only nine constitutional isomers, while replacing two single bonds by two double bonds, creating C7H10, increases this number to 499! Then, replacing in the last formula, one carbon by an oxygen adds another few, totaling 747 isomers.

Now, C8H8NBr has at least 649 thousand constitutional isomers, and I am quite interested in being able to know the number of isomers beforehand, without having to generate the structures itself (for example, using CDK’s GENMDeterministicGenerator). InChI=1/C8H8BrN/c9-7-1-2-8-6(5-7)3-4-10-8/h1-2,5,10H,3-4H2 is one of the isomers.

So, my question: is anyone aware of free code (in order of preference: 1. LGPL, 2. BSD/MIT,

opensource, 4. free) to calculate or estimate the number of constitutional isomers for a certain molecular formula. An estimate would already be nice. Ideally, I would implement this bit of code into the CDK, but otherwise, just knowing the number of isomers for C8H8NBr would be nice :)

Additionally, any relevant, recent literature recommendations are most welcomed. I am aware of the use of polynomials, but literature I have seen so far just focuses on molecules of a certain architecture, and it not able to come up with a guess based on the molecular formula alone.

Molecular Chemometrics

2006-12-12T00:00:00+00:00

I just found out that a review article that I wrote earlier this year got printed: Molecular Chemometrics (DOI:10.1080/10408340600969601), with my personal view on the interplay between chemoinformatics and chemometrics. The review discusses interesting developments in the last five years, and was fun writing (reading too, I think :). It has four major topics:

molecular representation (with ‘molecular descriptors’ and ‘beyond the molecule’)
chemical space, similarity and diversity
activity and property modeling (with ‘dimension reduction’ and ‘model validation’)
library searching, which mostly focuses on semantic web developments

Comments most welcome; just leave them below this blog item, or blog about the article yourself :)

H-index in chemoinformatics

2006-12-09T00:00:00+00:00

Peter blogged about the h-index, which is a measure for ones scientific impact. He used Google Scholar, but I do not feel that that database is clean enough. I believe a better source would be the ISI Web-of-Science.

Therefore, I composed a list of h-indices of my own, ordered by value. The choice of authors is biased to the Blue Obelisk and the CDK, has some personal touches (Buydens are Wehrens are my PhD supervisors) and some names that put the rest into perspective:

query		h-index	#pubs
BENDER A	41	222
WILLETT P	37	302
GASTEIGER J	33	212
RZEPA HS	25	236
BUYDENS LMC	18	108
GLEN RC		18	78
WEHRENS R	11	47
MURRAY-RUST P*	9	41
STEINBECK C	9	29
FECHNER U	6	12
GUHA R		4	24
WILLIGHAGEN E*	4	9
WEGNER JK	3	9
LUTTMANN E	2	4

Of course, there are many comments on this. Like any measurement, take into account the error. Sources of error include, but are not limited to, ambiguity in the query. The most notable example of this, I think, is Andreas Bender; I don’t think he has been that successful :) Also, Rajarshi Guha’s h-index was reported 6, but the list included two articles from the 70-ies and 80-ies, which I do not think are actually really his.

Feel free to suggest other names, query corrections, tips, and I will add or work on those too.

German Conference on Chemoinformatics 2006: Day 3

2006-11-14T00:00:00+00:00

Just some short quites note about the third day (see day 1 and 2 ). Today’s program of the German Conference on Chemoinformatics started with a presentation by Rzepa about his work on a semantic wiki (DOI:10.1021/ci060139e), which might be online here. (He recorded a podcast, but I have not seen it online yet.) I wish I could see the sources of those wiki pages, to see how that system integrates RDF, but at least Jmol is running fine. The presentation by Couch showed the status of the Materials Grid project, and how a guy called AgentX does all the hard work. Ihlenfeldt updated us about the status of PubChem, and mostly on what they had to do to keep the system from dying from its own success, for example using something called minimol. Googling does not seem to help, as that points to a number of things, but not any PubChem webpage. I am still waiting for a European organization to set up a mirror.

After the coffee break, Kuhn showed a coarse grained force field, approximating molecules by hacking them up in fragment of 3-10 heavy atoms. I guess, a bit like some small molecules force fields do for methyls. Fragments within a molecule are tied together by springs, and intra- and intermolecular force field parameters by running MD runs on fragment pairs. Varnek argued that QSPR for melting point prediction has reached a fundamental limited, with an RMSE of around 30 to 40 degrees Celsius, which makes it quite unreasonable to decide whether a compound with a predicted melting point of 40 degrees is solid or fluid at room temperature.

You have to forgive me for not reporting on the afternoon session; I was tied up talking with people at our booth, talking about the CDK, Taverna, Bioclipse, Jmol, other opensource chemoinformatics tools, and chemoinformatics in general. Very nice, but exhausting. I might advise the organization to set up a blog aggregator next year, though I am not sure whether there are others blogging about this conference.