chem-bla-ics

Chemo::Blogs #2

2006-12-06T00:00:00+00:00

Because no one picked up my Chemo::Blogs suggestion, I will now officially claim the blog series title. However, unlike the original Bio::Blogs series, I will not summarize interesting blogs, but just spam you with websites I recently marked as toblog on del.icio.us.

Semantics and Text Mining

Evan Prodromou wrote about RDFa vs microformats. The latter are commonly used in enhancing blog semantics, and for example used by PostGenomic.com. While RDFa is more explicit, e.g. by using namespaced markup, we have to wait until XHTML2 to see it working. I do not think chemists are using tags a log yet, but let me propose the following microformats: 1/CH4/h1H4 and methane. Standard JavaScripts and CSS scripts will then do the rest. (Think: addressing newlines, auto googling-for-inchi, etc).

The reason why using microformats is interesting, is text mining, of various kinds. Whether it is setting up a molecule-article link database, or find hot molecules in blogspace, adding semantics will help tools like OSCAR3 to mine chemistry. Some time ago OTMI was proposed by Nature, and they now set up a dedicated web site to explain there view on text mining. Zack Rosen has a good idea why RDF Semantic web research isn’t working.

Blogspace

There are a few new chemistry blogs I want to mention (and already added to Chemical blogspace): ChemBark, lirico which has an interesting chemoinformatics section, and The Curious Wavefunction. Worth reading indeed.

Pierre’s YOKOFAKUN deserves a paragraph of his own. He recently blogged about bio2rdf which provides an RDF interface to biochemical knowledge via Life Science Identifiers (LSID), OBOEdit which is a Java-based ontology editor, and Amadea which is a Taverna- and KNIME-like tool for setting up UNIX pipes.

Online EMBL Symposium

A few EMBL PhD students are having the First Online EMBL PhD Symposium (catchy name, or … ;) Anyway, discussions are held on IRC, and it has a rather interesting Web2.0 session. All media is available on the website but requires registration right now. After the conference it will become open access to all. Jean-Claude contributed The UsefulChem Project: Open Source Chemistry Research using Blogs and Wikis to the Participants’ Contributions section, and I had a poster on Distributing molecular information over the Internet, discussing CMLRSS, blog aggregators, CML and other things. The IRC session was logged and is available here.

Literature

Finally, I want to mention three recent articles. First one is a recent write up by Bourne and Friedberg about Ten Simple Rules for Selecting a Postdoctoral Position (DOI: 10.1371/journal.pcbi.0020121). With the end of my current postdoc position nearing, rather useful reading. Some time ago I blogged about a New open access journal Source Code for Biology and Medicine, and the journal is now up and running. Details can be read in the first editorial (DOI: 10.1186/1751-0473-1-1). The third article I would like to mention is Scientific Software Development Is Not an Oxymoron by Baxter (DOI: 10.1371/journal.pcbi.0020087), though I do not think it has new insights.

OK, this was a rather lengthy write up, but really needed to clean up my toblog section :)

Chemical Archeology: OSCAR3 to NMRShiftDB.org

2006-09-08T00:00:00+00:00

Chemical Archeology (see Christoph’s comment) is the process of extracting chemical information from old journal articles. Some time ago, Peter Corbett from the group of Peter Murray-Rust visited the CUBIC to talk to us about Oscar3 which can do just that. That day, we already hooked OPSIN into Bioclipse .

Oscar3, however, is capable of more then the name2structure of OPSIN (see also 10.1039/b411033a; it can take a plain text file with an experimental section with details on the synthesis of small organic compounds, and analyze the chemistry in that. This functionality has been available as an RSC authoring tool for some time now (see also 10.1039/b411699m). Unfortunately, what publisher put online (PDF and HTML) is much more difficult to process with Oscar3: those formats are often optimized for display, not for machine processing. The HTML can be cleaned up, but there is no general approach.

Christoph Steinbeck is going to present at the upcoming ACS meeting the use of Oscar3 for extraction of NMR spectra from old journal article, in preperation for submission to the NMRShiftDB.org (see the abstract of CINF 101).

Since the full Oscar3 was not hooked into Bioclipse yet, I had some work to do. It took me some time to figure out how to properly configure Oscar3, and what additional things I had to do to clean up the HTML used by publishers to get Oscar3 to extract NMR spectra (thanx to PeterC for hints!). I also had to tweak the Oscar3 code itself here and there, but that’s what opensource is about :) (Peter, if you are reading this: I have a number of patches for the Oscar3 code in bc_oscar; let me know if you’re interested in them.)

This is the end result:

Note especially the hierarchy in the resource navigator on the left. The misc folder contains all the chemistry found in the article. But more importantly is that for six molecules it fully detected he experimental section! For 3-(2-Oxocyclooctanyl)-3-phenylpropan-1-al (InChI=1/C17H22O2/c18-13-12-15(14-8-4-3-5-9-14)16-10-6-1-2-7-11-17(16)19/h3-5,8-9,13,15-16H,1-2,6-7,10-12H2) it derived the molecular structure (with OPSIN), and a few spectra: H-NMR, high-resolution MS and IR.

So, if you attend the ACS meeting: make sure to visit Christoph’s CINF 101 presentation!

Text mining for chemistry using OSCAR3

2006-06-22T00:00:00+00:00

Peter Corbett from Peter Murray-Rust’s group at the Unilever Cambridge Centre for Molecular Informatics visited Christoph Steinbeck’s junior Research Group on Molecular Informatics at the CUBIC today, and spoke about the status of Oscar3, a chemistry text mining program with the Artistic License. Oscar3, the successor of version 1 and 2, can detect and extract molecular structures and experimental details from plain text articles, using a variety of text mining techniques.

The afternoon was spend on hacking Oscar3 into Bioclipse, with good success. It involved updating Oscar3 for the latest CDK and setting up a plugin infrastructure for Bioclipse. This plugin will allow mining (scientific) articles for chemical compounds and there properties from within Bioclipse. The outcome of today’s hacking session was somewhat less ambitious and focused on the general infrastructure, and getting the OPSIN functionality in Oscar3 available as a wizard. OPSIN is a IUPAC name 2 structure tool and, amongst many other names, is able to recognize caffeine (InChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3):

Open Text Mining Interface and Bioclipse

2006-05-07T00:00:00+00:00

Timo Hannay blogged in Nature’s Nascent blog about the Open Text Mining Interface (OTMI), which is “a suggestion from Nature about how we might achieve text-mining and indexing purposes”. The idea is that each article has a link pointing to a machine readable file containing raw data about (and from?) the article. The standing example uses Atom 1.0 as a container, allowing raw data to be included using foreign namespaces, such as Dublic Core (for metadata) and Prism (for bibliographic data), and the OTMI text mining statistics uses a namespace too.

In a comment, Henry Rzepa proposed inclusion of CML, and refers to earlier work on CMLRSS where Chemical Markup Language is embedded in RSS news feeds for which I wrote readers for Jmol and JChemPaint (DOI:10.1021/ci034244p).

As readers of my blog know, the Bioclipse project has been working hard on an integrated (bio)chemistry workbench, and the latest release includes a CMLRSS reader plugin too, which supports CML embedded in Atom 0.3/1.0 and RSS 1.0/2.0 feeds. Now, adding support for other embedded namespaces is trivial, and this morning I hacked in support for OTMI:

This screenshot show the original OTMI example with the Atom 1.0 entry now wrapped in an Atom 1.0 element. There is no nice OTMI icon for the OTMI content in the Atom 1.0 entry, neither did I make a ‘view’ yet showing the actual vector’s or the snippet’s, but that’s a piece of cake too.

Now, the nice thing about this is that the Bioclipse code for the Atom and RSS feeds, just greps through the feed entry and show whatever CML or OTMI content is present. When Nature decides to include CML in these OTMI files too, I will not have to update the current code.