chem-bla-ics

One Million IUPAC names #4: a lot is happening

2025-08-09T00:00:00+00:00

A lot is happening. If you have been following this project more closesly, you may have already seen some interesting updates, but I will post it here too. First, a quick recap. In March I started a new Blue Obelisk project to collect CCZero IUPAC names from primary literature (paper still pending). It turned out we can automate that, while legally not violating any laws or licenses. In April I reported on some tweaks boosting the efficiency of the use of the API. I also reported on some possible further steps, including how to use the extracted names to create a larger set. Indeed, in June I could report to have passed the 200k IUPAC names, which with the idea from April gave us more than 1M IUPAC names.

In this post I want to give an update.

275k IUPAC names

I have continued running the scripts to detect new IUPAC names in full text, open access papers in Europe PMC, but something more awesome actually did much more since the June post: in July I received a pull request from mnietfeld with more than 40 thousand unique and new IUPAC names from the Beilstein Journal of Organic Chemistry (see also their LinkedIn post or this archived version that doesn’t require an account). While Europe PMC provides these articles too (and actually one of the first I analyzed), a lot of these names come from supplementary information, not provided by Europe PMC. Thanks!

This is focusing on names from primary literature, but there is more happening. Because I want to restrict the above project to names from primary literature (and supplementary information is still that), I have not been sure what to do with other collections yet, and they have been coming in. I have been taking notes in the project issue tracker, for future reference (like now, here). I have not forgotten about these!

Other large collections of IUPAC names

4M, CCZero
Let’s start with the news yesterday. The Chemical Biology Services team released 4 million IUPAC names from patent literature as CCZero! The CCZero license/waiver makes it compatible with our list. Their Zenodo release:

… contains IUPAC names text-mined from patents (US, WIPO, EPO, Chinese, Japanese).

The post also includes a nice example of the complexity of IUPAC names which makes the counting of unique names tricky: O-methylphenol and o-methylphenol. Thanks, Noel and the rest of the EMBL-EBI team!

2.3 million, CC-BY
And then Haydn Jones was one of the earliest to coin in, and released 2.3 million IUPAC names under the CC-BY license.

850k, CCZero
Wikidata also turnes out to have many IUPAC names. Adriano found more than 850 thousand IUPAC names, see this project.

Next week I will do some comparisons of the datasets with a clear Creative Commons license.

Even more

Beyond these five data releases, there is more. PubChem and other databses have millions of names, but often these are generated by proprietary software. These IUPAC name collections may be under some license agreement, and thus not compatible with Open Science. This is why it is so important that we very clearly know where these names are coming from.

5-6 million, license unclear
I also learned about ChemPile about which Adrian Mirza explained me it has about 5-6 million IUPAC names. But the source of this list of names is not yet clear to me.

Names from PhD theses and preprints
I also want to give a shout out to Peter Murray-Rusts proposal to start extracting IUPAC names from PhD theses. There have been projects to extract chemistry from PhD thesis in the past, and this will yield a lot of unique names. Please ping Peter, if you want to get involved in his idea!

What’s next

I am so excited with all these efforts and very grateful with the contribution by Beilstein. I really hope more Open Science publishers will follow, like perhaps the Royal Society of Chemistry for which it should be easy, with their Project Prospect background!

I am also excited by the release by ChEMBL under CCZero. That will allow the WikiProject Chemistry use this for Wikidata!

So, I have one week left to write the article about the work we started in March. The outlook is bright. I played last week with the Europe PMC full text downloads and can confirm that should yield thousands of additional names from the full texts. A single download file gave me more than two thousand new unique names. I think the 500k thousand IUPAC names is absolutely in reach with purely the full texts from Europe PMC.

This brings us to the end of 2025. By then, we should have a many millions of openly-licensed IUPAC names. And by March 2026, I hope we reached the 1M IUPAC names extracted from primary literature. That will require some creativity and enthusiasm, but sounds feasible!

Beilstein journals contain Bioschemas

2025-02-13T00:00:00+00:00

Two weeks ago, the Beilstein Institute announced Bioschemas support in their journals:

We streamline the discoverability of your research by incorporating machine-readable chemical information into many of our published articles. This includes the conversion of chemical structures from submitted ChemDraw files to InChI strings and validating them using open-source tools.

The idea is far from new and has been around for two decades. But the two Beilstein journals (both diamond Open Access), actually integrated into their active publishing model. That has been trialed and put in action before. For example, there was (is?) Project Prospect (2007), chemical structure annotation in Nature Chemistry (2009), SMILES in the ACS Journal of Medicinal Chemistry (2014) (doi:10.1021/jm5002056), and FAIR chemical structures in the Journal of Cheminformatics (2021) (doi:10.1186/s13321-021-00520-4).

But this announcement is a new step. I like how validation of the chemical structures is part of the approach, and I like how they use the Bioschemas extention of schema.org. The last because they use two Bioschemas types/profiles that contributed to or initiated, respectively: MolecularEntity and ChemicalSubstance.

First stop for me is to check the schema.org annotation with a validation tool, like Google’s Rich Results Test. That gives an idea how they may have have their search engine pick it up. The test article I was given on LinkedIn is Xiao et al.’s Molecular diversity of the reactions of MBH carbonates of isatins and various nucleophiles (doi:10.3762/bjoc.21.21) in the Beilstein Journal of Organic Chemistry, and we indeed see the schema.org annotation show up:

And because of the use of open standards, extracting the information is not so hard with, for example here, Bacting (doi:10.21105/joss.02558), based on a 2022 script from the NanoSafety Cluster projects NanoCommons and SbD4Nano:

@Grab(group='io.github.egonw.bacting', module='managers-rdf', version='1.0.4')
@Grab(group='io.github.egonw.bacting', module='managers-ui', version='1.0.4')
@Grab(group='io.github.egonw.bacting', module='net.bioclipse.managers.jsoup', version='1.0.4')

bioclipse = new net.bioclipse.managers.BioclipseManager(".");
rdf = new net.bioclipse.managers.RDFManager(".");
jsoup = new net.bioclipse.managers.JSoupManager(".");

articles = [
   args[0]
]

kg = rdf.createInMemoryStore()

for (article in articles) {
    htmlContent = bioclipse.download(article)

    htmlDom = jsoup.parseString(htmlContent)

    // application/ld+json

    bioschemasSections = jsoup.select(htmlDom, "script[type='application/ld+json']");

    for (section in bioschemasSections) {
        bioschemasJSON = section.html()
        rdf.importFromString(kg, bioschemasJSON, "JSON-LD")
    }
}

turtle = rdf.asTurtle(kg);

println "#" + rdf.size(kg) + " triples detected in the JSON-LD"
// println turtle


sparql = """
PREFIX schema: 
SELECT ?entity ?inchikey ?smiles WHERE {
  ?entity a schema:MolecularEntity .
  OPTIONAL { ?entity schema:inChIKey ?inchikey }
  OPTIONAL { ?entity schema:smiles ?smiles }
}
"""

results = rdf.sparql(kg, sparql)

for (i=1;i<=results.rowCount;i++) {
  println "${results.get(i, "inchikey")}\t${results.get(i, "smiles")}"
}

The output is a simple table:

MGAPJMNPGGTFHJ-JEIPZWNWSA-N     CN1C(=O)/C(=C/2\C3=CC(=CC=C3N(CC4=CC=CC=C4)C2=O)Cl)/C(=P(C5=CC=CC=C5)(C6=CC=CC=C6)C7=CC=CC=C7)C1=O
XEWMQVUVGAHESA-UHFFFAOYSA-N     CC1=CC=C(C=C1)NC2=C(C3C4=CC(=CC=C4N(CC5=CC=CC=C5)C3=O)C)C(=O)N(C)C2=O
UVTJORFYHPGJDZ-PYCFMQQDSA-N     CCCCN1C2=CC=C(C)C=C2/C(=C(\C#N)/CNC3=CC=C(C)C=C3)/C1=O
ILWGDUYVQRAMMG-PGMHBOJBSA-N     CCCCN1C2=CC=C(C)C=C2/C(=C(\C#N)/CNC3=CC=C(C=C3)Cl)/C1=O
CAFIBKBZWJFZCW-FXBPSFAMSA-N     CCCCN1C2=CC=C(C)C=C2/C(=C(\C#N)/CNC3=CC=CC=C3)/C1=O
UOJSFLANMVIMBV-UHFFFAOYSA-N     CCCCN1C2=CC=C(C)C=C2C(C3=C(C(=O)N(C)C3=O)NC4=CC=C(C=C4)Cl)C1=O
VNJBTGZXAGHCSO-OAPYJULQSA-N     COC(=O)/C(=C\1/C2=C(C=CC=C2)N(CC3=CC=CC=C3)C1=O)/C=P(C4=CC=CC=C4)(C5=CC=CC=C5)C6=CC=CC=C6
KJXQRAKSOANQTJ-GFMRDNFCSA-N     CC1=CC=C(C=C1)NC/C(=C\2/C3=C(C=CC=C3)N(CC4=CC=CC=C4)C2=O)/C#N
IGEBJMZDOPBFGF-UHFFFAOYSA-N     CCCCN1C2=CC=C(C)C=C2C(C3=C(C(=O)N(C)C3=O)NC4=CC=CC=C4)C1=O
SSANVPNESOMKOM-AWQADKOQSA-N     C1=CC=C(C=C1)CN2C3=CC=C(C=C3/C(=C(/C#N)\C=P(C4=CC=CC=C4)(C5=CC=CC=C5)C6=CC=CC=C6)/C2=O)Cl
GEHWHSHQSIOZKL-NVQSTNCTSA-N     CCCCN1C2=CC=C(C=C2/C(=C\3/C(=P(C4=CC=CC=C4)(C5=CC=CC=C5)C6=CC=CC=C6)C(=O)N(C)C3=O)/C1=O)Cl
PALRSQOHFLRWDH-UHFFFAOYSA-N     CCCCN1C2=CC=C(C)C=C2C(C3=C(C(=O)N(C)C3=O)NC4=CC=C(C=C4)OC)C1=O
KBFODZMDSAFLFR-UHFFFAOYSA-N     CN1C(=O)C(=C(C1=O)NC2=CC(=CC=C2)Cl)C3C4=CC(=CC=C4N(CC5=CC=CC=C5)C3=O)Cl
JCGAVVZYXDJPBU-GFMRDNFCSA-N     CC1=C(C=CC=C1)NC/C(=C\2/C3=C(C=CC=C3)N(CC4=CC=CC=C4)C2=O)/C#N
DZFPCPDEQGLPLY-UHFFFAOYSA-N     CCCCN1C2=CC=C(C)C=C2C(C3=C(C(=O)N(C)C3=O)NC4=CC=C(C)C=C4)C1=O
XMRNJCJUOXYXJU-DAFNUICNSA-N     CC1=CC=C(C=C1)NC/C(=C\2/C3=CC(=CC=C3N(CC4=CC=CC=C4)C2=O)C)/C#N
SSDSNBBHEUUKGI-UHFFFAOYSA-N     CC1=CC=C2C(=C1)C(C3=C(C(=O)N(C)C3=O)N(C)C4=CC=CC=C4)C(=O)N2CC5=CC=CC=C5
USFYPRDMNXMWPO-UHFFFAOYSA-N     CCCCN1C2=CC=C(C)C=C2C(C3=C(C(=O)N(C)C3=O)NC4=CC=C(C=C4)Br)C1=O
XYHTWFULRHTEAG-MUGXBBEHSA-N     CCCCN1C2=CC=C(C)C=C2/C(=C(/C#N)\C=P(C3=CC=CC=C3)(C4=CC=CC=C4)C5=CC=CC=C5)/C1=O
XALDZIBHNNIVAM-UHFFFAOYSA-N     CCCCN1C2=CC=C(C)C=C2C(C3=C(C(=O)N(C)C3=O)NC4=C(C=CC=C4)O)C1=O
TUTWQHBRQPMLME-OAPYJULQSA-N     COC(=O)/C(=C\1/C2=CC(=CC=C2N(CC3=CC=CC=C3)C1=O)Cl)/C=P(C4=CC=CC=C4)(C5=CC=CC=C5)C6=CC=CC=C6
IYEHFTMZZMIPRU-UHFFFAOYSA-N     CC1=CC=C(C=C1)NC2=C(C3C4=CC(=CC=C4N(CC5=CC=CC=C5)C3=O)Cl)C(=O)N(C)C2=O
KBSDGNPLIPXCEX-UHFFFAOYSA-N     CCCCN1C2=CC=C(C)C=C2C(C3=C(C(=O)N(C)C3=O)NCC4=CC=CC=C4)C1=O
BQGIUMITIGHBSD-UHFFFAOYSA-N     CCCCNC1=C(C2C3=CC(=CC=C3N(CC4=CC=CC=C4)C2=O)C)C(=O)N(C)C1=O
PNSOLOPHIVUPOZ-MNDPQUGUSA-N     CCCCNC/C(=C\1/C2=CC(=CC=C2N(CCCC)C1=O)C)/C#N
HLTBKJRJOIZCMJ-PYCFMQQDSA-N     CCCCN1C2=CC=C(C)C=C2/C(=C(\C#N)/CN(C)C3=CC=CC=C3)/C1=O
FFLHFLUBMRBQTB-UHFFFAOYSA-N     CCCCN1C2=CC=C(C=C2C(C3=C(C(=O)N(C)C3=O)NC4=CC=C(C)C=C4)C1=O)F
FOQOVOLYYARWPA-NKFKGCMQSA-N     C1=CC=C(C=C1)CN2C3=C(C=CC=C3)/C(=C(\C#N)/CNC4=CC(=CC=C4)Cl)/C2=O
KLEPCAQFOXJLNV-UHFFFAOYSA-N     CC1=C(C=CC=C1)NC2=C(C3C4=CC(=CC=C4N(CC5=CC=CC=C5)C3=O)Cl)C(=O)N(C)C2=O

That also made me realize that there are not chemical names in the annotation. That would be really useful to move things forward. Then again, PubChem will likely just generate the IUPAC name, since they have access to such software anyway. They have teamed up with PubChem which will index it, but I will be interested in seeing how to use this for main subject annotation in Wikidata.

A final note for now, the model they use is annotate the article with chemical substances (ChemicalSubstance) with (one or more?) molecular entities (`MolecularEntity’). That is a model that scales well to their other journal, the Beilstein Journal of Nanotechnology. But scraping that is for another post.

Bioclipse-Oscar4 - Text mining in Bioclipse

2011-09-27T00:00:00+00:00

Almost a year ago I started a position with Peter Murray-Rust to work on Oscar for three months (see this overview of results; a paper by the full Oscar team (Sam, David, Dan, Lezan) is pending, and I’m really happy to have been able to contribute bits to the project). Since then, I have had little time :( That’s how it goes, with post-hopping, unfortunately. One thing I did do after that, was write a Bioclipse plugin.

I was asked recently via LinkedIn if I was planning a Bioclipse-Oscar plugin, and I realized that I forgot to blog about it. So, here goes. The oscar manager I implemented follows the Oscar API , and these methods are available: extractText(), findNamedEntities(), findResolvedNamedEntities().

When I wrote the plugin, I also uploaded an example workflow to MyExperiment. The code is:

// Demo showing the Oscar text mining functionality
// in Bioclipse
var html = bioclipse.download(
  "http://dx.doi.org/10.3762/bjoc.6.133",
  "text/html"
)
var text = oscar.extractText(html);
// the next step may take some time, while
// initializing the Oscar software for the
// first time
var mols = oscar.findResolvedNamedEntities(text);
var file = "/Oscar Demo/extractedMols.sdf";
cdk.saveSDFile(file, mols);
ui.open(file);

The code will extract chemical entities, and open a molecules table in Bioclipse:

Status update on BJOC analysis with Oscar and ChemicalTagger #3

2010-12-23T00:00:00+00:00

The two earlier posts in this series showed screenshots of results of Oscar, but the title also promised results by Lezan’s ChemicalTagger. Sam helped with getting the HTML pages online via the Cambridge Hudson installation. Where Oscar find named entities (chemical compounds, processes, etc), ChemicalTagger finds roles, like solvent, acid, base, catalyst. Roles are properties of chemical compounds in certain situations. Ethanol is not always a solvent, sometimes it is a Xmas present. The current output is not entirely where I want to go yet, but makes it easy which solvents are frequently found in the BJOC corpus:

This screenshot of an analysis of 15 BJOC papers shows that AcOEt (is that the same as EtOAc?) is mentioned as solvent three times in PMC1399459. Brine, however, is mentioned as solvent in three papers.

As said, these two pages contain RDF and the tables are sortable. Hudson recompiles them automatically when I update the source code to create the HTML+RDFa. So, go ahead, send me bug reports, feature requests, and patches!

Oscar4 command line utilities

2010-11-18T00:00:00+00:00

One goal of my three month project is to take Oscar4 to the community. We want to get it used more, and we need a larger development community. Oscar4 and the related technologies do a good, sometimes excellent, job, but have to be maintained, just like any other piece of code. To make using it easier, we are developing new APIs, as well as two user-oriented applications: a Taverna 2 plugin , and command line utilities. The Oscar4 Java API has slightly evolved in the last three weeks, removing some complexity. In this post, I will introduce the command line utilities.

Oscar4

Most people will be mostly interested into the full Oscar4 program, to extract chemical entities. Oscar3 was also capable of extracting data (like NMR spectra ), but that is not yet being ported. The OscarCLI program takes input, extracts chemicals, and where possible resolves them into connection tables (viz. InChI).

To extract chemicals from a line of text (e.g. “This is propane.”, you do:

$ java -cp oscar4-cli-4.0-SNAPSHOT.jar \
  uk.ac.cam.ch.wwmm.oscar.oscarcli.OscarCLI \
  This is propane.
propane: InChI=1/C3H8/c1-3-2/h3H2,1-2H3

For larger chunks of texts it is easier to route it via stdin, for which we can use the -stdin option:

$ echo "This is propane." | \
  java -cp oscar4-cli-4.0-SNAPSHOT.jar \
  uk.ac.cam.ch.wwmm.oscar.oscarcli.OscarCLI \
  -stdin
propane: InChI=1/C3H8/c1-3-2/h3H2,1-2H3

That way, we can easily process large plain text files (output omitted):

$ cat largeFile.txt | \
  java -cp oscar4-cli-4.0-SNAPSHOT.jar \
  uk.ac.cam.ch.wwmm.oscar.oscarcli.OscarCLI \
  -stdin

If you prefer RDF output, for further integration, use the -output text/turtle:

$ cat largeFile.txt | \
  java -cp oscar4-cli-4.0-SNAPSHOT.jar \
  uk.ac.cam.ch.wwmm.oscar.oscarcli.OscarCLI \
  -stdin -output text/turtle

This returns RDF using the CHEMINF ontology like:

@prefix dc:  .
@prefix rdfs:  .
@prefix ex:  .
@prefix cheminf:  .
@prefix sio: .

ex:entity0
  rdfs:subClassOf cheminf:CHEMINF_000000 ;
  dc:label "propane" ;
  cheminf:CHEMINF_000200 [
    a cheminf:CHEMINF_000113 ;
    sio:SIO_000300 "InChI=1/C3H8/c1-3-2/h3H2,1-2H3" .
  ] .

We can, however, also use Jericho to extract text from HTML pages, made available with the -html option, and pulling in a Beilstein Journal of Organic Chemistry paper with wget:

$ wget -qO- https://doi.org/10.3762/bjoc.6.122 | \
  java -cp oscar4-cli-4.0-SNAPSHOT.jar \
  uk.ac.cam.ch.wwmm.oscar.oscarcli.OscarCLI \
  -stdin -html

This will return 271 chemical entities recognized in the text, matching 48 unique chemical structures.