Open Science is happening. The merits are no longer theoretical or idealistic but tangible. Research is faster than ever, more vetted than ever (think PubPeer), more cited than ever. Fairly, not just because of Open Science, but open access causes readership causes impact causes citations. When new people and organizations start adopting Open Science this warms my hearth.

So, when I was asked to work with Chemical Abstracts Service (CAS) on a new, bigger than ever version of Common Chemistry (which started as a project between CAS and Wikipedia), I welcomed the project. I don’t quite remember the first meetings, but roughly my task became to work with the new content and match this against Wikidata and Wikipedia. It aligned well with BridgeDb, Scholia, and our metabolomics research, so I even could find sufficient research time for it. This work is now published in the JCIM: CAS Common Chemistry in 2021: Expanding Access to Trusted Chemical Information for the Scientific Community (doi:10.1021/acs.jcim.2c00268).


Figure 2 from the article. Detailed record for caffeine in CAS Common Chemistry (image: CC-BY).

About Wikidata, the paper writes (CC-BY):

The latest release of CAS Common Chemistry has also supported updates and corrections to CAS RNs in Wikidata and Wikipedia. (22) InChIKeys were calculated from CAS SMILES using Bacting 0.0.31 (23) with the Chemistry Development Kit 2.7.1 (24) and were matched with content in Wikidata. The CAS RNs were then compared. References to CAS Common Chemistry were added for CAS RNs that matched. Mismatches have been shared with the Wikidata and Wikipedia communities so that they can manually review and correct the misleading entries using CAS Common Chemistry as a reference. Because Wikidata also curates identifiers from other data sources, validated CAS RNs in Wikidata may also be used to cross-reference with other resources. Scripts are provided in the Supporting Information.

The alignment is a continuous process, as new chemical compounds get added to Wikidata on a weekly basis. The comparison of Common Chemistry with Wikidata and Wikipedia resulted in a wealth of curation data, e.g. inconsistent CAS numbers linked to InChIKeys, where Common Chemistry had a different match than Wikidata or Wikipedia.

CAS registry numbers were not added to Wikidata in this process, only confirmed or reported as different. The latter allowed manual curation by the community, which it did. Reports look like this. When a InChIKey-CAS RN combination in Wikidata was confirmed, it was recorded as a reference, like this:


Screenshot of Wikidata with two references, one reflecting a confirmation by the English Wikipedia (potentially the result of the original Common Chemistry project) and the second as outcome of the now published project.

Thanks to everyone on this project and Andrea Jacobs particularly for leading this open science project.