<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.4">Jekyll</generator><link href="https://chem-bla-ics.linkedchemistry.info/feed/by_tag/cas.xml" rel="self" type="application/atom+xml" /><link href="https://chem-bla-ics.linkedchemistry.info/" rel="alternate" type="text/html" /><updated>2026-04-19T09:50:36+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/feed/by_tag/cas.xml</id><title type="html">chem-bla-ics</title><subtitle>Chemblaics (pronounced chem-bla-ics) is the science that uses open science and computers to solve problems in chemistry, biochemistry and related fields.</subtitle><author><name>Egon Willighagen</name></author><entry><title type="html">new: “CAS Common Chemistry in 2021: Expanding Access to Trusted Chemical Information for the Scientific Community”</title><link href="https://chem-bla-ics.linkedchemistry.info/2022/05/22/new-cas-common-chemistry-in-2021.html" rel="alternate" type="text/html" title="new: “CAS Common Chemistry in 2021: Expanding Access to Trusted Chemical Information for the Scientific Community”" /><published>2022-05-22T00:00:00+00:00</published><updated>2022-05-22T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2022/05/22/new-cas-common-chemistry-in-2021</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2022/05/22/new-cas-common-chemistry-in-2021.html"><![CDATA[<p>Open Science is happening. The merits are no longer theoretical or idealistic but tangible. Research is faster than ever, more vetted than ever (think PubPeer),
more cited than ever. Fairly, not just because of Open Science, but open access causes readership causes impact causes citations. When new people and
organizations start adopting Open Science this warms my hearth.</p>

<p>So, when I was asked to work with <a href="https://www.cas.org/">Chemical Abstracts Service</a> (CAS) on a new, bigger than ever version of
<a href="https://commonchemistry.cas.org/">Common Chemistry</a> (which started as a project between CAS and Wikipedia), I welcomed the project. I don’t quite
remember the first meetings, but roughly my task became to work with the new content and match this against Wikidata and Wikipedia. It aligned well
with <a href="https://bridgedb.github.io/">BridgeDb</a>, <a href="https://scholia.toolforge.org/">Scholia</a>, and our metabolomics research, so I even could find
sufficient research time for it. This work is now published in the <a href="https://pubs.acs.org/journal/jcisd8">JCIM</a>:
<em>CAS Common Chemistry in 2021: Expanding Access to Trusted Chemical Information for the Scientific Community</em>
(doi:<a href="https://doi.org/10.1021/acs.jcim.2c00268">10.1021/acs.jcim.2c00268</a>).</p>

<p><img src="/assets/images/images_medium_ci2c00268_0003.png" alt="" /> <br />
<em>Figure 2 from the article. Detailed record for caffeine in CAS Common Chemistry (image: CC-BY).</em></p>

<p>About Wikidata, the paper writes (CC-BY):</p>

<blockquote>
  <p>The latest release of CAS Common Chemistry has also supported updates and corrections to CAS RNs in Wikidata and Wikipedia. (22)
InChIKeys were calculated from CAS SMILES using Bacting 0.0.31 (23) with the Chemistry Development Kit 2.7.1 (24) and were
matched with content in Wikidata. The CAS RNs were then compared. References to CAS Common Chemistry were added for CAS RNs
that matched. Mismatches have been shared with the Wikidata and Wikipedia communities so that they can manually review and
correct the misleading entries using CAS Common Chemistry as a reference. Because Wikidata also curates identifiers from
other data sources, validated CAS RNs in Wikidata may also be used to cross-reference with other resources. Scripts are
provided in the Supporting Information.</p>
</blockquote>

<p>The alignment is a continuous process, as new chemical compounds get added to Wikidata on a weekly basis. The comparison of
Common Chemistry with Wikidata and Wikipedia resulted in a wealth of curation data, e.g. inconsistent CAS numbers linked to
InChIKeys, where Common Chemistry had a different match than Wikidata or Wikipedia.</p>

<p>CAS registry numbers were not added to Wikidata in this process, only confirmed or reported as different. The latter
allowed manual curation by the community, which it did. Reports <a href="https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Chemistry/CAS_Validation_Results">look like this</a>.
When a InChIKey-CAS RN combination in Wikidata was confirmed, it was recorded as a reference, like this:</p>

<p><img src="/assets/images/Screenshot_20220522_084233.png" alt="" /> <br />
<em>Screenshot of Wikidata with two references, one reflecting a confirmation
by the English Wikipedia (potentially the result of the original Common Chemistry
project) and the second as outcome of the now published project.</em></p>

<p>Thanks to everyone on this project and <a href="https://orcid.org/0000-0001-9316-9400">Andrea Jacobs</a>
particularly for leading this open science project.</p>]]></content><author><name>Egon Willighagen</name></author><category term="curation" /><category term="chemistry" /><category term="cas" /><category term="doi:10.1021/ACS.JCIM.2C00268" /><category term="bioclipse" /><category term="cdk" /><category term="bridgedb" /><summary type="html"><![CDATA[Open Science is happening. The merits are no longer theoretical or idealistic but tangible. Research is faster than ever, more vetted than ever (think PubPeer), more cited than ever. Fairly, not just because of Open Science, but open access causes readership causes impact causes citations. When new people and organizations start adopting Open Science this warms my hearth.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/images_medium_ci2c00268_0003.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/images_medium_ci2c00268_0003.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Compound (class) identifiers in Wikidata</title><link href="https://chem-bla-ics.linkedchemistry.info/2018/08/18/compound-class-identifiers-in-wikidata.html" rel="alternate" type="text/html" title="Compound (class) identifiers in Wikidata" /><published>2018-08-18T00:00:00+00:00</published><updated>2018-08-18T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2018/08/18/compound-class-identifiers-in-wikidata</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2018/08/18/compound-class-identifiers-in-wikidata.html"><![CDATA[<p><span style="width: 40%; display: block; margin-left: auto; margin-right: auto; float: right">
<img src="/assets/images/extid-wikidata-histogram.png" /> <br />
<a href="https://edu.nl/h6kg3">Bar chart</a> showing the number of compounds with a particular chemical identifier.
</span>
I think <a href="http://wikidata.org/">Wikidata</a> is a groundbreaking project, which will have a major impact on science. One of the
reasons is the open license (CCZero), the very basic approach (<a href="http://wikiba.se/">Wikibase</a>), and the superb community around
it. For example, setting up your own Wikibase including a cool SPARQL endpoint, is
<a href="https://github.com/wmde/wikibase-docker">easily done with Docker</a>.</p>

<p>Wikidata has many sub projects, such as <a href="http://wikicite.org/">WikiCite</a>, which captures the collective of primary literature.
Another one is the <a href="https://www.wikidata.org/wiki/Wikidata:WikiProject_Chemistry">WikiProject Chemistry</a>. The two nicely match
up, I think, making a public database linking chemicals to literature (tho, very much needs to be done here), see my recent
ICCS 2018 poster (doi:<a href="https://doi.org/10.6084/m9.figshare.6356027.v1">10.6084/m9.figshare.6356027.v1</a>, paper pending).</p>

<p>But Wikidata is also a great resource for identifier mappings between chemical databases, something we need for
<a href="https://chem-bla-ics.blogspot.com/2017/11/new-paper-wikipathways-multifaceted.html">our metabolism pathway research</a>.
The mapping, as you may know, are <a href="https://chem-bla-ics.blogspot.com/2016/09/metabolite-identifier-mapping-databases.html">used in the latter</a>
via <a href="https://www.bridgedb.org/">BridgeDb</a> and we have been using Wikidata as one of three sources for some time now (the others being
<a href="http://www.hmdb.ca/">HMDB</a> and <a href="https://www.ebi.ac.uk/chebi/">ChEBI</a>). WikiProject Chemistry has a related
<a href="https://www.wikidata.org/wiki/Wikidata:WikiProject_Chemistry/ChemID">ChemID</a> effort, and while the wiki page does not show
much recent activity, there is actually a lot of ongoing effort (see <a href="https://edu.nl/h6kg3">plot</a>).
And I’ve been <a href="https://chem-bla-ics.blogspot.com/2018/07/lipid-map-identifiers-and.html">adding my bits</a>.</p>

<h2 id="limitations-of-the-links">Limitations of the links</h2>
<p>But not each identifier in Wikidata has the same meaning. While they are all classified as ‘external-id’, the actual link may
have different meaning. This, of course, is the essence of scientific lenses, see <a href="https://chem-bla-ics.blogspot.com/2013/05/linking-wikipathways-to-binding.html">this post</a>
and the papers cited therein. One reason here is the difference in what entries in the various databases mean.</p>

<p>Wikidata has an extensive model, defined by the aforementioned WikiProject Chemistry. For example, it has different concepts
for chemical compounds (in fact, the hierarchy is pretty rich) and compound classes. And these are differently modeled. Furthermore,
it has a model that formalizes that things with a different InChI are different, but even allows things with the same InChI to be
different, if need arises. It tries to accurately and precisely capture the certainty and uncertainty of the chemistry. As such,
it is a powerful system to handle identifier mappings, because databases are not clear, and chemistry and biological in data is
even less: we measure experimentally a characterization of chemicals, but what we put in databases and give names, are specific
models (often chemical graphs).</p>

<p>That model differs from what other (chemical) databases use, or seem to use, because not always do databases indicate what they
actually have in a record. But I think this is a fair guess.</p>

<h2 id="chebi">ChEBI</h2>
<p>ChEBI (and the matching <a href="https://www.wikidata.org/wiki/Property:P683">ChEBI ID</a>) has entries for chemical classes (e.g.
<a href="https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:35366">fatty acid</a>) and specific compounds (e.g.
<a href="https://www.ebi.ac.uk/chebi/searchId.do?chebiId=30089">acetate</a>).</p>

<h2 id="pubchem-chemspider-unichem">PubChem, ChemSpider, UniChem</h2>
<p>These three resources use the InChI as central asset. While they do not really have the concept of compound classes so much
(though increasingly they have classifications), they do have entries where stereochemistry is undefined or unknown. Each
one has their own way to link to other databases themselves, which normally includes tons of structure normalization (see
e.g. doi:<a href="https://doi.org/10.1186/s13321-018-0293-8">10.1186/s13321-018-0293-8</a> and
doi:<a href="https://doi.org/10.1186/s13321-015-0072-8">10.1186/s13321-015-0072-8</a>).</p>

<h2 id="hmdb">HMDB</h2>
<p>HMDB (and the matching <a href="https://www.wikidata.org/wiki/Property:P2057">P2057</a>) has a biological perspective; the entries
reflect the biology of a chemical. Therefore, for most compounds, they focus on the neutral forms of compounds. This makes
linking to/from other databases where the compound is not neutral chemically less precise.</p>

<h2 id="cas-registry-numbers">CAS registry numbers</h2>
<p>CAS (and the matching <a href="https://www.wikidata.org/wiki/Property:P231">P231</a>) is pretty unique itself, and has identifiers
for substances (see <a href="https://www.wikidata.org/wiki/Q79529">Q79529</a>), much more than chemical compounds, and comes with a
own set of unique features. For example, solutions of some compound, by design, have the same identifier. Previously,
formaldehyde and formalin had different Wikipedia/Wikidata pages, both with the same CAS registry number.</p>

<h2 id="limitations-of-the-links-2">Limitations of the links #2</h2>
<p>Now, returning to our starting point: limitations in linking databases. If we want FAIR mappings, we need to be as precise
as possible. Of course, that may mean we need more steps, but we can always simplify at will, but we never can have a
computer make the links more complex (well, not without making assumptions, etc).</p>

<p>And that is why Wikidata is so suitable to link all these chemical databases: it can distinguish differences when needed,
and make that explicit. It make mappings between the databases more <a href="https://www.nature.com/articles/sdata201618">FAIR</a>.</p>]]></content><author><name>Egon Willighagen</name></author><category term="wikidata" /><category term="scholia" /><category term="chemistry" /><category term="bridgedb" /><category term="cas" /><category term="chebi" /><category term="chemspider" /><category term="fair" /><category term="hmdb" /><category term="pubchem" /><category term="rdf" /><category term="wikicite" /><category term="justdoi:10.6084/m9.figshare.6356027.v1" /><category term="justdoi:10.1186/s13321-018-0293-8" /><category term="justdoi:10.1186/s13321-015-0072-8" /><category term="justdoi:10.1038/sdata.2016.18" /><summary type="html"><![CDATA[Bar chart showing the number of compounds with a particular chemical identifier. I think Wikidata is a groundbreaking project, which will have a major impact on science. One of the reasons is the open license (CCZero), the very basic approach (Wikibase), and the superb community around it. For example, setting up your own Wikibase including a cool SPARQL endpoint, is easily done with Docker.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/extid-wikidata-histogram.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/extid-wikidata-histogram.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">New Edition! Getting CAS registry numbers out of WikiData</title><link href="https://chem-bla-ics.linkedchemistry.info/2015/12/22/new-edition-getting-cas-registry.html" rel="alternate" type="text/html" title="New Edition! Getting CAS registry numbers out of WikiData" /><published>2015-12-22T00:00:00+00:00</published><updated>2015-12-22T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2015/12/22/new-edition-getting-cas-registry</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2015/12/22/new-edition-getting-cas-registry.html"><![CDATA[<p><span style="width: 30%; display: block; margin-left: auto; margin-right: auto; float: right">
<img src="/assets/images/aceticAcidCAS.png" /> <br />
Source: Wikipedia. <a href="https://en.wikipedia.org/wiki/File:Acetic_acid.jpg">CC-BY-SA</a>
</span>
April this year <a href="https://chem-bla-ics.linkedchemistry.info/2015/04/10/getting-cas-registry-numbers-out-of.html">I blogged about an important SPARQL query <i class="fa-solid fa-recycle fa-xs"></i></a>
for many chemists: getting CAS registry numbers from Wikidata. This is relevant for two reasons:</p>

<ol>
  <li><a href="http://commonchemistry.org/">CAS works together with Wikimedia</a> on a large, free CAS-to-structure database</li>
  <li><a href="http://wikidata.org/">Wikidata</a> is <a href="https://creativecommons.org/choose/zero/">CCZero</a></li>
</ol>

<p>The original effort validated about eight thousand registry numbers, made available via Wikipedia and the
<a href="http://commonchemistry.org/">Common Chemistry</a> website. However, the effort did not stop there, and Wikipedia
now contains many more CAS registry numbers. In fact, Wikidata picked up many of these and now lists almost
twenty thousand CAS numbers. That well exceeds what databases are allowed to aggregate and make available.</p>

<p>Since the post in April, Wikidata put online a <a href="https://query.wikidata.org/">new SPARQL end point</a> and
created “direct” property links. This way, you loose the provenance information, but the query becomes simpler:</p>

<div class="language-sparql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">PREFIX</span><span class="w"> </span><span class="nn">wdt</span><span class="o">:</span><span class="w"> </span><span class="nn">&lt;http://www.wikidata.org/prop/direct/&gt;</span><span class="w">
</span><span class="k">SELECT</span><span class="w"> </span><span class="nv">?compound</span><span class="w"> </span><span class="nv">?id</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="nv">?compound</span><span class="w"> </span><span class="nn">wdt</span><span class="o">:</span><span class="ss">P231</span><span class="w"> </span><span class="nv">?id</span><span class="w"> </span><span class="p">.</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>The other thing that changed since April is that others and I requested the creation of more compound identifiers,
and here’s an overview along with the current number of such identifiers in Wikidata:</p>

<ul>
  <li>CAS registry number (<a href="https://www.wikidata.org/wiki/Property:P231">P231</a>): <a href="https://query.wikidata.org/#PREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0ASELECT%20(count(%3Fid)%20as%20%3Fcount)%20WHERE%20%7B%0A%20%20%3Fcompound%20wdt%3AP231%20%3Fid%20.%0A%7D%0A">19420</a></li>
  <li>PubChem ID (CID) (<a href="https://www.wikidata.org/wiki/Property:P662">P662</a>): <a href="https://query.wikidata.org/#PREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0ASELECT%20%28count%28%3Fid%29%20as%20%3Fcount%29%20WHERE%20%7B%0A%20%20%3Fcompound%20wdt%3AP662%20%3Fid%20.%0A%7D%0A">16616</a></li>
  <li>InChI (<a href="https://www.wikidata.org/wiki/Property:P234">P234</a>): <a href="https://query.wikidata.org/#PREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0ASELECT%20%28count%28%3Fid%29%20as%20%3Fcount%29%20WHERE%20%7B%0A%20%20%3Fcompound%20wdt%3AP234%20%3Fid%20.%0A%7D%0A">14312</a></li>
  <li>ChemSpider ID (<a href="https://www.wikidata.org/wiki/Property:P661">P661</a>): <a href="https://query.wikidata.org/#PREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0ASELECT%20%28count%28%3Fid%29%20as%20%3Fcount%29%20WHERE%20%7B%0A%20%20%3Fcompound%20wdt%3AP661%20%3Fid%20.%0A%7D%0A">11566</a></li>
  <li>ChEBI ID (<a href="https://www.wikidata.org/wiki/Property:P683">P683</a>): <a href="https://query.wikidata.org/#PREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0ASELECT%20%28count%28%3Fid%29%20as%20%3Fcount%29%20WHERE%20%7B%0A%20%20%3Fcompound%20wdt%3AP683%20%3Fid%20.%0A%7D%0A">4313</a></li>
  <li>KEGG ID (<a href="https://www.wikidata.org/wiki/Property:P665">P665</a>): <a href="https://query.wikidata.org/#PREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0ASELECT%20%28count%28%3Fid%29%20as%20%3Fcount%29%20WHERE%20%7B%0A%20%20%3Fcompound%20wdt%3AP665%20%3Fid%20.%0A%7D%0A">3983</a></li>
  <li>Drugbank ID (<a href="https://www.wikidata.org/wiki/Property:P715">P715</a>): <a href="https://query.wikidata.org/#PREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0ASELECT%20%28count%28%3Fid%29%20as%20%3Fcount%29%20WHERE%20%7B%0A%20%20%3Fcompound%20wdt%3AP715%20%3Fid%20.%0A%7D%0A">2518</a></li>
  <li>KNApSAcK ID (<a href="https://www.wikidata.org/wiki/Property:P2064">P2064</a>): <a href="https://query.wikidata.org/#PREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0ASELECT%20%28count%28%3Fid%29%20as%20%3Fcount%29%20WHERE%20%7B%0A%20%20%3Fcompound%20wdt%3AP2064%20%3Fid%20.%0A%7D%0A">9</a></li>
  <li>HMDB ID (<a href="https://www.wikidata.org/wiki/Property:P2057">P2057</a>): <a href="https://query.wikidata.org/#PREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0ASELECT%20%28count%28%3Fid%29%20as%20%3Fcount%29%20WHERE%20%7B%0A%20%20%3Fcompound%20wdt%3AP2057%20%3Fid%20.%0A%7D%0A">6</a></li>
  <li>ZINC ID (<a href="https://www.wikidata.org/wiki/Property:P2084">P2084</a>): <a href="https://query.wikidata.org/#PREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0ASELECT%20%28count%28%3Fid%29%20as%20%3Fcount%29%20WHERE%20%7B%0A%20%20%3Fcompound%20wdt%3AP2084%20%3Fid%20.%0A%7D%0A">4</a></li>
  <li>LIPID MAPS ID (<a href="https://www.wikidata.org/wiki/Property:P2063">P2063</a>): <a href="https://query.wikidata.org/#PREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0ASELECT%20(count(%3Fid)%20as%20%3Fcount)%20WHERE%20%7B%0A%20%20%3Fcompound%20wdt%3AP2063%20%3Fid%20.%0A%7D%0A">3</a></li>
  <li>Leadscope ID (<a href="https://www.wikidata.org/wiki/Property:P2083">P2083</a>): <a href="https://query.wikidata.org/#PREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0ASELECT%20(count(%3Fid)%20as%20%3Fcount)%20WHERE%20%7B%0A%20%20%3Fcompound%20wdt%3AP2083%20%3Fid%20.%0A%7D%0A">3</a></li>
</ul>

<p>Clearly, some identifiers are not well populated yet. This is what bots are for, like
<a href="https://bitbucket.org/sulab/wikidatabots/overview">those used by the Andrew Su team</a>.</p>

<p>Because there is also a predicate for SMILES, we can also create a query that puts the CAS registry
number alongside to the SMILES (or any other identifier):</p>

<div class="language-sparql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">PREFIX</span><span class="w"> </span><span class="nn">wdt</span><span class="o">:</span><span class="w"> </span><span class="nn">&lt;http://www.wikidata.org/prop/direct/&gt;</span><span class="w">
</span><span class="k">SELECT</span><span class="w"> </span><span class="nv">?compound</span><span class="w"> </span><span class="nv">?id</span><span class="w"> </span><span class="nv">?smiles</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="nv">?compound</span><span class="w"> </span><span class="nn">wdt</span><span class="o">:</span><span class="ss">P231</span><span class="w"> </span><span class="nv">?id</span><span class="w"> </span><span class="p">;</span><span class="w">
            </span><span class="nn">wdt</span><span class="o">:</span><span class="ss">P233</span><span class="w"> </span><span class="nv">?smiles</span><span class="w"> </span><span class="p">.</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Of course, then the question is, <a href="https://chem-bla-ics.blogspot.nl/2015/10/how-to-test-smiles-strings-in.html">are these SMILES string valid</a>…
And, importantly, this is nothing compared to the number of chemical compounds we know about, which currently is in
the order of 100 million, of which a quarter can be readily purchased:</p>

<p><a href="https://twitter.com/chem4biology/status/679314144680513536"><img src="/assets/images/twitter_chem4biology_679314144680513536.png" alt="" /></a></p>

<p><a href="https://twitter.com/chem4biology/status/677593362640142336"><img src="/assets/images/twitter_chem4biology_677593362640142336.png" alt="" /></a></p>]]></content><author><name>Egon Willighagen</name></author><category term="cas" /><category term="wikidata" /><category term="chemistry" /><category term="justdoi:10.15200/winn.142867.72538" /><summary type="html"><![CDATA[Source: Wikipedia. CC-BY-SA April this year I blogged about an important SPARQL query for many chemists: getting CAS registry numbers from Wikidata. This is relevant for two reasons:]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/aceticAcidCAS.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/aceticAcidCAS.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Getting CAS registry numbers out of WikiData</title><link href="https://chem-bla-ics.linkedchemistry.info/2015/04/10/getting-cas-registry-numbers-out-of.html" rel="alternate" type="text/html" title="Getting CAS registry numbers out of WikiData" /><published>2015-04-10T00:00:00+00:00</published><updated>2015-04-10T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2015/04/10/getting-cas-registry-numbers-out-of</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2015/04/10/getting-cas-registry-numbers-out-of.html"><![CDATA[<p>I have promised my Twitter followers the <a href="https://www.wikidata.org/wiki/Q54871">SPARQL query</a> you have all been waiting
for. Sadly, you had to wait for it for more than two months. I’m sorry about that. But, here it is:</p>

<div class="language-sparql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">PREFIX</span><span class="w"> </span><span class="nn">wd</span><span class="o">:</span><span class="w"> </span><span class="nn">&lt;http://www.wikidata.org/entity/&gt;</span><span class="w">

</span><span class="k">SELECT</span><span class="w"> </span><span class="nv">?compound</span><span class="w"> </span><span class="nv">?id</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="nv">?compound</span><span class="w"> </span><span class="nn">wd</span><span class="o">:</span><span class="ss">P231s</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="nn">wd</span><span class="o">:</span><span class="ss">P231v</span><span class="w"> </span><span class="nv">?id</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="p">.</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>What this query does is ask for all things (let’s call whatever is behind the identifier is a “compound”; of course, it can
be mixtures, ill-defined chemicals, nanomaterials, etc) that have a CAS registry identifier. This query results in a nice
table of <a href="https://www.wikidata.org/">Wikidata</a> identifiers (e.g. <a href="https://www.wikidata.org/wiki/Q47512">Q47512</a> is acetic acid)
and matching CAS numbers, 16298 of them.</p>

<p>Because Wikidata is not specific to the English Wikipedia, CAS numbers from other origin will show up too. For example, the
CAS number for N-benzylacrylamide (<a href="https://www.wikidata.org/wiki/Q10334928">Q10334928</a>) is provided by the Portuguese Wikipedia:</p>

<p><img src="/assets/images/casPT.png" alt="" /></p>

<p>I used Peter Ertl’s <a href="http://www.cheminfo.org/wikipedia">cheminfo.org</a> (doi:<a href="https://doi.org/10.1186/s13321-015-0061-y">10.1186/s13321-015-0061-y</a>)
to confirm this compound indeed does not have an English page, which is somewhat surprising.</p>

<p>The SPARQL query uses a predicate specifically for the CAS registry number (<a href="https://www.wikidata.org/wiki/Property:P231">P231</a>).
Other identifiers have similar predicates, like for PubChem compound (<a href="https://www.wikidata.org/wiki/Property:P662">P662</a>) and
Chemspider (<a href="https://www.wikidata.org/wiki/Property:P661">P661</a>). That means, Wikidata can become a community crowdsource of
identifier mappings, which is one of the things Daniel Mietchen, me, and a few others proposed in this H2020 grant application
(doi:<a href="https://doi.org/10.5281/zenodo.13906">10.5281/zenodo.13906</a>). The SPARQL query is run by the
<a href="http://linkeddatafragments.org/">Linked Data Fragments</a> platform, which you should really check out too, using the
<a href="http://www.bioclipse.net/">Bioclipse</a> manager I wrote around that.</p>

<p>The full Bioclipse script looks like:</p>

<div class="language-groovy highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">wikidataldf</span> <span class="o">=</span> <span class="n">ldf</span><span class="o">.</span><span class="na">createStore</span><span class="o">(</span>
  <span class="s2">"http://data.wikidataldf.com/wikidata"</span>
<span class="o">)</span>

<span class="c1">// P231 CAS</span>
<span class="n">identifier</span> <span class="o">=</span> <span class="s2">"P231"</span>
<span class="n">type</span> <span class="o">=</span> <span class="s2">"cas"</span>

<span class="n">sparql</span> <span class="o">=</span> <span class="s2">"""
PREFIX wd:

SELECT ?compound ?id WHERE {
  ?compound wd:${identifier}s [ wd:${identifier}v ?id ] .
}
"""</span>
<span class="n">mappings</span> <span class="o">=</span> <span class="n">rdf</span><span class="o">.</span><span class="na">sparql</span><span class="o">(</span><span class="n">wikidataldf</span><span class="o">,</span> <span class="n">sparql</span><span class="o">)</span>

<span class="c1">// recreate an empty output file</span>
<span class="n">outFilename</span> <span class="o">=</span> <span class="s2">"/Wikidata/${type}2wikidata.csv"</span>
<span class="k">if</span> <span class="o">(</span><span class="n">ui</span><span class="o">.</span><span class="na">fileExists</span><span class="o">(</span><span class="n">outFilename</span><span class="o">))</span> <span class="o">{</span>
  <span class="n">ui</span><span class="o">.</span><span class="na">remove</span><span class="o">(</span><span class="n">outFilename</span><span class="o">)</span>
  <span class="n">ui</span><span class="o">.</span><span class="na">newFile</span><span class="o">(</span><span class="n">outFilename</span><span class="o">)</span>
<span class="o">}</span>

<span class="c1">// safe to a file</span>
<span class="k">for</span> <span class="o">(</span><span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="o">;</span> <span class="n">i</span><span class="o">&lt;=</span><span class="n">mappings</span><span class="o">.</span><span class="na">rowCount</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span> <span class="o">{</span>
  <span class="n">wdID</span> <span class="o">=</span> <span class="n">mappings</span><span class="o">.</span><span class="na">get</span><span class="o">(</span><span class="n">i</span><span class="o">,</span> <span class="s2">"compound"</span><span class="o">).</span><span class="na">substring</span><span class="o">(</span><span class="mi">3</span><span class="o">)</span>
  <span class="n">ui</span><span class="o">.</span><span class="na">append</span><span class="o">(</span>
    <span class="n">outFilename</span><span class="o">,</span>
    <span class="n">wdID</span> <span class="o">+</span> <span class="s2">","</span> <span class="o">+</span> <span class="n">mappings</span><span class="o">.</span><span class="na">get</span><span class="o">(</span><span class="n">i</span><span class="o">,</span> <span class="s2">"id"</span><span class="o">)</span> <span class="o">+</span> <span class="s2">"\n"</span>
  <span class="o">)</span>
<span class="o">}</span>
</code></pre></div></div>

<p>BTW, of course, all this depends on work by many others including the core <a href="http://tools.wmflabs.org/wikidata-exports/rdf/">RDF generation</a>
with the <a href="https://www.mediawiki.org/wiki/Wikidata_Toolkit">Wikidata Toolkit</a>. See also the paper by Erxleben <em>et al.</em>
(<a href="http://korrekt.org/papers/Wikidata-RDF-export-2014.pdf">PDF</a>).</p>]]></content><author><name>Egon Willighagen</name></author><category term="wikidata" /><category term="chemistry" /><category term="cas" /><category term="justdoi:10.1186/s13321-015-0061-y" /><category term="doi:10.5281/ZENODO.13906" /><category term="justdoi:10.1007/978-3-319-11964-9_4" /><category term="ldf" /><summary type="html"><![CDATA[I have promised my Twitter followers the SPARQL query you have all been waiting for. Sadly, you had to wait for it for more than two months. I’m sorry about that. But, here it is:]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/casPT.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/casPT.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">The Chemical Object Identifier; or, the freedom to identify chemicals</title><link href="https://chem-bla-ics.linkedchemistry.info/2008/03/09/chemical-object-identifier-or-freedom.html" rel="alternate" type="text/html" title="The Chemical Object Identifier; or, the freedom to identify chemicals" /><published>2008-03-09T00:00:00+00:00</published><updated>2008-03-09T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2008/03/09/chemical-object-identifier-or-freedom</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2008/03/09/chemical-object-identifier-or-freedom.html"><![CDATA[<p>IUPAC chemical names, <a href="http://opensmiles.org/">SMILES</a> and InChIs are too long. <a href="http://en.wikipedia.org/wiki/International_Chemical_Identifier#InChIKey">InChIKeys</a>
are not unique enough because of safety reasons (<em>you have a 1 in 10 billion chance of blowing up your building</em>; well, odds are actually much, much lower than
getting hit by Osama or friends, let alone a car). Wikipedia URIs do not cover enough chemical space.</p>

<p>However, we need short identifier. Why, actually? Computers don’t care about long identifiers. Systems can be integrated. A web link is easy to make. But we do.
A bottle on the shelf does not have a HTML interface. And you do not have a scanner to read the chemical structure from a 2D barcode (see
DOI:<a href="https://doi.org/10.1021/ci049758i">10.1021/ci049758i</a>).</p>

<p>The <a href="http://en.wikipedia.org/wiki/CAS_registry_number">CAS registry number</a> has serviced this purpose for a long time. For example, as used on bottles visible
in this picture (copyright: <a href="http://creativecommons.org/licenses/by-sa/3.0/">CC BY-SA</a>, <a href="http://blog.openwetware.org/scienceintheopen/">Science in the Open</a>):</p>

<p><img src="/assets/images/cas-number.png" alt="" /></p>

<p>Now, when <a href="http://www.chemspider.com/blog/cas-discourages-using-scifinder-to-help-curate-wikipedia-structures-and-cas-numbers.html">Anthony reported</a> that CAS,
the organization that builds the proprietary lookup service, which has done an amazing job in the past, that they do not wish to see CAS numbers in Wikipedia
curated by means of the official database - it violates the <em>end user agreement</em> one has to sign before one can use the database - the blogging community
reacted (<a href="http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=997">here</a>,
<a href="http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=1000">here</a>,
<a href="http://www.chemconnector.com/chemunicating/the-curation-of-almost-5000-structures-on-wikipedia.html">here</a>,
<a href="https://doi.org/10.63485/54wv5-hs388">here <i class="fa-solid fa-recycle fa-xs"></i></a> and
<a href="http://miningdrugs.blogspot.com/2008/03/cas-numbers-are-not-public-domain-are.html">here</a>).</p>

<p>Personally, I agree with the CAS standpoint. It’s been a proprietary database which people have been supporting financially for years, and thoughtfully signed
the license agreement. So, don’t complain afterwards. If you <em>really</em> want to, <strong>end the agreement and object against the license</strong>. I
<a href="http://www.chemspider.com/blog/cas-discourages-using-scifinder-to-help-curate-wikipedia-structures-and-cas-numbers.html#comment-24101">commented in the original blog</a>:</p>

<blockquote>
  <p>In 1995 I started a Dutch website on organic chemistry [1] and the CAS number was as useful as it is now, and already then we knew we were not allowed
to compose a database of CAS numbers. Not sure about the legal state of that, but our university had a license; not sure if students had access, but
do not believe so. Anyway, building a substantial list of CAS number was not allowed. So, we looked for other means of identifying molecular structures,
which led us to CML… this was around ‘96-’97 or so, at least before XML was released, and we started using CML actually when it was still in a more
obscure SGML format :) Yeah, the XML recommendation was much appreciated!</p>

  <p>OK, so back to your blog item. You can imagine that the comment in WP by CAS does not surprise me at all; nothing really new. If they would allow this,
it would set a precedence…</p>

  <p>The solution is, however, fairly easy. Use InChI(Key), PubChem CID, or ChemSpider CID; the latter two are on the same level as CAS numbers. CAS registry
numbers are overrated. Not sure if they still hand out CAS numbers to mixture too… (I guess not).</p>

  <p>Oh, and I agree with Cpt. Renault… people should really abide to legal requirements. Period. If you don’t like them, quit the legal agreement.
As simple as that.</p>

  <p>1.<a href="http://www.woc.science.ru.nl/">http://www.woc.science.ru.nl/</a></p>
</blockquote>

<p>Here, I tend to disagree with <a href="http://www.chemspider.com/blog/cas-discourages-using-scifinder-to-help-curate-wikipedia-structures-and-cas-numbers.html#comment-24233">Will who wrote</a>
that “<em>They are just numbers. i.e. descriptors</em>”. The CAS number only makes sense with a (curated) look up table; making it tightly
linked to the CAS database. While theoretically you may be allowed to copy numbers from that database, the license agreement strictly
disagrees with that. Court would have to decide which right takes higher importance, but my vote is on the agreement, which you
thoughtfully signed. So, I tend to agree with Joerg who wrote that
<a href="http://miningdrugs.blogspot.com/2008/03/cas-numbers-are-not-public-domain-are.html">CAS number are not public domain, are they?</a></p>

<p>An interesting bit in that blog item is <a href="http://miningdrugs.blogspot.com/2008/03/cas-numbers-are-not-public-domain-are.html#c3452086141400278558">the comment he left himself</a>:</p>

<blockquote>
  <p>I just realized that <a href="http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=999">Peter</a> has also commented on it. And storing 10000 CAS
numbers and structures is allowed? What happens, if a journal reaches this limit? Just imagine they publish 1000 papers with 100
CAS numbers for each article? I do not get this!</p>
</blockquote>

<p>Interesting indeed. This gets me back to a recent question I was confronted: <em>How would I use chemical literature in the current
age?</em> Well, what about this hypothetical <a href="http://taverna.sf.net/">Taverna</a> workflow:</p>

<ul>
  <li>Node 1: get me a list of journals expected to contains CAS registry numbers (such as the <a href="http://pubs.acs.org/journals/jcisd8/index.html">JCIM</a>)</li>
  <li>Node 2: for each, get me all publications of the last 25 years</li>
  <li>Node 3: process all articles and count cited CAS registry numbers per journal</li>
  <li>Node 4: complain if count_per_journal &gt; 10000</li>
</ul>

<p>Anyway. Common agreement seems to be that we can opt to do without the CAS registry number. The PubChem ID seems a reasonable
candidate, and has been suggested <a href="http://blog.openwetware.org/scienceintheopen/2008/03/08/what-to-use-as-a-the-primary-key-for-chemicals/">here</a>
and <a href="http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=999">here</a>. The ChemSpider ID could be an option too, though ChemSpider content is
periodically added to PubChem.</p>

<p>I’d also like to bring in the suggestion of having a <em>Chemical Object Identifier</em>: like the DOI, the COI is a simple alpha-numerical
identifier, with a one-to-one connection to the InChI, and unlike the InChIKey unique as the InChI itself, but requiring a look up
service. And the latter I can offer: <a href="http://rdf.openmolecules.net/">http://rdf.openmolecules.net/</a>. It’s a free (as in Open)
resource, where we can provide this lookup service. It would be really easy to create a new COI when a InChI is passed it did not
assign a COI yet. A PHP page to do the reverse lookup is easy too. Interested? I can have it going by the end of the month. It comes
with full RDF support, so ready for the <a href="http://markclittle.blogspot.com/2006/05/web-ng.html">Web-NG</a>.</p>]]></content><author><name>Egon Willighagen</name></author><category term="cas" /><category term="cheminf" /><category term="justdoi:10.1021/ci049758i" /><category term="justdoi:10.63485/54wv5-hs388" /><summary type="html"><![CDATA[IUPAC chemical names, SMILES and InChIs are too long. InChIKeys are not unique enough because of safety reasons (you have a 1 in 10 billion chance of blowing up your building; well, odds are actually much, much lower than getting hit by Osama or friends, let alone a car). Wikipedia URIs do not cover enough chemical space.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/cas-number.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/cas-number.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>