<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.4">Jekyll</generator><link href="https://chem-bla-ics.linkedchemistry.info/feed/by_tag/beilstein.xml" rel="self" type="application/atom+xml" /><link href="https://chem-bla-ics.linkedchemistry.info/" rel="alternate" type="text/html" /><updated>2026-06-15T12:00:19+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/feed/by_tag/beilstein.xml</id><title type="html">chem-bla-ics</title><subtitle>Chemblaics (pronounced chem-bla-ics) is the science that uses open science and computers to solve problems in chemistry, biochemistry and related fields.</subtitle><author><name>Egon Willighagen</name></author><entry><title type="html">One Million IUPAC names #4: a lot is happening</title><link href="https://chem-bla-ics.linkedchemistry.info/2025/08/09/one-million-iupac-names-4.html" rel="alternate" type="text/html" title="One Million IUPAC names #4: a lot is happening" /><published>2025-08-09T00:00:00+00:00</published><updated>2025-08-09T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2025/08/09/one-million-iupac-names-4</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2025/08/09/one-million-iupac-names-4.html"><![CDATA[<p>A lot is happening. If you have been following this project more closesly, you may have already seen some interesting updates, but
I will post it here too. First, a quick recap. In March I started a new <a href="http://blueobelisk.org/">Blue Obelisk</a> project to
<a href="https://chem-bla-ics.linkedchemistry.info/2025/03/08/iupac-names.html">collect CCZero IUPAC names</a>
from primary literature (paper still pending). It turned out we can automate that, while legally not violating any laws or licenses.
In April I reported on <a href="https://chem-bla-ics.linkedchemistry.info/2025/04/27/one-million-iupac-names-2-the-100-thousand-milestone.html">some tweaks</a>
boosting the efficiency of the use of the API. I also reported on some possible further steps, including how to use the extracted
names to create a larger set. Indeed, in June I could <a href="https://chem-bla-ics.linkedchemistry.info/2025/06/09/one-million-iupac-names.html">report to have passed the 200k IUPAC names</a>,
which with the idea from April gave us more than 1M IUPAC names.</p>

<p>In this post I want to give an update.</p>

<h2 id="275k-iupac-names">275k IUPAC names</h2>

<p>I have continued running the scripts to detect new IUPAC names in full text, open access papers in <a href="https://europepmc.org/">Europe PMC</a>,
but something more awesome actually did much more since the <a href="https://chem-bla-ics.linkedchemistry.info/2025/06/09/one-million-iupac-names.html">June post</a>:
in July I received a <a href="https://github.com/BlueObelisk/iupac-names/pull/13">pull request</a> from <a href="https://github.com/mnietfeld">mnietfeld</a>
with more than 40 thousand unique and new IUPAC names from the <a href="https://www.beilstein-journals.org/bjoc/">Beilstein Journal of Organic Chemistry</a>
(see also <a href="https://www.linkedin.com/posts/beilstein-institut_openaccess-bjoc-fair-activity-7351596602660167681-0Z0r/">their LinkedIn post</a> or
<a href="https://archive.is/DZOnP">this archived version</a> that doesn’t require an account).
While Europe PMC provides these articles too (and actually one of the first I analyzed), a lot of these names come from supplementary
information, not provided by Europe PMC. Thanks!</p>

<p>This is focusing on names from primary literature, but there is more happening. Because I want to restrict the above project to
names from primary literature (and supplementary information is still that), I have not been sure what to do with other collections
yet, and they have been coming in. I have been <a href="https://github.com/BlueObelisk/iupac-names/issues?q=is%3Aissue%20label%3Aother">taking notes</a>
in the project issue tracker, for future reference (like now, here). I have not forgotten about these!</p>

<h2 id="other-large-collections-of-iupac-names">Other large collections of IUPAC names</h2>

<p><strong>4M, CCZero</strong><br />
Let’s start with the news yesterday. The <a href="https://www.ebi.ac.uk/about/teams/chemical-biology-services/">Chemical Biology Services team</a>
<a href="https://chembl.blogspot.com/2025/08/unleashing-4-million-iupac-names-into.html">released 4 million IUPAC names from patent literature as CCZero</a>!
The CCZero license/waiver makes it compatible with our list. Their Zenodo release:</p>

<blockquote>
  <p>… contains IUPAC names text-mined from patents (US, WIPO, EPO, Chinese, Japanese).</p>
</blockquote>

<p>The post also includes a nice example of the complexity of IUPAC names which makes the counting of unique names tricky:
<code class="language-plaintext highlighter-rouge">O-methylphenol</code> and <code class="language-plaintext highlighter-rouge">o-methylphenol</code>. Thanks, Noel and the rest of the EMBL-EBI team!</p>

<p><strong>2.3 million, CC-BY</strong><br />
And then <a href="https://github.com/haydn-jones">Haydn Jones</a> was one of the earliest <a href="https://github.com/BlueObelisk/iupac-names/issues/9">to coin in</a>,
and <a href="https://doi.org/10.5281/zenodo.15077270">released 2.3 million IUPAC names</a> under the CC-BY license.</p>

<p><strong>850k, CCZero</strong><br />
Wikidata also turnes out to have many IUPAC names. <a href="https://github.com/Adafede/">Adriano</a> found more than 850 thousand IUPAC
names, see <a href="https://github.com/Adafede/wd-labels-to-iupac">this project</a>.</p>

<p>Next week I will do some comparisons of the datasets with a clear Creative Commons license.</p>

<h2 id="even-more">Even more</h2>

<p>Beyond these five data releases, there is more. PubChem and other databses have millions of names, but often these are
generated by proprietary software. These IUPAC name collections may be under some license agreement, and thus not compatible
with Open Science. This is why it is so important that we very clearly know where these names are coming from.</p>

<p><strong>5-6 million, license unclear</strong><br />
I also learned about <a href="https://chempile.lamalab.org/">ChemPile</a> about which <a href="https://www.linkedin.com/in/adrian-mirza-chem/">Adrian Mirza</a>
explained me it has <a href="https://www.linkedin.com/feed/update/urn:li:activity:7330626142611062784">about 5-6 million IUPAC names</a>.
But the source of this list of names is not yet clear to me.</p>

<p><strong>Names from PhD theses and preprints</strong><br />
I also want to give a shout out to <a href="https://github.com/BlueObelisk/iupac-names/issues/15">Peter Murray-Rust</a>s proposal
to start extracting IUPAC names from PhD theses. There have been projects to extract chemistry from PhD thesis in the
past, and this will yield a lot of unique names. Please ping Peter, if you want to get involved in his idea!</p>

<h2 id="whats-next">What’s next</h2>

<p>I am so excited with all these efforts and very grateful with the contribution by Beilstein. I really hope more Open Science
publishers will follow, like perhaps the Royal Society of Chemistry for which it should be easy, with their
<a href="https://chem-bla-ics.linkedchemistry.info/2007/02/01/rsc-first-publisher-to-go-semantic.html">Project Prospect</a> background!</p>

<p>I am also excited by the release by ChEMBL under CCZero. That will allow the <a href="https://www.wikidata.org/wiki/Wikidata:WikiProject_Chemistry">WikiProject Chemistry</a>
use this for Wikidata!</p>

<p>So, I have one week left to write the article about the work we started in March. The outlook is bright. I played last
week with the Europe PMC full text downloads and can confirm that should yield thousands of additional names from the
full texts. A single download file gave me more than two thousand new unique names. I think the 500k IUPAC names
is absolutely in reach with purely the full texts from Europe PMC.</p>

<p>This brings us to the end of 2025. By then, we should have a many millions of openly-licensed IUPAC names.
And by March 2026, I hope we reached the 1M IUPAC names extracted from primary literature. That will require some
creativity and enthusiasm, but sounds feasible!</p>]]></content><author><name>Egon Willighagen</name></author><category term="iupac" /><category term="beilstein" /><category term="chembl" /><category term="cito:citesAsRecommendedReading:10.5281/zenodo.16755947" /><category term="inchikey:RDOXTESZEPMUJZ-UHFFFAOYSA-N" /><category term="inchikey:QWVGKYWNOKOFNN-UHFFFAOYSA-N" /><category term="cito:citesAsRecommendedReading:10.5281/zenodo.15077270" /><category term="europepmc" /><summary type="html"><![CDATA[A lot is happening. If you have been following this project more closesly, you may have already seen some interesting updates, but I will post it here too. First, a quick recap. In March I started a new Blue Obelisk project to collect CCZero IUPAC names from primary literature (paper still pending). It turned out we can automate that, while legally not violating any laws or licenses. In April I reported on some tweaks boosting the efficiency of the use of the API. I also reported on some possible further steps, including how to use the extracted names to create a larger set. Indeed, in June I could report to have passed the 200k IUPAC names, which with the idea from April gave us more than 1M IUPAC names.]]></summary></entry><entry><title type="html">Beilstein journals contain Bioschemas</title><link href="https://chem-bla-ics.linkedchemistry.info/2025/02/13/beiltein-journal-has-bioschemas.html" rel="alternate" type="text/html" title="Beilstein journals contain Bioschemas" /><published>2025-02-13T00:00:00+00:00</published><updated>2025-02-13T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2025/02/13/beiltein-journal-has-bioschemas</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2025/02/13/beiltein-journal-has-bioschemas.html"><![CDATA[<p>Two weeks ago, the <a href="https://www.beilstein-journals.org/bjoc/news/LAFGBV6PT5ASC5R7JOKSEXOQYM">Beilstein Institute announced Bioschemas support in their journals</a>:</p>

<blockquote>
  <p>We streamline the discoverability of your research by incorporating machine-readable chemical information into many of our published articles.
This includes the conversion of chemical structures from submitted ChemDraw files to InChI strings and validating them using open-source tools.</p>
</blockquote>

<p>The idea is far from new and has been around for two decades. But the <a href="https://scholia.toolforge.org/publisher/Q4881267">two Beilstein journals</a>
(both <a href="https://en.wikipedia.org/wiki/Diamond_open_access">diamond Open Access</a>), actually integrated into their active publishing model.
That has been trialed and put in action before. For example, there was (is?) <a href="https://doi.org/10.59350/ne4rf-wey66">Project Prospect</a>
(2007), <a href="https://chem-bla-ics.linkedchemistry.info/2009/03/19/nature-chemistry-improves-publishing.html">chemical structure annotation in Nature Chemistry</a>
(2009), <a href="https://chem-bla-ics.linkedchemistry.info/2014/02/21/slow-publishing-innovation.html">SMILES in the ACS Journal of Medicinal Chemistry</a>
(2014) (doi:<a href="https://doi.org/10.1021/jm5002056">10.1021/jm5002056</a>),
and <em>FAIR chemical structures in the Journal of Cheminformatics</em> (2021) (doi:<a href="https://doi.org/10.1186/s13321-021-00520-4">10.1186/s13321-021-00520-4</a>).</p>

<p>But this announcement is a new step. I like how validation of the chemical structures is part of the approach, and I like
how they use the <a href="https://bioschemas.org/">Bioschemas</a> extention of <a href="https://schema.org/">schema.org</a>. The last because
they use two Bioschemas types/profiles that contributed to or initiated, respectively: <a href="https://bioschemas.org/profiles/MolecularEntity/0.5-RELEASE">MolecularEntity</a>
and <a href="https://bioschemas.org/profiles/ChemicalSubstance/0.4-RELEASE">ChemicalSubstance</a>.</p>

<p>First stop for me is to check the schema.org annotation with a validation tool, like <a href="https://search.google.com/test/rich-results">Google’s Rich Results Test</a>.
That gives an idea how they may have have their search engine pick it up. The test article I was given on LinkedIn is
Xiao <em>et al.</em>’s <em>Molecular diversity of the reactions of MBH carbonates of isatins and various nucleophiles</em>
(doi:<a href="https://doi.org/10.3762/bjoc.21.21">10.3762/bjoc.21.21</a>) in the <a href="https://scholia.toolforge.org/venue/Q2894008">Beilstein Journal of Organic Chemistry</a>,
and we indeed <a href="https://search.google.com/test/rich-results/result?id=FRW9wBOpXtsMp9TLUV6SfQ">see the schema.org annotation show up</a>:</p>

<p><img src="/assets/images/bjoc_bioschemas.png" alt="" /></p>

<p>And because of the use of open standards, extracting the information is not so hard with, for example here,
Bacting (doi:<a href="https://doi.org/10.21105/joss.02558">10.21105/joss.02558</a>), based on a 2022 script from the NanoSafety Cluster
projects NanoCommons and SbD4Nano:</p>

<div class="language-groovy highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@Grab</span><span class="o">(</span><span class="n">group</span><span class="o">=</span><span class="s1">'io.github.egonw.bacting'</span><span class="o">,</span> <span class="n">module</span><span class="o">=</span><span class="s1">'managers-rdf'</span><span class="o">,</span> <span class="n">version</span><span class="o">=</span><span class="s1">'1.0.4'</span><span class="o">)</span>
<span class="nd">@Grab</span><span class="o">(</span><span class="n">group</span><span class="o">=</span><span class="s1">'io.github.egonw.bacting'</span><span class="o">,</span> <span class="n">module</span><span class="o">=</span><span class="s1">'managers-ui'</span><span class="o">,</span> <span class="n">version</span><span class="o">=</span><span class="s1">'1.0.4'</span><span class="o">)</span>
<span class="nd">@Grab</span><span class="o">(</span><span class="n">group</span><span class="o">=</span><span class="s1">'io.github.egonw.bacting'</span><span class="o">,</span> <span class="n">module</span><span class="o">=</span><span class="s1">'net.bioclipse.managers.jsoup'</span><span class="o">,</span> <span class="n">version</span><span class="o">=</span><span class="s1">'1.0.4'</span><span class="o">)</span>

<span class="n">bioclipse</span> <span class="o">=</span> <span class="k">new</span> <span class="n">net</span><span class="o">.</span><span class="na">bioclipse</span><span class="o">.</span><span class="na">managers</span><span class="o">.</span><span class="na">BioclipseManager</span><span class="o">(</span><span class="s2">"."</span><span class="o">);</span>
<span class="n">rdf</span> <span class="o">=</span> <span class="k">new</span> <span class="n">net</span><span class="o">.</span><span class="na">bioclipse</span><span class="o">.</span><span class="na">managers</span><span class="o">.</span><span class="na">RDFManager</span><span class="o">(</span><span class="s2">"."</span><span class="o">);</span>
<span class="n">jsoup</span> <span class="o">=</span> <span class="k">new</span> <span class="n">net</span><span class="o">.</span><span class="na">bioclipse</span><span class="o">.</span><span class="na">managers</span><span class="o">.</span><span class="na">JSoupManager</span><span class="o">(</span><span class="s2">"."</span><span class="o">);</span>

<span class="n">articles</span> <span class="o">=</span> <span class="o">[</span>
   <span class="n">args</span><span class="o">[</span><span class="mi">0</span><span class="o">]</span>
<span class="o">]</span>

<span class="n">kg</span> <span class="o">=</span> <span class="n">rdf</span><span class="o">.</span><span class="na">createInMemoryStore</span><span class="o">()</span>

<span class="k">for</span> <span class="o">(</span><span class="n">article</span> <span class="k">in</span> <span class="n">articles</span><span class="o">)</span> <span class="o">{</span>
    <span class="n">htmlContent</span> <span class="o">=</span> <span class="n">bioclipse</span><span class="o">.</span><span class="na">download</span><span class="o">(</span><span class="n">article</span><span class="o">)</span>

    <span class="n">htmlDom</span> <span class="o">=</span> <span class="n">jsoup</span><span class="o">.</span><span class="na">parseString</span><span class="o">(</span><span class="n">htmlContent</span><span class="o">)</span>

    <span class="c1">// application/ld+json</span>

    <span class="n">bioschemasSections</span> <span class="o">=</span> <span class="n">jsoup</span><span class="o">.</span><span class="na">select</span><span class="o">(</span><span class="n">htmlDom</span><span class="o">,</span> <span class="s2">"script[type='application/ld+json']"</span><span class="o">);</span>

    <span class="k">for</span> <span class="o">(</span><span class="n">section</span> <span class="k">in</span> <span class="n">bioschemasSections</span><span class="o">)</span> <span class="o">{</span>
        <span class="n">bioschemasJSON</span> <span class="o">=</span> <span class="n">section</span><span class="o">.</span><span class="na">html</span><span class="o">()</span>
        <span class="n">rdf</span><span class="o">.</span><span class="na">importFromString</span><span class="o">(</span><span class="n">kg</span><span class="o">,</span> <span class="n">bioschemasJSON</span><span class="o">,</span> <span class="s2">"JSON-LD"</span><span class="o">)</span>
    <span class="o">}</span>
<span class="o">}</span>

<span class="n">turtle</span> <span class="o">=</span> <span class="n">rdf</span><span class="o">.</span><span class="na">asTurtle</span><span class="o">(</span><span class="n">kg</span><span class="o">);</span>

<span class="n">println</span> <span class="s2">"#"</span> <span class="o">+</span> <span class="n">rdf</span><span class="o">.</span><span class="na">size</span><span class="o">(</span><span class="n">kg</span><span class="o">)</span> <span class="o">+</span> <span class="s2">" triples detected in the JSON-LD"</span>
<span class="c1">// println turtle</span>


<span class="n">sparql</span> <span class="o">=</span> <span class="s2">"""
PREFIX schema: &lt;http://schema.org/&gt;
SELECT ?entity ?inchikey ?smiles WHERE {
  ?entity a schema:MolecularEntity .
  OPTIONAL { ?entity schema:inChIKey ?inchikey }
  OPTIONAL { ?entity schema:smiles ?smiles }
}
"""</span>

<span class="n">results</span> <span class="o">=</span> <span class="n">rdf</span><span class="o">.</span><span class="na">sparql</span><span class="o">(</span><span class="n">kg</span><span class="o">,</span> <span class="n">sparql</span><span class="o">)</span>

<span class="k">for</span> <span class="o">(</span><span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="o">;</span><span class="n">i</span><span class="o">&lt;=</span><span class="n">results</span><span class="o">.</span><span class="na">rowCount</span><span class="o">;</span><span class="n">i</span><span class="o">++)</span> <span class="o">{</span>
  <span class="n">println</span> <span class="s2">"${results.get(i, "</span><span class="n">inchikey</span><span class="s2">")}\t${results.get(i, "</span><span class="n">smiles</span><span class="s2">")}"</span>
<span class="o">}</span>
</code></pre></div></div>

<p>The output is a simple table:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>MGAPJMNPGGTFHJ-JEIPZWNWSA-N     CN1C(=O)/C(=C/2\C3=CC(=CC=C3N(CC4=CC=CC=C4)C2=O)Cl)/C(=P(C5=CC=CC=C5)(C6=CC=CC=C6)C7=CC=CC=C7)C1=O
XEWMQVUVGAHESA-UHFFFAOYSA-N     CC1=CC=C(C=C1)NC2=C(C3C4=CC(=CC=C4N(CC5=CC=CC=C5)C3=O)C)C(=O)N(C)C2=O
UVTJORFYHPGJDZ-PYCFMQQDSA-N     CCCCN1C2=CC=C(C)C=C2/C(=C(\C#N)/CNC3=CC=C(C)C=C3)/C1=O
ILWGDUYVQRAMMG-PGMHBOJBSA-N     CCCCN1C2=CC=C(C)C=C2/C(=C(\C#N)/CNC3=CC=C(C=C3)Cl)/C1=O
CAFIBKBZWJFZCW-FXBPSFAMSA-N     CCCCN1C2=CC=C(C)C=C2/C(=C(\C#N)/CNC3=CC=CC=C3)/C1=O
UOJSFLANMVIMBV-UHFFFAOYSA-N     CCCCN1C2=CC=C(C)C=C2C(C3=C(C(=O)N(C)C3=O)NC4=CC=C(C=C4)Cl)C1=O
VNJBTGZXAGHCSO-OAPYJULQSA-N     COC(=O)/C(=C\1/C2=C(C=CC=C2)N(CC3=CC=CC=C3)C1=O)/C=P(C4=CC=CC=C4)(C5=CC=CC=C5)C6=CC=CC=C6
KJXQRAKSOANQTJ-GFMRDNFCSA-N     CC1=CC=C(C=C1)NC/C(=C\2/C3=C(C=CC=C3)N(CC4=CC=CC=C4)C2=O)/C#N
IGEBJMZDOPBFGF-UHFFFAOYSA-N     CCCCN1C2=CC=C(C)C=C2C(C3=C(C(=O)N(C)C3=O)NC4=CC=CC=C4)C1=O
SSANVPNESOMKOM-AWQADKOQSA-N     C1=CC=C(C=C1)CN2C3=CC=C(C=C3/C(=C(/C#N)\C=P(C4=CC=CC=C4)(C5=CC=CC=C5)C6=CC=CC=C6)/C2=O)Cl
GEHWHSHQSIOZKL-NVQSTNCTSA-N     CCCCN1C2=CC=C(C=C2/C(=C\3/C(=P(C4=CC=CC=C4)(C5=CC=CC=C5)C6=CC=CC=C6)C(=O)N(C)C3=O)/C1=O)Cl
PALRSQOHFLRWDH-UHFFFAOYSA-N     CCCCN1C2=CC=C(C)C=C2C(C3=C(C(=O)N(C)C3=O)NC4=CC=C(C=C4)OC)C1=O
KBFODZMDSAFLFR-UHFFFAOYSA-N     CN1C(=O)C(=C(C1=O)NC2=CC(=CC=C2)Cl)C3C4=CC(=CC=C4N(CC5=CC=CC=C5)C3=O)Cl
JCGAVVZYXDJPBU-GFMRDNFCSA-N     CC1=C(C=CC=C1)NC/C(=C\2/C3=C(C=CC=C3)N(CC4=CC=CC=C4)C2=O)/C#N
DZFPCPDEQGLPLY-UHFFFAOYSA-N     CCCCN1C2=CC=C(C)C=C2C(C3=C(C(=O)N(C)C3=O)NC4=CC=C(C)C=C4)C1=O
XMRNJCJUOXYXJU-DAFNUICNSA-N     CC1=CC=C(C=C1)NC/C(=C\2/C3=CC(=CC=C3N(CC4=CC=CC=C4)C2=O)C)/C#N
SSDSNBBHEUUKGI-UHFFFAOYSA-N     CC1=CC=C2C(=C1)C(C3=C(C(=O)N(C)C3=O)N(C)C4=CC=CC=C4)C(=O)N2CC5=CC=CC=C5
USFYPRDMNXMWPO-UHFFFAOYSA-N     CCCCN1C2=CC=C(C)C=C2C(C3=C(C(=O)N(C)C3=O)NC4=CC=C(C=C4)Br)C1=O
XYHTWFULRHTEAG-MUGXBBEHSA-N     CCCCN1C2=CC=C(C)C=C2/C(=C(/C#N)\C=P(C3=CC=CC=C3)(C4=CC=CC=C4)C5=CC=CC=C5)/C1=O
XALDZIBHNNIVAM-UHFFFAOYSA-N     CCCCN1C2=CC=C(C)C=C2C(C3=C(C(=O)N(C)C3=O)NC4=C(C=CC=C4)O)C1=O
TUTWQHBRQPMLME-OAPYJULQSA-N     COC(=O)/C(=C\1/C2=CC(=CC=C2N(CC3=CC=CC=C3)C1=O)Cl)/C=P(C4=CC=CC=C4)(C5=CC=CC=C5)C6=CC=CC=C6
IYEHFTMZZMIPRU-UHFFFAOYSA-N     CC1=CC=C(C=C1)NC2=C(C3C4=CC(=CC=C4N(CC5=CC=CC=C5)C3=O)Cl)C(=O)N(C)C2=O
KBSDGNPLIPXCEX-UHFFFAOYSA-N     CCCCN1C2=CC=C(C)C=C2C(C3=C(C(=O)N(C)C3=O)NCC4=CC=CC=C4)C1=O
BQGIUMITIGHBSD-UHFFFAOYSA-N     CCCCNC1=C(C2C3=CC(=CC=C3N(CC4=CC=CC=C4)C2=O)C)C(=O)N(C)C1=O
PNSOLOPHIVUPOZ-MNDPQUGUSA-N     CCCCNC/C(=C\1/C2=CC(=CC=C2N(CCCC)C1=O)C)/C#N
HLTBKJRJOIZCMJ-PYCFMQQDSA-N     CCCCN1C2=CC=C(C)C=C2/C(=C(\C#N)/CN(C)C3=CC=CC=C3)/C1=O
FFLHFLUBMRBQTB-UHFFFAOYSA-N     CCCCN1C2=CC=C(C=C2C(C3=C(C(=O)N(C)C3=O)NC4=CC=C(C)C=C4)C1=O)F
FOQOVOLYYARWPA-NKFKGCMQSA-N     C1=CC=C(C=C1)CN2C3=C(C=CC=C3)/C(=C(\C#N)/CNC4=CC(=CC=C4)Cl)/C2=O
KLEPCAQFOXJLNV-UHFFFAOYSA-N     CC1=C(C=CC=C1)NC2=C(C3C4=CC(=CC=C4N(CC5=CC=CC=C5)C3=O)Cl)C(=O)N(C)C2=O
</code></pre></div></div>

<p>That also made me realize that there are not chemical names in the annotation. That would be really useful to move things
forward. Then again, PubChem will likely just generate the IUPAC name, since they have access to such software anyway.
They have teamed up with PubChem which will index it, but I will be interested in seeing how to use this for
<code class="language-plaintext highlighter-rouge">main subject</code> annotation in <a href="https://www.wikidata.org/wiki/Wikidata:WikiProject_Chemistry">Wikidata</a>.</p>

<p>A final note for now, the model they use is annotate the article with chemical substances (<code class="language-plaintext highlighter-rouge">ChemicalSubstance</code>) with
(one or more?) molecular entities (`MolecularEntity’). That is a model that scales well to their other journal,
the <a href="https://scholia.toolforge.org/venue/Q814756">Beilstein Journal of Nanotechnology</a>. But scraping that is for another post.</p>]]></content><author><name>Egon Willighagen</name></author><category term="bioschemas" /><category term="rdf" /><category term="chemistry" /><category term="cito:citesForInformation:10.59350/ne4rf-wey66" /><category term="cito:citesForInformation:10.1186/s13321-021-00520-4" /><category term="cito:citesForInformation:10.1021/jm5002056" /><category term="cito:usesDataFrom:10.3762/bjoc.21.21" /><category term="cito:usesMethodIn:10.21105/joss.02558" /><category term="cito:citesForInformation:10.59350/40377-hz881" /><category term="beilstein" /><summary type="html"><![CDATA[Two weeks ago, the Beilstein Institute announced Bioschemas support in their journals:]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/bjoc_bioschemas.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/bjoc_bioschemas.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Bioclipse-Oscar4 - Text mining in Bioclipse</title><link href="https://chem-bla-ics.linkedchemistry.info/2011/09/27/almost-year-ago-i-started-position-with.html" rel="alternate" type="text/html" title="Bioclipse-Oscar4 - Text mining in Bioclipse" /><published>2011-09-27T00:00:00+00:00</published><updated>2011-09-27T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2011/09/27/almost-year-ago-i-started-position-with</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2011/09/27/almost-year-ago-i-started-position-with.html"><![CDATA[<p>Almost a year ago I <a href="https://chem-bla-ics.linkedchemistry.info/2010/10/15/working-on-oscar-for-three-months.html">started a position <i class="fa-solid fa-recycle fa-xs"></i></a>
with <a href="http://blogs.ch.cam.ac.uk/pmr/">Peter Murray-Rust</a> to work on Oscar for three months (see this overview of results;
a paper by the full Oscar team (Sam, David, Dan, Lezan) is pending, and I’m really happy to have been able to contribute
bits to the project). Since then, I have had little time :( That’s how it goes, with post-hopping, unfortunately.
One thing I did do after that, was write a <a href="https://github.com/bioclipse/bioclipse.oscar">Bioclipse plugin</a>.</p>

<p>I was asked recently via <a href="http://www.linkedin.com/in/egonw">LinkedIn</a> if I was planning a Bioclipse-Oscar plugin, and
I realized that I forgot to blog about it. So, here goes. The <code class="language-plaintext highlighter-rouge">oscar</code> manager I implemented follows the
<a href="https://chem-bla-ics.linkedchemistry.info/2010/10/28/oscar4-java-api-chemical-name.html">Oscar API <i class="fa-solid fa-recycle fa-xs"></i></a>, and these
methods are available: <code class="language-plaintext highlighter-rouge">extractText()</code>, <code class="language-plaintext highlighter-rouge">findNamedEntities()</code>,  <code class="language-plaintext highlighter-rouge">findResolvedNamedEntities()</code>.</p>

<p>When I wrote the plugin, I also uploaded an <a href="http://www.myexperiment.org/workflows/2117.html">example workflow to MyExperiment</a>.
The code is:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Demo showing the Oscar text mining functionality</span>
<span class="c1">// in Bioclipse</span>
<span class="kd">var</span> <span class="nx">html</span> <span class="o">=</span> <span class="nx">bioclipse</span><span class="p">.</span><span class="nf">download</span><span class="p">(</span>
  <span class="dl">"</span><span class="s2">http://dx.doi.org/10.3762/bjoc.6.133</span><span class="dl">"</span><span class="p">,</span>
  <span class="dl">"</span><span class="s2">text/html</span><span class="dl">"</span>
<span class="p">)</span>
<span class="kd">var</span> <span class="nx">text</span> <span class="o">=</span> <span class="nx">oscar</span><span class="p">.</span><span class="nf">extractText</span><span class="p">(</span><span class="nx">html</span><span class="p">);</span>
<span class="c1">// the next step may take some time, while</span>
<span class="c1">// initializing the Oscar software for the</span>
<span class="c1">// first time</span>
<span class="kd">var</span> <span class="nx">mols</span> <span class="o">=</span> <span class="nx">oscar</span><span class="p">.</span><span class="nf">findResolvedNamedEntities</span><span class="p">(</span><span class="nx">text</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">file</span> <span class="o">=</span> <span class="dl">"</span><span class="s2">/Oscar Demo/extractedMols.sdf</span><span class="dl">"</span><span class="p">;</span>
<span class="nx">cdk</span><span class="p">.</span><span class="nf">saveSDFile</span><span class="p">(</span><span class="nx">file</span><span class="p">,</span> <span class="nx">mols</span><span class="p">);</span>
<span class="nx">ui</span><span class="p">.</span><span class="nf">open</span><span class="p">(</span><span class="nx">file</span><span class="p">);</span>
</code></pre></div></div>

<p>The code will extract chemical entities, and open a molecules table in <a href="http://www.bioclipse.net/">Bioclipse</a>:</p>

<p><img src="/assets/images/oscarDemo2.png" alt="" /></p>]]></content><author><name>Egon Willighagen</name></author><category term="oscar" /><category term="bioclipse" /><category term="beilstein" /><summary type="html"><![CDATA[Almost a year ago I started a position with Peter Murray-Rust to work on Oscar for three months (see this overview of results; a paper by the full Oscar team (Sam, David, Dan, Lezan) is pending, and I’m really happy to have been able to contribute bits to the project). Since then, I have had little time :( That’s how it goes, with post-hopping, unfortunately. One thing I did do after that, was write a Bioclipse plugin.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/oscarDemo2.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/oscarDemo2.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Status update on BJOC analysis with Oscar and ChemicalTagger #3</title><link href="https://chem-bla-ics.linkedchemistry.info/2010/12/23/status-update-on-bjoc-analysis-with_23.html" rel="alternate" type="text/html" title="Status update on BJOC analysis with Oscar and ChemicalTagger #3" /><published>2010-12-23T00:00:00+00:00</published><updated>2010-12-23T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2010/12/23/status-update-on-bjoc-analysis-with_23</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2010/12/23/status-update-on-bjoc-analysis-with_23.html"><![CDATA[<p>The <a href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with_11.html">two <i class="fa-solid fa-recycle fa-xs"></i></a>
<a href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with.html">earlier <i class="fa-solid fa-recycle fa-xs"></i></a> posts
in this series showed screenshots of results of Oscar, but the title also promised results by Lezan’s
<a href="http://www-ucc.ch.cam.ac.uk/products/software/chemicaltagger">ChemicalTagger</a>. Sam
helped with getting the HTML pages online via the Cambridge Hudson installation. Where
Oscar find named entities (chemical compounds, processes, etc), ChemicalTagger finds
roles, like solvent, acid, base, catalyst. Roles are properties of chemical compounds
in certain situations. Ethanol is not always a solvent, sometimes it is a Xmas present.
The current output is not entirely where I want to go yet, but makes it easy which
solvents are frequently found in the BJOC corpus:</p>

<p><img src="/assets/images/chemtag1.png" alt="" /></p>

<p>This screenshot of an analysis of 15 BJOC papers shows that AcOEt (is that the
<a href="http://lab.chempedia.com/questions/427/are-etoac-and-acoet-the-same">same as EtOAc?</a>)
is mentioned as solvent three times in <a href="http://www.ncbi.nlm.nih.gov/sites/ppmc/articles/PMC1399459">PMC1399459</a>.
Brine, however, is mentioned as solvent in three papers.</p>

<p>As said, these <a href="https://hudson.ch.cam.ac.uk/job/oscar4-chebi/ws/target/output/bjoc.html">two</a>
<a href="https://hudson.ch.cam.ac.uk/job/oscar4-chebi/ws/target/output/roles.html">pages</a> contain
RDF and the tables are sortable. Hudson recompiles them automatically when I update the
source code to create the HTML+RDFa. So, go ahead, send me bug reports, feature requests,
and patches!</p>]]></content><author><name>Egon Willighagen</name></author><category term="oscar" /><category term="chemicaltagger" /><category term="beilstein" /><category term="justdoi:10.1186/1860-5397-1-11" /><summary type="html"><![CDATA[The two earlier posts in this series showed screenshots of results of Oscar, but the title also promised results by Lezan’s ChemicalTagger. Sam helped with getting the HTML pages online via the Cambridge Hudson installation. Where Oscar find named entities (chemical compounds, processes, etc), ChemicalTagger finds roles, like solvent, acid, base, catalyst. Roles are properties of chemical compounds in certain situations. Ethanol is not always a solvent, sometimes it is a Xmas present. The current output is not entirely where I want to go yet, but makes it easy which solvents are frequently found in the BJOC corpus:]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/chemtag1.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/chemtag1.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Supramolecular chemistry</title><link href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/supramolecular-chemistry.html" rel="alternate" type="text/html" title="Supramolecular chemistry" /><published>2010-12-11T00:20:00+00:00</published><updated>2010-12-11T00:20:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2010/12/11/supramolecular-chemistry</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2010/12/11/supramolecular-chemistry.html"><![CDATA[<p>Some smart software developer once said to not optimize your code too early. However, not caring about it at all does not help either.
Some basic knowledge of memory management can keep you going. That is, I just ran into the limits of <a href="http://oscar3-chem.sf.net/">Oscar</a>
and ChemicalTagger. As I blogged earlier today, I am <a href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with.html">analyzing the BJOC literature <i class="fa-solid fa-recycle fa-xs"></i></a>,
but Lezan and I are running into a reproducible out-of-memory exception. At first I thought it was a memory leak, as it was the 95th
paper if fell over on, but after we optimized our code a bit, by reusing classes, the problem remained and turned out to be not in
recreating objects (though the code is significantly faster now), but in a single BJOC paper being too large.</p>

<p>The particular paper is not even ridiculously large, though it has an amazing 800 references! The paper, <em>Molecular recognition of
organic ammonium ions in solution using synthetic receptors</em> (doi:<a href="https://doi.org/10.3762/bjoc.6.32">10.3762/bjoc.6.32</a>), is in
fact an interesting review paper on supramolecular chemistry. The molecules I worked on (see one below) in my own supramolecular chemistry
time (doing a M.Sc. minor (6 month practical) with Peter Buijnsters in organic chemistry in the
<a href="http://www.molchem.science.ru.nl/about.php">group of Prof. Nolte</a>), are actually of the type they review, though surfactants are
not really covered in it particularly.</p>

<p><img src="/assets/images/surfactant2002.png" alt="" /></p>

<p>Yeah, supramolecular chemistry has this nice level complexity; it is so supramolecular, that it is currently outside the scope
of the molecular analysis of Oscar and ChemicalTagger ;)</p>

<ul>
  <li>Buijnsters, P. J. J. A.; García-Rodríguez, C. L.; Willighagen, E. L.; Sommerdijk, N. A. J. M.; Kremer, A.; Camilleri, P.; Feiters, M. C.;
Nolte, R. J. M.; Zwanenburg, B. (2002). Cationic Gemini Surfactants Based on Tartaric Acid: Synthesis, Aggregation, Monolayer Behaviour,
and Interaction with DNA European Journal of Organic Chemistry, 2002 (8), 1397-1406 :
DOI:<a href="https://doi.org/csqhbn">10.1002/1099-0690(200204)2002:8%3C1397::AID-EJOC1397%3E3.0.CO;2-6</a></li>
</ul>]]></content><author><name>Egon Willighagen</name></author><category term="oscar" /><category term="beilstein" /><category term="chemistry" /><category term="justdoi:10.3762/bjoc.6.32" /><category term="inchikey:ZXBXTKLPJKVXBD-KKLWWLSJSA-N" /><summary type="html"><![CDATA[Some smart software developer once said to not optimize your code too early. However, not caring about it at all does not help either. Some basic knowledge of memory management can keep you going. That is, I just ran into the limits of Oscar and ChemicalTagger. As I blogged earlier today, I am analyzing the BJOC literature , but Lezan and I are running into a reproducible out-of-memory exception. At first I thought it was a memory leak, as it was the 95th paper if fell over on, but after we optimized our code a bit, by reusing classes, the problem remained and turned out to be not in recreating objects (though the code is significantly faster now), but in a single BJOC paper being too large.]]></summary></entry><entry><title type="html">Status update on BJOC analysis with Oscar and ChemicalTagger #2</title><link href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with_11.html" rel="alternate" type="text/html" title="Status update on BJOC analysis with Oscar and ChemicalTagger #2" /><published>2010-12-11T00:10:00+00:00</published><updated>2010-12-11T00:10:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with_11</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with_11.html"><![CDATA[<p><img src="/assets/images/bjoc1.png" alt="" /></p>

<p>A quick update on the <a href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with.html">post of this morning <i class="fa-solid fa-recycle fa-xs"></i></a>. The above screenshot
shows the progress of the reporting of text mining results using <a href="http://oscar3-chem.sf.net/">Oscar</a> on the <a href="http://www.beilstein-journals.org/bjoc/">BJOC literature</a>.
I think I am almost ready to analyze the full corpus, with a blacklist put in place for <a href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/supramolecular-chemistry.html">large papers <i class="fa-solid fa-recycle fa-xs"></i></a>,
What you see is the same kind of JQuery-enabled sortable list in the HTML view, and a <a href="http://en.wikipedia.org/wiki/SPARQL">SPARQL query</a>
in <a href="http://chem-bla-ics.blogspot.com/2010/07/rdfadev-htmlrdfa-development-with.html">RDFaDev</a>, to list all papers that mention
<a href="http://www.dhmo.org/facts.html">DHMO</a> (in the first 10 of all 350 BJOC papers) by its <a href="http://en.wikipedia.org/wiki/International_Chemical_Identifier">InChI</a>.</p>

<p>Importantly, IMHO, it is using the <a href="http://code.google.com/p/semanticchemistry/">CHEMINF ontology</a>.</p>]]></content><author><name>Egon Willighagen</name></author><category term="beilstein" /><category term="oscar" /><category term="inchi" /><summary type="html"><![CDATA[]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/bjoc1.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/bjoc1.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Status update on BJOC analysis with Oscar and ChemicalTagger</title><link href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with.html" rel="alternate" type="text/html" title="Status update on BJOC analysis with Oscar and ChemicalTagger" /><published>2010-12-11T00:00:00+00:00</published><updated>2010-12-11T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with.html"><![CDATA[<p><img src="/assets/images/bjoc.png" alt="" /></p>

<p>This screenshot shows the current status of the <a href="http://oscar3-chem.sf.net/">Oscar</a> analysis results of the
<a href="http://www.beilstein-journals.org/bjoc/">BJOC literature</a>. The results our logged as HTML+RDFa page, as I explained before in
<a href="http://chem-bla-ics.blogspot.com/2010/07/scripts-logs-as-htmlrdfa-mix-free-text.html">Scripts logs as HTML+RDFa: mix free text reporting with CSV</a>.
The page is interactive, using <a href="http://jquery.com/">JQuery</a> goodies to allow table sorting.</p>]]></content><author><name>Egon Willighagen</name></author><category term="beilstein" /><category term="oscar" /><summary type="html"><![CDATA[]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/bjoc.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/bjoc.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Oscar4 command line utilities</title><link href="https://chem-bla-ics.linkedchemistry.info/2010/11/18/oscar4-command-line-utilities.html" rel="alternate" type="text/html" title="Oscar4 command line utilities" /><published>2010-11-18T00:00:00+00:00</published><updated>2010-11-18T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2010/11/18/oscar4-command-line-utilities</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2010/11/18/oscar4-command-line-utilities.html"><![CDATA[<p>One goal of my three month project is to take Oscar4 to the community. We want to get it used more, and we need
a larger development community. Oscar4 and the related technologies do a good, sometimes excellent, job, but
have to be maintained, just like any other piece of code. To make using it easier, we are developing new APIs,
as well as two user-oriented applications: <a href="https://chem-bla-ics.linkedchemistry.info/2010/10/21/oscar-text-mining-in-taverna.html">a Taverna 2 plugin <i class="fa-solid fa-recycle fa-xs"></i></a>,
and command line utilities. The <a href="https://chem-bla-ics.linkedchemistry.info/2010/10/28/oscar4-java-api-chemical-name.html">Oscar4 Java API <i class="fa-solid fa-recycle fa-xs"></i></a>
has slightly evolved in the last three weeks, removing some complexity. In this post, I will introduce the command
line utilities.</p>

<h2 id="oscar4">Oscar4</h2>

<p>Most people will be mostly interested into the full Oscar4 program, to extract chemical entities. Oscar3 was
also capable of extracting data (like <a href="https://chem-bla-ics.linkedchemistry.info/2006/09/08/chemical-archeology-oscar3-to.html">NMR spectra <i class="fa-solid fa-recycle fa-xs"></i></a>),
but that is not yet being ported. The OscarCLI program takes input, extracts chemicals, and where possible resolves
them into connection tables (viz. InChI).</p>

<p>To extract chemicals from a line of text (e.g. <em>“This is propane.”</em>, you do:</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>java <span class="nt">-cp</span> oscar4-cli-4.0-SNAPSHOT.jar <span class="se">\</span>
  uk.ac.cam.ch.wwmm.oscar.oscarcli.OscarCLI <span class="se">\</span>
  This is propane.
propane: <span class="nv">InChI</span><span class="o">=</span>1/C3H8/c1-3-2/h3H2,1-2H3
</code></pre></div></div>

<p>For larger chunks of texts it is easier to route it via <a href="http://en.wikipedia.org/wiki/Standard_streams">stdin</a>,
for which we can use the <code class="language-plaintext highlighter-rouge">-stdin</code> option:</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">echo</span> <span class="s2">"This is propane."</span> | <span class="se">\</span>
  java <span class="nt">-cp</span> oscar4-cli-4.0-SNAPSHOT.jar <span class="se">\</span>
  uk.ac.cam.ch.wwmm.oscar.oscarcli.OscarCLI <span class="se">\</span>
  <span class="nt">-stdin</span>
propane: <span class="nv">InChI</span><span class="o">=</span>1/C3H8/c1-3-2/h3H2,1-2H3
</code></pre></div></div>

<p>That way, we can easily process large plain text files (output omitted):</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">cat </span>largeFile.txt | <span class="se">\</span>
  java <span class="nt">-cp</span> oscar4-cli-4.0-SNAPSHOT.jar <span class="se">\</span>
  uk.ac.cam.ch.wwmm.oscar.oscarcli.OscarCLI <span class="se">\</span>
  <span class="nt">-stdin</span>
</code></pre></div></div>

<p>If you prefer RDF output, for further integration, use the <code class="language-plaintext highlighter-rouge">-output text/turtle</code>:</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">cat </span>largeFile.txt | <span class="se">\</span>
  java <span class="nt">-cp</span> oscar4-cli-4.0-SNAPSHOT.jar <span class="se">\</span>
  uk.ac.cam.ch.wwmm.oscar.oscarcli.OscarCLI <span class="se">\</span>
  <span class="nt">-stdin</span> <span class="nt">-output</span> text/turtle
</code></pre></div></div>

<p>This returns RDF using the <a href="http://code.google.com/p/semanticchemistry/">CHEMINF</a> ontology like:</p>

<div class="language-turtle highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">@prefix</span><span class="w"> </span><span class="nn">dc:</span><span class="w">  </span><span class="p">.</span><span class="w">
</span><span class="kd">@prefix</span><span class="w"> </span><span class="nn">rdfs:</span><span class="w">  </span><span class="p">.</span><span class="w">
</span><span class="kd">@prefix</span><span class="w"> </span><span class="nn">ex:</span><span class="w">  </span><span class="p">.</span><span class="w">
</span><span class="kd">@prefix</span><span class="w"> </span><span class="nn">cheminf:</span><span class="w">  </span><span class="p">.</span><span class="w">
</span><span class="kd">@prefix</span><span class="w"> </span><span class="nn">sio:</span><span class="w"> </span><span class="p">.</span><span class="w">

</span><span class="nn">ex:</span><span class="n">entity0
</span><span class="w">  </span><span class="nn">rdfs:</span><span class="n">subClassOf</span><span class="w"> </span><span class="nn">cheminf:</span><span class="n">CHEMINF_000000</span><span class="w"> </span><span class="p">;</span><span class="w">
  </span><span class="nn">dc:</span><span class="n">label</span><span class="w"> </span><span class="s">"propane"</span><span class="w"> </span><span class="p">;</span><span class="w">
  </span><span class="nn">cheminf:</span><span class="n">CHEMINF_000200</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="k">a</span><span class="w"> </span><span class="nn">cheminf:</span><span class="n">CHEMINF_000113</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">sio:</span><span class="n">SIO_000300</span><span class="w"> </span><span class="s">"InChI=1/C3H8/c1-3-2/h3H2,1-2H3"</span><span class="w"> </span><span class="p">.</span><span class="w">
  </span><span class="p">]</span><span class="w"> </span><span class="p">.</span><span class="w">
</span></code></pre></div></div>

<p>We can, however, also use <a href="http://jericho.htmlparser.net/docs/index.html">Jericho</a> to extract text from HTML pages, made
available with the <code class="language-plaintext highlighter-rouge">-html</code> option, and pulling in a <a href="http://www.beilstein-journals.org/bjoc/">Beilstein Journal of Organic Chemistry</a>
paper with <a href="http://en.wikipedia.org/wiki/Wget">wget</a>:</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>wget <span class="nt">-qO-</span> https://doi.org/10.3762/bjoc.6.122 | <span class="se">\</span>
  java <span class="nt">-cp</span> oscar4-cli-4.0-SNAPSHOT.jar <span class="se">\</span>
  uk.ac.cam.ch.wwmm.oscar.oscarcli.OscarCLI <span class="se">\</span>
  <span class="nt">-stdin</span> <span class="nt">-html</span>
</code></pre></div></div>

<p>This will return 271 chemical entities recognized in the text, matching 48 unique chemical structures.</p>]]></content><author><name>Egon Willighagen</name></author><category term="oscar" /><category term="textmining" /><category term="beilstein" /><category term="inchikey:ATUOYWHBWRKTHZ-UHFFFAOYSA-N" /><summary type="html"><![CDATA[One goal of my three month project is to take Oscar4 to the community. We want to get it used more, and we need a larger development community. Oscar4 and the related technologies do a good, sometimes excellent, job, but have to be maintained, just like any other piece of code. To make using it easier, we are developing new APIs, as well as two user-oriented applications: a Taverna 2 plugin , and command line utilities. The Oscar4 Java API has slightly evolved in the last three weeks, removing some complexity. In this post, I will introduce the command line utilities.]]></summary></entry></feed>