<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.4">Jekyll</generator><link href="https://chem-bla-ics.linkedchemistry.info/feed/by_tag/textmining.xml" rel="self" type="application/atom+xml" /><link href="https://chem-bla-ics.linkedchemistry.info/" rel="alternate" type="text/html" /><updated>2026-04-19T09:50:36+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/feed/by_tag/textmining.xml</id><title type="html">chem-bla-ics</title><subtitle>Chemblaics (pronounced chem-bla-ics) is the science that uses open science and computers to solve problems in chemistry, biochemistry and related fields.</subtitle><author><name>Egon Willighagen</name></author><entry><title type="html">One Million IUPAC names #3: the 200 thousand milestone and 1 million IUPAC names</title><link href="https://chem-bla-ics.linkedchemistry.info/2025/06/09/one-million-iupac-names.html" rel="alternate" type="text/html" title="One Million IUPAC names #3: the 200 thousand milestone and 1 million IUPAC names" /><published>2025-06-09T00:00:00+00:00</published><updated>2025-06-09T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2025/06/09/one-million-iupac-names</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2025/06/09/one-million-iupac-names.html"><![CDATA[<p>I could not find the time earlier to report (<a href="https://chem-bla-ics.linkedchemistry.info/2025/06/08/iccs2025-1-back-in-noordwijkerhout.html">reason</a>),
but three weeks ago we passed the fourth milestone release of the CCZero IUPAC names found in literature collection. This release contains
200026 IUPAC names, 168702 unique names, reflecting 116207 unique InChIKeys. Time for an update of the
<a href="https://chem-bla-ics.linkedchemistry.info/2025/03/08/iupac-names.html">One Million IUPAC names</a> project.</p>

<p>The current count actually is just above 230 thousand IUPAC names, but further growth may require new approaches,
such as the <a href="https://chem-bla-ics.linkedchemistry.info/2025/04/27/one-million-iupac-names-2-the-100-thousand-milestone.html">four ideas</a>
I posted earlier. I have gone through all full-text Open Access articles provided by the <a href="https://europepmc.org/RestfulWebService">Europe PMC API</a>.
Now, this list is not static, but I wanted to start using their <a href="https://europepmc.org/downloads">bulk downloads</a> anyway.</p>

<h2 id="the-current-results">The current results</h2>

<p>I have been looking at the names coming in. Some are short, others long. The complexity is fascinating and I will
have to brush up my cheminformatics skills to make chemical space splots and visualize the structural diversity.
I also note the current workflow does a good job at unicode characters, and we have plenty of names
like <code class="language-plaintext highlighter-rouge">ε,ε-carotene-3,3’-dione</code>. There are also names that I do not expect to be really valid, like
<code class="language-plaintext highlighter-rouge">hydroxymethyl methacrylate-</code> that end with a hyphen (41 in total), but their overall count is low.
And OPSIN is happy with it, so the name fits the rules.</p>

<p>The ten longest names (so far) are these (with the lengths 322, 324, 332, 357, 371, 373, 376, 421, 429, and 626):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(5Z)-3-ethyl-5-[[4-[15-[7-[(Z)-(3-ethyl-4-oxo-2-sulfanylidene-1,3-thiazolidin-5-ylidene)methyl]-2,1,3-benzothiadiazol-4-yl]-9,9,18,18-tetra(nonyl)-5,14-dithiapentacyclo[10.6.0.03,10.04,8.013,17]octadeca-1(12),2,4(8),6,10,13(17),15-heptaen-6-yl]-2,1,3-benzothiadiazol-7-yl]methylidene]-2-sulfanylidene-1,3-thiazolidin-4-one
(Z)-[[4-[[(Z)-N’-carbamoyl-N-[2-[2-[2-[[3-[(4S)-6,8-dichloro-2-methyl-3,4-dihydro-1H-isoquinolin-4-yl]phenyl]sulfonylamino]ethoxy]ethoxy]ethyl]carbamimidoyl]amino]butylamino]-[2-[2-[2-[[3-[(4S)-6,8-dichloro-2-methyl-3,4-dihydro-1H-isoquinolin-4-yl]phenyl]sulfonylamino]ethoxy]ethoxy]ethylamino]methylene]urea dihydrochloride
2-((Z)-2-((6-(4-(6-((Z)-(1-(dicyanomethylene)-5,6-difluoro-3-oxo-1H-inden-2(3H)-ylidene)methyl)-4,4-bis(2-ethylhexyl)-4H-cyclopenta[1,2-b:5,4-b′]dithiophen-2-yl)-2,3-bis(hexyloxy)phenyl)-4-(5,7-diethylundecan-6-yl)-4H-cyclopenta[1,2-b:5,4-b′]dithiophen-2-yl)methylene)-5,6-difluoro-3-oxo-2,3-dihydro-1H-inden-1-ylidene)malononitrile
(2S,4S,5R,6R)‐5‐acetamido‐2‐[(2S,3R,4R,5S,6R)‐5‐[(2S,3R,4R,5R,6R)‐3‐acetamido‐4,5‐dihydroxy‐6‐(hydroxymethyl)oxan‐2‐yl]oxy‐2‐[(2R,3S,4R,5R,6R)‐4,5‐dihydroxy‐2‐(hydroxymethyl)‐6‐[(E,2S,3R)‐3‐hydroxy‐2‐(octadecanoylamino)octadec‐4‐enoxy]oxan‐3‐yl]oxy‐3‐hydroxy‐6‐(hydroxymethyl)oxan‐4‐yl]oxy‐4‐hydroxy‐6‐[(1R,2R)‐1,2,3‐trihydroxypropyl]oxane‐2‐carboxylic acid
(2R,3S,4R,5R,7S,9S,10S,11R,12S,13R)-7-[(benzylcarbamoyl)oxy]-2-(1-{[(2R,3R,4R,5R,6R)-5-hydroxy-3,4-dimethoxy-6-methyltetrahydro-2H-pyran-2-yl]oxy}propan-2-yl)-10-{[(2S,3R,6R)-3-hydroxy-4-(methoxyimino)-6-methyltetrahydro-2H-pyran-2-yl]oxy}-3,5,7,9,11,13-hexamethyl-6,14-dioxo-12-{[(2S,5R,7R)-2,4,5-trimethyl-1,4-oxazepan-7-yl]oxy}oxacyclotetradecan-4-yl 3-methylbutanoate
2-[4-[2-[[(2R)-1-[[(4R,7S,10S,13R,16S,19R)-10-(4-aminobutyl)-4-[[(2R,3R)-1,3 dihydroxybutan-2-yl]carbamoyl]-7-[(1R)-1-hydroxyethyl]-16-[(4-hydroxyphenyl)methyl]-13-(1H-indol3-ylmethyl)-6,9,12,15,18-pentaoxo-1,2-dithia-5,8,11,14,17-pentazacycloicos-19-yl]amino]-1-oxo-3 phenylpropan-2-yl]amino]-2-oxoethyl]-7,10-bis(carboxymethyl)-1,4,7,10-tetrazacyclododec-1-yl]acetic acid
(2R,3S,4R,5R,7S,9S,10S,11R,12S,13R)-12-{[(2R,4R,5S,6S)-4,5-dihydroxy-4,6-dimethyltetrahydro-2H-pyran-2-yl]oxy}-7-hydroxy-2-(1-{[(2R,3R,4R,5R,6R)-5-hydroxy-3,4-dimethoxy-6-methyltetrahydro-2H-pyran-2-yl]oxy}propan-2-yl)-10-{[(2S,3R,6R)-3-hydroxy-4-(methoxyimino)-6-methyltetrahydro-2H-pyran-2-yl]oxy}-3,5,7,9,11,13-hexamethyl-6,14-dioxooxacyclotetradecan-4-yl 3-methylbutanoate
(2S,4S,5R,6R)‐5‐acetamido‐2‐[(2S,3R,4R,5S,6R)‐5‐[(2S,3R,4R,5R,6R)‐3‐acetamido‐5‐hydroxy‐6‐(hydroxymethyl)‐4‐[(2R,3R,4S,5R,6R)‐3,4,5‐trihydroxy‐6‐(hydroxymethyl)oxan‐2‐yl]oxyoxan‐2‐yl]oxy‐2‐[(2R,3S,4R,5R,6R)‐4,5‐dihydroxy‐2‐(hydroxymethyl)‐6‐[(E,2S,3R)‐3‐hydroxy‐2‐(octadecanoylamino)octadec‐4‐enoxy]oxan‐3‐yl]oxy‐3‐hydroxy‐6‐(hydroxymethyl)oxan‐4‐yl]oxy‐4‐hydroxy‐6‐[(1R,2R)‐1,2,3‐trihydroxypropyl]oxane‐2‐carboxylic acid
(2R,3S,4R,5R,7S,9S,10S,11R,12S,13R)-12-{[(2R,4R,5S,6S)-4,5-dihydroxy-4,6-dimethyltetrahydro-2H-pyran-2-yl]oxy}-2-(1-{[(2R,3R,4R,5R,6R)-5-hydroxy-3,4-dimethoxy-6-methyltetrahydro-2H-pyran-2-yl]oxy}propan-2-yl)-10-{[(2S,3R,6R)-3-hydroxy-4-(methoxyimino)-6-methyltetrahydro-2H-pyran-2-yl]oxy}-3,5,7,9,11,13-hexamethyl-7-({[2-(2-methyl-5-nitro-1H-imidazol-1-yl)ethyl]carbamoyl}oxy)-6,14-dioxooxacyclotetradecan-4-yl 3-methylbutanoate
N-[(2S,3R,4R,5S,6R)-5-[(2S,3R,4R,5S,6R)-3-amino-5-[(2S,3R,4R,5S,6R)-3-amino-5-[(2S,3R,4R,5S,6R)-3-amino-5-[(2S,3R,4R,5S,6R)-3-amino-5-[(2S,3R,4R,5S,6R)-3-amino-5-[(2S,3R,4R,5S,6R)-3-amino-4,5-dihydroxy-6-(hydroxymethyl)oxan-2-yl]oxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl]oxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl]oxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl]oxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl]oxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl]oxy-2-[(2R,3S,4R,5R,6S)-5-amino-6-[(2R,3S,4R,5R,6R)-5-amino-4,6-dihydroxy-2-(hydroxymethyl)oxan-3-yl]oxy-4-hydroxy-2-(hydroxymethyl)oxan-3-yl]oxy-4-hydroxy-6-(hydroxymethyl)oxan-3-yl]carbamate
</code></pre></div></div>

<p>That last compound has the InChIKey <code class="language-plaintext highlighter-rouge">DKPKDPKJVDQUPD-XGBIXEJNSA-M</code> and cannot be found in Google nor in PubChem.
It looks like this:</p>

<p><img src="/assets/images/iupac_626.png" alt="" /></p>

<p>There are <a href="https://pubchem.ncbi.nlm.nih.gov/#query=N-%5B(2S%2C3R%2C4R%2C5S%2C6R)-5-%5B(2S%2C3R%2C4R%2C5S%2C6R)-3-amino-5-%5B(2S%2C3R%2C4R%2C5S%2C6R)-3-amino-5-%5B(2S%2C3R%2C4R%2C5S%2C6R)-3-amino-5-%5B(2S%2C3R%2C4R%2C5S%2C6R)-3-amino-5-%5B(2S%2C3R%2C4R%2C5S%2C6R)-3-amino-5-%5B(2S%2C3R%2C4R%2C5S%2C6R)-3-amino-4%2C5-dihydroxy-6-(hydroxymethyl)oxan-2-yl%5Doxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl%5Doxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl%5Doxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl%5Doxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl%5Doxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl%5Doxy-2-%5B(2R%2C3S%2C4R%2C5R%2C6S)-5-amino-6-%5B(2R%2C3S%2C4R%2C5R%2C6R)-5-amino-4%2C6-dihydroxy-2-(hydroxymethyl)oxan-3-yl%5Doxy-4-hydroxy-2-(hydroxymethyl)oxan-3-yl%5Doxy-4-hydroxy-6-(hydroxymethyl)oxan-3-yl%5Dcarbamate">some closely related compounds</a>,
though.</p>

<h2 id="chemicals-only-published-about-once">Chemicals only published about once</h2>

<p>Some <a href="https://doi.org/10.59350/rzepa.28802">related data was blogged</a> by <a href="https://orcid.org/0000-0002-8635-8390">Henry Rzepa</a> last week,
with this quote by Lee from CAS:</p>

<blockquote>
  <p>38.5% of the current substances have only 1 reference</p>
</blockquote>

<p>Apparently, based on <a href="https://www.cas.org/support/documentation/chemical-substances">CAS Registry</a> data,
about 1 in 3 chemical structures are only published about once. And two in three are published
about at least twice. I agree with Henry here, with organic chemistry literature in mind, I would have
expected that 38.5% to be higher.</p>

<p>Anyway, since this project is not tracking in which articles IUPAC names are found, I have nothing to study this.</p>

<h2 id="1-million-iupac-names">1 million IUPAC names</h2>

<p>So, the primary goal of this project is to reach one million IUPAC names. We are currently at around 23%.
Not bad, considering we started in Februari. And we have plenty of untouched literature left.</p>

<p>But I also applied <a href="https://chem-bla-ics.linkedchemistry.info/2025/04/27/one-million-iupac-names-2-the-100-thousand-milestone.html">idea 1</a>,
the varying names. The idea is that this was I can explode the number of compounds. In that compounds above,
just the number of variations by enumerating all <code class="language-plaintext highlighter-rouge">OH</code> replacements with <code class="language-plaintext highlighter-rouge">OMe</code> and <code class="language-plaintext highlighter-rouge">OEt</code> would help a lot.</p>

<p>Because I wanted to make sure I could answer positively at the ICCS if we made it to one million
CCZero IUPAC names, I implemented a very simple enumeration script. Really dumb approach. But the
results are interesting. I started with the 200026 names from the milestone. If I
<a href="https://github.com/BlueObelisk/iupac-names/blob/main/explode.groovy">explode</a> these names,
I get 1,377,127 IUPAC names, well above the target. Even if I remove name variations due to unicode
variations for hyphens, I still have 1,162,107 IUPAC names.</p>

<p>Something interesting I cannot fully understand at this moment yet, however, is the following.
When I calculate the number of unique InChIKeys for the milestone, I get 117,726 keys, and when I do
this for the list of name variations, I get 203,979 keys. So, while the IUPAC name list is about five
times as long, the list of InChIKeys is not even twice as long. Well, I guess that is why this is called
research.</p>]]></content><author><name>Egon Willighagen</name></author><category term="iupac" /><category term="textmining" /><category term="inchikey:DKPKDPKJVDQUPD-XGBIXEJNSA-M" /><category term="cito:containsAssertionFrom:10.59350/rzepa.28802" /><category term="europepmc" /><summary type="html"><![CDATA[I could not find the time earlier to report (reason), but three weeks ago we passed the fourth milestone release of the CCZero IUPAC names found in literature collection. This release contains 200026 IUPAC names, 168702 unique names, reflecting 116207 unique InChIKeys. Time for an update of the One Million IUPAC names project.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/iupac_626.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/iupac_626.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">One Million IUPAC names #2: the 100 thousand milestone</title><link href="https://chem-bla-ics.linkedchemistry.info/2025/04/27/one-million-iupac-names-2-the-100-thousand-milestone.html" rel="alternate" type="text/html" title="One Million IUPAC names #2: the 100 thousand milestone" /><published>2025-04-27T00:00:00+00:00</published><updated>2025-04-27T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2025/04/27/one-million-iupac-names-2-the-100-thousand-milestone</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2025/04/27/one-million-iupac-names-2-the-100-thousand-milestone.html"><![CDATA[<p>Two and a half month into the <a href="https://chem-bla-ics.linkedchemistry.info/2025/03/08/iupac-names.html">One Million IUPAC Names</a>
project, we passed <a href="https://github.com/BlueObelisk/iupac-names/releases/tag/milestone-100k">the third milestone</a>,
the one for 100 thousand IUPAC names (doi:<a href="https://doi.org/10.5281/zenodo.15266459">10.5281/zenodo.15266459</a>).
Time for an update.</p>

<p>This milestone release took a bit longer. Going from 50 to 100 thousand is a bigger step than from 10 to 50
thousand, but the open access chemistry literature was already done by then. Basically, I ran out of open access
chemistry publications. The scripts are now finding names in all (open access) literature, and the number of
new names per articles is a lot lower. Still about 1 in every twenty to 30 articles. But the diversity in names
is not really going down, which is important.</p>

<p>The first few weeks, I used the Google Colab to run a Jupyter notebook, initial created by
<a href="https://cpm.lumc.nl/research/bioinformatics-224/magnus-palmblad-5">Magnus</a>, but having to process more articles
to get a reasonable number of new IUPAC names required longer and longer jobs, and then Google Colab
is not really fit (well, the free version anyway). So, I started using a local script. That turned out
to be able to handle up to 20 thousand articles in one go and runs at least twice as fast. Moreover, I can
run three of them in parallel.</p>

<p>And that had impact. With each commit around 1000 new IUPAC names, the number of commits went up remarkably
last week:</p>

<p><img src="/assets/images/iupac-names-commits.png" alt="" /></p>

<p>At the current speed, I think we’ll make it to 150k soon and I added a new milestone for 200k, which sounds
doable in the next three week. That also means that 1M extracted IUPAC names from literature has become
a reasonable goal. And we can start thinking about the 2, 5, 10, 50 and 100 million IUPAC names. Those are,
at the current speed, rather unlikely to reach from the open access literature anytime soon. That brings
us to the question, what will. Well, I have some ideas.</p>

<h3 id="idea-1-name-variations">Idea 1: name variations</h3>

<p>First, I am figuring out some ways to make variants of names (no, not based on hyphens and spaces; that’s too easy),
but actual variations of the chemical structures. For example, I could exhaustively replace “methoxy” with “ethoxy”,
and iterate the halogens and acyl chain lengts. I have little doubt that I can grow the list with this approach
easily a 5-fold, maybe even a 10-fold.</p>

<h3 id="idea-2-hallucination">Idea 2: hallucination</h3>

<p>Another idea is that I could use tools that can generate IUPAC names for a limited set of compounds.
I once wrote code for alkanes myself and if I can find that, I may be able to generate additional names.
But perhaps more realistic is that I train a deep learning model and have it generate names for all compounds in
Wikidata (~1.5 million) or PubChem (&gt;100 million). STOUT needed 81 million compounds
(doi:<a href="https://doi.org/10.1186/s13321-021-00512-4">10.1186/s13321-021-00512-4</a>), but I don’t need a good model;
I just need a model that comes up with new, valid names. Hallucinated names, but valid.</p>

<p>While the list of valid names grows, I can retrain the deep-learned model and repeat. As long as the diversity
remains high enough, one could hypothesize that the deep learning will learn new tricks. And then,
that should be a near infinite source of additional names.</p>

<h3 id="idea-3-semi-closed-access-literature">Idea 3: (semi-)closed access literature</h3>

<p>Also, I haven’t touched closed access articles yet. This is all based on the collection of full texts
in <a href="https://europepmc.org/">Europe PMC</a>. For example, I could start with the green open access article
in (Dutch) university repositories, particularly those with large chemistry departments. PDF to text
tools are mature enough that this will provide a new source. Oh, and perhaps PhD thesis, which are now
also increasingly archived in university repository under open access. And that reminds me of a Dutch
project two decades ago doing exactly that. I wish I remembered the name.</p>

<h3 id="idea-4-alternatives-to-oscar4-and-europe-pmc">Idea 4: alternatives to Oscar4 and Europe PMC</h3>

<p>So, the first round of named entity recognition was with Europe PMC itself, as explained in
<a href="https://chem-bla-ics.linkedchemistry.info/2025/03/08/iupac-names.html">the first post</a>. The move
to Oscar4 helped a lot. But there exist many other chemical NER tools, like
(doi:<a href="https://doi.org/10.1093/bioinformatics/btn181">10.1093/bioinformatics/btn181</a>. And those may
find an additional number of names, even with just the literature I already covered.</p>

<p>Well, you get the idea.</p>

<h2 id="iccs-poster-rejected">ICCS poster rejected</h2>

<p>Unfortunately, the <a href="https://iccs-nl.org/">ICCS poster</a> abstract did not make the cut. The score was high enough,
but they received many abstracts and had to make a selection (of course, I am part of the ICCS organization,
and have more details of how it came about). I really like the project, and eager to write up a paper around
it.</p>]]></content><author><name>Egon Willighagen</name></author><category term="iupac" /><category term="textmining" /><category term="oscar" /><category term="cito:citesForInformation:10.1186/s13321-021-00512-4" /><category term="cito:citesAsPotentialSolution:10.1093/bioinformatics/btn181" /><category term="europepmc" /><summary type="html"><![CDATA[Two and a half month into the One Million IUPAC Names project, we passed the third milestone, the one for 100 thousand IUPAC names (doi:10.5281/zenodo.15266459). Time for an update.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/iupac-names-commits.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/iupac-names-commits.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">One Million IUPAC names</title><link href="https://chem-bla-ics.linkedchemistry.info/2025/03/08/iupac-names.html" rel="alternate" type="text/html" title="One Million IUPAC names" /><published>2025-03-08T00:00:00+00:00</published><updated>2025-03-08T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2025/03/08/iupac-names</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2025/03/08/iupac-names.html"><![CDATA[<p>Names of chemicals are part of the human user experience when browsing a chemical database. And literature too,
of course. Chemical names are also not easy to use, and what a chemical name means is not always clear.
This is why the <a href="https://en.wikipedia.org/wiki/International_Union_of_Pure_and_Applied_Chemistry">IUPAC</a>
started a standardizing nomenclature in chemistry, the <em>IUPAC names</em>. Each IUPAC name uniquely defines
the chemical structure it defines. For example, <em>methane</em> is the IUPAC name for the chemical CH<sub>4</sub>.</p>

<p>So, when propagating chemical structures from the <a href="https://chem-bla-ics.linkedchemistry.info/2025/02/13/beiltein-journal-has-bioschemas.html">Beilstein Bioschemas feed</a>,
I was looking for names, IUPAC or not, ideally the name used in the article. When I asked about this,
the question came up if they could autogenerate IUPAC names, for which
<a href="https://doi.org/10.1038/s41598-021-94082-y">various</a>
<a href="https://doi.org/10.1186/s13321-021-00535-x">new</a>
<a href="https://doi.org/10.1186/s13321-021-00512-4">tools</a>
<a href="https://doi.org/10.1186/s13321-024-00941-x">exist</a>
(I think I am missing one from an American team, but cannot find the reference),
along with multiple established commerical tools.
Because the IUPAC nomenclature is a long list of naming rules, priorities, etc, a rule-based
algorithm is logical, but newer methods take a deep-learning approach.</p>

<p>Back to the chemical annotation of chemistry literature. This is of obvious interest: you want
to know where we can read more about a certain chemical. We need the chemical structures in
a database for that, linked to the articles. This is, of course, one of the original studies
of <em>cheminformatics</em>. And when authors of the chemical literature do not provide this routinely
(<a href="https://chem-bla-ics.linkedchemistry.info/2025/02/13/beiltein-journal-has-bioschemas.html">this post</a>
shows a few exceptions, but it is still all too rare). And then manual and automated curation
is needed, e.g. done by <a href="https://en.wikipedia.org/wiki/Chemical_Abstracts_Service">Chemical Abstracts</a>.</p>

<p>Third, <a href="https://wikidata.org/">Wikidata</a> has <a href="https://scholia.toolforge.org/chemical/">about 1.4 million</a>
chemical compounds and many names. A <a href="https://www.wikidata.org/wiki/Wikidata:Property_proposal/Pending#IUPAC_name">property propoal for IUPAC names</a>
has been long pending, but once accepted in one form or another, will require IUPAC names too.</p>

<h2 id="one-million-iupac-names">One million IUPAC names</h2>

<p>Thus, the idea came up, can we create a set of 1 million unique IUPAC names found in literature?
I asked on the <a href="https://elixir-europe.org/">ELIXIR Europe</a> slack channel if <a href="https://europepmc.org/">Europe PMC</a>
had such a dataset (doi:<a href="https://doi.org/10.1093/nar/gkad1085">10.1093/nar/gkad1085</a>). I knew they had been adding chemical
<a href="https://scholia.toolforge.org/topic/Q403574">named-entity recognition</a> (NER) results in
<a href="https://europepmc.org/Annotations">their annotation API</a>. I learned they used <a href="https://www.ebi.ac.uk/chebi/">ChEBI</a>.
Melanie Vollmar and Summer Rosonovski or Europe PMC gave useful information and support.
<a href="https://cpm.lumc.nl/research/bioinformatics-224/magnus-palmblad-5">Magnus Palmblad</a> also replied
and provided Python code to use the Europe PMC API to fetch names it returns and see if those
are IUPAC names. Well, that’s easy. We have <a href="https://opsin.ch.cam.ac.uk/">OPSIN</a> for that
(see doi:<a href="https://doi.org/10.1021/ci100384d">10.1021/ci100384d</a>).</p>

<p>Unfortunately, the Europe PMC NER results are not ideal for IUPAC names. Just scanning
some 5, 6 organic chemistry journals returned some 8 thousand IUPAC names in open access
articles. But it quickly started to be too limited: each set of articles returned
increasingly few new names. The reason is simple: the NER is too <em>greedy</em> and as a
result, does not easily recognize longer IUPAC names. It is too happy with a substring
of the IUPAC name. For example, when it encounters the IUPAC name <em>5-Bromo-1H-indole-3-carboxylic acid</em>,
it settles for <em>indole-3-carboxylic acid</em>:</p>

<p><img src="/assets/images/greedy.png" alt="" /></p>

<h2 id="open-source-chemistry-analysis-routines">Open-Source Chemistry Analysis Routines</h2>

<p>During my PhD, in 2003, when I worked a few months with Prof. <a href="https://scholia.toolforge.org/author/Q908710">Peter Murray-Rust</a> (University of Cambridge)
and Prof. Janet Thornthon (EMBL-EBI), I learned about the research by <a href="https://scholia.toolforge.org/author/Q28946549">Sam Adams</a>
(doi:<a href="https://doi.org/10.1039/B411699M">10.1039/B411699M</a>), <a href="https://scholia.toolforge.org/author/Q133040220">Joe Townsend</a>
(doi:<a href="https://doi.org/10.1039/B411033A">10.1039/B411033A</a>), and <a href="https://scholia.toolforge.org/author/Q90318722">Peter Corbett</a>
(doi:<a href="https://doi.org/10.1007/11875741_11">10.1007/11875741_11</a>). One of the tools that used
this research was (is) <a href="https://scholia.toolforge.org/topic/Q133037490">OSCAR</a>,
short for <em>Open-Source Chemistry Analysis Routines</em> (see <a href="https://blogs.ch.cam.ac.uk/pmr/2009/05/16/opsin-and-oscar-chemical-language-processing/">this detailed write up by Peter MR</a>).
Later, in 2010 I visted Peter again, as postdoc, in Cambridge, and then
<a href="https://chem-bla-ics.linkedchemistry.info/2010/10/15/working-on-oscar-for-three-months.html">worked on the OSCAR project</a> too.
And while OSCAR did a lot more, the integration of <a href="https://chem-bla-ics.linkedchemistry.info/2010/12/26/oscar-training-data-models-etc.html">Corbett’s NER research</a>
made OSCAR the obvious follow-up step in finding IUPAC names in literature.</p>

<p>And because <a href="https://chem-bla-ics.linkedchemistry.info/2011/09/27/almost-year-ago-i-started-position-with.html">OSCAR4 had been integrated into Bioclipse</a>
(doi:<a href="https://doi.org/10.1186/1758-2946-3-41">10.1186/1758-2946-3-41</a>) and I had this ported to Bacting already
(doi:<a href="https://doi.org/10.21105/joss.02558">10.21105/joss.02558</a>), using this was trivial.
The use of Europe PMC is different now, however, and we are no longer using the Annotations API,
but just using it to find open access articles, and to get the full text in XML format.
That allows a simple XPath search on <code class="language-plaintext highlighter-rouge">&lt;p&gt;</code> elements, pass the resulting string to OSCAR4,
and the recognized names are checked with OPSIN.
And with this approach, processing two of the five or six journals we earlier explored,
we find another 40+ thousand IUPAC names. Quite a success, I am tempted to say.</p>

<h2 id="a-blue-obelisk-project">A Blue Obelisk project</h2>

<p>So, I started a new <a href="https://blueobelisk.github.io/">Blue Obelisk</a> project,
<a href="https://github.com/BlueObelisk/iupac-names">iupac-names</a>, to collect 1M IUPAC names. For researchers
to use, learn from, etc. Just IUPAC names. Not even the chemical structure, nor the link to the
articles. The first is trivial to do with OPSIN, so the matching SMILES do not need to be stored.
Links to literature is tricky because of the aforementioned issues, and we only want to know
which (partial) IUPAC names occur in literature. If you really want to know in which articles
that IUPAC name is found, you can simply do a search in Europe PMC.</p>

<p>And because we only store IUPAC names, this are very basic facts (this is an IUPAC name, as defined
by OPSIN being able to generate a SMILES for this structure) and that that string occurs in
some article) and we can share them as CCZero. We <a href="is:issue" title="milestone release">defined various milestones</a>,
and I am happy that the first two have been reached within two weeks:</p>

<ul>
  <li><a href="https://github.com/BlueObelisk/iupac-names/releases/tag/milestone-10k">Milestone 10k</a> (doi:<a href="https://doi.org/10.5281/zenodo.14965762">10.5281/zenodo.14965762</a>)</li>
  <li><a href="https://github.com/BlueObelisk/iupac-names/releases/tag/milestone-50k">Milestone 50k</a> (doi:<a href="https://doi.org/10.5281/zenodo.14978557">10.5281/zenodo.14978557</a>)</li>
</ul>

<p>This second milestone has 53848 unique names, but as literature goes, there are interesting
variations, some likely because of typesetting leading to spaces added and missing. If
we ignore spaces and hyphens, we have 50534 names left (hence the milestone). But IUPAC
names are also not fully unique, partly because of Unicode character variations and greek
letter alternatives, and you may wonder how many different chemical structures this set
reflects. While not perfect, the Standard InChI gives some lower limit, and we find 36528
InChIKeys in this second milestone.</p>

<p>Now, we need twenty times as much to reach the 1M IUPAC names, but given we have many, many
more open access articles to process. The bottleneck seems to be mostly our workflow.</p>

<h3 id="can-you-contribute">Can you contribute?</h3>

<p>Yes, of course! This is an open science project. But please keep in mind the narrow focus of this
project: only IUPAC names which can be found in (open access) literature. This project doed not accept
autogenerated names (PubChem would have given use many millions already), nor IUPAC names from existing
databases. Ideally, you are able to show the code you use to extract/find those names in literature.</p>

<h3 id="can-i-use-these-names">Can I use these names?</h3>

<p>First of all, this is what the CCZero license and open science nature of this project is about: reuse.
We love to hear how you are using these names, tho, and we encourage you to write up how you
are using them. You can use <a href="https://datacite.org/">DataCite</a> to cite the release you used,
and citing this blog post by DOI is also possible.</p>

<h3 id="does-it-support-my-language-too">Does it support my language too?</h3>

<p>No, at this moment it only support IUPAC names in English. Dutch, French, Spanish, or Chinese
IUPAC names are valid, but currently not supported. See also
<a href="https://chem-bla-ics.linkedchemistry.info/2010/12/30/text-mining-chemistry-from-dutch-or.html">this post</a>.</p>

<h3 id="will-there-be-a-publication">Will there be a publication?</h3>

<p>Magnus and I intend so. We already submitted an abstract to the <a href="https://iccs-nl.org/">International Conference on Chemical Structures</a>,
which has <a href="https://www.biomedcentral.com/collections/ICCS25">a Collection in the Journal of Cheminformatics</a>.
If the abstract gets accepted, of course, we can submit there. Otherwise, we will look for another venue,
likely <a href="https://en.wikipedia.org/wiki/Diamond_open_access">diamond open access</a>.</p>

<h3 id="where-is-your-script">Where is your script?</h3>

<p>Ah, fair point. We did not decide on the final license yet. I have used two scripts based on the template
by Magnus. As soon as we have finalized the license, we will make those available.</p>]]></content><author><name>Egon Willighagen</name></author><category term="iupac" /><category term="cheminf" /><category term="justdoi:10.1038/s41598-021-94082-y" /><category term="justdoi:10.1186/s13321-021-00512-4" /><category term="justdoi:10.1186/s13321-021-00535-x" /><category term="justdoi:10.1186/s13321-024-00941-x" /><category term="justdoi:10.1021/ci100384d" /><category term="oscar" /><category term="justdoi:10.1039/B411699M" /><category term="justdoi:10.1039/B411033A" /><category term="justdoi:10.1007/11875741_11" /><category term="textmining" /><category term="cito:usesMethodIn:10.1186/1758-2946-3-41" /><category term="cito:usesMethodIn:10.21105/JOSS.02558" /><category term="cito:usesMethodIn:10.1093/nar/gkad1085" /><category term="cito:citesAsEvidence:10.5281/zenodo.14965762" /><category term="cito:citesAsEvidence:10.5281/zenodo.14978557" /><category term="europepmc" /><summary type="html"><![CDATA[Names of chemicals are part of the human user experience when browsing a chemical database. And literature too, of course. Chemical names are also not easy to use, and what a chemical name means is not always clear. This is why the IUPAC started a standardizing nomenclature in chemistry, the IUPAC names. Each IUPAC name uniquely defines the chemical structure it defines. For example, methane is the IUPAC name for the chemical CH4.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/greedy.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/greedy.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Elsevier’s new text mining initiative is a step sideways</title><link href="https://chem-bla-ics.linkedchemistry.info/2014/02/15/elseviers-new-text-mining-initiative-is.html" rel="alternate" type="text/html" title="Elsevier’s new text mining initiative is a step sideways" /><published>2014-02-15T00:00:00+00:00</published><updated>2014-02-15T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2014/02/15/elseviers-new-text-mining-initiative-is</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2014/02/15/elseviers-new-text-mining-initiative-is.html"><![CDATA[<p>Elsevier’s <a href="http://www.elsevier.com/about/universal-access/content-mining-policies">new ideas on text mining</a> are getting a lot
<a href="http://www.nature.com/news/elsevier-opens-its-papers-to-text-mining-1.14659">attention</a> now. Sadly, they get it wrong, again.
On the bright side, all other publishers, which are <a href="http://www.nature.com/news/elsevier-opens-its-papers-to-text-mining-1.14659">expected to follow this year</a>,
can learn from this mistake.</p>

<p>Because if done right, the publishers can even help forward science, despite crippling progress. That sound harsh, and surely
they have done a lot of good for science. In fact, we would not be where we are now without the publishers. But things have
changed. With the internet anyone can be publisher. We see this with blogs, we see this with <a href="http://lulu.com/">Lulu.com</a>.
And, unlike some misinformed people think, this is independent from peer review. Publishers were important because they
provide a channel to disseminate knowledge. But paper publishing is no longer the most efficient way. In fact, in terms
of value, paper has been overtaken for some years now.</p>

<p>And we need more added value. Not the shipping of the knowledge, but keeping up is the issue. And there too, publishing is
inefficient: human language is nice for sharing ideas and concepts, but it fails at disseminating raw facts: measured data.
Anyone who has tried creating a data set to find patterns knows this: extracting the information is a lot of effort, mostly
caused by the broken paper publishing model. This is most apparent in some research domain where data repositories exist,
but sadly this applies to a small minority of data types.</p>

<p>Now, text mining seems in that sense the wrong question: why trying to recover knowledge that should have gone into
repositories in the first places. I agree. However, we cannot just throw away all the knowledge kept in these papers, and
certainly not as long as people keep insisting on seeing only papers as scientific success. We are slowly seeing this
improve, but only very slowly. Things that were apparent to me as a student 20 years ago, are the things that scholars are
still struggling with today. Depressing indeed, but it does help you grow a good sense of patience.</p>

<p>And now, Elsevier wants to make a step forward, wants to be leading in science dissemination again. And they come up with
an intermediate solution between actual knowledge dissemination and profit: they come up with a license-model, increasing
their monopoly on knowledge and trying to lure the scientist into a non-commercial license. From a money-making
perspective this is what society expects from them. From someone who likes to see societal problems solves, this is
disappointing. They had a great opportunity to lead the field.</p>

<p>Now, is all bad? Not at all. It’s a step, but not the step I would have liked to see. It will be a success: because the
CC-BY-NC data that will come out of it, will be part of the web of knowledge. No one will care about the NC part, except
all those SMEs in Europe that work on products to help society which will find it much harder to collaborate with other
companies, because they cannot share the knowledge the created from analyzing the literature (does Elsevier want a monopoly
in this analysis?).</p>

<p>Nor will many in the academic community complain. Surely, those that have worried about this, they will. But the scholar
at universities do not care about NC licenses. After all, universities are not commercial. Asking a student to pay
30 thousand euro for a year is surely not commercial. That is the consensus. But I note that this consensus has not be
tried in court, and I am looking forward to the day it will happen. Elsevier will likely not challenge this, and silently
accept this situation. Just like Microsoft never made a big deal out of people copying office versions of their operating
system for at home: you do not bit the hand that feeds you (too hard). You rather
<a href="http://svpow.com/2013/12/06/elsevier-is-taking-down-papers-from-academia-edu/">go after others</a>, like
<a href="http://academia.edu/">Academia.edu</a>. It will not be scholar Elsevier will enforce the NC on, and it will not be large
companies either: if any, it will be the SMEs. Support them, and do not agree with the license.</p>

<p>Well, it was a nice opportunity for Elsevier. I only see my choice to sign <a href="http://thecostofknowledge.com/">The Cost of Knowledge</a>
reaffirmed.</p>

<p>The choice of the NC clause is totally useless in any context of dissemination. I call for Elsevier to at least add this
option, if they are serious about improving: text mining is provided to subscribers, via a decent API, adhering to:</p>

<ol>
  <li>Facts extracted from literature are licensed CCZero and attribution is paid (facts are copyright free in most parts of the world)</li>
  <li>Output can contain “snippets” of the original text under international “fair use” concepts, and licensed as CC-BY</li>
</ol>

<p>Any scientist is expected to attribute the source of information in the first place, and it is kind of sad Elsevier is on
such bad foot with their audience that they feel this must be enforced via a contract, but that is not a problem. I also
see no reason to deviate from international law about “fair use”; I do understand this is probably an ill defined concept,
but 200 characters seems pretty limited to me, as facts can be spread of sentences longer than this.</p>

<p>I know that many will disagree on the CCZero license, and many will feel awkward about giving away data. It has value, right?
It’s your property, right? I am not going to argue against that. But personally I do not understand how it aligns with the
idea of scientific dissemination. Holding back knowledge as part of making knowledge available? How exactly does that make
sense? Importantly, just like with software, Open is not the same as Without-Cost! Hosting and sharing Open Data also costs
money (particularly, if it is 1 TB of data). Those are different concepts.</p>

<p>However, I also stress that the scholars have a great responsibility hear: I call for all Elsevier journal editorial
boards to not accept this deal either. In fact, all editorial boards have great say in this: it’s them who make a journal
valuable. I also call all scholars to be aware the consequences of selling away your copyright. That is a choice in the
current era. There are plenty of means to disseminate your science <em>without</em> (much) cost, and APC is a flawed argument.</p>

<p>The current step by Elsevier, after all the effort from many, is not a step forward, it’s a step sideways. Elsevier,
I know you can do better. Are you willing?</p>

<p>I am willing, and have been supporting science by making data available as CCZero. However, I also am happy if others
are not ready for this, or have other reasons not to. It is not always under their control. For example, I have heard
stories where data has been used by politicians as small change to get industry to test their products for safety.
I also accept that getting funding as a scholar is hard work, often not paid for, and that it is hard to give away
your only security of a future career. Then again, we all know what data is valuable, has already given its value,
or is of no use to you anymore. And this latter case I ask you to consider to make data available: data of no use
to you anymore, but that could be valuable to others. Make it available, and get cited, and get value out of it,
you would not have received when it sat on some hard disk, and probably is lost in five years.</p>

<p>I also fully understand this is my opinion. Thus, not all data I make available is CCZero: I fully respect copyright
and license from others; in fact, I often feel I do much more than scientists which object to Open licenses, which
just take data as their own as they please. That is why I insist often on clear copyright and license information.
Because if missing, default (local) law applies.</p>

<p>If you want to read more analysis, please refer to the following posts:</p>

<ol>
  <li><a href="http://www.nature.com/news/elsevier-opens-its-papers-to-text-mining-1.14659">Elsevier opens its papers to text-mining</a></li>
  <li><a href="http://blogs.ch.cam.ac.uk/pmr/2014/02/06/elseviers-tdm-terms-tac-can-they-force-us-to-copyright-data-2/">#elsevier’s TDM Terms (TaC): Can they force us to copyright data? (2)</a></li>
  <li><a href="http://blogs.ch.cam.ac.uk/pmr/2014/02/10/natures-recent-news-article-on-text-and-data-mining-was-an-unacceptable-marketing-exercise-i-ask-them-to-renounce-licensing/">Nature’s recent “news” article on Text and Data Mining was unacceptable [redacted]; I ask them to renounce licensing.</a></li>
  <li><a href="http://blogs.ch.cam.ac.uk/pmr/2014/02/10/natures-recent-news-article-on-text-and-data-mining-was-an-unacceptable-marketing-exercise-i-ask-them-to-renounce-licensing/#comment-150096">“Dear Peter,”, Richard van Noorden</a></li>
  <li><a href="http://blogs.ch.cam.ac.uk/pmr/2014/02/14/reply-to-richard-van-noorden/">Reply to Richard van Noorden</a></li>
</ol>]]></content><author><name>Egon Willighagen</name></author><category term="publishing" /><category term="textmining" /><category term="cito:citesAsEvidence:10.1038/506017a" /><summary type="html"><![CDATA[Elsevier’s new ideas on text mining are getting a lot attention now. Sadly, they get it wrong, again. On the bright side, all other publishers, which are expected to follow this year, can learn from this mistake.]]></summary></entry><entry><title type="html">Text mining chemistry from Dutch or Swedish texts</title><link href="https://chem-bla-ics.linkedchemistry.info/2010/12/30/text-mining-chemistry-from-dutch-or.html" rel="alternate" type="text/html" title="Text mining chemistry from Dutch or Swedish texts" /><published>2010-12-30T00:00:00+00:00</published><updated>2010-12-30T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2010/12/30/text-mining-chemistry-from-dutch-or</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2010/12/30/text-mining-chemistry-from-dutch-or.html"><![CDATA[<p><a href="http://oscar3-chem.sf.net/">Oscar</a> is a text miner. It mines in text for chemistry.
<a href="https://bitbucket.org/wwmm/oscar4/">Oscar4</a> is the next iteration of Oscar
code that I worked on in the past three months, with Lezan, Sam, and David. I blogged about
aspects of Oscar4 at several occasions:</p>

<ul>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/10/15/working-on-oscar-for-three-months.html">Working on Oscar for three months <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/10/21/oscar-text-mining-in-taverna.html">Oscar text mining in Taverna <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li><a href="http://chem-bla-ics.blogspot.com/2010/10/multiple-unit-test-inheritance-with.html">Multiple unit test inheritance with JExample</a></li>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/10/28/oscar4-java-api-chemical-name.html">Oscar4 Java API: chemical name dictionaries <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/11/18/oscar4-command-line-utilities.html">Oscar4 command line utilities <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li><a href="http://chem-bla-ics.blogspot.com/2010/11/installing-oscar.html">Installing Oscar</a></li>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/11/29/adding-new-dictionary-to-oscar.html">Adding a new dictionary to Oscar <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with.html">Status update on BJOC analysis with Oscar and ChemicalTagger <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with_11.html">Status update on BJOC analysis with Oscar and ChemicalTagger #2 <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/supramolecular-chemistry.html">Supramolecular chemistry <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/12/23/status-update-on-bjoc-analysis-with_23.html">Status update on BJOC analysis with Oscar and ChemicalTagger #3 <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/12/26/oscar-training-data-models-etc.html">Oscar: training data, models, etc <i class="fa-solid fa-recycle fa-xs"></i></a></li>
</ul>

<p>These posts will server is a some initial critical mass for a draft report I plan to finish
today. I might have to blog some further posts with diagrams, here and there. This post is
actually one of them, and discusses something where Oscar can be expected to go next, now
that the design is cleaned up (though this effort is not halted now) and it has become
possible again to extend it. The over <a href="https://hudson.ch.cam.ac.uk/job/oscar4/lastBuild/testReport/">250 unit tests</a>
make this a lot easier too.</p>

<p>One aspect where I expect Oscar to go in 2011 is the support for other languages. To a very
large extend this is based on multi-language support in the dictionaries, as well as having
training data in a particular language. This also provides some context to my earlier post
about the <a href="https://chem-bla-ics.linkedchemistry.info/2010/12/26/oscar-training-data-models-etc.html">need for a Oscar training data repository <i class="fa-solid fa-recycle fa-xs"></i></a>.</p>

<p>This extension opens a number of options: analysis of patent literature in other languages,
monitoring of press releases in other languages, and news items in local news papers, etc.
For example, it could analyse <a href="http://www.c2w.nl/energierijke-gistcel.119621.lynkx">this C2W news item</a>
on <a href="http://en.wikipedia.org/wiki/Yeast">yeast</a> cells:</p>

<p><img src="/assets/images/c2w.png" alt="" /></p>

<p>There are many use cases for such localized text mining. And it surely matters for determining
the impact of research.</p>

<p>Oscar has various places where language specifics are found. For example, in tokenization of a
text. One step here is the detection of sentence ends. This is done in most western languages
with a period, exclamation mark, question mark, etc. But periods (dots) are also used in
abbreviations. Similarly, colons can be used in chemical names. But the every language comes in
with different abbreviations that need to be recognized.</p>

<p>Currently, some abbreviations are found in <a href="https://bitbucket.org/wwmm/oscar4/src/005ffa00a69d/oscar4-core/src/main/java/uk/ac/cam/ch/wwmm/oscar/document/NonSentenceEndings.java">NonSentenceEndings</a>.
In the past three months, we have been cleaning up the code, and restructured the source code,
making it easier to detect such places. This class will likely undergo further refactoring, to
making the list of such non-sentence-endings configurable via files or so. What I expect to see,
is that we you initiate Oscar like this:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">Oscar</span> <span class="n">oscar</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Oscar</span><span class="o">(</span><span class="nc">Locale</span><span class="o">.</span><span class="na">US</span><span class="o">);</span>
</code></pre></div></div>

<p>This might actually even make a nice student summer project. The biggest challenge will be in making a good
corpus of training data, like the SciBorg training data that was used for training Oscar3.</p>

<p>But the whole normalization is tainted with English language specifics too. For example, the normalizer
will have to ‘normalize’ the question marks, for which there exist several
<a href="http://en.wikipedia.org/wiki/Question_mark#Stylistic_variants">unicode variations</a>.
But the normalized variant is language dependent. For example, greek and armenian have different characters
(see <a href="http://en.wikipedia.org/wiki/Question_mark#Opening_and_closing_question_marks">this page</a>),
and then we have not even started talking about the right to left.</p>

<p>Besides localized dictionaries, this Oscar will also benefit from a localized <a href="http://opsin.ch.cam.ac.uk/">OPSIN</a>.
It seem to recognize the Dutch <a href="https://opsin.ch.cam.ac.uk/opsin/propaan.png">propaan</a>, but not
<a href="https://opsin.ch.cam.ac.uk/opsin/benzeen.png">benzeen</a>. I am not going to look at that soon, but if you are
interested, I recommend checking out Rich’
<a href="https://doi.org/10.59350/bbrwt-e5n35">posts <i class="fa-solid fa-recycle fa-xs"></i></a>
<a href="https://doi.org/10.59350/vtadn-tdt17">about <i class="fa-solid fa-recycle fa-xs"></i></a>
<a href="https://doi.org/10.59350/nbtxd-kdz73">forking <i class="fa-solid fa-recycle fa-xs"></i></a>
OPSIN and writing patches.</p>

<p>Getting Oscar going for other languages is a challenge, but also offers new opportunities. Just email the
<a href="http://sourceforge.net/mailarchive/forum.php?forum_name=oscar3-chem-developers">oscar mailing list</a>
if you are interested and need help.</p>]]></content><author><name>Egon Willighagen</name></author><category term="oscar" /><category term="textmining" /><category term="justdoi:10.59350/vtadn-tdt17" /><category term="justdoi:10.59350/nbtxd-kdz73" /><category term="justdoi:10.59350/bbrwt-e5n35" /><summary type="html"><![CDATA[Oscar is a text miner. It mines in text for chemistry. Oscar4 is the next iteration of Oscar code that I worked on in the past three months, with Lezan, Sam, and David. I blogged about aspects of Oscar4 at several occasions:]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/c2w.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/c2w.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Oscar: training data, models, etc</title><link href="https://chem-bla-ics.linkedchemistry.info/2010/12/26/oscar-training-data-models-etc.html" rel="alternate" type="text/html" title="Oscar: training data, models, etc" /><published>2010-12-26T00:00:00+00:00</published><updated>2010-12-26T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2010/12/26/oscar-training-data-models-etc</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2010/12/26/oscar-training-data-models-etc.html"><![CDATA[<p><a href="https://sourceforge.net/projects/oscar3-chem/">Oscar</a> uses a Maximum Entropy Markov Model (MEMM) based on <a href="http://en.wikipedia.org/wiki/N-gram">n-grams</a>.
Peter Corbett has written this up (doi:<a href="https://doi.org/10.1186/1471-2105-9-S11-S4">10.1186/1471-2105-9-S11-S4</a>). So, it basically is statistics
once more. If you really want a proper bioinformatics education, so do your PhD at a (proteo)chemometrics department.</p>

<p>N-grams are word parts of n characters. For example, the trigrams of <a href="http://en.wikipedia.org/wiki/Acetic_acid">acetic acid</a>
include <code class="language-plaintext highlighter-rouge">ace</code>, <code class="language-plaintext highlighter-rouge">cid</code>, <code class="language-plaintext highlighter-rouge">tic</code>, <code class="language-plaintext highlighter-rouge">eti</code>, and <code class="language-plaintext highlighter-rouge">aci</code>. N-grams of length four include acid, etic, and acet. The MEMM assigns weights to
these n-grams, and based on that decided if something is in deed a <em>named entity</em> (in Oscar terminology). For example,
consider the <code class="language-plaintext highlighter-rouge">acet</code> n-gram: acetone should be matched, but the n-gram <code class="language-plaintext highlighter-rouge">facet</code> not.</p>

<p>Put this in perspective in the ongoing refactoring of the Oscar software. We are changing normalization (e.g. converting
all unicode hyphen alternatives into one specific hyphen), updating the tokenizer (e.g. changing the list of
non-sentence-endings like <em>Prof.</em>). It is clear this changes the n-grams typical for chemical-like things. Worse,
the weights are tuned towards to know n-grams, and statistical models are generally a bit overtrained for the
data, or, at least, specific for it.</p>

<p>Now, if the distribution of n-grams changes, the weights in the model need to be updated too, to not degrade
the model performance. So, Oscar is useless if we cannot retrain its MEMM component after a refactoring. If
that would be impossible, we would have effectively created an <em>intellectual monopoly</em>.</p>

<p>Thus, what the Oscar project needs, is one or more free sets of annotated literature, which can be used to
train new MEMM models. The SciBorg corpus was used to train the current Oscar3 and Oscar4 models. This data
(copyright <a href="http://rsc.org/">RSC</a>) will very likely be available under a <a href="http://creativecommons.org/licenses/">Creative Commons</a>
license (RSC++), but may have the NC clause, which would not be good for developing a business model around
the opensource Oscar (such as providing a high-performance web service via a subscription service). I have
recently written up <a href="http://chem-bla-ics.blogspot.com/2010/12/re-why-i-and-you-should-avoid-nc.html">the problems the NC clause introduces</a>,
and some <a href="http://chem-bla-ics.blogspot.com/2010/12/blog-post.html">examples of commercial Open Source cheminformatics projects</a>.</p>

<p>We need not focus only on this SciBorg data, however. In fact, we will need multiple models anyway. For
example, the SciBorg papers (42 if not mistaken) are around a particular kind of literature. So, it
introduces the risk of using it to analyse papers out of the application domain. Furthermore, I am very
interested (and others indicated so too) to use Oscar for other languages. Surely, English is the major
language, but there are many use cases for Oscar when useful for other languages.</p>

<p>Therefore, for what we need in the Oscar project, is a registry of training (/test) data, annotated itself
with metadata around how that data was created (what quality assurance, what kind of named entity types,
how many domain experts were involved, etc), test results for those data sets, etc. My time on the Oscar
project is almost over, and I have no clue when I will be able to invest the same amount of time into the
project as I did in the past three months. But the creation of this registry is clear step that must be
taken in the Oscar4 development.</p>]]></content><author><name>Egon Willighagen</name></author><category term="oscar" /><category term="textmining" /><category term="justdoi:10.1186/1471-2105-9-S11-S4" /><category term="inchikey:QTBSBXVTEAMEQO-UHFFFAOYSA-N" /><category term="inchikey:CSCPPACGZOOCGX-UHFFFAOYSA-N" /><summary type="html"><![CDATA[Oscar uses a Maximum Entropy Markov Model (MEMM) based on n-grams. Peter Corbett has written this up (doi:10.1186/1471-2105-9-S11-S4). So, it basically is statistics once more. If you really want a proper bioinformatics education, so do your PhD at a (proteo)chemometrics department.]]></summary></entry><entry><title type="html">Oscar4 command line utilities</title><link href="https://chem-bla-ics.linkedchemistry.info/2010/11/18/oscar4-command-line-utilities.html" rel="alternate" type="text/html" title="Oscar4 command line utilities" /><published>2010-11-18T00:00:00+00:00</published><updated>2010-11-18T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2010/11/18/oscar4-command-line-utilities</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2010/11/18/oscar4-command-line-utilities.html"><![CDATA[<p>One goal of my three month project is to take Oscar4 to the community. We want to get it used more, and we need
a larger development community. Oscar4 and the related technologies do a good, sometimes excellent, job, but
have to be maintained, just like any other piece of code. To make using it easier, we are developing new APIs,
as well as two user-oriented applications: <a href="https://chem-bla-ics.linkedchemistry.info/2010/10/21/oscar-text-mining-in-taverna.html">a Taverna 2 plugin <i class="fa-solid fa-recycle fa-xs"></i></a>,
and command line utilities. The <a href="https://chem-bla-ics.linkedchemistry.info/2010/10/28/oscar4-java-api-chemical-name.html">Oscar4 Java API <i class="fa-solid fa-recycle fa-xs"></i></a>
has slightly evolved in the last three weeks, removing some complexity. In this post, I will introduce the command
line utilities.</p>

<h2 id="oscar4">Oscar4</h2>

<p>Most people will be mostly interested into the full Oscar4 program, to extract chemical entities. Oscar3 was
also capable of extracting data (like <a href="https://chem-bla-ics.linkedchemistry.info/2006/09/08/chemical-archeology-oscar3-to.html">NMR spectra <i class="fa-solid fa-recycle fa-xs"></i></a>),
but that is not yet being ported. The OscarCLI program takes input, extracts chemicals, and where possible resolves
them into connection tables (viz. InChI).</p>

<p>To extract chemicals from a line of text (e.g. <em>“This is propane.”</em>, you do:</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>java <span class="nt">-cp</span> oscar4-cli-4.0-SNAPSHOT.jar <span class="se">\</span>
  uk.ac.cam.ch.wwmm.oscar.oscarcli.OscarCLI <span class="se">\</span>
  This is propane.
propane: <span class="nv">InChI</span><span class="o">=</span>1/C3H8/c1-3-2/h3H2,1-2H3
</code></pre></div></div>

<p>For larger chunks of texts it is easier to route it via <a href="http://en.wikipedia.org/wiki/Standard_streams">stdin</a>,
for which we can use the <code class="language-plaintext highlighter-rouge">-stdin</code> option:</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">echo</span> <span class="s2">"This is propane."</span> | <span class="se">\</span>
  java <span class="nt">-cp</span> oscar4-cli-4.0-SNAPSHOT.jar <span class="se">\</span>
  uk.ac.cam.ch.wwmm.oscar.oscarcli.OscarCLI <span class="se">\</span>
  <span class="nt">-stdin</span>
propane: <span class="nv">InChI</span><span class="o">=</span>1/C3H8/c1-3-2/h3H2,1-2H3
</code></pre></div></div>

<p>That way, we can easily process large plain text files (output omitted):</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">cat </span>largeFile.txt | <span class="se">\</span>
  java <span class="nt">-cp</span> oscar4-cli-4.0-SNAPSHOT.jar <span class="se">\</span>
  uk.ac.cam.ch.wwmm.oscar.oscarcli.OscarCLI <span class="se">\</span>
  <span class="nt">-stdin</span>
</code></pre></div></div>

<p>If you prefer RDF output, for further integration, use the <code class="language-plaintext highlighter-rouge">-output text/turtle</code>:</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">cat </span>largeFile.txt | <span class="se">\</span>
  java <span class="nt">-cp</span> oscar4-cli-4.0-SNAPSHOT.jar <span class="se">\</span>
  uk.ac.cam.ch.wwmm.oscar.oscarcli.OscarCLI <span class="se">\</span>
  <span class="nt">-stdin</span> <span class="nt">-output</span> text/turtle
</code></pre></div></div>

<p>This returns RDF using the <a href="http://code.google.com/p/semanticchemistry/">CHEMINF</a> ontology like:</p>

<div class="language-turtle highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">@prefix</span><span class="w"> </span><span class="nn">dc:</span><span class="w">  </span><span class="p">.</span><span class="w">
</span><span class="kd">@prefix</span><span class="w"> </span><span class="nn">rdfs:</span><span class="w">  </span><span class="p">.</span><span class="w">
</span><span class="kd">@prefix</span><span class="w"> </span><span class="nn">ex:</span><span class="w">  </span><span class="p">.</span><span class="w">
</span><span class="kd">@prefix</span><span class="w"> </span><span class="nn">cheminf:</span><span class="w">  </span><span class="p">.</span><span class="w">
</span><span class="kd">@prefix</span><span class="w"> </span><span class="nn">sio:</span><span class="w"> </span><span class="p">.</span><span class="w">

</span><span class="nn">ex:</span><span class="n">entity0
</span><span class="w">  </span><span class="nn">rdfs:</span><span class="n">subClassOf</span><span class="w"> </span><span class="nn">cheminf:</span><span class="n">CHEMINF_000000</span><span class="w"> </span><span class="p">;</span><span class="w">
  </span><span class="nn">dc:</span><span class="n">label</span><span class="w"> </span><span class="s">"propane"</span><span class="w"> </span><span class="p">;</span><span class="w">
  </span><span class="nn">cheminf:</span><span class="n">CHEMINF_000200</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="k">a</span><span class="w"> </span><span class="nn">cheminf:</span><span class="n">CHEMINF_000113</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">sio:</span><span class="n">SIO_000300</span><span class="w"> </span><span class="s">"InChI=1/C3H8/c1-3-2/h3H2,1-2H3"</span><span class="w"> </span><span class="p">.</span><span class="w">
  </span><span class="p">]</span><span class="w"> </span><span class="p">.</span><span class="w">
</span></code></pre></div></div>

<p>We can, however, also use <a href="http://jericho.htmlparser.net/docs/index.html">Jericho</a> to extract text from HTML pages, made
available with the <code class="language-plaintext highlighter-rouge">-html</code> option, and pulling in a <a href="http://www.beilstein-journals.org/bjoc/">Beilstein Journal of Organic Chemistry</a>
paper with <a href="http://en.wikipedia.org/wiki/Wget">wget</a>:</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>wget <span class="nt">-qO-</span> https://doi.org/10.3762/bjoc.6.122 | <span class="se">\</span>
  java <span class="nt">-cp</span> oscar4-cli-4.0-SNAPSHOT.jar <span class="se">\</span>
  uk.ac.cam.ch.wwmm.oscar.oscarcli.OscarCLI <span class="se">\</span>
  <span class="nt">-stdin</span> <span class="nt">-html</span>
</code></pre></div></div>

<p>This will return 271 chemical entities recognized in the text, matching 48 unique chemical structures.</p>]]></content><author><name>Egon Willighagen</name></author><category term="oscar" /><category term="textmining" /><category term="beilstein" /><category term="inchikey:ATUOYWHBWRKTHZ-UHFFFAOYSA-N" /><summary type="html"><![CDATA[One goal of my three month project is to take Oscar4 to the community. We want to get it used more, and we need a larger development community. Oscar4 and the related technologies do a good, sometimes excellent, job, but have to be maintained, just like any other piece of code. To make using it easier, we are developing new APIs, as well as two user-oriented applications: a Taverna 2 plugin , and command line utilities. The Oscar4 Java API has slightly evolved in the last three weeks, removing some complexity. In this post, I will introduce the command line utilities.]]></summary></entry><entry><title type="html">Working on Oscar for three months</title><link href="https://chem-bla-ics.linkedchemistry.info/2010/10/15/working-on-oscar-for-three-months.html" rel="alternate" type="text/html" title="Working on Oscar for three months" /><published>2010-10-15T00:00:00+00:00</published><updated>2010-10-15T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2010/10/15/working-on-oscar-for-three-months</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2010/10/15/working-on-oscar-for-three-months.html"><![CDATA[<p>As Peter <a href="https://blogs.ch.cam.ac.uk/pmr/2010/10/11/update-and-real-excitement/">announced <i class="fa-solid fa-recycle fa-xs"></i></a> in his blog, and I tweeted earlier, I have started as postdoctoral
research associate in <a href="http://www-pmr.ch.cam.ac.uk/wiki/Main_Page">Peter’s group</a> at the <a href="http://www.cam.ac.uk/">University of Cambridge</a>,
to work the next three months on <a href="https://oscar3-chem.sf.net">Oscar</a>, a chemical text mining tool. My tasks will focus on programmatical
plumbing instead of method development, and I am aiming at integration with <a href="http://cdktaverna.wordpress.com/installing-cdk-taverna/">CDK-Taverna</a>
(see doi:<a href="http://dx.doi.org/10.1186/1471-2105-11-159">10.1186/1471-2105-11-159</a>, and which is currently being ported to
<a href="http://www.taverna.org.uk/">Taverna 2.2</a> by Andreas). <a href="http://sea36.blogspot.com/">Sam</a> and Lezan having been working on the refactoring
as well, and will help me out with the gory details of the current code.</p>

<p>The source code of Oscar4 is available from <a href="https://bitbucket.org/wwmm/oscar4">this BitBucket project</a>, and you can monitor the code
state on <a href="https://hudson.ch.cam.ac.uk/job/oscar4/">this Hudson page</a>. The project I will be working on, is in collaboration with the
<a href="http://www.ebi.ac.uk/chebi/">ChEBI</a> project, and today we met up with various people in the group, and set out some really interesting
use cases.</p>]]></content><author><name>Egon Willighagen</name></author><category term="oscar" /><category term="textmining" /><category term="chebi" /><category term="doi:10.1186/1471-2105-11-159" /><summary type="html"><![CDATA[As Peter announced in his blog, and I tweeted earlier, I have started as postdoctoral research associate in Peter’s group at the University of Cambridge, to work the next three months on Oscar, a chemical text mining tool. My tasks will focus on programmatical plumbing instead of method development, and I am aiming at integration with CDK-Taverna (see doi:10.1186/1471-2105-11-159, and which is currently being ported to Taverna 2.2 by Andreas). Sam and Lezan having been working on the refactoring as well, and will help me out with the gory details of the current code.]]></summary></entry><entry><title type="html">Chemo::Blogs #2</title><link href="https://chem-bla-ics.linkedchemistry.info/2006/12/06/chemoblogs-2.html" rel="alternate" type="text/html" title="Chemo::Blogs #2" /><published>2006-12-06T00:00:00+00:00</published><updated>2006-12-06T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2006/12/06/chemoblogs-2</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2006/12/06/chemoblogs-2.html"><![CDATA[<p>Because no one picked up my <a href="https://chem-bla-ics.linkedchemistry.info/2006/09/15/chemoblogs-1.html">Chemo::Blogs <i class="fa-solid fa-recycle fa-xs"></i></a> suggestion, I will now
officially claim the blog series title. However, unlike the original <a href="http://bioblogs.wordpress.com/">Bio::Blogs</a> series,
I will not summarize interesting blogs, but just spam you with websites I recently marked as
<a href="http://del.icio.us/egonw/toblog">toblog on del.icio.us</a>.</p>

<h2 id="semantics-and-text-mining">Semantics and Text Mining</h2>

<p><a href="http://evan.prodromou.name/">Evan Prodromou</a> wrote about <a href="http://evan.prodromou.name/RDFa_vs_microformats">RDFa vs microformats</a>.
The latter are commonly used in <a href="https://chem-bla-ics.linkedchemistry.info/2006/02/06/tagging-blog-items.html">enhancing blog semantics <i class="fa-solid fa-recycle fa-xs"></i></a>, and
for example used by <a href="http://postgenomic.com/wiki/doku.php?id=markup">PostGenomic.com</a>. While RDFa is more explicit, e.g. by using
namespaced markup, we have to wait until XHTML2 to see it working. I do not think chemists are using tags a log yet, but let me
propose the following microformats: <span class="inchi"><a href="http://google.com/search?q=1/CH4/h1H4">1/CH4/h1H4</a></span> and
<span class="chemicalcompound">methane<span>. Standard JavaScripts and CSS scripts will then do the rest. (Think: addressing newlines,
auto <a href="http://wwmm-svc.ch.cam.ac.uk/wwmm/html/googleinchiserver.html">googling-for-inchi</a>, etc).</span></span></p>

<p>The reason why using microformats is interesting, is text mining, of various kinds. Whether it is setting up a molecule-article
link database, or <a href="https://chem-bla-ics.linkedchemistry.info/2006/02/25/hacking-inchi-support-into.html">find hot molecules in blogspace <i class="fa-solid fa-recycle fa-xs"></i></a>,
adding semantics will help tools like <a href="https://chem-bla-ics.linkedchemistry.info/2006/09/08/chemical-archeology-oscar3-to.html">OSCAR3 to mine chemistry <i class="fa-solid fa-recycle fa-xs"></i></a>.
Some time ago <a href="https://chem-bla-ics.linkedchemistry.info/2006/05/07/open-text-mining-interface-and.html">OTMI was proposed by Nature <i class="fa-solid fa-recycle fa-xs"></i></a>,
and they now set up a <a href="http://www.opentextmining.org/wiki/Main_Page">dedicated web site</a> to explain there view on text mining.
<a href="http://www.zacker.com/">Zack Rosen</a> has a good idea why <a href="http://www.zacker.org/semantic-web-research-isnt-working">RDF Semantic web research isn’t working</a>.</p>

<h2 id="blogspace">Blogspace</h2>

<p>There are a few new chemistry blogs I want to mention (and already added to <a href="https://chem-bla-ics.linkedchemistry.info/2006/08/25/chemical-blogspace.html">Chemical blogspace <i class="fa-solid fa-recycle fa-xs"></i></a>):
<a href="http://blog.chembark.com/">ChemBark</a>, <a href="http://www.lirico.co.uk/wp/">lirico</a> which has an interesting
<a href="http://www.lirico.co.uk/wp/?cat=8">chemoinformatics section</a>, and <a href="http://ashutoshchemist.blogspot.com/">The Curious Wavefunction</a>.
Worth reading indeed.</p>

<p><a href="http://plindenbaum.blogspot.com/">Pierre’s YOKOFAKUN</a> deserves a paragraph of his own. He recently blogged about
<a href="http://plindenbaum.blogspot.com/2006/11/bio2rdf.html">bio2rdf</a> which provides an <a href="http://bio2rdf.org/">RDF interface to biochemical knowledge</a>
via <a href="http://lsid.sourceforge.net/">Life Science Identifiers</a> (LSID), <a href="http://plindenbaum.blogspot.com/2006/11/wwwoboeditorg.html">OBOEdit</a>
which is a Java-based ontology editor, and <a href="http://plindenbaum.blogspot.com/2006/12/visual-unix-pipeline.html">Amadea</a>
which is a <a href="http://taverna.sf.net/">Taverna</a>- and <a href="http://www.knime.org/">KNIME</a>-like tool for setting up UNIX pipes.</p>

<h2 id="online-embl-symposium">Online EMBL Symposium</h2>

<p>A few EMBL PhD students are having the <a href="http://virtualsymposium.predocs.org/">First Online EMBL PhD Symposium</a> (catchy name, or … ;)
Anyway, discussions are held on IRC, and it has a rather interesting Web2.0 session. All
<a href="http://virtualsymposium.predocs.org/media">media is available on the website</a> but requires registration right now.
After the conference it will become open access to all. <a href="http://www.blogger.com/profile/6833158">Jean-Claude</a> contributed
<em>The UsefulChem Project: Open Source Chemistry Research using Blogs and Wikis</em> to the
<a href="http://virtualsymposium.predocs.org/media/participants-contributions/">Participants’ Contributions section</a>, and I had
a poster on <em>Distributing molecular information over the Internet</em>, discussing CMLRSS, blog aggregators, CML and other things.
The IRC session was logged and is <a href="http://virtualsymposium.predocs.org/chat/discussion-about-the-influence-of-web-2-0-on-science-tuesday-december-6-2006-16-00-cet/">available here</a>.</p>

<h2 id="literature">Literature</h2>

<p>Finally, I want to mention three recent articles. First one is a recent write up by Bourne and Friedberg about
<em>Ten Simple Rules for Selecting a Postdoctoral Position</em> (DOI: <a href="https://doi.org/10.1371/journal.pcbi.0020121">10.1371/journal.pcbi.0020121</a>).
With the end of my current postdoc position nearing, rather useful reading. Some time ago I blogged about a
<a href="https://chem-bla-ics.linkedchemistry.info/2006/05/11/new-open-access-journal-source-code.html">New open access journal Source Code for Biology and Medicine <i class="fa-solid fa-recycle fa-xs"></i></a>,
and the journal is now up and running. Details can be read in the first editorial (DOI: <a href="https://doi.org/10.1186/1751-0473-1-1">10.1186/1751-0473-1-1</a>).
The third article I would like to mention is <em>Scientific Software Development Is Not an Oxymoron</em> by Baxter
(DOI: <a href="https://doi.org/10.1371/journal.pcbi.0020087">10.1371/journal.pcbi.0020087</a>), though I do not think it has new insights.</p>

<p>OK, this was a rather lengthy write up, but really needed to clean up my toblog section :)</p>]]></content><author><name>Egon Willighagen</name></author><category term="blog" /><category term="rdf" /><category term="textmining" /><category term="cb" /><category term="justdoi:10.1371/journal.pcbi.0020121" /><category term="justdoi:10.1186/1751-0473-1-1" /><category term="justdoi:10.1371/journal.pcbi.0020087" /><summary type="html"><![CDATA[Because no one picked up my Chemo::Blogs suggestion, I will now officially claim the blog series title. However, unlike the original Bio::Blogs series, I will not summarize interesting blogs, but just spam you with websites I recently marked as toblog on del.icio.us.]]></summary></entry><entry><title type="html">Chemical Archeology: OSCAR3 to NMRShiftDB.org</title><link href="https://chem-bla-ics.linkedchemistry.info/2006/09/08/chemical-archeology-oscar3-to.html" rel="alternate" type="text/html" title="Chemical Archeology: OSCAR3 to NMRShiftDB.org" /><published>2006-09-08T00:00:00+00:00</published><updated>2006-09-08T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2006/09/08/chemical-archeology-oscar3-to</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2006/09/08/chemical-archeology-oscar3-to.html"><![CDATA[<p>Chemical Archeology (see <a href="http://wiki.cubic.uni-koeln.de/blog/pivot/entry.php?id=7#body">Christoph’s comment</a>) is the
process of extracting chemical information from old journal articles. Some time ago,
<a href="http://wwmm.ch.cam.ac.uk/blogs/corbett/">Peter Corbett</a> from the group of <a href="http://wwmm.ch.cam.ac.uk/blogs/murrayrust/">Peter Murray-Rust</a>
visited the <a href="http://almost.cubic.uni-koeln.de/jrg/">CUBIC</a> to talk to us about
<a href="http://wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/Oscar3">Oscar3</a> which can do just that. That day, we already
<a href="https://chem-bla-ics.linkedchemistry.info/2006/06/22/text-mining-for-chemistry-using-oscar3.html">hooked OPSIN into Bioclipse <i class="fa-solid fa-recycle fa-xs"></i></a>.</p>

<p>Oscar3, however, is capable of more than the name2structure of OPSIN (see also
<a href="httpa://doi.org/10.1039/b411033a">10.1039/b411033a</a>; it can take a plain text file with an experimental section
with details on the synthesis of small organic compounds, and analyze the chemistry in that. This functionality has been
available as <a href="http://www.rsc.org/Publishing/ReSourCe/AuthorGuidelines/AuthoringTools/index.asp">an RSC authoring tool</a>
for some time now (see also <a href="https://doi.org/10.1039/b411699m">10.1039/b411699m</a>). Unfortunately, what publisher put
online (PDF and HTML) is much more difficult to process with Oscar3: those formats are often optimized for display,
not for machine processing. The HTML can be cleaned up, but there is no general approach.</p>

<p><a href="http://wiki.cubic.uni-koeln.de/blog/">Christoph Steinbeck</a> is going to present at the
<a href="http://www.chemistry.org/portal/a/c/s/1/acsdisplay.html?DOC=meetings%5Csanfrancisco2006%5Chome.html">upcoming ACS meeting</a>
the use of Oscar3 for extraction of NMR spectra from old journal article, in preperation for submission to the
<a href="http://www.nmrshiftdb.org/">NMRShiftDB.org</a> (see the <a href="http://wiki.cubic.uni-koeln.de/blog/pivot/entry.php?id=4#body">abstract</a>
of <a href="http://oasys2.confex.com/acs/232nm/techprogram/P981204.HTM">CINF 101</a>).</p>

<p>Since the full Oscar3 was not hooked into <a href="http://www.bioclipse.net/">Bioclipse</a> yet, I had some work to do. It took me
some time to figure out how to properly configure Oscar3, and what additional things I had to do to clean up the HTML
used by publishers to get Oscar3 to extract NMR spectra (thanx to PeterC for hints!). I also had to tweak the Oscar3
code itself here and there, but that’s what opensource is about :) (Peter, if you are reading this: I have a number
of patches for the Oscar3 code in <a href="http://svn.sourceforge.net/viewvc/bioclipse/trunk/bc_oscar/">bc_oscar</a>;
let me know if you’re interested in them.)</p>

<p>This is the end result:</p>

<p><img src="/assets/images/oscar1.png" alt="" /></p>

<p>Note especially the hierarchy in the resource navigator on the left. The misc folder contains all the chemistry found in the article. But more importantly is that for six molecules it fully detected he experimental section! For 3-(2-Oxocyclooctanyl)-3-phenylpropan-1-al (InChI=1/C17H22O2/c18-13-12-15(14-8-4-3-5-9-14)16-10-6-1-2-7-11-17(16)19/h3-5,8-9,13,15-16H,1-2,6-7,10-12H2) it derived the molecular structure (with OPSIN), and a few spectra: H-NMR, high-resolution MS and IR.</p>

<p>So, if you attend the ACS meeting: make sure to visit Christoph’s CINF 101 presentation!</p>]]></content><author><name>Egon Willighagen</name></author><category term="oscar" /><category term="bioclipse" /><category term="acs" /><category term="chemistry" /><category term="justdoi:10.1039/b411033a" /><category term="textmining" /><category term="justdoi:10.1039/b411699m" /><category term="nmrshiftdb" /><summary type="html"><![CDATA[Chemical Archeology (see Christoph’s comment) is the process of extracting chemical information from old journal articles. Some time ago, Peter Corbett from the group of Peter Murray-Rust visited the CUBIC to talk to us about Oscar3 which can do just that. That day, we already hooked OPSIN into Bioclipse .]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/oscar1.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/oscar1.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>