<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.4">Jekyll</generator><link href="https://chem-bla-ics.linkedchemistry.info/feed/by_tag/europepmc.xml" rel="self" type="application/atom+xml" /><link href="https://chem-bla-ics.linkedchemistry.info/" rel="alternate" type="text/html" /><updated>2026-06-15T12:00:19+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/feed/by_tag/europepmc.xml</id><title type="html">chem-bla-ics</title><subtitle>Chemblaics (pronounced chem-bla-ics) is the science that uses open science and computers to solve problems in chemistry, biochemistry and related fields.</subtitle><author><name>Egon Willighagen</name></author><entry><title type="html">One Million IUPAC names #5: a new approach and 400k names</title><link href="https://chem-bla-ics.linkedchemistry.info/2026/05/02/one-million-iupac-names-5-a-new-approach.html" rel="alternate" type="text/html" title="One Million IUPAC names #5: a new approach and 400k names" /><published>2026-05-02T00:00:00+00:00</published><updated>2026-05-02T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2026/05/02/one-million-iupac-names-5-a-new-approach</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2026/05/02/one-million-iupac-names-5-a-new-approach.html"><![CDATA[<p>About fifteen months ago a new project started: <a href="https://chem-bla-ics.linkedchemistry.info/2025/03/08/iupac-names.html">One Million IUPAC names</a>:</p>

<blockquote>
  <p>Thus, the idea came up, can we create a set of 1 million unique IUPAC names found in literature?</p>
</blockquote>

<p>We started out with using <a href="https://europepmc.org/">Europe PMC</a> to get JATS XML files for the full texts of open access articles.
Parsing the XML is easy and the text paragraphs are passed through OSCAR and OPSIN. That has not changed.</p>

<p>What did change last weekend is something I had long on my todo list (but life interfered). The first approach
was to ask for named entities using the Europe PMC APIs. But I quickly realized that with OSCAR and OPSIN we could
get more names out of the articles. The next step was to move from Google Colab to a command line script.
That gave another boost, as explained in <a href="http://localhost:4000/2025/04/27/one-million-iupac-names-2-the-100-thousand-milestone.html">this second post in the series</a>.
We reached 200 thousand names in <a href="https://chem-bla-ics.linkedchemistry.info/2025/06/09/one-million-iupac-names.html">june 2025</a>
but then things slowed down again in the growth. <a href="https://chem-bla-ics.linkedchemistry.info/2025/08/09/one-million-iupac-names-4.html">Two months later</a>
we only had 75 thousand more. However, plenty of discussion was happening and there turned out to be
other, larger collections of IUPAC names under an open license. Millions of names, actually.</p>

<p>But another problem emerged. We were still using the Europe PMC API and were basically asking for open access
articles between two dates. Practically, the API could answer requests between 1 and max 3 days. Beyond that,
times outs and 404s became an issue. Moreover, because these dates are publications dates and not the dates
on which the JATS were deposited, I had to got back to previous months and redo the queries. That gave another
5 thousand names since last August. Something had to change.</p>

<h2 id="the-new-approach">The new Approach</h2>

<p>Europe PMC, however, also provides the JATS XML files as download on <a href="https://europepmc.org/ftp/oa/">their FTP site</a>.
Already that <a href="https://chem-bla-ics.linkedchemistry.info/2025/08/09/one-million-iupac-names-4.html">august 2025</a> I had
a prototype and knew it would change the game. These gzipped XML files are about 150 to 250 MB. Unzipped, about 1 GB each.
Better, these files are based on Europe PMC identifiers, hopefully resolving the issue with using dates in the queries.</p>

<p>Now, parsing a 1 GB XML files is a total non-issue. I have done it plenty of times before. Just use a
<a href="https://en.wikipedia.org/wiki/Simple_API_for_XML">Simple API for XML</a> (SAX) parser. This is a streaming parser
giving you full control of how to parse things. It is ideal for this siutation: you just keep the current
paragraph of text in memory and release that when done with that paragraph. That is, you do not have to read
the full file in memory, just the bits you are interested in. I used this for my Chemical Markup Language
patches for Jmol and JChemPaint back in the nineties.</p>

<p>Last weekend I finally made the jump. Use SAX to extract the <code class="language-plaintext highlighter-rouge">&lt;p&gt;</code> elements one by one, running OSCAR on
them, filter with OPSIN, output that name, and clear the memory. Effectively, each gzipped file processes
with a Groovy script in about 1 to 2 hours.</p>

<p>The output is a mesmerizing stream of scientific literature (which I will use until someone points me to a Java
CLI library that creates a Matrix-style falling letters equivalent), tho less so as a static image:</p>

<p><img src="/assets/images/jats_analysis.png" alt="" /></p>

<p>In this plot, an <code class="language-plaintext highlighter-rouge">x</code> means a new article to be processed. Each <code class="language-plaintext highlighter-rouge">.</code> and <code class="language-plaintext highlighter-rouge">o</code> that follows is a single <code class="language-plaintext highlighter-rouge">&lt;p&gt;</code>
element and the difference is that an <code class="language-plaintext highlighter-rouge">o</code> means at least one IUPAC name was detected in the paragraph.</p>

<p>Each gzipped file gives 400 to 500 new IUPAC names. Indeed, going from 288 thousand to 300 thousand
was a matter of a day and a half. And earlier this afternoon we passed the 400 thousand IUPAC names.
With about 230 gzipped files. Now, I am going back in time, and the sizes of these files are shrinking:
Another 500 files and the size has dropped to around 125 MB, so a rough estimate suggests that
we will end up with 650 to 700 thousand names this way. This will be completed in a few weeks (and mostly
because I need to focus first on other things again, because I can use our computing cluster do this).</p>

<p>Regarding the original goal, fortunately, we are still publishing at a higher rate every year, and
more and more articles are available as open access. So, I still have good hopes we will reach the
<em>1 million IUPAC names</em>. Also, keep in mind, we know how to boost this by simple name variations to
several millions, even with the <a href="https://codeberg.org/BlueObelisk/iupac-names/commit/30ddfd96c3ec6e6a5840be0ada1bdbd40972490e">400 thousand</a>
we have today.</p>

<p>Oh, and <a href="https://github.com/BlueObelisk/iupac-names/issues/4">our next milestone</a> will be in the pocket
before I visit <a href="https://cheminf.uni-jena.de/">Christoph Steinbeck’s cheminformatics team</a> in Jena!</p>]]></content><author><name>Egon Willighagen</name></author><category term="iupac" /><category term="textmining" /><category term="xml" /><category term="europepmc" /><summary type="html"><![CDATA[About fifteen months ago a new project started: One Million IUPAC names:]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/jats_analysis.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/jats_analysis.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Where does the WikiPathways Cited In information come from?</title><link href="https://chem-bla-ics.linkedchemistry.info/2026/01/10/where-does-the-wikipathways-cited-in-information-come-from.html" rel="alternate" type="text/html" title="Where does the WikiPathways Cited In information come from?" /><published>2026-01-10T00:00:00+00:00</published><updated>2026-01-10T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2026/01/10/where-does-the-wikipathways-cited-in-information-come-from</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2026/01/10/where-does-the-wikipathways-cited-in-information-come-from.html"><![CDATA[<p>I have been wanting to blog about this since this summer, but with everything going on, I never really got around to it.
What is this <em>Cited In</em> feature of <a href="https://wikipathways.org/">WikiPathways</a> and where does that information come from?
If you have not noticed this yet, this is what it looks like for <a href="https://www.wikipathways.org/instance/WP4846">WP4846</a>:</p>

<p><img src="/assets/images/wp_cited_in.png" alt="" /></p>

<p>Recently, I was close to writing up the context, because it is related to a new feature of the profile pages, where you
now can look up citations to pathways that you first authored (see
<a href="https://chem-bla-ics.linkedchemistry.info/2025/11/30/wikipathways-curation-reports-on-profile-pages.html">this post</a>).
And it also relates to the data I have been collecting around <a href="https://chem-bla-ics.linkedchemistry.info/tag/cito">citation intention annotations</a>:
articles that cite one of the WikiPathways papers and mention a specific pathway, could be considered <em>cito:usesDataFrom</em>
(see doi:<a href="https://doi.org/10.1186/s13321-023-00683-2">10.1186/s13321-023-00683-2</a>).</p>

<p>A third angle to citations to specific WikiPathways is the following. WikiPathways is used a lot in data analyses and
putting experimental data in biological context. How researchers do this varies a lot, in multiple ways. But just
thinking about this factually, research output cite specific biological pathways. And there are some interesting
phenomena there. Back in 2015 at the Metabolomics Society meeting in San Francisco (apparently, I only
blogged about the meeting only <a href="https://chem-bla-ics.blogspot.com/2015/06/metsoc2015-converting-smiles-annotation.html">once</a>?),
when I visited the 500+ posters looking for interesting biological pathways, there were a lot of studies
on different species, different diseases, different toxicities. The biological response had one thing in common:
it always was the TCA cycle that was key (see doi:<a href="https://doi.org/10.1096/FJ.11-203091">10.1096/FJ.11-203091</a> for
a 2012 comparison of TCA models).</p>

<p>Thus, with so many articles mentioned specific pathways and deriving biological knowledge from this, what is
reasonable to expect? Do we expect <em>co-citation</em> effects? That is, if two articles found the same set of pathways
of interest to their data, is the data showing a similar biological response? Do we expect a similar thing
like the above TCA cycle in metabolomics, something similar to the notion of <em>frequent hitters</em> (see
doi:<a href="https://doi.org/10.1021/jm010934d">10.1021/jm010934d</a>)?</p>

<p>Of course, to test this hypothesis we need data and the <em>Cited In</em> feature comes in. At the time of
writing of this blog post, we can see on <a href="https://www.wikipathways.org/browse/citedin.html">this page</a>
that 878 pathways have been cited a total of 2715 times. We are getting somewhere. This blog
post will not analyze this data, which is one reason why I had not blogged about it. But from the
above you can understand that I want to :)</p>

<h2 id="the-cited-in-feature">The Cited In feature</h2>

<p>This <em>Cited In</em> feature was introduced along with the new website (see doi:<a href="https://doi.org/10.1093/nar/gkad960">10.1093/nar/gkad960</a>),
where we change how GPML files are stored and how web pages are created from that.
Because we are no longer confined to the MediaWiki platform (which has served the project for very long,
very effectively), it is easier to integrate information from other sources. For example,
from literature databases. This feature was developed by <a href="https://orcid.org/0000-0001-5706-2163">Alex Pico</a>
at the Gladstone Institutes (see <a href="https://github.com/wikipathways/wikipathways-database/commit/840234adfd581730d86553910c078401351606ce">this 2022 commit</a>),
where he uses the <a href="https://www.ncbi.nlm.nih.gov/books/NBK25497/">NCBI eUtils API</a> to access
<a href="https://pmc.ncbi.nlm.nih.gov/">PubMed Central</a>.
The data is then collected into <a href="https://github.com/wikipathways/wikipathways-database/blob/main/downstream/citedin_lookup.yml">this YAML file</a>
which then gets used to generate webpage content (like the section in the above screenshot
and the page mentioning the current statistics).</p>

<h2 id="where-is-the-data-coming-from">Where is the data coming from?</h2>

<p>As just explained, originally the data was only coming from NCBI.
However, because I found many articles citting specific pathways that were not picked up by this
approach, and I wanted more data, so I started searching <a href="https://europepmc.org/">Europe PMC</a> the European
partner of PubMed Central. However, I am not automating this. I want to see the data, the articles, and
how people cite the pathways. I need to see that so that I can better understand how people are
using the data/knowledge from WikiPathways. I cannot keep up with checking why people are citing
my own research, but <a href="https://chem-bla-ics.linkedchemistry.info/2010/10/31/citeulike-cito-use-case-1-wordles.html">I once was</a>.
I learn(-ed) a lot from that.</p>

<p>I normally use a search that requires the word “WikiPathways” to be
<a href="https://europepmc.org/search?query=wikipathways">mentioned in the article</a> (in most, but
not all of them; citing literature you extend sounds like a core scholarly value, but is factually
not systematically complied with), and then manually searching for “WP”. With close to 1000
PubMed Central articles mentioning WikiPathways in 2025 and that these are mostly full texts,
I can see if the cite specific pathways. A good number of article mentions the WikiPathways
identifier, e.g. the aforementioned <code class="language-plaintext highlighter-rouge">WP4846</code>. If the article only mentions a pathway title,
I cannot confidently identify which pathway is cited, so I exclude that.</p>

<p>I originally started out manually editing the YAML file where the citations are collected,
but by now use <a href="https://github.com/wikipathways/wikipathways-database/blob/main/scripts/citedin_fromFile.R">a script similar to Alex’ R script</a>.
This makes it far easier to scale up, as I just have to populate a three column TSV file,
which is used by my R script to update the YAML file. This manual approach ensures that
I am not looking at text mining results, but see the citation of the WikiPathways identifier
with my own eyes. That’s just how I like it.</p>

<p>The full history of the YAML file content can be found on <a href="https://github.com/wikipathways/wikipathways-database/commits/main/downstream/citedin_lookup.yml">this GitHub page</a>
and <a href="https://github.com/wikipathways/wikipathways-database/blame/main/downstream/citedin_lookup.yml">this <em>git blame</em></a>
tells you if the information came from PubMed Central via the API, or was added by me:</p>

<p><img src="/assets/images/wp_cited_in_git_blame.png" alt="" /></p>

<p>This is Open Science in action: added transparency and making it easier for anyone to verify,
so that no one needs to be stuck in (dis)trust.</p>

<p>Of course, as we know from the CiTO ontology and real-world data, there are so
many different reasons why journal articles are cited (just <a href="https://chem-bla-ics.linkedchemistry.info/2024/08/07/cito-updates.html">an example</a>),
the data in the YAML file and on the WikiPathways website in the <em>Cited In</em> feature
does not have direct meaning. Just like a high citation count for an article or
even a journal impact factor cannot be directly interpreted (despite so many researchers
just blindly doing just that).</p>

<h2 id="whats-next">What’s next?</h2>

<p>Well, while I did not do any analysis yet, and do not even know yet how much citations we need to
reach some level of statistical significance, there are some observations I can mention:</p>

<ul>
  <li>if your analysis included anything like linking your data to pathways, citing those pathways is
a good way to give credit to the researchers that created that pathway</li>
  <li>if you cite data, please cite that as accurately as possible, see e.g. DataCite</li>
  <li>I wish all journal articles citing specific pathways from WikiPathways would include the pathway identifier</li>
  <li>I congratulate those authors that even mentioned the revision of the pathway! well done!</li>
</ul>

<p>And about biological interpretation, our group has long published that some genes with
differential data mapping to a pathway does not imply that that pathway is really affected.
Gene-set enrichment and over-representation analysis are a starting point; not a conclusion.
I wish more people were more aware of the work in our (now)
<a href="https://cris.maastrichtuniversity.nl/en/organisations/translational-genomics/">Translational Genomics research group</a>.
Like that of <a href="https://orcid.org/0000-0002-7699-8191">Martina Kutmon</a> (now as
<a href="https://www.maastrichtuniversity.nl/research/maastricht-centre-systems-biology-and-bioinformatics">MaCSBio<sup>2</sup></a>),
whom I have had the pleasure of collaborating with for quite some years now (and long time
archtect of WikiPathways).</p>

<p>There is so much more I want to write up about WikiPathways, but I leave it to this
for now.</p>]]></content><author><name>Egon Willighagen</name></author><category term="wikipathways" /><category term="europepmc" /><category term="doi:10.1186/S13321-023-00683-2" /><category term="cito:citesAsDataSource:10.1096/FJ.11-203091" /><category term="cito:obtainsBackgroundFrom:10.1021/jm010934d" /><category term="doi:10.1093/NAR/GKAD960" /><summary type="html"><![CDATA[I have been wanting to blog about this since this summer, but with everything going on, I never really got around to it. What is this Cited In feature of WikiPathways and where does that information come from? If you have not noticed this yet, this is what it looks like for WP4846:]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/wp_cited_in.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/wp_cited_in.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">WikiPathways curation reports on profile pages</title><link href="https://chem-bla-ics.linkedchemistry.info/2025/11/30/wikipathways-curation-reports-on-profile-pages.html" rel="alternate" type="text/html" title="WikiPathways curation reports on profile pages" /><published>2025-11-30T00:00:00+00:00</published><updated>2025-11-30T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2025/11/30/wikipathways-curation-reports-on-profile-pages</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2025/11/30/wikipathways-curation-reports-on-profile-pages.html"><![CDATA[<p>I have been running automated curation tests for many years now, at least <a href="https://chem-bla-ics.linkedchemistry.info/2018/10/11/two-presentations-at-wikipathways-2018.html">from before 2018</a>.
Because it has been done without funding, it has not been as nicely integrated, and depends, for example, first on the RDF generation to be integrated
in the GitHub Action. So, I still run them regularly (often in the morning during breakfast). Meanwhile, the <a href="https://www.wikipathways.org/wikipathways-collection/index2">curation tests</a>
help the project to monitor and maintain the quality of the pathways. The curation reports have been integrated into pathway pages for some
time now.</p>

<p><img src="/assets/images/wpCurationBadge.png" alt="" /></p>

<p>We have now integrated this curation badge into the author and community pages on the (not so) <a href="https://www.wikipathways.org/">new WikiPathways website</a>
too. Authors can now find curation reports for pathways they started and also for the community pages:</p>

<p><img src="/assets/images/4a0a20557574c3ae.png" alt="" /></p>

<p>A second new feature is the “Citations” tab on both pages, which link to <a href="https://europepmc.org/">Europe PMC</a>
with a dedicated search for articles mentioning those author or community pathways:</p>

<p><img src="/assets/images/270716cef8d30481.png" alt="" /></p>

<p>We hope you like it!</p>]]></content><author><name>Egon Willighagen</name></author><category term="wikipathways" /><category term="curation" /><category term="europepmc" /><summary type="html"><![CDATA[I have been running automated curation tests for many years now, at least from before 2018. Because it has been done without funding, it has not been as nicely integrated, and depends, for example, first on the RDF generation to be integrated in the GitHub Action. So, I still run them regularly (often in the morning during breakfast). Meanwhile, the curation tests help the project to monitor and maintain the quality of the pathways. The curation reports have been integrated into pathway pages for some time now.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/wpCurationBadge.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/wpCurationBadge.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">The Internet Journal of Chemistry</title><link href="https://chem-bla-ics.linkedchemistry.info/2025/08/11/the-internet-journal-of-chemistry.html" rel="alternate" type="text/html" title="The Internet Journal of Chemistry" /><published>2025-08-11T00:00:00+00:00</published><updated>2025-08-11T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2025/08/11/the-internet-journal-of-chemistry</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2025/08/11/the-internet-journal-of-chemistry.html"><![CDATA[<p>The <a href="https://scholia.toolforge.org/topic/Q27211732">Internet Journal of Chemistry</a> (IJC, issn:1099-8292) was one of the first scientific journals to get
published on the world wide web (part of <em>the Internet</em>), see doi:<a href="https://doi.org/10.1080/00987913.2000.10764578">10.1080/00987913.2000.10764578</a>.
Issues were published from 1998 to 2004. But because it predates
systematic archiving of webpages by libraries, a lot is lost. The nature of the journal, however, makes it unique, and quite
a number of articles are cited a lot, and should be part of the <em>scientific record</em>.
But I soon realized it actually is quite hard to track down content of the journal. I knew some articles have been
<em>author accepted manuscripts</em> online. One of that was my own first (and single) author-article, self-archived on
Zenodo (doi:<a href="https://doi.org/10.5281/zenodo.1495470">10.5281/zenodo.1495470</a>), green open access style.</p>

<p>I wanted to see what I could recover, and here I describe what I did and what could be done next.</p>

<h2 id="a-list-of-all-articles">A list of all articles</h2>

<p>The first step is actually to create a list of all articles published in the IJC and collect as much metadata about
them as possible. With just over 100 articles, I decided to use Wikidata, as a machine-readable database, supporting the curation and reporting. I wanted at least
two independent sources, and for Wikidata, use public resources. That means, while Web of Science does have a list of
all articles, I only used this for validation, and <strong>not</strong> as information source. Instead, I used citations to IJC
articles and, of course, the Internet Archive (IA). It turns out <a href="https://web.archive.org/web/*/http://www.ijc.com/abstracts/*">a query like this</a>
does wonders (well, for the abstracts; I did not find full-texts archived on IA):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>https://web.archive.org/web/*/http://www.ijc.com/abstracts/*
</code></pre></div></div>

<p>I found that all but one article had the abstract archived in the IA. Here’s <a href="https://web.archive.org/web/20000925050415/http://www.ijc.com/abstracts/abstract2n8.html">an example</a>:</p>

<p><img src="/assets/images/ia_ijc_abstract.png" alt="" /></p>

<p>This gave my a lot of information to add to Wikidata. Title, publication date, volume, article number, keywords, an absstract,
and, of course, the list of authors. Some authors I know personally, many I did not. But it did allow me to enter all
articles to Wikidata along with the authors and “author” (<a href="https://www.wikidata.org/wiki/Property:P50">P50</a>) or
“author name string” (<a href="https://www.wikidata.org/wiki/Property:P2093">P2093</a>).</p>

<h2 id="the-article-authors">The article authors</h2>

<p>It also turned out that multiple authors listed their IJC article on their public ORCID profile.
That greatly helped identification. I managed to <a href="https://w.wiki/Ezda">link many authors</a> to mostly existing Wikidata items:</p>

<p><img src="/assets/images/ijc_authors.png" alt="" /></p>

<p>I already mentioned that I used Wikidata to collect this information. Besides the <a href="https://scholia.toolforge.org/venue/Q27211732">interactive visualization with Scholia</a>,
it also gave me the option to track my progress with SPARQL queries. For example, <a href="https://w.wiki/Ezdf">this query</a> helped
me do that author FAIR-ification:</p>

<p><img src="/assets/images/ijc_sparql1.png" alt="" /></p>

<p>You can see here two columns with author information, one for P50 and the other for P2093. There is quite some
identification left to be done, and additional information is welcome:</p>

<p><img src="/assets/images/ijc_sparql2.png" alt="" /></p>

<h2 id="sources">Sources</h2>

<p>So, that brings us to this list of sources:</p>

<ul>
  <li>Internet Archive: abstracts and metadata</li>
  <li>ORCID profiles: ORCIDs of (some) authors</li>
  <li>Google Scholar: metadata and citations</li>
  <li>Web of Science: independent list for external validation</li>
</ul>

<p>Because there is plenty of work left to be done and I hope the collected information will further spread
in library collections, I added sources as much as possible. <a href="https://w.wiki/Em9i">This query</a> lists for all
articles the Web of Science identifier (recorded so that everyone can check the consistency), the link
to the Internet Archive-d abstract page, and a link to a known full text (five).</p>

<p>If you wonder, neither <a href="https://openalex.org/works?page=1&amp;filter=primary_location.source.id:s32147083">OpenAlex</a>
or <a href="https://europepmc.org/search?query=JOURNAL%3A%28%22Internet%20Journal%20of%20Chemistry%22%29">Europe PMC</a> have a full list.</p>

<h2 id="whats-next">What’s next?</h2>

<p>I do not have a formal training in archiving, but I am happy with the minimal viable metadata collection.
I know more can be done (and love to hear your pointers and suggestions): more author identies,
better coverage of keyword annotation, etc. But I think an important addition is adding citations
to and from the IJC articles are important. The journal predates efforts like the <a href="https://i4oc.org/">I4OC</a> and
<a href="https://opencitations.net/">Open Citations</a>, so I may have to manually recover citations from Google Scholar.
I will have to report on that later. But you can enjoy the citations that are
<a href="https://scholia.toolforge.org/venue/Q27211732#Citations">already there</a>. And now that we have sufficient metadata,
I can use this to find more full texts.</p>

<p>Btw, I have made contact with Prof. <a href="https://scholia.toolforge.org/author/Q28420106">Steven Bachrach</a>,
who founded the journal and was the Editor-in-Chief.</p>]]></content><author><name>Egon Willighagen</name></author><category term="publishing" /><category term="wikidata" /><category term="scholia" /><category term="doi:10.5281/ZENODO.1495470" /><category term="cito:citesAsEvidence:10.1080/00987913.2000.10764578" /><category term="europepmc" /><summary type="html"><![CDATA[The Internet Journal of Chemistry (IJC, issn:1099-8292) was one of the first scientific journals to get published on the world wide web (part of the Internet), see doi:10.1080/00987913.2000.10764578. Issues were published from 1998 to 2004. But because it predates systematic archiving of webpages by libraries, a lot is lost. The nature of the journal, however, makes it unique, and quite a number of articles are cited a lot, and should be part of the scientific record. But I soon realized it actually is quite hard to track down content of the journal. I knew some articles have been author accepted manuscripts online. One of that was my own first (and single) author-article, self-archived on Zenodo (doi:10.5281/zenodo.1495470), green open access style.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/ia_ijc.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/ia_ijc.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">One Million IUPAC names #4: a lot is happening</title><link href="https://chem-bla-ics.linkedchemistry.info/2025/08/09/one-million-iupac-names-4.html" rel="alternate" type="text/html" title="One Million IUPAC names #4: a lot is happening" /><published>2025-08-09T00:00:00+00:00</published><updated>2025-08-09T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2025/08/09/one-million-iupac-names-4</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2025/08/09/one-million-iupac-names-4.html"><![CDATA[<p>A lot is happening. If you have been following this project more closesly, you may have already seen some interesting updates, but
I will post it here too. First, a quick recap. In March I started a new <a href="http://blueobelisk.org/">Blue Obelisk</a> project to
<a href="https://chem-bla-ics.linkedchemistry.info/2025/03/08/iupac-names.html">collect CCZero IUPAC names</a>
from primary literature (paper still pending). It turned out we can automate that, while legally not violating any laws or licenses.
In April I reported on <a href="https://chem-bla-ics.linkedchemistry.info/2025/04/27/one-million-iupac-names-2-the-100-thousand-milestone.html">some tweaks</a>
boosting the efficiency of the use of the API. I also reported on some possible further steps, including how to use the extracted
names to create a larger set. Indeed, in June I could <a href="https://chem-bla-ics.linkedchemistry.info/2025/06/09/one-million-iupac-names.html">report to have passed the 200k IUPAC names</a>,
which with the idea from April gave us more than 1M IUPAC names.</p>

<p>In this post I want to give an update.</p>

<h2 id="275k-iupac-names">275k IUPAC names</h2>

<p>I have continued running the scripts to detect new IUPAC names in full text, open access papers in <a href="https://europepmc.org/">Europe PMC</a>,
but something more awesome actually did much more since the <a href="https://chem-bla-ics.linkedchemistry.info/2025/06/09/one-million-iupac-names.html">June post</a>:
in July I received a <a href="https://github.com/BlueObelisk/iupac-names/pull/13">pull request</a> from <a href="https://github.com/mnietfeld">mnietfeld</a>
with more than 40 thousand unique and new IUPAC names from the <a href="https://www.beilstein-journals.org/bjoc/">Beilstein Journal of Organic Chemistry</a>
(see also <a href="https://www.linkedin.com/posts/beilstein-institut_openaccess-bjoc-fair-activity-7351596602660167681-0Z0r/">their LinkedIn post</a> or
<a href="https://archive.is/DZOnP">this archived version</a> that doesn’t require an account).
While Europe PMC provides these articles too (and actually one of the first I analyzed), a lot of these names come from supplementary
information, not provided by Europe PMC. Thanks!</p>

<p>This is focusing on names from primary literature, but there is more happening. Because I want to restrict the above project to
names from primary literature (and supplementary information is still that), I have not been sure what to do with other collections
yet, and they have been coming in. I have been <a href="https://github.com/BlueObelisk/iupac-names/issues?q=is%3Aissue%20label%3Aother">taking notes</a>
in the project issue tracker, for future reference (like now, here). I have not forgotten about these!</p>

<h2 id="other-large-collections-of-iupac-names">Other large collections of IUPAC names</h2>

<p><strong>4M, CCZero</strong><br />
Let’s start with the news yesterday. The <a href="https://www.ebi.ac.uk/about/teams/chemical-biology-services/">Chemical Biology Services team</a>
<a href="https://chembl.blogspot.com/2025/08/unleashing-4-million-iupac-names-into.html">released 4 million IUPAC names from patent literature as CCZero</a>!
The CCZero license/waiver makes it compatible with our list. Their Zenodo release:</p>

<blockquote>
  <p>… contains IUPAC names text-mined from patents (US, WIPO, EPO, Chinese, Japanese).</p>
</blockquote>

<p>The post also includes a nice example of the complexity of IUPAC names which makes the counting of unique names tricky:
<code class="language-plaintext highlighter-rouge">O-methylphenol</code> and <code class="language-plaintext highlighter-rouge">o-methylphenol</code>. Thanks, Noel and the rest of the EMBL-EBI team!</p>

<p><strong>2.3 million, CC-BY</strong><br />
And then <a href="https://github.com/haydn-jones">Haydn Jones</a> was one of the earliest <a href="https://github.com/BlueObelisk/iupac-names/issues/9">to coin in</a>,
and <a href="https://doi.org/10.5281/zenodo.15077270">released 2.3 million IUPAC names</a> under the CC-BY license.</p>

<p><strong>850k, CCZero</strong><br />
Wikidata also turnes out to have many IUPAC names. <a href="https://github.com/Adafede/">Adriano</a> found more than 850 thousand IUPAC
names, see <a href="https://github.com/Adafede/wd-labels-to-iupac">this project</a>.</p>

<p>Next week I will do some comparisons of the datasets with a clear Creative Commons license.</p>

<h2 id="even-more">Even more</h2>

<p>Beyond these five data releases, there is more. PubChem and other databses have millions of names, but often these are
generated by proprietary software. These IUPAC name collections may be under some license agreement, and thus not compatible
with Open Science. This is why it is so important that we very clearly know where these names are coming from.</p>

<p><strong>5-6 million, license unclear</strong><br />
I also learned about <a href="https://chempile.lamalab.org/">ChemPile</a> about which <a href="https://www.linkedin.com/in/adrian-mirza-chem/">Adrian Mirza</a>
explained me it has <a href="https://www.linkedin.com/feed/update/urn:li:activity:7330626142611062784">about 5-6 million IUPAC names</a>.
But the source of this list of names is not yet clear to me.</p>

<p><strong>Names from PhD theses and preprints</strong><br />
I also want to give a shout out to <a href="https://github.com/BlueObelisk/iupac-names/issues/15">Peter Murray-Rust</a>s proposal
to start extracting IUPAC names from PhD theses. There have been projects to extract chemistry from PhD thesis in the
past, and this will yield a lot of unique names. Please ping Peter, if you want to get involved in his idea!</p>

<h2 id="whats-next">What’s next</h2>

<p>I am so excited with all these efforts and very grateful with the contribution by Beilstein. I really hope more Open Science
publishers will follow, like perhaps the Royal Society of Chemistry for which it should be easy, with their
<a href="https://chem-bla-ics.linkedchemistry.info/2007/02/01/rsc-first-publisher-to-go-semantic.html">Project Prospect</a> background!</p>

<p>I am also excited by the release by ChEMBL under CCZero. That will allow the <a href="https://www.wikidata.org/wiki/Wikidata:WikiProject_Chemistry">WikiProject Chemistry</a>
use this for Wikidata!</p>

<p>So, I have one week left to write the article about the work we started in March. The outlook is bright. I played last
week with the Europe PMC full text downloads and can confirm that should yield thousands of additional names from the
full texts. A single download file gave me more than two thousand new unique names. I think the 500k IUPAC names
is absolutely in reach with purely the full texts from Europe PMC.</p>

<p>This brings us to the end of 2025. By then, we should have a many millions of openly-licensed IUPAC names.
And by March 2026, I hope we reached the 1M IUPAC names extracted from primary literature. That will require some
creativity and enthusiasm, but sounds feasible!</p>]]></content><author><name>Egon Willighagen</name></author><category term="iupac" /><category term="beilstein" /><category term="chembl" /><category term="cito:citesAsRecommendedReading:10.5281/zenodo.16755947" /><category term="inchikey:RDOXTESZEPMUJZ-UHFFFAOYSA-N" /><category term="inchikey:QWVGKYWNOKOFNN-UHFFFAOYSA-N" /><category term="cito:citesAsRecommendedReading:10.5281/zenodo.15077270" /><category term="europepmc" /><summary type="html"><![CDATA[A lot is happening. If you have been following this project more closesly, you may have already seen some interesting updates, but I will post it here too. First, a quick recap. In March I started a new Blue Obelisk project to collect CCZero IUPAC names from primary literature (paper still pending). It turned out we can automate that, while legally not violating any laws or licenses. In April I reported on some tweaks boosting the efficiency of the use of the API. I also reported on some possible further steps, including how to use the extracted names to create a larger set. Indeed, in June I could report to have passed the 200k IUPAC names, which with the idea from April gave us more than 1M IUPAC names.]]></summary></entry><entry><title type="html">Archiving, but not really</title><link href="https://chem-bla-ics.linkedchemistry.info/2025/08/06/archiving-but-not-really.html" rel="alternate" type="text/html" title="Archiving, but not really" /><published>2025-08-06T00:00:00+00:00</published><updated>2025-08-06T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2025/08/06/archiving-but-not-really</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2025/08/06/archiving-but-not-really.html"><![CDATA[<p><a href="https://sauropods.win/@mike">Mike Taylor</a> wrote up <a href="https://doi.org/10.59350/svpow.24000">a post</a> about the various things a journal article is doing,
the first being <em>a scientific report</em>. We put a lot of money in establishing a scientific track record. In the past 30 years
how we publish our research and how we archive it has changed significantly. If you read my blog more often, you know I have
been critical of the performance of many publishers. Springer Nature was so disappointing that after 5 years I
<a href="https://chem-bla-ics.linkedchemistry.info/2021/06/11/conflict-of-interest-or-why-i-am.html">stepped down</a>
as Editor-in-Chief (of two) of the <a href="https://en.wikipedia.org/wiki/Journal_of_Cheminformatics">Journal of Cheminformatics</a>.
There is so much that must be <a href="https://chem-bla-ics.linkedchemistry.info/2024/09/16/publishing.html">done better</a>.</p>

<p>But in the most recent iteration, triggered by some work for <a href="https://www.wikipathways.org/">WikiPathways</a>, I was using
<a href="https://europepmc.org/">Europe PMC</a> to find articles that
mention <em>WikiPathways</em> and then search in the full text for the string <code class="language-plaintext highlighter-rouge">WP</code>, as a trigger for the possible mention of
WikiPathways pathway identifiers, which look like <code class="language-plaintext highlighter-rouge">WP4846</code>. The use of <em>compact (resource) identifiers</em>
(see doi:<a href="https://doi.org/10.1038/sdata.2018.29">10.1038/sdata.2018.29</a>) is minimal, but at least some articles use identifiers.</p>

<p>That allows me to extend our WikiPathways knowledge graph of <a href="https://www.wikipathways.org/browse/citedin">articles citing specific pathways</a>.
At the time of writing, we collected 2509 citations from 440 different articles to 883 different pathways. Now,
I want to blog about that more, but it’s related to an observation.</p>

<h2 id="information-loss">Information loss</h2>
<p>Now, back in the late ninities I learned about GNU/Linux and after playing with Red Hat and Suse, I settled for Debian.
One of the things I learned is that, generally, information corruption (like data loss) is an absolute red flag, a no-go,
a total showstopper.</p>

<p>And then we have this in publishing, the one area where data corruption must also be a no-go:</p>

<p><img src="/assets/images/imageResolutionLoss.png" alt="" /></p>

<p>In this image, the left side shows a screenshot of the publisher version of the article and on the right side
the version in <a href="https://pmc.ncbi.nlm.nih.gov/">Pubmed Central</a> (PMC). PMC has been an important project to archive full text versions of articles:</p>

<blockquote>
  <p>11.2 million articles are archived in PMC.</p>
</blockquote>

<p>So, this is <strong>really bad</strong>! The archived version is not really useful. As a human I already struggle to read the
degraded image, let alone an algorithm.</p>

<p>Does that matter? Yes, projects like the awesome
<a href="https://pfocr.wikipathways.org/">Pathway Figure OCR</a> (see doi:<a href="https://doi.org/10.1186/s13059-020-02181-2">10.1186/s13059-020-02181-2</a>)
depend on images to be FAIR enough to extract information. (Side note: yes, these images should be vector
graphics, but commercial publishers decided about twenty years ago that they could not care enough.)</p>

<p>At this moment, I do not know where the information is lost. Maybe PubMed Central is storing the images in a low
resolution. Maybe the publisher provides PMC with a low resolution image. But to me, this must be solved as soon
as possible. This is utterly unacceptable.</p>

<p>I wonder what the authors of the article (doi:<a href="https://doi.org/10.1186/s13287-025-04166-z">10.1186/s13287-025-04166-z</a>)
I took as example think of this.</p>]]></content><author><name>Egon Willighagen</name></author><category term="publishing" /><category term="cito:citesAsRecommendedReading:10.59350/svpow.24000" /><category term="cito:citesAsRecommendedReading:10.1186/s13059-020-02181-2" /><category term="cito:describes:10.1186/s13287-025-04166-z" /><category term="cito:obtainsBackgroundFrom:10.1038/sdata.2018.29" /><category term="europepmc" /><summary type="html"><![CDATA[Mike Taylor wrote up a post about the various things a journal article is doing, the first being a scientific report. We put a lot of money in establishing a scientific track record. In the past 30 years how we publish our research and how we archive it has changed significantly. If you read my blog more often, you know I have been critical of the performance of many publishers. Springer Nature was so disappointing that after 5 years I stepped down as Editor-in-Chief (of two) of the Journal of Cheminformatics. There is so much that must be done better.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/imageResolutionLoss.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/imageResolutionLoss.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Curation is an essential part of doing research</title><link href="https://chem-bla-ics.linkedchemistry.info/2025/06/29/curation-is-an-essential-part-of-doing-research.html" rel="alternate" type="text/html" title="Curation is an essential part of doing research" /><published>2025-06-29T00:00:00+00:00</published><updated>2025-06-29T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2025/06/29/curation-is-an-essential-part-of-doing-research</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2025/06/29/curation-is-an-essential-part-of-doing-research.html"><![CDATA[<p>Depending on your exact definition of doing science, keeping track as precise as possible of your observations
is an essential part of doing science. The precision should be high enough that mistakes are obvious. This pattern is,
of course, not limited to doing science and we see this in open source development too. Unfortunately, in the
modern way of doing science, this is not getting the attention it should get. Worse, with narratives (stories)
about the research, in the form of journal articles, are generally considered more important that a precise
description of the observations.</p>

<p>Is that a big issue? Hell, yes. Where do you think the FAIR ideas came from? And why FAIR in ten years has not
brought about the change it was hoping for?</p>

<p>For me, my fascination for curation started as a student, around 1995, with the <em>Dictionary on Organic Chemistry</em>.
At that time, my interest came from wanting to learn about chemistry and biology. During my M.Sc. and PhD, it was
obvious how essential it was to derivating correct scientific conclusions from your experiment. Data, knowledge,
and software alike, imo. And because curation is expensive, not having to repeat it, I prefer to do it as
Open Science.</p>

<h2 id="curation">Curation</h2>

<p>Of course, curation has been part of doing science, but to a large extens is separate step from doing science.
It is done by database developers, librarians, and chemo- and bioinformaticians. For example, Chemical Abstracts
Service (CAS) <a href="https://en.wikipedia.org/wiki/Chemical_Abstracts_Service">started over 100 years ago</a> and started
indexing chemical structures in 1965. The curation is an ongoing process, <a href="https://chem-bla-ics.linkedchemistry.info/2022/05/22/new-cas-common-chemistry-in-2021.html">also for old records</a>.</p>

<p><a href="https://www.biocuration.org/dissemination/who-are-we/">Biocuration</a> is getting
<a href="https://scholia.toolforge.org/topic/Q54987878#publications-per-year">more and more attention</a>:</p>

<p><img src="/assets/images/biocuration.png" alt="" /></p>

<p>The recognition and rewarding by having the <a href="https://www.biocuration.org/">International Society for Biocuration</a>
(ISB, <a href="https://scholia.toolforge.org/organization/Q23809291">Scholia page</a>) should not be underestimated
(doi:<a href="https://doi.org/10.1038/455047A">10.1038/455047A</a>). Their <a href="https://scholia.toolforge.org/event-series/Q106486148">Annual International Biocuration Conferences</a>
have been running since <a href="https://scholia.toolforge.org/event/Q109408101">2005</a>. And with their
awards, they give the biocuration work recognition and, literally, rewarding:</p>

<ul>
  <li><a href="https://scholia.toolforge.org/award/Q106045191">Biocuration Career Award</a> (2016-2021)</li>
  <li><a href="https://scholia.toolforge.org/award/Q118947746">Excellence in Biocuration Early Career Award</a> (2022-)</li>
  <li><a href="https://scholia.toolforge.org/award/Q119882229">Excellence in Biocuration Advanced Career Award</a> (2022-)</li>
  <li><a href="https://scholia.toolforge.org/award/Q106045103">Exceptional Contribution to Biocuration Award</a> (2017-)</li>
</ul>

<h2 id="my-curation-curriculum-vitae">My curation Curriculum Vitae</h2>

<p>I don’t have a good <em>curation CV</em>. For a large extend because the curation has been part of a study. The curation
itself does not get recognized, and only the <em>journal article</em> does. With datasets slowly getting more recognition,
so does data curation, but data curation is not really part of how we do FAIR at this moment, and via this route
not getting the attention it gets.</p>

<p>But since I have been updating <a href="https://egonw.github.io/cv/">my CV anyway</a>, I dug up some curation I am proud
of:</p>

<ul>
  <li>the Dictionary on Organic Chemistry, which no longer exists, but it started my Open Science chemistry research</li>
  <li>the <a href="Blue Obelisk Data Repository">Blue Obelisk Data Repositry</a> (BODR), which has been part of various
GNU/Linux distributions (see also doi:<a href="https://doi.org/10.1021/ci050400b">10.1021/ci050400b</a>).
A new version is <a href="https://chem-bla-ics.blogspot.com/2013/08/the-blue-obelisk-data-repositorys-10.html">long overdue</a></li>
  <li>I contributed hundreds of NMR spectra with uncommon nuclei to <a href="https://sourceforge.net/projects/nmrshiftdb2/files/data/">NMRShiftDb</a></li>
  <li>Wikidata, see <a href="https://chem-bla-ics.linkedchemistry.info/2025/05/25/new-preprint-scholia-chemistry-access-to-chemistry-in-wikidata.html">this preprint</a>,
but also many small projects, like adding CXSMILES for polymers, and <a href="https://laurendupuis.github.io/Scholia_tutorial/">main subject annotation in Scholia</a></li>
  <li>WikiPathways (see <a href="https://chem-bla-ics.linkedchemistry.info/tag/wikipathways">these blog posts</a>), where I started
<a href="https://classic.wikipathways.org/index.php?title=Special:Contributions&amp;dir=prev&amp;target=Egonw&amp;month=&amp;year=">curating metabolites in 2012</a>,
set up <a href="https://chem-bla-ics.linkedchemistry.info/2018/10/11/two-presentations-at-wikipathways-2018.html">a computer-assistent curation platform</a>
<a href="https://chem-bla-ics.linkedchemistry.info/2016/07/02/two-apache-jena-sparql-query.html">using SPARQL</a>, and
were an early curator of <a href="https://chem-bla-ics.linkedchemistry.info/2020/10/31/sars-cov-2-covid-19-and-open-science.html">SARS-CoV-2 biological processes</a></li>
  <li>citation intent annotation with the Citation Typing Ontology, see this <a href="https://scholia.toolforge.org/cito/">Scholia overview</a></li>
  <li>nanosafety ontology and data: the <a href="https://github.com/enanomapper/ontologies">eNanoMapper Ontology</a> (ENMO),
<a href="https://figshare.com/search?q=nanowiki">NanoWiki</a>, <a href="https://nanocommons.github.io/specifications/jrc/">JRC nanomaterial index</a> and
<a href="https://nanocommons.github.io/erm-database/">the ERM indentifier database</a></li>
  <li>made RDF for supplementary information (e.g. <a href="http://chem-bla-ics.linkedchemistry.info/2018/09/16/data-curation-5-inspiration-95.html">this NanoE-Tox spreadsheet</a>,
full databases, like <a href="https://chem-bla-ics.linkedchemistry.info/2011/04/21/chembl-09-as-rdf.html">ChEMBL</a> and
<a href="https://chem-bla-ics.linkedchemistry.info/2009/09/04/nmrshiftdb-enters-rdfopenmoleculesnet-2.html">NMRShiftDb <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li>organized <a href="https://chem-bla-ics.linkedchemistry.info/2019/10/14/chemcuration-2019-poster-conference.html">an online ChemCuration event</a> (inspired by the ISB annual meetings!)</li>
</ul>

<p>I am also curation my blog, which was <a href="https://chem-bla-ics.linkedchemistry.info/2023/08/18/last-post-here-freebie-model-online.html">originally in blogger.com but being ported to Markdown with extra annotation</a>.
That includes <a href="https://chem-bla-ics.linkedchemistry.info/2023/07/27/archiving-and-updating-my-blog.html">updating URLs</a>
and annotation of blog posts <a href="https://chem-bla-ics.linkedchemistry.info/2005/10/21/viagra-saves-environment.html">with chemicals</a>,
<a href="https://chem-bla-ics.linkedchemistry.info/2024/10/24/vhp4safety.html">grants</a>, and
<a href="https://chem-bla-ics.linkedchemistry.info/2025/02/08/cito-for-blog-citations.html">intention-typed citations</a>.</p>

<h2 id="long-tail">Long tail</h2>

<p>Of course, I have my Wikipedia edits, and contributed to projects like <a href="https://github.com/biopragmatics/bioregistry/commits/main/?author=egonw">Bioregistry.io</a>,
<a href="https://fairsharing.org/users/596">FAIRsharing</a>, regularly submit <a href="https://form.typeform.com/to/SWoxIY?typeform-source=altmetric.typeform.com">missed mentions to Altmetric.com</a>,
etc. There is a long tail in curation. And there is a lot of curation hidden in <a href="https://scholar.google.com/citations?user=u8SjMZ0AAAAJ&amp;hl=en">my literature list</a>.</p>

<p>And that long tail matters to me. I want every researcher to pick up the challenge to curate their own
research output. Put your experimental data in databases, add important provenance, get the details rights.
This is essential to reduce the cost of doing research, and that is more important than ever.</p>

<p>BTW, I must note that our bioinformatics team colleagues too have done a tremendous amount of biocuration,
in WikiPathways (<a href="https://scholia.toolforge.org/author/Q43744369">Denise</a>, <a href="https://scholia.toolforge.org/author/Q28025534">Freddie</a>,
<a href="https://scholia.toolforge.org/author/Q19851164">Susan</a>), in nanosafety (<a href="https://scholia.toolforge.org/author/Q99306396">Jeaphianne</a>,
<a href="https://scholia.toolforge.org/author/Q86442640">Ammar</a>), and in toxicology (<a href="https://scholia.toolforge.org/author/Q42369611">Marvin</a>),
just to name a few. Often together with B.Sc. and M.Sc. students (which <a href="https://europepmc.org/article/med/26557796">can work really well</a>).</p>

<h2 id="award-nomination">Award nomination</h2>

<p>And I hope this makes it clear why I am delighted to was <a href="https://www.biocuration.org/community/biocuration-career-awards/excellence-in-biocuration-advanced-career-award-2025/">nominated last week</a>
for an ISB <em>Excellence in Biocuration Advanced Career Award</em>. The list of past awardees is impressive,
as are the other nominations:
<a href="https://scholia.toolforge.org/author/Q89869027">Laurel Cooper</a>, Oregon State University/USA,
<a href="https://scholia.toolforge.org/author/Q57227590">Steven Marygold</a>, University of Cambridge/UK,
<a href="https://scholia.toolforge.org/author/Q111430202">Saurabh Raghuvanshi</a>, University of Delhi/India, and
<a href="https://scholia.toolforge.org/author/Q59674797">Kimberly Van Auken</a>, California Institute of Technology/USA.</p>

<p>It’s an honor to be listed along these other nominees and being nominated is a great recognition! With a
<em>thank you</em> to the person who proposed my nomination.</p>]]></content><author><name>Egon Willighagen</name></author><category term="curation" /><category term="openscience" /><category term="justdoi:10.1038/455047A" /><category term="doi:10.1021/CI050400B" /><category term="nmrshiftdb" /><category term="europepmc" /><summary type="html"><![CDATA[Depending on your exact definition of doing science, keeping track as precise as possible of your observations is an essential part of doing science. The precision should be high enough that mistakes are obvious. This pattern is, of course, not limited to doing science and we see this in open source development too. Unfortunately, in the modern way of doing science, this is not getting the attention it should get. Worse, with narratives (stories) about the research, in the form of journal articles, are generally considered more important that a precise description of the observations.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/biocuration.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/biocuration.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">All BioHackrXiv preprints and BioHackathon RSS feeds</title><link href="https://chem-bla-ics.linkedchemistry.info/2025/06/22/all-biohackrxiv-preprints-and-biohackathon-rss-feeds.html" rel="alternate" type="text/html" title="All BioHackrXiv preprints and BioHackathon RSS feeds" /><published>2025-06-22T00:00:00+00:00</published><updated>2025-06-22T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2025/06/22/all-biohackrxiv-preprints-and-biohackathon-rss-feeds</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2025/06/22/all-biohackrxiv-preprints-and-biohackathon-rss-feeds.html"><![CDATA[<p>One thing I was still missing in <a href="https://biohackrxiv.org">BioHackrXiv</a> was a place with an overview
of: 1. all biohackathons, 2. all preprints linked to a biohackathon, 3. an RSS feed for new papers of a biohackathon.
Of course, there is the <a href="https://biohackrxiv.org/discover">BioHackrXiv discover</a> service, but the biohackathon
is not a metadata field and I cannot filter based on it. And, of course, there is Scholia, but not all preprints
are notable (so far, a good number had CiTO annotation that at least made them somewhat notable). Thus,
they are not all listed in <a href="https://scholia.toolforge.org/venue/Q115450084">this venue page</a> and neither
on <a href="https://scholia.toolforge.org/event-series/Q109379759">this overview of collections of preprints linked to BioHackathon Europe meetings</a>.</p>

<p>Additionally, it also does not have an RSS feed, and
<a href="https://pluralistic.net/2024/10/16/keep-it-really-simple-stupid/read-receipts-are-you-kidding-me-seriously-fuck-that-noise">we should indeed be using RSS more</a>.
So, I hacked something up and impressions were positive. Based on Jekyll and the experiences I had with this
blog, I modelled individual articles as blog posts and biohackathons as tags. That automatically gave me
the RSS feeds:</p>

<ul>
  <li><a href="https://index.biohackrxiv.org/feed.xml">feed for new BioHackrXiv preprints</a> (or <a href="https://index.biohackrxiv.org/feed.json">this JSON Feed</a>)</li>
  <li><a href="https://index.biohackrxiv.org/feed/by_tag/BH21EU.xml">feed for Europe BioHackathon 2021</a> (based on <a href="https://index.biohackrxiv.org/tag/BH21EU">the BH21EU tag</a>)</li>
</ul>

<p>It’s still very much in progress, but it’s now live at <a href="https://index.biohackrxiv.org/">index.biohackrxiv.org</a>:</p>

<p><img src="/assets/images/biohackrxiv_index.png" alt="" /></p>

<h2 id="extras">Extras</h2>

<p>Some extra goodies already there include:</p>

<ul>
  <li>the <a href="https://en.wikipedia.org/wiki/Altmetrics">Altmetric.com donut</a></li>
  <li>links to <a href="https://scholia.toolforge.org/">Scholia</a></li>
  <li>links to <a href="https://europepmc.org/">Europe PMC</a></li>
</ul>

<p>The <a href="https://index.biohackrxiv.org/tags/">overview of biohackathons</a> looks like this (the tag size follows the number
of preprints for that biohackathon):</p>

<p><img src="/assets/images/biohackrxiv_biohackathons.png" alt="" /></p>]]></content><author><name>Egon Willighagen</name></author><category term="biohackrxiv" /><category term="europepmc" /><summary type="html"><![CDATA[One thing I was still missing in BioHackrXiv was a place with an overview of: 1. all biohackathons, 2. all preprints linked to a biohackathon, 3. an RSS feed for new papers of a biohackathon. Of course, there is the BioHackrXiv discover service, but the biohackathon is not a metadata field and I cannot filter based on it. And, of course, there is Scholia, but not all preprints are notable (so far, a good number had CiTO annotation that at least made them somewhat notable). Thus, they are not all listed in this venue page and neither on this overview of collections of preprints linked to BioHackathon Europe meetings.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/biohackrxiv_index.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/biohackrxiv_index.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">One Million IUPAC names #3: the 200 thousand milestone and 1 million IUPAC names</title><link href="https://chem-bla-ics.linkedchemistry.info/2025/06/09/one-million-iupac-names.html" rel="alternate" type="text/html" title="One Million IUPAC names #3: the 200 thousand milestone and 1 million IUPAC names" /><published>2025-06-09T00:00:00+00:00</published><updated>2025-06-09T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2025/06/09/one-million-iupac-names</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2025/06/09/one-million-iupac-names.html"><![CDATA[<p>I could not find the time earlier to report (<a href="https://chem-bla-ics.linkedchemistry.info/2025/06/08/iccs2025-1-back-in-noordwijkerhout.html">reason</a>),
but three weeks ago we passed the fourth milestone release of the CCZero IUPAC names found in literature collection. This release contains
200026 IUPAC names, 168702 unique names, reflecting 116207 unique InChIKeys. Time for an update of the
<a href="https://chem-bla-ics.linkedchemistry.info/2025/03/08/iupac-names.html">One Million IUPAC names</a> project.</p>

<p>The current count actually is just above 230 thousand IUPAC names, but further growth may require new approaches,
such as the <a href="https://chem-bla-ics.linkedchemistry.info/2025/04/27/one-million-iupac-names-2-the-100-thousand-milestone.html">four ideas</a>
I posted earlier. I have gone through all full-text Open Access articles provided by the <a href="https://europepmc.org/RestfulWebService">Europe PMC API</a>.
Now, this list is not static, but I wanted to start using their <a href="https://europepmc.org/downloads">bulk downloads</a> anyway.</p>

<h2 id="the-current-results">The current results</h2>

<p>I have been looking at the names coming in. Some are short, others long. The complexity is fascinating and I will
have to brush up my cheminformatics skills to make chemical space splots and visualize the structural diversity.
I also note the current workflow does a good job at unicode characters, and we have plenty of names
like <code class="language-plaintext highlighter-rouge">ε,ε-carotene-3,3’-dione</code>. There are also names that I do not expect to be really valid, like
<code class="language-plaintext highlighter-rouge">hydroxymethyl methacrylate-</code> that end with a hyphen (41 in total), but their overall count is low.
And OPSIN is happy with it, so the name fits the rules.</p>

<p>The ten longest names (so far) are these (with the lengths 322, 324, 332, 357, 371, 373, 376, 421, 429, and 626):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(5Z)-3-ethyl-5-[[4-[15-[7-[(Z)-(3-ethyl-4-oxo-2-sulfanylidene-1,3-thiazolidin-5-ylidene)methyl]-2,1,3-benzothiadiazol-4-yl]-9,9,18,18-tetra(nonyl)-5,14-dithiapentacyclo[10.6.0.03,10.04,8.013,17]octadeca-1(12),2,4(8),6,10,13(17),15-heptaen-6-yl]-2,1,3-benzothiadiazol-7-yl]methylidene]-2-sulfanylidene-1,3-thiazolidin-4-one
(Z)-[[4-[[(Z)-N’-carbamoyl-N-[2-[2-[2-[[3-[(4S)-6,8-dichloro-2-methyl-3,4-dihydro-1H-isoquinolin-4-yl]phenyl]sulfonylamino]ethoxy]ethoxy]ethyl]carbamimidoyl]amino]butylamino]-[2-[2-[2-[[3-[(4S)-6,8-dichloro-2-methyl-3,4-dihydro-1H-isoquinolin-4-yl]phenyl]sulfonylamino]ethoxy]ethoxy]ethylamino]methylene]urea dihydrochloride
2-((Z)-2-((6-(4-(6-((Z)-(1-(dicyanomethylene)-5,6-difluoro-3-oxo-1H-inden-2(3H)-ylidene)methyl)-4,4-bis(2-ethylhexyl)-4H-cyclopenta[1,2-b:5,4-b′]dithiophen-2-yl)-2,3-bis(hexyloxy)phenyl)-4-(5,7-diethylundecan-6-yl)-4H-cyclopenta[1,2-b:5,4-b′]dithiophen-2-yl)methylene)-5,6-difluoro-3-oxo-2,3-dihydro-1H-inden-1-ylidene)malononitrile
(2S,4S,5R,6R)‐5‐acetamido‐2‐[(2S,3R,4R,5S,6R)‐5‐[(2S,3R,4R,5R,6R)‐3‐acetamido‐4,5‐dihydroxy‐6‐(hydroxymethyl)oxan‐2‐yl]oxy‐2‐[(2R,3S,4R,5R,6R)‐4,5‐dihydroxy‐2‐(hydroxymethyl)‐6‐[(E,2S,3R)‐3‐hydroxy‐2‐(octadecanoylamino)octadec‐4‐enoxy]oxan‐3‐yl]oxy‐3‐hydroxy‐6‐(hydroxymethyl)oxan‐4‐yl]oxy‐4‐hydroxy‐6‐[(1R,2R)‐1,2,3‐trihydroxypropyl]oxane‐2‐carboxylic acid
(2R,3S,4R,5R,7S,9S,10S,11R,12S,13R)-7-[(benzylcarbamoyl)oxy]-2-(1-{[(2R,3R,4R,5R,6R)-5-hydroxy-3,4-dimethoxy-6-methyltetrahydro-2H-pyran-2-yl]oxy}propan-2-yl)-10-{[(2S,3R,6R)-3-hydroxy-4-(methoxyimino)-6-methyltetrahydro-2H-pyran-2-yl]oxy}-3,5,7,9,11,13-hexamethyl-6,14-dioxo-12-{[(2S,5R,7R)-2,4,5-trimethyl-1,4-oxazepan-7-yl]oxy}oxacyclotetradecan-4-yl 3-methylbutanoate
2-[4-[2-[[(2R)-1-[[(4R,7S,10S,13R,16S,19R)-10-(4-aminobutyl)-4-[[(2R,3R)-1,3 dihydroxybutan-2-yl]carbamoyl]-7-[(1R)-1-hydroxyethyl]-16-[(4-hydroxyphenyl)methyl]-13-(1H-indol3-ylmethyl)-6,9,12,15,18-pentaoxo-1,2-dithia-5,8,11,14,17-pentazacycloicos-19-yl]amino]-1-oxo-3 phenylpropan-2-yl]amino]-2-oxoethyl]-7,10-bis(carboxymethyl)-1,4,7,10-tetrazacyclododec-1-yl]acetic acid
(2R,3S,4R,5R,7S,9S,10S,11R,12S,13R)-12-{[(2R,4R,5S,6S)-4,5-dihydroxy-4,6-dimethyltetrahydro-2H-pyran-2-yl]oxy}-7-hydroxy-2-(1-{[(2R,3R,4R,5R,6R)-5-hydroxy-3,4-dimethoxy-6-methyltetrahydro-2H-pyran-2-yl]oxy}propan-2-yl)-10-{[(2S,3R,6R)-3-hydroxy-4-(methoxyimino)-6-methyltetrahydro-2H-pyran-2-yl]oxy}-3,5,7,9,11,13-hexamethyl-6,14-dioxooxacyclotetradecan-4-yl 3-methylbutanoate
(2S,4S,5R,6R)‐5‐acetamido‐2‐[(2S,3R,4R,5S,6R)‐5‐[(2S,3R,4R,5R,6R)‐3‐acetamido‐5‐hydroxy‐6‐(hydroxymethyl)‐4‐[(2R,3R,4S,5R,6R)‐3,4,5‐trihydroxy‐6‐(hydroxymethyl)oxan‐2‐yl]oxyoxan‐2‐yl]oxy‐2‐[(2R,3S,4R,5R,6R)‐4,5‐dihydroxy‐2‐(hydroxymethyl)‐6‐[(E,2S,3R)‐3‐hydroxy‐2‐(octadecanoylamino)octadec‐4‐enoxy]oxan‐3‐yl]oxy‐3‐hydroxy‐6‐(hydroxymethyl)oxan‐4‐yl]oxy‐4‐hydroxy‐6‐[(1R,2R)‐1,2,3‐trihydroxypropyl]oxane‐2‐carboxylic acid
(2R,3S,4R,5R,7S,9S,10S,11R,12S,13R)-12-{[(2R,4R,5S,6S)-4,5-dihydroxy-4,6-dimethyltetrahydro-2H-pyran-2-yl]oxy}-2-(1-{[(2R,3R,4R,5R,6R)-5-hydroxy-3,4-dimethoxy-6-methyltetrahydro-2H-pyran-2-yl]oxy}propan-2-yl)-10-{[(2S,3R,6R)-3-hydroxy-4-(methoxyimino)-6-methyltetrahydro-2H-pyran-2-yl]oxy}-3,5,7,9,11,13-hexamethyl-7-({[2-(2-methyl-5-nitro-1H-imidazol-1-yl)ethyl]carbamoyl}oxy)-6,14-dioxooxacyclotetradecan-4-yl 3-methylbutanoate
N-[(2S,3R,4R,5S,6R)-5-[(2S,3R,4R,5S,6R)-3-amino-5-[(2S,3R,4R,5S,6R)-3-amino-5-[(2S,3R,4R,5S,6R)-3-amino-5-[(2S,3R,4R,5S,6R)-3-amino-5-[(2S,3R,4R,5S,6R)-3-amino-5-[(2S,3R,4R,5S,6R)-3-amino-4,5-dihydroxy-6-(hydroxymethyl)oxan-2-yl]oxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl]oxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl]oxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl]oxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl]oxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl]oxy-2-[(2R,3S,4R,5R,6S)-5-amino-6-[(2R,3S,4R,5R,6R)-5-amino-4,6-dihydroxy-2-(hydroxymethyl)oxan-3-yl]oxy-4-hydroxy-2-(hydroxymethyl)oxan-3-yl]oxy-4-hydroxy-6-(hydroxymethyl)oxan-3-yl]carbamate
</code></pre></div></div>

<p>That last compound has the InChIKey <code class="language-plaintext highlighter-rouge">DKPKDPKJVDQUPD-XGBIXEJNSA-M</code> and cannot be found in Google nor in PubChem.
It looks like this:</p>

<p><img src="/assets/images/iupac_626.png" alt="" /></p>

<p>There are <a href="https://pubchem.ncbi.nlm.nih.gov/#query=N-%5B(2S%2C3R%2C4R%2C5S%2C6R)-5-%5B(2S%2C3R%2C4R%2C5S%2C6R)-3-amino-5-%5B(2S%2C3R%2C4R%2C5S%2C6R)-3-amino-5-%5B(2S%2C3R%2C4R%2C5S%2C6R)-3-amino-5-%5B(2S%2C3R%2C4R%2C5S%2C6R)-3-amino-5-%5B(2S%2C3R%2C4R%2C5S%2C6R)-3-amino-5-%5B(2S%2C3R%2C4R%2C5S%2C6R)-3-amino-4%2C5-dihydroxy-6-(hydroxymethyl)oxan-2-yl%5Doxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl%5Doxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl%5Doxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl%5Doxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl%5Doxy-4-hydroxy-6-(hydroxymethyl)oxan-2-yl%5Doxy-2-%5B(2R%2C3S%2C4R%2C5R%2C6S)-5-amino-6-%5B(2R%2C3S%2C4R%2C5R%2C6R)-5-amino-4%2C6-dihydroxy-2-(hydroxymethyl)oxan-3-yl%5Doxy-4-hydroxy-2-(hydroxymethyl)oxan-3-yl%5Doxy-4-hydroxy-6-(hydroxymethyl)oxan-3-yl%5Dcarbamate">some closely related compounds</a>,
though.</p>

<h2 id="chemicals-only-published-about-once">Chemicals only published about once</h2>

<p>Some <a href="https://doi.org/10.59350/rzepa.28802">related data was blogged</a> by <a href="https://orcid.org/0000-0002-8635-8390">Henry Rzepa</a> last week,
with this quote by Lee from CAS:</p>

<blockquote>
  <p>38.5% of the current substances have only 1 reference</p>
</blockquote>

<p>Apparently, based on <a href="https://www.cas.org/support/documentation/chemical-substances">CAS Registry</a> data,
about 1 in 3 chemical structures are only published about once. And two in three are published
about at least twice. I agree with Henry here, with organic chemistry literature in mind, I would have
expected that 38.5% to be higher.</p>

<p>Anyway, since this project is not tracking in which articles IUPAC names are found, I have nothing to study this.</p>

<h2 id="1-million-iupac-names">1 million IUPAC names</h2>

<p>So, the primary goal of this project is to reach one million IUPAC names. We are currently at around 23%.
Not bad, considering we started in Februari. And we have plenty of untouched literature left.</p>

<p>But I also applied <a href="https://chem-bla-ics.linkedchemistry.info/2025/04/27/one-million-iupac-names-2-the-100-thousand-milestone.html">idea 1</a>,
the varying names. The idea is that this was I can explode the number of compounds. In that compounds above,
just the number of variations by enumerating all <code class="language-plaintext highlighter-rouge">OH</code> replacements with <code class="language-plaintext highlighter-rouge">OMe</code> and <code class="language-plaintext highlighter-rouge">OEt</code> would help a lot.</p>

<p>Because I wanted to make sure I could answer positively at the ICCS if we made it to one million
CCZero IUPAC names, I implemented a very simple enumeration script. Really dumb approach. But the
results are interesting. I started with the 200026 names from the milestone. If I
<a href="https://github.com/BlueObelisk/iupac-names/blob/main/explode.groovy">explode</a> these names,
I get 1,377,127 IUPAC names, well above the target. Even if I remove name variations due to unicode
variations for hyphens, I still have 1,162,107 IUPAC names.</p>

<p>Something interesting I cannot fully understand at this moment yet, however, is the following.
When I calculate the number of unique InChIKeys for the milestone, I get 117,726 keys, and when I do
this for the list of name variations, I get 203,979 keys. So, while the IUPAC name list is about five
times as long, the list of InChIKeys is not even twice as long. Well, I guess that is why this is called
research.</p>]]></content><author><name>Egon Willighagen</name></author><category term="iupac" /><category term="textmining" /><category term="inchikey:DKPKDPKJVDQUPD-XGBIXEJNSA-M" /><category term="cito:containsAssertionFrom:10.59350/rzepa.28802" /><category term="europepmc" /><summary type="html"><![CDATA[I could not find the time earlier to report (reason), but three weeks ago we passed the fourth milestone release of the CCZero IUPAC names found in literature collection. This release contains 200026 IUPAC names, 168702 unique names, reflecting 116207 unique InChIKeys. Time for an update of the One Million IUPAC names project.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/iupac_626.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/iupac_626.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">One Million IUPAC names #2: the 100 thousand milestone</title><link href="https://chem-bla-ics.linkedchemistry.info/2025/04/27/one-million-iupac-names-2-the-100-thousand-milestone.html" rel="alternate" type="text/html" title="One Million IUPAC names #2: the 100 thousand milestone" /><published>2025-04-27T00:00:00+00:00</published><updated>2025-04-27T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2025/04/27/one-million-iupac-names-2-the-100-thousand-milestone</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2025/04/27/one-million-iupac-names-2-the-100-thousand-milestone.html"><![CDATA[<p>Two and a half month into the <a href="https://chem-bla-ics.linkedchemistry.info/2025/03/08/iupac-names.html">One Million IUPAC Names</a>
project, we passed <a href="https://github.com/BlueObelisk/iupac-names/releases/tag/milestone-100k">the third milestone</a>,
the one for 100 thousand IUPAC names (doi:<a href="https://doi.org/10.5281/zenodo.15266459">10.5281/zenodo.15266459</a>).
Time for an update.</p>

<p>This milestone release took a bit longer. Going from 50 to 100 thousand is a bigger step than from 10 to 50
thousand, but the open access chemistry literature was already done by then. Basically, I ran out of open access
chemistry publications. The scripts are now finding names in all (open access) literature, and the number of
new names per articles is a lot lower. Still about 1 in every twenty to 30 articles. But the diversity in names
is not really going down, which is important.</p>

<p>The first few weeks, I used the Google Colab to run a Jupyter notebook, initial created by
<a href="https://cpm.lumc.nl/research/bioinformatics-224/magnus-palmblad-5">Magnus</a>, but having to process more articles
to get a reasonable number of new IUPAC names required longer and longer jobs, and then Google Colab
is not really fit (well, the free version anyway). So, I started using a local script. That turned out
to be able to handle up to 20 thousand articles in one go and runs at least twice as fast. Moreover, I can
run three of them in parallel.</p>

<p>And that had impact. With each commit around 1000 new IUPAC names, the number of commits went up remarkably
last week:</p>

<p><img src="/assets/images/iupac-names-commits.png" alt="" /></p>

<p>At the current speed, I think we’ll make it to 150k soon and I added a new milestone for 200k, which sounds
doable in the next three week. That also means that 1M extracted IUPAC names from literature has become
a reasonable goal. And we can start thinking about the 2, 5, 10, 50 and 100 million IUPAC names. Those are,
at the current speed, rather unlikely to reach from the open access literature anytime soon. That brings
us to the question, what will. Well, I have some ideas.</p>

<h3 id="idea-1-name-variations">Idea 1: name variations</h3>

<p>First, I am figuring out some ways to make variants of names (no, not based on hyphens and spaces; that’s too easy),
but actual variations of the chemical structures. For example, I could exhaustively replace “methoxy” with “ethoxy”,
and iterate the halogens and acyl chain lengts. I have little doubt that I can grow the list with this approach
easily a 5-fold, maybe even a 10-fold.</p>

<h3 id="idea-2-hallucination">Idea 2: hallucination</h3>

<p>Another idea is that I could use tools that can generate IUPAC names for a limited set of compounds.
I once wrote code for alkanes myself and if I can find that, I may be able to generate additional names.
But perhaps more realistic is that I train a deep learning model and have it generate names for all compounds in
Wikidata (~1.5 million) or PubChem (&gt;100 million). STOUT needed 81 million compounds
(doi:<a href="https://doi.org/10.1186/s13321-021-00512-4">10.1186/s13321-021-00512-4</a>), but I don’t need a good model;
I just need a model that comes up with new, valid names. Hallucinated names, but valid.</p>

<p>While the list of valid names grows, I can retrain the deep-learned model and repeat. As long as the diversity
remains high enough, one could hypothesize that the deep learning will learn new tricks. And then,
that should be a near infinite source of additional names.</p>

<h3 id="idea-3-semi-closed-access-literature">Idea 3: (semi-)closed access literature</h3>

<p>Also, I haven’t touched closed access articles yet. This is all based on the collection of full texts
in <a href="https://europepmc.org/">Europe PMC</a>. For example, I could start with the green open access article
in (Dutch) university repositories, particularly those with large chemistry departments. PDF to text
tools are mature enough that this will provide a new source. Oh, and perhaps PhD thesis, which are now
also increasingly archived in university repository under open access. And that reminds me of a Dutch
project two decades ago doing exactly that. I wish I remembered the name.</p>

<h3 id="idea-4-alternatives-to-oscar4-and-europe-pmc">Idea 4: alternatives to Oscar4 and Europe PMC</h3>

<p>So, the first round of named entity recognition was with Europe PMC itself, as explained in
<a href="https://chem-bla-ics.linkedchemistry.info/2025/03/08/iupac-names.html">the first post</a>. The move
to Oscar4 helped a lot. But there exist many other chemical NER tools, like
(doi:<a href="https://doi.org/10.1093/bioinformatics/btn181">10.1093/bioinformatics/btn181</a>. And those may
find an additional number of names, even with just the literature I already covered.</p>

<p>Well, you get the idea.</p>

<h2 id="iccs-poster-rejected">ICCS poster rejected</h2>

<p>Unfortunately, the <a href="https://iccs-nl.org/">ICCS poster</a> abstract did not make the cut. The score was high enough,
but they received many abstracts and had to make a selection (of course, I am part of the ICCS organization,
and have more details of how it came about). I really like the project, and eager to write up a paper around
it.</p>]]></content><author><name>Egon Willighagen</name></author><category term="iupac" /><category term="textmining" /><category term="oscar" /><category term="cito:citesForInformation:10.1186/s13321-021-00512-4" /><category term="cito:citesAsPotentialSolution:10.1093/bioinformatics/btn181" /><category term="europepmc" /><summary type="html"><![CDATA[Two and a half month into the One Million IUPAC Names project, we passed the third milestone, the one for 100 thousand IUPAC names (doi:10.5281/zenodo.15266459). Time for an update.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/iupac-names-commits.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/iupac-names-commits.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>