<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.4">Jekyll</generator><link href="https://chem-bla-ics.linkedchemistry.info/feed/by_tag/oscar.xml" rel="self" type="application/atom+xml" /><link href="https://chem-bla-ics.linkedchemistry.info/" rel="alternate" type="text/html" /><updated>2026-06-15T12:00:19+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/feed/by_tag/oscar.xml</id><title type="html">chem-bla-ics</title><subtitle>Chemblaics (pronounced chem-bla-ics) is the science that uses open science and computers to solve problems in chemistry, biochemistry and related fields.</subtitle><author><name>Egon Willighagen</name></author><entry><title type="html">One Million IUPAC names #2: the 100 thousand milestone</title><link href="https://chem-bla-ics.linkedchemistry.info/2025/04/27/one-million-iupac-names-2-the-100-thousand-milestone.html" rel="alternate" type="text/html" title="One Million IUPAC names #2: the 100 thousand milestone" /><published>2025-04-27T00:00:00+00:00</published><updated>2025-04-27T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2025/04/27/one-million-iupac-names-2-the-100-thousand-milestone</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2025/04/27/one-million-iupac-names-2-the-100-thousand-milestone.html"><![CDATA[<p>Two and a half month into the <a href="https://chem-bla-ics.linkedchemistry.info/2025/03/08/iupac-names.html">One Million IUPAC Names</a>
project, we passed <a href="https://github.com/BlueObelisk/iupac-names/releases/tag/milestone-100k">the third milestone</a>,
the one for 100 thousand IUPAC names (doi:<a href="https://doi.org/10.5281/zenodo.15266459">10.5281/zenodo.15266459</a>).
Time for an update.</p>

<p>This milestone release took a bit longer. Going from 50 to 100 thousand is a bigger step than from 10 to 50
thousand, but the open access chemistry literature was already done by then. Basically, I ran out of open access
chemistry publications. The scripts are now finding names in all (open access) literature, and the number of
new names per articles is a lot lower. Still about 1 in every twenty to 30 articles. But the diversity in names
is not really going down, which is important.</p>

<p>The first few weeks, I used the Google Colab to run a Jupyter notebook, initial created by
<a href="https://cpm.lumc.nl/research/bioinformatics-224/magnus-palmblad-5">Magnus</a>, but having to process more articles
to get a reasonable number of new IUPAC names required longer and longer jobs, and then Google Colab
is not really fit (well, the free version anyway). So, I started using a local script. That turned out
to be able to handle up to 20 thousand articles in one go and runs at least twice as fast. Moreover, I can
run three of them in parallel.</p>

<p>And that had impact. With each commit around 1000 new IUPAC names, the number of commits went up remarkably
last week:</p>

<p><img src="/assets/images/iupac-names-commits.png" alt="" /></p>

<p>At the current speed, I think we’ll make it to 150k soon and I added a new milestone for 200k, which sounds
doable in the next three week. That also means that 1M extracted IUPAC names from literature has become
a reasonable goal. And we can start thinking about the 2, 5, 10, 50 and 100 million IUPAC names. Those are,
at the current speed, rather unlikely to reach from the open access literature anytime soon. That brings
us to the question, what will. Well, I have some ideas.</p>

<h3 id="idea-1-name-variations">Idea 1: name variations</h3>

<p>First, I am figuring out some ways to make variants of names (no, not based on hyphens and spaces; that’s too easy),
but actual variations of the chemical structures. For example, I could exhaustively replace “methoxy” with “ethoxy”,
and iterate the halogens and acyl chain lengts. I have little doubt that I can grow the list with this approach
easily a 5-fold, maybe even a 10-fold.</p>

<h3 id="idea-2-hallucination">Idea 2: hallucination</h3>

<p>Another idea is that I could use tools that can generate IUPAC names for a limited set of compounds.
I once wrote code for alkanes myself and if I can find that, I may be able to generate additional names.
But perhaps more realistic is that I train a deep learning model and have it generate names for all compounds in
Wikidata (~1.5 million) or PubChem (&gt;100 million). STOUT needed 81 million compounds
(doi:<a href="https://doi.org/10.1186/s13321-021-00512-4">10.1186/s13321-021-00512-4</a>), but I don’t need a good model;
I just need a model that comes up with new, valid names. Hallucinated names, but valid.</p>

<p>While the list of valid names grows, I can retrain the deep-learned model and repeat. As long as the diversity
remains high enough, one could hypothesize that the deep learning will learn new tricks. And then,
that should be a near infinite source of additional names.</p>

<h3 id="idea-3-semi-closed-access-literature">Idea 3: (semi-)closed access literature</h3>

<p>Also, I haven’t touched closed access articles yet. This is all based on the collection of full texts
in <a href="https://europepmc.org/">Europe PMC</a>. For example, I could start with the green open access article
in (Dutch) university repositories, particularly those with large chemistry departments. PDF to text
tools are mature enough that this will provide a new source. Oh, and perhaps PhD thesis, which are now
also increasingly archived in university repository under open access. And that reminds me of a Dutch
project two decades ago doing exactly that. I wish I remembered the name.</p>

<h3 id="idea-4-alternatives-to-oscar4-and-europe-pmc">Idea 4: alternatives to Oscar4 and Europe PMC</h3>

<p>So, the first round of named entity recognition was with Europe PMC itself, as explained in
<a href="https://chem-bla-ics.linkedchemistry.info/2025/03/08/iupac-names.html">the first post</a>. The move
to Oscar4 helped a lot. But there exist many other chemical NER tools, like
(doi:<a href="https://doi.org/10.1093/bioinformatics/btn181">10.1093/bioinformatics/btn181</a>. And those may
find an additional number of names, even with just the literature I already covered.</p>

<p>Well, you get the idea.</p>

<h2 id="iccs-poster-rejected">ICCS poster rejected</h2>

<p>Unfortunately, the <a href="https://iccs-nl.org/">ICCS poster</a> abstract did not make the cut. The score was high enough,
but they received many abstracts and had to make a selection (of course, I am part of the ICCS organization,
and have more details of how it came about). I really like the project, and eager to write up a paper around
it.</p>]]></content><author><name>Egon Willighagen</name></author><category term="iupac" /><category term="textmining" /><category term="oscar" /><category term="cito:citesForInformation:10.1186/s13321-021-00512-4" /><category term="cito:citesAsPotentialSolution:10.1093/bioinformatics/btn181" /><category term="europepmc" /><summary type="html"><![CDATA[Two and a half month into the One Million IUPAC Names project, we passed the third milestone, the one for 100 thousand IUPAC names (doi:10.5281/zenodo.15266459). Time for an update.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/iupac-names-commits.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/iupac-names-commits.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">One Million IUPAC names</title><link href="https://chem-bla-ics.linkedchemistry.info/2025/03/08/iupac-names.html" rel="alternate" type="text/html" title="One Million IUPAC names" /><published>2025-03-08T00:00:00+00:00</published><updated>2025-03-08T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2025/03/08/iupac-names</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2025/03/08/iupac-names.html"><![CDATA[<p>Names of chemicals are part of the human user experience when browsing a chemical database. And literature too,
of course. Chemical names are also not easy to use, and what a chemical name means is not always clear.
This is why the <a href="https://en.wikipedia.org/wiki/International_Union_of_Pure_and_Applied_Chemistry">IUPAC</a>
started a standardizing nomenclature in chemistry, the <em>IUPAC names</em>. Each IUPAC name uniquely defines
the chemical structure it defines. For example, <em>methane</em> is the IUPAC name for the chemical CH<sub>4</sub>.</p>

<p>So, when propagating chemical structures from the <a href="https://chem-bla-ics.linkedchemistry.info/2025/02/13/beiltein-journal-has-bioschemas.html">Beilstein Bioschemas feed</a>,
I was looking for names, IUPAC or not, ideally the name used in the article. When I asked about this,
the question came up if they could autogenerate IUPAC names, for which
<a href="https://doi.org/10.1038/s41598-021-94082-y">various</a>
<a href="https://doi.org/10.1186/s13321-021-00535-x">new</a>
<a href="https://doi.org/10.1186/s13321-021-00512-4">tools</a>
<a href="https://doi.org/10.1186/s13321-024-00941-x">exist</a>
(I think I am missing one from an American team, but cannot find the reference),
along with multiple established commerical tools.
Because the IUPAC nomenclature is a long list of naming rules, priorities, etc, a rule-based
algorithm is logical, but newer methods take a deep-learning approach.</p>

<p>Back to the chemical annotation of chemistry literature. This is of obvious interest: you want
to know where we can read more about a certain chemical. We need the chemical structures in
a database for that, linked to the articles. This is, of course, one of the original studies
of <em>cheminformatics</em>. And when authors of the chemical literature do not provide this routinely
(<a href="https://chem-bla-ics.linkedchemistry.info/2025/02/13/beiltein-journal-has-bioschemas.html">this post</a>
shows a few exceptions, but it is still all too rare). And then manual and automated curation
is needed, e.g. done by <a href="https://en.wikipedia.org/wiki/Chemical_Abstracts_Service">Chemical Abstracts</a>.</p>

<p>Third, <a href="https://wikidata.org/">Wikidata</a> has <a href="https://scholia.toolforge.org/chemical/">about 1.4 million</a>
chemical compounds and many names. A <a href="https://www.wikidata.org/wiki/Wikidata:Property_proposal/Pending#IUPAC_name">property propoal for IUPAC names</a>
has been long pending, but once accepted in one form or another, will require IUPAC names too.</p>

<h2 id="one-million-iupac-names">One million IUPAC names</h2>

<p>Thus, the idea came up, can we create a set of 1 million unique IUPAC names found in literature?
I asked on the <a href="https://elixir-europe.org/">ELIXIR Europe</a> slack channel if <a href="https://europepmc.org/">Europe PMC</a>
had such a dataset (doi:<a href="https://doi.org/10.1093/nar/gkad1085">10.1093/nar/gkad1085</a>). I knew they had been adding chemical
<a href="https://scholia.toolforge.org/topic/Q403574">named-entity recognition</a> (NER) results in
<a href="https://europepmc.org/Annotations">their annotation API</a>. I learned they used <a href="https://www.ebi.ac.uk/chebi/">ChEBI</a>.
Melanie Vollmar and Summer Rosonovski or Europe PMC gave useful information and support.
<a href="https://cpm.lumc.nl/research/bioinformatics-224/magnus-palmblad-5">Magnus Palmblad</a> also replied
and provided Python code to use the Europe PMC API to fetch names it returns and see if those
are IUPAC names. Well, that’s easy. We have <a href="https://opsin.ch.cam.ac.uk/">OPSIN</a> for that
(see doi:<a href="https://doi.org/10.1021/ci100384d">10.1021/ci100384d</a>).</p>

<p>Unfortunately, the Europe PMC NER results are not ideal for IUPAC names. Just scanning
some 5, 6 organic chemistry journals returned some 8 thousand IUPAC names in open access
articles. But it quickly started to be too limited: each set of articles returned
increasingly few new names. The reason is simple: the NER is too <em>greedy</em> and as a
result, does not easily recognize longer IUPAC names. It is too happy with a substring
of the IUPAC name. For example, when it encounters the IUPAC name <em>5-Bromo-1H-indole-3-carboxylic acid</em>,
it settles for <em>indole-3-carboxylic acid</em>:</p>

<p><img src="/assets/images/greedy.png" alt="" /></p>

<h2 id="open-source-chemistry-analysis-routines">Open-Source Chemistry Analysis Routines</h2>

<p>During my PhD, in 2003, when I worked a few months with Prof. <a href="https://scholia.toolforge.org/author/Q908710">Peter Murray-Rust</a> (University of Cambridge)
and Prof. Janet Thornthon (EMBL-EBI), I learned about the research by <a href="https://scholia.toolforge.org/author/Q28946549">Sam Adams</a>
(doi:<a href="https://doi.org/10.1039/B411699M">10.1039/B411699M</a>), <a href="https://scholia.toolforge.org/author/Q133040220">Joe Townsend</a>
(doi:<a href="https://doi.org/10.1039/B411033A">10.1039/B411033A</a>), and <a href="https://scholia.toolforge.org/author/Q90318722">Peter Corbett</a>
(doi:<a href="https://doi.org/10.1007/11875741_11">10.1007/11875741_11</a>). One of the tools that used
this research was (is) <a href="https://scholia.toolforge.org/topic/Q133037490">OSCAR</a>,
short for <em>Open-Source Chemistry Analysis Routines</em> (see <a href="https://blogs.ch.cam.ac.uk/pmr/2009/05/16/opsin-and-oscar-chemical-language-processing/">this detailed write up by Peter MR</a>).
Later, in 2010 I visted Peter again, as postdoc, in Cambridge, and then
<a href="https://chem-bla-ics.linkedchemistry.info/2010/10/15/working-on-oscar-for-three-months.html">worked on the OSCAR project</a> too.
And while OSCAR did a lot more, the integration of <a href="https://chem-bla-ics.linkedchemistry.info/2010/12/26/oscar-training-data-models-etc.html">Corbett’s NER research</a>
made OSCAR the obvious follow-up step in finding IUPAC names in literature.</p>

<p>And because <a href="https://chem-bla-ics.linkedchemistry.info/2011/09/27/almost-year-ago-i-started-position-with.html">OSCAR4 had been integrated into Bioclipse</a>
(doi:<a href="https://doi.org/10.1186/1758-2946-3-41">10.1186/1758-2946-3-41</a>) and I had this ported to Bacting already
(doi:<a href="https://doi.org/10.21105/joss.02558">10.21105/joss.02558</a>), using this was trivial.
The use of Europe PMC is different now, however, and we are no longer using the Annotations API,
but just using it to find open access articles, and to get the full text in XML format.
That allows a simple XPath search on <code class="language-plaintext highlighter-rouge">&lt;p&gt;</code> elements, pass the resulting string to OSCAR4,
and the recognized names are checked with OPSIN.
And with this approach, processing two of the five or six journals we earlier explored,
we find another 40+ thousand IUPAC names. Quite a success, I am tempted to say.</p>

<h2 id="a-blue-obelisk-project">A Blue Obelisk project</h2>

<p>So, I started a new <a href="https://blueobelisk.github.io/">Blue Obelisk</a> project,
<a href="https://github.com/BlueObelisk/iupac-names">iupac-names</a>, to collect 1M IUPAC names. For researchers
to use, learn from, etc. Just IUPAC names. Not even the chemical structure, nor the link to the
articles. The first is trivial to do with OPSIN, so the matching SMILES do not need to be stored.
Links to literature is tricky because of the aforementioned issues, and we only want to know
which (partial) IUPAC names occur in literature. If you really want to know in which articles
that IUPAC name is found, you can simply do a search in Europe PMC.</p>

<p>And because we only store IUPAC names, this are very basic facts (this is an IUPAC name, as defined
by OPSIN being able to generate a SMILES for this structure) and that that string occurs in
some article) and we can share them as CCZero. We <a href="is:issue" title="milestone release">defined various milestones</a>,
and I am happy that the first two have been reached within two weeks:</p>

<ul>
  <li><a href="https://github.com/BlueObelisk/iupac-names/releases/tag/milestone-10k">Milestone 10k</a> (doi:<a href="https://doi.org/10.5281/zenodo.14965762">10.5281/zenodo.14965762</a>)</li>
  <li><a href="https://github.com/BlueObelisk/iupac-names/releases/tag/milestone-50k">Milestone 50k</a> (doi:<a href="https://doi.org/10.5281/zenodo.14978557">10.5281/zenodo.14978557</a>)</li>
</ul>

<p>This second milestone has 53848 unique names, but as literature goes, there are interesting
variations, some likely because of typesetting leading to spaces added and missing. If
we ignore spaces and hyphens, we have 50534 names left (hence the milestone). But IUPAC
names are also not fully unique, partly because of Unicode character variations and greek
letter alternatives, and you may wonder how many different chemical structures this set
reflects. While not perfect, the Standard InChI gives some lower limit, and we find 36528
InChIKeys in this second milestone.</p>

<p>Now, we need twenty times as much to reach the 1M IUPAC names, but given we have many, many
more open access articles to process. The bottleneck seems to be mostly our workflow.</p>

<h3 id="can-you-contribute">Can you contribute?</h3>

<p>Yes, of course! This is an open science project. But please keep in mind the narrow focus of this
project: only IUPAC names which can be found in (open access) literature. This project doed not accept
autogenerated names (PubChem would have given use many millions already), nor IUPAC names from existing
databases. Ideally, you are able to show the code you use to extract/find those names in literature.</p>

<h3 id="can-i-use-these-names">Can I use these names?</h3>

<p>First of all, this is what the CCZero license and open science nature of this project is about: reuse.
We love to hear how you are using these names, tho, and we encourage you to write up how you
are using them. You can use <a href="https://datacite.org/">DataCite</a> to cite the release you used,
and citing this blog post by DOI is also possible.</p>

<h3 id="does-it-support-my-language-too">Does it support my language too?</h3>

<p>No, at this moment it only support IUPAC names in English. Dutch, French, Spanish, or Chinese
IUPAC names are valid, but currently not supported. See also
<a href="https://chem-bla-ics.linkedchemistry.info/2010/12/30/text-mining-chemistry-from-dutch-or.html">this post</a>.</p>

<h3 id="will-there-be-a-publication">Will there be a publication?</h3>

<p>Magnus and I intend so. We already submitted an abstract to the <a href="https://iccs-nl.org/">International Conference on Chemical Structures</a>,
which has <a href="https://www.biomedcentral.com/collections/ICCS25">a Collection in the Journal of Cheminformatics</a>.
If the abstract gets accepted, of course, we can submit there. Otherwise, we will look for another venue,
likely <a href="https://en.wikipedia.org/wiki/Diamond_open_access">diamond open access</a>.</p>

<h3 id="where-is-your-script">Where is your script?</h3>

<p>Ah, fair point. We did not decide on the final license yet. I have used two scripts based on the template
by Magnus. As soon as we have finalized the license, we will make those available.</p>]]></content><author><name>Egon Willighagen</name></author><category term="iupac" /><category term="cheminf" /><category term="justdoi:10.1038/s41598-021-94082-y" /><category term="justdoi:10.1186/s13321-021-00512-4" /><category term="justdoi:10.1186/s13321-021-00535-x" /><category term="justdoi:10.1186/s13321-024-00941-x" /><category term="justdoi:10.1021/ci100384d" /><category term="oscar" /><category term="justdoi:10.1039/B411699M" /><category term="justdoi:10.1039/B411033A" /><category term="justdoi:10.1007/11875741_11" /><category term="textmining" /><category term="cito:usesMethodIn:10.1186/1758-2946-3-41" /><category term="cito:usesMethodIn:10.21105/JOSS.02558" /><category term="cito:usesMethodIn:10.1093/nar/gkad1085" /><category term="cito:citesAsEvidence:10.5281/zenodo.14965762" /><category term="cito:citesAsEvidence:10.5281/zenodo.14978557" /><category term="europepmc" /><summary type="html"><![CDATA[Names of chemicals are part of the human user experience when browsing a chemical database. And literature too, of course. Chemical names are also not easy to use, and what a chemical name means is not always clear. This is why the IUPAC started a standardizing nomenclature in chemistry, the IUPAC names. Each IUPAC name uniquely defines the chemical structure it defines. For example, methane is the IUPAC name for the chemical CH4.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/greedy.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/greedy.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Oscar4 paper: text mining in Bioclipse (and everywhere else, of course)</title><link href="https://chem-bla-ics.linkedchemistry.info/2011/11/01/oscar4-paper-text-mining-in-bioclipse.html" rel="alternate" type="text/html" title="Oscar4 paper: text mining in Bioclipse (and everywhere else, of course)" /><published>2011-11-01T00:00:00+00:00</published><updated>2011-11-01T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2011/11/01/oscar4-paper-text-mining-in-bioclipse</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2011/11/01/oscar4-paper-text-mining-in-bioclipse.html"><![CDATA[<p>The <a href="http://www.jcheminf.com/content/3/1/41">Oscar4 paper</a> (CC-BY, just like the screenshots of the paper below) was out already some days now, but the formatting has finished:</p>

<p><img src="/assets/images/oscar4Paper.png" alt="" /></p>

<p>I spotted a rogue <code class="language-plaintext highlighter-rouge">http://</code> in the code example b) in <a href="http://www.jcheminf.com/content/3/1/41#IDAE2JBD">Appendix B</a>:</p>

<p><img src="/assets/images/oscar4Paper2.png" alt="" /></p>

<p>I’ll see what I can do about that, but the API might evolve a bit anyway.</p>

<p>That leaves me to mention that <a href="http://chem-bla-ics.blogspot.com/2011/09/almost-year-ago-i-started-position-with.html">Bioclipse has an Oscar extension</a>
(<a href="http://www.bioclipse.net/">Bioclipse</a> has a lot of functionality nowadays, in fact),
and that I <a href="http://chem-bla-ics.blogspot.com/2010/12/text-mining-chemistry-from-dutch-or.html">blogged several times on Oscar4</a>
when I was working with the other authors on the refactoring last year.</p>]]></content><author><name>Egon Willighagen</name></author><category term="oscar" /><category term="bioclipse" /><category term="myexperiment" /><category term="doi:10.1186/1758-2946-3-41" /><summary type="html"><![CDATA[The Oscar4 paper (CC-BY, just like the screenshots of the paper below) was out already some days now, but the formatting has finished:]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/oscar4Paper.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/oscar4Paper.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Bioclipse-Oscar4 - Text mining in Bioclipse</title><link href="https://chem-bla-ics.linkedchemistry.info/2011/09/27/almost-year-ago-i-started-position-with.html" rel="alternate" type="text/html" title="Bioclipse-Oscar4 - Text mining in Bioclipse" /><published>2011-09-27T00:00:00+00:00</published><updated>2011-09-27T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2011/09/27/almost-year-ago-i-started-position-with</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2011/09/27/almost-year-ago-i-started-position-with.html"><![CDATA[<p>Almost a year ago I <a href="https://chem-bla-ics.linkedchemistry.info/2010/10/15/working-on-oscar-for-three-months.html">started a position <i class="fa-solid fa-recycle fa-xs"></i></a>
with <a href="http://blogs.ch.cam.ac.uk/pmr/">Peter Murray-Rust</a> to work on Oscar for three months (see this overview of results;
a paper by the full Oscar team (Sam, David, Dan, Lezan) is pending, and I’m really happy to have been able to contribute
bits to the project). Since then, I have had little time :( That’s how it goes, with post-hopping, unfortunately.
One thing I did do after that, was write a <a href="https://github.com/bioclipse/bioclipse.oscar">Bioclipse plugin</a>.</p>

<p>I was asked recently via <a href="http://www.linkedin.com/in/egonw">LinkedIn</a> if I was planning a Bioclipse-Oscar plugin, and
I realized that I forgot to blog about it. So, here goes. The <code class="language-plaintext highlighter-rouge">oscar</code> manager I implemented follows the
<a href="https://chem-bla-ics.linkedchemistry.info/2010/10/28/oscar4-java-api-chemical-name.html">Oscar API <i class="fa-solid fa-recycle fa-xs"></i></a>, and these
methods are available: <code class="language-plaintext highlighter-rouge">extractText()</code>, <code class="language-plaintext highlighter-rouge">findNamedEntities()</code>,  <code class="language-plaintext highlighter-rouge">findResolvedNamedEntities()</code>.</p>

<p>When I wrote the plugin, I also uploaded an <a href="http://www.myexperiment.org/workflows/2117.html">example workflow to MyExperiment</a>.
The code is:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Demo showing the Oscar text mining functionality</span>
<span class="c1">// in Bioclipse</span>
<span class="kd">var</span> <span class="nx">html</span> <span class="o">=</span> <span class="nx">bioclipse</span><span class="p">.</span><span class="nf">download</span><span class="p">(</span>
  <span class="dl">"</span><span class="s2">http://dx.doi.org/10.3762/bjoc.6.133</span><span class="dl">"</span><span class="p">,</span>
  <span class="dl">"</span><span class="s2">text/html</span><span class="dl">"</span>
<span class="p">)</span>
<span class="kd">var</span> <span class="nx">text</span> <span class="o">=</span> <span class="nx">oscar</span><span class="p">.</span><span class="nf">extractText</span><span class="p">(</span><span class="nx">html</span><span class="p">);</span>
<span class="c1">// the next step may take some time, while</span>
<span class="c1">// initializing the Oscar software for the</span>
<span class="c1">// first time</span>
<span class="kd">var</span> <span class="nx">mols</span> <span class="o">=</span> <span class="nx">oscar</span><span class="p">.</span><span class="nf">findResolvedNamedEntities</span><span class="p">(</span><span class="nx">text</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">file</span> <span class="o">=</span> <span class="dl">"</span><span class="s2">/Oscar Demo/extractedMols.sdf</span><span class="dl">"</span><span class="p">;</span>
<span class="nx">cdk</span><span class="p">.</span><span class="nf">saveSDFile</span><span class="p">(</span><span class="nx">file</span><span class="p">,</span> <span class="nx">mols</span><span class="p">);</span>
<span class="nx">ui</span><span class="p">.</span><span class="nf">open</span><span class="p">(</span><span class="nx">file</span><span class="p">);</span>
</code></pre></div></div>

<p>The code will extract chemical entities, and open a molecules table in <a href="http://www.bioclipse.net/">Bioclipse</a>:</p>

<p><img src="/assets/images/oscarDemo2.png" alt="" /></p>]]></content><author><name>Egon Willighagen</name></author><category term="oscar" /><category term="bioclipse" /><category term="beilstein" /><summary type="html"><![CDATA[Almost a year ago I started a position with Peter Murray-Rust to work on Oscar for three months (see this overview of results; a paper by the full Oscar team (Sam, David, Dan, Lezan) is pending, and I’m really happy to have been able to contribute bits to the project). Since then, I have had little time :( That’s how it goes, with post-hopping, unfortunately. One thing I did do after that, was write a Bioclipse plugin.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/oscarDemo2.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/oscarDemo2.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Text mining chemistry from Dutch or Swedish texts</title><link href="https://chem-bla-ics.linkedchemistry.info/2010/12/30/text-mining-chemistry-from-dutch-or.html" rel="alternate" type="text/html" title="Text mining chemistry from Dutch or Swedish texts" /><published>2010-12-30T00:00:00+00:00</published><updated>2010-12-30T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2010/12/30/text-mining-chemistry-from-dutch-or</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2010/12/30/text-mining-chemistry-from-dutch-or.html"><![CDATA[<p><a href="http://oscar3-chem.sf.net/">Oscar</a> is a text miner. It mines in text for chemistry.
<a href="https://bitbucket.org/wwmm/oscar4/">Oscar4</a> is the next iteration of Oscar
code that I worked on in the past three months, with Lezan, Sam, and David. I blogged about
aspects of Oscar4 at several occasions:</p>

<ul>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/10/15/working-on-oscar-for-three-months.html">Working on Oscar for three months <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/10/21/oscar-text-mining-in-taverna.html">Oscar text mining in Taverna <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/10/26/multiple-unit-test-inheritance-with.html">Multiple unit test inheritance with JExample <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/10/28/oscar4-java-api-chemical-name.html">Oscar4 Java API: chemical name dictionaries <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/11/18/oscar4-command-line-utilities.html">Oscar4 command line utilities <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li><a href="http://chem-bla-ics.blogspot.com/2010/11/installing-oscar.html">Installing Oscar</a></li>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/11/29/adding-new-dictionary-to-oscar.html">Adding a new dictionary to Oscar <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with.html">Status update on BJOC analysis with Oscar and ChemicalTagger <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with_11.html">Status update on BJOC analysis with Oscar and ChemicalTagger #2 <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/supramolecular-chemistry.html">Supramolecular chemistry <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/12/23/status-update-on-bjoc-analysis-with_23.html">Status update on BJOC analysis with Oscar and ChemicalTagger #3 <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li><a href="https://chem-bla-ics.linkedchemistry.info/2010/12/26/oscar-training-data-models-etc.html">Oscar: training data, models, etc <i class="fa-solid fa-recycle fa-xs"></i></a></li>
</ul>

<p>These posts will server is a some initial critical mass for a draft report I plan to finish
today. I might have to blog some further posts with diagrams, here and there. This post is
actually one of them, and discusses something where Oscar can be expected to go next, now
that the design is cleaned up (though this effort is not halted now) and it has become
possible again to extend it. The over <a href="https://hudson.ch.cam.ac.uk/job/oscar4/lastBuild/testReport/">250 unit tests</a>
make this a lot easier too.</p>

<p>One aspect where I expect Oscar to go in 2011 is the support for other languages. To a very
large extend this is based on multi-language support in the dictionaries, as well as having
training data in a particular language. This also provides some context to my earlier post
about the <a href="https://chem-bla-ics.linkedchemistry.info/2010/12/26/oscar-training-data-models-etc.html">need for a Oscar training data repository <i class="fa-solid fa-recycle fa-xs"></i></a>.</p>

<p>This extension opens a number of options: analysis of patent literature in other languages,
monitoring of press releases in other languages, and news items in local news papers, etc.
For example, it could analyse <a href="http://www.c2w.nl/energierijke-gistcel.119621.lynkx">this C2W news item</a>
on <a href="http://en.wikipedia.org/wiki/Yeast">yeast</a> cells:</p>

<p><img src="/assets/images/c2w.png" alt="" /></p>

<p>There are many use cases for such localized text mining. And it surely matters for determining
the impact of research.</p>

<p>Oscar has various places where language specifics are found. For example, in tokenization of a
text. One step here is the detection of sentence ends. This is done in most western languages
with a period, exclamation mark, question mark, etc. But periods (dots) are also used in
abbreviations. Similarly, colons can be used in chemical names. But the every language comes in
with different abbreviations that need to be recognized.</p>

<p>Currently, some abbreviations are found in <a href="https://bitbucket.org/wwmm/oscar4/src/005ffa00a69d/oscar4-core/src/main/java/uk/ac/cam/ch/wwmm/oscar/document/NonSentenceEndings.java">NonSentenceEndings</a>.
In the past three months, we have been cleaning up the code, and restructured the source code,
making it easier to detect such places. This class will likely undergo further refactoring, to
making the list of such non-sentence-endings configurable via files or so. What I expect to see,
is that we you initiate Oscar like this:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">Oscar</span> <span class="n">oscar</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Oscar</span><span class="o">(</span><span class="nc">Locale</span><span class="o">.</span><span class="na">US</span><span class="o">);</span>
</code></pre></div></div>

<p>This might actually even make a nice student summer project. The biggest challenge will be in making a good
corpus of training data, like the SciBorg training data that was used for training Oscar3.</p>

<p>But the whole normalization is tainted with English language specifics too. For example, the normalizer
will have to ‘normalize’ the question marks, for which there exist several
<a href="http://en.wikipedia.org/wiki/Question_mark#Stylistic_variants">unicode variations</a>.
But the normalized variant is language dependent. For example, greek and armenian have different characters
(see <a href="http://en.wikipedia.org/wiki/Question_mark#Opening_and_closing_question_marks">this page</a>),
and then we have not even started talking about the right to left.</p>

<p>Besides localized dictionaries, this Oscar will also benefit from a localized <a href="http://opsin.ch.cam.ac.uk/">OPSIN</a>.
It seem to recognize the Dutch <a href="https://opsin.ch.cam.ac.uk/opsin/propaan.png">propaan</a>, but not
<a href="https://opsin.ch.cam.ac.uk/opsin/benzeen.png">benzeen</a>. I am not going to look at that soon, but if you are
interested, I recommend checking out Rich’
<a href="https://doi.org/10.59350/bbrwt-e5n35">posts <i class="fa-solid fa-recycle fa-xs"></i></a>
<a href="https://doi.org/10.59350/vtadn-tdt17">about <i class="fa-solid fa-recycle fa-xs"></i></a>
<a href="https://doi.org/10.59350/nbtxd-kdz73">forking <i class="fa-solid fa-recycle fa-xs"></i></a>
OPSIN and writing patches.</p>

<p>Getting Oscar going for other languages is a challenge, but also offers new opportunities. Just email the
<a href="http://sourceforge.net/mailarchive/forum.php?forum_name=oscar3-chem-developers">oscar mailing list</a>
if you are interested and need help.</p>]]></content><author><name>Egon Willighagen</name></author><category term="oscar" /><category term="textmining" /><category term="justdoi:10.59350/vtadn-tdt17" /><category term="justdoi:10.59350/nbtxd-kdz73" /><category term="justdoi:10.59350/bbrwt-e5n35" /><summary type="html"><![CDATA[Oscar is a text miner. It mines in text for chemistry. Oscar4 is the next iteration of Oscar code that I worked on in the past three months, with Lezan, Sam, and David. I blogged about aspects of Oscar4 at several occasions:]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/c2w.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/c2w.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Oscar: training data, models, etc</title><link href="https://chem-bla-ics.linkedchemistry.info/2010/12/26/oscar-training-data-models-etc.html" rel="alternate" type="text/html" title="Oscar: training data, models, etc" /><published>2010-12-26T00:00:00+00:00</published><updated>2010-12-26T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2010/12/26/oscar-training-data-models-etc</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2010/12/26/oscar-training-data-models-etc.html"><![CDATA[<p><a href="https://sourceforge.net/projects/oscar3-chem/">Oscar</a> uses a Maximum Entropy Markov Model (MEMM) based on <a href="http://en.wikipedia.org/wiki/N-gram">n-grams</a>.
Peter Corbett has written this up (doi:<a href="https://doi.org/10.1186/1471-2105-9-S11-S4">10.1186/1471-2105-9-S11-S4</a>). So, it basically is statistics
once more. If you really want a proper bioinformatics education, so do your PhD at a (proteo)chemometrics department.</p>

<p>N-grams are word parts of n characters. For example, the trigrams of <a href="http://en.wikipedia.org/wiki/Acetic_acid">acetic acid</a>
include <code class="language-plaintext highlighter-rouge">ace</code>, <code class="language-plaintext highlighter-rouge">cid</code>, <code class="language-plaintext highlighter-rouge">tic</code>, <code class="language-plaintext highlighter-rouge">eti</code>, and <code class="language-plaintext highlighter-rouge">aci</code>. N-grams of length four include acid, etic, and acet. The MEMM assigns weights to
these n-grams, and based on that decided if something is in deed a <em>named entity</em> (in Oscar terminology). For example,
consider the <code class="language-plaintext highlighter-rouge">acet</code> n-gram: acetone should be matched, but the n-gram <code class="language-plaintext highlighter-rouge">facet</code> not.</p>

<p>Put this in perspective in the ongoing refactoring of the Oscar software. We are changing normalization (e.g. converting
all unicode hyphen alternatives into one specific hyphen), updating the tokenizer (e.g. changing the list of
non-sentence-endings like <em>Prof.</em>). It is clear this changes the n-grams typical for chemical-like things. Worse,
the weights are tuned towards to know n-grams, and statistical models are generally a bit overtrained for the
data, or, at least, specific for it.</p>

<p>Now, if the distribution of n-grams changes, the weights in the model need to be updated too, to not degrade
the model performance. So, Oscar is useless if we cannot retrain its MEMM component after a refactoring. If
that would be impossible, we would have effectively created an <em>intellectual monopoly</em>.</p>

<p>Thus, what the Oscar project needs, is one or more free sets of annotated literature, which can be used to
train new MEMM models. The SciBorg corpus was used to train the current Oscar3 and Oscar4 models. This data
(copyright <a href="http://rsc.org/">RSC</a>) will very likely be available under a <a href="http://creativecommons.org/licenses/">Creative Commons</a>
license (RSC++), but may have the NC clause, which would not be good for developing a business model around
the opensource Oscar (such as providing a high-performance web service via a subscription service). I have
recently written up <a href="http://chem-bla-ics.blogspot.com/2010/12/re-why-i-and-you-should-avoid-nc.html">the problems the NC clause introduces</a>,
and some <a href="http://chem-bla-ics.blogspot.com/2010/12/blog-post.html">examples of commercial Open Source cheminformatics projects</a>.</p>

<p>We need not focus only on this SciBorg data, however. In fact, we will need multiple models anyway. For
example, the SciBorg papers (42 if not mistaken) are around a particular kind of literature. So, it
introduces the risk of using it to analyse papers out of the application domain. Furthermore, I am very
interested (and others indicated so too) to use Oscar for other languages. Surely, English is the major
language, but there are many use cases for Oscar when useful for other languages.</p>

<p>Therefore, for what we need in the Oscar project, is a registry of training (/test) data, annotated itself
with metadata around how that data was created (what quality assurance, what kind of named entity types,
how many domain experts were involved, etc), test results for those data sets, etc. My time on the Oscar
project is almost over, and I have no clue when I will be able to invest the same amount of time into the
project as I did in the past three months. But the creation of this registry is clear step that must be
taken in the Oscar4 development.</p>]]></content><author><name>Egon Willighagen</name></author><category term="oscar" /><category term="textmining" /><category term="justdoi:10.1186/1471-2105-9-S11-S4" /><category term="inchikey:QTBSBXVTEAMEQO-UHFFFAOYSA-N" /><category term="inchikey:CSCPPACGZOOCGX-UHFFFAOYSA-N" /><summary type="html"><![CDATA[Oscar uses a Maximum Entropy Markov Model (MEMM) based on n-grams. Peter Corbett has written this up (doi:10.1186/1471-2105-9-S11-S4). So, it basically is statistics once more. If you really want a proper bioinformatics education, so do your PhD at a (proteo)chemometrics department.]]></summary></entry><entry><title type="html">Status update on BJOC analysis with Oscar and ChemicalTagger #3</title><link href="https://chem-bla-ics.linkedchemistry.info/2010/12/23/status-update-on-bjoc-analysis-with_23.html" rel="alternate" type="text/html" title="Status update on BJOC analysis with Oscar and ChemicalTagger #3" /><published>2010-12-23T00:00:00+00:00</published><updated>2010-12-23T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2010/12/23/status-update-on-bjoc-analysis-with_23</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2010/12/23/status-update-on-bjoc-analysis-with_23.html"><![CDATA[<p>The <a href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with_11.html">two <i class="fa-solid fa-recycle fa-xs"></i></a>
<a href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with.html">earlier <i class="fa-solid fa-recycle fa-xs"></i></a> posts
in this series showed screenshots of results of Oscar, but the title also promised results by Lezan’s
<a href="http://www-ucc.ch.cam.ac.uk/products/software/chemicaltagger">ChemicalTagger</a>. Sam
helped with getting the HTML pages online via the Cambridge Hudson installation. Where
Oscar find named entities (chemical compounds, processes, etc), ChemicalTagger finds
roles, like solvent, acid, base, catalyst. Roles are properties of chemical compounds
in certain situations. Ethanol is not always a solvent, sometimes it is a Xmas present.
The current output is not entirely where I want to go yet, but makes it easy which
solvents are frequently found in the BJOC corpus:</p>

<p><img src="/assets/images/chemtag1.png" alt="" /></p>

<p>This screenshot of an analysis of 15 BJOC papers shows that AcOEt (is that the
<a href="http://lab.chempedia.com/questions/427/are-etoac-and-acoet-the-same">same as EtOAc?</a>)
is mentioned as solvent three times in <a href="http://www.ncbi.nlm.nih.gov/sites/ppmc/articles/PMC1399459">PMC1399459</a>.
Brine, however, is mentioned as solvent in three papers.</p>

<p>As said, these <a href="https://hudson.ch.cam.ac.uk/job/oscar4-chebi/ws/target/output/bjoc.html">two</a>
<a href="https://hudson.ch.cam.ac.uk/job/oscar4-chebi/ws/target/output/roles.html">pages</a> contain
RDF and the tables are sortable. Hudson recompiles them automatically when I update the
source code to create the HTML+RDFa. So, go ahead, send me bug reports, feature requests,
and patches!</p>]]></content><author><name>Egon Willighagen</name></author><category term="oscar" /><category term="chemicaltagger" /><category term="beilstein" /><category term="justdoi:10.1186/1860-5397-1-11" /><summary type="html"><![CDATA[The two earlier posts in this series showed screenshots of results of Oscar, but the title also promised results by Lezan’s ChemicalTagger. Sam helped with getting the HTML pages online via the Cambridge Hudson installation. Where Oscar find named entities (chemical compounds, processes, etc), ChemicalTagger finds roles, like solvent, acid, base, catalyst. Roles are properties of chemical compounds in certain situations. Ethanol is not always a solvent, sometimes it is a Xmas present. The current output is not entirely where I want to go yet, but makes it easy which solvents are frequently found in the BJOC corpus:]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/chemtag1.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/chemtag1.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Supramolecular chemistry</title><link href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/supramolecular-chemistry.html" rel="alternate" type="text/html" title="Supramolecular chemistry" /><published>2010-12-11T00:20:00+00:00</published><updated>2010-12-11T00:20:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2010/12/11/supramolecular-chemistry</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2010/12/11/supramolecular-chemistry.html"><![CDATA[<p>Some smart software developer once said to not optimize your code too early. However, not caring about it at all does not help either.
Some basic knowledge of memory management can keep you going. That is, I just ran into the limits of <a href="http://oscar3-chem.sf.net/">Oscar</a>
and ChemicalTagger. As I blogged earlier today, I am <a href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with.html">analyzing the BJOC literature <i class="fa-solid fa-recycle fa-xs"></i></a>,
but Lezan and I are running into a reproducible out-of-memory exception. At first I thought it was a memory leak, as it was the 95th
paper if fell over on, but after we optimized our code a bit, by reusing classes, the problem remained and turned out to be not in
recreating objects (though the code is significantly faster now), but in a single BJOC paper being too large.</p>

<p>The particular paper is not even ridiculously large, though it has an amazing 800 references! The paper, <em>Molecular recognition of
organic ammonium ions in solution using synthetic receptors</em> (doi:<a href="https://doi.org/10.3762/bjoc.6.32">10.3762/bjoc.6.32</a>), is in
fact an interesting review paper on supramolecular chemistry. The molecules I worked on (see one below) in my own supramolecular chemistry
time (doing a M.Sc. minor (6 month practical) with Peter Buijnsters in organic chemistry in the
<a href="http://www.molchem.science.ru.nl/about.php">group of Prof. Nolte</a>), are actually of the type they review, though surfactants are
not really covered in it particularly.</p>

<p><img src="/assets/images/surfactant2002.png" alt="" /></p>

<p>Yeah, supramolecular chemistry has this nice level complexity; it is so supramolecular, that it is currently outside the scope
of the molecular analysis of Oscar and ChemicalTagger ;)</p>

<ul>
  <li>Buijnsters, P. J. J. A.; García-Rodríguez, C. L.; Willighagen, E. L.; Sommerdijk, N. A. J. M.; Kremer, A.; Camilleri, P.; Feiters, M. C.;
Nolte, R. J. M.; Zwanenburg, B. (2002). Cationic Gemini Surfactants Based on Tartaric Acid: Synthesis, Aggregation, Monolayer Behaviour,
and Interaction with DNA European Journal of Organic Chemistry, 2002 (8), 1397-1406 :
DOI:<a href="https://doi.org/csqhbn">10.1002/1099-0690(200204)2002:8%3C1397::AID-EJOC1397%3E3.0.CO;2-6</a></li>
</ul>]]></content><author><name>Egon Willighagen</name></author><category term="oscar" /><category term="beilstein" /><category term="chemistry" /><category term="justdoi:10.3762/bjoc.6.32" /><category term="inchikey:ZXBXTKLPJKVXBD-KKLWWLSJSA-N" /><summary type="html"><![CDATA[Some smart software developer once said to not optimize your code too early. However, not caring about it at all does not help either. Some basic knowledge of memory management can keep you going. That is, I just ran into the limits of Oscar and ChemicalTagger. As I blogged earlier today, I am analyzing the BJOC literature , but Lezan and I are running into a reproducible out-of-memory exception. At first I thought it was a memory leak, as it was the 95th paper if fell over on, but after we optimized our code a bit, by reusing classes, the problem remained and turned out to be not in recreating objects (though the code is significantly faster now), but in a single BJOC paper being too large.]]></summary></entry><entry><title type="html">Status update on BJOC analysis with Oscar and ChemicalTagger #2</title><link href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with_11.html" rel="alternate" type="text/html" title="Status update on BJOC analysis with Oscar and ChemicalTagger #2" /><published>2010-12-11T00:10:00+00:00</published><updated>2010-12-11T00:10:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with_11</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with_11.html"><![CDATA[<p><img src="/assets/images/bjoc1.png" alt="" /></p>

<p>A quick update on the <a href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with.html">post of this morning <i class="fa-solid fa-recycle fa-xs"></i></a>. The above screenshot
shows the progress of the reporting of text mining results using <a href="http://oscar3-chem.sf.net/">Oscar</a> on the <a href="http://www.beilstein-journals.org/bjoc/">BJOC literature</a>.
I think I am almost ready to analyze the full corpus, with a blacklist put in place for <a href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/supramolecular-chemistry.html">large papers <i class="fa-solid fa-recycle fa-xs"></i></a>,
What you see is the same kind of JQuery-enabled sortable list in the HTML view, and a <a href="http://en.wikipedia.org/wiki/SPARQL">SPARQL query</a>
in <a href="http://chem-bla-ics.blogspot.com/2010/07/rdfadev-htmlrdfa-development-with.html">RDFaDev</a>, to list all papers that mention
<a href="http://www.dhmo.org/facts.html">DHMO</a> (in the first 10 of all 350 BJOC papers) by its <a href="http://en.wikipedia.org/wiki/International_Chemical_Identifier">InChI</a>.</p>

<p>Importantly, IMHO, it is using the <a href="http://code.google.com/p/semanticchemistry/">CHEMINF ontology</a>.</p>]]></content><author><name>Egon Willighagen</name></author><category term="beilstein" /><category term="oscar" /><category term="inchi" /><summary type="html"><![CDATA[]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/bjoc1.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/bjoc1.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Status update on BJOC analysis with Oscar and ChemicalTagger</title><link href="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with.html" rel="alternate" type="text/html" title="Status update on BJOC analysis with Oscar and ChemicalTagger" /><published>2010-12-11T00:00:00+00:00</published><updated>2010-12-11T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2010/12/11/status-update-on-bjoc-analysis-with.html"><![CDATA[<p><img src="/assets/images/bjoc.png" alt="" /></p>

<p>This screenshot shows the current status of the <a href="http://oscar3-chem.sf.net/">Oscar</a> analysis results of the
<a href="http://www.beilstein-journals.org/bjoc/">BJOC literature</a>. The results our logged as HTML+RDFa page, as I explained before in
<a href="http://chem-bla-ics.blogspot.com/2010/07/scripts-logs-as-htmlrdfa-mix-free-text.html">Scripts logs as HTML+RDFa: mix free text reporting with CSV</a>.
The page is interactive, using <a href="http://jquery.com/">JQuery</a> goodies to allow table sorting.</p>]]></content><author><name>Egon Willighagen</name></author><category term="beilstein" /><category term="oscar" /><summary type="html"><![CDATA[]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/bjoc.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/bjoc.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>