<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.4">Jekyll</generator><link href="https://chem-bla-ics.linkedchemistry.info/feed/by_tag/curation.xml" rel="self" type="application/atom+xml" /><link href="https://chem-bla-ics.linkedchemistry.info/" rel="alternate" type="text/html" /><updated>2026-06-15T12:00:19+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/feed/by_tag/curation.xml</id><title type="html">chem-bla-ics</title><subtitle>Chemblaics (pronounced chem-bla-ics) is the science that uses open science and computers to solve problems in chemistry, biochemistry and related fields.</subtitle><author><name>Egon Willighagen</name></author><entry><title type="html">WikiPathways curation reports on profile pages</title><link href="https://chem-bla-ics.linkedchemistry.info/2025/11/30/wikipathways-curation-reports-on-profile-pages.html" rel="alternate" type="text/html" title="WikiPathways curation reports on profile pages" /><published>2025-11-30T00:00:00+00:00</published><updated>2025-11-30T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2025/11/30/wikipathways-curation-reports-on-profile-pages</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2025/11/30/wikipathways-curation-reports-on-profile-pages.html"><![CDATA[<p>I have been running automated curation tests for many years now, at least <a href="https://chem-bla-ics.linkedchemistry.info/2018/10/11/two-presentations-at-wikipathways-2018.html">from before 2018</a>.
Because it has been done without funding, it has not been as nicely integrated, and depends, for example, first on the RDF generation to be integrated
in the GitHub Action. So, I still run them regularly (often in the morning during breakfast). Meanwhile, the <a href="https://www.wikipathways.org/wikipathways-collection/index2">curation tests</a>
help the project to monitor and maintain the quality of the pathways. The curation reports have been integrated into pathway pages for some
time now.</p>

<p><img src="/assets/images/wpCurationBadge.png" alt="" /></p>

<p>We have now integrated this curation badge into the author and community pages on the (not so) <a href="https://www.wikipathways.org/">new WikiPathways website</a>
too. Authors can now find curation reports for pathways they started and also for the community pages:</p>

<p><img src="/assets/images/4a0a20557574c3ae.png" alt="" /></p>

<p>A second new feature is the “Citations” tab on both pages, which link to <a href="https://europepmc.org/">Europe PMC</a>
with a dedicated search for articles mentioning those author or community pathways:</p>

<p><img src="/assets/images/270716cef8d30481.png" alt="" /></p>

<p>We hope you like it!</p>]]></content><author><name>Egon Willighagen</name></author><category term="wikipathways" /><category term="curation" /><category term="europepmc" /><summary type="html"><![CDATA[I have been running automated curation tests for many years now, at least from before 2018. Because it has been done without funding, it has not been as nicely integrated, and depends, for example, first on the RDF generation to be integrated in the GitHub Action. So, I still run them regularly (often in the morning during breakfast). Meanwhile, the curation tests help the project to monitor and maintain the quality of the pathways. The curation reports have been integrated into pathway pages for some time now.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/wpCurationBadge.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/wpCurationBadge.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Curation is an essential part of doing research</title><link href="https://chem-bla-ics.linkedchemistry.info/2025/06/29/curation-is-an-essential-part-of-doing-research.html" rel="alternate" type="text/html" title="Curation is an essential part of doing research" /><published>2025-06-29T00:00:00+00:00</published><updated>2025-06-29T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2025/06/29/curation-is-an-essential-part-of-doing-research</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2025/06/29/curation-is-an-essential-part-of-doing-research.html"><![CDATA[<p>Depending on your exact definition of doing science, keeping track as precise as possible of your observations
is an essential part of doing science. The precision should be high enough that mistakes are obvious. This pattern is,
of course, not limited to doing science and we see this in open source development too. Unfortunately, in the
modern way of doing science, this is not getting the attention it should get. Worse, with narratives (stories)
about the research, in the form of journal articles, are generally considered more important that a precise
description of the observations.</p>

<p>Is that a big issue? Hell, yes. Where do you think the FAIR ideas came from? And why FAIR in ten years has not
brought about the change it was hoping for?</p>

<p>For me, my fascination for curation started as a student, around 1995, with the <em>Dictionary on Organic Chemistry</em>.
At that time, my interest came from wanting to learn about chemistry and biology. During my M.Sc. and PhD, it was
obvious how essential it was to derivating correct scientific conclusions from your experiment. Data, knowledge,
and software alike, imo. And because curation is expensive, not having to repeat it, I prefer to do it as
Open Science.</p>

<h2 id="curation">Curation</h2>

<p>Of course, curation has been part of doing science, but to a large extens is separate step from doing science.
It is done by database developers, librarians, and chemo- and bioinformaticians. For example, Chemical Abstracts
Service (CAS) <a href="https://en.wikipedia.org/wiki/Chemical_Abstracts_Service">started over 100 years ago</a> and started
indexing chemical structures in 1965. The curation is an ongoing process, <a href="https://chem-bla-ics.linkedchemistry.info/2022/05/22/new-cas-common-chemistry-in-2021.html">also for old records</a>.</p>

<p><a href="https://www.biocuration.org/dissemination/who-are-we/">Biocuration</a> is getting
<a href="https://scholia.toolforge.org/topic/Q54987878#publications-per-year">more and more attention</a>:</p>

<p><img src="/assets/images/biocuration.png" alt="" /></p>

<p>The recognition and rewarding by having the <a href="https://www.biocuration.org/">International Society for Biocuration</a>
(ISB, <a href="https://scholia.toolforge.org/organization/Q23809291">Scholia page</a>) should not be underestimated
(doi:<a href="https://doi.org/10.1038/455047A">10.1038/455047A</a>). Their <a href="https://scholia.toolforge.org/event-series/Q106486148">Annual International Biocuration Conferences</a>
have been running since <a href="https://scholia.toolforge.org/event/Q109408101">2005</a>. And with their
awards, they give the biocuration work recognition and, literally, rewarding:</p>

<ul>
  <li><a href="https://scholia.toolforge.org/award/Q106045191">Biocuration Career Award</a> (2016-2021)</li>
  <li><a href="https://scholia.toolforge.org/award/Q118947746">Excellence in Biocuration Early Career Award</a> (2022-)</li>
  <li><a href="https://scholia.toolforge.org/award/Q119882229">Excellence in Biocuration Advanced Career Award</a> (2022-)</li>
  <li><a href="https://scholia.toolforge.org/award/Q106045103">Exceptional Contribution to Biocuration Award</a> (2017-)</li>
</ul>

<h2 id="my-curation-curriculum-vitae">My curation Curriculum Vitae</h2>

<p>I don’t have a good <em>curation CV</em>. For a large extend because the curation has been part of a study. The curation
itself does not get recognized, and only the <em>journal article</em> does. With datasets slowly getting more recognition,
so does data curation, but data curation is not really part of how we do FAIR at this moment, and via this route
not getting the attention it gets.</p>

<p>But since I have been updating <a href="https://egonw.github.io/cv/">my CV anyway</a>, I dug up some curation I am proud
of:</p>

<ul>
  <li>the Dictionary on Organic Chemistry, which no longer exists, but it started my Open Science chemistry research</li>
  <li>the <a href="Blue Obelisk Data Repository">Blue Obelisk Data Repositry</a> (BODR), which has been part of various
GNU/Linux distributions (see also doi:<a href="https://doi.org/10.1021/ci050400b">10.1021/ci050400b</a>).
A new version is <a href="https://chem-bla-ics.blogspot.com/2013/08/the-blue-obelisk-data-repositorys-10.html">long overdue</a></li>
  <li>I contributed hundreds of NMR spectra with uncommon nuclei to <a href="https://sourceforge.net/projects/nmrshiftdb2/files/data/">NMRShiftDb</a></li>
  <li>Wikidata, see <a href="https://chem-bla-ics.linkedchemistry.info/2025/05/25/new-preprint-scholia-chemistry-access-to-chemistry-in-wikidata.html">this preprint</a>,
but also many small projects, like adding CXSMILES for polymers, and <a href="https://laurendupuis.github.io/Scholia_tutorial/">main subject annotation in Scholia</a></li>
  <li>WikiPathways (see <a href="https://chem-bla-ics.linkedchemistry.info/tag/wikipathways">these blog posts</a>), where I started
<a href="https://classic.wikipathways.org/index.php?title=Special:Contributions&amp;dir=prev&amp;target=Egonw&amp;month=&amp;year=">curating metabolites in 2012</a>,
set up <a href="https://chem-bla-ics.linkedchemistry.info/2018/10/11/two-presentations-at-wikipathways-2018.html">a computer-assistent curation platform</a>
<a href="https://chem-bla-ics.linkedchemistry.info/2016/07/02/two-apache-jena-sparql-query.html">using SPARQL</a>, and
were an early curator of <a href="https://chem-bla-ics.linkedchemistry.info/2020/10/31/sars-cov-2-covid-19-and-open-science.html">SARS-CoV-2 biological processes</a></li>
  <li>citation intent annotation with the Citation Typing Ontology, see this <a href="https://scholia.toolforge.org/cito/">Scholia overview</a></li>
  <li>nanosafety ontology and data: the <a href="https://github.com/enanomapper/ontologies">eNanoMapper Ontology</a> (ENMO),
<a href="https://figshare.com/search?q=nanowiki">NanoWiki</a>, <a href="https://nanocommons.github.io/specifications/jrc/">JRC nanomaterial index</a> and
<a href="https://nanocommons.github.io/erm-database/">the ERM indentifier database</a></li>
  <li>made RDF for supplementary information (e.g. <a href="http://chem-bla-ics.linkedchemistry.info/2018/09/16/data-curation-5-inspiration-95.html">this NanoE-Tox spreadsheet</a>,
full databases, like <a href="https://chem-bla-ics.linkedchemistry.info/2011/04/21/chembl-09-as-rdf.html">ChEMBL</a> and
<a href="https://chem-bla-ics.linkedchemistry.info/2009/09/04/nmrshiftdb-enters-rdfopenmoleculesnet-2.html">NMRShiftDb <i class="fa-solid fa-recycle fa-xs"></i></a></li>
  <li>organized <a href="https://chem-bla-ics.linkedchemistry.info/2019/10/14/chemcuration-2019-poster-conference.html">an online ChemCuration event</a> (inspired by the ISB annual meetings!)</li>
</ul>

<p>I am also curation my blog, which was <a href="https://chem-bla-ics.linkedchemistry.info/2023/08/18/last-post-here-freebie-model-online.html">originally in blogger.com but being ported to Markdown with extra annotation</a>.
That includes <a href="https://chem-bla-ics.linkedchemistry.info/2023/07/27/archiving-and-updating-my-blog.html">updating URLs</a>
and annotation of blog posts <a href="https://chem-bla-ics.linkedchemistry.info/2005/10/21/viagra-saves-environment.html">with chemicals</a>,
<a href="https://chem-bla-ics.linkedchemistry.info/2024/10/24/vhp4safety.html">grants</a>, and
<a href="https://chem-bla-ics.linkedchemistry.info/2025/02/08/cito-for-blog-citations.html">intention-typed citations</a>.</p>

<h2 id="long-tail">Long tail</h2>

<p>Of course, I have my Wikipedia edits, and contributed to projects like <a href="https://github.com/biopragmatics/bioregistry/commits/main/?author=egonw">Bioregistry.io</a>,
<a href="https://fairsharing.org/users/596">FAIRsharing</a>, regularly submit <a href="https://form.typeform.com/to/SWoxIY?typeform-source=altmetric.typeform.com">missed mentions to Altmetric.com</a>,
etc. There is a long tail in curation. And there is a lot of curation hidden in <a href="https://scholar.google.com/citations?user=u8SjMZ0AAAAJ&amp;hl=en">my literature list</a>.</p>

<p>And that long tail matters to me. I want every researcher to pick up the challenge to curate their own
research output. Put your experimental data in databases, add important provenance, get the details rights.
This is essential to reduce the cost of doing research, and that is more important than ever.</p>

<p>BTW, I must note that our bioinformatics team colleagues too have done a tremendous amount of biocuration,
in WikiPathways (<a href="https://scholia.toolforge.org/author/Q43744369">Denise</a>, <a href="https://scholia.toolforge.org/author/Q28025534">Freddie</a>,
<a href="https://scholia.toolforge.org/author/Q19851164">Susan</a>), in nanosafety (<a href="https://scholia.toolforge.org/author/Q99306396">Jeaphianne</a>,
<a href="https://scholia.toolforge.org/author/Q86442640">Ammar</a>), and in toxicology (<a href="https://scholia.toolforge.org/author/Q42369611">Marvin</a>),
just to name a few. Often together with B.Sc. and M.Sc. students (which <a href="https://europepmc.org/article/med/26557796">can work really well</a>).</p>

<h2 id="award-nomination">Award nomination</h2>

<p>And I hope this makes it clear why I am delighted to was <a href="https://www.biocuration.org/community/biocuration-career-awards/excellence-in-biocuration-advanced-career-award-2025/">nominated last week</a>
for an ISB <em>Excellence in Biocuration Advanced Career Award</em>. The list of past awardees is impressive,
as are the other nominations:
<a href="https://scholia.toolforge.org/author/Q89869027">Laurel Cooper</a>, Oregon State University/USA,
<a href="https://scholia.toolforge.org/author/Q57227590">Steven Marygold</a>, University of Cambridge/UK,
<a href="https://scholia.toolforge.org/author/Q111430202">Saurabh Raghuvanshi</a>, University of Delhi/India, and
<a href="https://scholia.toolforge.org/author/Q59674797">Kimberly Van Auken</a>, California Institute of Technology/USA.</p>

<p>It’s an honor to be listed along these other nominees and being nominated is a great recognition! With a
<em>thank you</em> to the person who proposed my nomination.</p>]]></content><author><name>Egon Willighagen</name></author><category term="curation" /><category term="openscience" /><category term="justdoi:10.1038/455047A" /><category term="doi:10.1021/CI050400B" /><category term="nmrshiftdb" /><category term="europepmc" /><summary type="html"><![CDATA[Depending on your exact definition of doing science, keeping track as precise as possible of your observations is an essential part of doing science. The precision should be high enough that mistakes are obvious. This pattern is, of course, not limited to doing science and we see this in open source development too. Unfortunately, in the modern way of doing science, this is not getting the attention it should get. Worse, with narratives (stories) about the research, in the form of journal articles, are generally considered more important that a precise description of the observations.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/biocuration.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/biocuration.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">new: “CAS Common Chemistry in 2021: Expanding Access to Trusted Chemical Information for the Scientific Community”</title><link href="https://chem-bla-ics.linkedchemistry.info/2022/05/22/new-cas-common-chemistry-in-2021.html" rel="alternate" type="text/html" title="new: “CAS Common Chemistry in 2021: Expanding Access to Trusted Chemical Information for the Scientific Community”" /><published>2022-05-22T00:00:00+00:00</published><updated>2022-05-22T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2022/05/22/new-cas-common-chemistry-in-2021</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2022/05/22/new-cas-common-chemistry-in-2021.html"><![CDATA[<p>Open Science is happening. The merits are no longer theoretical or idealistic but tangible. Research is faster than ever, more vetted than ever (think PubPeer),
more cited than ever. Fairly, not just because of Open Science, but open access causes readership causes impact causes citations. When new people and
organizations start adopting Open Science this warms my hearth.</p>

<p>So, when I was asked to work with <a href="https://www.cas.org/">Chemical Abstracts Service</a> (CAS) on a new, bigger than ever version of
<a href="https://commonchemistry.cas.org/">Common Chemistry</a> (which started as a project between CAS and Wikipedia), I welcomed the project. I don’t quite
remember the first meetings, but roughly my task became to work with the new content and match this against Wikidata and Wikipedia. It aligned well
with <a href="https://bridgedb.github.io/">BridgeDb</a>, <a href="https://scholia.toolforge.org/">Scholia</a>, and our metabolomics research, so I even could find
sufficient research time for it. This work is now published in the <a href="https://pubs.acs.org/journal/jcisd8">JCIM</a>:
<em>CAS Common Chemistry in 2021: Expanding Access to Trusted Chemical Information for the Scientific Community</em>
(doi:<a href="https://doi.org/10.1021/acs.jcim.2c00268">10.1021/acs.jcim.2c00268</a>).</p>

<p><img src="/assets/images/images_medium_ci2c00268_0003.png" alt="" /> <br />
<em>Figure 2 from the article. Detailed record for caffeine in CAS Common Chemistry (image: CC-BY).</em></p>

<p>About Wikidata, the paper writes (CC-BY):</p>

<blockquote>
  <p>The latest release of CAS Common Chemistry has also supported updates and corrections to CAS RNs in Wikidata and Wikipedia. (22)
InChIKeys were calculated from CAS SMILES using Bacting 0.0.31 (23) with the Chemistry Development Kit 2.7.1 (24) and were
matched with content in Wikidata. The CAS RNs were then compared. References to CAS Common Chemistry were added for CAS RNs
that matched. Mismatches have been shared with the Wikidata and Wikipedia communities so that they can manually review and
correct the misleading entries using CAS Common Chemistry as a reference. Because Wikidata also curates identifiers from
other data sources, validated CAS RNs in Wikidata may also be used to cross-reference with other resources. Scripts are
provided in the Supporting Information.</p>
</blockquote>

<p>The alignment is a continuous process, as new chemical compounds get added to Wikidata on a weekly basis. The comparison of
Common Chemistry with Wikidata and Wikipedia resulted in a wealth of curation data, e.g. inconsistent CAS numbers linked to
InChIKeys, where Common Chemistry had a different match than Wikidata or Wikipedia.</p>

<p>CAS registry numbers were not added to Wikidata in this process, only confirmed or reported as different. The latter
allowed manual curation by the community, which it did. Reports <a href="https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Chemistry/CAS_Validation_Results">look like this</a>.
When a InChIKey-CAS RN combination in Wikidata was confirmed, it was recorded as a reference, like this:</p>

<p><img src="/assets/images/Screenshot_20220522_084233.png" alt="" /> <br />
<em>Screenshot of Wikidata with two references, one reflecting a confirmation
by the English Wikipedia (potentially the result of the original Common Chemistry
project) and the second as outcome of the now published project.</em></p>

<p>Thanks to everyone on this project and <a href="https://orcid.org/0000-0001-9316-9400">Andrea Jacobs</a>
particularly for leading this open science project.</p>]]></content><author><name>Egon Willighagen</name></author><category term="curation" /><category term="chemistry" /><category term="cas" /><category term="doi:10.1021/ACS.JCIM.2C00268" /><category term="bioclipse" /><category term="cdk" /><category term="bridgedb" /><summary type="html"><![CDATA[Open Science is happening. The merits are no longer theoretical or idealistic but tangible. Research is faster than ever, more vetted than ever (think PubPeer), more cited than ever. Fairly, not just because of Open Science, but open access causes readership causes impact causes citations. When new people and organizations start adopting Open Science this warms my hearth.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/images_medium_ci2c00268_0003.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/images_medium_ci2c00268_0003.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">ChemCuration 2019 Poster Conference: Call for Posters</title><link href="https://chem-bla-ics.linkedchemistry.info/2019/10/14/chemcuration-2019-poster-conference.html" rel="alternate" type="text/html" title="ChemCuration 2019 Poster Conference: Call for Posters" /><published>2019-10-14T00:00:00+00:00</published><updated>2019-10-14T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2019/10/14/chemcuration-2019-poster-conference</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2019/10/14/chemcuration-2019-poster-conference.html"><![CDATA[<p><span style="width: 40%; display: block; margin-left: auto; margin-right: auto; float: right">
<img src="/assets/images/Screenshot_20191014_174204.png" /> <br />
Twitter profile.
</span></p>

<p><em>It giet oan!</em> That it a Frisian phrase for something unlike is going to happen, like and particularly related to the
<a href="https://en.wikipedia.org/wiki/Elfstedentocht">Elfstedentocht</a>.</p>

<p><strong>ChemCuration 2019</strong> is a go. The <a href="https://chemcuration.github.io/chemcuration2019/">website is online</a>, the
<a href="https://twitter.com/chemcuration">Twitter account</a> and <a href="https://twitter.com/hashtag/chemcur2019">hashtag are ready</a>,
we got a poster prize, and here is the call for posters!</p>

<blockquote>
  <p>On December 3 the first ChemCuration conference will take place. ChemCuration 2019 is a one day, online-only conference around data curation and curated data in the chemistry domain. During the entire conference day, you can participate by tweeting about the poster that you uploaded, along with the meeting hashtag, and responding to questions about your poster in the 24 hours of the conference day. The poster must be available in an online repository (e.g. Zenodo or Figshare) under the CCZero, CC-BY or CC-BY-SA license prior to the conference.</p>

  <p>This is the meeting scope: anything around data curation and curated data of open science data in chemistry. This includes but is not limited to: 1. a new release of curated open data; 2. FAIR metadata around open data; and 3. open source tools for data curation.</p>

  <p><strong>How do I participate in ChemCuration?</strong><br />
You can participate in this online poster conference by presenting your poster on Twitter
during the conference day. You do this by first archiving your poster via Figshare or Zenodo,
with an open license (e.g. CCZero or CC-BY). Then, during the day you tweet an image of
(part of) your digital poster with the <a href="https://twitter.com/hashtag/chemcur2019">#chemcur2019</a>
hashtag, a short summary, and a link to your online poster with its DOI. The archived poster
should be a regular A0 poster (WxH = 841 x 1189 mm or 33.1 x 46.8 in)</p>

  <p><strong>Do I need to register?</strong><br />
Registration is not obligatory to participate. However, if you would like to be eligible
for a poster prize, then registration is required, by Nov. 30th, 2019. The registration form
is found at <a href="https://github.com/chemcuration/chemcuration2019/issues/new/choose">https://github.com/chemcuration/chemcuration2019/issues/new/choose</a></p>

  <p>More information can be found on the website (<a href="https://chemcuration.github.io/chemcuration2019/">https://chemcuration.github.io/chemcuration2019/</a>)
and on Twitter <a href="https://twitter.com/chemcuration">https://twitter.com/chemcuration</a></p>
</blockquote>]]></content><author><name>Egon Willighagen</name></author><category term="curation" /><category term="chemistry" /><summary type="html"><![CDATA[Twitter profile.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/Screenshot_20191014_174204.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/Screenshot_20191014_174204.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">ChemCuration: a small trick to fix the SMILES of glucuronides</title><link href="https://chem-bla-ics.linkedchemistry.info/2019/10/09/chemcuration-small-trick-to-fix-smiles.html" rel="alternate" type="text/html" title="ChemCuration: a small trick to fix the SMILES of glucuronides" /><published>2019-10-09T00:00:00+00:00</published><updated>2019-10-09T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2019/10/09/chemcuration-small-trick-to-fix-smiles</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2019/10/09/chemcuration-small-trick-to-fix-smiles.html"><![CDATA[<p><span style="width: 40%; display: block; margin-left: auto; margin-right: auto; float: right">
<img src="/assets/images/Screenshot_20191008_144049.png" /> <br />
Glucuronide functional group.
</span></p>

<p>Now that the <a href="https://chemcuration.github.io/chemcuration2019/">ChemCuration 2019</a> online poster conference is nearing, and
my upcoming talks about chemistry in <a href="https://wikidata.org/">Wikidata</a> (also needing curation), and the much longer process
of curation of metabolite (-like) structures in <a href="https://wikipathways.org/">WikiPathways</a>, I decided that something I
tweeted earlier this week is actually quite useful, and therefore something I should really write up in my lab notebook.</p>

<p><a href="https://en.wikipedia.org/wiki/Glucuronide">Glucuronide</a> is an example (biological) functional group. And there are several
databases that represent the stereochemistry now always correct. That is an interoperability (and thus FAIR) problem.
Correcting this is not trivial, particularly if you have to redraw the same glucuronide group again and again.</p>

<p>So, not looking forward to that, I invested a bit of time to find a <a href="http://opensmiles.org/">SMILES</a> trick. What if I had
a SMILES snippet that I could easily copy/paste and attach to the SMILES of the chemical structure it is attached to? Here
goes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>O1[C@H](C(O)=O)[C@H]([C@H](O)[C@@H](O)[C@@H]1O9)O.
</code></pre></div></div>

<p>I just realized that <a href="https://twitter.com/egonwillighagen/status/1181573810543321088">the original 3 I used</a> can better be
a <code class="language-plaintext highlighter-rouge">9</code>, which is less likely to occur in the SMILES of the rest of the molecule. The period at the end is also deliberate.
That way, I can just copy past the SMILES of the rest directly after that period. Then I get a disconnected structure, but
I only have to put a 9 next to the atom that is binding to the glucuronide. So, let’s see the R group is methane, I get:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>O1[C@H](C(O)=O)[C@H]([C@H](O)[C@@H](O)[C@@H]1O9)O.C9
</code></pre></div></div>

<p>Now, next stop: <code class="language-plaintext highlighter-rouge">CoA</code> and other common biological tags.</p>]]></content><author><name>Egon Willighagen</name></author><category term="chemistry" /><category term="curation" /><category term="wikidata" /><category term="wikipathways" /><category term="smiles" /><summary type="html"><![CDATA[Glucuronide functional group.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/Screenshot_20191008_144049.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/Screenshot_20191008_144049.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Two presentations at WikiPathways 2018 Summit #WP18Summit</title><link href="https://chem-bla-ics.linkedchemistry.info/2018/10/11/two-presentations-at-wikipathways-2018.html" rel="alternate" type="text/html" title="Two presentations at WikiPathways 2018 Summit #WP18Summit" /><published>2018-10-11T00:00:00+00:00</published><updated>2018-10-11T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2018/10/11/two-presentations-at-wikipathways-2018</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2018/10/11/two-presentations-at-wikipathways-2018.html"><![CDATA[<p><span style="width: 40%; display: block; margin-left: auto; margin-right: auto; float: right">
<img src="/assets/images/430px-Wp18summit.png" />
</span></p>

<p>Found my way back to my room a few kilometers from the San Francisco city center, after a third day at the
<a href="https://gladstone.org/WP18Summit">WikiPathways 2018 Summit</a> at the <a href="https://gladstone.org/">Gladstone Institutes</a>
in Mission Bay, celebrating 10 years of the project, which I only joined
<a href="http://chem-bla-ics.blogspot.com/2012/01/first-month-back-in-nl.html">some six and a half years ago</a>.</p>

<p>The Summit was awesome and the whole trip was awesome. The flight was long, with a stop in Seattle. I always get a
bit nervous of lay-overs (having missed my plane twice before…), but a stop in Seattle is interesting, with a great view of
<a href="https://tools.wmflabs.org/reasonator/?q=Q194057&amp;lang=en">Mt. Rainier</a>, which is also from an airplane quite a sight.
Alex picked us up from the airport and the Airbnb is great (HT to Annie for being a great host), from which we can even
see the Golden Gate Bridge.</p>

<p>The Sunday was surreal. With some 27 degrees Celsius the choice to visit the beach and stand, for the first time,
in the Pacific was great. I had the great pleasure to meet Dario and his family and played volleyball at a beach
for the first time in some 28 years. Apparently, there was an airshow nearby and several shows were visible from
our site, including a very long show by the <a href="https://www.instagram.com/p/BopjukvhUK1/">Blue Angels</a>.
Thanks for a great afternoon!</p>

<p>Sunday evening Adam hosted us for an <a href="https://www.wikipathways.org/index.php/WikiPathways:Team">WikiPathways team</a> dinner.
His place gave a great view on San Francisco, the Bay Bridge, etc. Because Chris was paying attention, we actually got
to see <a href="https://www.space.com/42068-amazing-spacex-rocket-launch-photos-not-aliens.html">the SpaceX rocket launch</a>
(no, my photo is not so impressive :). Well, I cannot express in words how cool that is, to see a rocket escape the
earth gravity with your own eyes.</p>

<p>And the Summit had not even started yet.</p>

<p>I will have quite a lot to write up about the meeting itself. It was a great line up of speakers, great workshops,
awesome discussions, and a high density of very knowledgeable people. I think we need 5M to implement just the ideas
that came up in the past three days. And it would be well invested. Anyway, more about that later. Make sure to keep
an eye on the <a href="https://github.com/wikipathways">GitHub repo for WikiPathways</a>.</p>

<p>That leave me only, right now, to return to the title of this post. And below they are, my two contributions to this summit:</p>

<p><a href="https://doi.org/10.5281/zenodo.3544361"><img src="/assets/images/wpsummit_PDF_slide1_andabit.png" alt="" /></a></p>

<div style="margin-bottom: 5px;">
<strong> <a href="https://zenodo.org/records/3544361" target="_blank" title="Automated Curation with Internal and External Validation">Automated Curation with Internal and External Validation</a> </strong> from <strong><a href="https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Willighagen,+Egon%22" target="_blank">Egon Willighagen</a></strong> </div>

<p><br /></p>

<p><a href="https://doi.org/10.5281/zenodo.3544363"><img src="/assets/images/wpsummit_PDF2_slide1_andabit.png" alt="" /></a></p>

<div style="margin-bottom: 5px;">
<strong> <a href="https://zenodo.org/records/3544363" target="_blank" title="Using WikiPathways with Its Resource Description Framework Format">Using WikiPathways with Its Resource Description Framework Format</a> </strong> from <strong><a href="https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Willighagen,+Egon%22" target="_blank">Egon Willighagen</a></strong> </div>]]></content><author><name>Egon Willighagen</name></author><category term="curation" /><category term="wikipathways" /><category term="justdoi:10.5281/zenodo.3544361" /><category term="justdoi:10.5281/zenodo.3544363" /><summary type="html"><![CDATA[]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/430px-Wp18summit.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/430px-Wp18summit.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Data Curation: 5% inspiration, 95% frustration (cleaning up data inconsistencies)</title><link href="https://chem-bla-ics.linkedchemistry.info/2018/09/16/data-curation-5-inspiration-95.html" rel="alternate" type="text/html" title="Data Curation: 5% inspiration, 95% frustration (cleaning up data inconsistencies)" /><published>2018-09-16T00:00:00+00:00</published><updated>2018-09-16T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2018/09/16/data-curation-5-inspiration-95</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2018/09/16/data-curation-5-inspiration-95.html"><![CDATA[<p><span style="width: 40%; display: block; margin-left: auto; margin-right: auto; float: right">
<img src="/assets/images/Screenshot_20180916_170915.png" /> <br />
Slice of the spreadsheet in the supplementary info.
</span></p>

<p>Just some bit of cleaning I scripted today for a number of toxicology end points in a database published some time ago the
zero-APC Open Access (CC_BY) journal <a href="https://www.beilstein-journals.org/bjnano/">Beilstein of Journal of Nanotechnology</a>,
NanoE-Tox (doi:<a href="https://www.beilstein-journals.org/bjnano/articles/6/183">10.3762/bjnano.6.183</a>).</p>

<p>The curation I am doing is to redistribute the data in the eNanoMapper database (see doi:<a href="https://doi.org/10.3762/bjnano.6.165/">10.3762/bjnano.6.165</a>)
and thus with ontology annotation (see doi:<a href="https://doi.org/10.1186/s13326-015-0005-5">10.1186/s13326-015-0005-5</a>):</p>

<div class="language-groovy highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">recognizedToxicities</span> <span class="o">=</span> <span class="o">[</span>
  <span class="s2">"EC10"</span><span class="o">:</span> <span class="s2">"http://www.bioassayontology.org/bao#BAO_0001263"</span><span class="o">,</span>
  <span class="s2">"EC20"</span><span class="o">:</span> <span class="s2">"http://www.bioassayontology.org/bao#BAO_0001235"</span><span class="o">,</span>
  <span class="s2">"EC25"</span><span class="o">:</span> <span class="s2">"http://www.bioassayontology.org/bao#BAO_0001264"</span><span class="o">,</span>
  <span class="s2">"EC30"</span><span class="o">:</span> <span class="s2">"http://www.bioassayontology.org/bao#BAO_0000599"</span><span class="o">,</span>
  <span class="s2">"EC50"</span><span class="o">:</span> <span class="s2">"http://www.bioassayontology.org/bao#BAO_0000188"</span><span class="o">,</span>
  <span class="s2">"EC80"</span><span class="o">:</span> <span class="s2">"http://purl.enanomapper.org/onto/ENM_0000053"</span><span class="o">,</span>
  <span class="s2">"EC90"</span><span class="o">:</span> <span class="s2">"http://www.bioassayontology.org/bao#BAO_0001237"</span><span class="o">,</span>
  <span class="s2">"IC50"</span><span class="o">:</span> <span class="s2">"http://www.bioassayontology.org/bao#BAO_0000190"</span><span class="o">,</span>
  <span class="s2">"LC50"</span><span class="o">:</span> <span class="s2">"http://www.bioassayontology.org/bao#BAO_0002145"</span><span class="o">,</span>
  <span class="s2">"MIC"</span><span class="o">:</span>  <span class="s2">"http://www.bioassayontology.org/bao#BAO_0002146"</span><span class="o">,</span>
  <span class="s2">"NOEC"</span><span class="o">:</span> <span class="s2">"http://purl.enanomapper.org/onto/ENM_0000060"</span><span class="o">,</span>
  <span class="s2">"NOEL"</span><span class="o">:</span> <span class="s2">"http://purl.enanomapper.org/onto/ENM_0000056"</span>
<span class="o">]</span>
</code></pre></div></div>

<p>With 402(!) variants left. Many do not have an ontology term yet, and I
<a href="https://github.com/enanomapper/ontologies/issues/143">filed a feature request</a>.</p>

<p>Units:</p>

<div class="language-groovy highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">recognizedUnits</span> <span class="o">=</span> <span class="o">[</span>
  <span class="s2">"g/L"</span><span class="o">:</span> <span class="s2">"g/L"</span><span class="o">,</span>
  <span class="s2">"g/l"</span><span class="o">:</span> <span class="s2">"g/l"</span><span class="o">,</span>
  <span class="s2">"mg/L"</span><span class="o">:</span> <span class="s2">"mg/L"</span><span class="o">,</span>
  <span class="s2">"mg/ml"</span><span class="o">:</span> <span class="s2">"mg/ml"</span><span class="o">,</span>
  <span class="s2">"mg/mL"</span><span class="o">:</span> <span class="s2">"mg/mL"</span><span class="o">,</span>
  <span class="s2">"µg/L of food"</span><span class="o">:</span> <span class="s2">"µg/L"</span><span class="o">,</span>
  <span class="s2">"µg/L"</span><span class="o">:</span> <span class="s2">"µg/L"</span><span class="o">,</span>
  <span class="s2">"µg/mL"</span><span class="o">:</span> <span class="s2">"µg/mL"</span><span class="o">,</span>
  <span class="s2">"mg Ag/L"</span><span class="o">:</span> <span class="s2">"mg/L"</span><span class="o">,</span>
  <span class="s2">"mg Cu/L"</span><span class="o">:</span> <span class="s2">"mg/L"</span><span class="o">,</span>
  <span class="s2">"mg Zn/L"</span><span class="o">:</span> <span class="s2">"mg/L"</span><span class="o">,</span>
  <span class="s2">"µg dissolved Cu/L"</span><span class="o">:</span> <span class="s2">"µg/L"</span><span class="o">,</span>
  <span class="s2">"µg dissolved Zn/L"</span><span class="o">:</span> <span class="s2">"µg/L"</span><span class="o">,</span>
  <span class="s2">"µg Ag/L"</span><span class="o">:</span> <span class="s2">"µg/L"</span><span class="o">,</span>
  <span class="s2">"fmol/L"</span><span class="o">:</span> <span class="s2">"fmol/L"</span><span class="o">,</span>

  <span class="s2">"mmol/g"</span><span class="o">:</span> <span class="s2">"mmol/g"</span><span class="o">,</span>
  <span class="s2">"nmol/g fresh weight"</span><span class="o">:</span> <span class="s2">"nmol/g"</span><span class="o">,</span>
  <span class="s2">"µg Cu/g"</span><span class="o">:</span> <span class="s2">"µg/g"</span><span class="o">,</span>
  <span class="s2">"mg Ag/kg"</span><span class="o">:</span> <span class="s2">"mg/kg"</span><span class="o">,</span>
  <span class="s2">"mg Zn/kg"</span><span class="o">:</span> <span class="s2">"mg/kg"</span><span class="o">,</span>
  <span class="s2">"mg Zn/kg  d.w."</span><span class="o">:</span> <span class="s2">"mg/kg"</span><span class="o">,</span>
  <span class="s2">"mg/kg of dry feed"</span><span class="o">:</span> <span class="s2">"mg/kg"</span><span class="o">,</span>
  <span class="s2">"mg/kg"</span><span class="o">:</span> <span class="s2">"mg/kg"</span><span class="o">,</span>
  <span class="s2">"g/kg"</span><span class="o">:</span> <span class="s2">"g/kg"</span><span class="o">,</span>
  <span class="s2">"µg/g dry weight sediment"</span><span class="o">:</span> <span class="s2">"µg/g"</span><span class="o">,</span>
  <span class="s2">"µg/g"</span><span class="o">:</span> <span class="s2">"µg/g"</span>
<span class="o">]</span>
</code></pre></div></div>

<p>Oh, and don’t get me started on actual values, with endpoint values, as ranges, errors, etc. That variety is
not the problem, but the lack of FAIR-ness makes the whole really hard to process. I now have something like:</p>

<div class="language-groovy highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">prop</span> <span class="o">=</span> <span class="n">prop</span><span class="o">.</span><span class="na">replace</span><span class="o">(</span><span class="s2">","</span><span class="o">,</span> <span class="s2">"."</span><span class="o">)</span>
<span class="k">if</span> <span class="o">(</span><span class="n">prop</span><span class="o">.</span><span class="na">substring</span><span class="o">(</span><span class="mi">1</span><span class="o">).</span><span class="na">contains</span><span class="o">(</span><span class="s2">"-"</span><span class="o">))</span> <span class="o">{</span>
  <span class="n">rdf</span><span class="o">.</span><span class="na">addTypedDataProperty</span><span class="o">(</span>
    <span class="n">store</span><span class="o">,</span> <span class="n">endpointIRI</span><span class="o">,</span> <span class="s2">"${oboNS}STATO_0000035"</span><span class="o">,</span>
    <span class="n">prop</span><span class="o">,</span> <span class="s2">"${xsdNS}string"</span>
  <span class="o">)</span>
  <span class="n">rdf</span><span class="o">.</span><span class="na">addDataProperty</span><span class="o">(</span>
    <span class="n">store</span><span class="o">,</span> <span class="n">endpointIRI</span><span class="o">,</span> <span class="s2">"${ssoNS}has-unit"</span><span class="o">,</span> <span class="n">units</span>
  <span class="o">)</span>
<span class="o">}</span> <span class="k">else</span> <span class="k">if</span> <span class="o">(</span><span class="n">prop</span><span class="o">.</span><span class="na">contains</span><span class="o">(</span><span class="s2">"±"</span><span class="o">))</span> <span class="o">{</span>
  <span class="n">rdf</span><span class="o">.</span><span class="na">addTypedDataProperty</span><span class="o">(</span>
    <span class="n">store</span><span class="o">,</span> <span class="n">endpointIRI</span><span class="o">,</span> <span class="s2">"${oboNS}STATO_0000035"</span><span class="o">,</span>
    <span class="n">prop</span><span class="o">,</span> <span class="s2">"${xsdNS}string"</span>
  <span class="o">)</span>
  <span class="n">rdf</span><span class="o">.</span><span class="na">addDataProperty</span><span class="o">(</span>
    <span class="n">store</span><span class="o">,</span> <span class="n">endpointIRI</span><span class="o">,</span> <span class="s2">"${ssoNS}has-unit"</span><span class="o">,</span> <span class="n">units</span>
  <span class="o">)</span>
<span class="o">}</span> <span class="k">else</span> <span class="k">if</span> <span class="o">(</span><span class="n">prop</span><span class="o">.</span><span class="na">contains</span><span class="o">(</span><span class="s2">"&lt;"</span><span class="o">))</span> <span class="o">{</span>
<span class="o">}</span> <span class="k">else</span> <span class="o">{</span>
  <span class="n">rdf</span><span class="o">.</span><span class="na">addTypedDataProperty</span><span class="o">(</span>
    <span class="n">store</span><span class="o">,</span> <span class="n">endpointIRI</span><span class="o">,</span> <span class="s2">"${ssoNS}has-value"</span><span class="o">,</span> <span class="n">prop</span><span class="o">,</span>
    <span class="s2">"${xsdNS}double"</span>
  <span class="o">)</span>
  <span class="n">rdf</span><span class="o">.</span><span class="na">addDataProperty</span><span class="o">(</span>
    <span class="n">store</span><span class="o">,</span> <span class="n">endpointIRI</span><span class="o">,</span> <span class="s2">"${ssoNS}has-unit"</span><span class="o">,</span> <span class="n">units</span>
  <span class="o">)</span>
<span class="o">}</span>
</code></pre></div></div>

<p>But let me make clear: I can actually do this, add more data to the eNanoMapper database (with
<a href="https://tools.wmflabs.org/scholia/github/vedina">Nina</a>), only because the developers of this database made their data
available under an Open license (CC-BY, to be precise), allowing me to reuse, modify (change format), and redistribute
it. Thanks to the authors. Data curation is expensive, whether I do it, or if the authors of the database did. They
already did a lot of data curation. But only because of Open licenses, <strong>we only have to do this once</strong>.</p>]]></content><author><name>Egon Willighagen</name></author><category term="curation" /><category term="toxicology" /><category term="nanosafety" /><category term="cito:usesDataFrom:10.3762/bjnano.6.183" /><category term="doi:10.3762/BJNANO.6.165" /><category term="doi:10.1186/S13326-015-0005-5" /><summary type="html"><![CDATA[Slice of the spreadsheet in the supplementary info.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/Screenshot_20180916_170915.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/Screenshot_20180916_170915.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Two Apache Jena SPARQL query performance observations</title><link href="https://chem-bla-ics.linkedchemistry.info/2016/07/02/two-apache-jena-sparql-query.html" rel="alternate" type="text/html" title="Two Apache Jena SPARQL query performance observations" /><published>2016-07-02T00:00:00+00:00</published><updated>2016-07-02T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2016/07/02/two-apache-jena-sparql-query</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2016/07/02/two-apache-jena-sparql-query.html"><![CDATA[<p><span style="width: 50%; display: block; margin-left: auto; margin-right: auto; float: right">
<img src="/assets/images/jenaSlow.png" />
</span></p>

<p>Doing searches in RDF stores is commonly done with SPARQL queries. I have been using this with <a href="http://chem-bla-ics.blogspot.nl/2016/06/new-paper-using-semantic-web-for-rapid.html">the semantic web translation of WikiPathways</a>
by <a href="https://twitter.com/andrawaag">Andra</a> to find common content issues, though sometimes combined with some additional Java code.
For example, find <a href="http://www.ncbi.nlm.nih.gov/pubmed">PubMed</a> identifiers that are not numbers.</p>

<p>Based on <a href="http://orcid.org/0000-0003-3477-7443">Ryan</a>’s work on interactions, a more complex curation query I
recently wrote in reply to issues that <a href="https://twitter.com/xanderpico">Alex</a> ran into with converting pathways to
BioPax, is to find interactions that convert a gene to another gene. Such occurred in <a href="http://wikipathways.org/">WikiPathways</a>
because graphically you do not see the difference. I originally had this query:</p>

<div class="language-sparql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span><span class="w"> </span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="nv">?organismName</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nv">?organism</span><span class="p">)</span><span class="w"> </span><span class="nv">?page</span><span class="w">
       </span><span class="nv">?gene1</span><span class="w"> </span><span class="nv">?gene2</span><span class="w"> </span><span class="nv">?interaction</span><span class="w">
</span><span class="k">WHERE</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="nv">?gene1</span><span class="w"> </span><span class="k">a</span><span class="w"> </span><span class="nn">wp</span><span class="o">:</span><span class="ss">GeneProduct</span><span class="w"> </span><span class="p">.</span><span class="w">
  </span><span class="nv">?gene2</span><span class="w"> </span><span class="k">a</span><span class="w"> </span><span class="nn">wp</span><span class="o">:</span><span class="ss">GeneProduct</span><span class="w"> </span><span class="p">.</span><span class="w">
  </span><span class="nv">?interaction</span><span class="w"> </span><span class="nn">wp</span><span class="o">:</span><span class="ss">source</span><span class="w"> </span><span class="nv">?gene1</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">wp</span><span class="o">:</span><span class="ss">target</span><span class="w"> </span><span class="nv">?gene2</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="k">a</span><span class="w"> </span><span class="nn">wp</span><span class="o">:</span><span class="ss">Conversion</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">dcterms</span><span class="o">:</span><span class="ss">isPartOf</span><span class="w"> </span><span class="nv">?pathway</span><span class="w"> </span><span class="p">.</span><span class="w">
  </span><span class="nv">?pathway</span><span class="w"> </span><span class="nn">foaf</span><span class="o">:</span><span class="ss">page</span><span class="w"> </span><span class="nv">?page</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">wp</span><span class="o">:</span><span class="ss">organismName</span><span class="w"> </span><span class="nv">?organismName</span><span class="w"> </span><span class="p">.</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="k">ASC</span><span class="p">(</span><span class="nv">?organism</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>This query properly found all gene-gene conversions to be fixed. However, it was also horribly slow with my
<a href="http://junit.org/">JUnit</a>/<a href="https://jena.apache.org/">Apache Jena</a> set up. The queries runs very efficiently on <a href="http://sparql.wikipathways.org/">the Virtuoso-based SPARQL end point</a>.
I had been trying to speed it up in the past, but without much success. Instead, I ended up batching the
testing on our Jenkins instance. But this got a bit silly, with at some point subsets of less than 100 pathways.</p>

<h2 id="observation-1">Observation #1</h2>

<p>So, I <a href="https://twitter.com/egonwillighagen/status/748817658758344704">turned to twitter</a>, and quite soon got
<a href="https://twitter.com/xbib/status/748818534457716736">three</a> <a href="https://twitter.com/jervenbolleman/status/748820145028550656">useful</a>
<a href="https://twitter.com/soilandreyes/status/748891148182257664">leads</a>. The first two suggestions did not help, but helped me rule out the problem.
Of course, there is literature about optimizing, like this recent paper by Antonis (doi:<a href="http://doi.org/10.1016/j.websem.2014.11.003">10.1016/j.websem.2014.11.003</a>),
but I haven’t been able to convert this knowledge into practical steps either. After ruling out these options (though I kept the
<code class="language-plaintext highlighter-rouge">sameTerm()</code> suggestion), and realized it had to be the first two triples with the variables <code class="language-plaintext highlighter-rouge">?gene1</code> and <code class="language-plaintext highlighter-rouge">?gene2</code>. So,
<a href="https://github.com/BiGCAT-UM/WikiPathwaysCurator/commit/b8283419b252bd8525631d5035d086a15d0773e0">I tried using <em>FILTER</em> there too</a>,
resulting with this query:</p>

<div class="language-sparql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WHERE</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="nv">?interaction</span><span class="w"> </span><span class="nn">wp</span><span class="o">:</span><span class="ss">source</span><span class="w"> </span><span class="nv">?gene1</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">wp</span><span class="o">:</span><span class="ss">target</span><span class="w"> </span><span class="nv">?gene2</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="k">a</span><span class="w"> </span><span class="nn">wp</span><span class="o">:</span><span class="ss">Conversion</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">dcterms</span><span class="o">:</span><span class="ss">isPartOf</span><span class="w"> </span><span class="nv">?pathway</span><span class="w"> </span><span class="p">.</span><span class="w">
  </span><span class="nv">?pathway</span><span class="w"> </span><span class="nn">foaf</span><span class="o">:</span><span class="ss">page</span><span class="w"> </span><span class="nv">?page</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">wp</span><span class="o">:</span><span class="ss">organismName</span><span class="w"> </span><span class="nv">?organismName</span><span class="w"> </span><span class="p">.</span><span class="w">
  </span><span class="k">FILTER</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="nb">sameTerm</span><span class="p">(</span><span class="nv">?gene1</span><span class="p">,</span><span class="w"> </span><span class="nv">?gene2</span><span class="p">))</span><span class="w">
  </span><span class="k">FILTER</span><span class="w"> </span><span class="p">(</span><span class="nv">?gene1</span><span class="w"> </span><span class="k">a</span><span class="w"> </span><span class="nn">wp</span><span class="o">:</span><span class="ss">GeneProduct</span><span class="p">)</span><span class="w">
  </span><span class="k">FILTER</span><span class="w"> </span><span class="p">(</span><span class="nv">?gene2</span><span class="w"> </span><span class="k">a</span><span class="w"> </span><span class="nn">wp</span><span class="o">:</span><span class="ss">GeneProduct</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="k">ASC</span><span class="p">(</span><span class="nv">?organism</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>That did it! The time to run a query halved. Not so surprising, in retrospect, but it all depends on the SPARQL engine:
which parts does it run first. Apparently, Jena’s SPARQL engine starts at the top. This seems to be confirmed by
<a href="https://twitter.com/soilandreyes/status/748891148182257664">the third comment I got</a>. However, I always understood
engine can also start at the bottom.</p>

<h2 id="observation-2">Observation #2</h2>

<p>But that’s not all. This speed up made me wonder something else. The problem clearly seems to engine approach to run
parts of the query. So, what if I remove further choices in what to run first? That leads me to
<a href="https://twitter.com/egonwillighagen/status/748844395701506048">a second observation</a>. It helps significantly if you
reduce the number of subgraphs it should later “merge”. Instead, if possible, use
<a href="https://www.w3.org/TR/sparql11-query/#propertypaths">property paths</a>. That again, about halved the runtime of the query.
I ended up with the below query, which, obviously, no longer give me access to the pathway resources, but I can live
with that:</p>

<div class="language-sparql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WHERE</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="nv">?interaction</span><span class="w"> </span><span class="nn">wp</span><span class="o">:</span><span class="ss">source</span><span class="w"> </span><span class="nv">?gene1</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">wp</span><span class="o">:</span><span class="ss">target</span><span class="w"> </span><span class="nv">?gene2</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="k">a</span><span class="w"> </span><span class="nn">wp</span><span class="o">:</span><span class="ss">Conversion</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">dcterms</span><span class="o">:</span><span class="ss">isPartOf</span><span class="o">/</span><span class="nn">foaf</span><span class="o">:</span><span class="ss">page</span><span class="w"> </span><span class="nv">?pathway</span><span class="w"> </span><span class="p">;</span><span class="w">
    </span><span class="nn">dcterms</span><span class="o">:</span><span class="ss">isPartOf</span><span class="o">/</span><span class="nn">wp</span><span class="o">:</span><span class="ss">organismName</span><span class="w"> </span><span class="nv">?organismName</span><span class="w"> </span><span class="p">.</span><span class="w">
  </span><span class="k">FILTER</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="nb">sameTerm</span><span class="p">(</span><span class="nv">?gene1</span><span class="p">,</span><span class="w"> </span><span class="nv">?gene2</span><span class="p">))</span><span class="w">
  </span><span class="k">FILTER</span><span class="w"> </span><span class="k">EXISTS</span><span class="w"> </span><span class="p">{</span><span class="nv">?gene1</span><span class="w"> </span><span class="k">a</span><span class="w"> </span><span class="nn">wp</span><span class="o">:</span><span class="ss">GeneProduct</span><span class="p">}</span><span class="w">
  </span><span class="k">FILTER</span><span class="w"> </span><span class="k">EXISTS</span><span class="w"> </span><span class="p">{</span><span class="nv">?gene2</span><span class="w"> </span><span class="k">a</span><span class="w"> </span><span class="nn">wp</span><span class="o">:</span><span class="ss">GeneProduct</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="k">ASC</span><span class="p">(</span><span class="nv">?organism</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>I’m hoping these two observations may help other with using Apache Jena with unit and integrated testing of RDF generation too.</p>]]></content><author><name>Egon Willighagen</name></author><category term="curation" /><category term="wikipathways" /><category term="sparql" /><category term="rdf" /><category term="justdoi:10.1016/j.websem.2014.11.003" /><summary type="html"><![CDATA[]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/jenaSlow.png" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/jenaSlow.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">New Paper: “Using the Semantic Web for Rapid Integration of WikiPathways with Other Biological Online Data Resources”</title><link href="https://chem-bla-ics.linkedchemistry.info/2016/06/25/new-paper-using-semantic-web-for-rapid.html" rel="alternate" type="text/html" title="New Paper: “Using the Semantic Web for Rapid Integration of WikiPathways with Other Biological Online Data Resources”" /><published>2016-06-25T00:00:00+00:00</published><updated>2016-06-25T00:00:00+00:00</updated><id>https://chem-bla-ics.linkedchemistry.info/2016/06/25/new-paper-using-semantic-web-for-rapid</id><content type="html" xml:base="https://chem-bla-ics.linkedchemistry.info/2016/06/25/new-paper-using-semantic-web-for-rapid.html"><![CDATA[<p><a href="http://micelio.be/">Andra Waagmeester</a> published a paper on his work on a semantic web version of the <a href="https://wikipathways.org/">WikiPathways</a>
(doi:<a href="https://doi.org/10.1371/journal.pcbi.1004989">10.1371/journal.pcbi.1004989</a>). The paper outlines the design decisions, shows
<a href="https://sparql.wikipathways.org/">the SPARQL endpoint</a>, and several examples SPARQL queries. These include federates queries, like a mashup
with <a href="https://www.disgenet.org/">DisGeNET</a> (doi:<a href="https://doi.org/10.1093/database/bav028">10.1093/database/bav028</a>) and EMBL-EBI’s
<a href="https://www.ebi.ac.uk/gxa/home">Expression Atlas</a>. That results in nice visualisations like this:</p>

<p><img src="/assets/images/journal.pcbi.1004989.g002.PNG" alt="" /></p>

<p>If you have the relevant information in the pathway, these pathways can help a lot in helping understanding of what is biologically going on.
And, of course, used for exactly that a lot.</p>

<h2 id="press-release">Press release</h2>

<p>Because press releases have become an interesting tool in knowledge dissemination, I wanted to learn what it involved to get one out. This
involved the people as <a href="http://journals.plos.org/ploscompbiol/">PLOS Computational Biology</a> and the press offices of the Gladstone Institutes
and our Maastricht University (<a href="https://gladstone.org/about-us/news/easy-integration-biological-knowledge-improves-understanding-diseases">press release 1</a>,
<a href="https://www.maastrichtuniversity.nl/news/easy-integrating-biological-knowledge-improves-understanding-diseases">press release 2 EN</a>/<a href="https://www.maastrichtuniversity.nl/nl/nieuws/eenvoudigere-integratie-van-biologische-kennis-verbetert-begrip-van-ziekten">NL</a>).
There is already one thing I learned in retrospect, and I am pissed with myself that I did not think of this: you should always have a
graphics supporting your story. I have been doing this for a long time in my blog now (sometimes I still forget), but did not think of
that in the press release. The press release was picked up by three outlets, though all basically as we presented it to them (thanks to
<a href="http://altmetric.com/">Altmetric.com</a>):</p>

<p><img src="/assets/images/pressReleaseUptake.png" alt="" /></p>

<h2 id="sparql">SPARQL</h2>

<p>But what makes me appreciate this piece of work, and WikiPathways itself, is how it creates a central hub of biological knowledge.
Pathway databases capture knowledge not easily embedded an generally structured (relational) databases. As such, expression this
in the RDF format seems simple enough. The thing I really love about this approach, is that your queries become machine readable
stories, particularly when you start using human readable variants of SPARQL for this. And you can
<a href="http://chem-bla-ics.blogspot.nl/2009/08/bioclipse-and-sparql-end-points-2.html">share these queries with the online scientific community with, for example, myExperiment</a>.</p>

<p>There are two applications how I have used SPARQL on WikiPathways data for metabolomics: 1. curation; 2. statistics. Data analysis
is harder, because in the RDF world resources scientific lenses are needed to accommodate for the chemical structural-temporal
complexity of metabolites. For curation, we have long used SPARQL for unit tests to support the curation of WikiPathways.
Moreover, I have manually used the SPARQL end point to find curation tasks. But now that the paper is out, I can blog about
this more. For now, <a href="http://www.wikipathways.org/index.php/Help:WikiPathways_Sparql_queries">many examples SPARQL queries can be found in the WikiPathways wiki</a>.
It features several queries showing statistics, but also some for curation. This is an example query I use to improve the
interoperability of WikiPathways with <a href="https://wikidata.org/">Wikidata</a> (also for <a href="https://bridgedb.org/">BridgeDb</a>):</p>

<div class="language-sparql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span><span class="w"> </span><span class="k">DISTINCT</span><span class="w"> </span><span class="nv">?metabolite</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="nv">?metabolite</span><span class="w"> </span><span class="k">a</span><span class="w"> </span><span class="nn">wp</span><span class="o">:</span><span class="ss">Metabolite</span><span class="w"> </span><span class="p">.</span><span class="w">
  </span><span class="k">OPTIONAL</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nv">?metabolite</span><span class="w"> </span><span class="nn">wp</span><span class="o">:</span><span class="ss">bdbWikidata</span><span class="w"> </span><span class="nv">?wikidata</span><span class="w"> </span><span class="p">.</span><span class="w"> </span><span class="p">}</span><span class="w">
  </span><span class="k">FILTER</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="nb">BOUND</span><span class="p">(</span><span class="nv">?wikidata</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Feel free to give this query a go at <a href="https://sparql.wikipathways.org/">sparql.wikipathways.org</a>!</p>

<h2 id="triptych">Triptych</h2>

<p>This papers completes a nice triptych of three papers about WikiPathways in the past 6 months. Thanks to
whole community and <a href="http://www.wikipathways.org/index.php/Special:ContributionScores">the very many contributors</a>!
All three papers are linked below.</p>]]></content><author><name>Egon Willighagen</name></author><category term="wikipathways" /><category term="curation" /><category term="sparql" /><category term="rdf" /><category term="justdoi:10.1093/database/bav028" /><category term="doi:10.1371/JOURNAL.PCBI.1004989" /><category term="wikidata" /><category term="doi:10.1093/NAR/GKV1024" /><category term="justdoi:10.1371/journal.pcbi.1004941" /><summary type="html"><![CDATA[Andra Waagmeester published a paper on his work on a semantic web version of the WikiPathways (doi:10.1371/journal.pcbi.1004989). The paper outlines the design decisions, shows the SPARQL endpoint, and several examples SPARQL queries. These include federates queries, like a mashup with DisGeNET (doi:10.1093/database/bav028) and EMBL-EBI’s Expression Atlas. That results in nice visualisations like this:]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://chem-bla-ics.linkedchemistry.info/assets/images/journal.pcbi.1004989.g002.PNG" /><media:content medium="image" url="https://chem-bla-ics.linkedchemistry.info/assets/images/journal.pcbi.1004989.g002.PNG" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>