chem-bla-ics

Open Science Retreat #2: CiTO Nanopublications

2024-04-02T00:00:00+00:00

During the Open Science Retreat I organized a short session where we looking into typing citation intentions using a new nanopublication template. First, let’s describe nanopublications (originally used in doi:10.3233/ISU-2010-0613) a bit. Scholia gives a nice overview of (macro?)publications on the topic. The nanopub.net website describes that [a nanopublication is a small knowledge graph snippet with metadata that is treated as an independent (scientific) publication.]. The knowledge graph, it continues, can be anything from an opinion to the link between a disease and a gene (doi:10.1109/ESCIENCE.2018.00024).

Now, in this post I will document an update of how we can use nanopublications for citation intention annotation, and compare this to existing solutions. I have been collecting and indexing the CiTO intention annotations in Wikidata and visualizing the corpus with Scholia at scholia.toolforge.org/cito/. There are currently 22 journal articles with explicit CiTO annoation, largely thanks to a Journal of Cheminformatics pilot (e.g. see doi:10.1186/s13321-023-00683-2). Recently, the preprint/report server BioHackrXiv started CiTO support too, also visible in the statistics on Scholia with another 17 papers. A third source is data sets from bibliometric-like studies, as explained in this post. Nanopublications would be a fourth solution.

So, why another solutions? Like the datasets, assuming DataCite approaches, have clear provenance, but the overhead of and needed time for creating a dataset with citation intent annotations can be limiting. And because nanopublications can be linked to ORCID identifiers, we can even discover which citation intent annotations are created by the original authors of articles. Another advantage is that nanopubs are basically RDF and we can query them easily, allowing the citation intentions to migrate to Wikidata. Scholia already saw an update to recognize nanopublications as a unique kind reference (see the new Wikidata property Nanopublication identifier (P12545)).

NanoDash template

So, if we can make it easy for people to define nanopublications with CiTO citation intent annotations, than we can start formalizing intent annotations from a much wider range of use cases. For example, we can annotate historically important discussions. Anyone can retrospectively annotate all their own articles, making them more FAIR. And if we use DOI links, then it no longer is limited to journal articles, but we can use of for software and data citations too. This is where a recent template comes in created by Tobias Kuhn, one of the main nanopub developers:

This nanopublication template defines the minimal needs of the assumptions, along with useful provenance and nanopub info. Basically, the assertion defines that one DOI is a ScholarlyWork and using the CiTO, defines that it cites one or more article works (with DOI). For each citations, one can select any of the known CiTO intent types, e.g. ‘extends’ or ‘uses method’ in, as in this nanopublication created with this template:

SPARQL-ing CiTO annotations

Besides the template, Tobias also started a SPARQL query to which I added restrictions that the citing and cited resources needs to have a DOI, giving us this query:

prefix rdfs: 
prefix np: 
prefix npa: 
prefix npx: 
prefix xsd: 
prefix dct: 

select ?np ?label ?subj ?citationrel ?obj ?date where {
  graph npa:graph {
    ?np npa:hasValidSignatureForPublicKey ?pubkey .
    ?np dct:created ?date .
    ?np np:hasAssertion ?assertion .
    optional { ?np rdfs:label ?label . }
    filter not exists { ?npx npx:invalidates ?np ; npa:hasValidSignatureForPublicKey ?pubkey . }
    filter not exists { ?np npx:hasNanopubType npx:ExampleNanopub . }
  }
  graph ?assertion {
    ?subj ?citationrel ?obj .
    filter(regex(str(?citationrel), "^http://purl.org/spar/cito/.*$"))
    filter(regex(str(?subj), "doi.org/10"))
    filter(regex(str(?obj), "doi.org/10"))
  }
}

This includes 6 citation intentions defined by 4 nanopublications added during the Open Science Retreat:

RAUjZE1JMu by me for a paper by Marija Purgar
RAXgI–5gc by Christian Meesters
RATZNhd3l_j by Taichi Oichi
RA6Q6wxSYy by Niklas Hohmann

From nanopublications to Wikidata

Now, this query also provides me with enough information to propagate the citation intent (a fact?) to Wikidata and cite the original nanopublication as reference. With a variation of the above SPARQL query, I can get the five most recent new nanopublications, convert them to QuickStatements, and then enjoy them in Wikidata. This is written up in this Bacting script.

The script needs to handle some situations. For example, it will not add items for DOIs not already in Wikidata. So, if neither of the two DOIs are known in Wikidata, then nothing gets added. If they both are, then it will add the citation intent. There are alternative solutions, but in practice that doesn’t matter and the QuickStatements is in all situations the same, and QuickStatements will only add the new information.

This is what it will look like in Wikidata:

And this is what it looks (yellow) when we compare the contributions from nanopublications now with the other sources:

New paper: “Wikidata subsetting: approaches, tools, and evaluation”

2024-02-13T00:00:00+00:00

Just before the end of the year, the Wikidata subsetting: approaches, tools, and evaluation paper by Seyed Amir Hosseini Beghaeiraveri et al. got published (doi:10.3233/SW-233491). I am really excited our group (i.e. Ammar and Denise) has been able to contribute to this. I think it also is a great example of the power of hackathons to bring together people.

To me, subsetting of Wikidata (or any large knowledge graph) is important for a couple of reasons. First, there can be practical reasons. Scholia, for example, is computationally expensive, and the idea we explore in the Alfred P. Sloan Foundation grant for Scholia (doi:10.3897/rio.5.e35820) was that a subset of Wikidata would make it more performant and potentially more environmental-friendly.

A second reason is more about the scientific process. When doing an analysis and when you want to make the reasoning transparent, you want to share the analyzed data as part of the research output (basically, the “data”). For example, the data may have undergone some curation, or you combined data from two or more different sources. And you will want to share this as part of the scientific process. Resharing a full dump of the larger knowledge base would not be practical for at least two reasons: duplication of huge data, and a lot of unrelated content makes it hard for peers to find the bits of interest to the study.

Subsetting may be useful here. This paper evaluates a number of different subsetting approaches. Myself, I am particularly excited about the idea that we can take a shape expression (e.g. ShEx) as input. I still love the idea that I take the SPARQL queries in my analyses, convert that into shapes automatically, and then get a subet that returns the exact same results as the query would on the full dataset.

American Chemical Society Fall 2023 meeting

2023-09-09T00:00:00+00:00

About four weeks ago the Fall 2023 American Chemical Society meeting (#ACSFall2023). I have attended a few ACS meetings in person and even organized a symposium at the 2010 ACS meeting in Boston. This time too, I did not participate in person, tho visiting San Francisco again would have been nice. I gave two presentations (slides doi:10.5281/zenodo.8255394), but have not uploaded my slides of the first presentation to Zenodo yet.

The theme of the meeting was data, and this resulted in a wealth of presentations with cheminformatics. What is striking here is that a lot of work has not changed so much in 20 years, except for the scale. What I missed here was the large open data sets, but generally the level of open science was heartwarming! So many preprints mentions, GitHub repositories, and Zenodo deposits. The Blue Obelisk was truly ahead of its time, but it is a delight to see the field of chemistry catch up. I can now say a lot of about peer review, and why the field is not benefitting from all the experience that exists in the field because people publish in the wrong journals, but that is for another time.

I attended multiple sessions, which is a bit of a challenge, doing this remotely from Central European Summer Time (CEST). Of course, the Sunday started with the Chemical informatics (R)evolution: Towards Democratization and Open Science session, where I had my first talk, and later that day the Enhance your Data - Smart Ways to Metadata and Knowledge Graphs session, where I gave a second talk, about Bioschemas’ ChemicalSubstance and MolecularEntity. Sadly, I had to leave that meeting early because it was getting too late.

There were so many interesting sessions, I could not attend everything. I also have to go back to all my notes and isolate things I want to follow up on, prominently open datasets.

More later.

Boiling points in Wikidata

2023-08-12T00:00:00+00:00

Some days ago, I started added boiling points to Wikidata, referenced from Basic Laboratory and Industrial Chemicals (wikidata:Q22236188), David R. Lide’s ‘a CRC quick reference handbook’ from 1993 (well, the edition I have). But Wikidata wants pressure (wikidata:P2077) info at which the boiling point (wikidata:P2102) was measured. Rightfully so. But I had not added those yet, because it slows me and can be automated with QuickStatements.

I just need a few SPARQL queries to list to which statements the qualifiers needs to be added. Basically, all boiling points which has the book as a reference and that do not have the pressure info. First, there are values with ‘unknown value’, which results in blank nodes (by the time you read this, they likely are already fixed):

SELECT ?cmp ?bp ?pressure WHERE {
  ?cmp p:P2102 ?bpStatement .
  ?bpStatement prov:wasDerivedFrom/pr:P248 wd:Q22236188 ;
    ps:P2102 ?bp .
  ?bpStatement pq:P2077 ?pressure .
  FILTER (contains(str(?pressure), "http://"))
}

So, to get the list for which I want to write the QuickStatements which does not have any P2077 qualifier yet, I use this query:

SELECT ?cmp WHERE {
  ?cmp p:P2102 ?bpStatement .
  ?bpStatement prov:wasDerivedFrom/pr:P248 wd:Q22236188 ;
    ps:P2102 ?bp .
  MINUS { ?bpStatement pq:P2077 ?pressure }
}

At the time of writing, this lists 54 boiling points.

I can the WDQS create CSV-styled QuickStatements with:

SELECT (SUBSTR(STR(?cmp),32) AS ?qid) ?P2102 ?qal2077 WHERE {
  ?cmp p:P2102 ?bpStatement .
  ?bpStatement prov:wasDerivedFrom/pr:P248 wd:Q22236188 ;
    ps:P2102 ?P2102 .
  MINUS { ?bpStatement pq:P2077 ?pressure }
  BIND ("101.325U21064807" AS ?qal2077)
}

Here, the SPARQL variables double as QuickStatement instructions. Finally, note to use of “U21064807” which is the Wikidata item for kilopascal (wikidata:Q21064807).

I also need to “add” the boiling point again, to make sure QuickStatements knows which statement to add the qualifier to. I think this can be done better, but not sure how to target statements directly. This is not fool proof: I noted that this approach ignores the situation where there are two statements with the (exact) same boiling point, but different error margins. But that I will monitor and where needed correct manually.

History, provenance, detail

2023-08-08T00:00:00+00:00

Just a quick note: I just love the level of detail Wikidata allows us to use. One of the marvels is the practices of named as, which can be used in statements for subject and objects. The notion and importance here is that things are referred to in different ways, and these properties allows us to link the interpretation with the source. For example, Max Born’s seminal work Zur Quantenmechanik (doi:10.1007/BF01328531) uses a very short notation to cite other literature, as footnotes, and DOIs did not exist yet.

So, in Wikidata, you can capture this like this:

Blog planets: blogging about Debian, GNOME, Wikimedia, FSFE, and many more

2023-08-04T00:00:00+00:00

I am still an avid user of RSS/Atom feeds. I use Feedly daily, partly because of their easy to use app. My blog is part of Planet RDF, a blog planet. Blog planets aggregate blogs from many people around a certain topic. It’s like a forum, but open, free, community driven. It’s exactly what the web should be.

It turned out that planets do still exist, so I started a small corner on Wikidata: Q121134938, and a number of existing blog planets:

The software used to run these planets is ancient, though. We need a new generation of software, replacing things like Planet. And I want something people can easily host on GitHub or GitLab Pages or the likes.

I created a minimal shape expression but the Wikidata items for the planets still lack a lot of information that can be added. First, we can think of them as venues, perhaps, where people “publish” their work. Second, we can annotate the blog planets with ‘main subject’ for the topics the cover. Or we can list the people that are “author” on the planet; most planets are very transparent about which blogs they aggregate.

Love to see where this is going. Who knows? Maybe we will see Postgenomic (see doi:10.1186/1471-2105-8-487) and Chemical blogspace resurface :)

Finding Mastodon accounts with Wikidata (a few SPARQL queries)

2022-11-21T00:00:00+00:00

There are multiple initiatives to support the migration from Twitter to Mastodon (see also this blog post ). But Wikidata should not be forgotten here which has been tracking Mastodon accounts of things in their database:

So, here are some Wikidata SPARQL queries to see the uptake:

Universities with Mastodon
All Mastodon accounts in Wikidata (or subset with also a Twitter account)
Nobel Prize winners with Mastodon
Academic journals with Mastodon
People with Mastodon that published in a PLOS journal (you can pick another publisher)
Find your co-authors with your ORCID (just replace my ORCID with yours)

If you find yourself missing, back in April I tweeted (sorry) how you can find yourself and others in Wikidata and how to add your or their Mastodon account.

Wikidata script for SMILES, SMARTS, and CXSMILES depiction

2022-11-12T00:00:00+00:00

In August I reported about 2D depiction of (CX)SMILES in Wikidata via linkouts (going back to 2017). Based on a script by Magnus Manske, I wrote a Wikidata gadget that uses the same CDK Depict (VHP4Safety mirror) to depict the 2D structure in Wikidata itself:

Note the depiction of the undefined (CIP) stereochemistry on two atoms. Thanks to Adriano and John for working that out.

More about CXSMILES in Wikidata in this Dagstuhl meeting results write up.

Biology, ACPs, lipids, cheminformatics, and Dagstuhl

2022-08-01T00:00:00+00:00

Already 3 months ago I visited Dagstuhl for the second time. The weather was much better than in the January right before the start of the pandemic. The first I attended the Computational Metabolomics meeting, with the focus From Cheminformatics to Machine Learning, one of the things we concerned ourselves with was how to do computation with compound classes (see Section 3.6 and this online book). We know how to handle SMILES and we know how to the substructure searching with SMARTS, but what if you have compound classes or lipid classes? Biology is a greasy business.

From a WikiPathways there is additional complexity, with modified proteins involved in lipid metabolism, the acyl-carrier proteins. They look like this, and the R group is a protein:

We have quite a few of them in WikiPathway and they also show up in ChEBI (and likely Reactome), LIPID MAPS, and KEGG.

During this years Dagstuhl we used up one session to continue working on it (report pending). Part of the results is that Wikidata (see doi:10.7554/eLife.52614 and doi:10.7554/eLife.70780) now has a property for CXSMILES. CDK 2.0 (doi:10.1186/s13321-017-0220-4) already supported CXSMILES and the above image is actually created with CDK Depict (thx to John!).

So, that means I can now start adding all those ACPs to Wikidata :) Here’s hexadecanoyl-[acp] (or this Scholia page):

Adding disclosures to Wikidata with Bioclipse

2016-03-20T00:00:00+00:00

Last week the huge, bi-annual ACS meeting took place (#ACSSanDiego), during which commonly new drug (leads) are disclosed. This time too, like this one tweeted by Bethany Halford:

Because getting this information out in the open is important, I think it’s a good idea to add them to Wikidata (see doi:10.3897/rio.1.e7573). So, with Bioclipse (doi:10.1186/1471-2105-8-59) I redrew the structure:

I previously blogged about how to add chemicals to Wikidata, but I realized that I wanted to also use Bioclipse to automate this process a bit. So, I wrote this script to generated the SMILES, InChI, InChIKey, double check the compound is not already in Wikidata (using the Wikidata SPARQL endpoint), an look up the PubChem compound identifier (example SMILES).

smiles = "CCCC"

mol = cdk.fromSMILES(smiles)
ui.open(mol)

inchiObj = inchi.generate(mol)
inchiShort = inchiObj.value.substring(6)
key = inchiObj.key // key = "GDGXJFJBRMKYDL-FYWRMAATSA-N"

sparql = """
PREFIX wdt: 
SELECT ?compound WHERE {
  ?compound wdt:P235 "$key" .
}
"""

if (bioclipse.isOnline()) {
  results = rdf.sparqlRemote(
    "https://query.wikidata.org/sparql", sparql
  )
  missing = results.rowCount == 0
} else {
  missing = true
}

formula = cdk.molecularFormula(mol)

// Create the Wikidata QuickStatement,
// see https://tools.wmflabs.org/wikidata-todo/quick_statements.php

item = "LAST" // set to Qxxxx if you need to append info,
              // e.g. item = "Q22579236"

pubchemLine = ""
if (bioclipse.isOnline()) {
  pcResults = pubchem.search(key)
  if (pcResults.size == 1) {
    cid = pcResults[0]
    pubchemLine = "$item\tP662\t\"$cid\""
  }
}

if (!missing) {
  println "===================="
  println "Already in Wikidata as " + results.get(1,"compound")
  println "===================="
} else {
  statement = """
    CREATE
    
    $item\tDen\t\"chemical compound\"
    $item\tP233\t\"$smiles\"
    $item\tP274\t\"$formula\"
    $item\tP234\t\"$inchiShort\"
    $item\tP235\t\"$key\"
    $pubchemLine
  """

  println "===================="
  println statement
  println "===================="
}

The output of this script is a QuickStatement for Magnus Manske’s tool (IMPORTANT: it’s not meant to automate editing Wikidata! I only automate creating the input, which I carefully check (e.g. checking all stereochemistry is defined)! Note, how Bioclipse opens up the structure in a viewer with ui.open()), which is a list of commands to create and edit entries in Wikidata. You need to enable it first, but if you have an account, this is not too hard. Of course, the advantage is that it is a lot quicker. I have similar script to create QuickStatements starting with only a ChEMBL identifier.

The QuickStatement for GDC-0853 looks like:

    CREATE
    
    LAST Den "chemical compound"
    LAST P233 "O=C1C(=CC(=CN1C)c2ccnc(c2CO)N4C(=O)c3cc5c(n3CC4)CC(C)(C)C5)Nc6ncc(cc6)N7CCN(C[C@@H]7C)C8COC8"
    LAST P274 "C37H44N8O4"
    LAST P234 "1S/C37H44N8O4/c1-23-18-42(27-21-49-22-27)9-10-43(23)26-5-6-33(39-17-26)40-30-13-25(19-41(4)35(30)47)28-7-8-38-34(29(28)20-46)45-12-11-44-31(36(45)48)14-24-15-37(2,3)16-32(24)44/h5-8,13-14,17,19,23,27,46H,9-12,15-16,18,20-22H2,1-4H3,(H,39,40)/t23-/m0/s1"
    LAST P235 "WNEODWDFDXWOLU-QHCPKHFHSA-N"
    LAST P662 "86567195"

The first line creates a new Wikidata item, while the next ones add information about this compound. GDC-0853 is now also Q23304817. The label I added manually afterwards. Note how the Bioclipse script found the PubChem identifier, using the InChIKey. I also use this approach to add compounds to Wikidata that we have in WikiPathways.