chem-bla-ics

GoatCounter, Rogue Scholar and more new things

2024-07-21T00:00:00+00:00

About a year ago I started migrating my blogger.com blog to a git-version-controlled, Markdown-based blogging platform. I have to say, it has been a happy year. It actually is awesome to port old blog posts (follow that here) and to see what I have been working on some 17, 18 years ago.

I do have a nasty bug to fix that causes the conversion of the Markdown to HTML is scaling badly. The system is doing some indexing at the wrong time, and probably all indexing for each post again. Kudos if you spot it.

But while still being on a Jekyll learning curve, some nice things have happened since I started. This blog started with InChIKeys, as demonstrated in this post, which adds this molecule page. On my wishlist is still a CMLRSS-based feed.

Newer is things I worked on since, this includes the following, and something that readers of my blog may be interested in learning about. First, I started counting visitors again, but with the GDPR-compliant GoatCounter. I have been using my social network as advisory board, and knowing what people find interested matters to me.

The second thing is listing in The Rogue Scholar. This is a new platform, like a blog planet, perhaps a bit like (the late) Chemical blogspace and (the late) Postgenomic.com, but so far without the extraction of journal articles (tho it did start recognizing some references), chemicals, and conferences. Instead, they offer archiving by the Internet Archive, DOIs for your blog posts, ePub and PDF downloads, and JATS. The just passed the milestone of 100 participating blogs! Please do check it out, it’s an awesome service.

A final thing I want to mention here is that my blog now has an archive page, which sometimes can be useful.

Let’s see what I can say next year, when my blog celebrates its 20th birthday :)

References

cdk2024 #3: an unexpected downstream project

2024-06-16T00:00:00+00:00

In the CDK2024 grant we wrote about updating various software projects using the Chemistry Development Kit. We even wrote that “[r]equired API changes will be publicly shared and disseminated with the Groovy Cheminformatics with the Chemistry Development Kit book (egonw.github.io/cdkbook/)”. The Groovy Cheminformatics with the Chemistry Development Kit book is a project that has run since 2009.

commit c5cbf9b5dd49baf582afc595c9cbafc714c5199f
Author: Egon Willighagen 
Date:   Fri Apr 10 12:34:42 2009 +0200

    Initial copy of the current draft; converted into separate project for easier branching
    for tunes of the book for workshops and sorts

The original version was in LaTeX and sold online via Lulu.com. Because all code examples were run (the first public edition had 72 pages with 75 code examples), like RMarkdown of Jupyter Notebooks by design, I was able to make many releases. The big advantage of this was that when API changes happened, this would be visible by code not compiling or by output changing.

At some point I open sourced the book (doi:10.6084/M9.FIGSHARE.2057790.V1) and then realized that I can convert the book to Markdown:

commit 2630699aa280200188f2ae9ef3f0698964926752
Author: Egon Willighagen 
Date:   Mon Dec 24 16:59:14 2018 +0100

    Create chapter3.md

This is the version available at egonw.github.io/cdkbook/ for some time now. So, now that for SMARTCyp I need to update the visualization, I went book to my book of code examples (I have a collection of more than 200 examples), but then found that the chapter on Depiction was missing. I was not looking forward to this, because I know that the code examples predate a massive improvement by John Mayfield of the rendering stack and I never got around to see if the examples from the book work well enough with that new API (one is actually updated).

That is when I realized that the Groovy Cheminformatics book actually also is a downstream project that needs updating. I have been doing this already and it’s fairly smooth so that I did not think of including it in the grant, other than updating the Migration chapter. I now had enough time to dive into this project. I need that, because the goal of the project is also to learn about all the meta science aspects of project maintenance, roles, communication, etc. Therefore also this blog post: we need a track record, to collect data.

Anyway, porting the first script went fairly easy, but I am now running into a stacktrace:

Processing  RenderSelection.groovyin
doing RenderSelection.out ...
org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
/home/egonw/var/Projects/hub/cdkbook-source/code/RenderSelection.groovy: 39: unable to resolve class ExternalHighlightGenerator
 @ line 39, column 16.
   generators.add(new ExternalHighlightGenerator());
                  ^
org.codehaus.groovy.syntax.SyntaxException: unable to resolve class ExternalHighlightGenerator
 @ line 39, column 16.

That brings us to the task of how to find where that class is coming from, which happens to be something I already had to write up for up for RingSearch. Dependency galore.

References

10.6084/M9.FIGSHARE.2057790.V1

Two meetings: ELIXIR Toxicology and FAIR4ChemNL

2024-06-10T00:00:00+00:00

Noting that in the coming week I am not attending the ELIXIR All Hands in Uppsala. Having lived in (and around) Uppsala for more than three years, I am disappointed and with the first stories from colleagues coming in even more. But it has been a way too busy year, I have much to finish up, and I need to take care of myself too. I am not 32 anymore.

But in the past two weeks I did attend two workshops. The first was a workshop by the ELIXIR Toxicology Community, which was held in Utrecht/NL. The programme was around FAIR and included two really nice hands-on sessions where we developed drafts for FAIR Cookbook recipes (see also doi:10.1038/s41597-023-02166-3) and for FAIR Implementation Profiles (doi:10.1007/978-3-030-65847-2_13). We will write up a BioHackrXiv report.

The second workshop was last week, the FAIR4ChemNL workshop, which was also held in Utrecht/NL. The topic was FAIR in chemistry, and we discussed various aspects. There was a significant participant group from the German NFDI4Cat project (“Cat” is short for (chemical) catalysis), which recently published a nice analysis of several ontologies (doi:10.1186/s13321-024-00807-2). And there was also a lot of mention of RDF and SPARQL.

I think it is time for a new special issue around semantic web technologies.

References

New paper: FAIR assessment of nanosafety data reusability with community standards

2024-06-10T00:00:00+00:00

Ammar is finishing up his PhD thesis with his research on the use of FAIR towards predictive toxicology. Or, “AI ready”, as the term FAIR is now sometimes explained. Any computational method needs good data, and just FAIR is not enough. It needs to meet community standards, as formalized in R1.3. To me, this includes meeting community standards like minimal reporting standards. Indeed, in the EU NanoSafety Cluster the notion that FAIR data also needs be scientifically good data is well noted.

In this paper (doi:10.1038/s41597-024-03324-x), Ammar explores this notion and compiled more than 200 maturity indicators in the category R1.3 resulting from 12 different community standards. For example, this includes minimal reporting standards. There is overlap in needs, but they often also have a different focus. The conclusion here: different (re)use cases have different needs, and data not usable to one use case can be sufficiently FAIR for another. Of course, ideally, it would be FAIR enough for all use cases.

Ammar formalizes the maturity indicators and links the comming maturity indicators to various use cases. That means that when you determine the indicator values for your data, people can immediately lookup how this data can be reused. And, the generator of the data can immediately see how the data would need to be improved to widen the reusability. How FAIR can we get?

His proposal has already been further explored in two other papers, one around data sharing (doi:10.1038/s41596-024-00993-1, see also this blog post) and one around QSAR modelling (doi:10.1016/j.impact.2023.100475, see also this blog post).

The below screenshot shows what an analysis using this approach can look like:

References

New paper: A template wizard for the cocreation of machine-readable data-reporting to harmonize the evaluation of (nano)materials

2024-05-27T00:00:00+00:00

I was about to call this blog post From spreadsheets to RDF, after the post last week. But then I decided to just use the pattern I typically use. Why I wanted to use that shorter term in the first place was that one of the thing I like about the AMBIT software (of OpenTox and eNanoMapper fame) is its RDF support (see doi:10.1186/1756-0500-4-487). But RDF, ontologies, those are hard things. And unlike mathematics, we do not have simple objects like integer numbers or simple operators. Well, I think we do, and we talk about them. But there is no obligatory education. Just like any biologist needs to know what 1 + 2 means, I think any biologist needs basic knowledge about how knowledge graphs work. But sometimes feels like a taboo, like cursing in the life sciences church.

So, there we are. This is where spreadsheets come in. If done well, they combine aspects of knowledge graphs with usability and can even cover a good bit of the learnability. This is what is described in this new paper about templates in the EU NanoSafety Cluster: A template wizard for the cocreation of machine-readable data-reporting to harmonize the evaluation of (nano)materials (doi:10.1038/s41596-024-00993-1).

The learnability comes in with the spreadsheet templates (“this is how we did it”) and a “wizard” around it guides the user with the selection of a template but also can provide feedback on the template. The technical term for that is “validator”, but it can be tought of as a spelling checker. Computers are good at finding contradictions (the lack of a pattern), though less good at ranking the alternatives (which is the cause of hallucinations in AI approaches).

And to return to the RDF, software like AMBIT can read these templates, use the semantics linked to the template, and make the FAIR static spreadsheets (good for archiving on Zenodo!) available as FAIR interactive data (good for exploration and machine learning), and as RDF (good for data integration).

Congrats to Nina and the various EU NanoSafety Cluster projects!

References

New paper: From papers to RDF-based integration of physicochemical data and adverse outcome pathways for nanomaterials

2024-05-20T00:00:00+00:00

Making something FAIR is hard, particularly when you do more than making something findable. We’ve seen before that making something usefully findable requires deep indexing, and already that continues to be difficult, because we are not seeing it enough. So, when I thought convert a paper led by Hoet’s lab in Leuven into machine-actionable RDF to make it FAIR, I gravely underestimated the amount of work. Jeaphianne et al. did an awesome job on this work (doi:10.1186/s13321-024-00833-0).

The idea was simple: write up which nanomaterial (type) activates which molecular initiating event. It would simply annotate each material with a unique identifier to link it to databases like eNanoMapper and NanoCommons and it would use unique identifiers for the Adverse Outcome Pathway) (AOP) key events. As such, it would make a direct link in the growing linked open data cloud between the AOPs and the nanomaterial databases.

Unfortunately, it was quickly discovered that actually reusing this new datasets requires rich annotation (metadata!) of the materials and the materials from the source paper were not yet in material databases. And then the cumbersome start was started, resulting in a very rich data model describing the key events, the materials, the assays used, and the original papers themselves:

But the work has not finished yet. The paper assigned ERM identifiers to all included materials, and now these need to be added to new ERM Identifier Database under development.

References

cdk2024 #2: publishing grant proposals

2024-05-18T00:00:00+00:00

Publishing grant proposal is still not very common. The proposal published in Research Ideas and Outcomes) (doi:10.3897/rio.10.e124884) for the NWO Open Science grant for the CDK is, however, not the first and hopefully not the last. Interestingly, it is already cited in (the German) Wikipedia. It is used there to support a statement which tools use the Chemistry Development Kit.

References

10.3897/RIO.10.E124884

cdk2024 #1: NWO Open Science grant for the Chemistry Development Kit

2024-04-07T00:00:00+00:00

We recently got awarded our second NWO Open Science grant (OSF23.2.097), this time for the Chemistry Development Kit (CDK). “We” here is me and Alyanne de Haan, René van der Ploeg, and Marc Teunis from Hogeschool Utrecht. The proposal has been submitted for public dissemination in RIO Journal, like we did with the first NWO Open Science grant.

The project formally started on April 1 but we had our kick-off meeting in Maastricht on April 4-5. We were joined by Javier and on the second day by Marvin, and Ozan from our BiGCaT research group in Maastricht. During this hackathon, I gave a (repeat) presentation about the history of the CDK which also included the problem that software using the CDK does not always use the most recent version.

And that, upgrading tools using the CDK with the latest CDK version, is the main topic of this grant (work package 2, WP2). The full proposal has the focus list of tools, but most of it is also listed in the issue tracker we have set up as project management tool on GitHub.

Second, we actually hacked together on two first tools, one on our focus list, but the other that was requested we have a look at too: SMARTCyp. The latest version uses RDKit (doi:10.1093/bioinformatics/btz037), but the original version uses the CDK (doi:10.1021/ml100016x).

We downloaded the source code of SMARTCyp 2.4.2, started taking notes, Javier started a Maven build environment, updated a lot of code, but we seem quite close to a version that can be tested by people that have integrated SMARTCyp in other tools. This is based on CDK 2.9 and if you ignore the 2D depiction glitch, it looks it was a nice first choice:

On a final note, we plan to record carefully our steps, in an open notebook science approach, with the intention to extract general upgrade steps. For example, we will update the Migration section of the Groovy Cheminformatics with the Chemistry Development Kit.

References

Open Science Retreat #2: CiTO Nanopublications

2024-04-02T00:00:00+00:00

During the Open Science Retreat I organized a short session where we looking into typing citation intentions using a new nanopublication template. First, let’s describe nanopublications (originally used in doi:10.3233/ISU-2010-0613) a bit. Scholia gives a nice overview of (macro?)publications on the topic. The nanopub.net website describes that [a nanopublication is a small knowledge graph snippet with metadata that is treated as an independent (scientific) publication.]. The knowledge graph, it continues, can be anything from an opinion to the link between a disease and a gene (doi:10.1109/ESCIENCE.2018.00024).

Now, in this post I will document an update of how we can use nanopublications for citation intention annotation, and compare this to existing solutions. I have been collecting and indexing the CiTO intention annotations in Wikidata and visualizing the corpus with Scholia at scholia.toolforge.org/cito/. There are currently 22 journal articles with explicit CiTO annoation, largely thanks to a Journal of Cheminformatics pilot (e.g. see doi:10.1186/s13321-023-00683-2). Recently, the preprint/report server BioHackrXiv started CiTO support too, also visible in the statistics on Scholia with another 17 papers. A third source is data sets from bibliometric-like studies, as explained in this post. Nanopublications would be a fourth solution.

So, why another solutions? Like the datasets, assuming DataCite approaches, have clear provenance, but the overhead of and needed time for creating a dataset with citation intent annotations can be limiting. And because nanopublications can be linked to ORCID identifiers, we can even discover which citation intent annotations are created by the original authors of articles. Another advantage is that nanopubs are basically RDF and we can query them easily, allowing the citation intentions to migrate to Wikidata. Scholia already saw an update to recognize nanopublications as a unique kind reference (see the new Wikidata property Nanopublication identifier (P12545)).

NanoDash template

So, if we can make it easy for people to define nanopublications with CiTO citation intent annotations, than we can start formalizing intent annotations from a much wider range of use cases. For example, we can annotate historically important discussions. Anyone can retrospectively annotate all their own articles, making them more FAIR. And if we use DOI links, then it no longer is limited to journal articles, but we can use of for software and data citations too. This is where a recent template comes in created by Tobias Kuhn, one of the main nanopub developers:

This nanopublication template defines the minimal needs of the assumptions, along with useful provenance and nanopub info. Basically, the assertion defines that one DOI is a ScholarlyWork and using the CiTO, defines that it cites one or more article works (with DOI). For each citations, one can select any of the known CiTO intent types, e.g. ‘extends’ or ‘uses method’ in, as in this nanopublication created with this template:

SPARQL-ing CiTO annotations

Besides the template, Tobias also started a SPARQL query to which I added restrictions that the citing and cited resources needs to have a DOI, giving us this query:

prefix rdfs: 
prefix np: 
prefix npa: 
prefix npx: 
prefix xsd: 
prefix dct: 

select ?np ?label ?subj ?citationrel ?obj ?date where {
  graph npa:graph {
    ?np npa:hasValidSignatureForPublicKey ?pubkey .
    ?np dct:created ?date .
    ?np np:hasAssertion ?assertion .
    optional { ?np rdfs:label ?label . }
    filter not exists { ?npx npx:invalidates ?np ; npa:hasValidSignatureForPublicKey ?pubkey . }
    filter not exists { ?np npx:hasNanopubType npx:ExampleNanopub . }
  }
  graph ?assertion {
    ?subj ?citationrel ?obj .
    filter(regex(str(?citationrel), "^http://purl.org/spar/cito/.*$"))
    filter(regex(str(?subj), "doi.org/10"))
    filter(regex(str(?obj), "doi.org/10"))
  }
}

This includes 6 citation intentions defined by 4 nanopublications added during the Open Science Retreat:

RAUjZE1JMu by me for a paper by Marija Purgar
RAXgI–5gc by Christian Meesters
RATZNhd3l_j by Taichi Oichi
RA6Q6wxSYy by Niklas Hohmann

From nanopublications to Wikidata

Now, this query also provides me with enough information to propagate the citation intent (a fact?) to Wikidata and cite the original nanopublication as reference. With a variation of the above SPARQL query, I can get the five most recent new nanopublications, convert them to QuickStatements, and then enjoy them in Wikidata. This is written up in this Bacting script.

The script needs to handle some situations. For example, it will not add items for DOIs not already in Wikidata. So, if neither of the two DOIs are known in Wikidata, then nothing gets added. If they both are, then it will add the citation intent. There are alternative solutions, but in practice that doesn’t matter and the QuickStatements is in all situations the same, and QuickStatements will only add the new information.

This is what it will look like in Wikidata:

And this is what it looks (yellow) when we compare the contributions from nanopublications now with the other sources:

References

Open Science Retreat #1: impressions

2024-03-31T00:00:00+00:00

Last week I attended the Open Science Retreat (#osr24nl) in a quite and relaxing region in North-Holland. The meeting was how I like all meetings to be (and I count myself lucky many of my meetings are like this): open, welcoming, constructive, diverse, and intellectually challenging. Not all scientific meetings are like this and it is easy to end up going to obligatory meetings where the discussions are of a different level. Therefore, great thanks to the organizers, but also to all participants, that showed not just to have a hearth for open science (getting pretty common), but also a drive to advocate for open science. Finally, I like to thank the people that joined me in creating nanopublications for CiTO annotations (will blog about that later), and to Sadik and Marija with whom we worked on exploring using Wikibase for capturing knowledge about research waste in ecology (more about that later too).