In this paper (doi:10.1038/s41597-024-03324-x), Ammar explores this notion and compiled more than 200 maturity indicators in the category R1.3 resulting from 12 different community standards. For example, this includes minimal reporting standards. There is overlap in needs, but they often also have a different focus. The conclusion here: different (re)use cases have different needs, and data not usable to one use case can be sufficiently FAIR for another. Of course, ideally, it would be FAIR enough for all use cases.

Ammar formalizes the maturity indicators and links the comming maturity indicators to various use cases. That means that when you determine the indicator values for your data, people can immediately lookup how this data can be reused. And, the generator of the data can immediately see how the data would need to be improved to widen the reusability. How FAIR can we get?

His proposal has already been further explored in two other papers, one around data sharing (doi:10.1038/s41596-024-00993-1, see also this blog post) and one around QSAR modelling (doi:10.1016/j.impact.2023.100475, see also this blog post).

The below screenshot shows what an analysis using this approach can look like:

]]>The study has two sides to it: first, it looks into how far we are with QSAR in the field of nanosafety. We have limited data, but this paper got together 34 data sets, and in the model building many different possible factors are explored. Now, as a scholar, I would really want to know which factors are really important. We have been studying this for some time, e.g. in the past RRegrs paper (doi:10.1186/S13321-015-0094-2). Basically, I think we still don’t really understand the relation between the data characteristics and the modelling options. When is data rich enough to move from classification to regression? How much (many) exerimental data do we need, for the model to capture a certain applicability domain sufficiently?

Actually, I think the rise of deep learning approaches shows us a few things: more data actually does help. But also, with enough data, the representation becomes less important for the overall pattern. There are even hints that deep learning needs a certain level of noise. Did anyone study that phenomenon yet?

Now, the reader of this paper will not be disappointed. The design is complex and there are many small hints about what worked and what did not. But this gets us to the other side of this story.

The second side of this paper is the question whether the level of FAIR-ness helps this QSAR modelling. Earlier, Ammar studied the R1.3 aspects of nanosafety research. The R1.3 guiding principle expects that (Meta)data meet domain-relevant community standards. Ammar’s research (preprint doi:10.26434/CHEMRXIV-2022-L8VK8-V2) shows we can link this to actual reuse, where QSAR is one of those use cases. In their July paper, they show how we can integrate the use of the community standards in a reproducible way to support nanosafety research.

The following screenshot from the article (Figure 2, CC-BY) shows the relation between R1.3 maturity indicators and QSAR variables:

I think Furxhi and Ammar may actually have introduced a new community standard: this is how nanoQSAR research should be done from now on. Irini and Ammar, thanks for this great collaboration!

]]>`IMatrixImplementation`

extension point.
The plugin bc_jama provides a JAMA
based extension for this, but other implementations are possible, and possibly useful.
The second component provided by the new statistics plugin, is the MatrixResource, a BioResource for documents (e.g. files on the harddisk) that represent a matrix. However, Bioclipse can create such matrices on the fly too, and these do not necessarily have to be stored on disk, as is general for BioResource’s. This makes it possible for other plugins to create matrices from other resources: for example, the CDK plugin can now have an action that converts a SDF file into a QSAR data matrix.

The MatrixResource can be edited using a plain text editor, and a more visually attractive graphical editor based on the KTable SWT widget:

The next step is to work on column and row names, and replace those uninformative X’s. As you can see in the Properties View, I also need to tweak adding and removing advanced properties a bit. And then it is time to have the CDK plugin create a QSAR data matrix.

]]>The article discusses CDK’s QSAR capabilities (the class designs and an overview of provided descriptors), the 3D model builder (see also C. Hoppe, CDK News, 1(2):4-5) and and the interface to the statistical software R (see also CDK News, vol.2, issue 1). The article is part of a small special issue on Computational Applications in Medicinal Chemistry.

CDK’s QSAR package comes with one main requirement: **the outcome of QSAR descriptor calculations must be reproducable**.
*“Science must be reproducable”*; I’m sure someone once said this :) Therefore, each QSAR descriptor has a specification
pointing the a unique algorithm found in an ontology (see diagram below). This QSAR descriptor ontology is maintained by
the qsar.sf.net project, which is project independent, and even welcomes proprietary programs to
discuss interoperability.

And calculated descriptors are explicitely linked to this specification again, though it is up to the user to do with this what he wants:

Note that code has evolved since this publication, so class, interface and method names may have changed a bit.

]]>- 2001: oxygen paths of length 3 10.1021/ci000116e
- 2002: a molecular shape descriptor 10.1021/ci000100o
- 2003: molecular signature 10.1021/ci020345w
- 2004: 4D-fingerprint 10.1021/ci049898s
- 2005: summed NMR shift difference 10.1021/ci049643e

If you know additional new descriptors, or feel like discussion one or more of the above, please leave a comment.

]]>with quite OK prediction results (R=0.9880). But I was not quite comfortable with the coefficient for the \(p3\) variable.
The article did not calculate significances for the coefficients, so it was not obvious from the article wether is was useful
to include them. I then looked at the range for `p3`

, which was 110-150; so, the maximal influence this variable can have is
\(150*0.006 = 0.9\). Now, the experimental values given in the article were rounded to integers, indicating that the maximal
effect of the `p3`

variable is smaller than the experimental error! It’s even worse when you consider the difference between the
min and max value (40), then the influence would even be smaller (assuming that most model methods would put the mean temperature
effect in the offset, 151 in this case).

Today, I reread an article with a similar issue. The model was something like:

\[y = -0.81 + 0.03*p1 + 0.009*p2\]Here, \(max(p2)-min(p2)\) is a smaller than 100, so the maximal effect of the variable would be in the order 0.9, which is of the same order of the root mean square error of prediction (RMSEP) for this model. Indeed, the article already states that the coefficient is only significant at the 95% level, and not at the 99% level. But, without having calculated the RMSEP for a model without the p4 variable, I would guess that leaving it out would give equally good prediction results.

Concluding, I would say the the `p2`

variable does not include relevant information.

Do you think it is reasonable to include the `p2`

variable in the second model?