Data Curation: 5% inspiration, 95% frustration (cleaning up data inconsistencies)

Slice of the spreadsheet in the supplementary info.

Just some bit of cleaning I scripted today for a number of toxicology end points in a database published some time ago the zero-APC Open Access (CC_BY) journal Beilstein of Journal of Nanotechnology, NanoE-Tox (doi:10.3762/bjnano.6.183).

The curation I am doing is to redistribute the data in the eNanoMapper database (see doi:10.3762/bjnano.6.165) and thus with ontology annotation (see doi:10.1186/s13326-015-0005-5):

recognizedToxicities = [
  "EC10": "http://www.bioassayontology.org/bao#BAO_0001263",
  "EC20": "http://www.bioassayontology.org/bao#BAO_0001235",
  "EC25": "http://www.bioassayontology.org/bao#BAO_0001264",
  "EC30": "http://www.bioassayontology.org/bao#BAO_0000599",
  "EC50": "http://www.bioassayontology.org/bao#BAO_0000188",
  "EC80": "http://purl.enanomapper.org/onto/ENM_0000053",
  "EC90": "http://www.bioassayontology.org/bao#BAO_0001237",
  "IC50": "http://www.bioassayontology.org/bao#BAO_0000190",
  "LC50": "http://www.bioassayontology.org/bao#BAO_0002145",
  "MIC":  "http://www.bioassayontology.org/bao#BAO_0002146",
  "NOEC": "http://purl.enanomapper.org/onto/ENM_0000060",
  "NOEL": "http://purl.enanomapper.org/onto/ENM_0000056"
]

With 402(!) variants left. Many do not have an ontology term yet, and I filed a feature request.

Units:

recognizedUnits = [
  "g/L": "g/L",
  "g/l": "g/l",
  "mg/L": "mg/L",
  "mg/ml": "mg/ml",
  "mg/mL": "mg/mL",
  "µg/L of food": "µg/L",
  "µg/L": "µg/L",
  "µg/mL": "µg/mL",
  "mg Ag/L": "mg/L",
  "mg Cu/L": "mg/L",
  "mg Zn/L": "mg/L",
  "µg dissolved Cu/L": "µg/L",
  "µg dissolved Zn/L": "µg/L",
  "µg Ag/L": "µg/L",
  "fmol/L": "fmol/L",

  "mmol/g": "mmol/g",
  "nmol/g fresh weight": "nmol/g",
  "µg Cu/g": "µg/g",
  "mg Ag/kg": "mg/kg",
  "mg Zn/kg": "mg/kg",
  "mg Zn/kg  d.w.": "mg/kg",
  "mg/kg of dry feed": "mg/kg",
  "mg/kg": "mg/kg",
  "g/kg": "g/kg",
  "µg/g dry weight sediment": "µg/g",
  "µg/g": "µg/g"
]

Oh, and don’t get me started on actual values, with endpoint values, as ranges, errors, etc. That variety is not the problem, but the lack of FAIR-ness makes the whole really hard to process. I now have something like:

prop = prop.replace(",", ".")
if (prop.substring(1).contains("-")) {
  rdf.addTypedDataProperty(
    store, endpointIRI, "${oboNS}STATO_0000035",
    prop, "${xsdNS}string"
  )
  rdf.addDataProperty(
    store, endpointIRI, "${ssoNS}has-unit", units
  )
} else if (prop.contains("±")) {
  rdf.addTypedDataProperty(
    store, endpointIRI, "${oboNS}STATO_0000035",
    prop, "${xsdNS}string"
  )
  rdf.addDataProperty(
    store, endpointIRI, "${ssoNS}has-unit", units
  )
} else if (prop.contains("<")) {
} else {
  rdf.addTypedDataProperty(
    store, endpointIRI, "${ssoNS}has-value", prop,
    "${xsdNS}double"
  )
  rdf.addDataProperty(
    store, endpointIRI, "${ssoNS}has-unit", units
  )
}

But let me make clear: I can actually do this, add more data to the eNanoMapper database (with Nina), only because the developers of this database made their data available under an Open license (CC-BY, to be precise), allowing me to reuse, modify (change format), and redistribute it. Thanks to the authors. Data curation is expensive, whether I do it, or if the authors of the database did. They already did a lot of data curation. But only because of Open licenses, we only have to do this once.

This post was originally posted as https://chem-bla-ics.blogspot.com/2018/09/data-curation-5-inspiration-95.html.