Downloading all currently released BridgeDb identifier mapping databases

The BridgeDb project (doi:10.1186/1471-2105-11-5) (and ELIXIR recommended interoperability resource) has several aims, all around identifier mapping:

provide a Java API for identifier mapping
provide ID mappings (two flavors: with and without semantic meaning)
provide services (R package, OpenAPI webservice)
track the history of identifiers

The last one is more recent and two aspects are under development here: secondary identifiers and dead identifiers. More about that in some future post. About the first and the third I am also not going to tell much in this post. Just follow the above links.

I do want to say something in this post about the actually identifier mapping databases, in particular those we distribute as Apache Derby files, the storage format used by the Java libraries. These are the files you download if you want mapping databases for PathVisio (doi:10.1371/journal.pcbi.1004085). BridgeDb has mapping files for various things and some example databases the data it maps between:

genes and proteins: Ensembl, UniProt, NCBI Gene
metabolites; HMDB, ChEBI, LIPID MAPS, Wikidata, CAS
publications: DOI, PubMed
macromolecular complexes: Complex Portal, Wikidata

The BridgeDb API is agnostic to the things it can map identifiers for.

Downloading mapping files: BridgeDb has an BioSchemas-powered web page with an overview of the latest released mapping files. It looks like this:

This webpage is the result from the cyber attack in late 2019, disrupting a good bit of the infrastructure. This is why we renewed the website, including the download page. The new page actually is hosted on GitHub as a Markdown file, but this is where things get interesting. The Markdown file is actually autogenerated from a JSON file with all the info. Everything, including the BioSchemas annotation is created from that. Basically, JSON gets converted into Markdown (with a custom script), which gets converted into HTML by a GitHub Action/Pages. So, when someone releases a new mapping file on Zenodo or Figshare, they only have to send me a pull request with updated JSON file.

Now, previously, downloading all released mapping files, for example for the BridgeDb webservice, was a bit complicated. The information was a HTML file generated by the webserver for a folder. No metadata. Nuno wrote code to extract the relevant info and download all the files. However, since the information is now available in a public JSON file, it is a lot easier. The following code uses wget and jq, two tools readily available on the popular operating systems. Have fun!

!/bin/bash

wget -nc https://bridgedb.github.io/data/gene.json
wget -nc https://bridgedb.github.io/data/corona.json
wget -nc https://bridgedb.github.io/data/other.json

jq -r '.mappingFiles | .[] | "\(.file)=\(.downloadURL)"' gene.json > files.txt
jq -r '.mappingFiles | .[] | "\(.file)=\(.downloadURL)"' corona.json >> files.txt
jq -r '.mappingFiles | .[] | "\(.file)=\(.downloadURL)"' other.json >> files.txt

for FILE in $(cat files.txt)
do
  readarray -d = -t splitFILE<<< "$FILE"
  echo ${splitFILE[0]}
  wget -nc -O ${splitFILE[0]} ${splitFILE[1]}
done

Actually, while writing this blog post, I notice the code can be further simplified.

This post was originally posted as https://chem-bla-ics.blogspot.com/2021/02/downloading-all-currently-released.html.