The Bioregistry is a curated database and accompanying web application that act as a:
prefix:identifier
to HTML and structured
content providers. Some other well-known resolvers are Identifiers.org
and Name-To-Thing.
The Bioregistry is different from other registries. It is:
The Bioregistry is not a:
The goal of the Inspector Javert's Xref Database was to extract all the xrefs from OBO ontologies in the OBO Foundry. However, most ontologies took a lot of creative freedom in what prefixes they used to refer to which resources, and they therefore had to be normalized. Unfortunately, most did not appear in popular registries like MIRIAM, so the Bioregistry was created to store this information and facilitate downstream data integration. Later, the Bioregistry became a tool that enabled the investigation of the discrepancies between MIRIAM, OBO Foundry, OLS, and other biological registries.
This two-sided comparison shows how well the Bioregistry covers each external registry. In the case of Wikidata, it's difficult to know exactly how many relevant properties there are.
More information about each external resource covered in this image can be found on the registries page. Additional external registries that have been queued for alignment can be listed via the GitHub issue tracker.
While many of the resources reported above are finite, Wikidata is a bit more difficult. Because it is a general-purpose ontology (for lack of a better word), it contains many properties that are irrelevant for the Bioregistry. Further, its properties that are relevant are labeled in a variety of ways. The GAS service might provide a solution that enables graph traversal over the various hierarchies of properties (see this).
Biological databases, ontologies, and resource will continue to be generated as we learn about new and exciting phenomena, so the medium-term plan to grow the Bioregistry is to continue to cover resources that are not covered by the other resources it references. New external registries can be suggested on the Bioregistry GitHub issue tracker using the External Registry label. Further, there are contribution guidelines on the GitHub site to help guide potential contributors towards small but meaningful first contributions. It is expected that all contributors will be listed as co-authors in the eventual manuscript describing this resource.
While the Bioregistry has been completely aligned with the OBO Foundry, the OLS, MIRIAM, Wikidata, and Name-to-Thing, the coverage summaries show that it does not completely cover other resources. This is due to the lack of or low quality metadata associated with records in other resources. In many cases, there is not enough information to determine what the resource is, the resource has moved to a non-obvious new location, the resource has been superseded by another resource (e.g., TrEMBL is now a part of UniProt), or the resource has been decommissioned. In the case of the BioPortal, records are not automatically added to the Bioregistry because of the general poor quality of its ontologies that can not be automatically mapped to Bioregistry through other means.
After normalization and integration in the Bioregistry, it's possible to investigate the overlap between
pairs of other registries. It can be seen that the MIRIAM and Name-to-Thing (N2T) registries are
effectively the same because N2T imports from MIRIAM. It can also be seen that OLS and OBO Foundry have
a very high overlap, where the OLS includes several ontologies that are not included in OBO Foundry.
Notably, this discrepancy contains the highly regarded Experimental Factor Ontology (EFO).
Pairwise Overlap Comparisons
The following chart shows how often entries in the Bioregistry have few or many references to external registries. A few resources appear in all external registries, such as the NCBI Taxonomy database. However, the notable lack of inclusion of controlled vocabularies that aren't technically ontologies into the OBO Foundry and OLS severely lacks their ability to cover some of the most used resources like the HGNC. Entries with no references are uniquely curated in the Bioregistry.
Licenses are only directly available from OBO Foundry and the OLS. Wikidata contains some licensing information, but more would need to be written to handle this.
However, even internally, neither the OBO Foundry nor OLS use a consistent nomenclature for licenses, so they were remapped using this ruleset. Further, some licenses that were inappropriate for data (e.g., Apache 2.0 License, GNU GPL 3.0 License, BSD License) appeared infrequently and were collapsed into "Other". Other uncommon and infrequent licenses were likewise collapsed into "Other". After, there were still several conflicts between the reported license in OBO Foundry and OLS, in which case both were added to the tally. In the majority of the conflicts, OBO Foundry reported CC-BY and OLS reported CC 0.
One of the original goals of using IRIs as identifiers for biomedical entities was that they could be resolved in the browser. This isn't strictly enforced, and even worse it's the case that many resources don't have any IRIs associated with them at all. The following chart shows roughly the distribution of how many providers are available for each resource. Luckily, most have one or more, but some don't have any.
The OLS is the only registry that actually consumes the data it references, and is therefore the only registry that reports version information. The OBO Foundry also references versioned data, but does not consume it and therefore can not report version information. Wikidata also contains version information for some databases, but is not currently viable for generally tracking version information. The other registries (e.g., MIRIAM, N2T) do no report version information as their resolution services are independent of the data versions. Alternatively, the Bioversions project sets out to be a registry-independent solution for identifying current versions of different databases, ontologies, and resources.
While only a few registries report patterns (e.g., MIRIAM, Prefix Commons, Wikidata), though OBO Foundry
ontologies are usually consistent in using the ^<prefix>:\d{7}$
pattern. However, this
isn't a rule, so it can't be assumed without inspection of some terms from the ontology itself. The
Bioregistry also has a place to curate patterns for all the entries that do not have one imported from
an external registry.
It's typically difficult to propose new Wikidata properties to go along with databases, but anyone can add entities corresponding to databases. This is one part of the Bioregistry that will require lots of manual effort. Eventually, we can develop a minimum information standard for entries in the Bioregistry that would be convincing enough for the Wikidata property gatekeepers and the MIRIAM registry.
Some CURIEs can not be resolved in the Bioregistry. There are three typical reasons:
nope:1234
.
chebi:ABCD
.
gmelin:1466
.
While entries in the Bioregistry are supposed to represent nomenclature authorities, this is not always
true because it imports from external sources that don't enforce this constraint.
For example, the Comparative Toxicogenomics Database uses
NCBI Gene for naming genes and
MeSH for naming diseases and chemicals.
Identifiers.org has minted 3 prefixes
(ctd.gene
,
ctd.disease
, and
ctd.chemical
) that
mostly reflect the entries of the authorities for which they are providers. Another example is
ValidatorDB, which provides information
based on Protein Databank records.
An even more exotic example are the Gene Ontology Annotations provided by the EBI because it provides for several types of identifiers including those from UniProt, RNA Central, and the ComplexPortal. This is more similar to providers like the OLS and OntoBee since it can accept prefix/identifier pairs instead of just identifiers for the single prefix that it serves.
Each entry in the Bioregistry now has a slot "provides"
that can codify these connections and we
may begin annotating all entries with a "type" in the future such that resources and providers, which have many
common aspects to their metadata schemata, can be more easily listed and curated in the same place.
Further discussion about this is taking place on
biopragmatics/bioregistry#32.
The first commits to the repo related to that are available
here.
In the OBO file format, terms have a description field which allows for the specification
of a list of CURIEs to consider as provenance. Often, this will point to PubMed identifiers
or Wikipedia pages. However, many resources create their own prefix with which they identify
the original curator. For example, in the Gene Ontology, there is a prefix GOC
that often
appears in CURIEs with the initials of the curator such as in GOC:vw
. Unfortunately, this
information is hard to deconvolve because GOC
has not been registered with Identifiers.org
or another resource, the identifiers are not MIRIAM compliant, and it's not obvious to whom
vw
refers since there is no (obvious) resource to resolve these.
In the example of the Human Phenotype Ontology, whose prefix is hp
, the prefix
HPO
is used to denote curators in the provenance for descriptions. Luckily, they use
slightly more informative tags such as HPO:skoehler
, which can be easily attributed to
Sebastian Köhler, one of its main contributors. However, it would be much more informative to use a
CURIE for the ORCID identifier for this author, orcid:0000-0002-5316-1399
, which
immediately addresses all concerns shared across GOC
and HPO
.