Bioregistry Summary

What is the Bioregistry?

The Bioregistry is a curated database and accompanying web application that act as a:

Registry: A collection of prefixes and metadata for ontologies, controlled vocabularies, and other semantic spaces. Some other well-known registries for prefixes are the OBO Foundry, Identifiers.org, the OLS, BioPortal, and Prefix Commons.
Metaregistry: A collection of metadata about registries and mappings between their constituent prefixes. For example, ChEBI appears in all example registries from above.
Resolver: A tool for mapping compact URIs (CURIEs) of the form prefix:identifier to HTML and structured content providers. Some other well-known resolvers are Identifiers.org and Name-To-Thing.

The Bioregistry is different from other registries. It is:

Open Source: Anyone can make updates by pull requesting against the single source of truth JSON "database". There are clear contribution guidelines and decentralized review, led by the community. Automated testing notifies contributors of mistakes or omissions before review.
Automated: Continuous integration is used to automatically update the database weekly and align with other registries. It suggests new curations to help maintain the community's semantic health.
Extensible: The underlying codebase can also be improved to add new features. It's all written in idiomatic, high quality Python code.
Community Driven: Governed by public, well-defined contribution guidelines, code of conduct, and project governance to promote the project's inclusivity and longevity.

The Bioregistry is not a:

Lookup Service: A service that provides information about an entity based on its prefix/identifier pair. Some well-known lookup services are the OLS, AberOWL, OntoBee, and BioPortal.
Registry of Databases: A service that keep track of databases. Some well-known database registries are FAIRsharing and re3data. Note that databases often mint one or more identifier schemas, which might be appropriate for inclusion in the Bioregistry.

Motivation

The goal of the Inspector Javert's Xref Database was to extract all the xrefs from OBO ontologies in the OBO Foundry. However, most ontologies took a lot of creative freedom in what prefixes they used to refer to which resources, and they therefore had to be normalized. Unfortunately, most did not appear in popular registries like MIRIAM, so the Bioregistry was created to store this information and facilitate downstream data integration. The Bioregistry evolved into a tool that enabled the investigation of the discrepancies between MIRIAM, OBO Foundry, OLS, and other biological registries. Later, it was generalized to support domains outside of biomedicine and the life sciences.

Bioregistry Coverage

This two-sided comparison shows how well the Bioregistry covers each external registry. In the case of Wikidata, it's difficult to know exactly how many relevant properties there are. Bioregistry Coverage

More information about each external resource covered in this image can be found on the registries page. Additional external registries that have been queued for alignment can be listed via the GitHub issue tracker.

How Complete is the Bioregistry?

While many of the resources reported above are finite, Wikidata is a bit more difficult. Because it is a general-purpose ontology (for lack of a better word), it contains many properties that are irrelevant for the Bioregistry. Further, its properties that are relevant are labeled in a variety of ways. The graph analytics service (GAS, see example usage) service might provide a solution that enables graph traversal over the various hierarchies of properties (see this).

Ontologies, controlled vocabularies, databases, and other resources that mint (persistent) identifiers will continue to be generated as we learn about new and exciting phenomena, so the medium-term plan to grow the Bioregistry is to continue to cover resources that are not covered by the other resources it references.

New external registries can be suggested on the project's issue tracker using the External Registry label. Further, there are contribution guidelines on the GitHub site to help guide potential contributors towards small but meaningful first contributions. It is expected that all contributors will be listed as co-authors in the eventual manuscript describing this resource.

Why does the Bioregistry not completely cover some external registries?

While the Bioregistry has been completely aligned with the OBO Foundry, the OLS, MIRIAM, Wikidata, and Name-to-Thing, the coverage summaries show that it does not completely cover other resources. This is due to the lack of or low quality metadata associated with records in other resources. In many cases, there is not enough information to determine what the resource is, the resource has moved to a non-obvious new location, the resource has been superseded by another resource (e.g., TrEMBL is now a part of UniProt), or the resource has been decommissioned. In the case of the BioPortal, records are not automatically added to the Bioregistry because of the general poor quality of its ontologies that can not be automatically mapped to Bioregistry through other means.

Overlap between External Registries

After normalization and integration in the Bioregistry, it's possible to investigate the overlap between pairs of other registries. It can be seen that the MIRIAM and Name-to-Thing (N2T) registries are effectively the same because N2T imports from MIRIAM. It can also be seen that OLS and OBO Foundry have a very high overlap, where the OLS includes several ontologies that are not included in OBO Foundry. Notably, this discrepancy contains the highly regarded Experimental Factor Ontology (EFO).

Pairwise Overlap Comparisons

Highly Conserved Resources

The following chart shows how often entries in the Bioregistry have few or many references to external registries. A few resources appear in all external registries, such as the NCBI Taxonomy database. However, the notable lack of inclusion of controlled vocabularies that aren't technically ontologies into the OBO Foundry and OLS severely lacks their ability to cover some of the most used resources like the HGNC. Entries with no references are uniquely curated in the Bioregistry. Reference Counts

Licensing

Licenses are only directly available from OBO Foundry and the OLS. Wikidata contains some licensing information, but more would need to be written to handle this.

However, even internally, neither the OBO Foundry nor OLS use a consistent nomenclature for licenses, so they were remapped before comparison. Further, some licenses that were inappropriate for data (e.g., Apache 2.0 License, GNU GPL 3.0 License, BSD License) appeared infrequently and were collapsed into "Other". Other uncommon and infrequent licenses were likewise collapsed into "Other". After, there were still several conflicts between the reported license in OBO Foundry and OLS, in which case both were added to the tally. In the majority of the conflicts, OBO Foundry reported CC-BY and OLS reported CC0.

Providers

One of the original goals of using URIs as identifiers for entities was that they could be resolved in the browser. This isn't strictly enforced, and even worse it's the case that many resources don't have any URIs associated with them at all. The following chart shows roughly the distribution of how many providers are available for each resource. Luckily, most have one or more, but some don't have any. Provider Coverage

Other Attributes

Attributes Coverage

Versioning

The OLS is the only registry that actually consumes the data it references, and is therefore the only registry that reports version information. The OBO Foundry also references versioned data, but does not consume it and therefore can not report version information. Wikidata also contains version information for some databases, but is not currently viable for generally tracking version information. The other registries (e.g., MIRIAM, N2T) do no report version information as their resolution services are independent of the data versions. Alternatively, the Bioversions project sets out to be a registry-independent solution for identifying current versions of different databases, ontologies, and resources.

Pattern

While only a few registries report patterns (e.g., MIRIAM, Prefix Commons, Wikidata), though OBO Foundry ontologies are usually consistent in using the ^<prefix>:\d{7}$ pattern. However, this isn't a rule, so it can't be assumed without inspection of some terms from the ontology itself. The Bioregistry also has a place to curate patterns for all the entries that do not have one imported from an external registry.

Wikidata Database

It's typically difficult to propose new Wikidata properties to go along with databases, but anyone can add entities corresponding to databases. This is one part of the Bioregistry that will require lots of manual effort. Eventually, we can develop a minimum information standard for entries in the Bioregistry that would be convincing enough for the Wikidata property gatekeepers and the MIRIAM registry.

Unexpected usage of CURIEs

Example Unresolvable CURIEs

Some CURIEs can not be resolved in the Bioregistry. There are three typical reasons:

The prefix is not registered with the Bioregistry.
Example: nope:1234.
The prefix has a validation pattern and the identifier does not match it.
Example: chebi:ABCD.
There are no providers available for the prefix.
Example: gmelin:1466.

Resources that Aren't Authorities

While entries in the Bioregistry are supposed to represent nomenclature authorities, this is not always true because it imports from external sources that don't enforce this constraint. For example, the Comparative Toxicogenomics Database uses NCBI Gene for naming genes and MeSH for naming diseases and chemicals. Identifiers.org has minted 3 prefixes (ctd.gene, ctd.disease, and ctd.chemical) that mostly reflect the entries of the authorities for which they are providers. Another example is ValidatorDB, which provides information based on Protein Databank records.

An even more exotic example are the Gene Ontology Annotations provided by the EBI because it provides for several types of identifiers including those from UniProt, RNA Central, and the ComplexPortal. This is more similar to providers like the OLS and OntoBee since it can accept prefix/identifier pairs instead of just identifiers for the single prefix that it serves.

Each entry in the Bioregistry now has a slot "provides" that can codify these connections, and we may begin annotating all entries with a "type" in the future such that resources and providers, which have many common aspects to their metadata schemata, can be more easily listed and curated in the same place. Further discussion about this is taking place on biopragmatics/bioregistry#32. The first commits to the repo related to that are available here.

Misguided Attribution

In the OBO file format, terms have a description field which allows for the specification of a list of CURIEs to consider as provenance. Often, this will point to PubMed identifiers or Wikipedia pages. However, many resources create their own prefix with which they identify the original curator. For example, in the Gene Ontology, there is a prefix GOC that often appears in CURIEs with the initials of the curator such as in GOC:vw. Unfortunately, this information is hard to deconvolve because GOC has not been registered with Identifiers.org or another resource, the identifiers are not MIRIAM compliant, and it's not obvious to whom vw refers since there is no (obvious) resource to resolve these.

In the example of the Human Phenotype Ontology, whose prefix is hp, the prefix HPO is used to denote curators in the provenance for descriptions. Luckily, they use slightly more informative tags such as HPO:skoehler, which can be easily attributed to Sebastian Köhler, one of its main contributors. However, it would be much more informative to use a CURIE for the ORCID identifier for this author, orcid:0000-0002-5316-1399, which immediately addresses all concerns shared across GOC and HPO.