An overview on registries covering biomedical ontologies, controlled vocabularies, and databases.
Data Models
A 🟢 means the field is required. A 🟡 means it is
part of the schema, but not required or incomplete on some entries. A 🔴 means that
it is not part of the metadata schema. For lookup services like the OLS, some fields (i.e., Example ID,
Default Provider, Alternate Providers) are omitted because inclusion would be redundant.
Data Model Score
The weighted sum of green dots, less valuable yellow dots, and some negatively weighted red dots. Higher is
better.
Name
This field denotes if a name is required, optional, or never captured for each record in the registry.
Homepage
This field denotes if a homepage is required, optional, or never captured for each record in the registry.
Description
This field denotes if a description is required, optional, or never captured for each record in the registry.
Example
This field denotes if an example local unique identifier is required, optional, or never captured for each record in the registry.
Pattern
This field denotes if a regular expression pattern for matching local unique identifiers is required, optional, or never captured for each record in the registry.
Provider
This field denotes if a URI format string for converting local unique identifiers into URIs is required, optional, or never captured for each record in the registry.
Alternate Providers
This field denotes if additional/secondary URI format strings for converting local unique identifiers into URIs is required, optional, or never captured for each record in the registry.
Synonyms
This field denotes if alternative prefixes (e.g., taxonomy for NCBITaxon) is required, optional, or never captured for each record in the registry.
License
This field denotes if capturing the data license is required, optional, or never captured for each record in the registry.
Version
This field denotes if capturing the current data version is required, optional, or never captured for each record in the registry.
Contact
This field denotes if capturing the primary responsible person's contact information (e.g., name, ORCID, email) is required, optional, or never captured for each record in the registry.
Notes: Several of Wikidata's fields can be accessed indirectly with alternative SPARQL queries.
Non-english language registries in the OntoPortal Alliance were not considered.
Capabilities and Qualities
This section provides a systematic evaluation and comparison of
the capabilities of each registry.
Quality Score
The sum of the number of green dots across each row.
Structured Data
This field denotes if the registry provides structured access to its data? For example, this can be through an API (e.g., FAIRsharing, OLS) or a bulk download (e.g., OBO Foundry) in a structured file format. A counter-example is a site that must be scraped to acquire its content (e.g, the NCBI GenBank).
Bulk Data
This field denotes if the registry provides a bulk dump of its data? For example, the OBO Foundry provides its bulk data in a file and Identifiers.org provides its bulk data in an API endpoint. A counterexample is FAIRsharing, which requires slow, expensive pagination through its data. Another counterexample is HL7 which requires manually navigating a form to download its content. While GenBank is not structured, it is still bulk downloadable.
No Authentication
This field denotes if the registry provides access to its data without an API key? For example, Identifiers.org. As a counter-example, BioPortal requires an API key for access to its structured data.
Automatable Download
This field denotes if the registry makes its data available downloadable in an automated way?This includes websites that have bulk downloads, paginated API downloads, or even require scraping.A counter example is HL7, whose download can not be automated due to the need to interact with a web form.
Permissive License
This field denotes if the registry uses a license that permits reuse and or remixing? Based on the OBO
Foundry's FP-001 "openness" principle, this
includes Creative Commons CC BY 3.0, CC BY 4.0, and CC Zero. This explicitly does not include resources
licensed with share-alike clauses, no derivatives clauses, or ones that are missing license statements
entirely.
Prefix Search
This field denotes if the registry provides either a dedicated page for searching for prefixes (e.g. AberOWL has a dedicated search page) OR a contextual search (e.g., AgroPortal has a prefix search built in its homepage).
Prefix Provider
This field denotes if the registry provides information about its own prefixes either
in the form of a web page or an API endpoint. These can be accessed
through a stable URL into which a prefix from the registry can be formatted.
CURIE Resolver
This field denotes if the registry can act as a resolver, i.e., it redirects to an external
page about a given biomedical concept or entity based on its CURIE and
the registry's internal metadata data about the prefix's associated
URI format string.
CURIE Lookup
This field denotes if the registry act as a lookup service, i.e., it gives information
about a given biomedical concept or entity based on its CURIE.
This section provides a systematic evaluation and comparison of
the governance and standard operating procedures for each registry. We generated the following list of
objective,
measurable metrics:
Are there clear, public policies on what content can be added to the registry?
Are there clear, public policies on who is allowed to add content to the registry?
Are there clear, public policies on why/how content is edited, deprecated, or removed from the registry?
Are community members able to petition for updates to resources that they do not "own", for example, if
there is a typo in the metadata?
Does the community have clear, public policies for handling records that have been abandoned by the
submitter/responsible person?
Are there clear, public guidelines on how to contribute to the registry? We argue that open contribution,
e.g., via a request in an issue tracker or directly by creating a pull
request is better due to the ability to better engage other community members and stakeholders
Does the registry make its data available under a data-appropriate, permissive, well-understood license
(e.g., CC Zero or CC BY 4.0)?
Does the registry make its underlying code open source under version control?
Are there similar appropriate policies for the code with respect to contribution and moderation as
previously described for the content of the registry?
Does the community have a public issue tracker related to both curation and technical issues with the
registry? A counter-example is that some communities require petitioning the moderator(s) privately by
email.
Are there clear, public, up-to-date resources listing who has the technical ability to make updates to the
registry (i.e., the community moderator(s))?
Are the community moderators responsive on the issue tracker? This can be compared between communities using
measurements like how many total issues are open on the tracker, how many have been unanswered by a
moderator for more than a certain amount of time, how quickly issues are closed on average, etc.
Is there a clear, public governance structure for inducting/removing community moderators?
Are the moderators from heterogeneous institutions/scientific domains?
Are contributions from the community attributed (both on a technical level, e.g., by associating ORCID
identifiers to records, and also during scientific publication, e.g., as acknowledgments or including
contributors as co-authors)?
Does the community have a clear, public code of conduct?
Do the moderators (or wider community) organize discussions, such as community meetings or workshops?
We have made a survey of a subset of these questions which are presented in the table below, but, first, an
explanation of each field is given.
Governance Score
The sum of the following boolean fields and some additional logic. One point is deducted from registries
with internal-focused scope.
Accepts External Contributions
This field denotes if the registry (in theory) accepts external contributions, either via suggestion or proactive improvement. This field does not pass judgement on the difficult of this process from the perspective of the submitter nor the responsiveness of the registry. This field does not consider the ability for insiders (i.e., people with private relationships to the maintainers) to affect change.
Public Version-Controlled Data
This field denotes if the registry stores its data in publicly available version control system, such as GitHub or GitLab
Issue Tracker
This field denotes the public issue tracker for issues related to the code and data of the repository.
Review Team
This field denotes if the registry's reviewers/moderators for external contributions known? If there's a well-defined, maintained listing, then it can be marked as public. If it can be inferred, e.g. from reading the commit history on a version control system, then it can be marked as inferrable. A closed review team, e.g., like for Identifiers.org can be marked as private. Resources that do not accept external contributions can be marked with N/A. An unmoderated regitry like Prefix.cc is marked with 'democratic'.
Scope
This field denotes the scope of prefixes which the registry covers. For example, some registries are limited to ontologies, some have a full scope over the life sciences, and some are general purpose.
Status
This field denotes the maitenance status of the repository. An active repository is still being maintained and also is responsive to external requests for improvement. An unresponsive repository is still being maintained in some capacity but is not responsive to external requests for improvement. An inactive repository is no longer being proactively maintained (though may receive occasional patches).
The semantic web and ontology communities are bound to the use of IRIs as identifiers and therefore are very
interested in the interconversion between compact identifiers (i.e., CURIEs) and IRIs. While the Bioregistry
provides many tools for one way conversion from CURIEs to IRIs, there are several related packages that help
parse CURIEs from IRIs:
The @geneontogy/dbxrefs Node.js package
translates CURIEs into URLs using the Gene Ontology Registry.
The curie-util-py Python package more generally
loads JSON-LD files to convert between IRIs and CURIEs.