Introduction

Information on functions and physicochemical qualities of biological molecules, such as chemical compounds and gene products, is essential for not only elucidating and recognizing biological phenomena but also the development of various biobased products, for example, drugs, foods, and materials. A simple method to collect and leverage information was studied to prepare a rich research environment for researchers, developers, and engineers. We investigated the interconnection of biological knowledge for the development of chemical compounds, drugs, gene products, diseases, and biological phenomena. The goal was to retrieve reliable information on chemical compounds, drugs, and gene products using knowledge graphs (KGs) developed from biological ontology and the Resource Description Framework (RDF) data.

We developed NBDC NikkajiRDF from the Japan Chemical Substance Dictionary (Nikkaji) [1], which is one of the largest databases of chemical compounds in Japan [2, 3]. Nikkaji includes 3.5 million chemical compounds, of which 6,454 have at least one of the 694 application examples (e.g., “hypotensive drug,” “artificial colorant”). NikkajiRDF uses InChI and InChIKey as unique chemical identifiers and InChI, developed by the International Union of Pure and Applied Chemistry and National Institute of Standards and Technology, as a non-proprietary identifier of chemical compounds [4]. InChIKey is a hashed version of the full InChI. The InChI/InChIKey assists in simplifying mapping between chemical database IDs and facilitates a collection of the corresponding chemicals. NikkajiRDF uses standard ontologies in PubChem [5] and ChEMBL [6]. These ontologies include Chemical Information Ontology [7] and Semanticscience Integrated Ontology (SIO) [8]. Consequently, users can perform SPARQL searches with these ontologies (Fig. 1). NikkajiRDF links chemical compounds of more than 30 other databases that share the same InChIKey. We developed the RDF triples to link these compounds following UniChem [9] work and skos:closeMatch. Users can download the RDF data from the Life Science Database Archive [2] and NBDC RDF Portal [3] websites. The SPARQL search can be performed using the endpoint [10].

Fig. 1
figure 1

Integration of NikkajiRDF with major databases of chemical compounds using InChI/InChIKey

Interlinking Ontology for Biological Concepts (IOBC), previously referred to as the “Refined JST thesaurus” [11], contains approximately 80,000 biological concepts, including biological phenomena, diseases, molecular functions, gene products, chemical compounds, drugs, and medical procedures. It also contains approximately 20,000 related concepts in basic chemistry and environmental science [12]. The concepts are structured by “subclass of” and 35 additional relations, for example, “has function,” “has role,” “has quality,” and “is participant in.” Each concept is labeled in both English and Japanese. We can browse and download the ontology from the BioPortal [13] homepage [12] to prepare the SPARQL endpoint [14].

Information on chemical compounds, drugs, and gene product functions/roles/applications is crucial in developing pharmaceutical products and discovering new materials for medical treatment. NikkajiRDF consists of a significant number of InChI/InChIKey chemical compounds. However, it lacks information on the functions/roles/applications. On the contrary, IOBC contains various biological phenomena, including diseases, chemical compounds, drugs, and gene products. However, these items lack the unique identifiers, such as InChI/InChIKey and Protein IDs (e.g., UniProtKB accession number [15]), used for easy mapping of biological molecules and drugs of other data resources. These data sources should be combined to efficiently collect the functions/roles/applications.

In addition to information on chemical compounds [16], this studyFootnote 1 aimed to collect and interconnect biochemical and genomic knowledge to find drugs and biological molecules, such as gene product information by combining NikkajiRDF, IOBC, and other open-source data. Using ontological knowledge and unique identifiers, InChI/ InChIKey, UniProtKB accession number, and GeneID, helps infer the functions/roles/applications of a larger number of chemical compounds and gene products.

The rest of the paper is organized as follows: Sect. 2 reviews related works that describe representative open-source knowledge and ontologies to collect information on the functions/roles/applications of biological molecules. Section 3 describes the inference of the chemical compounds’ functions/roles/applications through combinations of NikkajiRDF, ChEBI, and IOBC. Section 4 presents a method of creating KGs from IOBC and extending the KGs using existing external databases, thesauri, and ontologies. We also demonstrate the inference of chemical compounds and gene products in biological phenomena and diseases using the KGs. Section 5 summarizes our conclusions and discusses future work.

Related Works

ChEBI is a major chemical database and ontology [17] of approximately 90,000 chemical compounds, identified through InChI and InChIKey, performing 1,000 roles and applications. ChEBI is used frequently in annotating and classifying chemical compounds through InChI/InChIKey in various databases: PubChem and ChEMBL. However, the number of chemical compounds in ChEBI is lacking even in comparison with that of other chemical databases such as NikkajiRDF, which contains information on approximately 3.5 million chemical compounds. Thus, preparing the knowledge bases and establishing a method to infer the functions/roles/applications of many chemical compounds is necessary.

DBpedia is a project that extracts structure information from Wikipedia [18] using RDF [19]. Wikidata is a knowledge base that allows every user to extend and edit stored information [20]. Although DBpedia and Wikidata are used widely for cross-domain knowledge, they have recently been attempting to integrate chemical information [19, 21]. DBpedia and Wikidata contain information on approximately 18,000 and 150,000 chemical compounds, respectively. However, these numbers are fewer than those of NikkajiRDF, PubChem, or ChEMBL.

From the DBpedia [22] and Wikidata [23] public SPARQL endpoints, users can collect information on biological and chemical functions/roles/applications to perform SPARQL queries. However, DBpedia uses only annotation properties: “dcterms:subject,” and “rdfs:seeAlso” to describe the information, instead of specific properties such as “has function (sioFootnote 2:SIO_000225)” and “has role (sio:SIO_000228).” For example, the roles and applications of “Caffeine,” such as “Anxiogenics” and “Effect_of psychoactive_drugs_on_animals,” are described as objects of “dcterms:subject” and “rdfs:seeAlso,” respectively. Moreover, these properties include information outside their functions/roles/applications, for example, the categories.

This shows that users must select the information manually. Wikidata also faces the same problems as it uses a property “wdtFootnote 3:P31 (instance of)” to describe the functions/roles/applications of compounds. The objects of the property include information outside its functions/roles/applications. Hence, we are incorporating some specific properties: “has function” and “has role,” into Wikidata to describe information. Therefore, DBpedia and Wikidata are currently neither reasonable nor suitable for the efficient collection of functions/roles/applications of chemical compounds.

ChEMBL and PubChem are the major chemical compounds’ databases that offer downloadable RDF data. ChEMBL provides the public SPARQL endpoint to collect the original data. Currently, PubChem does not provide the public SPARQL endpoints; however, the PubChem Classification Browser [24] and PUG REST [25] are available to search for and collect information.

UniProt is a large protein knowledge base, providing information on functions, subcellular locations, molecular interactions, structures, amino acid sequences, similar proteins, and so on. Many biological databases adopt the protein identifier: UniProtKB accession number. The public SPARQL endpoint offers available data [26], and RDF data are effective in exploring life sciences.

DisGeNET [27] is a database that contains gene–disease associations, collected by expert human curation and through text-mining methods from many public data sources and the scientific literature. We can retrieve RDF data of the gene–disease associations from the SPARQL endpoint under the Open Database License [28].

Open Pharmacological Concepts Triple Store (Open PHACTS) [29], Bio2RDF [30], Chem2Bio2RDF [31], and RIKEN MetaDatabase [32] are databases for research and development to collect information on chemical compounds and gene products, integrated using semantic technologies such as RDF. Researchers and engineers can retrieve and leverage innovative drug discovery information from these databases.

Open PHACTS is an open innovation platform for drug discovery. Using semantic approaches, several linked open data, such as ChEMBL, Human Disease Ontology [33], and WikiPathways [34], are integrated. Information on chemical targets, assays, biological activities, and diseases is retrieved using keyword search, API, and Apps. Data are provided in various formats: RDF/Turtle, JSON, and XML. However, the Open PHACTS Linked Data API and associated services were closed in March 2019.

Bio2RDF applies semantic web technology to integrate life-science databases. Public databases, such as NCBI’s Entrez Gene [35], Online Mendelian Inheritance in Man (OMIM) [36], Kyoto Encyclopedia of Genes and Genomes (KEGG) [37], and DrugBank [38], are converted to the RDF format through RDF conversion programs from XML, SQL, and TEXT. On the project page, RDF data are accessible from the SPARQL endpoint [39].

Chem2Bio2RDF is a project that collects information on chemical compounds/drugs and proteins/genes through the chemogenomics approach. The datasets include information on protein–protein interactions, diseases, side effects, and literature, linked to Bio2RDF, and Linked Open Drug Data [40]. This was designed for polypharmacology, pathway inhibition, and adverse drug reaction analysis. At present, Semantic Link Association Prediction [41] for drug target prediction based on Chem2Bio2RDF datasets is available; however, its SPARQL endpoint is unavailable.

RIKEN MetaDatabase is an RDF platform containing Riken’s original databases, Bioresources (e.g., FANTOM [42], mouse resources [43]), and external databases (e.g., PDB [44]). Using standard ontologies (e.g., SIO, and Phenotype and Trait Ontology [45]), users can collect the metadata linked to other datasets.

Comparing our IOBC-leveraged project and other datasets with the above-mentioned projects, our proposed datasets have the following features: (1) They contain the relationships between instance-level information on various types of life-science knowledge (e.g., a relationship between a biological phenomenon: Fibrinolysis, and the succeeding disease: Fibrinolytic purpura, in Fig. 4 in Sect. 4.1). (2) In our dataset, IOBC serves as a hub for integrating various life-science concepts, such as chemical compounds, gene products, biological phenomena, and diseases (Fig. 5 in Sect. 4.2). (3) Using the ontological structures and KGs (Figs. 6, 7, 8, 9, and 10 in Sect. 4.3), IOBC and other ontologies can infer new facts (e.g., biological molecular functions) from the integrated information.

Fig. 2
figure 2

Inference of the roles and applications of NikkajiRDF’s chemicals using ChEBI. Inferred that “Aspirin” had “non-steroidal anti-inflammatory drug” as an application and “Brønsted acid” as a chemical role. This diagram is visualized on a web service: https://www.kanzaki.com/works/2009/pub/graph-draw. chebi: http://purl.obolibrary.org/obo/

Inference of Chemicals’ Functions/Roles/Applications Using Ontological Structure

Inference of Chemical Compounds for Functions/Roles/Applications Using ChEBI and NikkajiRDF

In this section, we infer the functions/roles/applications of chemical compounds using linked open data and ontologies. NikkajiRDF has approximately 3.5 million chemical compounds; however, most of them lack application examples. We attempt to integrate NikkajiRDF with ChEBI using InChIKey to add information to NikkajiRDF chemical compounds based on ChEBI’s roles and applications. Prior to that, ChEBI [46] and NikkajiRDF data [47] were stored in a triple store and the SPARQL execution was prepared. Consequently, 280 ChEBI roles and application terms could be assigned to 2,926 NikkajiRDF chemical compounds. Next, ChEBI’s roles/applications were inferred to NikkajiRDF’s chemical compounds using ChEBI’s ontological structure. The following SPARQL query was performed.

figure a

Figure 2 shows the inference of the roles and applications of NikkajiRDF’s chemical compound “Aspirin” using ChEBI. The inference process is as follows: (1) it was found that ChEBI’s chemical compounds had the same InChIKey as NikkajiRDF’s chemical compounds using the property skos:closeMatch (e.g., ChEBI’s “acetylsalicylic acid” and NikkajiRDF’s “Aspirin”) in the NikkajiRDF structure. (2) The upper chemical compounds were found using the property rdfs:subClassOf (e.g., oxoacid) in ChEBI’s structure. (3) We collected the roles/applications of the upper chemical compounds and assigned the information to the lower chemical compounds (e.g., “Brønsted acid” to “Aspirin”) in the ChEBI structure. This indicated that chemical compounds inherited the ontological upper chemical compounds’ roles/applications through the ChEBI structure.

At least one of the 1062 ChEBI role and application terms was assigned to each of the 18,386 NikkajiRDF chemical compounds through the ChEBI ontological structure. This indicates that the number of NikkajiRDF chemical compounds and roles/applications increased by approximately three times after inference. The reason is that 6,454 chemical compounds had at least one of the 694 applications, corresponding to ChEBI’s roles/applications. This result is downloadable [48].

Inference of Chemical Compounds for Functions/Roles/Applications Using IOBC and NikkajiRDF

As mentioned previously, NikkajiRDF has approximately 3.5 million chemical compounds; however, IOBC has 17,180 organic chemicals, inorganic chemicals, and drugs, which do not contain InChI/InChIKey. A total of 5,781 of these chemical compounds has information on biological and chemical functions (e.g., Apoptosis [iobcFootnote 4:200906039143928462]), roles (e.g., antirheumatic drug [riobc:200906008284879667]), and chemical involvements in biological phenomena and diseases (e.g., hepatitis B [iobc:200906000547096041]). In particular, information on the chemical compounds in biological phenomena is unique to IOBC.

Fig. 3
figure 3

Inference of the biological and chemical functions, roles, and chemical involvements in the biological phenomena of IOBC’s chemicals derived from NikkajiRDF. It is inferred that “Dopamine” would be involved with “Catecholamine cardiomyopathy” with which the upper class “catecholamine” is involved. This diagram is visualized on a web service: https://www.kanzaki.com/works/2009/pub/graph-draw

We implemented a Lexical OWL Ontology Matcher (LOOM) algorithm [49] to match the labels between the NikkajiRDF and IOBC chemical compounds. LOOM is a simple lexical algorithm to produce mappings. It takes two ontologies from a Semantic Web ontology language and produces pairs of related concepts from two ontologies. The label-comparison function removed delimiters such as spaces, underscores, and parentheses. Then, it used an approximate string comparison technique to mismatch one character in strings with length greater than four and no mismatches for shorter strings [49]. The LOOM algorithm is widely used in the field of life sciences such as BioPortal because it exhibits high performance in terms of the precision of the mappings [49], and it is also easy to implement in systems.

In our project, two life-science experts reviewed the algorithm results. If they found false-positive errors, they removed them. If their opinions were divided, they discussed them, and selected one of the opinions. In contrast, we did not evaluate false-negative errors of the mapping, because acquiring the information to calculate them was difficult.

As a result of executing the LOOM algorithm, in total, 10,576 NikkajiRDF chemical compounds were incorporated into IOBC. Two experts reviewed the results of the mapping algorithm, and they found 68 false-positives, which were subsequently removed. In this case, there was no difference of opinion among experts. The precision rate of the LOOM algorithm was 0.99 (10,508/10,576). For example, NikkajiRDF contained two entries whose labels were “HMDP,” namely, stiFootnote 5:200907088719956119 and sti:200907015329956587, whereas IOBC contained an entry whose label was “HMDP,” iobc:200906046710073151. In this case, the experts confirmed that iobc:200906046710073151 corresponded to sti:200907015329956587 through database descriptions such as using their structure information.

Euzenat and Shvaiko [50] have classified ontology matching (mapping) algorithms into two types: element-level techniques and structure-level techniques. Moreover, they have subclassified the former into five categories including string-based techniques and formal resource-based techniques; in contrast, the latter has been subclassified into four categories including graph-based techniques and taxonomy-based techniques. Harrow [51] demonstrated some applications of ontology mapping in the fields of biomedical science.

We focused on taxonomy-based techniques, which utilized information on the upper concepts in ontological structures. Then, we conducted a preliminary experiment to compare the performance of only the LOOM algorithm with that of the combination of LOOM and taxonomy-based techniques to gauge any improvement in the ontology mapping. If chemical compounds with defined structures in NikkajiRDF and IOBC comprise basic chemical structures, such as phosphonic acid and polynuclear aromatic compounds, the chemical compounds can be related to the basic chemical structures using skos:broader. In this preliminary experiment, we examined whether the number of 68 earlier false-positives produced by only performing the LOOM algorithm would be effectively decreased using not only label information but also basic chemical structures.

Consequently, we confirmed that the utilization of both chemical structures and label information decreased the number of errors to 59, removing 9 false-positives, which is an indicative of the improved precision rate. For example, as mentioned earlier, this improvement can be seen in the case of two NikkajiRDF chemical compounds that have the label HMDP, namely sti:200907088719956119 and sti:200907015329956587, and an IOBC chemical compound that has the same label namely iobc:200906046710073151. In addition, both sti:200907015329956587 and iobc:200906046710073151 have a common basic chemical structure “phosphonic acid”; in contrast, sti:200907088719956119 does not have the mentioned structure. Therefore, we have confirmed that both, the NikkajiRDF chemical compound “sti:200907015329956587” and the IOBC chemical compound “iobc:200906046710073151,” were the same chemical compound. Results obtained using chemical compound mapping were equivalent to those derived from expert manual curation based on the structural information on these chemical compounds.

Furthermore, by appropriately leveraging ontology mapping algorithms mentioned above for biomedical concepts, we would be able to discover new relations among biomedical concepts, such as those of equivalent and overlapping relations, which could not be identified using only string comparison techniques, such as the LOOM. For example, there is an ontology mapping system “AgreementMakerLight (AML) [52],” which implements some matching algorithms: (1) “The LexicalMatcher” to find literal full name matches between the lexicon entries of two ontologies, (2) “The ThesaurusMatcher,” to find literal full name matches involving synonyms inferred from an automatically generated thesaurus, and (3) “The XRefMatcher,” which uses cross-reference information among data sources. In the AML’s matching tasks using anatomy, phenotype, and disease datasets, they have demonstrated that not only the precision rate but also recall rate and F-measure were improved, simply by optimizing the algorithm parameters or combining some algorithms [52].

Furthermore, using the IOBC ontological structure, at least one of the 432 biological and chemical functions, roles, and chemical involvements in biological phenomena could be inferred for 5038 extended chemical compounds (Fig. 3 and Table 1). Inference using the ontology enabled the assignment of more chemical compound functions, roles, and involvements in biological phenomena than that obtained by not using the ontology. For the cases of “is participant in” and Inference: Yes in Table 1, the SPARQL query and result are available in [53].

Table 1 Inference results of IOBC chemical compounds’ functions, roles, and involvements in biological phenomena of the inheritance from upper-class chemical compounds

Inference of the Chemicals’ Functions/Roles/Applications using KGs

Creating KGs from IOBC

In previous works [54], we inferred functions of gene products and subcellular components using IOBC’s ontological structure: “is-a” and “whole-part” relationships. The inference examples included (1) the inheritance of a function “biological transport” of “ABC transporter” to the lower-class “P-glycoprotein,” and (2) the inheritance of a function “RNA splicing” of “splicing factor” to the whole structure “spliceosome.”

Fig. 4
figure 4

A part of the Fibrinolysis network. This graph is visualized using Cytoscape (http://www.cytoscape.org/)

Aside from the “is-a” and “whole-part” relationships, we leverage more than 30 relations within IOBC for functions/roles/applications/qualities of chemical compounds, drugs, and gene products. The primary focus was on the relationships between a preceding biological phenomenon (e.g., Fibrinolysis [iobc:200906057747871335]) and the succeeding disease (e.g., Fibrinolytic purpura [riobc:200906056051568500]). The relationships were described using a property “precedes [rxkos: precedes]” within the IOBC (Fig. 4). Gene products, which regulated or promoted a biological phenomenon and preceded a disease, were claimed to be potential candidates for disease-related gene products. IOBC has 35 properties, such as “has function,” “precedes,” and “is participant in,” to describe the relationships between the concepts [11, 33]. It is possible to precisely discover potential candidate genes by performing a SPARQL search.

In another study [55], we developed KGs: Fibrinolysis network (Fig. 4) [56] and Bone metabolic turnover network (BMT network) [57] from IOBC. A SPARQL query was performed to create the KGs. Each of the KGs was constructed as collections of concepts connected with “Fibrinolysis” and “BMT [iobc:200906094913122330]” within three steps, respectively. Next, we stored them in a triple store. Then, we inferred chemical compounds with diseases from both the KGs.

In Sect. 4, in addition to chemical compounds, we inferred gene product involvements in biological processes and diseases using the Fibrinolysis network and BMT network. The involvements of diseases in any chemical compound and gene product can be inferred using disease information preceding biological phenomena.

Fig. 5
figure 5

Data schema of the IOBC’s KG extended by other data sources

Extending the KGs using existing databases, thesauri and ontologies

IOBC contains various biological concepts, such as chemical compounds, gene products, proteins, biological processes, and diseases. However, these concepts did not have sufficient external links to other databases, thesauri, and ontologies. Thus, in the Fibrinolysis, and the BMT network, which comprised 181 IOBC’s concepts in total, we executed the LOOM algorithm [49] (see Sect. 3.2) to match the labels and synonyms of resources between the IOBC and major RDF data (e.g., ChEBI, PubChem, ChEMBL, Medical Subject Headings (MeSH) [58] using UniProt and Gene Ontology (GO) [59]) with a SPARQL search. Two experts confirmed the results, and manually removed 1 false-positive. In this case, there were no differences in opinion among experts. The precision rate of the LOOM algorithm was 0.99 (461/462). From the true-positive data, we created triples between IOBC and other RDF resources using a property skos:exactMatch. We used both original URIs as identifiers of the resources (e.g., http://purl.uniprot.org/uniprot/P02675) and URIs corresponding to the original ones, provided by Identifiers.org [60] (e.g., http://identifiers.org/uniprot/P02675) in the triples. Next, we stored them in the triple store that contained IOBC.

We collected relationships between GO concepts such as biological processes, and the related human proteins provided by UniProt from AmiGO 2 [61]. From the collected relationships, we created triples using a property “has function [rsio:SIO_000225] (e.g., “uniprotkb:P05155 [rSERPING1] ” “has function” “Fibrinolysis [go:GO_0042730]”). For the resource’s URIs, we used both the original URIs and the URIs provided by Identifiers.org. Finally, we stored the triples into the triple store. Consequently, the KGs consisted of IOBC’s concepts and the corresponding concepts derived from other RDF data (e.g., UniProt) (Fig. 5). By performing a federated SPARQL search to the endpoints (e.g., UniProt SPARQL endpoint [26]), we interconnected the IOBC’s KGs and other RDF data.

Inference of Chemicals for Functions/Roles/Applications Using KG

In the extended KGs, the Fibrinolysis network, and the BMT network, we performed the following SPARQL search to infer chemical compounds and gene products’ involvement in diseases.

figure b
Fig. 6
figure 6

Associations between chemicals, namely chemical compounds, drugs and gene products, and diseases in the Fibrinolysis network (1/2)

Fig. 7
figure 7

Associations between chemicals, namely chemical compounds, drugs and gene products, and diseases in the Fibrinolysis network (2/2)

Consequently, we discovered 7 PubChem substances, 5 ChEBI compounds and drugs, 13 MeSH chemicals, 325 UniProt proteins (e.g., uniprotkb:P05155), and 7 CompexPortal complexes strongly involved in 16 kinds of diseases (e.g., Fibrinolytic purpura) in the Fibrinolysis network (Figs. 6 and 7) based on the ontological structures, and relationships (e.g. rdfs:subClassOf). In the BMT network (Figs. 8, 9, and 10), we discovered 39 PubChem substances, 1 ChEBML compound, 2 ChEBI compounds and drugs, 51 MeSH chemicals, 377 UniProt proteins (e.g., uniprotkb:Q99572), and 6 RNAcentral ncRNAs strongly involved in 15 kinds of diseases (e.g., Osteolysis).

We discovered 5 chemical compounds related to Fibrinolytic purpura, namely, anagrelide (chebi:CHEBI_142290), anagrelide hydrochloride (chebi:CHEBI_55345), 6-aminohexanoic acid (chebi:CHEBI_16586), and Tranexamic acid (chebi:CHEBI_48669) in the Fibrinolysis network. However, we did not confirm that these relationships were from Bio2RDF, Chem2BioRDF, or RIKEN MetaDatabase. The chemical compounds and gene products discovered are the potential candidates. Future studies should validate these inferred results biologically and clinically.

Table 2 PubChem SIDs, ChEBI IDs, and MeSH Unique IDs for the disease-related chemical compounds inferred from the Fibrinolysis network
Table 3 Information on the therapeutic uses and clinical trials for disease-related chemical compounds inferred from the Fibrinolysis network (1/2)
Table 4 Information on the therapeutic uses and clinical trials for disease-related chemical compounds inferred from the Fibrinolysis network (2/2)

Furthermore, we investigated whether these disease-related chemical compounds, which were inferred in the Fibrinolysis network (Fig. 6 and 7), have been authorized as disease drugs using the comparative toxicogenomics database (CTD) [62] and PubChem. Consequently, we confirmed that all of the disease-related chemical compounds were not authorized as drugs for the inferred diseases, such as Fibrinolytic purpura (Fig. 6), and in which clinical trials were also not conducted. Table 2 shows that the PubChem SIDs, ChEBI IDs, and MeSH Unique IDs for the disease-related chemical compounds inferred from the Fibrinolysis network. Moreover, life-science experts manually collected the chemical identifiers that linked to the information on therapeutic uses and clinical trials such as PubChem CIDs, DrugBank IDs, and CTD IDs from the internal and external links of the PubChem SIDs, PubChem CIDs, and MeSH Unique IDs, respectively. Tables 3 and 4 summarize the information on disease-related chemical compounds on the PubChem therapeutic uses and clinical trials, whether they are categorized as Approved (A) or Investigational (I) in the DrugBank, and Therapeutic (T) or Marker/Mechanism (M) in the CTD (see Table 4 legend) for at least any one disease (except for inferred diseases.)

Chemical compounds that have information on the PubChem therapeutic uses and categorized as “A” in the DrugBank or “T” in the CTD have been used in medical treatment. Thus, confirming the medical efficacy, we expect to decrease the drug development cost and the period because human toxicity tests and pharmacokinetic studies have been already performed on the chemical compounds. Such information about the disease-related chemical compounds, that is, drug candidates, which the KG infers, would be useful for the drug repositioning that refers to the development of existing drugs for new medical indications.

Some diseases in IOBC contain external links: MeSH, International Statistical Classification of Diseases and Related Health Problems, 10th Revision [63], OMIM, National Drug File—Reference Terminology [64], National Cancer Institute Thesaurus (NCIt) [65], and DisGeNET. Using the Fibrinolysis network (Fig. 7), we found 325 UniProt proteins as thromboembolism-related gene products.

We performed the following federated SPARQL search via the DisGeNET SPARQL endpoint to integrate information on gene–disease associations in DisGeNET with the KG (Fig. 11).

figure c

Consequently, we discovered 13 disease-related proteins via both, the IOBC’s KG and DisGeNET (e.g., uniprotkb:P00734) [66]. Moreover, we also found 18 disease-related proteins suggested only by DisGeNET (e.g., uniprotkb:P08519) and 312 disease-related proteins suggested only by the IOBC’s KG (e.g., uniprotkb:Q9P126). This shows that the gene products suggested by IOBC’s KG and DisGeNET may be stronger disease candidates than those suggested only by IOBC’s or DisGeNET.

Fig. 8
figure 8

Associations between chemicals, namely chemical compounds, drugs and gene products, and diseases in the BMT network (1/3)

Fig. 9
figure 9

Associations between chemicals, namely chemical compounds, drugs and gene products, and diseases in the BMT network (2/3)

Fig. 10
figure 10

Associations between chemicals, namely chemical compounds, drugs and gene products, and diseases in the BMT network (3/3)

Fig. 11
figure 11

Interconnection between KG developed from IOBC and DisGeNET RDF using MeSH disease

Conclusions

The Semantic Automated Discovery and Integration is a framework that assists in extracting chemical information using SPARQL [67]. Further, RDF and KG machine learning to find drug targets and predict side effects has been performed [68]. The results are actively being discussed; however, researchers with low specialized knowledge and skill sets may face challenges to prepare the execution environments of these drugs.

We integrated biological knowledge: chemical compounds, gene products, biological processes, and diseases. We constructed KGs, from NikkajiRDF and IOBC, to facilitate the easy collection of biochemical and genomic information on the Internet, particularly information on chemical compounds’ and gene products’ functions and roles, as well as involvements in biological processes, including diseases. Valuable biochemical and genomic data sources dispersed globally should be findable, accessible, interoperable, and reusable based on the FAIR principle [69]. The InChI/InChIKey as a chemical identifier based on the steric structure and other major identifiers in the biological database, thesauri, and ontologies such as UniProtKB accession number are necessary for integrating chemical compounds and gene products among different data sources. A federated search on SPARQL endpoints, such as the NBDC RDF portal, is also important. Conversely, the federated search from the public DBpedia SPARQL endpoint [22] to other SPARQL endpoints is currently unavailable.

We are evaluating the effectiveness of the knowledge expansion and inference using KGs, and ontologies in the field of bioresources. As a result, we confirm that they assist in finding new bioresource usages. For example, using KGs created from IOBC, NikkajiRDF, and other data sources, we can discover that coumarin (sti:200907007165179824), efficiently produced by a Tobacco cell, is not only a chemical compound related to oxidative stress, and plant defense responses [70], but also used in fluorescent dyes (chebi:CHEBI_51121), and as an anticoagulant (snomedctFootnote 6:373307003).

In the future, the utilization of information on the interactions between chemical compounds, gene products, and metabolic and signal transduction pathways will facilitate more extensive and precise collection and prediction of chemical compounds’ and gene products’ associations with biological phenomena, along with the corresponding side effects. This will improve drug discovery, selection of effective medical treatments, and application of materials.