ERLKG

Entity Representation Learning and Knowledge Graph-based association analysis of COVID-19 through mining of unstructured biomedical corpora

Motivation

Scientists have to pour through hundreds of papers and articles in order to understand the behaviour of an unknown entity. This is time consuming (can take years) and at the same time, depends a lot on the expertise level of the researcher.

We, instead, developed a human-out-of-the-loop pipeline, ERLKG, that ingests scholarly articles to produce a list of top 20 known entities that are very similar to the unknown entity (here COVID-19). This pipeline, with the help of various Natural Language Processing (NLP) and Knowledge Graph (KG) techniques, is able to reduce work that would take years down to just a few minutes. Based on the characteristics of these related entities, scientists can get a better idea on how the unknown entity behaves.

NLP is powerful right ?

Key Contributions

Generic, end-to-end, human-out-of-the-loop pipeline to perform rapid association analysis of any biomedical entity with other existing entities from a corpora of the same domain. In this work we demonstrate our pipeline using COVID-19.
We benchmark multiple KG embedding techniques on the task of link prediction and demonstrate that simple embedding methods provide comparable performance on simple structured KGs.
We propose two datasets for intrinsic evaluation, namely, COV19_729 and COV19_25. The datasets contain 729 and 25 entities along with their types (chemicals, proteins or diseases). Each entity has a corresponding physician rating (on a scale of 1 to 5) which measures its association with COVID-19.

Intuition

Objective: Extract biomedical entities related to COVID-19 from scientific, unstructured text.

Two main ideas helped us in coming up with a solution for the problem:

When you are on facebook, you are connected with your friends, have pages that you like and so on. When a new user joins the platform, and gets connected with you, facebook will recommend him/her, friends that you are connected with since the probability of both you and the new user sharing a similar friends' group is relatively high.
On a separate note, the old adage, "a man is known by the company he keeps", has been consistently used throughout ML, including NLP which led to the development of word2vec and other word embedding techniques.

Say we have an interconnected graph of existing biomedical entities (knowledge graph) of drugs, chemicals and proteins. If connections between COVID-19 and nodes in the graph (friends) can be correctly identified (based on the first idea), we can then realize a new approach to understand the novel coronavirus as well as find possible drugs for its treatment (based on the second idea).

But the assumption of a pre-existing graph with interconnected biomedical entities is wrong as such a knowledge graph (KG) does not exist. We therefore, have to build one ourselves, from scratch. Fortunately, Allen AI has released a dataset, named COVID-19 Open Research Dataset Challenge (CORD-19)^[1] which is a vast dataset of curated research papers specifically related to the coronavirus and is a gold mine for our research problem.

The initial challenge comes in the form of developing said KG from CORD-19. To this end, we need to identify biomedical entities present in research articles (nodes of the graph) as well as the type of relationship between them (edges of the graph). This will allow us to construct a rich knowledge graph of different biomedical entities and their interactions. Formally, the first task of extracting biomedical entities from text is known as Named Entity Recognition (NER) whereas the second task of identifying the type of interactions is known as Relation Extraction (RE).

Fig. 1: A broad overview of the developed pipeline, ERLKG

Methodology

Fig. 2: The various modules of the ERLKG pipeline in detail

Dataset Preprocessing

CORD-19 requires pre-processing for downstream NLP tasks such as tokenization at both sentence (sentence tokenization) and word (word tokenization) level as well as addition of POS (Parts of Speech) tags for richer information per token. We used the Spacy library for this purpose.

Named Entity Recognition

Proteins, chemicals and diseases are the biomedical entities of interest that need to be extracted from running text. We therefore use SciBERT^[2] on the CORD-19 dataset to tag said entities. SciBERT is a BERT variant which has been trained on 1.14M full text scientific research papers of which 18% articles are from the computer science domain and 82% are from the broad biomedical domain. SciBERT is a perfect fit for this task since most CORD-19 papers belong to the biomedical domain.

In order to achieve better performance, we finetuned SciBERT on three different datasets, namely JNLPBA^[3] (for proteins), NCBI-disease corpus^[4] (for disease) and CHEMDNER^[5] (for chemicals).

Fig. 3: The input sentence gets tagged with disease and chemical mentions by the NER module. Here, “coronavirus” is a disease and the rest of the tagged entities are chemicals..

Relation Extraction

Based on the now tagged CORD-19 corpus, it is necessary to understand how every entity is related to each other. Accordingly, we will connect them via edges and construct our Knowledge Graph (KG). We use SciBERT for this task as well with only sentences having more than one tagged entity being fed to the mode.

We want to target only those entity pairs that have chemical-protein and chemical-disease relationships (protein-disease can be found indirectly). To achieve better performance, we again finetune SciBERT on CHEMPROT^[6] (for chemical-protein examples) and BC5CDR^[7] (for chemical-disease examples) datasets as they contain the said relationships.

Fig. 4: The tagged sentences from NER module gets processed by the RE module to extract the related entities as shown above.

KG construction

From the NER module, we obtained biomedical entities that formed the nodes of a graph while, from the RE module, we obtained the relationship types among said entities which formed the edges between nodes. This gives us a huge graph with multiple disconnected components and repeated nodes. In order to form a single KG, all repeated nodes are merged and edges are handled accordingly. An abstraction of the process is provided in Fig. 5.

Now that our KG is constructed, we need to take advantage of the rich information it holds. There are many ways to do this, but one of the proven methods is to learn vector representations of each node using the type of incoming and outgoing edges as well as neighbourhood information for a given node.

Fig. 5: By aggregating the RE module’s output, a rich graph is constructued.

Learning latent representation of biomedical entities

As stated earlier, in order to take advantage of the rich information contained in our KG, one of the proven methods is to learn dense vector representations of the nodes. Just like word2vec^[8] learns vector embeddings of words and groups similar terms together, example, "house" and "home", we need a method that can provide such vector representations for nodes in our KG so that similar entities can be grouped together. This is exactly what Knowledge Graph embedding techniques achieve.

We explored many KG embedding techniques like TransD^[9], TransE^[10], RotatE^[11], etc and compared the results. The basic idea of these embedding techniques is to keep vectors of connected nodes closer to each other in comparison to all other nodes in a graph thus segregating unrelated entities from related ones. We use OpenKE^[12] which is an Open-source Framework for trying different KG embedding techniques. Fig. 6 gives an abstract overview of such methods.

Fig. 6: A pictorial representation of the concept of learning dense vectors of nodes in a KG.
Taken from AmpliGraph

Generating entities similar to COVID-19

After learning the latent representations we now have embeddings for each entity, including COVID-19. Also the embeddings are such that similar entities are present closer to each other in the learned vector space while entities with dissimilar properties are present further apart. Thus, in order to find entities associated with COVID-19, we simply take the cosine similarity of COVID-19's vector with the vectors of all other biomedical entities, rank them and pick the top k as our final list. This k can be decided by the user.

Experiment and Evaluation

Our evaluation strategy for ERLKG was to perform both extrinsic evaluation in the form of link prediction task as well as intrinsic evaluation.

Link Prediction

Link prediction is the task of predicting the existence of an edge between two entities. If our generated KG is correct, we should be able to achieve a good score on said task by feeding the vectors learned by graph embedding techniques into a classification model.

The test and validation set is created by deleting edges with the addition of equal number of randomly sampled pairs of false links (nodes that did not have connections in the graph). The test and validation sets have 10 percent and 5 percent of true links, respectively. We use GCN-AE^[13] as our model for the task. The Average Precision and ROC score of each setting is noted and used to benchmark these embedding types as can be seen in Table 1. We can see that RotatE and TransD perform the best.

Method	ROC	AP
RotatE (Sun et al., 2019)	0.858	0.887
TransD (Ji et al., 2015)	0.860	0.883
TransE (Bordes et al., 2013)	0.853	0.877
DistMult (Yang et al., 2015)	0.855	0.883
ComplEx (Trouillon et al., 2016)	0.852	0.881
Node2Vec (Grover and Leskovec, 2016)	0.821	0.849

Table 1: Link Prediction performance of different KG embedding techniques on the test set using the GCN-AE model

Intrinsic Evaluation

From the link prediction task, we know that both TransD and RotatE are good graph embedding techniques but in order to find a clear winner, we decided to perform another round of evaluation. This time, we went for intrinsic evaluation which is the process of evaluating a model with the direct involvement of human judgement. Similar to our KG predicament, there is no intrinsic evaluation dataset for COVID-19. We thus propose two new datasets, COV19_729 and COV19_25, that we created with the help of medical experts. Following are the details on how we developed the datasets:

COV19_729: After generating the KG, a list of all entities are supplied to a medical practitioner, who clubbed the terms into 3 groups based on their relatedness to COVID-19, i.e., NOT RELATED, PARTIALLY RELATED and HIGHLY RELATED. It was identified that the number of entities in the HIGHLY RELATED group are much less in comparison to the other two categories. Thus, in order to reduce bias, the expert sampled nearly equal number of entities from each group, resulting in a final dataset comprising of 729 entities. This dataset was then shuffled and passed on to two independent medical exprts, who provided ratings to each sample indicating how related an entity is to COVID-19 on a scale of 0 (NOT RELATED) to 5 (HIGHLY RELATED).

The inter-rater agreement score (kappa score) was found to be 0.5116, which lies in the moderate agreement range. We, therefore, averaged out the ratings and propose a relatively large, intrinsic evaluation dataset called COV19_729 for benchmarking COVID-19 related embedding techniques. Table 2 shows a snapshot of COV19_729.

Entity	Cosine with COVID-19	Rating by medical expert
retinoic acid inducible gene-1 (protein)	-0.0799	0
hydroxyprolinol (chemical)	-0.0183	2
acute asthma attacks (disease)	0.0530	1
pc18 (chemical)	0.1540	1
immunodominant epitopes (protein)	0.1894	2
receptor binding domain (protein)	0.4644	5
spike glycoprotein (protein)	0.4138	4

Table 2: A snapshot of the COV19_729 dataset. The second column denotes cosine scores generated by TransD.

COV19_25: The top 100 entities given by TransD and RotatE were selected and an intersection of the generated entities was taken, which was then passed on to a medical practitioner. The expert recommended a list of 25 relevant entities, out of the provided set. This list was then sent to another expert who rated the entities based on their relatedness to COVID-19. This was named COV19_25.

If you want to know the reason behind proposing two separate intrinsic evaluation datasets, then please do give the paper a read.

Once the datasets were complete, we performed intrinsic evalaution by generating cosine similarity scores from both TransD and RotatE for the biomedical entity pairs in each dataset and finding their Pearson and Spearman correlation values between generated score and the expert ratings. The cosine similarity scores for each entity was generated with respect to the COVID19 embedding vector obtained from our proposed pipeline. As can be seen from Table 3, TransD outperformed RotatE on both datasets and hence is the selected graph embedding technique in our case.

Entity List	Spearman Correlation	Pearson Correlation
COV19_729 (TransD)	0.2186	0.2117
COV19_729 (RotatE)	0.1933	0.1879
COV19_25 (TransD)	0.4570	0.4348
COV19_25 (RotatE)	0.4240	0.4105

Table 3: Pearson and Spearman Correlation values between the ratings and the cosine similarity scores generated by both TransD and RotatE.

Results

Using ERLKG we are able to rapidly mine important biomedical entities that are highly associated with COVID-19. The final result of chemical entities are given in Fig. 7. Entities color coded “Blue” signify a higher cosine similarity value compared to entities color coded “Yellow”. We can see entities like “Flutamide”, “Lopinavir”, “Carfilzomib”, etc. are highly associated with COVID-19. On further investigation we found out that most of the entities are actually being considered for COVID-19 treatment.

For further results we highly encourage the readers to check out the full paper.

Fig. 7: Our final results of biomedical entities closely related to COVID-19. Blue coloured entities are much closer to the central node, in terms of cosine similarity than the yellow coloured ones.

References

Wang, Lucy Lu, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn Funk et al. "CORD-19: The Covid-19 Open Research Dataset." ArXiv (2020).
Beltagy, Iz, Kyle Lo, and Arman Cohan. "SciBERT: A pretrained language model for scientific text." arXiv preprint arXiv:1903.10676 (2019).
Kim, Jin-Dong, Tomoko Ohta, Yoshimasa Tsuruoka, Yuka Tateisi, and Nigel Collier. "Introduction to the bio-entity recognition task at JNLPBA." In Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, pp. 70-75. 2004.
Doğan, Rezarta Islamaj, Robert Leaman, and Zhiyong Lu. "NCBI disease corpus: a resource for disease name recognition and concept normalization." Journal of biomedical informatics 47 (2014): 1-10.
Krallinger, Martin, Obdulia Rabal, Florian Leitner, Miguel Vazquez, David Salgado, Zhiyong Lu, Robert Leaman et al. "The CHEMDNER corpus of chemicals and drugs and its annotation principles." Journal of cheminformatics 7, no. 1 (2015): 1-17.
Taboureau, Olivier, Sonny Kim Nielsen, Karine Audouze, Nils Weinhold, Daniel Edsgärd, Francisco S. Roque, Irene Kouskoumvekaki et al. "ChemProt: a disease chemical biology database." Nucleic acids research 39, no. suppl_1 (2010): D367-D372.
Li, Jiao, Yueping Sun, Robin J. Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J. Mattingly, Thomas C. Wiegers, and Zhiyong Lu. "BioCreative V CDR task corpus: a resource for chemical disease relation extraction." Database 2016 (2016).
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems 26 (2013): 3111-3119.
Ji, Guoliang, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao. "Knowledge graph embedding via dynamic mapping matrix." In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers), pp. 687-696. 2015.
Bordes, Antoine, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. "Translating embeddings for modeling multi-relational data." Advances in neural information processing systems 26 (2013): 2787-2795.
Sun, Zhiqing, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. "Rotate: Knowledge graph embedding by relational rotation in complex space." arXiv preprint arXiv:1902.10197 (2019).
Han, Xu, Shulin Cao, Xin Lv, Yankai Lin, Zhiyuan Liu, Maosong Sun, and Juanzi Li. "Openke: An open toolkit for knowledge embedding." In Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations, pp. 139-144. 2018.
Kipf, Thomas N., and Max Welling. "Variational graph auto-encoders." arXiv preprint arXiv:1611.07308 (2016).