Escolar Documentos
Profissional Documentos
Cultura Documentos
COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation
COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation
Abstract
To combat COVID-19, clinicians and scien- 30000
05-07
05-14
05-21
05-28
06-04
06-11
06-18
06-25
tities, relations and events). We then exploit
the constructed multimedia knowledge graphs
(KGs) for question answering and report gen- Figure 1: The Growing Number of COVID-19 Papers
eration, using drug repurposing as a case study. at PubMed
Our framework also provides detailed contex-
tual sentences, subfigures and knowledge sub-
graphs as evidence. All of the data, KGs, re- are 140K+ related papers, nearly 2.7K new papers
ports, resources and shared services are pub- per day (see Figure 1). This knowledge bottleneck
licly available1 . causes significant delay in the development of vac-
cines and drugs for COVID-19. More intelligent
1 Introduction
knowledge discovery technologies need to be de-
The COVID-19 pandemic has rapidly changed our veloped to enable researchers to more quickly and
work styles and behaviors in previously unimag- accurately access and digest relevant knowledge
inable ways, including scientific research. Scien- from literature.
tists across the world have dropped everything to The second challenge is quality due to the
fight COVID-19. Practical progress at combat- rise and rapid, extensive publications of preprint
ing COVID-19 highly depends on effective search, manuscripts without pre-publication peer review.
analysis, discovery, assessment and extension of Many research results about coronavirus from dif-
these research results. However, clinicians and sci- ferent research labs and sources are redundant,
entists are facing two unique barriers on digesting complementary or event conflicting with each other,
these research papers. while some false information has been promoted at
The first challenge is quantity. Such a bottle- both formal publication venues and social media
neck in knowledge access is exacerbated during platforms such as Twitter. As a result, some of the
a pandemic when the increased investment on rel- policy responses to the virus, and public perception
evant research would lead to even faster growth of it, have been based on misleading, and at times
of literature than usual. For example, till April erroneous, claims. The isolation of these knowl-
28, 2020, at PubMed2 there are 19,443 papers re- edge resources makes it hard, if not impossible, for
lated to coronavirus; as of June 13, 2020, there researchers to connect dots that exist in separate
1
http://blender.cs.illinois.edu/covid19/ resources to obtain insights. Thus, it is challeng-
2
https://www.ncbi.nlm.nih.gov/pubmed/ ing to draw useful conclusions based on previous
Figure 2: COVID-KG Overview: From Data to Semantics to Knowledge
ing (QA) methods usually rely on word-level or ple subgraph summarizing the connections between
sentence-level semantic meaning matching. The Losartan and cathepsin L pseudogene 2. The paths
questions from existing shared tasks are limited to are generated by traversing the constructed KG,
non-experts (e.g., Corona Virus Update?) or too and are ranked the frequency of paths in the KG.
high-level (e.g., What is known about transmission, We construct a subgraph for each query by merg-
incubation, and environmental stability?). Most of ing the paths of top ranked paths. Each edge is
current QA systems are trained from Wikipedia assigned a salience score by aggregating the scores
articles, and thus they will not be effective for of paths passing through it,
the COVID-19 domain where most answers are
not explicitly written in a single sentence or docu-
ment. In sharp contrast, we develop a QA compo- 3.2 Knowledge-driven Sentence Matching
nent based on a combination of knowledge graph
matching and distributional semantic matching. It
provides fast and effective answering of questions In addition to knowledge elements, we also present
about drugs, diseases, chemical entities and genes related sentences as evidence. We use BioBert (Lee
from any angle. We build knowledge graph index- et al., 2020) pre-trained language model to repre-
ing and searching functions to facilitate users to sent each sentence along with its left and right
pose queries to search effectively and efficiently. neighboring sentences as local contexts. Using the
We also support semantic matching from the con- same architecture computed on all respective sen-
structed KGs and related texts by accepting multi- tences and user query, we aggregate the sequence
hop queries. embedding layer, the last hidden layer in the BERT
architecture with average pooling (Reimers and
A common category of queries is about the con- Gurevych, 2019). We use the similarity between
nections between two entities. Given two entities the embedding representations of each sentence
as query, we generate a subgraph covering salient and query to extract the most relevant sentences as
paths between them to show how they are con- evidence. Table 1 shows some answer examples
nected through other entities. Figure 3 is an exam- from our QA component.
Question # of Answers Example Answers
Which genes are related 687 AP2 associated kinase 1,
1. Current indication: what is the drug class?
to COVID-19? myeloperoxidase, thiore- What is it currently approved to treat?
doxin
Which chemicals are re- 3,142 acetoacetic acid, Chlo-
2. Molecular structure (symbols desired, but a
lated to COVID-19? rine, Zymosan pointer to a reference is also useful)
Which genes are related 2,168 DEK proto-oncogene,
to COVID-19 that can be neclear receptor corepres-
3. Mechanism of action i.e. inhibits viral entry,
transferred from its simi- sor 1 replication, etc (w/ a pointer to data)
lar diseases?
Which chemicals are re- 327 Ampicillin, Quercetin, 4. Was the drug identified by manual or compu-
lated to COVID-19 that Zoledronic Acid tation screen?
can be transferred from
its similar diseases? 5. Who is studying the drug? (Source/lab name)
6. In vitro Data available (cell line used, assays
Table 1: Knowledge-driven QA Output Examples run, viral strain used, cytopathic effects, toxi-
city, LD50, dosage response curve, etc)
7. Animal Data Available (what animal model,
3.3 Evidence Mining
LD50, dosage response curve, etc)
Queries also often include entity types instead of 8. Clinical trials on going (what phase, facility,
entity instances, which requires us to extract evi- target population, dosing, intervention etc)
dence sentences based on type or pattern match- 9. Funding source
ing. We have developed E VIDENCE M INER (Wang 10. Has the drug shown evidence of systemic tox-
et al., 2020b,c), a web-based system that allows icity?
a users query as a natural language statement or 11. List of relevant sources to pull data from.
an inquired relationship at the meta-symbol level We use three drugs suggested by DARPA biol-
(e.g., CHEMICAL, PROTEIN) and automatically ogists as case studies: Benazepril, Losartan, and
retrieves textual evidence from a background cor- Amodiaquine. Our KG results for many other drugs
pora of COVID-19. In Figure 9, the top results are visualized at our website4 . We use the follow-
of E VIDENCE M INER for the query (“SARS-COV- ing list of chemicals/genes related to COVID-19,
2”, “Losartan”) indicates that “Losartan” can be a suggested by DARPA biologists:
potential drug treatment for “COVID-19”. • BM1 00870 BM1 06175 BM1 16375 BM1 17125
BM1 22385 BM1 30360 BM1 33735 BM1 56245
BM1 56735 BM1 00870 BM1 06175 BM1 16375
BM1 17125 BM1 22385 BM1 30360 BM1 33735
BM1 56245 BM1 56735 CATB-10270 CATB-1418
CATB-1674 CATB-16A CATB-16D2 CATB-1852
CATB-1874 CATB-2744 CATB-3098 CATB-348
CATB-3483 CATB-5880 CATB-84 CATB-912 CATD
CATHY CATK CATL CATL-LIKE CTS12 CTS3
CTS6 CTS7 CTS7-PS CTS8 CTS8L1 CTS8-PS
CTSA CTSA.L CTSB CTSBA CTSBB CTSB.L
CTSB-PS CTSB.S CTSC CTSC.L CTSC.S CTSD
CTSD2 CTSD.S CTSE CTSEAL CTSE.L CTSE.S
CTSF CTSF.L CTSG CTSH CTSH.L CTSH-PS
CTSJ CTSK CTSK1 CTSK.L CTSL CTSL.1 CTSL3
CTSL3P CTSLA CTSLB CTSLL CTSL.L CTSLL3
CTSLP1 CTSLP2 CTSLP3 CTSLP4 CTSLP6
CTSLP8 CTSM CTSM-PS CTSM-PS2 CTSO
CTSO.L CTSQ CTSQL2 CTSR CTSS CTSS1
CTSS.2 CTSS2.1 CTSS2.2 CTSSL CTSS.L CTSS.S
CTSV CTSV.L CTSW CTSW.L CTSZ CTSZ.L
Figure 9: Top results of EvidenceMiner for the query
CTSZ.S LOAG 18685 SMP 013040.1 SMP 034410.1
(“SARS-COV-2”, “Losartan”). SMP 067050 SMP 067060 SMP 085010 SMP 085180
SMP 103610 SMP 105370 SMP 158410 SMP 158420
SMP 179950 TSP 01409 TSP 02382 TSP 02383
TSP 03306 TSP 07747 TSP 10129 TSP 10493
4 A case study on Drug Repurposing TSP 11596 LMAN1 LMAN1L LMAN1.L LMAN1.S
Report Generation LMAN2 LMAN2L MBL1P MBL2 ACE2 FURIN
TMPRSS2
4.1 Task and Data With the knowledge discovered and more pa-
A human written report about drug repurposing pers about this topic published every day, we need
usually includes answers for the following typical 4
http://blender.cs.illinois.edu/
questions. covid19/visualization.html
lopinavir-ritonavir
drug combination
COVID-19 subgraphs and image segmentation and analysis
Severe
Acute Coronavirus
results. Table 2 shows some example answers.
Respiratory Infections
Syndrome cathepsin D Question Example Answers
Drug Class angiotensin-converting enzyme (ACE) inhibitors
Disease hypertension
Figure 10: Connections Involving Coronavirus Related [PMID:32314699 (PMC7253125)] Past medical his-
Diseases Q1
tory was significant for hypertension, treated with
Evidence amlodipine and benazepril, and chronic back pain.
Sentences [PMID:32081428 (PMC7092824)] On the other
hand, many ACE inhibitors are currently used to
to keep up to date with the latest development. treat hypertension and other cardiovascular diseases.
Among them are captopril, perindopril, ramipril,
For this purpose, we download new COVID-19 lisinopril, benazepril, and moexipril.
Disease COVID-19
papers on a daily basis from three Application [PMID:32081428 (PMC7092824)] By using a
Programming Interfaces (APIs): NCBI PMC API, molecular docking approach, an earlier study iden-
tified N-(2-aminoethyl)-1 aziridine-ethanamine as a
NCBI Pubtator API and CORD-19 archive. We Q4 novel ACE2 inhibitor that effectively blocks the
Evidence SARS-CoV RBD-mediated cell fusion.
provide incremental updates including new papers, Sentences This has provided a potential candidate and lead
removed papers and updated papers, and their meta- compound for further therapeutic drug development.
Meanwhile, biochemical and cell-based assays can
data information at our website5 . be established to screen chemical compound libraries
to identify novel inhibitors.
Disease cardiovascular disease
4.2 Results [PMID:22800722 (PMC7102827)] The in vitro half-
maximal inhibitory concentration (IC50) values of
Up to June 14, 2020 we have collected 140K Evidence food-derived ACE inhibitory peptides are about 1000
Q6 Sentences fold higher than that of synthetic captopril but they
papers. We choose 25,534 peer-reviewed pa- have higher in vivo activities than would be expected
pers and construct the KG. The current KG in- from their in vitro activities.....
Disease COVID-19
cludes 7,230 Diseases, 9,123 Chemicals and 50,864 [PMID:32336612 (PMC7167588)] Two trials of
genes, 1,725,518 chemical-gene links, 5,556,670 losartan as additional treatment for SARS-CoV-2 in-
fection in hospitalized (NCT04312009) or not hos-
chemical-disease links, and 7,7844,574 gene- Q8 pitalized (NCT04311177) patients have been an-
nounced, supported by the background of the huge
disease links. The KG has got more than 800+ adverse impact of the ACE Angiotensin II AT1 re-
downloads. Evidence ceptor axis over-activity in these patients.
Sentences [PMID:32350632 (PMC7189178)] To address the
Several clinicians and medical school students role of angiotensin in lung injury, there is an ongoing
clinical trial to examine whether losartan treatment
in our team have manually reviewed the drug repur- affects outcomes in COVID-19 associated ARDS
posing reports for three drugs, and also the knowl- (NCT04312009).
[PMID:32439915 (PMC7242178)] Losartan was
edge graphs connecting 41 drugs and COVID-19 also the molecule chosen in two trials recently started
in the United States by the University of Minnesota
related chemicals/genes. Preliminary results show to treat patients with COVID-19 (clinical trials.gov
that most of our output are informative, valid and NCT04311177 and NCT 104312009).
sound. For instance, after the coronavirus enters Table 2: Example Answers for Questions in Drug Re-
the cell in the lungs, it can cause a severe dis- purposing Reports
ease called Acute Respiratory Distress Syndrome
(ARDS). This condition causes the release of in-
flammatory molecules in the body named cytokines 5 Related Work
such as Interleukin-2, Interleukin-6, Tumor Necro-
sis Factor, and Interleukin-10. We see all of these There has been a lot of previous work on extract-
connections in our results, such as the examples ing biomedical entities (Krallinger et al., 2013; Lu
shown in Figure 3 and Figure 10. Some results et al., 2015; Leaman and Lu, 2016; Habibi et al.,
are a little surprising to scientists and they think 2017; Crichton et al., 2017; Wang et al., 2018;
it’s worth further investigation. For example, in Beltagy et al., 2019; Alsentzer et al., 2019; Wei
Figure 3 we can see that Lusartan is connected to et al., 2019; Wang et al., 2020d), relations (Uzuner
tumor protein p53 which is related to lung cancer. et al., 2011; Krallinger et al., 2011; Segura-Bedmar
Our final generated reports6 are shared publicly. et al., 2013; Bui et al., 2014; Peng et al., 2016; Wei
For each question, our framework provides an- et al., 2015; Peng et al., 2017; Quirk and Poon,
swers along with detailed evidences, knowledge 2016; Luo et al., 2017; Wei et al., 2019; Peng et al.,
2019, 2020), and events (Ananiadou et al., 2010;
5
http://blender.cs.illinois.edu/ Van Landeghem et al., 2013; Nédellec et al., 2013;
covid19/
6
http://blender.cs.illinois.edu/ Deléger et al., 2016; Wei et al., 2019; Li et al.,
covid19/DrugRe-purposingReport_V2.0.docx 2019; ShafieiBavani et al., 2020) from biomed-
ical corpora to construct KGs. Recently, Hope Matthew McDermott. 2019. Publicly available clini-
et al. (2020); Ilievski et al. (2020); Wolinski (2020); cal BERT embeddings. In Proceedings of the 2nd
Clinical Natural Language Processing Workshop,
Ahamed and Samad (2020) build KGs based on
pages 72–78, Minneapolis, Minnesota, USA. Asso-
CORD-19 (Wang et al., 2020a). ciation for Computational Linguistics.
Most of the recent biomedical QA work is
driven by the BioASQ initiative (Tsatsaronis et al., Sophia Ananiadou, Sampo Pyysalo, Junichi Tsujii, and
2015) with many algorithms developed (Yang et al., Douglas B Kell. 2010. Event extraction for sys-
tems biology by text mining the literature. Trends
2015, 2016; Chandu et al., 2017; Kraus et al., in biotechnology, 28(7):381–390.
2017). There are COVID-19 question answering
live systems coming from BioASQ (COVIDASK7 , Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scib-
AUEB8 ), and search engines (Kricka et al., 2020; ert: Pretrained language model for scientific text. In
EMNLP.
Esteva et al., 2020; Hope et al., 2020; Taub Tabib
et al., 2020) have also been built. Our work ad- Quoc-Chinh Bui, Peter MA Sloot, Erik M Van Mul-
vances state-of-the-art by extending the knowledge ligen, and Jan A Kors. 2014. A novel feature-
elements to more fine-grained types, incorporating based approach to extract drug–drug interactions
image analysis and cross-media knowledge ground- from biomedical text. Bioinformatics, 30(23):3365–
3371.
ing, and knowledge graph matching into question
answering. Khyathi Chandu, Aakanksha Naik, Aditya Chan-
drasekar, Zi Yang, Niloy Gupta, and Eric Nyberg.
6 Conclusions and Future Work 2017. Tackling biomedical text summarization:
Oaqa at bioasq 5b. In BioNLP 2017, pages 58–66.
We have developed a novel framework COVID-
KG that automatically transforms massive scien- Gamal Crichton, Sampo Pyysalo, Billy Chiu, and Anna
tific literature corpus into organized, structured and Korhonen. 2017. A neural network multi-task learn-
ing approach to biomedical named entity recogni-
actionable knowledge graphs. With COVID-KG, tion. BMC Bioinf., 18(1):368.
researchers and clinicians are able to obtain trust-
worthy and non-trivial answers from scientific liter- Allan Peter Davis, Cynthia J Grondin, Robin J Johnson,
ature, and thus focus on more important hypothesis Daniela Sciaky, Benjamin L King, Roy McMorran,
Jolene Wiegers, Thomas C Wiegers, and Carolyn J
testing, and prioritize the analysis efforts for can-
Mattingly. 2016. The comparative toxicogenomics
didate exploration directions. In our ongoing work database: update 2017. Nucleic acids research.
we have created a new ontology that includes 77
entity subtypes and 58 event subtypes, and we are Louise Deléger, Robert Bossy, Estelle Chaix,
re-building an end-to-end joint neural IE system Mouhamadou Ba, Arnaud Ferré, Philippe Bessieres,
and Claire Nédellec. 2016. Overview of the bacteria
following this new ontology. In the future we plan biotope task at bionlp shared task 2016. In Proceed-
to extend COVID-KG to automate the creation of ings of the 4th BioNLP shared task workshop, pages
new hypotheses by predicting new links. Inspired 12–22.
from our recent success at multimedia event extrac-
tion (Li et al., 2020), we will create a multimedia Andre Esteva, Anuprit Kale, Romain Paulus, Kazuma
Hashimoto, Wenpeng Yin, Dragomir Radev, and
common semantic space for literature and apply Richard Socher. 2020. Co-search: Covid-19 in-
it to improve cross-media knowledge grounding, formation retrieval with semantic search, question
inference and transfer. answering, and abstractive summarization. arXiv
preprint arXiv:2006.09595.
Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Haibo Zhang, Josef M Penninger, Yimin Li, Nanshan
Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn Zhong, and Arthur S Slutsky. 2020. Angiotensin-
Funk, Rodney Kinney, Ziyang Liu, William Merrill, converting enzyme 2 (ace2) as a sars-cov-2 receptor:
et al. 2020a. Cord-19: The covid-19 open research molecular mechanisms and potential therapeutic tar-
dataset. ArXiv. get. Intensive care medicine, 46(4):586–590.
Jin Guang Zheng, Daniel Howsmon, Boliang Zhang,
Juergen Hahn, Deborah McGuinness, James
Hendler, and Heng Ji. 2014. Entity linking for
biomedical literature. In BMC Medical Informatics
and Decision Making.
Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang,
Shuchang Zhou, Weiran He, and Jiajun Liang. 2017.
East: an efficient and accurate scene text detector. In
Proceedings of the IEEE conference on Computer Vi-
sion and Pattern Recognition, pages 5551–5560.