Você está na página 1de 11

COVID-19 Literature Knowledge Graph Construction and Drug

Repurposing Report Generation


Qingyun Wang1 , Manling Li1 , Xuan Wang1 , Nikolaus Parulian1 , Guangxing Han2 ,
Jiawei Ma2 , Jingxuan Tu3 , Ying Lin1 , Haoran Zhang1 , Weili Liu1 , Aabhas Chauhan1 ,
Yingjun Guan1 , Bangzheng Li1 , Ruisong Li1 , Xiangchen Song1 , Heng Ji1 , Jiawei Han1 ,
Shih-Fu Chang2 , James Pustejovsky3 , Jasmine Rah4 , David Liem5 , Ahmed Elsayed6 ,
Martha Palmer6 , Clare Voss7 , Cynthia Schneider8 , Boyan Onyshkevych8
1
UIUC 2 Columbia University 3 Brandeis University 4 UW 5 UCLA 6 CU 7 ARL 8 DARPA
hengji@illinois.edu, hanj@illinois.edu, sc250@columbia.edu

Abstract
To combat COVID-19, clinicians and scien- 30000

tists all need to digest the vast amount of rel-


28000
evant biomedical knowledge in literature to
arXiv:2007.00576v2 [cs.CL] 6 Jul 2020

understand the disease mechanism and the re-


26000
lated biological functions. We have developed
a novel and comprehensive knowledge discov- 24000
ery framework, COVID-KG, which leverages
novel semantic representation and external on- 22000
tologies to represent text and images in the
input literature data, and then extracts fine- 20000
grained multimedia knowledge elements (en-
04-30

05-07

05-14

05-21

05-28

06-04

06-11

06-18

06-25
tities, relations and events). We then exploit
the constructed multimedia knowledge graphs
(KGs) for question answering and report gen- Figure 1: The Growing Number of COVID-19 Papers
eration, using drug repurposing as a case study. at PubMed
Our framework also provides detailed contex-
tual sentences, subfigures and knowledge sub-
graphs as evidence. All of the data, KGs, re- are 140K+ related papers, nearly 2.7K new papers
ports, resources and shared services are pub- per day (see Figure 1). This knowledge bottleneck
licly available1 . causes significant delay in the development of vac-
cines and drugs for COVID-19. More intelligent
1 Introduction
knowledge discovery technologies need to be de-
The COVID-19 pandemic has rapidly changed our veloped to enable researchers to more quickly and
work styles and behaviors in previously unimag- accurately access and digest relevant knowledge
inable ways, including scientific research. Scien- from literature.
tists across the world have dropped everything to The second challenge is quality due to the
fight COVID-19. Practical progress at combat- rise and rapid, extensive publications of preprint
ing COVID-19 highly depends on effective search, manuscripts without pre-publication peer review.
analysis, discovery, assessment and extension of Many research results about coronavirus from dif-
these research results. However, clinicians and sci- ferent research labs and sources are redundant,
entists are facing two unique barriers on digesting complementary or event conflicting with each other,
these research papers. while some false information has been promoted at
The first challenge is quantity. Such a bottle- both formal publication venues and social media
neck in knowledge access is exacerbated during platforms such as Twitter. As a result, some of the
a pandemic when the increased investment on rel- policy responses to the virus, and public perception
evant research would lead to even faster growth of it, have been based on misleading, and at times
of literature than usual. For example, till April erroneous, claims. The isolation of these knowl-
28, 2020, at PubMed2 there are 19,443 papers re- edge resources makes it hard, if not impossible, for
lated to coronavirus; as of June 13, 2020, there researchers to connect dots that exist in separate
1
http://blender.cs.illinois.edu/covid19/ resources to obtain insights. Thus, it is challeng-
2
https://www.ncbi.nlm.nih.gov/pubmed/ ing to draw useful conclusions based on previous
Figure 2: COVID-KG Overview: From Data to Semantics to Knowledge

research effectively. expert clinicians and medical school students show


our generated reports are informative and sound.
Let’s consider drug repurposing as a case study.
Besides the long process of clinical trial and 2 Multimedia Knowledge Graph
biomedical experiments, another major cause for Construction
the long process is the complexity of the problem
involved and the difficulty in drug discovery in
general. The current clinical trials for drug re-
purposing mainly rely on symptoms by considering
drugs that can treat diseases with similar symptoms.
However, there are too many drug candidates and
too much misinformation published from multiple
sources. The clinicians and scientists thus need
urgent help to obtain a reliable ranked list of drugs
with detailed evidence. In addition to a ranked list
of drugs, clinicians and scientists also aim to gain
new insights into the underlying molecular cellular
mechanisms on Covid-19, and which pre-existing
conditions may affect the mortality and severity of
this disease.
To tackle these two challenges we propose a new Figure 3: Constructed KG Connecting Losartan (candi-
framework COVID-KG to accelerate scientific dis- date drug in COVID-19) and cathepsin L pseudogene 2
covery and build a bridge between clinicians and bi- (gene related to coronavirus).
ology scientists, as illustrated in Figure 2. COVID-
KG starts by reading existing papers to build multi-
media knowledge graphs (KGs), in which nodes are 2.1 Coarse-grained Text Knowledge
entities/concepts and edges represent relations in- Extraction
volving these entities, extracted from both text and We apply our state-of-the-art biomedical Informa-
images. Given the KGs enriched with path ranking tion Extraction (IE) system (Wang et al., 2019; Li
and evidence mining, COVID-KG answers natural et al., 2019; Li and Ji, 2019; Zheng et al., 2014;
language questions effectively. Using drug repur- Huang et al., 2017) to build knowledge graphs
posing as a case study, for 11 typical questions that (KGs), in which nodes are entities/concepts and
human experts aim to explore, we integrate our edges are the relations and events involving these
techniques to generate a comprehensive report for entities. This system consists of three components:
each candidate drug. Preliminary assessment by (1) coarse-grained entity extraction and entity link-
ing for four entity types: Gene nodes, Disease
nodes, Chemical nodes, and Organism. We follow
the entity ontology defined in the Comparative Tox-
icogenomics Database (CTD) (Davis et al., 2016),
and obtain a Medical Subject Headings (MeSH)
Unique ID for each mention. (2) Based on the
MeSH Unique IDs, we further link all entities to
the CTD and extract 133 subtypes of relations such
as Marker/Mechanism, Therapeutic, and Increase
Figure 5: System pipeline for automatic figure extrac-
Expression, and validate them based on document- tion and subfigure segmentation.
level co-occurrence through distant supervision.
These relation types include GeneChemicalInterac-
tion Relationships, ChemicalDisease Associations, a CORD-19 paper (Zhang et al., 2020).
GeneDisease Associations, ChemicalGO Enrich-
2.3 Image Processing and Cross-media
ment Associations and ChemicalPathway Enrich-
Entity Grounding
ment Associations. (3) Event extraction: we extract
13 Event types and the roles of entities involved Figures in biomedical papers contain rich informa-
in these events, including Gene expression, Tran- tion uniquely manifested in the visual modality,
scription, Localization, Protein catabolism, Bind- such as molecular structures, microscopic images,
ing, Protein modification, Phosphorylation, Ubiq- dosage response curves, relational diagrams, and
uitination, Acetylation, Deacetylation, Regulation, other visual types. We have developed a visual IE
Positive regulation, Negative regulation. Figure 3 subsystem to extract the visual information from
shows an example of the constructed knowledge figure images to enrich the knowledge graph. We
graph. start by designing a pipeline and automatic tools
shown in Figure 5 to extract figures from papers
2.2 Result
NER Fine-grained Text Entity Extraction
Visualization in the CORD-19 dataset and segment figures into
close to half million subfigures. Then, we perform
Angiotensin-converting enzyme 2 GENE_OR_GENOME ( ACE2 GENE_OR_GENOME ) as a
SARS-CoV-2 CORONAVIRUS receptor: molecular mechanisms and potential therapeutic target.
cross-modal entity grounding to ground entities
SARS-CoV-2 CORONAVIRUS has been sequenced [3]. A phylogenetic EVOLUTION analysis
[3, 4] found a bat WILDLIFE origin for the SARS-CoV-2 CORONAVIRUS. There is a diversity of
mentioned in captions or referring text to visual
possible intermediate hosts for SARS-CoV-2 CORONAVIRUS, including pangolins WILDLIFE,
but not mice EUKARYOTE and rats EUKARYOTE [5]. There are many similarities of SARS-
objects in the subfigures.
CoV-2 CORONAVIRUS with the original SARS-CoV CORONAVIRUS. Using computer
modeling, Xu et al. [6] found that the spike proteins GENE_OR_GENOME of SARS-CoV-2
One main challenge for figure analysis lies in
CORONAVIRUS and SARS-CoV CORONAVIRUS have almost identical 3-D structures in the
receptor binding domain that maintains Van der Waals forces PHYSICAL_SCIENCE. SARS-
the lack of figures stored in separate image files.
CoV spike proteins GENE_OR_GENOME has a strong binding affinity to human ACE2 Most figures are embedded as part of PDF files of
GENE_OR_GENOME, based on biochemical interaction studies and crystal structure analysis
[7]. SARS-CoV-2 CORONAVIRUS and SARS-CoV spike proteins GENE_OR_GENOME share the papers. We employ Deepfigures (Siegel et al.,
identity in amino acid sequences and ……
2018) to automatically detect and extract figures
Figure 4: Example of Fine-grained Entity Extraction from each PDF document. Each figure is associated
with text in its caption or referring context (main
However, questions from experts often involve body text referring to the figure). In this way, a
fine-grained knowledge elements, such as Which figure can be coarsely attached to an KG entity if
animo acids in glycoprotein (a spike protein of the entity is mentioned in the associated text.
COVID-19) are most related to Glycan (CHEM- To further delineate semantic and visual infor-
ICAL)?. In order to be able to answer these mation contained in each subfigure, we have de-
questions, We have incorporated 75 fine-grained veloped a pipeline to segment individual subfig-
entity types automatically annotated by CORD- ures and then align each subfigure with its corre-
NER (Wang et al., 2020d) into the constructed sponding sub-caption. We employ Figure-separator
KG. CORD-NER covers many new entity types (Tsutsui and Crandall, 2017) to detect and separate
specifically related to the COVID-19 studies (e.g., all non-overlapping image regions. Meanwhile,
coronaviruses, viral proteins, evolution, materials, subfigures in a figure are typically marked with
substrates and immune responses), which may ben- alphabetical letters (e.g., A, B, C, etc). We use
efit research on COVID-19 related virus, spread- deep neural networks (Zhou et al., 2017) to detect
ing mechanisms, and potential vaccines. Figure 4 text in the figures and use OCR tools (Smith, 2007)
shows some examples of the annotation results on to automatically recognize text information within
Fig. The baseplate .p. (A) Isosurface
representa0on of the C3 symmetric
baseplate reconstruc0on at 4.5 σ density
2.4 Knowledge Graph Semantic
cutoff. The density corresponding to the
Tal trimer is shown in pink, with the rest
of the reconstruc0on in transparent gray.
Visualization
(B) Central sec0on through C3
reconstruc0on (gray scale). Density
inside the lumen presumably
corresponding to TMP is indicated with In order to enhance the exploration and discovery
the yellow arrow. Scale bar = 100Å. (C)
The Tal density alone, viewed from the
side and boNom. The disordered density
at the boNom in the side view is
of the information mined from the COVID-19 liter-
assumed to be the Tal CTD and was
removed from the boNom view for
clarity. (D) Ribbon representa0on of the
ature through the algorithms discussed in previous
atomic model of Tal viewed from the
side and boNom. One subunit is colored
dark gray. The TMP model is shown in
sections, we have been developing techniques to
green. (E) Tal subunit, colored in rainbow
colors from blue (N-terminus) to red (C-
terminus). Relevant structural elements
create semantic visualizations over large datasets of
are labeled. (F) Density corresponding to
the central part of Tal with the atomic
models of Tal and TMP shown in pink
complex networks of biomedical relations. Seman-
and green, respec0vely. (G) View down
the axis of the baseplate showing the
inside of Tal as a van der Waals surface,
colored from hydrophilic (tan) to
tic visualization allows for visualization of user-
hydrophobic (purple). TMP is shown in
green. (H) Detail of the interac0on
between the TMP C-terminus and Tal.
defined subsets of these relations through interac-
The three C-terminal residues of TMP
(Y1152, Y1153 and L1154) are shown in
s0ck representa0on.
tive semantically typed tag clouds and heat maps.
This allows researchers to get a global view of se-
Figure 6: Examples of segmenting a figure into subfig- lected relation subtypes drawn from hundreds or
ures and aligning them with subcaption text thousands of papers at a single glance. This in turn
allows for the ready identification of novel relation-
ships that would typically be missed by directed
keyword searches or simple unigram word cloud
or heatmap displays.3
We first build a data index from the datasets, and
then create a Kibana dashboard out of the gener-
ated data indices. Each Kibana dashboard has a
collection of visualizations that are designed to in-
teract with each other. Dashboards are fulfilled as
web applications. The navigation of a dashboard is
mainly through clicking and searching. By click-
ing the protein keyword EIF2AK2 in the tag cloud
Figure 7: Expanding knowledge graph through subfig-
named “Enzyme proteins participating Modifica-
ure segmentation and cross-modal entity grounding.
tion relations”, a constraint on the type of proteins
in modifications is added. Correspondingly, all the
other visualizations will be changed.
One unique feature of the SemViz semantic vi-
each figure. To distinguish subfigure marker text sualization is the creation of dense tag clouds and
from text labels in figures to annotate figure con- dense heatmaps, through a process of parameter
tent, we use location proximity between text labels reduction over relations, allowing for the visual-
and subfigures to locate subfigure text markers. Lo- ization of a relation sets as tag clouds and multi-
cation information of such text markers can also be ple chained relations as heatmaps. Figure 8 illus-
used to merge multiple image regions into a single trates such a dense heatmap, where a functionally
subfigure. At the end, each subfigure is segmented, typed protein is implicated in a disease relation
and associated with its corresponding subcaption (e.g., “those proteins that are down regulators of
and referring context. INF which are implicated in obesity”).
The segmented subfigures and associated text
3 Knowledge-driven Question
labels provide rich information that can be used
Answering
to expand KG constructed from text captions. For
example, as shown in Figure 7, we apply a classifier 3.1 KG Matching and Path Ranking
to detect subfigure images those contain molecular
With the constructed knowledge graphs from mas-
structures. Then by linking specific drug names
sive updated scientific literature and domain knowl-
extracted from within-figure text to the drug entity
edge, we can support many biomedical hypothe-
in the coarse KG constructed from the caption text,
sis related queries and inference tools for working
a cross-modal expanded KG can be constructed
scientists and clinicians. Current question answer-
that links specific molecular structure images to
3
corresponding drug entities in the KG. https://www.semviz.org/
Figure 8: Regulatory Processes-Disease Interactions Heatmap

ing (QA) methods usually rely on word-level or ple subgraph summarizing the connections between
sentence-level semantic meaning matching. The Losartan and cathepsin L pseudogene 2. The paths
questions from existing shared tasks are limited to are generated by traversing the constructed KG,
non-experts (e.g., Corona Virus Update?) or too and are ranked the frequency of paths in the KG.
high-level (e.g., What is known about transmission, We construct a subgraph for each query by merg-
incubation, and environmental stability?). Most of ing the paths of top ranked paths. Each edge is
current QA systems are trained from Wikipedia assigned a salience score by aggregating the scores
articles, and thus they will not be effective for of paths passing through it,
the COVID-19 domain where most answers are
not explicitly written in a single sentence or docu-
ment. In sharp contrast, we develop a QA compo- 3.2 Knowledge-driven Sentence Matching
nent based on a combination of knowledge graph
matching and distributional semantic matching. It
provides fast and effective answering of questions In addition to knowledge elements, we also present
about drugs, diseases, chemical entities and genes related sentences as evidence. We use BioBert (Lee
from any angle. We build knowledge graph index- et al., 2020) pre-trained language model to repre-
ing and searching functions to facilitate users to sent each sentence along with its left and right
pose queries to search effectively and efficiently. neighboring sentences as local contexts. Using the
We also support semantic matching from the con- same architecture computed on all respective sen-
structed KGs and related texts by accepting multi- tences and user query, we aggregate the sequence
hop queries. embedding layer, the last hidden layer in the BERT
architecture with average pooling (Reimers and
A common category of queries is about the con- Gurevych, 2019). We use the similarity between
nections between two entities. Given two entities the embedding representations of each sentence
as query, we generate a subgraph covering salient and query to extract the most relevant sentences as
paths between them to show how they are con- evidence. Table 1 shows some answer examples
nected through other entities. Figure 3 is an exam- from our QA component.
Question # of Answers Example Answers
Which genes are related 687 AP2 associated kinase 1,
1. Current indication: what is the drug class?
to COVID-19? myeloperoxidase, thiore- What is it currently approved to treat?
doxin
Which chemicals are re- 3,142 acetoacetic acid, Chlo-
2. Molecular structure (symbols desired, but a
lated to COVID-19? rine, Zymosan pointer to a reference is also useful)
Which genes are related 2,168 DEK proto-oncogene,
to COVID-19 that can be neclear receptor corepres-
3. Mechanism of action i.e. inhibits viral entry,
transferred from its simi- sor 1 replication, etc (w/ a pointer to data)
lar diseases?
Which chemicals are re- 327 Ampicillin, Quercetin, 4. Was the drug identified by manual or compu-
lated to COVID-19 that Zoledronic Acid tation screen?
can be transferred from
its similar diseases? 5. Who is studying the drug? (Source/lab name)
6. In vitro Data available (cell line used, assays
Table 1: Knowledge-driven QA Output Examples run, viral strain used, cytopathic effects, toxi-
city, LD50, dosage response curve, etc)
7. Animal Data Available (what animal model,
3.3 Evidence Mining
LD50, dosage response curve, etc)
Queries also often include entity types instead of 8. Clinical trials on going (what phase, facility,
entity instances, which requires us to extract evi- target population, dosing, intervention etc)
dence sentences based on type or pattern match- 9. Funding source
ing. We have developed E VIDENCE M INER (Wang 10. Has the drug shown evidence of systemic tox-
et al., 2020b,c), a web-based system that allows icity?
a users query as a natural language statement or 11. List of relevant sources to pull data from.
an inquired relationship at the meta-symbol level We use three drugs suggested by DARPA biol-
(e.g., CHEMICAL, PROTEIN) and automatically ogists as case studies: Benazepril, Losartan, and
retrieves textual evidence from a background cor- Amodiaquine. Our KG results for many other drugs
pora of COVID-19. In Figure 9, the top results are visualized at our website4 . We use the follow-
of E VIDENCE M INER for the query (“SARS-COV- ing list of chemicals/genes related to COVID-19,
2”, “Losartan”) indicates that “Losartan” can be a suggested by DARPA biologists:
potential drug treatment for “COVID-19”. • BM1 00870 BM1 06175 BM1 16375 BM1 17125
BM1 22385 BM1 30360 BM1 33735 BM1 56245
BM1 56735 BM1 00870 BM1 06175 BM1 16375
BM1 17125 BM1 22385 BM1 30360 BM1 33735
BM1 56245 BM1 56735 CATB-10270 CATB-1418
CATB-1674 CATB-16A CATB-16D2 CATB-1852
CATB-1874 CATB-2744 CATB-3098 CATB-348
CATB-3483 CATB-5880 CATB-84 CATB-912 CATD
CATHY CATK CATL CATL-LIKE CTS12 CTS3
CTS6 CTS7 CTS7-PS CTS8 CTS8L1 CTS8-PS
CTSA CTSA.L CTSB CTSBA CTSBB CTSB.L
CTSB-PS CTSB.S CTSC CTSC.L CTSC.S CTSD
CTSD2 CTSD.S CTSE CTSEAL CTSE.L CTSE.S
CTSF CTSF.L CTSG CTSH CTSH.L CTSH-PS
CTSJ CTSK CTSK1 CTSK.L CTSL CTSL.1 CTSL3
CTSL3P CTSLA CTSLB CTSLL CTSL.L CTSLL3
CTSLP1 CTSLP2 CTSLP3 CTSLP4 CTSLP6
CTSLP8 CTSM CTSM-PS CTSM-PS2 CTSO
CTSO.L CTSQ CTSQL2 CTSR CTSS CTSS1
CTSS.2 CTSS2.1 CTSS2.2 CTSSL CTSS.L CTSS.S
CTSV CTSV.L CTSW CTSW.L CTSZ CTSZ.L
Figure 9: Top results of EvidenceMiner for the query
CTSZ.S LOAG 18685 SMP 013040.1 SMP 034410.1
(“SARS-COV-2”, “Losartan”). SMP 067050 SMP 067060 SMP 085010 SMP 085180
SMP 103610 SMP 105370 SMP 158410 SMP 158420
SMP 179950 TSP 01409 TSP 02382 TSP 02383
TSP 03306 TSP 07747 TSP 10129 TSP 10493
4 A case study on Drug Repurposing TSP 11596 LMAN1 LMAN1L LMAN1.L LMAN1.S
Report Generation LMAN2 LMAN2L MBL1P MBL2 ACE2 FURIN
TMPRSS2
4.1 Task and Data With the knowledge discovered and more pa-
A human written report about drug repurposing pers about this topic published every day, we need
usually includes answers for the following typical 4
http://blender.cs.illinois.edu/
questions. covid19/visualization.html
lopinavir-ritonavir
drug combination
COVID-19 subgraphs and image segmentation and analysis
Severe
Acute Coronavirus
results. Table 2 shows some example answers.
Respiratory Infections
Syndrome cathepsin D Question Example Answers
Drug Class angiotensin-converting enzyme (ACE) inhibitors
Disease hypertension
Figure 10: Connections Involving Coronavirus Related [PMID:32314699 (PMC7253125)] Past medical his-
Diseases Q1
tory was significant for hypertension, treated with
Evidence amlodipine and benazepril, and chronic back pain.
Sentences [PMID:32081428 (PMC7092824)] On the other
hand, many ACE inhibitors are currently used to
to keep up to date with the latest development. treat hypertension and other cardiovascular diseases.
Among them are captopril, perindopril, ramipril,
For this purpose, we download new COVID-19 lisinopril, benazepril, and moexipril.
Disease COVID-19
papers on a daily basis from three Application [PMID:32081428 (PMC7092824)] By using a
Programming Interfaces (APIs): NCBI PMC API, molecular docking approach, an earlier study iden-
tified N-(2-aminoethyl)-1 aziridine-ethanamine as a
NCBI Pubtator API and CORD-19 archive. We Q4 novel ACE2 inhibitor that effectively blocks the
Evidence SARS-CoV RBD-mediated cell fusion.
provide incremental updates including new papers, Sentences This has provided a potential candidate and lead
removed papers and updated papers, and their meta- compound for further therapeutic drug development.
Meanwhile, biochemical and cell-based assays can
data information at our website5 . be established to screen chemical compound libraries
to identify novel inhibitors.
Disease cardiovascular disease
4.2 Results [PMID:22800722 (PMC7102827)] The in vitro half-
maximal inhibitory concentration (IC50) values of
Up to June 14, 2020 we have collected 140K Evidence food-derived ACE inhibitory peptides are about 1000
Q6 Sentences fold higher than that of synthetic captopril but they
papers. We choose 25,534 peer-reviewed pa- have higher in vivo activities than would be expected
pers and construct the KG. The current KG in- from their in vitro activities.....
Disease COVID-19
cludes 7,230 Diseases, 9,123 Chemicals and 50,864 [PMID:32336612 (PMC7167588)] Two trials of
genes, 1,725,518 chemical-gene links, 5,556,670 losartan as additional treatment for SARS-CoV-2 in-
fection in hospitalized (NCT04312009) or not hos-
chemical-disease links, and 7,7844,574 gene- Q8 pitalized (NCT04311177) patients have been an-
nounced, supported by the background of the huge
disease links. The KG has got more than 800+ adverse impact of the ACE Angiotensin II AT1 re-
downloads. Evidence ceptor axis over-activity in these patients.
Sentences [PMID:32350632 (PMC7189178)] To address the
Several clinicians and medical school students role of angiotensin in lung injury, there is an ongoing
clinical trial to examine whether losartan treatment
in our team have manually reviewed the drug repur- affects outcomes in COVID-19 associated ARDS
posing reports for three drugs, and also the knowl- (NCT04312009).
[PMID:32439915 (PMC7242178)] Losartan was
edge graphs connecting 41 drugs and COVID-19 also the molecule chosen in two trials recently started
in the United States by the University of Minnesota
related chemicals/genes. Preliminary results show to treat patients with COVID-19 (clinical trials.gov
that most of our output are informative, valid and NCT04311177 and NCT 104312009).

sound. For instance, after the coronavirus enters Table 2: Example Answers for Questions in Drug Re-
the cell in the lungs, it can cause a severe dis- purposing Reports
ease called Acute Respiratory Distress Syndrome
(ARDS). This condition causes the release of in-
flammatory molecules in the body named cytokines 5 Related Work
such as Interleukin-2, Interleukin-6, Tumor Necro-
sis Factor, and Interleukin-10. We see all of these There has been a lot of previous work on extract-
connections in our results, such as the examples ing biomedical entities (Krallinger et al., 2013; Lu
shown in Figure 3 and Figure 10. Some results et al., 2015; Leaman and Lu, 2016; Habibi et al.,
are a little surprising to scientists and they think 2017; Crichton et al., 2017; Wang et al., 2018;
it’s worth further investigation. For example, in Beltagy et al., 2019; Alsentzer et al., 2019; Wei
Figure 3 we can see that Lusartan is connected to et al., 2019; Wang et al., 2020d), relations (Uzuner
tumor protein p53 which is related to lung cancer. et al., 2011; Krallinger et al., 2011; Segura-Bedmar
Our final generated reports6 are shared publicly. et al., 2013; Bui et al., 2014; Peng et al., 2016; Wei
For each question, our framework provides an- et al., 2015; Peng et al., 2017; Quirk and Poon,
swers along with detailed evidences, knowledge 2016; Luo et al., 2017; Wei et al., 2019; Peng et al.,
2019, 2020), and events (Ananiadou et al., 2010;
5
http://blender.cs.illinois.edu/ Van Landeghem et al., 2013; Nédellec et al., 2013;
covid19/
6
http://blender.cs.illinois.edu/ Deléger et al., 2016; Wei et al., 2019; Li et al.,
covid19/DrugRe-purposingReport_V2.0.docx 2019; ShafieiBavani et al., 2020) from biomed-
ical corpora to construct KGs. Recently, Hope Matthew McDermott. 2019. Publicly available clini-
et al. (2020); Ilievski et al. (2020); Wolinski (2020); cal BERT embeddings. In Proceedings of the 2nd
Clinical Natural Language Processing Workshop,
Ahamed and Samad (2020) build KGs based on
pages 72–78, Minneapolis, Minnesota, USA. Asso-
CORD-19 (Wang et al., 2020a). ciation for Computational Linguistics.
Most of the recent biomedical QA work is
driven by the BioASQ initiative (Tsatsaronis et al., Sophia Ananiadou, Sampo Pyysalo, Junichi Tsujii, and
2015) with many algorithms developed (Yang et al., Douglas B Kell. 2010. Event extraction for sys-
tems biology by text mining the literature. Trends
2015, 2016; Chandu et al., 2017; Kraus et al., in biotechnology, 28(7):381–390.
2017). There are COVID-19 question answering
live systems coming from BioASQ (COVIDASK7 , Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scib-
AUEB8 ), and search engines (Kricka et al., 2020; ert: Pretrained language model for scientific text. In
EMNLP.
Esteva et al., 2020; Hope et al., 2020; Taub Tabib
et al., 2020) have also been built. Our work ad- Quoc-Chinh Bui, Peter MA Sloot, Erik M Van Mul-
vances state-of-the-art by extending the knowledge ligen, and Jan A Kors. 2014. A novel feature-
elements to more fine-grained types, incorporating based approach to extract drug–drug interactions
image analysis and cross-media knowledge ground- from biomedical text. Bioinformatics, 30(23):3365–
3371.
ing, and knowledge graph matching into question
answering. Khyathi Chandu, Aakanksha Naik, Aditya Chan-
drasekar, Zi Yang, Niloy Gupta, and Eric Nyberg.
6 Conclusions and Future Work 2017. Tackling biomedical text summarization:
Oaqa at bioasq 5b. In BioNLP 2017, pages 58–66.
We have developed a novel framework COVID-
KG that automatically transforms massive scien- Gamal Crichton, Sampo Pyysalo, Billy Chiu, and Anna
tific literature corpus into organized, structured and Korhonen. 2017. A neural network multi-task learn-
ing approach to biomedical named entity recogni-
actionable knowledge graphs. With COVID-KG, tion. BMC Bioinf., 18(1):368.
researchers and clinicians are able to obtain trust-
worthy and non-trivial answers from scientific liter- Allan Peter Davis, Cynthia J Grondin, Robin J Johnson,
ature, and thus focus on more important hypothesis Daniela Sciaky, Benjamin L King, Roy McMorran,
Jolene Wiegers, Thomas C Wiegers, and Carolyn J
testing, and prioritize the analysis efforts for can-
Mattingly. 2016. The comparative toxicogenomics
didate exploration directions. In our ongoing work database: update 2017. Nucleic acids research.
we have created a new ontology that includes 77
entity subtypes and 58 event subtypes, and we are Louise Deléger, Robert Bossy, Estelle Chaix,
re-building an end-to-end joint neural IE system Mouhamadou Ba, Arnaud Ferré, Philippe Bessieres,
and Claire Nédellec. 2016. Overview of the bacteria
following this new ontology. In the future we plan biotope task at bionlp shared task 2016. In Proceed-
to extend COVID-KG to automate the creation of ings of the 4th BioNLP shared task workshop, pages
new hypotheses by predicting new links. Inspired 12–22.
from our recent success at multimedia event extrac-
tion (Li et al., 2020), we will create a multimedia Andre Esteva, Anuprit Kale, Romain Paulus, Kazuma
Hashimoto, Wenpeng Yin, Dragomir Radev, and
common semantic space for literature and apply Richard Socher. 2020. Co-search: Covid-19 in-
it to improve cross-media knowledge grounding, formation retrieval with semantic search, question
inference and transfer. answering, and abstractive summarization. arXiv
preprint arXiv:2006.09595.

Maryam Habibi, Leon Weber, Mariana Neves,


References David Luis Wiegandt, and Ulf Leser. 2017. Deep
Sabber Ahamed and Manar Samad. 2020. Informa- learning with word embeddings improves biomed-
tion mining for covid-19 research from a large ical named entity recognition. Bioinformatics,
volume of scientific literature. arXiv preprint 33(14):i37–i48.
arXiv:2004.02085.
Tom Hope, Jason Portenoy, Kishore Vasan, Jonathan
Emily Alsentzer, John Murphy, William Boag, Wei- Borchardt, Eric Horvitz, Daniel S Weld, Marti A
Hung Weng, Di Jindi, Tristan Naumann, and Hearst, and Jevin West. 2020. Scisight: Combin-
ing faceted navigation and research group detection
7
https://covidask.korea.ac.kr/ for covid-19 exploratory scientific search. arXiv
8
http://cslab241.cs.aueb.gr:5000/ preprint arXiv:2005.12668.
Lifu Huang, Jonathan May, Xiaoman Pan, Heng Ji, Xi- Manling Li, Alireza Zareian, Qi Zeng, Spencer White-
ang Ren, Jiawei Han, Lin Zhao, and James Hendler. head, Di Lu, Heng Ji, and Shih-Fu Chang. 2020.
2017. Liberal entity extraction: Rapid construction Cross-media structured common space for multime-
of fine-grained entity typing systems. In Big Data, dia event extraction. In Proceedings of the 58th An-
Mar 2017, 5(1): 19-31. nual Meeting of the Association for Computational
Linguistics, pages 2557–2568, Online. Association
Filip Ilievski, Daniel Garijo, Hans Chalupsky, for Computational Linguistics.
Naren Teja Divvala, Yixiang Yao, Craig Rogers,
Ronpeng Li, Jun Liu, Amandeep Singh, Daniel Yanan Lu, Donghong Ji, Xiaoyuan Yao, Xiaomei Wei,
Schwabe, et al. 2020. Kgtk: A toolkit for large and Xiaohui Liang. 2015. CHEMDNER system
knowledge graph manipulation and analysis. arXiv with mixed conditional random fields and multi-
preprint arXiv:2006.00088. scale word clustering. J. Cheminf., 7(S1):S4.
Yuan Luo, Özlem Uzuner, and Peter Szolovits.
Martin Krallinger, Florian Leitner, Obdulia Rabal, 2017. Bridging semantics and syntax with graph
Miguel Vazquez, Julen Oyarzabal, and Alfonso Va- algorithmsstate-of-the-art of extracting biomedical
lencia. 2013. Overview of the chemical compound relations. Briefings in bioinformatics, 18(1):160–
and drug name recognition (chemdner) task. In 178.
BioCreative challenge evaluation workshop, vol-
ume 2, page 2. Citeseer. Claire Nédellec, Robert Bossy, Jin-Dong Kim, Jung-
Jae Kim, Tomoko Ohta, Sampo Pyysalo, and Pierre
Martin Krallinger, Miguel Vazquez, Florian Leitner, Zweigenbaum. 2013. Overview of bionlp shared
David Salgado, Andrew Chatr-Aryamontri, Andrew task 2013. In Proceedings of the BioNLP shared
Winter, Livia Perfetto, Leonardo Briganti, Luana Li- task 2013 workshop, pages 1–7.
cata, Marta Iannuccelli, et al. 2011. The protein-
protein interaction tasks of biocreative iii: classifica- Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina
tion/ranking of articles and linking bio-ontology con- Toutanova, and Wen-tau Yih. 2017. Cross-sentence
cepts to full text. BMC bioinformatics, 12(S8):S3. n-ary relation extraction with graph lstms. Transac-
tions of the Association for Computational Linguis-
Milena Kraus, Julian Niedermeier, Marcel Jankrift, tics, 5:101–115.
Sören Tietböhl, Toni Stachewicz, Hendrik Folk-
erts, Matthias Uflacker, and Mariana Neves. 2017. Yifan Peng, Qingyu Chen, and Zhiyong Lu. 2020. An
Olelo: a web application for intuitive exploration empirical study of multi-task learning on BERT for
of biomedical literature. Nucleic acids research, biomedical text mining. In Proceedings of the 19th
45(W1):W478–W483. SIGBioMed Workshop on Biomedical Language Pro-
cessing, pages 205–214, Online. Association for
Computational Linguistics.
Larry J Kricka, Sergei Polevikov, Jason Y Park,
Paolo Fortina, Sergio Bernardini, Daniel Satchkov, Yifan Peng, Chih-Hsuan Wei, and Zhiyong Lu. 2016.
Valentin Kolesov, and Maxim Grishkov. 2020. Ar- Improving chemical disease relation extraction with
tificial intelligence-powered search tools and re- rich features and weakly labeled data. Journal of
sources in the fight against covid-19. EJIFCC, cheminformatics, 8(1):53.
31(2):106.
Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019.
Robert Leaman and Zhiyong Lu. 2016. Tag- Transfer learning in biomedical natural language
gerone: joint named entity recognition and normal- processing: An evaluation of BERT and ELMo on
ization with semi-markov models. Bioinformatics, ten benchmarking datasets. In Proceedings of the
32(18):2839–2846. 18th BioNLP Workshop and Shared Task, pages 58–
65, Florence, Italy. Association for Computational
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Linguistics.
Donghyeon Kim, Sunkyu Kim, Chan Ho So, and
Jaewoo Kang. 2020. Biobert: a pre-trained biomed- Chris Quirk and Hoifung Poon. 2016. Distant super-
ical language representation model for biomedical vision for relation extraction beyond the sentence
text mining. Bioinformatics, 36(4):1234–1240. boundary. arXiv preprint arXiv:1609.04873.
Nils Reimers and Iryna Gurevych. 2019. Sentence-
Diya Li, Lifu Huang, Heng Ji, and Jiawei Han. 2019. bert: Sentence embeddings using siamese bert-
Biomedical event extraction based on knowledge- networks. In Proceedings of the 2019 Conference on
driven tree-lstm. In Proc. 2019 Annual Conference Empirical Methods in Natural Language Processing.
of the North American Chapter of the Association Association for Computational Linguistics.
for Computational Linguistics (NAACL-HLT2019).
Isabel Segura-Bedmar, Paloma Martı́nez, and Marı́a
Diya Li and Heng Ji. 2019. Syntax-aware multi-task Herrero-Zazo. 2013. SemEval-2013 task 9 : Extrac-
graph convolutional networks for biomedical rela- tion of drug-drug interactions from biomedical texts
tion extraction. In Proc. EMNLP2019 Workshop on (DDIExtraction 2013). In Second Joint Conference
Health Text Mining and Information Analysis. on Lexical and Computational Semantics (*SEM),
Volume 2: Proceedings of the Seventh International Qingyun Wang, Lifu Huang, Zhiying Jiang, Kevin
Workshop on Semantic Evaluation (SemEval 2013), Knight, Heng Ji, Mohit Bansal, and Yi Luan. 2019.
pages 341–350, Atlanta, Georgia, USA. Association Paperrobot: Incremental draft generation of scien-
for Computational Linguistics. tific ideas. In Proc. The 57th Annual Meeting
of the Association for Computational Linguistics
Elaheh ShafieiBavani, Antonio Jimeno Yepes, (ACL2019).
Xu Zhong, and David Martinez Iraola. 2020. Global
locality in biomedical relation and event extraction. Xuan Wang, Yingjun Guan, Weili Liu, Aabhas
In Proceedings of the 19th SIGBioMed Workshop on Chauhan, Enyi Jiang, Qi Li, David Liem, Dibakar
Biomedical Language Processing, pages 195–204, Sigdel, John Caufield, Peipei Ping, et al. 2020b. Ev-
Online. Association for Computational Linguistics. idenceminer: Textual evidence discovery for life sci-
ences. In Proceedings of the 58th Annual Meeting of
Noah Siegel, Nicholas Lourie, Russell Power, and the Association for Computational Linguistics: Sys-
Waleed Ammar. 2018. Extracting scientific figures tem Demonstrations, pages 56–62.
with distantly supervised neural networks. In Pro-
ceedings of the 18th ACM/IEEE on joint conference Xuan Wang, Weili Liu, Aabhas Chauhan, Yingjun
on digital libraries, pages 223–232. Guan, and Jiawei Han. 2020c. Automatic textual ev-
Ray Smith. 2007. An overview of the tesseract ocr en- idence mining in covid-19 literature. arXiv preprint
gine. In Ninth international conference on document arXiv:2004.12563.
analysis and recognition (ICDAR 2007), volume 2,
pages 629–633. IEEE. Xuan Wang, Xiangchen Song, Yingjun Guan,
Bangzheng Li, and Jiawei Han. 2020d. Com-
Hillel Taub Tabib, Micah Shlain, Shoval Sadde, Dan prehensive named entity recognition on cord-19
Lahav, Matan Eyal, Yaara Cohen, and Yoav Gold- with distant or weak supervision. arXiv preprint
berg. 2020. Interactive extractive search over arXiv:2003.12218.
biomedical corpora. In Proceedings of the 19th
SIGBioMed Workshop on Biomedical Language Pro- Xuan Wang, Yu Zhang, Xiang Ren, Yuhao Zhang,
cessing, pages 28–37, Online. Association for Com- Marinka Zitnik, Jingbo Shang, Curtis Langlotz, and
putational Linguistics. Jiawei Han. 2018. Cross-type biomedical named en-
tity recognition with deep multi-task learning. Bioin-
George Tsatsaronis, Georgios Balikas, Prodromos formatics, page bty869.
Malakasiotis, Ioannis Partalas, Matthias Zschunke,
Michael R Alvers, Dirk Weissenborn, Anastasia Chih-Hsuan Wei, Alexis Allot, Robert Leaman, and
Krithara, Sergios Petridis, Dimitris Polychronopou- Zhiyong Lu. 2019. PubTator central: automated
los, et al. 2015. An overview of the bioasq concept annotation for biomedical full text articles.
large-scale biomedical semantic indexing and ques- Nucleic Acids Research, 47(W1):W587–W593.
tion answering competition. BMC bioinformatics,
16(1):138. Chih-Hsuan Wei, Yifan Peng, Robert Leaman, Al-
lan Peter Davis, Carolyn J Mattingly, Jiao Li,
Satoshi Tsutsui and David J Crandall. 2017. A data Thomas C Wiegers, and Zhiyong Lu. 2015.
driven approach for compound figure separation us- Overview of the biocreative v chemical disease re-
ing convolutional neural networks. In 2017 14th lation (cdr) task. In Proceedings of the fifth BioCre-
IAPR International Conference on Document Analy- ative challenge evaluation workshop, volume 14.
sis and Recognition (ICDAR), volume 1, pages 533–
540. IEEE. Francis Wolinski. 2020. Visualization of diseases
at risk in the covid-19 literature. arXiv preprint
Özlem Uzuner, Brett R South, Shuying Shen, and
arXiv:2005.00848.
Scott L DuVall. 2011. 2010 i2b2/va challenge on
concepts, assertions, and relations in clinical text.
Journal of the American Medical Informatics Asso- Zi Yang, Niloy Gupta, Xiangyu Sun, Di Xu, Chi Zhang,
ciation, 18(5):552–556. and Eric Nyberg. 2015. Learning to answer biomed-
ical factoid & list questions: Oaqa at bioasq 3b.
Sofie Van Landeghem, Jari Björne, Chih-Hsuan Wei, CLEF (Working Notes), 1391.
Kai Hakala, Sampo Pyysalo, Sophia Ananiadou,
Hung-Yu Kao, Zhiyong Lu, Tapio Salakoski, Yves Zi Yang, Yue Zhou, and Eric Nyberg. 2016. Learn-
Van de Peer, et al. 2013. Large-scale event extrac- ing to answer biomedical questions: Oaqa at bioasq
tion from literature with multi-level gene normaliza- 4b. In Proceedings of the Fourth BioASQ workshop,
tion. PloS one, 8(4):e55814. pages 23–37.

Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Haibo Zhang, Josef M Penninger, Yimin Li, Nanshan
Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn Zhong, and Arthur S Slutsky. 2020. Angiotensin-
Funk, Rodney Kinney, Ziyang Liu, William Merrill, converting enzyme 2 (ace2) as a sars-cov-2 receptor:
et al. 2020a. Cord-19: The covid-19 open research molecular mechanisms and potential therapeutic tar-
dataset. ArXiv. get. Intensive care medicine, 46(4):586–590.
Jin Guang Zheng, Daniel Howsmon, Boliang Zhang,
Juergen Hahn, Deborah McGuinness, James
Hendler, and Heng Ji. 2014. Entity linking for
biomedical literature. In BMC Medical Informatics
and Decision Making.
Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang,
Shuchang Zhou, Weiran He, and Jiajun Liang. 2017.
East: an efficient and accurate scene text detector. In
Proceedings of the IEEE conference on Computer Vi-
sion and Pattern Recognition, pages 5551–5560.

Você também pode gostar