Escolar Documentos
Profissional Documentos
Cultura Documentos
Richard C. Kiefer Robert R. Freimuth, PhD Christopher G. Chute, MD, DrPH Jyotishman Pathak, PhD
AMIA Clinical Research Informatics (CRI) Summit March 20, 2013
Outline
Research goals Data sources Creation of endpoints SPARQL queries Results Analysis Future research
Background
Advances have led to a heavy influx of large The Linked Data paradigm provides the
amounts of association data interlinking genes, SNPs, proteins, diseases and drugs. requisite backbone for building applications that can address analysis of this information. of September 2012, more than 300 datasets with approximately 30 billion Resource Description Framework (RDF) triples have been connected via more than 500 million links.
The Linked Clinical Data (LCD) Project
Objective
SPARQL querying capabilities using three public knowledgebases: NCBI database for single nucleotide polymorphisms (dbSNP) Online Mendelian Inheritance in Man (OMIM) Gene Wiki Plus (GeneWiki+) six chronic diseases: Arthritis Asthma Diabetes Dementia
GeneWiki+
retrieval of large amounts of information which also includes additional public data sources. detailed entries on human genes with half of them being linked to the Disease Ontology. There are 19,000 articles on SNPs with a tenth of them being directly associated to a disease. (approximately 80MB) allowed us to experiment with SPARQL queries which could not be executed within the GeneWiki+ environment.
The Linked Clinical Data (LCD) Project 2013 MFMER | slide-7
OMIM
(OMIM) is a comprehensive knowledgebase of human genes and genetic disorders that is distributed by the U.S. National Center for Biotechnology Information (NCBI) to support research in genomics. human genes and genetic disorders, including around 5,400 phenotypes with a clinical synopsis. SPARQL endpoint contains 2,895,940 triples.
The Linked Clinical Data (LCD) Project 2013 MFMER | slide-8
dbSNP
polymorphisms (dbSNP) contains information about genetic variations. Variations in dbSNP are linked to a genetic locus using Entrez Gene identifiers. more than 58.5 million reference SNPs and 187.8 million submitted SNPs for homo sapiens. dbSNP SPARQL endpoint; therefore, we downloaded the database and created our own endpoint.
The Linked Clinical Data (LCD) Project
Extraction OMIM/dbSNP
to create scripts allowing the load of data into MySQL. Using Virtuoso, an endpoint was created through an RDF view of the database. OMIM, we devised a simple SPARQL query for execution at Bio2RDFs OMIM endpoint get the genes which have been given an causal relationship to each of the six chronic diseases. Including dbSNP in the federated query provided SNPs which make up those genes.
The Linked Clinical Data (LCD) Project 2013 MFMER | slide-11
To extract the disease-gene associations from Federated queries were run against OMIM to
Extraction GeneWiki+
snapshot of GeneWiki+ data as RDF files, and loaded the dataset in a local Virtuoso server to create a SPARQL endpoint. disease-SNP associations (in addition to the disease-gene associations). which have been given an causal relationship to each of the six chronic diseases. That result set was UNIONed to also retrieve the SNPs having a stated causal relationship.
The Linked Clinical Data (LCD) Project 2013 MFMER | slide-13
SPARQL GeneWiki+
Results Validation
Analysis
For Asthma, all 17 genes found in OMIM had a Our technique can deduce disease-gene
PubMed citation. Only 50 of the 113 genes mentioned by GeneWiki+ had PubMed citations. associations that were not explicitly stated in the RDF graph, although further validation and verification is required before such data can be applied effectively.
rs3184504 partOf associatedWith Asthma INFERRED
SH2B3
Analysis
GeneWiki+ based on the query but the relationship to the disease was mentioned within the text of the article in OMIM.
with Asthma in the query results, but was mentioned and referenced at the website.
Deductions
associations in GeneWiki+ was significantly higher than OMIM. Such a finding is contrary to our initial hypothesis. extracted from GeneWiki+ and OMIM was significantly less (<5% on average) for all the six diseases. with Cancer was highest followed by Diabetes and Arthritis. This is consistent with the number of genome-wide association studies conducted.
The Linked Clinical Data (LCD) Project
where the gene is associated with a disease but the SNPs association cannot be asserted. Therefore could not analyze SNP comparisons.
OMIM data did not reflect page content. GWP+ surfaces more data but only half had
PubMed citations to support.
data between public endpoints and private medical data to investigate patient genotypes and phenotypes.
The Linked Clinical Data (LCD) Project 2013 MFMER | slide-20
Thank You!
http://informatics.mayo.edu/LCD