Você está na página 1de 21

Mining Genotype-Phenotype Associations from Public Knowledge Sources via Semantic Web Querying

Richard C. Kiefer Robert R. Freimuth, PhD Christopher G. Chute, MD, DrPH Jyotishman Pathak, PhD
AMIA Clinical Research Informatics (CRI) Summit March 20, 2013

Outline

Research goals Data sources Creation of endpoints SPARQL queries Results Analysis Future research

Why? Who? Where? How? What? Huh? Next?

The Linked Clinical Data (LCD) Project

2013 MFMER | slide-2

Background

Advances have led to a heavy influx of large The Linked Data paradigm provides the

amounts of association data interlinking genes, SNPs, proteins, diseases and drugs. requisite backbone for building applications that can address analysis of this information. of September 2012, more than 300 datasets with approximately 30 billion Resource Description Framework (RDF) triples have been connected via more than 500 million links.
The Linked Clinical Data (LCD) Project

W3C Linked Open Data (LOD) project, where as

2013 MFMER | slide-3

The LOD cloud, September 2012

The Linked Clinical Data (LCD) Project

2012 MFMER | slide-4

DBPedia: Extracting structured data from the Wikipedia


@prefix dbpedia <http://dbpedia.org/resource/>. @prefix dbterm <http://dbpedia.org/property/>. dbpedia:Amsterdam dbterm:officialName "Amsterdam" ; dbterm:longd "4" ; dbterm:longm "53" ; dbterm:longs "32" ;

dbterm:website <http://www.amsterdam.nl> ; dbterm:populationUrban "1364422" ; dbterm:areaTotalKm "219" ;


... dbpedia:ABN_AMRO dbterm:location dbpedia:Amsterdam ; ...
2012 MFMER | slide-5

Objective

Our goal is to demonstrate the applicability of

SPARQL querying capabilities using three public knowledgebases: NCBI database for single nucleotide polymorphisms (dbSNP) Online Mendelian Inheritance in Man (OMIM) Gene Wiki Plus (GeneWiki+) six chronic diseases: Arthritis Asthma Diabetes Dementia

To extract disease-gene-SNP associations for


Cancer Obesity
2013 MFMER | slide-6

The Linked Clinical Data (LCD) Project

GeneWiki+

GeneWiki+ enables semantic queries for

retrieval of large amounts of information which also includes additional public data sources. detailed entries on human genes with half of them being linked to the Disease Ontology. There are 19,000 articles on SNPs with a tenth of them being directly associated to a disease. (approximately 80MB) allowed us to experiment with SPARQL queries which could not be executed within the GeneWiki+ environment.
The Linked Clinical Data (LCD) Project 2013 MFMER | slide-7

GeneWiki+ contains approximately 11,000

Creating a local dump of the RDF data files

OMIM

The Online Mendelian Inheritance in Man

(OMIM) is a comprehensive knowledgebase of human genes and genetic disorders that is distributed by the U.S. National Center for Biotechnology Information (NCBI) to support research in genomics. human genes and genetic disorders, including around 5,400 phenotypes with a clinical synopsis. SPARQL endpoint contains 2,895,940 triples.
The Linked Clinical Data (LCD) Project 2013 MFMER | slide-8

OMIM contains over 21,500 detailed entries on

The OMIM data accessible via the Bio2RDF

dbSNP

The NCBI database for single nucleotide

polymorphisms (dbSNP) contains information about genetic variations. Variations in dbSNP are linked to a genetic locus using Entrez Gene identifiers. more than 58.5 million reference SNPs and 187.8 million submitted SNPs for homo sapiens. dbSNP SPARQL endpoint; therefore, we downloaded the database and created our own endpoint.
The Linked Clinical Data (LCD) Project

In the June 2012 build of dbSNP there were

To the best of our knowledge, there is no public

2013 MFMER | slide-9

dbSNP and OMIM RDF graphs

The Linked Clinical Data (LCD) Project

2013 MFMER | slide-10

Extraction OMIM/dbSNP

Perl scripts were run against dbSNP build 134

to create scripts allowing the load of data into MySQL. Using Virtuoso, an endpoint was created through an RDF view of the database. OMIM, we devised a simple SPARQL query for execution at Bio2RDFs OMIM endpoint get the genes which have been given an causal relationship to each of the six chronic diseases. Including dbSNP in the federated query provided SNPs which make up those genes.
The Linked Clinical Data (LCD) Project 2013 MFMER | slide-11

To extract the disease-gene associations from Federated queries were run against OMIM to

SPARQL OMIM & dbSNP federation

The Linked Clinical Data (LCD) Project

2013 MFMER | slide-12

Extraction GeneWiki+

We downloaded the February 12th, 2012

snapshot of GeneWiki+ data as RDF files, and loaded the dataset in a local Virtuoso server to create a SPARQL endpoint. disease-SNP associations (in addition to the disease-gene associations). which have been given an causal relationship to each of the six chronic diseases. That result set was UNIONed to also retrieve the SNPs having a stated causal relationship.
The Linked Clinical Data (LCD) Project 2013 MFMER | slide-13

Unlike OMIM, GeneWiki+ explicitly specifies

A SPARQL query was run to retrieve the genes

SPARQL GeneWiki+

The Linked Clinical Data (LCD) Project

2013 MFMER | slide-14

Results Gene comparison

We compared the genes/SNPs returned from both


queries and calculated results.

The uniqueness of genes was based on a

comparison of their Entrez-Gene IDs rather than gene symbols/names.


The Linked Clinical Data (LCD) Project 2013 MFMER | slide-15

Results Validation

SPARQL queries discovered 130 unique genes


were found to be associated with asthma. associations through a PubMed citation.
The Linked Clinical Data (LCD) Project

A little over half of those genes were validated

2013 MFMER | slide-16

Analysis

For Asthma, all 17 genes found in OMIM had a Our technique can deduce disease-gene

PubMed citation. Only 50 of the 113 genes mentioned by GeneWiki+ had PubMed citations. associations that were not explicitly stated in the RDF graph, although further validation and verification is required before such data can be applied effectively.
rs3184504 partOf associatedWith Asthma INFERRED

SH2B3

The Linked Clinical Data (LCD) Project

2013 MFMER | slide-17

Analysis

There were genes found to be unique in

GeneWiki+ based on the query but the relationship to the disease was mentioned within the text of the article in OMIM.

DPP10 did not show up being associated

with Asthma in the query results, but was mentioned and referenced at the website.

The Linked Clinical Data (LCD) Project

2013 MFMER | slide-18

Deductions

The total number of unique disease-gene

associations in GeneWiki+ was significantly higher than OMIM. Such a finding is contrary to our initial hypothesis. extracted from GeneWiki+ and OMIM was significantly less (<5% on average) for all the six diseases. with Cancer was highest followed by Diabetes and Arthritis. This is consistent with the number of genome-wide association studies conducted.
The Linked Clinical Data (LCD) Project

The overlap between the numbers of genes

Overall, the number of unique genes associated

2013 MFMER | slide-19

Limitations and Future Work

OMIM-dbSNP query results in a list of SNPs

where the gene is associated with a disease but the SNPs association cannot be asserted. Therefore could not analyze SNP comparisons.

OMIM data did not reflect page content. GWP+ surfaces more data but only half had
PubMed citations to support.

Future work will use federated queries to join

data between public endpoints and private medical data to investigate patient genotypes and phenotypes.
The Linked Clinical Data (LCD) Project 2013 MFMER | slide-20

Thank You!

http://informatics.mayo.edu/LCD

The Linked Clinical Data (LCD) Project

2012 MFMER | slide-21

Você também pode gostar