Escolar Documentos
Profissional Documentos
Cultura Documentos
Overview
Introduction and Background
- Challenges in dbGaP database - Objective of the study
Genotype
Phenotype
Phenotypes: diseases, signs and symptoms, clinical attributesetc. Currently host 400+ studies, 2500+ datasets, 130,000+ phenotype
variables
Introduction With the advancements in genome-wide association studies (GWAS), public repositories of genotype and phenotype data, such as the database of Genotypes and Phenotypes (dbGaP), have become increasingly available online (1). The proper use or reuse of GWAS data could promote exploratory research, novel scientific discovery, validation of existing findings, and reduction of cost and time for research. However, data in such public repositories are not Unstandardized representation of phenotype variables collected in a standardized or harmonized way, and hence it is challenging to reuse them. For example, as illustrated results in incomplete and inaccurate data retrieval. in Table 1, variables are often named without following a specific naming convention, or are labeled with abbreviated codes that do not naming convey specific meaning. Many of these variables are accompanied by variable No specific convention descriptions that can help users understand what data the variable intends to represent. However, keyword searches to No specific meaning in abbreviated codes applied variable descriptions do not always provide accurate results due to many syntactic and lexical complexities associated with narrative text, such as use of negation and synonyms (3).
height height in dbGaP Table 1.Phenotype Idiosyncratic ways ofvariable representing the variable
Variable ID phv00071000.v1 phv00165340.v1.p2 phv00083471.v1.p2 Variable Names htcm ESP_HEIGHT_BASELINE lunghta4 Variable Descriptions Standing height at follow up visit Standing height in cm at baseline HEIGHT (cm)
Idiosyncrasies in variable names are a major challenge to utilizing the data included in dbGaP. Standardizing phenotype variables in such a way that supports an accurate and complete search against dbGaP data is one of the main purposes of the Phenotype Finder in Data Resource (PFINDR) program funded by the National Heart, Lung, and Blood Institute (NHLBI). As the first step towards standardizing the phenotype variables in dbGaP, we tested the feasibility of using an existing information model for clinical data, the Clinical Element Models (CEM) developed by GE
http://www.ncbi.nlm.nih.gov/gap
Idiosyncrasies of phenotype variables make it difficult to identify relevant data with a sufficient level of accuracy. Standardization phenotype variable is important Focus on variable description
New workflow
Original workflow
!"#$% &
h rch earc a se )s ext nced t e a Fre (adv ed tur c u Unsorted, flat list results Str
Free text search
'!#$% &
Data user
Objective
Goal: Investigate an information model based approach to standardizing phenotype variables in dbGaP.
Objective
Goal: Investigate an information model based approach to standardizing phenotype variables in dbGaP.
Phase I Representing phenotype variable descriptions in Mapping results dbGaP using CEM template models
TABLE 4. RESULTS OF MAPPING PHENOTYPE NAMES TO CEM
Phenotype categories
Diseases and 0 116 0 5 7 128 1. Randomly retrieve 200 non-demographic phenotype variable Disorders Procedures 0 from two 0 phenotype 0 0 dictionaries 0 0 descriptions data in dbGaP. Signs and 2 19 2 2 56 81 Symptoms 2. Manually conduct the modeling using the six CEM template Medications 0 0 0 0 0 0 models. Anatomical 0 0 0 0 0 0
Results
1. 115 unique variables 25 143 48 24 139 379 2. CEM template models represented 70% phenotype variable
descriptions and are overly complex.
Topics Diseases and Disorders Findings (excluding Disease or Disorder) Medications Laboratory tests Not applicable Unknown Total number V. DISCUSSION Number of variables (%) 1 (0.87) 70 (60.87) 2 (1.74) 8 (6.96) 30 (26.09) 4 (3.48) 115
TABLE 5. CATEGORIES OF THE PHENOTYPE VARIABLE AND RELEVANT CEM TEMPLATE MODELS USED
20 3 0
2 6 0
44 2 0
10 7 0
21 32 23
97 50 23
CEM template models used Diseases and Disorders Signs and Symptoms Medication, Signs and Symptoms Laboratory Tests, Signs and Symptoms ---
as in dbGaP).
Objective
Goal: Investigate an information model based approach to standardizing phenotype variables in dbGaP.
Phase II Methods
MetaMap
Mapping
Develop rules
Algorithmic process
Phase II Results
Our information model was
constructed with 10 semantic role classes.
1 2 3 4 5 6 7 8 9 10 Semantic role class name Topic Subject of information Informer Certainty Situational Context Temporal modifier Extent modifier Health outcomes Body site Quantity Qualifier Examples Disease, Signs and symptoms Patient, family members Doctor Diagnosed, confirmed While sleeping, after birth Last month, since last visit Loudly, excessive Hospitalization Right leg, lower back How many, count
Mapping Example 1
Mom has lung cancer diagnosed by doctor last year
Subject of Information Mom
Quantity Qualifier
Informer doctor
Body site
Topic
lung cancer
Certainty diagnosed
Situation Context
Quantity Qualifier
Informer
Topic
pain
Certainty
Conclusions
We developed an information model for a simple NLP
algorithm to standardize phenotype variables
!"#$% &
h arc e t se ed) s x e c t n e a Fre (adv ed r u t uc Unsorted, flat list results Str
Free text search
'!#$% &
h a rc
Data user
Acknowledgements
University of California San Diego
Division of Biomedical Informatics Lucila Ohno-Machado, MD, PhD Wendy Chapman, PhD Mike Conway, PhD Jihoon Kim, MS Mindy Ross, MD, MBA Melissa Tharp, BS Current and past PFINDR team members: Dr. Xiaoqian Jiang, Dr. Neda Alipanah, Stephanie Feudjio Feupe, Rebacca Walker, Asher Garland, Jing Zhang, Ustun Yildiz, Karen Truong, Vinay Venkatesh, Rafael Talavera
Collaborator:
Hua Xu, PhD (Vanderbilt University)