Standardizing Phenotype Variable in The Database of Genotypes and Phenotypes

Standardizing Phenotype Variables in the Database of Genotypes and Phenotypes (dbGaP) based on Information Models
Ko-Wei Lin, DVM, PhD

Alexander Hsieh, Seena Farzaneh, BS, Son Doan, PhD, Hyeoneui Kim, RN, MPH, PhD Division of Biomedical Informatics, University of California, San Diego, La Jolla, CA
2013 AMIA Summit on Translational Bioinformatics March 18, 2013
Overview
Introduction and Background
- Challenges in dbGaP database - Objective of the study
Materials, Methods and Results

- Phase I: Test of CEM template models - Phase II: Build information model
Conclusions Current Status and Future Direction
dbGaP: database of Genotypes and Phenotypes

Developed by National Center for Biotechnology Information (NCBI) Data repositories of studies such as Genome-Wide Association Studies
(GWAS) to allow researchers investigate association between genotype and phenotype any variant is associated with a trait
GWAS: exam common genetic variant in different individuals to see if
Genotype
Phenotype
Phenotypes: diseases, signs and symptoms, clinical attributesetc. Currently host 400+ studies, 2500+ datasets, 130,000+ phenotype
variables
Reuse of dbGaP data: promote research discovery, validate existing
findings, reduce time and cost, advance translational medical research.
large portion of the clinical variables contained in dbGaP.
Introduction With the advancements in genome-wide association studies (GWAS), public repositories of genotype and phenotype data, such as the database of Genotypes and Phenotypes (dbGaP), have become increasingly available online (1). The proper use or reuse of GWAS data could promote exploratory research, novel scientific discovery, validation of existing findings, and reduction of cost and time for research. However, data in such public repositories are not Unstandardized representation of phenotype variables collected in a standardized or harmonized way, and hence it is challenging to reuse them. For example, as illustrated results in incomplete and inaccurate data retrieval. in Table 1, variables are often named without following a specific naming convention, or are labeled with abbreviated codes that do not naming convey specific meaning. Many of these variables are accompanied by variable No specific convention descriptions that can help users understand what data the variable intends to represent. However, keyword searches to No specific meaning in abbreviated codes applied variable descriptions do not always provide accurate results due to many syntactic and lexical complexities associated with narrative text, such as use of negation and synonyms (3).
Challenges in current dbGaP
height height in dbGaP Table 1.Phenotype Idiosyncratic ways ofvariable representing the variable
Variable ID phv00071000.v1 phv00165340.v1.p2 phv00083471.v1.p2 Variable Names htcm ESP_HEIGHT_BASELINE lunghta4 Variable Descriptions Standing height at follow up visit Standing height in cm at baseline HEIGHT (cm)
Idiosyncrasies in variable names are a major challenge to utilizing the data included in dbGaP. Standardizing phenotype variables in such a way that supports an accurate and complete search against dbGaP data is one of the main purposes of the Phenotype Finder in Data Resource (PFINDR) program funded by the National Heart, Lung, and Blood Institute (NHLBI). As the first step towards standardizing the phenotype variables in dbGaP, we tested the feasibility of using an existing information model for clinical data, the Clinical Element Models (CEM) developed by GE
http://www.ncbi.nlm.nih.gov/gap
Idiosyncrasies of phenotype variables make it difficult to identify relevant data with a sufficient level of accuracy. Standardization phenotype variable is important Focus on variable description
PFINDR program (Phenotype Finder IN Data Resources)

PhenDisco: Phenotype Discoverer
PhenDisco data PhD data flow flow
New workflow
Original workflow
PhenDisco PhD SystemSystem

Study Description Annotator Phenotype Variable Annotator Standardization & annotation
feedback/ confirmation semiautomated standardization & annotation Data submitter
Demographics variables (DIVER) Other variables Clustering ()*+,-$.+)& /+!01 &
!"#$% &
h rch earc a se )s ext nced t e a Fre (adv ed tur c u Unsorted, flat list results Str
Free text search
'!#$% &
Query Parser Structured Query Interface Ranking Algorithms Query support
Data user
Structured search Ranked results/Relevance feedback
Objective
Goal: Investigate an information model based approach to standardizing phenotype variables in dbGaP.
Phase I: Test the feasibility of CEM template models to

formally represent the phenotype variable descriptions in dbGaP.
Phase II: Develop our own information models and applied

them to variable standardization.
The ultimate goal is to develop an Nature Language Processing (NLP) based system that algorithmically standardizes the phenotype variables in PhenDisco.
Objective

Phase II: Develop our own information models and

applied them to variable standardization.
The Clinical Element Model (CEM)

Developed by GE Health/Intermountain Healthcare Data Modeling and
Terminology Team
Support sharing computable meaning during data exchange between

different systems
A logical structure for representing detailed clinical data models
CEM Template Models

Serve as basis for creating a CEM 6 domains: Disease and Disorders, Procedures, Signs and Symptoms,
Medications, Anatomical Sites, and Laboratory Test
Signs and Symptoms CEM Template Model:

Alleviating_factor UMLS relations {manages, treats, prevents} associatedCode SNOMED CT, UMLS CUI Body_laterality {superior, inferior, medial, lateral, distal, proximal, dorsal, ventral} Body_location UMLS relation {location_of} Body_side {left, right, bilateral, unmarked} Conditional {true, false} Course {unmarked, changed, increased, decreased, improved, worsened, resolved} Duration Temporal Link End_time Temporal Link Exacerbating_factor UMLS relations {complicates, disrupts} Generic {true, false} Negation_indicator {negationAbsent, negationPresent} Relative_temporal_context Temporal Link Severity UMLS relation {degree_of} Start_time Temporal Link Subject {patient, familyMember, donorFamilyMember, donorOther, other} Uncertainty_indicator {indicatorPresent, indicatorAbsent}
Phase I Representing phenotype variable descriptions in Mapping results dbGaP using CEM template models
TABLE 4. RESULTS OF MAPPING PHENOTYPE NAMES TO CEM
Material and ExactMethods Broad
Phenotype categories
Mapped (N=240) Narrow Related Not mapped Total
Diseases and 0 116 0 5 7 128 1. Randomly retrieve 200 non-demographic phenotype variable Disorders Procedures 0 from two 0 phenotype 0 0 dictionaries 0 0 descriptions data in dbGaP. Signs and 2 19 2 2 56 81 Symptoms 2. Manually conduct the modeling using the six CEM template Medications 0 0 0 0 0 0 models. Anatomical 0 0 0 0 0 0
Results
1. 115 unique variables 25 143 48 24 139 379 2. CEM template models represented 70% phenotype variable
descriptions and are overly complex.
Topics Diseases and Disorders Findings (excluding Disease or Disorder) Medications Laboratory tests Not applicable Unknown Total number V. DISCUSSION Number of variables (%) 1 (0.87) 70 (60.87) 2 (1.74) 8 (6.96) 30 (26.09) 4 (3.48) 115
TABLE 5. CATEGORIES OF THE PHENOTYPE VARIABLE AND RELEVANT CEM TEMPLATE MODELS USED
Sites Labs Other Unknown Total number
20 3 0
2 6 0
44 2 0
10 7 0
21 32 23
97 50 23
CEM template models used Diseases and Disorders Signs and Symptoms Medication, Signs and Symptoms Laboratory Tests, Signs and Symptoms ---
as in dbGaP).
The former was often aggregated and
Objective

Phase II: Develop our own information models and

applied them to variable standardization.
Phase II Methods
MetaMap
eHost Test generalizability of the model
Randomly select 300 Variable descriptions
Mapping
Information model (Semantic roles)
Develop rules
Algorithmic process
South BR et al. BioNLP 2012, page 130-139. http://code.google.com/p/ehost/
Phase II Results
Our information model was
constructed with 10 semantic role classes.
1 2 3 4 5 6 7 8 9 10 Semantic role class name Topic Subject of information Informer Certainty Situational Context Temporal modifier Extent modifier Health outcomes Body site Quantity Qualifier Examples Disease, Signs and symptoms Patient, family members Doctor Diagnosed, confirmed While sleeping, after birth Last month, since last visit Loudly, excessive Hospitalization Right leg, lower back How many, count
Our model fully represented the

key concepts in the 600 phenotype variable descriptions.
Mapping Example 1
Mom has lung cancer diagnosed by doctor last year
Subject of Information Mom
Quantity Qualifier
Informer doctor
Body site
Topic
lung cancer
Certainty diagnosed
Health outcomes Extent modifier Temporal modifier last year
Situation Context
Mapping Example 2 Minor pain in lower back after running

Subject of Information Subject
Quantity Qualifier
Informer
Body site lower back
Topic
pain
Certainty
Health outcomes Extent modifier minor Temporal modifier
Situation Context after running
Conclusions
We developed an information model for a simple NLP
algorithm to standardize phenotype variables
Our experience showed that direct analysis of the

phenotype variable descriptions in dbGaP is an important component for developing a workable information model
Current Status and Future Direction

We have developed a system for tagging the phenotype variables
with two main semantic roles topic and subject of information, and the system achieved 69% accuracy in semantic tagging.
We plan to process all phenotype variables in dbGaP and add

them into the pipeline. We will evaluate whether it improves the accuracy of phenotype query in PhenDisco.
PhenDisco PhD data flow data flow
New workflow Original workflow Study Description Annotator Phenotype Variable Annotator
PhenDisco PhD System System

Standardization & annotation
feedback/ confirmation semiautomated standardization & annotation Data submitter
Demographics variables (DIVER) Other variables Clustering ()*+,-$.+)& /+!01 &
!"#$% &
h arc e t se ed) s x e c t n e a Fre (adv ed r u t uc Unsorted, flat list results Str
Free text search
'!#$% &
h a rc
Query Parser Structured Query Interface Ranking Algorithms Query support
Data user
Structured search Ranked results/Relevance feedback
Acknowledgements
University of California San Diego
Division of Biomedical Informatics Lucila Ohno-Machado, MD, PhD Wendy Chapman, PhD Mike Conway, PhD Jihoon Kim, MS Mindy Ross, MD, MBA Melissa Tharp, BS Current and past PFINDR team members: Dr. Xiaoqian Jiang, Dr. Neda Alipanah, Stephanie Feudjio Feupe, Rebacca Walker, Asher Garland, Jing Zhang, Ustun Yildiz, Karen Truong, Vinay Venkatesh, Rafael Talavera
Collaborator:
Hua Xu, PhD (Vanderbilt University)
NIH/NHLBI (The National Heart, Lung, and Blood Institution)

grant UH2HL108785

Standardizing Phenotype Variable in The Database of Genotypes and Phenotypes

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Standardizing Phenotype Variable in The Database of Genotypes and Phenotypes

Enviado por

Direitos autorais:

Formatos disponíveis

Standardizing Phenotype Variables in the Database of Genotypes and Phenotypes (dbGaP) based on Information Models

Ko-Wei Lin, DVM, PhD

2013 AMIA Summit on Translational Bioinformatics March 18, 2013

Materials, Methods and Results

Conclusions Current Status and Future Direction

dbGaP: database of Genotypes and Phenotypes

GWAS: exam common genetic variant in different individuals to see if

Reuse of dbGaP data: promote research discovery, validate existing

findings, reduce time and cost, advance translational medical research.

large portion of the clinical variables contained in dbGaP.

Challenges in current dbGaP

Challenges in current dbGaP

Challenges in current dbGaP

PFINDR program (Phenotype Finder IN Data Resources)

PhenDisco PhD SystemSystem

feedback/ confirmation semiautomated standardization & annotation Data submitter

Demographics variables (DIVER) Other variables Clustering ()*+,-$.+)& /+!01 &

Query Parser Structured Query Interface Ranking Algorithms Query support

Structured search Ranked results/Relevance feedback

Phase I: Test the feasibility of CEM template models to

Phase II: Develop our own information models and applied

Phase I: Test the feasibility of CEM template models to

Phase II: Develop our own information models and

The Clinical Element Model (CEM)

Support sharing computable meaning during data exchange between

A logical structure for representing detailed clinical data models

CEM Template Models

Signs and Symptoms CEM Template Model:

Material and ExactMethods Broad

Mapped (N=240) Narrow Related Not mapped Total

Sites Labs Other Unknown Total number

The former was often aggregated and

Phase I: Test the feasibility of CEM template models to

Phase II: Develop our own information models and

eHost Test generalizability of the model

Randomly select 300 Variable descriptions

Information model (Semantic roles)

South BR et al. BioNLP 2012, page 130-139. http://code.google.com/p/ehost/

Our model fully represented the

Health outcomes Extent modifier Temporal modifier last year

Mapping Example 2 Minor pain in lower back after running

Body site lower back

Health outcomes Extent modifier minor Temporal modifier

Situation Context after running

Our experience showed that direct analysis of the

Current Status and Future Direction

We plan to process all phenotype variables in dbGaP and add

PhenDisco PhD System System

feedback/ confirmation semiautomated standardization & annotation Data submitter

Demographics variables (DIVER) Other variables Clustering ()*+,-$.+)& /+!01 &

Query Parser Structured Query Interface Ranking Algorithms Query support

Structured search Ranked results/Relevance feedback

NIH/NHLBI (The National Heart, Lung, and Blood Institution)

Você também pode gostar