Você está na página 1de 142

Biomedical

and Clinical Natural


Language Processing
Ozlem Uzuner
Meliha Ye0sgen
Amber Stubbs
1

Welcome
Ozlem Uzuner
Associate Professor, University at Albany

Meliha Ye0sgen
Assistant Professor, University of Washington

Amber Stubbs
Assistant Professor, Simmons College

Outline
Introduc0on to clinical and biomedical NLP
Research ques0ons in clinical and biomedical
NLP
Data and annota0on processes
Methods
Open ques0ons and future direc0ons


Introduc0on to Clinical and
Biomedical NLP

Clinical vs. Biomedical NLP


Clinical NLP
Focus on documenta0on related to pa0ent care
Electronic Health Records
Both in-pa0ent and out-pa0ent documenta0on

Biomedical NLP
Focus on scien0c discoveries about biology,
physiology, and medicine
Journal ar0cles, clinical trials, webpages,
5

Ever-growing Volumes of Data


Clinical text:
Millions of pa0ents*
Percent of adults who had contact with a health care professional in the past year: 82.1%,
Percent of children who had contact with a health care professional in the past year: 92.8%

Number of visits (to physician oces, hospital outpa0ent and emergency departments):
1.2 billion (actual number reported by CDC in 2010)
Hospital inpa0ent care
Number of discharges: 35.1 million
Discharges per 10,000 popula0on: 1,139.6
Average length of stay in days: 4.8

Biomedical text:
PubMed contains more than 23 million biomedical ar0cles from MEDLINE, life science
journals, and online books
~500,000 new records are added each year
13.1 million abstracts, and 14.2 million full-text

*hfp://www.cdc.gov/nchs/fastats/physician_visits.htm

Text Processing Needs in Clinical


Prac0ce
Cohort selec0on, phenotyping:
nding groups of pa0ents that sa0sfy par0cular
criteria

Decision support systems:


Is there a common prac0ce for trea0ng the
disease I am observing? What were the rates of
success and what is the best prac0ce?

Quality assurance in the hospital:


Who saw these pa0ents?
7

Text Processing Needs in Biomedical


Domain
Cohort selec0on, phenotyping:
nding groups of pa0ents that sa0sfy par0cular
criteria

Decision support systems:


Is there a common prac0ce for trea0ng the
disease I am observing? What were the rates of
success and what is the best prac0ce?

Quality assurance in the hospital:


Who saw these pa0ents?
8

Clinical and Biomedical NLP


Two closely-related complementary domains
Focus of this tutorial on clinical NLP

Clinical Documents
HISTORY OF PRESENT ILLNESS: Mrs. [Hun6ngton] is a
77-year-old-woman with long standing hypertension
who presented as a Walk-in to me at the [Bronx]
Health Center on [DATE]. Recently had been started
q.o.d. on Clonidine since [DATE] to taper o of the
drug. Was told to start Zestril 20 mg. q.d. again. The
pa0ent was sent to the Emergency Unit for direct
admission for cardioversion and an0coagula0on, with
the Cardiologist, Dr. [Swasissz] to follow.
SOCIAL HISTORY: Lives alone, has one daughter living
in [Spring]. Is a non-smoker, and does not drink
alcohol.
HOSPITAL COURSE AND TREATMENT: During
admission, the pa0ent was seen by Cardiology, Dr.
[Tylenol], was started on IV Heparin, Sotalol 40 mg PO
b.i.d. increased to 80 mg b.i.d., and had an
echocardiogram. By [DATE] the pa0ent had befer
rate control and blood pressure control but remained
in atrial brilla0on. On [DATE], the pa0ent was felt to
be medically stable

10

Clinical Documents
The pa0ent is a 46 year old woman with a history of Q
wave myocardial infarc6on with right ventricular
infarct in October 1992. Peak CK's were 2300.
Catheteriza0on showed 100% RCA lesion which was
treated with angioplasty reduced to 20-30% stenosis.
Subsequent catheteriza0on October 92 , July 92 and
September 92 for atypical chest pain , showed clean
coronaries. Exercise tread mill test in September 92 ,
the pa0ent went three minutes and 31 seconds with
standard Bruce protocol and stopped secondary to
atypical chest pain. Maximum heart rate 162 , blood
pressure 176/90 , no ST or T wave changes. In April
92 she ruled out for myocardial infarc0on by enzymes
and EKG , aser presen0ng with prolonged chest pain.
VQ scan was low probability. Chest CT ruled out
aor0c dissec0on. The pa0ent now presents to the
hospital with 24 hours of right sided chest pain ,
sta0ng that it was squeezing in her right breast , felt
to be between the shoulder blades. She complained
of shortness of breath , dizziness , weakness and
nausea , no palpita6ons were noted

11

Clinical Documents
Pt recently hospitalized 7/19/06 for chf exacerba0on
( diastolic dysfunc0on ) 2nd to dietary and medicine
noncompliance ( salty foods , stopped her HCTZ ) and
con0nued to smoke. Pt diuresed and sent home on new
lasix 60qam 40qpm regimen. Pt no0ced steady decline in
func0onal status during the last 3 weeks because of SOB.
at baseline should sat 85% on ra , 95% on 6L02NC at rest
and ambula0on. ( on home o2 ) but now , can't ambulate ,
sa6ng 83-89% on 6l at rest. also notes pnd , orthopnea. Pt
notes intermifent chest pain on and o las0ng 5 minutes
not associated with exer0on or any other cardiac sx. 8/15
dobuta mibi-> ischemia in d1 territory. 11/19 :echo->ef
60% , Pa pressure 48 + RA. no valve dz. rv enlarged and
hypokine6c. A/P: pump: decompesated CHF ( diastolic
dysfxn , ? cor pulmonale component ) 2nd to diet/med
non-compliance. up6trate captopril , con6nue iv lasix 60
qd with goal net neg 2 liters , daily weights , strict Iand
O. check cxray. Switched to po lasix 10/06 , back to
lisinopril for d/c Fri. ischemia: has + mibi in past , but no
further workup to d1 lesion. can't get ecasa 2nd to vWD.
con6nue BB , will hold o on sta6n since not
hyperlipidemic. rate:tele.
12

Clinical Language
Domain-specic, jargon,
idioms
Telegraphic, with
misspellings, incomplete
sentences
Specula0ons, hypotheses,
and nega0ons
Some structure
*Slide courtesy of Bref South. DAvolio L.W., Demner-Fushman D., South B.R. (2013). Tutorial on An Introduc0on to Clinical Natural
Language Processing, Fall Symposium of AMIA, 2013.

13

Clinical Language
Linguis0c varia0on
Deriva0on

medias3nal = medias3num

Inec0on

opacity = opaci3es; cough = coughed

Synonymy

Addisons Disease: Addison


melanoderma, adrenal insuciency,
adrenocor3cal insuciency, asthenia
pigemntosa, bronzed disease,
melasma addisonii,
Chest wall tenderness: chest wall did
demonstrate some slight tenderness
when the pa3ent had pressure applied
to the right side of the thoracic cage

*Slide courtesy of Bref South. DAvolio L.W., Demner-Fushman D., South B.R. (2013). Tutorial on An Introduc0on to Clinical Natural
Language Processing, Fall Symposium of AMIA, 2013.

14

Clinical Language
Polysemy

General polysemy

Pa3ent was prescribed codeine upon


discharge
The discharge was yellow and purulent

Acronyms and Abbrevia0ons

APC: ac3vated protein c, adenomatosis


polyposis coli, adenomatous polyposis
coli, an3gen presen3ng cell, aerobic
plate count, advanced pancrea3c
cancer, age period cohort, alfalfa protein
concentrated, allophycocyanin,
anaphase promo3ng complex, anoxic
precondi3oning, anterior piriform cortex,
an3body producing cells, atrial
premature complex,

*Slide courtesy of Bref South. DAvolio L.W., Demner-Fushman D., South B.R. (2013). Tutorial on An Introduc0on to Clinical Natural
Language Processing, Fall Symposium of AMIA, 2013.

15

Clinical Language
Nega0on and uncertainty
Approximately half of all clinical
concepts in dictated reports are
negated*
Explicit nega0on

The medias3num is not widened


Medias0nal widening: absent

Implicit nega0on

Lungs are clear upon ausculta3on


Rales/crackles: absent
Rhonchi: absent
Wheezing: absent

Uncertainty

Treated for a presump3ve sinusi3s

*Slide courtesy of Bref South. DAvolio L.W., Demner-Fushman D., South B.R. (2013). Tutorial on An Introduc0on to Clinical
Natural Language Processing, Fall Symposium of AMIA, 2013.
*Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. Evalua0on of nega0on phrases in narra0ve clinical reports.
Proc AMIA Sym. 2001:105-9.

16

Clinical Language
Hypotheses
It was felt that the pa3ent
probably had a cerebrovascular
accident involving the leI side of
the brain. Other dieren3als
entertained were perhaps
seizure and the pa3ent being
post-ictal when he was found,
although this considera3on is
less likely.
R/O out pneumonia.
*Slide courtesy of Bref South. DAvolio L.W., Demner-Fushman D., South B.R. (2013). Tutorial on An Introduc0on to Clinical Natural
Language Processing, Fall Symposium of AMIA, 2013.

17

Clinical Language
Implica0on

Audience for pa0ent reports is


physicians
That puts lay people at a
disadvantage when interpre0ng
these records.

Requires inference

Sentence level inference

There were hazy opaci3es in the


lower lobes Localized inltrate

Report level inference

Localized inltrates Probable


pneumonia

*Slide courtesy of Bref South. DAvolio L.W., Demner-Fushman D., South B.R. (2013). Tutorial on An Introduc0on to Clinical Natural
Language Processing, Fall Symposium of AMIA, 2013.

18

Clinical Language
More inference
Fever
Temperature 38.5C

Oxygen desatura0on
Oxygen satura3on low
Oxygen satura3on 85% on room
air

*Slide adapted from Bref South. DAvolio L.W., Demner-Fushman D., South B.R. (2013). Tutorial on An Introduc0on to Clinical
Natural Language Processing, Fall Symposium of AMIA, 2013.

19

Clinical Language
Temporality

Past medical history


History of CHF presen3ng with
shortness of leI-sided chest pain.

Hypothe0cal or non-specic
men0ons

He should return for fever or


increased shortness of breath.

Temporal course of disease

Pa3ent presents with chest pain


AIer administra3on of
nitroglycerin, the chest pain
resolved.
*Slide adapted from Bref South. DAvolio L.W., Demner-Fushman D., South B.R. (2013). Tutorial on An Introduc0on to Clinical
Natural Language Processing, Fall Symposium of AMIA, 2013.

20

Clinical Language
Report structure
Anatomic Loca0on some0mes in
sec0on header
NECK: no adenopathy.

Some sec0ons carry more weight


IMPRESSION: atelectasis

Some reports contain pasted (or


templated) text dicult to process
Cardiovascular: [ ] Angina [ ] MI [x ]
HTN [ ] CHF [ ] PVD [ ] DVT [ ]
Arrhythmias [ ] Previous PTCA [ ]
Previous Cardiac Surgery [ ] Nega3ve -
Denies CV problems
*Slide courtesy of Bref South. DAvolio L.W., Demner-Fushman D., South B.R. (2013). Tutorial on An Introduc0on to Clinical Natural
Language Processing, Fall Symposium of AMIA, 2013.

21

Biomedical Documents
Grows in size at a drama0c pace

2014 Medline baseline contains:


22,376,811 cita0ons
13,214,810 abstracts

22

Biomedical Language
Contains domain specic rich and
evolving vocabulary
Concepts introduced when new
discoveries are presented

Very structured

Journal abstracts and ar0cles usually


follow similar sec0on structure

Sentences are very gramma0cal but


include highly ambiguous terms
Neurobromatosis 2 [disease]
Neurobromin 2 [protein]
Neurobromatosis 2 gene [gene]

23


Research Problems in Clinical
and Biomedical NLP

24

Clinical and Biomedical Applica0ons


Applica0ons focus on making informa0on
hidden in vast amounts of clinical data
available in biomedical literature and clinical
records more accessible to improve
biomedical research and pa0ent care.

25

Sample Clinical NLP


Applica0on
Does the pa0ent drink?
Classify pa0ent into 3 classes: Heavy consump0on,
Moderate consump0on, None

Applica0on:
Retrospec0ve cohort study of high B12 levels as ICU
mortality predictor

Hypotheses:
High B12 levels are associated with liver func0on
Alcohol consump0on impacts liver

Data:
The Mul0parameter Intelligent Monitoring in
Intensive Care (MIMIC-II) database
B12 measurements for ~2,000 adult pa0ents

Structured data:
ICD9 codes of alcohol-based illness, e.g., delirium
tremens (291*)

*Slide courtesy of Dina Demner-Fushman. DAvolio L.W., Demner-Fushman D., South B.R. (2013). Tutorial on An Introduc0on to Clinical
Natural Language Processing, Fall Symposium of AMIA, 2013.
Callaghan F.M., Leishear K., Abhyankar S., Demner-Fushman D., McDonald C.J. (2014) High vitamin B12 levels are not associated with
increased mortality risk for ICU pa0ents aser adjus0ng for liver func0on: a cohort study. ESPEN J. 2014 Apr 1;9(2):e76-e83.

26

Clinical NLP Applica0on

Search for alcohol OR drink OR


Retrieval results:
Yes: The pa0ent is known to have a history of alcohol
abuse
Maybe: tox screen only signicant for alcohol level of
273
No: She denied any intravenous drug use, tobacco, or
alcohol
Maybe: He was counseled against the use of alcohol
Maybe: He stated that he would not drink alcohol in the
future
No: Major Surgical or Invasive Procedure: Alcohol septal
abla0on.
Yes: Quit smoking 'many' yrs ago. smoked 1 ppweek.
used to drink 10-12 hard drinks every other day. last
drink before goig to OSH. resolved to quit now. no IVDU.
No: her husband is a heavy drinker and osen isolates
himself to drink alone
No: the pa0ent was seen to drink a tremendous amount
of water

*Slide courtesy of Dina Demner-Fushman. DAvolio L.W., Demner-Fushman D., South B.R. (2013). Tutorial on An Introduc0on to Clinical
Natural Language Processing, Fall Symposium of AMIA, 2013.

27

Clinical NLP Applica0on

Is alcohol used in its beverage sense?


Polysemy (concept extrac0on and word sense
disambigua0on)
What are synonyms, hypernyms, and hyponyms of
alcohol?
Synonymy (concept extrac0on)
Does the pa0ent drink alcohol or not?
Nega0on
Is alcohol use asserted, equivocal, modal or hypothe0cal?
Uncertainty
Who is drinking?
Experiencer, rela0on extrac0on
What is the pa0ent drinking?
Rela0on, event extrac0on
Is the pa0ent drinking currently and regularly?
Temporal rela0ons, 0meline genera0on

*Slide courtesy of Dina Demner-Fushman. DAvolio L.W., Demner-Fushman D., South B.R. (2013). Tutorial on An Introduc0on to Clinical
Natural Language Processing, Fall Symposium of AMIA, 2013.

28

Clinical NLP Research


Ques0ons

Open ques0ons at all levels of NLP research.


Low-level tasks
Sentence boundary detec0on
Tokeniza0on
Part-of-speech tagging
Stemming
Shallow parsing (chunking)
Text segmenta0on
High-level tasks
Spelling/gramma0cal error iden0ca0on and
recovery
Named en0ty recogni0on and informa0on extrac0on
Word sense disambigua0on
Nega0on and uncertainty
Rela0onship extrac0on
Temporal reasoning

Nadkarni P.M., Ohno-Machado L., Chapman W.W. (2011). Natural language processing: an introduc0on. Journal of the American Medical
Informa3cs Associa3on 2011 Sep-Oct; 18(5): 544551. doi: 10.1136/amiajnl-2011-00046

29

Low-level Clinical NLP


Research Ques0ons
Sentence boundary detec0on: complicated by
abbrevia0ons and 0tles, e.g., m.g., Dr.
lists and templates, e.g., MI [x], SOB[]

Tokeniza0on: complicated by
characters typically used as token boundaries, e.g.,
10 mg/day, N-acetylcysteine.

Morphological decomposi0on: complicated by


compound words, e.g., nasogastric

Text segmenta0on: complicated by


problem-specic needs, e.g., sec0ons, including Chief
Complaint, Past Medical History, HEENT, etc.

In general, systems developed for non-clinical


text osen work less well on clinical narra0ves.
Nadkarni P.M., Ohno-Machado L., Chapman W.W. (2011). Natural language processing: an introduc0on. Journal of the American Medical
Informa3cs Associa3on 2011 Sep-Oct; 18(5): 544551. doi: 10.1136/amiajnl-2011-00046

30

High-level Clinical NLP


Research Ques0ons

Spelling/gramma0cal error iden0ca0on and recovery:


complicated by

highly synthe0c phrases


incorrectly used homophones, e.g., sole/soul, their/there

Named en0ty recogni0on (NER): complicated by


Word/phrase order varia0on:

perforated duodenal ulcer vs. duodenal ulcer, perforated


chest tenderness vs. chest wall shows slight tenderness on
pressure

Deriva0on:

Medias3num vs. medias3nal

Inec0on:

opacity vs. opaci3es

Synonymy:

Addisons disease vs. adrenocor3cal insuciency

Acronyms:

APC vs. ac3vated protein C vs. adenomatous polyposis coli vs.

Nadkarni P.M., Ohno-Machado L., Chapman W.W. (2011). Natural language processing: an introduc0on. Journal of the American Medical
Informa3cs Associa3on 2011 Sep-Oct; 18(5): 544551. doi: 10.1136/amiajnl-2011-00046

31

High-level Clinical NLP


Research Ques0ons

Nega0on and uncertainty iden0ca0on:


complicated by
Explicit and implicit asser0ons, e.g., Pa3ent
denies chest pain, Lungs are clear upon
ausculta3on
Uncertainty, e.g., the ill-dened density
suggests pneumonia.
Uncertainty in the reasoning processes, e.g.,
The pa3ent probably has a leI-sided
cerebrovascular accident; postconvulsive state
is less likely.

Nadkarni P.M., Ohno-Machado L., Chapman W.W. (2011). Natural language processing: an introduc0on. Journal of the American
Medical Informa3cs Associa3on 2011 Sep-Oct; 18(5): 544551. doi: 10.1136/amiajnl-2011-00046

32

High-level Clinical NLP


Research Ques0ons
Rela0onships are complicated by
The nature of the en00es, which aect:
Rela0onship extrac0on, e.g., causal
rela0ons
Coreference resolu0on
Temporal reasoning

Nadkarni P.M., Ohno-Machado L., Chapman W.W. (2011). Natural language processing: an introduc0on. Journal of the American Medical
Informa3cs Associa3on 2011 Sep-Oct; 18(5): 544551. doi: 10.1136/amiajnl-2011-00046

33


Datasets and the Annota0on
Processes

34

Biomedical and Clinical Corpora

There are various biomedical corpora annotated for syntax and seman0cs

MedTag: A collec0on of biomedical annota0ons (MEDLINE abstracts): the AbGene corpus of annotated sentences of
genes and protein named en00es, the MedPost corpus of part of speech tagged sentences and the GENETAG corpus
for named en0ty iden0ca0on used for BioCreAtIvE I.
TREC Genomics Track: A set of data collecions provided by TREC Genomics Track useful for development and
evalua0on of retrieval and text categoriza0on strategies in the biomedical domain.
BioCrea0ve corpus: Dataset produced by the BioCrea0ve assessment, text passages relevant for GO annota0ons of
human proteins.
GENIA corpus: Annotated corpus of literature related to the MeSH terms: Human, Blood Cells, and Transcrip0on
Factors.
Yapex corpus: Training and test data for the protein tagger (NER) YAPEX.
PASBio: Predicate-argument structures of biomedical literature.
LLL05 dataset: Genic Interac0on Extrac0on Challenge: protein/gene interac0ons IE data set
IEPA corpus: The Interac0on Extrac0on Performance Assessment corpus
BioText Data: Dataset for extrac0on of disease/treatment en00es rela0ons
BioText NC Seman0cs Dataset: Dataset of Noun Compound Seman0cs used in experiments described in ar0cles
PennBioIE: UPenn Biomedical Informa0on Extrac0on datasets of annotated PubMed abstracts: CYP450 domain and
oncology domain
Medstract corpus: Biomedical annota0on corpus useful for acronym deni0on and coreference resolu0on
Medstract corpus: Biomedical annota0on corpus useful for acronym deni0on and coreference resolu0on
OHSUMED text collec0on: Document collec0on used for the TREC-9 contest.
BMC corpus: Open access corpus of full text ar0cles provided by BioMed Central.
FetchProt corpus: Full text journal ar0cles from the biological domain analyzed for experiments on proteins.
PDG Bio-sentence splifer corpus: Small collec0on of text data sets derived from PubMed abstracts to develop and
assess sentence spling tools.
Bio1 corpus: annotated corpus, same eld as GENIA, but annotated to small top-level ontology.

35

9 years of i2b2 NLP


Challenges
All data available at
hfp://www.i2b2.org/NLP/
DataSets
with a data use agreement
36

i2b2 NLP Shared-task


Challenges
Goals:
Understanding the key informa0on in
narra0ve pa0ent records
Through extrac0on and classica0on
tasks
To enable phenotype extrac0on from
narra0ve pa0ent records

Run as shared-tasks
Training data made available
Tes0ng set held out for evalua0on
Evalua0on performed by i2b2
37

i2b2 NLP Datasets, from i2b2 Challenges


Task

Classica6on (C) or Data Sources


Extrac6on (E)

Annota6on
Protocol

Training/
Tes6ng
Records

De-iden0ca0on

Partners Healthcare
(PH)

Standard

669 / 220

Smoking Status

PH

Standard by
experts

398 / 104

Obesity Diagnosis

PH

Standard by
experts

730 / 507

Medica0on
Extrac0on

E & C

PH

Community
annota0on

17 / 251

Concepts,
Asser0ons, and
Rela0ons

E & C

PH, MIMIC II,


University of
Pifsburgh Medical
Center (UPMC)

Standard, then
349 / 477
machine valida0on

Coreference
Resolu0on

E & C

PH, MIMIC II, UPMC


Standard then
492 / 322
machine valida0on

Temporal
Rela0ons
Heart disease risk
factors

E & C

PH, MIMIC II

Standard

PH Longitudinal
records

190 / 120
38

i2b2 NLP Datasets, from i2b2 Challenges


Task

Classica6on (C) or Data Sources


Extrac6on (E)

Annota6on
Protocol

Training/
Tes6ng
Records

De-iden6ca6on

Partners Healthcare
(PH)

Standard

669 / 220

Smoking Status

PH

Standard by
experts

398 / 104

Obesity Diagnosis

PH

Standard by
experts

730 / 507

Medica0on
Extrac0on

E & C

PH

Community
annota0on

17 / 251

Concepts,
Asser0ons, and
Rela0ons

E & C

PH, MIMIC II,


University of
Pifsburgh Medical
Center (UPMC)

Standard, then
349 / 477
machine valida0on

Coreference
Resolu0on

E & C

PH, MIMIC II, UPMC


Standard then
492 / 322
machine valida0on

Temporal
Rela0ons
Heart disease risk
factors

E & C

PH, MIMIC II

Standard

PH Longitudinal
records

190 / 120
39

Challenge 2006 De-


iden0ca0on
Automa0c de-iden0ca0on
Pa0ent,
Doctor,
Loca0on,
Hospital,
Date,

Uzuner ., Luo Y., Szolovits P. (2007). Evalua0ng the State-of-the-Art in Automa0c De-
iden0ca0on. Journal of the American Medical Informa3cs Associa3on. September
2007;14(5):550-563.

40

Discharge Summaries

What if pa0ent had


Hun0ngtons disease?



HISTORY OF PRESENT ILLNESS: Mrs. [Hun0ngton] is a 77-year-old-woman with
long standing hypertension who presented as a Walk-in to me at the [Bronx]
Health Center on [DATE]. Recently had been started q.o.d. on Clonidine
since [DATE] to taper o of the drug. Was told to start Zestril 20 mg. q.d.
again. The pa0ent was sent to the Emergency Unit for direct admission for
cardioversion and an0coagula0on, with the Cardiologist, Dr. [Swasissz] to Misspelled
or
follow.
foreign
SOCIAL HISTORY: Lives alone, has one daughter living in [Spring]. Is a non-
name?
smoker, and does not drink alcohol.
HOSPITAL COURSE AND TREATMENT: During admission, the pa0ent was seen by
Cardiology, Dr. [Tylenol], was started on IV Heparin, Sotalol 40 mg PO b.i.d.
increased to 80 mg b.i.d., and had an echocardiogram. By [DATE] the pa0ent
had befer rate control and blood pressure control but remained in atrial
brilla0on. On [DATE], the pa0ent was felt to be medically stable

What if the pa0ent


has to take Tylenol?

41

i2b2 NLP Datasets, from i2b2 Challenges


Task

Classica6on (C) or Data Sources


Extrac6on (E)

Annota6on
Protocol

Training/
Tes6ng
Records

De-iden0ca0on

Partners Healthcare
(PH)

Standard

669 / 220

Smoking Status

PH

Standard by
experts

398 / 104

Obesity Diagnosis

PH

Standard by
experts

730 / 507

Medica0on
Extrac0on

E & C

PH

Community
annota0on

17 / 251

Concepts,
Asser0ons, and
Rela0ons

E & C

PH, MIMIC II,


University of
Pifsburgh Medical
Center (UPMC)

Standard, then
349 / 477
machine valida0on

Coreference
Resolu0on

E & C

PH, MIMIC II, UPMC


Standard then
492 / 322
machine valida0on

Temporal
Rela0ons
Heart disease risk
factors

E & C

PH, MIMIC II

Standard

PH Longitudinal
records

190 / 120
42

Challenge 2006
Smoking Status
Document classica0on into
the following classes:

Smoker
Current smoker
Past smoker
Non-smoker

Uzuner ., Goldstein I., Luo Y., Kohane I. (2008). Iden0fying Pa0ent Smoking
Status from Medical Discharge Records. Journal of the American Medical
Informa3cs Associa3on. January 2008;15(1):14-24.

43

Smoking Status Sample Input/Output



HISTORY OF PRESENT ILLNESS: Mrs. [Hun0ngton] is a 77-year-old-woman with
long standing hypertension who presented as a Walk-in to me at the [Bronx]
Health Center on [DATE]. Recently had been started q.o.d. on Clonidine
since [DATE] to taper o of the drug. Was told to start Zestril 20 mg. q.d.
again. The pa0ent was sent to the Emergency Unit for direct admission for
cardioversion and an0coagula0on, with the Cardiologist, Dr. [Swasissz] to
follow.
SOCIAL HISTORY: Lives alone, has one daughter living in [Spring]. Is a non-
smoker, and does not drink alcohol.
HOSPITAL COURSE AND TREATMENT: During admission, the pa0ent was seen by
Cardiology, Dr. [Tylenol], was started on IV Heparin, Sotalol 40 mg PO b.i.d.
increased to 80 mg b.i.d., and had an echocardiogram. By [DATE] the pa0ent
had befer rate control and blood pressure control but remained in atrial
brilla0on. On [DATE], the pa0ent was felt to be medically stable

44

i2b2 NLP Datasets, from i2b2 Challenges


Task

Classica6on (C) or Data Sources


Extrac6on (E)

Annota6on
Protocol

Training/
Tes6ng
Records

De-iden0ca0on

Partners Healthcare
(PH)

Standard

669 / 220

Smoking Status

PH

Standard by
experts

398 / 104

Obesity Diagnosis

PH

Standard by
experts

730 / 507

Medica0on
Extrac0on

E & C

PH

Community
annota0on

17 / 251

Concepts,
Asser0ons, and
Rela0ons

E & C

PH, MIMIC II,


University of
Pifsburgh Medical
Center (UPMC)

Standard, then
349 / 477
machine valida0on

Coreference
Resolu0on

E & C

PH, MIMIC II, UPMC


Standard then
492 / 322
machine valida0on

Temporal
Rela0ons
Heart disease risk
factors

E & C

PH, MIMIC II

Standard

PH Longitudinal
records

190 / 120
45

Challenge 2008
Obesity Diagnosis
Obesity and 15 of its co-morbidi0es
Asthma, atherosclero0c cardiovascular disease
(CAD), conges0ve heart failure (CHF), depression,
diabetes mellitus (DM), gallstones /
cholecystectomy, gastroesophageal reux disease
(GERD), gout, hypercholesterolemia,
hypertension (HTN), hypertriglyceridemia,
obstruc0ve sleep apnea (OSA), osteoarthri0s
(OA), peripheral vascular disease (PVD), and
venous insuciency

To be classied on a record level


Present (Y): the pa0ent has the disease
Absent (N): the pa0ent does not have the disease
Ques0onable (Q): the pa0ent may have the
disease
Unmen0oned (U): the disease is not men0oned in
the record
46

i2b2 NLP Datasets, from i2b2 Challenges


Task

Classica6on (C) or Data Sources


Extrac6on (E)

Annota6on
Protocol

Training/
Tes6ng
Records

De-iden0ca0on

Partners Healthcare
(PH)

Standard

669 / 220

Smoking Status

PH

Standard by
experts

398 / 104

Obesity Diagnosis

PH

Standard by
experts

730 / 507

Medica6on
Extrac6on

E & C

PH

Community
annota6on

17 / 251

Concepts,
Asser0ons, and
Rela0ons

E & C

PH, MIMIC II,


University of
Pifsburgh Medical
Center (UPMC)

Standard, then
349 / 477
machine valida0on

Coreference
Resolu0on

E & C

PH, MIMIC II, UPMC


Standard then
492 / 322
machine valida0on

Temporal
Rela0ons
Heart disease risk
factors

E & C

PH, MIMIC II

Standard

PH Longitudinal
records

190 / 120
47

Challenge 2009
Medica0on extrac0on
Medica0ons and informa0on
related to them from medical
discharge summaries
Medica0on names
Dosages
Modes
Frequencies
Dura0ons
Reasons
List/narra0ve

48

Medica0on extrac0on
Extrac0on task
Medica0ons and informa0on
related to them

Classica0on task
Whether a piece of informa0on
is related to a medica0on

Uzuner ., Sol0 I., Cadag E. (2010). Extrac0ng Medica0on Informa0on from Clinical Text.
Journal of the American Medical Informa3cs Associa3on. 2010;17:514-518 doi:10.1136/
jamia.2010.003947.
Uzuner ., Sol0 I., Xia F., Cadag E. (2010). Community Annota0on Experiment for Ground
Truth Genera0on for the i2b2 Medica0on Challenge. Journal of the American Medical
Informa3cs Associa3on. 2010;17:519-523 doi:10.1136/jamia.2010.004200.

49

i2b2 NLP Datasets, from i2b2 Challenges


Task

Classica6on (C) or Data Sources


Extrac6on (E)

Annota6on
Protocol

Training/
Tes6ng
Records

De-iden0ca0on

Partners Healthcare
(PH)

Standard

669 / 220

Smoking Status

PH

Standard by
experts

398 / 104

Obesity Diagnosis

PH

Standard by
experts

730 / 507

Medica0on
Extrac0on

E & C

PH

Community
annota0on

17 / 251

Concepts,
Asser6ons, and
Rela6ons

E & C

PH, MIMIC II,


University of
Piksburgh Medical
Center (UPMC)

Standard, then
machine
valida6on

349 / 477

Coreference
Resolu0on

E & C

PH, MIMIC II, UPMC


Standard then
492 / 322
machine valida0on

Temporal
Rela0ons
Heart disease risk
factors

E & C

PH, MIMIC II

Standard

PH Longitudinal
records

190 / 120
50

Challenge 2010
Rela0on Extrac0on
Three 0er task
Clinical concepts
Asser0ons on concepts
Rela0ons of concepts

51

i2b2 NLP Datasets, from i2b2 Challenges


Task

Classica6on
(C) or
Extrac6on (E)

Data Sources

Annota6on
Protocol

Training/
Tes6ng
Records

De-iden0ca0on

Partners Healthcare (PH)

Standard

669 / 220

Smoking Status

PH

Standard by
experts

398 / 104

Obesity Diagnosis

PH

Standard by
experts

730 / 507

Medica0on
Extrac0on

E & C

PH

Community
annota0on

17 / 251

Concepts,
Asser0ons, and
Rela0ons

E & C

PH, MIMIC II, University of Standard, then


349 / 477
Pifsburgh Medical Center machine valida0on
(UPMC)

Coreference
Resolu6on

E & C

PH, MIMIC II, UPMC


Standard then
machine
valida6on

492 / 322

Temporal
Rela0ons
Heart disease risk
factors

E & C

PH, MIMIC II

Standard

190 / 120

PH Longitudinal records

52

i2b2 NLP Datasets, from i2b2 Challenges


Task

Classica6on
(C) or
Extrac6on (E)

Data Sources

Annota6on
Protocol

Training/
Tes6ng
Records

De-iden0ca0on

Partners Healthcare (PH)

Standard

669 / 220

Smoking Status

PH

Standard by
experts

398 / 104

Obesity Diagnosis

PH

Standard by
experts

730 / 507

Medica0on
Extrac0on

E & C

PH

Community
annota0on

17 / 251

Concepts,
Asser0ons, and
Rela0ons

E & C

PH, MIMIC II, University of Standard, then


349 / 477
Pifsburgh Medical Center machine valida0on
(UPMC)

Coreference
Resolu0on

E & C

PH, MIMIC II, UPMC


Standard then
492 / 322
machine valida0on

Temporal
Rela6ons

E & C

PH, MIMIC II

Standard

Heart disease risk


factors

PH Longitudinal records

190 / 120

53

i2b2 NLP Datasets, from i2b2 Challenges


Task

Classica6on (C) or Data Sources


Extrac6on (E)

Annota6on
Protocol

Training/
Tes6ng
Records

De-iden0ca0on

Partners Healthcare
(PH)

Standard

669 / 220

Smoking Status

PH

Standard by
experts

398 / 104

Obesity Diagnosis

PH

Standard by
experts

730 / 507

Medica0on
Extrac0on

E & C

PH

Community
annota0on

17 / 251

Concepts,
Asser0ons, and
Rela0ons

E & C

PH, MIMIC II,


University of
Pifsburgh Medical
Center (UPMC)

Standard, then
349 / 477
machine valida0on

Coreference
Resolu0on

E & C

PH, MIMIC II, UPMC


Standard then
492 / 322
machine valida0on

Temporal
Rela0ons

E & C

PH, MIMIC II

Standard

Heart disease risk


factors

PH Longitudinal
records

190 / 120
54

Challenge 2014 Heart


Disease Risk Factors
Four separate tracks under
one 0tle
Heart disease risk factors
De-iden0ca0on
Sosware evalua0on
Novel data use

55

Aims

Use annota0on to answer a


complex clinical ques0on:

How does heart disease


(specically coronary artery
disease) progress in diabe0c
pa0ents?
Ques0on developed in
consulta0on with i2b2 MD/PhDs
who study diabetes and cardiology

De-iden0fy a new corpus of clinical


narra0ves

56

Corpus selec0on

297 pa0ents

2-5 records per pa0ent


1304 records total

Postdoc MD selected records


Cohort selec0on

All pa0ents diagnosed with diabetes


1/3 start with CAD
1/3 develop CAD over course of
records
1/3 never develop CAD

57

Task descrip0on for


heart disease track

Iden0fy risk factors for CAD in


diabe0c pa0ents:

hyperlipidemia/
hypercholesterolemia,
hypertension,
obesity,
a family history of premature CAD,
and
being a smoker

Categorize temporality of risk


factors

Present before, during, aser, or


some combina0ons of those rela0ve
to date of record (document
crea0on 0me, DCT)
58

Gold standard crea0on

All annotators with


medical backgrounds

1 medical doctor
6 registered nurses
1 medical assistant

Each le triple-annotated
Gold standard: any tags
appearing in 2/3 or more
of the annota0ons
59

Training/tes0ng

60-40 split between training


and tes0ng

790 records in the training set


514 records in the tes0ng set

60

How to obtain the data?


Data available with a use
agreement
Download data use
agreement from

hfp://www.i2b2.org/NLP/DataSets

61


Methods

62

Methods
Depending on the problem a wide range of
methods are applied.
Rule based
Sta0s0cal
Hybrid

63

Example clinical projects from UW-


BioNLP lab*
Extrac0ng structure and seman0cs from clinical
text
Sec0on segmenta0on
Asser0on analysis

Clinical applica0ons
Phenotype modeling in the ICU
Pneumonia predictor
Acute lung injury predictor

Informa0on extrac0on from radiology notes


Clinically important incidental recommenda0on extractor

*UW-BioNLP Lab: hfp://depts.washington.edu/bionlp/index.html


64

Sec0on segmenta0on
HISTORY OF PRESENT ILLNESS:
This is an 85 year old man initially admitted to the Plastic Surgery Service for evaluation of a
left facial mass. Subsequently, CMED CCU was consulted and he was transferred to our
Service postoperatively.
PAST MEDICAL HISTORY:
His past medical history is significant for prostate cancer, benign prostatic hypertrophy,
hypothyroidism , status post radiation for non Hodgkin 's lymphoma, chronic painless
hematuria , degenerative joint disease and history of a murmur. Last colonoscopy, five
years ago. Dementia.
ALLERGIES:
No known drug allergies.
MEDICATIONS:
1. Levothyroxine.
2. Lasix.
3. Proscar.
4. Aeroseb.
5. Ancef.
PHYSICAL EXAMINATION:
On examination, he is afebrile. Vital signs, stable. Elderly man, somewhat cachectic . Head,
eye, ears, nose and throat, polypoid lesion just inferior to the left zygoma, elevated
superiorly, with visible bone. No exudate. Minimal bleeding. Regular rate and rhythm. Clear
to auscultation. Nontender, nondistended.
HOSPITAL COURSE:
He was initially admitted to CMED for resection and repair of this left facial lesion. He also
had consults from Urology for his hematuria as well as Medicine preoperatively and CMED
CCU. He went to the Operating Room on 2016-03-10 with Urology for hematuria where he
had a cystoscopy transurethral resection of prostate placement. He then went to the
Operating Room on 2016-03-14 where he had ...
65

Sec0on segmenta0on
HISTORY OF PRESENT ILLNESS:
This is an 85 year old man initially admitted to the Plastic Surgery Service for evaluation of a
left facial mass. Subsequently, CMED CCU was consulted and he was transferred to our
Service postoperatively.
PAST MEDICAL HISTORY:
His past medical history is significant for prostate cancer, benign prostatic hypertrophy,
hypothyroidism , status post radiation for non Hodgkin 's lymphoma, chronic painless
hematuria , degenerative joint disease and history of a murmur. Last colonoscopy, five
years ago. Dementia.
ALLERGIES:
No known drug allergies.
MEDICATIONS:
1. Levothyroxine.
2. Lasix.
3. Proscar.
4. Aeroseb.
5. Ancef.
PHYSICAL EXAMINATION:
On examination, he is afebrile. Vital signs, stable. Elderly man, somewhat cachectic . Head,
eye, ears, nose and throat, polypoid lesion just inferior to the left zygoma, elevated
superiorly, with visible bone. No exudate. Minimal bleeding. Regular rate and rhythm. Clear
to auscultation. Nontender, nondistended.
HOSPITAL COURSE:
He was initially admitted to CMED for resection and repair of this left facial lesion. He also
had consults from Urology for his hematuria as well as Medicine preoperatively and CMED
CCU. He went to the Operating Room on 2016-03-10 with Urology for hematuria where he
had a cystoscopy transurethral resection of prostate placement. He then went to the
Operating Room on 2016-03-14 where he had ...
66

Framework
We create a sec0on header ontology for a
given note type (e.g., radiology reports,
discharge summaries)
We annotate a small set of document for each
of the report types
We train a two level classier
First classier iden0es the boundaries of the
sec0ons
Second classier iden0ed the sec0on type
67

Radiology report ontology


Clinical Informa0on
Clinical History

Exam Details

Exam
Comparison
Contrast
Procedure

Findings

Findings

Impression

Impression
Afending Statement
68

Discharge report ontology

General pa0ent informa0on

Admission diagnoses
History
Reason for admission
Medica0ons

Condi0ons as discharge

Discharge diagnoses
Other diagnoses
Physician exam on discharge
Disposi0on
Other diagnoses
Condi0on

Consulta0on
Procedures
Hospital course
Studies
Physical

Discharge instruc0ons

Allergies
Past medical history
Past surgical history
Family history
Gynecological history
Social history

Hospital course

Condi0on before admission

Admit physician
Afending physician
Discharge physician
Afending surgeon

Medical history

Provider informa0on

Admit date
Discharge date
Service

Discharge instruc0ons
Discharge medica0ons
Follow-up

Addenda

Afending statement
Note

69

Annota0on Process
HISTORY(OF(PRESENT(ILLNESS:(
This(is(an(85(year(old(man(initially(admitted(to(the(Plastic(Surgery(Service(for(evaluation(of(a(
left(facial(mass.(Subsequently,(CMED(CCU(was(consulted(and(he(was(transferred(to(our(
Service(postoperatively.(
MEDICAL(HISTORY:(
His(past(medical(history(is(significant(for(prostate(cancer,(benign(prostatic(hypertrophy,(
hypothyroidism(,(status(post(radiation(for(non(Hodgkin('s(lymphoma,(chronic(painless(
hematuria(,(degenerative(joint(disease(and(history(of(a(murmur.(Last(colonoscopy,(five(
years(ago.(Dementia.(
ALLERGIES:(
No(known(drug(allergies.(
MEDICATIONS:(
1.(Levothyroxine.(
2.(Lasix.(
3.(Proscar.(
4.(Aeroseb.(
5.(Ancef.(
PHYSICAL(EXAMINATION:(
On(examination,(he(is(afebrile.(Vital(signs,(stable.(Elderly(man,(somewhat(cachectic(.(Head,(
eye,(ears,(nose(and(throat,(polypoid(lesion(just(inferior(to(the(left(zygoma,(elevated(
superiorly,(with(visible(bone.(No(exudate.(Minimal(bleeding.(Regular(rate(and(rhythm.(Clear(
to(auscultation.(Nontender,(nondistended.(
HOSPITAL(COURSE:(
He(was(initially(admitted(to(CMED(for(resection(and(repair(of(this(left(facial(lesion.(He(also(
had(consults(from(Urology(for(his(hematuria(as(well(as(Medicine(preoperatively(and(CMED(
CCU.(He(went(to(the(Operating(Room(on(2016\03\10(with(Urology(for(hematuria(where(he(
had(a(cystoscopy(transurethral(resection(of(prostate(placement.((He(then(went(to(the(
Operating(Room(on(2016\03\14(where(he(had(...(
70

Annota0on Process
History of present
illness
Past medical history
Allergies
Medica0ons

Physical Examina0on

Hospital Course

HISTORY(OF(PRESENT(ILLNESS:(
This(is(an(85(year(old(man(initially(admitted(to(the(Plastic(Surgery(Service(for(evaluation(of(a(
left(facial(mass.(Subsequently,(CMED(CCU(was(consulted(and(he(was(transferred(to(our(
Service(postoperatively.(
MEDICAL(HISTORY:(
His(past(medical(history(is(significant(for(prostate(cancer,(benign(prostatic(hypertrophy,(
hypothyroidism(,(status(post(radiation(for(non(Hodgkin('s(lymphoma,(chronic(painless(
hematuria(,(degenerative(joint(disease(and(history(of(a(murmur.(Last(colonoscopy,(five(
years(ago.(Dementia.(
ALLERGIES:(
No(known(drug(allergies.(
MEDICATIONS:(
1.(Levothyroxine.(
2.(Lasix.(
3.(Proscar.(
4.(Aeroseb.(
5.(Ancef.(
PHYSICAL(EXAMINATION:(
On(examination,(he(is(afebrile.(Vital(signs,(stable.(Elderly(man,(somewhat(cachectic(.(Head,(
eye,(ears,(nose(and(throat,(polypoid(lesion(just(inferior(to(the(left(zygoma,(elevated(
superiorly,(with(visible(bone.(No(exudate.(Minimal(bleeding.(Regular(rate(and(rhythm.(Clear(
to(auscultation.(Nontender,(nondistended.(
HOSPITAL(COURSE:(
He(was(initially(admitted(to(CMED(for(resection(and(repair(of(this(left(facial(lesion.(He(also(
had(consults(from(Urology(for(his(hematuria(as(well(as(Medicine(preoperatively(and(CMED(
CCU.(He(went(to(the(Operating(Room(on(2016\03\10(with(Urology(for(hematuria(where(he(
had(a(cystoscopy(transurethral(resection(of(prostate(placement.((He(then(went(to(the(
Operating(Room(on(2016\03\14(where(he(had(...(
71

Two-level classier
Line Labeling
Labels each line of the report to one of the
following three labels
B Beginning of sec0on
I Inside of sec0on
O Outside of sec0on

Sec0on Type Labeling


Labels each sec0on to a sec0on type based on the
content of an iden0ed sec0on (e.g. Impression)

Classier: MaxEnt
72

Features
Features for line labeling
Type

Features

Text features

isAllCaps, isTitleCaps, containsNumber,


beginsWithNumber, numTokens, numPreBlanklines,
numPostBlanklines, rstToken, secondToken, unigram

Tag features

prevTag, prevTwoTags, tagChainLength

Features for sec0on labeling


Type

Features

Header features

Same as tag features, only the header line is used

Body features

avgLineLength, numLines, docPosi3on, containsList,


unigram

Tag features

prevTag

73

Results*
Discharge Summaries
Dataset: 430 notes
Performance:
Precision: 91.1%
Recall: 92.4%
F: 91.8%

Radiology Reports
Dataset: 100 notes
Performance:
Precision: 93.1%
Recall: 91.1%
F: 92.1%

Sosware released at: hfp://depts.washington.edu/bionlp/sosware.htm



*M Tepper, D Capurro, F Xia, L Vanderwende, M Ye0sgen-Yildiz. Sta0s0cal Sec0on Segmenta0on in Free-Text Clinical Records. In
Proceedings of LREC2012.

74

Asser0on analysis
Asser0on classica0on: Given a medical problem
concept men0oned in a clinical note (e.g., chest pain),
the purpose of a system solving this task to determine
whether the concept is:
present
absent
condi0onal
hypothe0cal
possible
not associated with the pa0ent (not pa0ent).
75

Asser0on classica0on (examples)


The pa0ent was then followed in the cardiac cri0cal care unit where he
had evidence of anoxic encephalopathy. (present)
Heart was regular with a I/VI systolic ejec0on murmur without jugular
venous disten0on. (absent)
He does become slightly short of breath when lising furniture.
(condi0onal)
If you have fevers please contact your PCP or return to the emergency
room. (hypothe0cal)
The pa0ent was con0nued on an0bio0cs for possible pneumonia.
(possible)
Father had coronary artery disease. (not pa0ent)

76

Annotated corpus
Part of 2010 Informa0cs for Integra0ng Biology and the
Bedside (i2b2)/Veterans Aairs (VA) shared-task challenge:
21 teams competed for this task

Dataset:
826 clinical reports (mostly discharge summaries):
349 training documents (11,968 instances)
477 test documents (18,550 instances)

Distribu0on of instances per asser0on category:

69% present
20% absent
4.5% hypothe0cal and possible
<1% condi0onal and not pa0ent
77

Related work
De Burjin et al. (2011):

state-of-the-art at the 2010 i2b2/VA challenge


93.62 micro-averaged F-measure (primary metric)
two layers of classiers for compu0ng concept predic0ons
wide range of engineered features

Roberts and Harabagiu (2011):


93.94 micro-averaged F-measure
various feature type op0miza0on techniques

Kim et al. (2011):


94.17 micro-averaged F-measure
79.76 macro-averaged F-measure
Implemented addi0onal features with a focus on improving the
performance of minority classes
78

Asser0on classica0on framework


SVM-based classica0on framework* (LIBLINEAR)
SPLAT preprocessing: tokeniza0on, lemma0za0on, Porter
stemming, and cons0tuent and dependency parsing
Op0miza0on of feature types over the training set (greedy
forward/backward feature selec0on)
Extracted a wide diversity of features: basic features (lexical, NegEx,
ConText, etc.)

sec0on features (sta0s0cal sec0on chunker*)


category specic (N-grams highly correlated with an asser0on category)
asser0on focus features
*C.A. Bejan, L. Vanderwende, F. Xia, M. Ye0sgen-Yildiz. Asser0on modeling and its role in clinical phenotype iden0ca0on.
J Biomed Inform. 2013;46(1):68-74.
79

Basic features
Encode the surrounding contextual informa0on of the
medical concept at the sentence level:
word, lemma, and stem uni/bi/tri-grams occurring before and aser
the concept
right sparse stem trigram (e.g., * li( funitur)
while lining furniture, if lining furniture, aner lining furniture

tests the presence of special tokens (ques6on mark, conjunc6on)


concept stems
the les closest preposi0on
tests the presence of nega6ve prexes (ab, de, di, il, im, in, ir, re, un,
no, mel, mal, and mis)
output of ConText and NegEx (window size = 6)

80

Category specic features


N-gram features highly correlated with a specic asser0on category
Ngram

L/R

Absent

NP

Cond.

Hypo.

Possible

Present

Score

found to have

35

1.0

which showed

27

1.0

showed

153

.99

revealed

83

.97

developed

54

.96

on exer0on

11

1.0

with exer0on

1.0

when

.58

Present

Condi6onal

81

Category specic features


Ngram

L/R

Absent

NP

Cond.

Hypo.

Possible

Present

Score

no evidence *

90

1.0

no * or

61

1.0

did not *

33

1.0

no

1000

.99

r/o

14

1.0

ques0onable

12

1.0

possible

50

.98

ques0on of

16

.94

Absent

Possible

82

Seman0c features
The pa0ent was then followed in the cardiac cri0cal care unit
where he had evidence of anoxic encephalopathy. (present)
Heart was regular with a I/VI systolic ejec0on murmur
without jugular venous disten0on. (absent)
He does become slightly short of breath when lising
furniture. (condi6onal)
If you have fevers please contact your PCP or return to the
emergency room. (hypothe6cal)
The pa0ent was con0nued on an0bio0cs for possible
pneumonia. (possible)
Father had coronary artery disease. (not pa6ent)
83

Seman0c features
The pa0ent was then followed in the cardiac cri0cal care unit
where he had evidence of anoxic encephalopathy. (present)
Heart was regular with a I/VI systolic ejec0on murmur
without jugular venous disten0on. (absent)
He does become slightly short of breath when lising
furniture. (condi6onal)
If you have fevers please contact your PCP or return to the
emergency room. (hypothe6cal)
The pa0ent was con0nued on an0bio0cs for possible
pneumonia. (possible)
Father had coronary artery disease. (not pa6ent)
84

Seman0c features
Asser0on cues:

Bioscope nega0on cues (e.g., not, without, absence of)


Bioscope specula0ve (or hedge) cues (e.g., suggest, possible, might)
temporal signals from TimeBank (e.g., aIer, while, on, at)
kinship terms from Longman English Dic0onary (e.g., mother, brother)

Seman0c features:
encode the connec0on between asser0on cues and medical concepts
capture the meaning of asser0on cues
help classiers decide whether or not a concept is within the focus of an
asser0on cue

85

Seman0c features
the closest nega0ve cue in the les token context window
(size=8)
the rst asser0on cue on the path in the dependency tree
between the concept and root
the rst verb on the dependency tree path between the
medical concept and root
the modal auxiliary verb associated with the rst verb on the
dependency tree path between the medical concept and the
closest asser0on cue
the sequence of POS labels between the closest les asser0on
cue and the medical concept
86

Asser0on classica0on results*


absent
20%
P

not pa6ent
<1%

condi6onal
<1%

hypothe6cal
4.5%

possible
4.5%
P

present
69%

Overall

MF

mF

training set
basic

95.77

95.66

85.48

57.61

76.19

31.07

94.26

88.33

79.36 64.67

95.05

97.81

77.93

94.48

+sect

96.20

95.78

92.68

82.61

73.33

32.04

95.36

91.55

79.73 65.42

95.5

97.89

81.65

94.96

+spec

96.50

95.78

92.94

85.87

80.39

39.81

95.51

91.55

84.87 72.34

95.97

98.16

84.55

95.55

+focus

96.87

96.37

95.18

85.87

82.35

40.78

95.54

92.17

87.03 74.02

96.2

98.31

85.42

95.89

Kim et al
96.31
2011

94.71

97.52

81.38

81.25

30.41

92.07

87.45

78.3

54.36

94.46

98.07

79.76

94.17

Our
95.71
system

93.88

91.79

84.83

80.00

30.41

92.42

86.75

83.16 55.95

94.51

98.28

79.96

94.23

test set

*C.A. Bejan, L. Vanderwende, F. Xia, M. Ye0sgen-Yildiz. Asser0on modeling and its role in clinical phenotype iden0ca0on.
J Biomed Inform. 2013;46(1):68-74.

87

Open source sosware


Available at:
hfp://depts.washington.edu/bionlp/
sosware.htm
Applied on various note types:
UW Medicine
Radiology reports, microbiology notes, opera0ve notes,
admit notes, discharge summaries, ICU progress notes,

Fred Hutch:
Pathology notes


88

Examples of Clinical Applica0on


Projects
1. Phenotype modeling in the ICU
Pneumonia predictor
Acute lung injury predictor

2. Informa0on extrac0on from radiology notes


Clinically important incidental recommenda0on
extractor

89

Applica0on #1 Phenotype modeling


in ICU
Aim: Developing an automated screening tool
that accurately iden0es cri0cal illness
phenotypes in the ICU
Cohort selec0on for phenotype-genotype
correla0on (community acquired pneumonia
(CAP))
Phenotype surveillance: Predic0ng the likelihood
of a pa0ent acquiring a given phenotype the next
day in ICU (ven0lator associated pneumonia
(VAP))

90

Data - unstructured
All free-text notes generated during ICU stay
Admit notes
ICU daily progress notes
Acute care daily progress notes
Transfer notes
Cardiology daily progress notes
Respiratory therapy notes
Radiology notes (chest x-rays)
Microbiology notes
Discharge summary

91

Data structured (ven0lator


associated pneumonia)
Vital signs (temperature, blood pressure, heart rate, oxygen
satura0on)
Ven0lator sengs (0dal volume, FiO2, peep, respiratory rate),
laboratory values (white blood cell count, pH, PaO2, PCO2)
Frequency and character of endotracheal aspirates
Ven0lator bundle elements (e.g., head of bed posi0on)
Compliance with oral hygiene and chlorhexidine mouthwash
Date, 0me, and type of an0bio0c therapy administrated

92

CAP cohort selec0on


System architecture (baseline*)
F-score=50.70

Patient
records

Feature
Extractor

MetaMap

Training Data

Pneumonia
Learner

Test Data

Pneumonia
Predictor
Positive Negative

*M. Ye0sgen-Yildiz, B.J. Glavan, F. Xia, L. Vanderwende, M.M. Wurfel. Extrac0on of Pneumonia Cases from Free-Text
Intensive Care Unit Reports. Proceedings AMIA'2011.
*M. Ye0sgen-Yildiz, B.J. Glavan, F. Xia, L. Vanderwende, M.M. Wurfel. Iden0fying Pa0ents with Pneumonia from Free-Text
Intensive Care Unit Reports. Proceedings of Learning from Unstructured Clinical Text Workshop of ICML'2011.
*C.A. Bejan, L. Vanderwende, M.M. Wurfel, and M. Ye0sgen-Yildiz. Assessing Pneumonia Iden0ca0on from Time-Ordered
Narra0ve Reports. Proceedings of AMIA'12.


CAP cohort selec0on


System architecture (extension*)
F-score=85.71

Patient
records
Ranked
words
Ranked
concepts

Feature
Extractor
MetaMap

Training Data

Pneumonia
Learner

Test Data

Pneumonia
Predictor

Assertion
Classifier

Positive Negative

*C.A. Bejan, L. Vanderwende, F. Xia, M. Wurfel, M. Ye0sgen-Yildiz. Pneumonia iden0ca0on using sta0s0cal feature selec0on.
J Am Med Inform Assoc. 2012;19(5):817-23.

94

Time-of-onset predic0on* (VAP)


Instance(pa0ent, predic0on 0mepoint)
Pa0ent A

Day 0 Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7

Current predic0on 0mepoint

InstanceLabel(Pa0ent A, Day 0) =
InstanceLabel(Pa0ent A, Day 1) =
InstanceLabel(Pa0ent A, Day 2) = +

Lookback period (lp)


Pa0ent A

F-score=76.46
lp = 2

Day 0 Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7

Current predic0on 0mepoint


*C.A. Bejan, L. Vanderwende, H.L. Evans, M.M. Wurfel, M. Ye0sgen-Yildiz. On-0me clinical phenotype predic0on based on narra0ve reports.
Proceedings of theAmerican Medical Informa0cs Associa0on Fall Symposium (AMIA'13). Washington DC, November, 2013. (Dis0nguished Paper
95 Award

Other experiments
Acute lung injury predic0on from chest x-rays
M. Ye0sgen-Yildiz, C.A. Bejan, M.M. Wurfel. Iden0ca0on of pa0ents with
acute lung injury from free-text chest x-ray reports. Proceeding of BioNLP
Workshop of ACL'2013, 2013.

CPIS score predic0on from chest x-rays


M. Tepper, H.L. Evans, F. Xia, M. Ye0sgen-Yildiz. Modeling Annotator
Ra0onales with Applica0on to Pnemonia Classica0on. Proceedings of
Expanding the Boundaries of Health Informa0cs Using AI Workshop of
AAAI'2013, 2013.

96

Applica0on #2 Informa0on
extrac0on from radiology notes
Goal: Clinically important recommenda0on
extrac0on.
Seng: pa0ent, clinician, radiologist
The clinician orders a radiology test for the pa0ent
The radiologist takes X-ray and writes a radiology
report, which is sent back to the clinician

97

Example radiology report

Reason for test: Prostate


cancer surveillance

Incidental nding: 6-mm les


lung nodule

*M. Ye0sgen-Yildiz, M.L. Gunn, F. Xia, T.H. Payne. Automa0c Iden0ca0on of Cri0cal Follow-Up Recommenda0on Sentences in Radiology Reports. Proceedings of
AMIA2011.
*M. Ye0sgen-Yildiz, ML Gunn, F Xia, TH Payne. A Text Processing Methodology to Extract Recommenda0on Informa0on from Radiology Reports. J Biomed Inform, 2013;
46(2):354-62.

98

Architecture

99

Features
Feature Type

Features

Syntac0c

Unigram, bigram, tense, stemmedVerb,


includesModalVerb,
includesTemporalPhrase

Knowledge-based

umlsCpncept, umlsSeman0cType,
radlexConcept

Structural

sec0onType

Performance:

P: 0.74 R: 0.78 F: 0.76

Next steps: iden0fy details of recommenda0on


Reason for recommenda0on
Recommended test
Time frame

New NIH grant with start date March 1st, 2014.



100

Error analysis for both applica0ons


Current text representa0on is not sucient to
capture the meaning.
Asser0on classes are not powerful enough
(1) present, (2) absent, (3) condi0onal, (4) hypothe0cal,
(5) possible, (6) not associated with the pa0ent

False Posi0ve Examples from ALI:


Report #1: Diuse lung opaci0es consistent with pulmonary
edema.
Report #2: There has been gradual improvement of diuse lung
opaci0es consistent with pulmonary edema.

101

Error analysis (Cont.)


Some reports include very complex
informa0on that can not be represented with
bag-of-words approach
Example microbiology note:
2+ Stenotrophomonas maltophilla: Timen0n
(Ticarcillin/Clavulanic Acid) MIC is 128 Mcg/ml
Resistant, Moxioxacin MIC is 4.0 Mcg/ml. No CLSI
interpre0ve criteria available for this organism
an0bio0c combina0on.

102

Summary of error analysis


To extract meaningful informa0on more
sophis0cated representa0on approaches are
needed for clinical text!

103

Current research
Events in clinical text
How are events represented in clinical text?
How can we extract event?

Report types:
Microbiology notes
Longitudinal chest x-ray notes

104

Microbiology notes
Microbiology laboratory culture tests ordered to (1)
iden0fy sources of bacterial infec0on, (2) determine
between dieren0al diagnoses, and (3) adjust
an0bio0c treatments
Unlike other report types, microbiology notes
change over 0me as more informa0on is available
about the culture

105

Example microbiology ndings


60,000 Col/mL Staphylococcus aureus, coagulase
posi0ve: See other cultures of Bronchoalveolar Lavage
from the same day for sensi0vi0es>100000 Col/mL
Beta hemolic Streptococous not A, C, or G.
2+ Stenotrophomonas maltophilla: Timen0n
(Ticarcillin/Clavulanic Acid) MIC is 128 Mcg/ml
Resistant, Moxioxacin MIC is 4.0 Mcg/ml No CLSI
interpre0ve criteria available for this organism
an0bio0c combina0on.
No Methicillin Resistant Staphyloccus aureus is
isolated.
106

Event deni0on
Main afributes:

Organism: bacteria, ora, fungus, yeast (e.g., Stenotrophomonas maltophilla)


Organism quan0ty: quan0ty (e.g. >10000 col/ml)
Ra0ng: cell conuence ra0ng (e.g. 1+, 2+)
Drug: drug name that was tested (e.g., Moxioxacin)
Drug resistance: suscep0bility of the organism to the drug (e.g., resistant)
MIC: minimum inhibitory concentra0on (e.g. 2.0 Mcg/ml)

Addi0onal afributes:

No-growth men0on: no growth men0on (e.g., no growth)


No-growth measurement: 0me measurement of no growth (e.g., 2 days)
Specimen descrip0on: reference specimen (e.g., 2 days)
Specimen date: minimum span that iden0es the reference specimen date
(e.g., same day)
Specimen afribute: afributes that the reference uses (e.g. reference
sensi0vi0es)

107

Rela0ons

equivalentRefOf: organism-to-organism, or drug-to-drug


hasQuan0ty: organism-to-organism quan0ty
hasRa0ng: organism-to-ra0ng
measuredBy: no growth-to-no growth measurement
hasDrugDesc: organism-to-drug, organism-to-drug resistance
hasResistance: drug-to-drug resistance, organism-to-drug
resistance
hasMIC: drug-to-MIC
hasAfrRefIn: organism-to-specimen descrip0on, organism
quan0ty-to-specimen descrip0on
0mestamp: specimen descrip0on-to-specimen date
afr: specimen descrip0on-to-reference item

108

Annota0on examples

109

Corpus*
1442 microbiology reports from UW Medical
center
100 reports were double annotated by a medical
student and a biomedical informa0cs PhD
student
En0ty level
Kappa: 0.977
F-score: 0.964

Event level (Exact en0ty and rela0on)


F-score: 0.960
*W. Yim, H. Evans, and M. Ye0sgen. An annotated corpus for events in microbiology notes. To appear in proceeding of AMIA2014.
110

En0ty extrac0on

Rule based:

Ra0ng: cell conuence ra0ng (e.g. 1+, 2+)


MIC: minimum inhibitory concentra0on (e.g. 2.0 Mcg/ml)
No-growth men0on: no growth men0on (e.g., no growth)

Hybrid: (rules + logis0c regression to prune false posi0ves)

Organism quan0ty: quan0ty (e.g. >10000 col/ml)


Drug: drug name that was tested
Drug resistance: suscep0bility of the organism to the drug
No-growth measurement: 0me measurement of no growth (e.g., 2 days)
Specimen date: minimum span that iden0es the reference specimen date (e.g., same day)
Specimen afribute: afributes that the reference uses (e.g. reference sensi0vi0es)

Sta0s0cal: (Sequen0al classica0on with condi0onal random elds)

Organism: bacteria, ora, fungus, yeast


Specimen descrip0on: reference specimen (e.g. lower respiratory culture from endotracheal
tube)

111

STAT.

HYBRID

RULE

En0ty extrac0on performance


TP

FP

FN

Ra0ng

453

0.99

0.99

MIC

83

0.98

0.99

No-growth 26

No-growth 26
measure

Specimen

124

0.99

0.97

0.98

Reference 134

Drug
262
resistance

0.99

0.99

0.99

Organism
quan0ty

738

18

0.99

0.97

0.98

Drug

304

0.99

0.97

0.98

Organism

1281

94

123

0.93

0.91

0.92

Specimen

109

13

27

0.89

0.80

0.85

112

Event extrac0on performance





All en00es

0.968

0.952

0.960

Rela0ons

0.915

0.860

0.886

113

Events in Radiology Reports


Radiology reports contain rich seman0c
vocabulary designed to interpret imaging data
with free-text
Events in this context are descrip0ons of
relevant disease processes found by
radiologists on imaging studies


114

Events with change of state


ICU Day #1: Diffuse lung opacities consistent
with pulmonary edema.
ICU Day #1: No change in diffuse lung
opacities consistent with pulmonary edema.
ICU Day #2: Diffuse lung opacities consistent
with pulmonary edema have worsened.
ICU Day #2: No change in diffuse lung
opacities consistent with edema.
ICU Day #3: There has been gradual
improvement of diffuse lung opacities
consistent with pulmonary edema.
115

Event descrip0on
Main afributes
loc: anatomical loca0on
afr: something being measured or observed (e.g.,
volume, opacity)
val: a possible value for the afr (e.g., clear)
cos: change of state compared to other reports
for the same pa0ent (e.g., unchanged)
ref: a link to the report(s) that the change of state
compared to (e.g., prior examina0on)
116

Example annota0ons

A snippet featuring an event annota0on connec0ng all ve elds of the COS tuple.

A snippet featuring shared en00es between events

117

Corpus*
1008 sentences from 1344 chest x-ray notes
7173 en00es
4128 rela0ons
2101 event tuples

Agreement:

3 annotators annotated 100 snippets


En0ty annota0on: Kappa = 0.902
Event annota0on: Kappa = 0.716

*P.Klassen, F. Xia, L. Vanderwende, M. Ye0sgen. Annota0ng Clinical Events in Text Snippets for Phenotype Detec0on. To Appear in
Proceedings of Interna0onal Conference on Language Resources and Evalua0on (LREC). Reykjavik, Iceland, May, 2014.

*L. Vanderwende, F. Xia, M. Ye0sgen-Yildiz. Annota0ng Change of State for Clinical Events. Proceedings of The 1st Workshop on EVENTS:
Deni0on, Detec0on, Coreference, and Representa0on Workshop of NAACL'2013, 2013.

118

Event extrac0on
Sequen0al classica0on for en0ty recogni0on
SVM for rela0on classica0on
En0ty extrac0on performance:
P: 0.94 R: 0.95 F: 095

Rela0on extrac0on experiments are on-going



119

Future steps
Running experiments with the VAP data to see
whether features extracted from microbiology
and radiology events improve VAP
classica0on

120

Role of Local Context in


Understanding Clinical
Sublanguage
zlem Uzuner
MIT CSAIL and SUNY, Albany

Joint Work with: Tawanda Sibanda,
Tian He, Jonathan Mailoa, and Peter
Szolovits
121

Contribu0ons and Take Home


Message
We can extract key informa0on from such
narra0ves using a sta0s0cal representa0on of
local context
Local context improves system performance in
informa0on extrac0on and goes a long way
towards crea0ng an accurate interpreta0on of
the informa0on buried in clinical sublanguage
122

Local Context
Characteris0cs of the
target word (TW) and of
the words immediately
surrounding the TW
Lexical and orthographic
features
Syntac0c features
Seman0c features

123

De-iden0ca0on
Privacy concerns related to medical
records
Health Informa0on Portability and
Accountability Act (HIPAA)

17 pieces of textual Personal Health


Informa0on (PHI)
PHI in discharge summaries:

First and last names of pa0ents, their health


proxies, family members
Iden0ca0on numbers
Telephone, fax, pager numbers
Geographic loca0ons
Dates

Hospital names
First and last names of doctors
124

Discharge Summaries

What if pa0ent had


Hun0ngtons disease?



HISTORY OF PRESENT ILLNESS: Mrs. [Hun0ngton] is a 77-year-old-woman with
long standing hypertension who presented as a Walk-in to me at the [Bronx]
Health Center on [DATE]. Recently had been started q.o.d. on Clonidine
since [DATE] to taper o of the drug. Was told to start Zestril 20 mg. q.d.
again. The pa0ent was sent to the Emergency Unit for direct admission for
cardioversion and an0coagula0on, with the Cardiologist, Dr. [Swasissz] to Misspelled
or
follow.
foreign
SOCIAL HISTORY: Lives alone, has one daughter living in [Spring]. Is a non-
name?
smoker, and does not drink alcohol.
HOSPITAL COURSE AND TREATMENT: During admission, the pa0ent was seen by
Cardiology, Dr. [Tylenol], was started on IV Heparin, Sotalol 40 mg PO b.i.d.
increased to 80 mg b.i.d., and had an echocardiogram. By [DATE] the pa0ent
had befer rate control and blood pressure control but remained in atrial
brilla0on. On [DATE], the pa0ent was felt to be medically stable

What if the pa0ent


has to take Tylenol?

125

Related Work
Named En0ty Recogni0on (NER)
Exploit both the characteris0cs of the names of the en00es and contextual
clues related to these en00es (Bikel et al.; McCallum et al.; Rilo and Jones; )

Bio-Named En0ty Recogni0on


Exploit various feature sets including surface and syntac0c features, word
forma0on paferns, morphological paferns, POS tags, etc. (Collier et al.; Yu et
al.; )

De-iden0ca0on
Combina0ons of sta0s0cal and rule-based approaches
Most sta0s0cal approaches focused on sub-categories of PHI (Taira et al.;
Thomas et al.; ...)
Approaches that target full de-iden0ca0on use dic0onaries, rules, and
paferns (Gupta et al.; Douglass; ...)

126

Local Context for De-iden0ca0on


Local context: Characteris0cs of the target word (TW) and of the
words immediately surrounding the TW
Lexical and orthographic features:

The target word (TW) itself


The word before and aser the TW
The bigram before and aser the TW
Capitaliza0on, punctua0on, numbers, word length

Syntac0c features:
Part of speech (POS) of TW, of the word before, and of the word aser
Syntac0c bigrams

Seman0c features:
Presence of TW, of the word before, and of the word aser in relevant dic0onaries
MeSH ID

The heading of the sec0on in which TW appears


127

Local Context for De-iden0ca0on


Local context: Characteris0cs of the target word (TW)
and of the words immediately surrounding the TW
Lexical and orthographic features:
Syntac0c features:
Part of speech (POS) of TW, of the word before, and of the word
aser
Syntac0c bigrams

Transferred to
Transferred immediately to
Transferred later to

Seman0c features:
The heading of the sec0on in which TW appears
128

Syntac0c Informa0on
From the output of the Link Grammar Parser
(Sleator and Temperly, 1Xp 991)

MVa
Wd

Ss

Op
Dmc

LEFT-WALL John smokes two packs daily .

Op links verbs to their plural objects


Dmc links determiners to their plural nouns
MVa connects verbs to their modiers
Xp links periods to words
129

Syntac0c Bigrams

Wd

Ss
les syntac0c bigram

LEFT-WALL John smokes


two packs, NONE right syntac0c bigram

130

Local Context for De-iden0ca0on


Local context: Characteris0cs of the target
word (TW) and of the words immediately
surrounding the TW
Lexical and orthographic features:
Syntac0c features:
Seman0c features:
Presence of TW, of the word before, and of the word
aser in relevant dic0onaries
MeSH ID

The heading of the sec0on in which TW appears


131

Stat De-id
Mul0-class SVM (linear kernel)
with local context
Determine if a word is:
PHI

Pa0ent name
Doctor name
Date
Phone
ID
Hospital name
Loca0on

Non-PHI

132

Evalua0on
Compare with an Heuris0c + Dic0onary (H+D)
approach that benets from dic0onaries, rules,
and paferns (Douglass)

Compare with approaches that benet from
wider context, i.e., Iden0Finder (Bikel et al.) and
a Condi0onal Random Field (CRF) De-iden0er
Wider context: Characteris0cs and dependencies of
the en00es in the sentence containing the target
133

Methods
Stat De-id
Cross-validated

CRFD
Cross-validated

H+D
Rule-based

Iden0Finder
Obtained pre-trained on
newswire


134

Data
Number of tokens

PHI

Category

Random
corpus

Authen6c
corpus

Challenge
corpus

Non-PHI
Pa6ent
Doctor
Loca6on
Hospital
Date
ID
Phone

17,874
1,048
311
24
600
735
36
39

112,669
294
738
88
656
1,953
482
32

444,127
1,737
7,697
518
5,204
7,651
5,110
271

Table 1: Number of words in each PHI category in the corpora.


135

Data
Category

PHI

Non-PHI
Pa6ent
Doctor
Loca6on
Hospital
Date
ID
Phone

Number of ambiguous tokens in the


challenge corpus
39,374
158
1,083
44
1,910
81
4
1

Table 2: Distribution of words, i.e., tokens, that are ambiguous between PHI and non-PHI.

136

Data

Corpus

Pa6ents in
names dict.

Doctors in
names dict.

Random
Authen6c
Challenge

86.45%
78.57%
14.10%

86.50%
70.33%
17.20%

Corpus
Random
Authen6c
Challenge

Loca6ons in Hospitals in
loca6on dict. hospital dict.
87.5%
54.55%
11.40%

Non-PHI in
Non-PHI in
names
loca6on dict.
dict.
15.87%
9.19%
16.12%
10.19%
15.36%
11.32%

Dates in
month dict.

87.5%
80.18%
26.59%

Non-PHI in
hospitals
dict.
14.10%
12.74%
8.61%

12.65%
21.97%
5.15%

Non-PHI in
month dict.
0.07%
0.02%
0.06%

Table 3: Percentage of words that appear in name, location, hospital, and


month dictionaries used by Stat De-id and by the H+D approach.

137

Evalua0on Metrics
Precision
Recall
F-measure
Only on the PHI
Aggregate over all PHI

138

Results
F-measures on PHI in the Randomly Re-
id'ed Corpus

F-measures on PHI in the Authen6c


Corpus

F-measures on PHI in the Challenge Corpus

All dierences signicant at alpha=0.05.

139

100.00

F-measure Comparison of Local


Context Features on PHI
97.63

95.33

96.82

94.28

98.03

95.08

80.00

60.00

40.00

20.00

0.00
Random
All Stat De-id / All CRFD
All dierences signicant at alpha=0.05.

Authen0c

Challenge

All local context features of Iden0Finder


140

Contribu0ons
We can:
De-iden0fy medical discharge
summaries using a sta0s0cal
representa0on of local context

We showed that:
Stronger the local context,
befer the performance
When using an SVM
141


Open Ques0ons and Future
Direc0ons

142

Você também pode gostar