Você está na página 1de 8

Using Artificial Intelligence to Leverage Precision

Medicine on EHR data: A Review

Sarah Mullin, MSa


a
State University of New York at Buffalo
Department of Biomedical Informatics
Buffalo, NY United States

Abstract—Prediction algorithms using machine and deep give us better predictive models that are generalizable to a
learning can play a role in precision medicine. As data increases larger and more diverse population, not just patients that
in the form of electronic health records, new methods for specifically fit the strict inclusion/exclusion criteria of
generalizable automatic prediction of diseases, risk, and clinical trials or cohort studies.3; 4 Therefore, we can
mortality will be invaluable. The review found that the majority
potentially predict personalized medical outcomes for
of literature makes use of machine learning, rather than deep
learning, techniques in medicine. Issues with EHR data for almost all patients using EHR data by looking for patterns
secondary research use were dealt with in multiple ways. In that a single clinician could not see based on an entire
addition, most algorithms do not use clinical text in their models population of data.
and note this as a limitation. Although EHR data is rich, there are many caveats
associated with the data for secondary research. Multiple
Keywords—EHR, machine learning, deep learning, precision papers have illustrated the issues and quality of EHR data
medicine for secondary research use. These include a vast amount of
missing values, inaccuracy, insufficient granularity,
incomplete patient story, inconsistency, and a mismatch of
research protocols and EHR data recording.5; 6
1. INTRODUCTION In addition, the majority of data is locked in free
text notes. For instance, one study clearly shows the
Precision medicine, or personalized medicine, is a
inconsistency between free text and codified structured
new wave of medicine that forgoes the “one-size-fits-all”
components, stating, “of the 3068 patients who had ICD-9-
method in favor of a tailored disease prevention and
CM diagnoses for pancreatic cancer, only 1589 had
treatment that takes into account differences in patients.1
corresponding disease documentation in pathology
Clinical trials often exclude under-represented or highly
reports.”5 Not only can data be inconsistent, but relying on
complex patients and therefore, no precedent for these types
only codified components does not show the complete
of patients exists in typical forms of evidence based
clinical picture of the patient and can often introduce bias
medicine. Parameters of a patient’s health are contained
and missing data components that could be present in the
within the Electronic Health Record (EHR) or Electronic
clinical text. Therefore, either manual extraction or
Medical Record (EMR) in structured data, including
extraction by natural language processing should be done
pharmaceutical data, ICD-9 codes, and lab values, and
for accurate prediction algorithms.
unstructured clinical notes. As of 2015, the basic EHR has
Therefore, a review of personalized prediction
been adopted by 84% of non-federal U.S. hospitals and that
algorithms using learning methods on EHR data has been
number continues to grow.2 EHRs and EMRs contain
done to assess the current literature. Since the popularity of
millions of records of digital longitudinal observational data
using EHR data for secondary use is increasing, as well as
that is specific to a patient and can aid in creating secondary
the inevitable increase in patient records, special attention
use measures and risk predictions for precision medicine.
was given to how the caveats associated with using EHR
Millions of records and multitudes of variables or
data for secondary research were dealt with in the models.
features lead us to big data solutions for prediction in the
form of artificial intelligence. Artificial Intelligence, by
way of unsupervised and supervised machine learning, 2. METHODS
extracts patterns from raw data. Deep learning allows
The PRISMA statement for reporting reviews was
computers to learn from experience and understand the
used to guide items included and analysis.7 Since EHR data
world in a hierarchy of concepts.3 Deterministic and
is observational by nature, risk of bias was not assessed in
generative learning algorithms including regression,
the conventional way outlined by Cochrane or PRISMA.
classification, clustering techniques, and a combination of
Instead, selective outcome reporting and reporting
multiple types of models or mixed distribution models, can
conventional parameters for learning applications, such as 2.4. Data extraction and Synthesis
training and test set sizes, were analyzed. Outcomes that directly and indirectly impacted our
objectives were included in the analysis. An abstraction form
2.1. Objectives was created and outcomes fell into three categories, general
information, outcomes pertaining to learning algorithms, and
Since the use of machine learning and deep outcomes pertaining to issues with EHR data. General
learning in medicine is a relatively new application to the information outcomes included author, publication date,
field of medicine, an assessment of the current scope of location, medical domain, and funding. Learning algorithm
literature that exists for predictive learning models on EHR outcomes included sample size, particular testing and training
data for precision medicine was a main objective of the sample size, particular method of learning (i.e. deep learning,
review. Assessing accuracy outcomes and learning methods supervised, unsupervised, etc.), method of evaluation (AUC,
applied to certain disease, disorder, and risk assessments, in accuracy, sensitivity, specificity), and type of outcome
addition to how EHR data complexities and issues are (binary, multi-class, continuous). Finally, EHR outcomes
overcome, can be a beneficial learning tool for future included data sources, whether structured and/or unstructured
research aimed at predictive modeling in medicine. data was included, how missing data was handled, how
A second objective involves assessing the use of temporal distinctions were accounted for, and the
EHR data for precision medicine. As noted previously, the terminologies or ontologies used for coded data. Narrative
complexities and issues of using EHR data for research are synthesis, as opposed to a primarily statistical synthesis, was
abundant. Therefore, the questions, “how do they address used to synthesize included paper findings, relying primarily
on the use of text to analyze similarities and differences
the main issues that arise with using EHR data, especially
among the papers.8
with respect to missing data?”, “from what source
(structured codes or unstructured clinical texts) do diagnoses 3. RESULTS
come from?”, and “are temporal distinctions taken into
account?” are assessed. 3.1. Study Selection
The search strategy found 201 records and 2 were
found from reference mining. 38 papers met the broad
2.2. Datasources and Searches criteria and abstracts were screened. Since the topic is
A literature search was done using Medline and relatively new in the field of medicine, only 10 papers that
Google Scholar where ‘machine learning,’ ‘artificial met the full extent of inclusion/exclusion criteria were
intelligence,’ ‘electronic health records,’ and ‘precision found. The trial flow can be found in Figure 1.
medicine’ are all MeSH headings. Search methodology was
in the form ‘((Machine Learning) AND Electronic Health
Records) AND Personalized Medicine’ for all of the 3.2. Study Characteristics
variations of combinations of headings. In addition, General paper outcomes can be found in Table 1.
searches with the non-MeSH headings ‘deep learning’ and Three out of the ten papers are conference proceedings and
‘personalized medicine’ were done in google scholar. the remaining seven are from various journals. The most
Reference mining was done to elicit additional citations. recent impact factor of the respective journals is reported.
Each database was searched from inception until October 8, The papers’ medical domains range from things easy to
2017. assess and predict with codified diagnoses and prescriptions,
such as diabetes and sepsis, to more difficult things such as
heart failure and abdominal pain which don’t necessarily
2.3. Study Selection
have associated codes, labs, and prescriptions.
Abstracts of the papers found from the search
criteria were reviewed for studies of original research from
journal publications or conference proceedings. Reviews
and editorials were not included. Papers from non-peer
reviewed journals and journals that had a removed impact
factor were also excluded. From abstract review, only those
fitting the criteria of using a learning algorithm to predict
outcomes for precision medical use with an EHR data
source were included. Prediction analytics included risk,
disease, and temporal, such as survival analysis, prediction.
Papers that discussed the extraction of data from clinical
notes using Natural Language Processing (NLP) or
electronic phenotyping were excluded unless they had a
prediction component.
data as possible to train your model. However, Anderson,
2016 and Panahiazar, 2015 put the majority of their data in
their test set.11; 12
Models were assessed using the measures Area
Under the Curve (AUC), accuracy in terms of
misclassification over total, sensitivity, specificity, RMSE
and F-score. With higher being a better outcome, AUC
ranged from 0.65 to 1, with the majority in the 0.7 to 0.9
range.
3.4. The Complexity of the EHR
A limitation of using EHR data is that the majority
of data is locked in free text in the clinical notes. This can
be extracted through manual review, which occurred in a
few studies, or through automatic phenotyping or natural
language processing methods. The results of the studies can
be found in Table 3. Six out of ten of the studies used only
structured data components, ICD-9 codes, vital signs,
demographics, and lab values. Two of the remaining four
studies used data that was already preprocessed into data
repositories from an EHR. Only one study used both
structured and unstructured data, using cTAKES to
extract.13
Six out of ten studies used the ICD-9 or ICD-10
terminology to distinguish diagnoses. Other terminologies,
which are better structured and have more intricate and
accurate hierarchies, such as SNOMED-CT, were not used.
In addition, Churpek, 2016 made no mention of where the
Fig. 1. Trial Flow of search strategy and inclusion/exclusion criteria. study’s data came from specifically, other than stating
which EHR, and did not mention a terminology for
3.3. Learning Algorithms coding.14
The majority of prediction algorithms were two- Missing data was dealt with in a variety of ways.
class binary algorithms that made use of supervised Two studies deleted all participants who had missing values.
classification methods such as logistic regression, SVM, k- 11; 15
Three longitudinal studies used the method of values
nearest neighbors, and Bayesian supervised techniques from previous time blocks, or the median of variable if there
(Table 2). The multi-class outcomes used Gaussian Mixture was no prior value, to impute. One study did not mention
Models and K-means, which allows assignment of data how they dealt with missing data.13
points to clusters, and Classification and Regression Trees With many data points in the EHR to extract,
(CART), which combines two methods of learning to feature selection, and its method, plays an important role.
predict outcomes based on a tree structure of decisions. Some studies had specific outcomes, like the PAS score, in
There were no continuous outcomes. Two out of ten of the mind, and therefore, only the PAS components were
papers used deep learning.9; 10 extracted. 13 Others had feature selection manually curated
Machine learning and deep learning algorithms by experts. Four studies made use of automatic feature
often make use of hundreds of thousands pieces of training selection and reduction techniques.9; 16-18
data in order to achieve adequate accuracy on the test set. The majority of studies (60%) took time into
However, the supply of medical EHR data currently is account, which is important when considering years of data
limited to hundreds or thousands based on non-interoperable from a patient in an EHR.
data sources and repositories. Cross validation, which six
out of ten of the papers use, allows a proportion of the
available data to be used for training while making use of all 4. DISCUSSION AND CONCLUSION
of the data to assess performance.4 However, this is Using machine and deep learning methods for
computationally expensive. Multiple training runs have to prediction and risk models Personalized methods can
be processed and exploring combinations of settings, provide improved accuracy and generalizability over global
changing complexity of the model and hyperparmeters, adds and local models. As EHR data continues to grow, it
to the number of runs.4 These small sample sizes with EHR provides the researcher with a multitude of data particular to
data may be why more computationally heavy deep learning a population. The majority of studies did not have access to
models have not been assessed in the clinical domain. In a large sample size on which to train their model, and that
addition, for the most part, a rule of thumb is to use as much could have led to lower AUC and accuracy assessments on
the test data. If these models are to be used in the clinical D. 2009. The prisma statement for reporting systematic
setting, they should be re-trained with more data, and a new reviews and meta-analyses of studies that evaluate health
independent test set should be used. In addition, the choice care interventions: Explanation and elaboration. PLoS
of outcome type, binary or limited multi-class, is not medicine. 6(7):e1000100.
8. McKenzie J, Ryan R, Di Tanna G. 2014. Cochrane consumers
representative of the complexity of medicine. Therefore, and communication review group.‘Cochrane consumers
algorithms may continue to become more complex and and communication review group: Cluster randomised
more representative of a real-world and useful medical controlled trials’.
model with multiple disease classifications like 9. Miotto R, Li L, Kidd BA, Dudley JT. 2016. Deep patient: An
DeepPatient.9 unsupervised representation to predict the future of
EHR data is littered with issues and each paper patients from the electronic health records. Scientific
dealt with these issues, such as missing data, in their own Reports. 6:26094.
way, dependent on the model. In addition, only one paper 10. . An rnn architecture with dynamic temporal matching for
made use of clinical text. Using only codified structured personalized predictions of parkinson's disease.
Proceedings of the 2017 SIAM International Conference
data could lead to a poorly trained model that is not on Data Mining; 2017: SIAM.
representative of specific patients because only certain 11. Panahiazar M, Taslimitehrani V, Pereira N, Pathak J. 2015.
aspects of the patient are present. Therefore, unstructured Using ehrs and machine learning for heart failure survival
clinical notes should be used for predictive modelling. analysis. Studies in health technology and informatics.
Finally, one should always compare multiple types 216:40.
of models to find the best test accuracy. Multiple papers did 12. Anderson JP, Parikh JR, Shenfeld DK, Ivanov V, Marks C,
this and found the most suitable model for the EHR data in Church BW, Laramie JM, Mardekian J, Piper BA, Willke
terms of accuracy and AUC. RJ et al. 2016. Reverse engineering and evaluation of
Despite the issues presented with EHR data, prediction models for progression to type 2 diabetes: An
application of machine learning using electronic health
machine and deep learning algorithms can provide an records. Journal of Diabetes Science and Technology.
efficient way to assess patterns that can be generalized to a 10(1):6-18.
population with high accuracy using observational data. As 13. Deleger L, Brodzinski H, Zhai H, Li Q, Lingren T, Kirkendall
we learn to overcome these issues with better techniques, ES, Alessandrini E, Solti I. 2013. Developing and
these algorithms will improve, providing a beneficial way to evaluating an automated appendicitis risk stratification
make personalized predictions. algorithm for pediatric patients in the emergency
department. Journal of the American Medical Informatics
Association : JAMIA. 20(e2):e212-e220.
ACKNOWLEDGMENT 14. Churpek MM, Yuen TC, Winslow C, Meltzer DO, Kattan MW,
Edelson DP. 2016. Multicenter comparison of machine
This work has been supported in part by NIH NLM Training learning methods and conventional regression for
Grant T15LM012495-01. predicting clinical deterioration on the wards. Critical care
medicine. 44(2):368.
15. Gultepe E, Green JP, Nguyen H, Adams J, Albertson T,
REFERENCES Tagkopoulos I. 2014. From vital signs to clinical
outcomes for patients with sepsis: A machine learning
1. The precision medicine initiative. 2017. [accessed 2017 October basis for a clinical decision support system. Journal of the
8]. American Medical Informatics Association.315-325.
https://www.fda.gov/ScienceResearch/SpecialTopi 16. Wu J, Roy J, Stewart WF. 2010. Prediction modeling using ehr
cs/PrecisionMedicine/default.htm. data: Challenges, strategies, and a comparison of machine
2. Henry J, Pylypchuk Y, Searcy T, Patel V. 2016. Adoption of learning approaches. Medical care. 48(6):S106-S113.
electronic health record systems among us non-federal 17. Wang Y, Wu P, Liu Y, Weng C, Zeng D. 2016. Learning optimal
individualized treatment rules from electronic health
acute care hospitals: 2008-2015. The Office of National
Coordinator for Health Information Technology. record data. IEEE International Conference on Healthcare
3. Goodfellow I, Bengio Y, Courville A. 2016. Deep learning. MIT Informatics IEEE International Conference on Healthcare
press. Informatics. 2016:65-71.
4. Bishop CM. 2006. Pattern recognition and machine learning. 18. Desautels T, Calvert J, Hoffman J, Jay M, Kerem Y, Shieh L,
springer. Shimabukuro D, Chettipally U, Feldman MD, Barton C.
5. Botsis T, Hartvigsen G, Chen F, Weng C. 2010. Secondary use of 2016. Prediction of sepsis in the intensive care unit with
ehr: Data quality issues and informatics opportunities. minimal electronic health record data: A machine learning
Summit on Translational Bioinformatics. 2010:1-5. approach. JMIR medical informatics. 4(3).
6. Hersh WR, Weiner MG, Embi PJ, Logan JR, Payne PRO,
Bernstam EV, Lehmann HP, Hripcsak G, Hartzog TH,
Cimino JJ et al. 2013. Caveats for the use of operational
electronic health record data in comparative effectiveness
research. Medical care. 51(8 0 3):S30-S37.
7. Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gøtzsche PC,
Ioannidis JP, Clarke M, Devereaux PJ, Kleijnen J, Moher
Paper ID Journal Funding Medical Domain
Medical Care June 2010 (Impact
Wu, 2010 None mentioned Heart Failure
Factor 2016: 3.081)
Panahiazar, Conference Paper from Medinfo
AHRQ Heart Failure
2015 2015 published under IOS Press
Conference Paper from 2016 IEEE
Wang, 2016 International Conference on NIH Diabetes
Healthcare Informatics (ICHI)
Journal of Diabetes Science and
Anderson,
Technology (2016) (peer-reviewed, Pfizer, Inc. Diabetes/Pre-Diabetes
2016
no impact factor)
Center for Information Technology
Journal of the American Medical Research in the Interest of Society
Gultepe,
Informatics Association (Impact (CITRIS), the National Center for Sepsis
2013
factor 2015: Impact Factor 3.698) Advancing Translational Sciences
(NCATS), and NIH
Journal of the American Medical
Deleger, Cincinnati Children’s Hospital Medical
Informatics Association (Impact Abdominal pain
2013 Center
factor 2015: Impact Factor 3.698)
Medical-surgical wards for
Churpek, Critical Care Medicine (Impact institutional Clinical and Translational
predicting clinical
2016 factor: 7.050) Science Award grant, Philips HealthCare
deterioration
Desautels, Sepsis in the intensive care
JMIR Med Inform Dascena
2016 unit (ICU) population
Nature Scientiific reports (2016 2-
Miotto, 2016 None mentioned Entire Medical Domain
year impact factor: 4.259)
Michael J. Fox Foundation for Parkinson’s
Research and funding partners. Chao Che is
Proceedings of the 2017 SIAM supported by NSFC No. 91546123. Fei
Che, 2017 International Conference on Data Wang is partially sup- ported by NSF IIS- Parkinson’s Disease
Mining 1650723. Jiayu Zhou is supported in part by
ONR N00014-14-1-0631, NSF IIS-1565596
and IIS-1615597
Table 1. General Paper Information
Paper ID Outcome Type Sample Size Learning Method Assessment (best outcome) All Method
Outcomes
Reported*
Wu, 2010 Heart Failure Diagnosis binary (multi-center) 536, k-fold cross- Logistic Regression, SVM, AUC: 0.77 Logistic yes
validation Boosting Regression (BIC)
Panahiazar, Survival 1, 2, and 5 years after binary 5044 (31% train/69% test) Logistic regression/SVM/Ada AUC: 0.8, 0.81 (EX model yes
2015 diagnosis, classify between 2 and 5 year boost/ random forest ( extended random forest, logistic
survival (EX) vs. baseline (BL) model) regression, 1 year)
Wang, 2016 Propensity of treatment assignment multi-class 357 (1725 longitudinal) in Compares Reinforcement Learning ITR estimate has lower yes
(personalized treatment); individualized sulfonylureas group vs. 203 (982 to Outcome-Weighted Learning HbA1c in Outcome-weighted
decision tree so post-treatment HbA1c longitudinal) in insulin group, k- (CART)(optimal individualized learning
will be under control (8 personalized fold cross-validation treatment rule)
outcomes)
Anderson, Progression to Type II diabetes; binary 24,331 (train)/ 189,082 (test) Reverse Engineering and Forward AUC: 0.78 (Diabetes) yes
2016 Progression to pre-type II diabetes Simulation (Bayesian Scoring AUC:0.7 (Pre-Diabetes)
Algorithm)
Gultepe, high or low serum lactate levels binary 741 (590 non-sepsis controls and naïve Bayes (NB), Gaussian Accuracy: 0.73, AUC: 0.73 yes
2013 151 Sepsis), k-fold cross- mixture model (GMM) and hidden (mortality) Accuracy: 0.99,
validation Markov model (HMM) AUC: 1 (lactate level)
Deleger, High-risk for acute appendicitis with a multi-class 2,100 (90% train/10% test), k-fold Conditional Random Fields Compared to Gold Standard: yes
2013 PAS ≥7 Equivocal-risk with a PAS of cross-validation (CRF)/rule-based AI Average Recall: 0.869
3–6
 Low-risk with a PAS ≤2 Precision: 0.867 Baseline
system using only structured
components was unable to
classify high risk patients
Churpek, composite of ward cardiac arrest, ward binary 269,999 (60% train/40% test), k- Logistic regression, decision trees, Random Forest: AUC 0.80 yes
2016 to ICU transfer, or death on the wards fold cross-validation K-nearest neighbors, random forest, [95% CI 0.80–0.80]
without attempted resuscitation SVM, gradient boosted machine,
neural network
Desautels, Septic or not binary 1,840 septic ICU stays vs 17,214 InSight: posterior probability AUC: (0.8799 [SD 0.0056]) yes
2016 non-septic ICU stays, k-fold cross- classification
validation
Miotto, 10 classes of disease multi-class 200,000 (train)/ 5,000 (validation)/ DeepPatient, RawFeat, K-means, Deep Patient Highest, AUC: yes
2016 76,214 (test) GMM, ICA 0.773, Accuracy: 0.773, F-
score=0.181
Che, 2017 diagnosis are either Idiopathic PD binary 683 (466 cases, 217 control) Recurrent Neural Networks, Personalized Logistic yes
(case) or “No PD nor other neurological (80%train /10% validation/10% Dynamic Time Warping, logistic Regression and SVM had
disorder” (control) test) regression, SVM lowest RMSE (0.65, 0.69)
compared to multiclass.
Personalized SVM had
highest F-score (0.77)
Table 2. Learning Method Paper Information
* As a measure of bias, paper was checked to see if for each learning method used, an evaluation assessment was reported
Unstructured/Structured Temporal
Paper ID Data Sources Terminology/Ontology Missing Data Feature Extraction
Data Distinction

ICD-9 diagnoses, clinical notes, Indicator for whether 179 features: reduced
Wu, 2010 Structured ICD-9 by variable selection
labs value was obtained
methods (described) Yes
Panahiazar, ICD-9 diagnoses, clinical notes, Cardiologists
Structured ICD-9 removed
2015 labs, prescriptions manually extracted No
inverse probability
ICD-10/ICD-9 diagnoses, labs, weighting (IPW)
Wang, 2016 Structured ICD-10, ICD-9
prescriptions (outcome), imputed using Data driven feature
proximity in RF (feature) selection Yes
Anderson, ICD-9 diagnoses, prescriptions, discrete with missing 442 features; manual
Structured, labs ICD-9
2016 lab values, vital signs category for imputation variable selection
Gultepe, vital signs, mortality, cpded 5 features: manual
Structured, labs No Mention removed
2013 diagnosis variable selection No
Both-unstructured data 8 features: manually
Deleger, ED physician and nursing
(manually extracted and UMLS thesaurus Not Mentioned done based on PAS
2013 notes; codified components
using cTAKES) score No
Values from previous
Churpek, vital signs, laboratory values,
No description No mention time blocks were used or 8 features: manually
2016 demographic values
median extracted Yes
Desautels, Values from previous Data-driven feature
MIMIC-II data repository** N/A ICD-9, CPT
2016 time blocks were used selection Yes
Data-driven feature
ICD-9 diagnoses/Mt. Sinai selection; feature
Input corruption using
disease classification, learning algorithm
Miotto, 2016 Structured ICD-9 the masking noise
classification of prediction of used 704,587 patients
algorithm
78 diseases and 60,238 clinical
descriptors Yes
Parkinson’s Progression
Values from previous
Che, 2017 Markers Initiative (PPMI) data N/A PPMI code set 319 features:
time blocks were used
repository** manually extracted Yes
Table 3. EHR Paper Information
* As a measure of bias, paper was checked to see if for each learning method used, an evaluation assessment was reported
** These papers mention how the data in the repository came from HER. They do not discuss how they could use HER data directly instead of from a warehouse.

Você também pode gostar