Escolar Documentos
Profissional Documentos
Cultura Documentos
ÜBERSICHTSARBEIT
Deutsches Ärzteblatt International | Dtsch Arztebl Int 2015; 112: 137–42 137
MEDICINE
138 Deutsches Ärzteblatt International | Dtsch Arztebl Int 2015; 112: 137–42
MEDICINE
FIGURE Computational
and biostatistical
Biostatistical complexity of
complexity various big data
Method transfer analyses/prob-
lems
Marker-dependent
modification of
treatment over time
Multiple
measurements Big data
over time
Multiple sources,
e.g., levels or
databases
Computational
complexity
Selected Large number of Raw data from
biomarkers processed molecular high-throughput
markers platforms
each patient, based on the measured characteristics and show similar patterns with reference to these patient
influencing future measurements. Therefore, not only groups is desired, biclustering procedures are available
the temporal nature of the measurements but also the (20).
temporal pattern of treatment decisions have to be In contrast, “supervised” approaches have a particu-
taken into account when, for instance, comparing lar target criterion, e.g., prediction of 1-year survival
patients with one another and determining the best based on the gene expression profile of the tumor at the
treatment options. Precisely this combination forms the time of diagnosis. A crucial aspect of these procedures
foundation of personalized medicine. is the automated selection of a small number of patient
Figure 1 depicts the computational and biostatistical characteristics or, for example, gene expression
complexity of the various big data analyses in different parameters, that are suitable for prediction. Another im-
scenarios. portant distinction lies in whether and to what extent
the respective approaches are based on a statistical
Techniques model, i.e., a mathematically explicitly specified form
A characteristic feature of big data scenarios is that the of connection between the observed parameters.
accumulated data are hard to handle by means of con- Model-based approaches stem from classical statistics
ventional methods. The difficulty begins with the very (see [21] for extensions of regression models), while
first step of analysis, namely description of the data. model-free approaches are often rooted in computer
With 10 potential biomarkers, for example, a table of science (22). Prominent model-based approaches are
mean values would typically be generated, but with regularized regression procedures (23) and logic
10 000 or more possible markers such a table is no regression (24). Well-known model-free approaches in-
longer useful. Particularly in big data applications, clude random forests (22) and support vector machines
pattern recognition, i.e., the detection of relevant, (25).
potentially frequent patterns, has to be supported by Model-based approaches bear a greater resemblance
machine learning techniques that automatically recog- to the statistical methods used in clinical studies.
nize patterns and can yield a dimension reduction or However, while clinical studies are designed to quan-
preselection (17). tify the influence of a parameter—typically the effect
So-called “unsupervised” machine learning of a treatment—precisely, i.e., unbiased and with low
techniques seek to detect, for instance, the frequent variability, analysis of a large number of potential
simultaneous occurrence of certain patient character- parameters, e.g., candidate biomarkers, comes at a
istics. One example is the “bump hunting” approach, price. Namely, important markers are identified, but
which gradually refines the definition criteria of their effects can no longer be estimated without distor-
frequent groups of individuals (18). Alternatively, clus- tion (26).
tering approaches identify groups of similar patients In model-based approaches the available data are
(19). If, for example, identification of biomarkers that summarized in the shape of an estimated model, on the
Deutsches Ärzteblatt International | Dtsch Arztebl Int 2015; 112: 137–42 139
MEDICINE
140 Deutsches Ärzteblatt International | Dtsch Arztebl Int 2015; 112: 137–42
MEDICINE
provide descriptive analysis for large volumes of data, 16. Horn JDV, Toga AW: Human neuroimaging as a big data science.
Brain Imaging Behav 2013; 2: 323–31.
the traditional first step of statistical analysis.
17. James G, Witten D, Hastie T, Tibshirani R: An introduction to
● It is a special feature of medical data that procedures statistical learning. New York: Springer 2013.
for data analysis have to consider the weighting of 18. Friedman JH, Fisher NI: Bump hunting in high-dimensional data.
individual patient characteristics, e.g., age and sex in Stat Comput 1999; 9: 123–43.
relation to thousands of gene expression values. 19. Andreopoulos B, An A, Wang X, Schroeder M: A roadmap of
clustering algorithms: finding a match for a biomedical application.
● Analysis of causality from jointly acquired treatment Brief Bioinform 2009; 10: 297–314.
data and molecular markers must take account of the 20. Eren K, Deveci M, Küçüktunç O, Çatalyürek ÜVV: A comparative
temporal sequence, e.g., by adaptation of existing analysis of biclustering algorithms for gene expression data. Brief
procedures for observational data. Bioinform 2013; 14: 279–92.
● The increasing popularity of big data approaches is 21. Binder H, Porzelius C, Schumacher M: An overview of techniques
for linking high-dimensional molecular data to time-to-event
leading to wider availability of the corresponding data endpoints by risk prediction models. Biom J 2011; 53: 170–89.
analysis techniques, with beneficial consequences for 22. Breiman L: Random Forests. Mach Learn 2001; 45: 5–32.
medical science.
23. Witten DM, Tibshirani R: Survival analysis with high-dimensional
covariates. Stat Methods Med Res 2010; 19: 29–51.
Deutsches Ärzteblatt International | Dtsch Arztebl Int 2015; 112: 137–42 141
MEDICINE
24. Ruczinski I, Kooperberg C, LeBlanc M: Logic Regression. J Comput 34. Gaber MM, Zaslavsky A, Krishnaswamy S: Mining data streams: a
Graph Stat 2003; 12: 475–511. review. ACM Sigmod Record 2005; 34: 18–26.
25. Evers L, Messow CM: Sparse kernel methods for high-dimensional 35. Binder H, Schumacher M: Allowing for mandatory covariates in
survival data. Bioinformatics 2008; 24: 1632–8. boosting estimation of sparse high-dimensional survival models.
26. Porzelius C, Schumacher M, Binder H: Sparse regression tech- BMC Bioinformatics 2008; 9: 14.
niques in low-dimensional survival settings. Stat Comput 2010; 20: 36. Aalen Røysland O, Gran JM, Ledergerber B: Causality, mediation
151–63. and time: a dynamic viewpoint. J R Stat Soc A 2012; 175: 831–61.
27. Breiman L: Statistical modeling: The two cultures. Stat Sci 2001; 37. Andersen PK, Liest K: Attenuation caused by infrequently updated
16: 199–231. covariates in survival analysis. Biostatistics 2003; 4: 633–49.
28. Boulesteix ALL, Janitza S, Kruppa J, König IR: Overview of random 38. Daniel RM, Cousens SN, De Stavola BL, Kenward MG, Sterne JAC:
forest methodology and practical guidance with emphasis on Methods for dealing with time-dependent confounding. Stat Med
computational biology and bioinformatics. Wiley Interdiscip Rev 2012; 32: 1584–618.
Data Min Knowl Discov 2012; 2: 493–507. 39. Ibrahim JG, Chu H, Chen LM: Basic concepts and methods for joint
29. Kruppa J, Liu Y, Biau G, et al.: Probability estimation with machine models of longitudinal and survival data. J Clin Oncol 2010; 28:
learning methods for dichotomous and multicategory outcome: 2796–801.
theory. Biom J 2014; 56: 534–63. 40. Gran JM, Røysland K, Wolbers M, et al.: A sequential Cox approach
30. Glez-Peña D, Díaz F, Hernández JM, Corchado JM, Fdez-Riverola F: for estimating the causal effect of treatment in the presence of
geneCBR: a translational tool for multiple-microarray analysis and time-dependent confounding applied to data from the Swiss HIV
integrative information retrieval for aiding diagnosis in cancer Cohort Study. Stat Med 2010; 29: 2757–68.
research. BMC Bioinformatics 2009; 10: 187.
31. Binder H, Müller T, Schwender H, et al.: Cluster-localized sparse
logistic regression for SNP data. Statl Appl Genet Mol 2012; 11: 4. Corresponding author
32. Reich BJ, Bondell HD: A spatial dirichlet process mixture model for Prof. Dr. oec. pub. Harald Binder
clustering population genetics data. Biometrics 2010; 67: 381–90. Institut für Medizinische Biometrie, Epidemiologie und Informatik (IMBEI) der
Universitätsmedizin der Johannes-Gutenberg-Universität Mainz
33. Toh S, Platt R: Is size the next big thing in epidemiology? Obere Zahlbacher Straße 69, 55101 Mainz, Germany
Epidemiology 2013; 24: 349–51. binderh@uni-mainz.de
142 Deutsches Ärzteblatt International | Dtsch Arztebl Int 2015; 112: 137–42