Você está na página 1de 4

FORUM

Big Data in Biology and Medicine


Based on material from a joint workshop with representatives of the international
Data-Enabled Life Science Alliance, July 4, 2013, Moscow, Russia

O. P. Trifonova1*, V. A. Il’in2,3, E. V. Kolker4,5, A. V. Lisitsa1


1
Orekhovich Research Institute of Biomedical Chemistry, Russian Academy of Medical Sciences, Pogodinskaya Str. 10,
Bld. 8, Moscow, Russia, 119121
2
Scientific Research Center “Kurchatov Institute,” Academician Kurchatov Sq. 1, Moscow, Russia 123182
3
Skobel’tsyn Research Institute of Nuclear Physics, Lomonosov Moscow State University, Leninskie Gory 1, Bld. 58,
Moscow, Russia, 119992
4
DELSA Global, USA
5
Seattle Children’s Research Institute, 1900 9th Ave Seattle, WA 98101, USA
*E-mail: oxana.trifonova@gmail.com

T
he task of extracting new dealing with the analysis of large Big Data world: either molecular
knowledge from large data experimental data sets and com- biology in the “omics” format, or in-
sets is designated by the mercial companies developing in- tegrative biology in brain modeling,
term “Big Data.” To put it simply, formation systems. The workshop or social sciences?
the Big Data phenomenon is when participants delivered 16 short re- The tasks of working with large
the results of your experiments can- ports that were aimed at discussing data sets can be subdivided into two
not be imported into an Excel file. how manipulating large data sets groups: (1) when data are obtained
Estimated, the volume of Twitter is related to the issues of medicine interactively and need to be proc-
chats throughout a year is several and health care. essed immediately and (2) when
orders of magnitude larger than The workshop was opened by there is a large body of accumulat-
the volume of a person’s memory Prof. Eugene Kolker, who pre- ed data requiring comprehensive
accumulated during his/her entire sented a report on the behalf of interpretation. The former category
life. As compared to Twitter, all the the Data-Enabled Life Science of data is related to commercial sys-
data on human genomes constitute Alliance (DELSA, www.delsaglo- tems, such as Google, Twitter, and
a negligibly small amount [1]. The bal.org). The alliance supports the Facebook. Repositories of genomic
problem of converting data sets into globalization of bioinformatics ap- and proteomic data exemplify the
knowledge brought up by the U.S. proaches in life sciences and the latter type of data.
National Institutes of Health in 2013 establishment of scientific commu- Systems for handling large data
is the primary area of interest of the nities in the field of “omics.” The arrays are being developed at the
Data-Enabled Life Science Alliance main idea is to accelerate transla- Institute for System Programming,
(DELSA, www.delsaglobal.org) [2]. tion of the results of biomedical re- Russian Academy of Sciences, with
Why have the issues of compu- search to satisfy the needs of the special attention on poorly struc-
ter-aided collection of Big Data cre- community. tured and ambiguous data that are
ated incentives for the formation Large data sets that need to be typical of the medical and biological
of the DELSA community, which stored, processed, and analyzed fields. Collections of software utili-
includes over 80 world-leading re- are accumulated in many scientific ties and packages, as well as dis-
searchers focused on the areas of fields, in addition to biology; there tributed programming frameworks
medicine, health care, and applied is nothing surprising about this fact. running on clusters consisting of
information science? This new Large data sets in the field of high- several hundreds and thousands of
trend was discussed by the partici- energy physics imply several dozen nodes, are employed to implement
pants of the workshop “Convergent petabytes; in biology, this number is smart methods for data search,
Technologies: Big Data in Biology lower by an order of magnitude, al- storage, and analysis. Such projects
and Medicine.” though it also approaches petabyte as Hadoop (http://hadoop.apache.
The total number of workshop scale. The question discussed dur- org/), Data-Intensive Computing,
participants was 35, including rep- ing the workshop was what Russian and NoSQL are used to run search-
resentatives of research institutes researchers should focus on in the es and context mechanisms when

VOL. 5 № 3 (18) 2013 | Acta naturae | 13


FORUM

From problem to solution: the experts in the field of data processing, Data-Enabled Life Science Alliance-DELSA, are
ready to beat back challenge of NIH

handling data sets on a number of the analysis of the resulting data derived using genome, transcrip-
modern web sites. will become the highest priority. tome and/or proteome profiling
Prof. Konstantin Anokhin (Sci- Extraction and interpretation techniques; composite, integrative
entific Research Center “Kur- of information from existing data- measures may be used to quantify
chatov Institute”), talked on the bases using novel analytical algo- the distance that separate any two
fundamentally novel discipline of rithms will play a key role in science samples. However, as each human
connectomics, which is focused on in future. The existence of a large organism has both individual genet-
handling data sets by integrating number of open information sourc- ic predispositions and a history of
data obtained at various organiza- es, including various databases and environmentalal exposure, the tra-
tional levels. Large bodies of data search systems, often impedes the ditional concept of averaged norm
will accumulate in the field of neu- search for the desired data. Accord- would not be appropriate for per-
roscience because of the merging ing to Andrey Lisitsa (Research In- sonalized medicine applications in
of two fundamental factors. First, stitute of Biomedical Chemistry, its true sense. Instead, Prof. Ancha
an enormous amount of results ob- Russian Academy of Medical Sci- Baranova introduced the concept of
tained using high-resolution ana- ences), existing interactomics data- a multidimensional space occupied
lytical methods has been accumu- bases coincide to no more than 55% by set of normal sample tissue and
lated in the field of neurosciences. [3]. The goal in handling large data the tissue-specific centers within
Second, the main concern of scien- sets is to obtain a noncontradictory this space (“the ideal state of the
tists is whole-brain functioning and picture when integrating data tak- tissue”). The diseased tissues will be
how its function is projected onto en from different sources. located at a greater distance from
the system (mind, thought, action), The concept of dynamic profiling the center as compared to healthy
rather than the function of individ- of a person’s health or the state of ones. The proposed approach allows
ual synapses. Obtaining data on the an underlying chronic disease us- one to abandon binary (yes/no) pre-
functioning of the brain as a system ing entire sets of high throughput dictions and to show the departure
includes visualization techniques: data without reducing the dataset of a given tissue sample as a point in
high-resolution computed tomog- to the size of the diagnostic biomar- an easily understandable line graph
raphy, light microscopy, and elec- ker panels is being developed at the that places each sample in the con-
tron microscopy. Megaprojects on Research Center of Medical Ge- text of other samples collected from
brain simulation have already been netics of the Russian Academy of patients with the same condition
launched (e.g., the Human Brain Medical Sciences. The description and associated with survival and
Project in Europe); the investments of a normal human tissue requires other post-hoc measures.
to obtaining new experimental data one to integrate several thousand Prof. Vsevolod Makeev (Institute
will be devalued with time, while quantifiable variables that may be of General Genetics, Russian Acad-

14 | Acta naturae | VOL. 5 № 3 (18) 2013


FORUM

emy of Sciences) asserted in his re- Medical data need to be proc- Watson supercomputer in various
port that we will be dealing with essed interactively so that a pre- contexts. This supercomputer was
large data sets more frequently in liminary diagnosis can be made no designed by IBM to provide an-
the near future. There will be two later than several minutes after swers to questions (theoretically
types of data: data pertaining to the the data have been obtained. The any questions!) formulated using
individual genome (the 1000 Ge- “Progress” company is currently the natural language. It is one of
nomes Project), which are obtained developing a system for remote the first examples of expert sys-
once and subsequently stored in monitoring of medical indicators tems utilizing the Big Data princi-
databases to be downloaded when using mobile devices and the cel- ple. In 2011, it was announced that
required. The second type of data lular network for data transfer the supercomputer will be used to
pertains to the transcriptome or (Telehealth, report by Oleg Gash- process poorly structured data sets
proteome analysis, which is con- nikov). This method allows one to in order to solve medical and health
ducted on a regular basis in order provide 24-hour out-of-hospital care problems [7].
to obtain an integrative personal monitoring of a patient, which is When analyzing the problem of
omics profile [4]. There are several supposed to reduce medical serv- Big Data in biology and medicine,
providers of such data in the case of ices costs in future. At this stage, one should note that the disciplines
genomes; Russian laboratories can techniques for forming alarm pat- have been characterized by the ac-
use these repositories and employ terns are to be developed based on cumulation of large data sets that
their own bioinformatics approach- accumulated data; algorithms are describe the results of observations
es to arrive at new results [5]. to be modified for each patient. since the natural philosophy era.
The flow of dynamic data for in- The report on the problem of During the genomic era, the aim
dividuals (results of monitoring the collecting and processing the geo of data accumulation seemed to be
parameters of the organism) will in- location data that are accumulated understandable. However, as the
crease as modern analytical meth- by mobile network operators and technical aspect was solved and the
ods are adopted. Researchers will collected by aggregators, such as genome deciphered, it turned out
face the need for rapid processing Google, Facebook, and AlterGeo, that the data was poorly related to
of continuously obtained data and appeared to lie beyond the work- the problems of health maintenance
for transferring the information to shop’s topic on the face of it. The [8].
repositories for further annotation lecturer, Artem Wolftrub (lead- In the post-genomic era, bio-
and automated decision-making. ing developer at Gramant Ltd.), medical science has returned back
There emerges the need for modi- reported that a number of papers to the level of phenomenological
fying the technology of data storage have been published by a group description oriented towards data
and transfer to ensure a more rap- led by Alex Pentland and David collection only, without an under-
id exchange of information. Cloud Laser (Massachusetts Institute of standing of the prospect of its fur-
services for storing and transfer- Technology) since 2009, where it ther interpretation. The Human
ring large sets of data exist already has been substantiated that the Proteome Project is such an exam-
(e.g., AmazonS3). analysis of geo data can be no less ple: data for each protein are col-
The development of more rapid informative for predicting socially lected; however, it is not always a
methods of mathematical analysis important diseases than the genome given that these data can be used
also plays a significant role. The is. Environmental factors (the so- in the applied problems of in-vitro
report delivered by Ivan Oseledets called exposome) play a significant diagnostics. Another example is the
(Institute of Computational Math- role in the pathogenesis of multi- Human Connectome Project, which
ematics, Russian Academy of Sci- gene diseases. Data regarding the is aimed at accumulating data on
ences) focused on the mathematical exposome can be obtained with a signal transduction between neu-
apparatus for compact presenta- sufficient degree of detail by ana- rons in expectation of the fact that
tion of multidimensional arrays lyzing the relocations of a person, having been accumulated to a cer-
based on tensor trains (tensor train by comparing the general regulari- tain critical level, these data will
format, TT-format). Multidimen- ties of population migrations, and allow one to simulate human brain
sional tasks constantly emerge in by identifying the patterns that activity using a computer.
biomedical applications; the TT- correlate with health risks (e.g., In summary, the workshop par-
format allows one to identify the development of cardiovascular dis- ticipants noted that the Big Data
key variables that are sufficient to eases or obesity [6]). phenomenon is related to the new-
describe the system or process un- In their discussions, the work- ly available opportunity of mod-
der study. shop participants mentioned the ern technogenic media to generate

VOL. 5 № 3 (18) 2013 | Acta naturae | 15


FORUM

and store data; however, there is on analyzing Big Data so that the acquainted with the data accu-
no clear understanding as to the data array can be converted into mulated within the “Connectome”
reason and purpose for the ac- hypotheses applicable for verifica- Project is bound to be the main di-
cumulation of such data. Russian tion using a point-wise biochemi- rection of development at the Rus-
scientists should primarily focus cal experiment. The task of getting sian subgroup of DELSA.

REFERENCES 5. Tsoy O.V., Pyatnitskiy M.A., Kazanov M.D., Gelfand M.S.


1. Hesla L. Particle physics tames big data // Symmetry. Au- // BMC Evolutionary Biology. 2012. № 12. (doi: 10.1186/1471-
gust 01, 2012. (http://www.symmetrymagazine.org/article/ 2148-12-200).
august-2012/particle-physics-tames-big-data). 6. Pentland A., Lazer D., Brewer D., Heibeck T. // Studies in
2. Kolker E., Stewart E., Ozdemir V. // OMICS. 2012. V. 3. Health Technology and Informatics. 2009. № 149. Р. 93–102.
№ 16. P. 138–147. 7. Wakeman N. // IBM’s Watson heads to medical school.
3. Lehne B., Schlitt T. // Human Genomics. 2009. № 3. Washington Technology. February 17, 2011. (http://wash-
Р. 291–297. ingtontechnology.com/articles/2011/02/17/ibm-watson-
4. Li-Pook-Than J., Snyder M. // Chemistry & Biology. 2013. next-steps.aspx).
№ 20. P. 660–666. 8. Bentley D.R. // Nature. 2004. V. 429. № 6990. Р. 440–445.

16 | Acta naturae | VOL. 5 № 3 (18) 2013

Você também pode gostar