Você está na página 1de 50

Data Mining in Bioinformatics

presented by:
Ji ZhiLiang (HT006436A)
Hu YanJun (HT016111W)
Li ChangQin (HT016121U)
General Content

Presented by:
Ji ZhiLiang
What is Bioinformatics?

Bioinformatics is the computer-assisted


data management discipline that helps
us gather, analyze, and represent
biological information in order to
understand life's processes.
What is Data mining?

Data mining: the extraction of hidden


predictive information from large
databases.
Why need Data mining?
♦ Massive biological information of various
form
• The Human Genome Project: over 22.1 billion bases
of raw sequence data, comprising overlapping fragments
totaling 3.9 billion bases; many tens of thousands of genes
have been identified from the genome sequence. Analysis
of the current sequence shows 38,000 predicted genes
confirmed by experimental evidence.

• Swiss_Prot Database: more than 10,000 proteins

• Pubmed: more than 12,000,000 biological abstracts for


information extraction, and the amount is still updating.
Why need Data mining? (cont.)
♦ Sophisticated relationship among the
biological data
Why need Data mining? (cont.)
♦ The biopharmaceutical industry is
generating more chemical and biological
screening data than it knows what to do with
or how best to handle. As a result, deciding
which target and lead compound to develop
further is often a long and arduous task.
Why need Data mining? (cont.)

♦ The traditional data analysis methods are


not adequate to deal with enormous data
flow.
Most common techniques in
Data mining
♦ Artificial neural networks: Non-linear predictive models that
learn through training and resemble biological neural networks in
structure.
♦ Decision trees: Tree-shaped structures that represent sets of
decisions.
♦ Genetic algorithms: Optimization techniques that use processes
such as genetic combination, mutation, and natural selection in a
design based on the concepts of evolution.
♦ Nearest neighbor method: A technique that classifies each
record in a dataset based on a combination of the classes of the k
record(s) most similar to it in a historical dataset (where k ³ 1).
Sometimes called the k-nearest neighbor technique.
♦ Rule induction: The extraction of useful if-then rules from data
based on statistical significance.
Data mining applications in
Bioinformatics
♦ Decision Tree: predict the possible cause of
diseases from clinical records; phylogenetic data
mining of gene sequences from public databases;
♦ Artificial neural networks: Drug safety study
and new drug development;
♦ Rule induction: Pattern in clinical pathway;
Data mining applications in
Bioinformatics (cont.)
Approaches of Data mining in
Bioinformatics
♦ Influence-based mining: complex and granular
(as opposed to linear) data in large databases are
scanned for influences between specific data sets,
and this is done along many dimensions and in
multi-table formats. These systems find
applications wherever there are significant cause-
and-effect relationships between data sets—as
occurs, for example, in large and multivariant
gene expression studies, which are behind areas
such as pharmacogenomics.
Approaches of Data mining in
Bioinformatics (cont.)
♦ Affinity-based mining: large and complex data
sets are analyzed across multiple dimensions, and
the data-mining system identifies data points or
sets that tend to be grouped together. These
systems differentiate themselves by providing
hierarchies of associations and showing any
underlying logical conditions or rules that account
for the specific groupings of data. This approach is
particularly useful in biological motif analysis,
whereby it is important to distinguish "accidental"
or incidental motifs from ones with biological
significance.
Approaches of Data mining in
Bioinformatics (cont.)
♦ Time delay data mining: the data set is
not available immediately and in complete
form, but is collected over time. The
systems designed to handle such data look
for patterns that are confirmed or rejected as
the data set increases and becomes more
robust. This approach is geared towards
long-term clinical trial analysis and
multicomponent mode of action studies, for
example.
Approaches of Data mining in
Bioinformatics (cont.)

♦ Trend-based mining: the software


analyzes large and complex data sets in
terms of any changes that occur in specific
data sets over time. The data sets can be
user-defined, or the system can uncover
them itself. Essentially, the system reports
on anything that is changing over time.
Approaches of Data mining in
Bioinformatics (cont.)
♦ Comparative data mining: it focuses on
overlaying large and complex data sets that
are similar to each other and comparing
them. This is particularly useful in all forms
of clinical trial meta analyses, where data
collected at different sites over different
time periods, and perhaps under similar but
not always identical conditions, need to be
compared. Here, the emphasis is on finding
dissimilarities, not similarities.
Approaches of Data mining in
Bioinformatics (cont.)
♦ Predictive data mining: data mining alone
is lacking somewhat if it is unable to also
offer a framework for making simulations,
predictions, and forecasts, based on the data
sets it has analyzed. It combines pattern
matching, influence relationships, time set
correlations, and dissimilarity analysis to
offer simulations of future data sets.
Case study

Richards G, Rayward-Smith VJ, Sonksen PH, Carey


S and Weng C. Data Mining for Indicators or Early
Mortality in Database of Clinical Records. (2001).
Artifical Intelligence in Medicine. 22: 215-231.
Case Study

Li changqing
HT016121U
Outline
♦ Introduction
♦ Data preparation
♦ Rule induction
♦ Rule evaluation
♦ Interpretation
Introduction
♦ How the KDD can be used in in bio-
informatics (An actual medical system)
♦ Medical data has increased dramatically
♦ Manual analysis is not adequate
♦ Data mining is necessary
Objective
♦ To find the relationship between the first
observations of diabetes and early death
Data preparation
♦ Classification of cases
♦ Missing data
♦ Leveling of ages of observation
Classification of cases
♦ A threshold value of 60 years of age was
chosen to discriminate between those who
have died young and those who have not
♦ ‘age of death’ <=60, Class T; ‘age of death’
>60, Class F
♦ The records of the people who have died
can be used
♦ If age>60 and not died then Class F
Classification of cases

♦ Category C is discarded
Result of Classification

♦ The result of this stage was a file (D60) of 3971


records. Randomly split into a training
set(2/3)(D60_train) and a testing
set(1/3)(D60_test)
Missing data
♦ Most records and most attributes have a
significant proportion of blank fields
♦ Solutions:
a. discard all attributes or all records with
missing data
b. estimate the missing values by reference
to existing values
Example

Attri1 Attri2 …

Rec1 1 2

Rec2 2

… 2 3
Leveling of ages of observation
♦ Those who die young(<=60) will tend to
have been examined at a younger age than
those that don’t, thus bias occurs, no
comparable
♦ Three methods to solve this problem
Methods for leveling
♦ Deletion
♦ Duplication
♦ Adjustment
♦ The leveled training sets were saved as files
D60_del, D60_dup, D60_adj
Results for leveling
Rule induction
♦ Using simulated annealing algorithm
♦ The following sets were defined

A = {r | condition of rule, r, is satified }


B = {r | conclusion of rule, r, is satified }
C = {r | condition and conclusion of rule, r, is satified }
♦ a=|A|, b=|B|, c=|C|
♦ Accuracy(confidence)=c/a,Coverage(sensitivity)=
c/b
♦ Maximize c, minimize (a-c) and minimize (b-c)
Rule evaluation on training data
♦ Based on training data, rules not significant
at 5% were rejected
♦ Rules with coverage less than 5% were also
rejected
♦ Then 9 of the 41 rules were rejected
Rule evaluation on testing data
♦ Based on training data, rules not significant
at 5% were rejected
♦ Then 22 of the left 32 rules were rejected,
only 10 rules were left for final use
Interpretation
♦ Rules indicate that nerve damage is the
most important indicator of early death

♦ A recent medical study confirmed the rules


mined from KDD
Conclusion

Hu YanJun
Conclusion on Bio-info DM
♦ First visit observations and early mortality
are significant association
♦ KDD methods is useful in above study
♦ To find whether findings are general the
statistical analysis is required to confirm the
association is “real” association
KDD process
♦ KDD (Knowledge Discovery in Databases)
are being developed to identify patterns
within the data that can be exploited
♦ Following are corresponding techniques
that can be used in KDD
Data preparation
♦ The pre-processing stage
– Data no clean
– Too many attributes
– Discrete data needed
– Missing data-can not delete easily!
Data preparation (cont)
♦ Usually the original data was derived from a
relational stage
♦ Reforming the data into a spreadsheet
format in a form suitable for data mining
Data preparation (cont)
♦ This extensive pre-processing work is
typical for clinical records
– Data collected for each patient vary
considerably
– Some unreliability
– Some patients have long-period recording while
other patients only have a single visit
Leveling
♦ Some attributes may bias so that they may
influence the whole training data
♦ The training set was adjusted so that the
distribution on one specific attribute is
almost the same
Leveling (cont)
♦ The leveling object of this case is to
eliminate the effect of age of observation
from the training data
♦ Stripping unnecessary attributes, only
important attributes left, such as gender,
marital status …
FSS (Feature Subset Selection)
♦ In order to adjust to compensate for missing
values
♦ Use an entropy based information
♦ Not useful for all cases, especially when the
distinction between FSS and rule induction
becomes blurred
Data mining
♦ The interface provides users with extensive
control of the simulated annealing algorithm
♦ The algorithm use heuristic searching
strategy to find the “best” rule
♦ Users can control various parameters in
order to optimize the search
Data mining (cont)
♦ Comprehensive pre-processing facilities are
included
♦ The generated rules were simple to
understand
– In the medical domain primary objective was
explanation rather than prediction
Data mining (cont)
♦ This data mining software can efficiently
handle the missing values
– Medical databases typically have a high
proportion of missing values
– Many of the discovered rules were based on
attributes with 50% missing values
Rule evaluation
♦ Some rules have low levels of accuracy and
coverage
– They may be produced by chance

♦ 2 test was a useful method


Rule evaluation (cont)
♦ N Need further investigated
– A very large number of rules are generated and
evaluted
♦ Further work
– Incorporating 2 into the fitness function of the
simulated annealing algorithm.
End

♦ Questions and Answers

Você também pode gostar