Escolar Documentos
Profissional Documentos
Cultura Documentos
presented by:
Ji ZhiLiang (HT006436A)
Hu YanJun (HT016111W)
Li ChangQin (HT016121U)
General Content
Presented by:
Ji ZhiLiang
What is Bioinformatics?
Li changqing
HT016121U
Outline
♦ Introduction
♦ Data preparation
♦ Rule induction
♦ Rule evaluation
♦ Interpretation
Introduction
♦ How the KDD can be used in in bio-
informatics (An actual medical system)
♦ Medical data has increased dramatically
♦ Manual analysis is not adequate
♦ Data mining is necessary
Objective
♦ To find the relationship between the first
observations of diabetes and early death
Data preparation
♦ Classification of cases
♦ Missing data
♦ Leveling of ages of observation
Classification of cases
♦ A threshold value of 60 years of age was
chosen to discriminate between those who
have died young and those who have not
♦ ‘age of death’ <=60, Class T; ‘age of death’
>60, Class F
♦ The records of the people who have died
can be used
♦ If age>60 and not died then Class F
Classification of cases
♦ Category C is discarded
Result of Classification
Attri1 Attri2 …
Rec1 1 2
Rec2 2
… 2 3
Leveling of ages of observation
♦ Those who die young(<=60) will tend to
have been examined at a younger age than
those that don’t, thus bias occurs, no
comparable
♦ Three methods to solve this problem
Methods for leveling
♦ Deletion
♦ Duplication
♦ Adjustment
♦ The leveled training sets were saved as files
D60_del, D60_dup, D60_adj
Results for leveling
Rule induction
♦ Using simulated annealing algorithm
♦ The following sets were defined
Hu YanJun
Conclusion on Bio-info DM
♦ First visit observations and early mortality
are significant association
♦ KDD methods is useful in above study
♦ To find whether findings are general the
statistical analysis is required to confirm the
association is “real” association
KDD process
♦ KDD (Knowledge Discovery in Databases)
are being developed to identify patterns
within the data that can be exploited
♦ Following are corresponding techniques
that can be used in KDD
Data preparation
♦ The pre-processing stage
– Data no clean
– Too many attributes
– Discrete data needed
– Missing data-can not delete easily!
Data preparation (cont)
♦ Usually the original data was derived from a
relational stage
♦ Reforming the data into a spreadsheet
format in a form suitable for data mining
Data preparation (cont)
♦ This extensive pre-processing work is
typical for clinical records
– Data collected for each patient vary
considerably
– Some unreliability
– Some patients have long-period recording while
other patients only have a single visit
Leveling
♦ Some attributes may bias so that they may
influence the whole training data
♦ The training set was adjusted so that the
distribution on one specific attribute is
almost the same
Leveling (cont)
♦ The leveling object of this case is to
eliminate the effect of age of observation
from the training data
♦ Stripping unnecessary attributes, only
important attributes left, such as gender,
marital status …
FSS (Feature Subset Selection)
♦ In order to adjust to compensate for missing
values
♦ Use an entropy based information
♦ Not useful for all cases, especially when the
distinction between FSS and rule induction
becomes blurred
Data mining
♦ The interface provides users with extensive
control of the simulated annealing algorithm
♦ The algorithm use heuristic searching
strategy to find the “best” rule
♦ Users can control various parameters in
order to optimize the search
Data mining (cont)
♦ Comprehensive pre-processing facilities are
included
♦ The generated rules were simple to
understand
– In the medical domain primary objective was
explanation rather than prediction
Data mining (cont)
♦ This data mining software can efficiently
handle the missing values
– Medical databases typically have a high
proportion of missing values
– Many of the discovered rules were based on
attributes with 50% missing values
Rule evaluation
♦ Some rules have low levels of accuracy and
coverage
– They may be produced by chance