Escolar Documentos
Profissional Documentos
Cultura Documentos
Pairing
5’ 3’
Microarrays:Experimental Protocol
Cells of Interest
Reference sample
Copyright Russ B. Altman Copyright Russ B. Altman
Page 1
Affymetrix chip Compare two types of chips
?
Matrix of Expression E1
Matrix of Expression E1 E2
Gene 1 Gene 1
Gene 2 Gene 2
Experiment/Conditions 1 Experiment/Conditions 2
Page 2
Reorder Rows for Clustering
Matrix of Expression E1 E2 E3
E1 E2 E3 E1 E2 E3
Gene 1 Gene 1
Gene 1
Experiment/Conditions 3 Gene 2
Page 3
Cluster Analysis Result Methods for Clustering
• Hierarchical Clustering
• K-means
• Trillions of others.
4
Copyright Russ B. Altman Copyright Russ B. Altman
Page 4
Hierarchical Clustering
• Easy to understand & implement
Can build
• Can decide how big to make clusters by trees from
choosing the “cut” level of the hierarchy cluster
• Can be sensitive to bad data analysis,
• Can have problems interpreting the tree
• Can have local minima
groups
genes by
Most commonly used method for common
microarray data. patterns of
expression.
Copyright Russ B. Altman Copyright Russ B. Altman
Graphical Representation
K-means
Two features f1 (x-coordinate) and f2 (y-coordinate)
(Computationally attractive)
Page 5
Graphical Representation
Two features f1 (x-coordinate) and f2 (y-coordinate)
A
Copyright Russ B. Altman Copyright Russ B. Altman
Graphical Representation
Clustering vs. Classification
Two features f1 (x-coordinate) and f2 (y-coordinate)
Page 6
Clusters Apply external labels for classification
Two features f1 (x-coordinate) and f2 (y-coordinate) RED group and BLUE group now labeled
B B
A A
Copyright Russ B. Altman Copyright Russ B. Altman
Naïve Bayes
Classification uses previous knowledge,
so can detect weaker signal, but may Decision Trees
be biased by WRONG previous
knowledge. Support Vector Machines
Linear Model
Linear Model PREDICT RED if high value for A and low value for B,
(high weight on x coordinate, negative weight on y)
Each gene, g, has list of n measurements at each
condition, [f1 f2 f3…fn].
Page 7
Logistic Regression Classifying Lymphomas
p = probability of being in group of interest
f = vector of expression measurements
log(p/(1-p) = β f
or
p = eβf/(1+eβf)
p(f) = probability of the data Can just multiply these probabilities (or add their
logs), which are easy to compute, by counting up
p(f|group 1) = probability of data given that gene frequencies in the set of “known” members of
is in group 1 group 1.
p(group 1) = probability of group 1 for a given Choose a cutoff probability for saying “Group 1
gene (prior) member.”
Copyright Russ B. Altman Copyright Russ B. Altman
Naïve Bayes
If P(Red|x=A) * P(Red| y = 0) = HIGH, so assign to RED
Decision Trees
A
Copyright Russ B. Altman Copyright Russ B. Altman
Page 8
Decision Trees
If x < A and y > B => BLUE Support Vector Machines
If Y < B OR Y >B and X > A => RED
Draw a line that passes close to the members of
two different groups that are the most difficult
to distinguish.
Support Vectors and Decision Line Support Vectors and Decision Line
(Bad point put back in…Can penalize boundary line for
(One point left out) bad predictions
PENALTY based on
distance from line
B B
A A
Copyright Russ B. Altman Copyright Russ B. Altman
Page 9
Copyright Russ B. Altman Copyright Russ B. Altman
0.25
Normalized RMS error
0.2
0.19
SVDimpute
0.18
0.17
0.16 KNNimpute
0.15
Complete data set Data set with 30% entries Data set with missing
0 5 10 15 20
missing (missing values values estimated by
appear black) KNNimpute algorithm Percent of entries missing
Page 10
Choosing best “indicator” genes
Principal Component Analysis of Conditions
Which gene or genes “predicts” the class the in Microarray Experiment
best.
2.5
Might be a good candidate for a biomarker.
2
Principal Component
PCA
and clusters
reported in
original
literature
Less
kurtotic
Page 11
“Interesting” Projections
remainder
MOTIF
500bp 5’ ORF FINDING
PROGRAM
Page 12