Micro Array Data Analysis Clustering and Classification Methods

Microarrays: DNA Base
Pairing
5’ 3’
Microarray data analysis: A

C 5’ 3’ G
clustering and classification G C A G T T
G T C A
methods T
C
3’ 5’ A
3’
5’
Russ B. Altman
BMI 214
CS 274
Copyright Russ B. Altman Copyright Russ B. Altman
Microarrays:Experimental Protocol
Cells of Interest
Known DNA sequences Typical

DNA
array
Isolate mRNA for
Yeast
Glass slide
Reference sample
Affymetrix chip technology Affymetrix fabrication

Instead of putting down intact genes on the chip,
these chips put down N-mers of a certain length
(around 20) systematically onto a chip by
synthesizing the N-mers on the spots.
Labelled mRNA is then added to the chip and a

*pattern* of binding (based on which 20-mers
are in the mRNA sequence) is seen.
Bioinformatics is used to deduce the mRNA

sequences that are present
Page 1
Affymetrix chip Compare two types of chips
?
REMEMBER: Also control/reference DNA competing (in green)

Reproducibility of data sets What are expression arrays

good for?
• Preparation of mRNA and • Follow population of (synchronized) cells over
experimental design time, to see how expression changes (vs.
• Hybridization of RNA to DNA baseline).
– Sequence-specific effects
– Length-related effects • Expose cells to different external stimuli and
• Quality of spotted genes on array measure their response (vs. baseline).
– Proper sequence spotted evenly
• Finding and digitizing spot • Take cancer cells (or other pathology) and
intensities compare to normal cells.
• Comparability of experiments on • (Also some non-expression uses, such as
same chip, experiments on different assessing presence/absence of sequences in the
chips genome)
Matrix of Expression E1
Matrix of Expression E1 E2
Gene 1 Gene 1
Gene 2 Gene 2
Experiment/Conditions 1 Experiment/Conditions 2
Gene N Copyright Russ B. Altman Gene N Copyright Russ B. Altman
Page 2
Reorder Rows for Clustering
Matrix of Expression E1 E2 E3
E1 E2 E3 E1 E2 E3
Gene 1 Gene 1
Gene 2 Gene 2 Gene N
Gene 1
Experiment/Conditions 3 Gene 2
Gene N Copyright Russ B. Altman Gene N Copyright Russ B. Altman
Why do we care about

clustering expression data?
If two genes are expressed in the same way, they
may be functionally related.
If a gene has unknown function, but clusters with

genes of known function, this is a way to assign its
general function.
We may be able to look at high resolution

measurements of expression and figure out which
genes control which other genes.
E.g. peak in cluster 1 always precedes peak in

cluster 2 => ?cluster 1 turns cluster 2 on?
Average of clustered wave

forms
Typical
“wave
forms”
observed
(note: not
lots of
bumps)
Page 3
Cluster Analysis Result Methods for Clustering
• Hierarchical Clustering
• Self Organizing Maps
• K-means
• Trillions of others.
Need a distance metrix for two n-

dimensional vectors (e.g., for n Hierarchical Clustering
expression measurements) Used in Eisen et al
1. Euclidean Distance (Nodes = genes or groups of genes. Initially all nodes are
D(X, Y) = sqrt [(x1-y1 + (x2-y2 …(xn-yn ]
)2 )2 )2 rows of data matrix)
2. Correlation coefficient 1. Compute matrix of all distances (they used

correlation coefficient)
R(X,Y) = cov(xy)/sd(x)sd(y) 2. Find two closest nodes.
= 1/n * SUM [ (xi-xo)/σx * (yi-yo)/σy ]
3. Merge them by averaging measurements (weighted)
Where = σx = sqrt (E(x2) - E(x)2) 4. Compute distances from merged node to all others
and E(x) = expected value of x = average of x 5. Repeat until all nodes merged into a single node
3. Other choices for distance too…
Hierarchical Clustering E1 E2 E3 How many clusters? 7 E1 E2 E3
4
Page 4
Hierarchical Clustering
• Easy to understand & implement
Can build
• Can decide how big to make clusters by trees from
choosing the “cut” level of the hierarchy cluster
• Can be sensitive to bad data analysis,
• Can have problems interpreting the tree
• Can have local minima
groups
genes by
Most commonly used method for common
microarray data. patterns of
expression.
Graphical Representation
K-means
Two features f1 (x-coordinate) and f2 (y-coordinate)
(Computationally attractive)
1. Generate random points (“cluster centers”) in

n dimensions
2. Compute distance of each data point to each of
the cluster centers.
3. Assign each data point to the closest cluster B
center.
4. Compute new cluster center position as average
of points assigned.
5. Loop to (2), stop when cluster centers do not
move very much. A
Self Organizing Maps SOM equations for updating

Used by Tamayo et al node positions
(use same idea of nodes) fi+1(N)= fi(N) + τ (d(N, NP), i) * [P- fi(N)]
1. Generate a simple (usually) 2D grid of nodes fi(N) = position of node N at iteration i
(x,y) P = position of current data point
2. Map the nodes into n-dim expression vectors P- fi(N) = vector from N to P
(initially randomly) τ = weighting factor or “learning rate” dictates how
much to move N towards P.
(e.g. (x,y) -> [0 0 0 x 0 0 0 y 0 0 0 0 0]) τ (d(N, NP), i) = 0.02 T/(T+100 i) for d(N,Np) < cutoff
radius, else = 0
3. For each data point, P, change all node
positions so that they move towards P. Closer T = maximum number of iterations
nodes move more than far nodes. Decreases with iteration and distance of N to P
4. Iterate for a maximum number of iterations,
and then assess position of all nodes. Copyright Russ B. Altman Copyright Russ B. Altman
Page 5
A
SOMs Clustering Lymphomas

Works well if we use the appropriate 143 GC specific genes
• Impose a partial structure on the cluster
problem as a start
• Easy to implement
• Pretty fast
• Let the clusters move towards the data
• Easy to visualize results
• Can be sensitive to starting structure
• No guarantee of convergence to good
clusters.
Clustering vs. Classification
Clustering uses the primary data to group

together measurements, with no
information from other sources. Often
called “unsupervised machine learning.”
Classification uses known groups of interest

(from other sources) to learn the features
B
associated with these groups in the primary
data, and create rules for associating the
data with the groups of interest. Often
called “supervised machine learning.” A
Page 6
Clusters Apply external labels for classification
Two features f1 (x-coordinate) and f2 (y-coordinate) RED group and BLUE group now labeled
B B
A A
Tradeoffs Methods for Classification
Clustering is not biased by previous Linear Models

knowledge, but therefore needs
stronger signal to discovery clusters. Logistic Regressian
Naïve Bayes
Classification uses previous knowledge,
so can detect weaker signal, but may Decision Trees
be biased by WRONG previous
knowledge. Support Vector Machines
Linear Model
Linear Model PREDICT RED if high value for A and low value for B,
(high weight on x coordinate, negative weight on y)
Each gene, g, has list of n measurements at each
condition, [f1 f2 f3…fn].
Associate each gene with a 1 if in a group of interest,

otherwise a 0.
Compute weights to optimize ability to predict

whether genes are in group of interest or not.
B
Predicted group = SUM [ weight(i) * fi]
If fi always occurs in group 1 genes, then weight is A

high. If never, then weight is low.
Assumes that weighted combination works.
Page 7
Logistic Regression Classifying Lymphomas
p = probability of being in group of interest
f = vector of expression measurements
log(p/(1-p) = β f
or
p = eβf/(1+eβf)
Use optimization methods to find β that maximizes

the difference between two groups. Then, can use
equation to estimate membership of a gene in a
group.
Bayes Rule for Classification Naïve Bayes

Bayes’ Rule: p(hypothesis|data) = Assume all expression measurements for a gene are
p(data|hypothesis)p(hypothesis)/p(data) independent.
p(group 1| f) = p(f|group1) p(group1)/p(f) Assume p(f) and p(group1) are constant.
p(group 1|f) = probability that gene is in group 1 P(f|group 1) = p(f1&f2…fn|group1)

give the expression data = p(f1|group1) * p(f2|group1)…* p(fn|group1)
p(f) = probability of the data Can just multiply these probabilities (or add their
logs), which are easy to compute, by counting up
p(f|group 1) = probability of data given that gene frequencies in the set of “known” members of
is in group 1 group 1.
p(group 1) = probability of group 1 for a given Choose a cutoff probability for saying “Group 1
gene (prior) member.”
Naïve Bayes
If P(Red|x=A) * P(Red| y = 0) = HIGH, so assign to RED
Decision Trees
Consider an n-dimensional graph of all data

points (f, gene expression vectors).
Try to learn cutoff values for each fi that

separate different groups.
B
A
Page 8
Decision Trees
If x < A and y > B => BLUE Support Vector Machines
If Y < B OR Y >B and X > A => RED
Draw a line that passes close to the members of
two different groups that are the most difficult
to distinguish.
Label those difficult members the “support

B vectors.” (Remember, all points are vectors).
For a variety of reasons (discussed in the

tutorial, and the Brown et al paper to some
A degree), this choice of line is a good one for
classification, given many choices.
Support Vectors and Decision Line Support Vectors and Decision Line
(Bad point put back in…Can penalize boundary line for
(One point left out) bad predictions
PENALTY based on
distance from line
B B
A A
Choose boundary line that is

Notes about SVMs
closest to both support vectors
If the points are not easily separable in n dimensions,
can add dimensions (similar to how we mapped low
dimensional SOM grid points to expression
dimensions).
Dot product is used as measure of distance between

two vectors. But can generalize to an arbitrary
1/||w|| function of the features (expression measurements)
as discussed in Brown and associated Burges
tutorial.
Page 9
Other informatics issues

• Low level image processing of spots to assess
amount of fluorescence.
• Need to deal with missing values (due to
Typical
experimental artifacts, etc…) DNA
array
• Need to decide how much of a change is
significant (e.g. “2-fold increase” in for
expression). Yeast
• Creation of databases with the info (SMD)
Estimate Missing Values.
0.25
Normalized RMS error
0.24 filled with

0.23 zeros
0.22
0.21 row average
0.2
0.19
SVDimpute
0.18
0.17
0.16 KNNimpute
0.15
Complete data set Data set with 30% entries Data set with missing
0 5 10 15 20
missing (missing values values estimated by
appear black) KNNimpute algorithm Percent of entries missing
Page 10
Choosing best “indicator” genes
Principal Component Analysis of Conditions
Which gene or genes “predicts” the class the in Microarray Experiment
best.
2.5
Might be a good candidate for a biomarker.
2
Allows you to focus attention on small 1.5
number of genes, instead of large number Variance

1
required to get perfect discrimination 0.5
between groups.
0
1 2 3 4 5 6 7
Principal Component
PCA--first two components
PCA
and clusters
reported in
original
literature
Independent Components Analysis Look for interesting genes

wA* After + wB*Before
Find projection where distribution maximizes a
measure of non-normality.
Use kurtosis as measure of non-normality.
More w’A* After + w’B*Before

kurtotic
Less
kurtotic
Page 11
“Interesting” Projections
remainder
max variance = PCA

max kurtosis ICA
max non-Gaussianity
Normal = not interesting (PJ
Huber ‘85, Jone & Sibson
‘87)
max information non-sites sites
MOTIF
500bp 5’ ORF FINDING
PROGRAM
Imagine other array

Discovered MSE Promoter
technologies
Consensus: TTTTGTGACAT
Protein chips to assess interaction of
>YLL004W ORC3 1 74 ATTTGTGTCAT
>YML066C x 1 336 TTTTGTGTCAT
proteins (lay down proteins, and then
>YFR028C CDC14 1 35 TTCTGTGACTT label others, and look for binding
>YNL018C x 2 r444 TTTTGTGGCAC events).
r484 ATTTGTGACGT
>YJL212C x 1 r3 TTCTGTGACGT
>YNL174W x 1 283 ATCTGTGACAT
>YBR069C VAP1 1 350 TTTTGTGGCAT
>YDR191W HST4 1 384 TTTAGTGACAT
>YHR124W NDT80 2 414 TTTTGTGTCAC
r212 TTTTGTGTCAT
the regulatory element
that recognizes this multiple elements
promoter... Copyright Russ B. Altman Copyright Russ B. Altman
Reproducibility of data sets

http://bioinformatics.oupjournals.org/cgi/reprint/18/3/405.pdf
• Preparation of mRNA and
experimental design
• Hybridization of RNA to DNA
– Sequence-specific effects
– Length-related effects
• Quality of spotted genes on array
– Proper sequence spotted evenly
• Finding and digitizing spot
intensities
• Comparability of experiments on
same chip, experiments on different
chips
Copyright Russ B. Altman
Page 12

Micro Array Data Analysis Clustering and Classification Methods

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Micro Array Data Analysis Clustering and Classification Methods

Enviado por

Direitos autorais:

Formatos disponíveis

Microarrays: DNA Base

Microarray data analysis: A

Copyright Russ B. Altman Copyright Russ B. Altman

Known DNA sequences Typical

Affymetrix chip technology Affymetrix fabrication

Labelled mRNA is then added to the chip and a

Bioinformatics is used to deduce the mRNA

Copyright Russ B. Altman Copyright Russ B. Altman

REMEMBER: Also control/reference DNA competing (in green)

Reproducibility of data sets What are expression arrays

Gene N Copyright Russ B. Altman Gene N Copyright Russ B. Altman

Gene 2 Gene 2 Gene N

Gene N Copyright Russ B. Altman Gene N Copyright Russ B. Altman

Why do we care about

If a gene has unknown function, but clusters with

We may be able to look at high resolution

E.g. peak in cluster 1 always precedes peak in

Average of clustered wave

Copyright Russ B. Altman Copyright Russ B. Altman

• Self Organizing Maps

Copyright Russ B. Altman Copyright Russ B. Altman

Need a distance metrix for two n-

2. Correlation coefficient 1. Compute matrix of all distances (they used

Copyright Russ B. Altman Copyright Russ B. Altman

Hierarchical Clustering E1 E2 E3 How many clusters? 7 E1 E2 E3

1. Generate random points (“cluster centers”) in

Self Organizing Maps SOM equations for updating

SOMs Clustering Lymphomas

Copyright Russ B. Altman Copyright Russ B. Altman

Clustering uses the primary data to group

Classification uses known groups of interest

Tradeoffs Methods for Classification

Clustering is not biased by previous Linear Models

Copyright Russ B. Altman Copyright Russ B. Altman

Associate each gene with a 1 if in a group of interest,

Compute weights to optimize ability to predict

If fi always occurs in group 1 genes, then weight is A

Use optimization methods to find β that maximizes

Bayes Rule for Classification Naïve Bayes

p(group 1| f) = p(f|group1) p(group1)/p(f) Assume p(f) and p(group1) are constant.

p(group 1|f) = probability that gene is in group 1 P(f|group 1) = p(f1&f2…fn|group1)

Consider an n-dimensional graph of all data

Try to learn cutoff values for each fi that

Label those difficult members the “support

For a variety of reasons (discussed in the

Choose boundary line that is

Dot product is used as measure of distance between

Copyright Russ B. Altman Copyright Russ B. Altman

Other informatics issues

Copyright Russ B. Altman Copyright Russ B. Altman

Estimate Missing Values.

0.24 filled with

Copyright Russ B. Altman Copyright Russ B. Altman

Allows you to focus attention on small 1.5

number of genes, instead of large number Variance

Copyright Russ B. Altman Copyright Russ B. Altman

PCA--first two components

Copyright Russ B. Altman Copyright Russ B. Altman

Independent Components Analysis Look for interesting genes

More w’A* After + w’B*Before

Copyright Russ B. Altman Copyright Russ B. Altman

max variance = PCA

Copyright Russ B. Altman Copyright Russ B. Altman