Você está na página 1de 12

Microarrays: DNA Base

Pairing
5’ 3’

Microarray data analysis: A


C 5’ 3’ G
clustering and classification G C A G T T
G T C A
methods T
C
3’ 5’ A
3’
5’
Russ B. Altman
BMI 214
CS 274

Copyright Russ B. Altman Copyright Russ B. Altman

Microarrays:Experimental Protocol
Cells of Interest

Known DNA sequences Typical


DNA
array
Isolate mRNA for
Yeast
Glass slide

Reference sample
Copyright Russ B. Altman Copyright Russ B. Altman

Affymetrix chip technology Affymetrix fabrication


Instead of putting down intact genes on the chip,
these chips put down N-mers of a certain length
(around 20) systematically onto a chip by
synthesizing the N-mers on the spots.

Labelled mRNA is then added to the chip and a


*pattern* of binding (based on which 20-mers
are in the mRNA sequence) is seen.

Bioinformatics is used to deduce the mRNA


sequences that are present

Copyright Russ B. Altman Copyright Russ B. Altman

Page 1
Affymetrix chip Compare two types of chips
?

REMEMBER: Also control/reference DNA competing (in green)


Copyright Russ B. Altman Copyright Russ B. Altman

Reproducibility of data sets What are expression arrays


good for?
• Preparation of mRNA and • Follow population of (synchronized) cells over
experimental design time, to see how expression changes (vs.
• Hybridization of RNA to DNA baseline).
– Sequence-specific effects
– Length-related effects • Expose cells to different external stimuli and
• Quality of spotted genes on array measure their response (vs. baseline).
– Proper sequence spotted evenly
• Finding and digitizing spot • Take cancer cells (or other pathology) and
intensities compare to normal cells.
• Comparability of experiments on • (Also some non-expression uses, such as
same chip, experiments on different assessing presence/absence of sequences in the
chips genome)
Copyright Russ B. Altman Copyright Russ B. Altman

Matrix of Expression E1
Matrix of Expression E1 E2

Gene 1 Gene 1

Gene 2 Gene 2

Experiment/Conditions 1 Experiment/Conditions 2

Gene N Copyright Russ B. Altman Gene N Copyright Russ B. Altman

Page 2
Reorder Rows for Clustering
Matrix of Expression E1 E2 E3
E1 E2 E3 E1 E2 E3

Gene 1 Gene 1

Gene 2 Gene 2 Gene N

Gene 1

Experiment/Conditions 3 Gene 2

Gene N Copyright Russ B. Altman Gene N Copyright Russ B. Altman

Why do we care about


clustering expression data?
If two genes are expressed in the same way, they
may be functionally related.

If a gene has unknown function, but clusters with


genes of known function, this is a way to assign its
general function.

We may be able to look at high resolution


measurements of expression and figure out which
genes control which other genes.

E.g. peak in cluster 1 always precedes peak in


cluster 2 => ?cluster 1 turns cluster 2 on?
Copyright Russ B. Altman Copyright Russ B. Altman

Average of clustered wave


forms
Typical
“wave
forms”
observed
(note: not
lots of
bumps)

Copyright Russ B. Altman Copyright Russ B. Altman

Page 3
Cluster Analysis Result Methods for Clustering

• Hierarchical Clustering

• Self Organizing Maps

• K-means

• Trillions of others.

Copyright Russ B. Altman Copyright Russ B. Altman

Need a distance metrix for two n-


dimensional vectors (e.g., for n Hierarchical Clustering
expression measurements) Used in Eisen et al
1. Euclidean Distance (Nodes = genes or groups of genes. Initially all nodes are
D(X, Y) = sqrt [(x1-y1 + (x2-y2 …(xn-yn ]
)2 )2 )2 rows of data matrix)

2. Correlation coefficient 1. Compute matrix of all distances (they used


correlation coefficient)
R(X,Y) = cov(xy)/sd(x)sd(y) 2. Find two closest nodes.
= 1/n * SUM [ (xi-xo)/σx * (yi-yo)/σy ]
3. Merge them by averaging measurements (weighted)
Where = σx = sqrt (E(x2) - E(x)2) 4. Compute distances from merged node to all others
and E(x) = expected value of x = average of x 5. Repeat until all nodes merged into a single node
3. Other choices for distance too…

Copyright Russ B. Altman Copyright Russ B. Altman

Hierarchical Clustering E1 E2 E3 How many clusters? 7 E1 E2 E3

4
Copyright Russ B. Altman Copyright Russ B. Altman

Page 4
Hierarchical Clustering
• Easy to understand & implement
Can build
• Can decide how big to make clusters by trees from
choosing the “cut” level of the hierarchy cluster
• Can be sensitive to bad data analysis,
• Can have problems interpreting the tree
• Can have local minima
groups
genes by
Most commonly used method for common
microarray data. patterns of
expression.
Copyright Russ B. Altman Copyright Russ B. Altman

Graphical Representation
K-means
Two features f1 (x-coordinate) and f2 (y-coordinate)
(Computationally attractive)

1. Generate random points (“cluster centers”) in


n dimensions
2. Compute distance of each data point to each of
the cluster centers.
3. Assign each data point to the closest cluster B
center.
4. Compute new cluster center position as average
of points assigned.
5. Loop to (2), stop when cluster centers do not
move very much. A
Copyright Russ B. Altman Copyright Russ B. Altman

Self Organizing Maps SOM equations for updating


Used by Tamayo et al node positions
(use same idea of nodes) fi+1(N)= fi(N) + τ (d(N, NP), i) * [P- fi(N)]
1. Generate a simple (usually) 2D grid of nodes fi(N) = position of node N at iteration i
(x,y) P = position of current data point
2. Map the nodes into n-dim expression vectors P- fi(N) = vector from N to P
(initially randomly) τ = weighting factor or “learning rate” dictates how
much to move N towards P.
(e.g. (x,y) -> [0 0 0 x 0 0 0 y 0 0 0 0 0]) τ (d(N, NP), i) = 0.02 T/(T+100 i) for d(N,Np) < cutoff
radius, else = 0
3. For each data point, P, change all node
positions so that they move towards P. Closer T = maximum number of iterations
nodes move more than far nodes. Decreases with iteration and distance of N to P
4. Iterate for a maximum number of iterations,
and then assess position of all nodes. Copyright Russ B. Altman Copyright Russ B. Altman

Page 5
Graphical Representation
Two features f1 (x-coordinate) and f2 (y-coordinate)

A
Copyright Russ B. Altman Copyright Russ B. Altman

SOMs Clustering Lymphomas


Works well if we use the appropriate 143 GC specific genes
• Impose a partial structure on the cluster
problem as a start
• Easy to implement
• Pretty fast
• Let the clusters move towards the data
• Easy to visualize results
• Can be sensitive to starting structure
• No guarantee of convergence to good
clusters.

Copyright Russ B. Altman Copyright Russ B. Altman

Graphical Representation
Clustering vs. Classification
Two features f1 (x-coordinate) and f2 (y-coordinate)

Clustering uses the primary data to group


together measurements, with no
information from other sources. Often
called “unsupervised machine learning.”

Classification uses known groups of interest


(from other sources) to learn the features
B
associated with these groups in the primary
data, and create rules for associating the
data with the groups of interest. Often
called “supervised machine learning.” A
Copyright Russ B. Altman Copyright Russ B. Altman

Page 6
Clusters Apply external labels for classification
Two features f1 (x-coordinate) and f2 (y-coordinate) RED group and BLUE group now labeled

B B

A A
Copyright Russ B. Altman Copyright Russ B. Altman

Tradeoffs Methods for Classification

Clustering is not biased by previous Linear Models


knowledge, but therefore needs
stronger signal to discovery clusters. Logistic Regressian

Naïve Bayes
Classification uses previous knowledge,
so can detect weaker signal, but may Decision Trees
be biased by WRONG previous
knowledge. Support Vector Machines

Copyright Russ B. Altman Copyright Russ B. Altman

Linear Model
Linear Model PREDICT RED if high value for A and low value for B,
(high weight on x coordinate, negative weight on y)
Each gene, g, has list of n measurements at each
condition, [f1 f2 f3…fn].

Associate each gene with a 1 if in a group of interest,


otherwise a 0.

Compute weights to optimize ability to predict


whether genes are in group of interest or not.
B
Predicted group = SUM [ weight(i) * fi]

If fi always occurs in group 1 genes, then weight is A


high. If never, then weight is low.
Assumes that weighted combination works.
Copyright Russ B. Altman Copyright Russ B. Altman

Page 7
Logistic Regression Classifying Lymphomas
p = probability of being in group of interest
f = vector of expression measurements

log(p/(1-p) = β f

or

p = eβf/(1+eβf)

Use optimization methods to find β that maximizes


the difference between two groups. Then, can use
equation to estimate membership of a gene in a
group.
Copyright Russ B. Altman Copyright Russ B. Altman

Bayes Rule for Classification Naïve Bayes


Bayes’ Rule: p(hypothesis|data) = Assume all expression measurements for a gene are
p(data|hypothesis)p(hypothesis)/p(data) independent.

p(group 1| f) = p(f|group1) p(group1)/p(f) Assume p(f) and p(group1) are constant.

p(group 1|f) = probability that gene is in group 1 P(f|group 1) = p(f1&f2…fn|group1)


give the expression data = p(f1|group1) * p(f2|group1)…* p(fn|group1)

p(f) = probability of the data Can just multiply these probabilities (or add their
logs), which are easy to compute, by counting up
p(f|group 1) = probability of data given that gene frequencies in the set of “known” members of
is in group 1 group 1.

p(group 1) = probability of group 1 for a given Choose a cutoff probability for saying “Group 1
gene (prior) member.”
Copyright Russ B. Altman Copyright Russ B. Altman

Naïve Bayes
If P(Red|x=A) * P(Red| y = 0) = HIGH, so assign to RED
Decision Trees

Consider an n-dimensional graph of all data


points (f, gene expression vectors).

Try to learn cutoff values for each fi that


separate different groups.
B

A
Copyright Russ B. Altman Copyright Russ B. Altman

Page 8
Decision Trees
If x < A and y > B => BLUE Support Vector Machines
If Y < B OR Y >B and X > A => RED
Draw a line that passes close to the members of
two different groups that are the most difficult
to distinguish.

Label those difficult members the “support


B vectors.” (Remember, all points are vectors).

For a variety of reasons (discussed in the


tutorial, and the Brown et al paper to some
A degree), this choice of line is a good one for
classification, given many choices.
Copyright Russ B. Altman Copyright Russ B. Altman

Support Vectors and Decision Line Support Vectors and Decision Line
(Bad point put back in…Can penalize boundary line for
(One point left out) bad predictions

PENALTY based on
distance from line

B B

A A
Copyright Russ B. Altman Copyright Russ B. Altman

Choose boundary line that is


Notes about SVMs
closest to both support vectors
If the points are not easily separable in n dimensions,
can add dimensions (similar to how we mapped low
dimensional SOM grid points to expression
dimensions).

Dot product is used as measure of distance between


two vectors. But can generalize to an arbitrary
1/||w|| function of the features (expression measurements)
as discussed in Brown and associated Burges
tutorial.

Copyright Russ B. Altman Copyright Russ B. Altman

Page 9
Copyright Russ B. Altman Copyright Russ B. Altman

Other informatics issues


• Low level image processing of spots to assess
amount of fluorescence.
• Need to deal with missing values (due to
Typical
experimental artifacts, etc…) DNA
array
• Need to decide how much of a change is
significant (e.g. “2-fold increase” in for
expression). Yeast
• Creation of databases with the info (SMD)

Copyright Russ B. Altman Copyright Russ B. Altman

Estimate Missing Values.

0.25
Normalized RMS error

0.24 filled with


0.23 zeros
0.22
0.21 row average

0.2
0.19
SVDimpute
0.18
0.17
0.16 KNNimpute
0.15
Complete data set Data set with 30% entries Data set with missing
0 5 10 15 20
missing (missing values values estimated by
appear black) KNNimpute algorithm Percent of entries missing

Copyright Russ B. Altman Copyright Russ B. Altman

Page 10
Choosing best “indicator” genes
Principal Component Analysis of Conditions
Which gene or genes “predicts” the class the in Microarray Experiment
best.
2.5
Might be a good candidate for a biomarker.
2

Allows you to focus attention on small 1.5

number of genes, instead of large number Variance


1
required to get perfect discrimination 0.5
between groups.
0
1 2 3 4 5 6 7

Principal Component

Copyright Russ B. Altman Copyright Russ B. Altman

PCA--first two components

PCA
and clusters
reported in
original
literature

Copyright Russ B. Altman Copyright Russ B. Altman

Independent Components Analysis Look for interesting genes


wA* After + wB*Before
Find projection where distribution maximizes a
measure of non-normality.
Use kurtosis as measure of non-normality.

More w’A* After + w’B*Before


kurtotic

Less
kurtotic

Copyright Russ B. Altman Copyright Russ B. Altman

Page 11
“Interesting” Projections
remainder

max variance = PCA


max kurtosis ICA
max non-Gaussianity
Normal = not interesting (PJ
Huber ‘85, Jone & Sibson
‘87)
max information non-sites sites

MOTIF
500bp 5’ ORF FINDING
PROGRAM

Copyright Russ B. Altman Copyright Russ B. Altman

Imagine other array


Discovered MSE Promoter
technologies
Consensus: TTTTGTGACAT
Protein chips to assess interaction of
>YLL004W ORC3 1 74 ATTTGTGTCAT
>YML066C x 1 336 TTTTGTGTCAT
proteins (lay down proteins, and then
>YFR028C CDC14 1 35 TTCTGTGACTT label others, and look for binding
>YNL018C x 2 r444 TTTTGTGGCAC events).
r484 ATTTGTGACGT
>YJL212C x 1 r3 TTCTGTGACGT
>YNL174W x 1 283 ATCTGTGACAT
>YBR069C VAP1 1 350 TTTTGTGGCAT
>YDR191W HST4 1 384 TTTAGTGACAT
>YHR124W NDT80 2 414 TTTTGTGTCAC
r212 TTTTGTGTCAT
the regulatory element
that recognizes this multiple elements
promoter... Copyright Russ B. Altman Copyright Russ B. Altman

Reproducibility of data sets


http://bioinformatics.oupjournals.org/cgi/reprint/18/3/405.pdf
• Preparation of mRNA and
experimental design
• Hybridization of RNA to DNA
– Sequence-specific effects
– Length-related effects
• Quality of spotted genes on array
– Proper sequence spotted evenly
• Finding and digitizing spot
intensities
• Comparability of experiments on
same chip, experiments on different
chips
Copyright Russ B. Altman

Page 12

Você também pode gostar