P. 1
Improved Statistical Test

Improved Statistical Test


|Views: 277.431|Likes:
Publicado porJared Friedman
My Intel paper. Now it is finally published.
My Intel paper. Now it is finally published.

More info:

Published by: Jared Friedman on Oct 11, 2006
Direitos Autorais:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as DOC, PDF, TXT or read online from Scribd
See more
See less





Improved Statistical Test for

Multiple-Condition Microarray
Data Integrating ANOVA wit
!ared "riedman
Tis researc was done at Te $oc%efeller
&niversity and supervised 'y (rian )ir%
#+ #,- .ast .nd Ave/ New 0or%/ N0/ #-#*,+ .mail1 vicvic2att+net+ 3one1 *#*-
*+ .mail1 )ir%(2$oc%efeller+edu+ 3one1 *#*-6*7-7-84
Microarrays are e9citing new 'iological instruments+ Microarrays promise to ave
important applications in many fields of 'iological researc/ 'ut tey are currently fairly
inaccurate instruments and tis inaccuracy increases teir e9pense and reduces teir usefulness+
Tis researc introduces a new tecni:ue for analy;ing data from some types of microarray
e9periments tat promises to produce more accurate results+ .ssentially/ te metod wor%s 'y
identifying patterns in te genes+
Motivation1 Statistical significance analysis of microarray data is currently a pressing
pro'lem+ A multitude of statistical tecni:ues ave 'een proposed/ and in te case of simple two
condition e9periments tese tecni:ues wor% well/ 'ut in te case of multiple condition
e9periments tere is additional information tat none of tese tecni:ues ta%e into account+ Tis
information is te sape of te e9pression vector for eac gene/ i+e+/ te ordered set of e9pression
measurements/ and its usefulness lies in te fact tat genes tat are actually affected 'iologically
'y some e9perimental circumstance tend to fall into a relatively small num'er of clusters+ <enes
tat do appear to fall into tese clusters sould 'e selected for 'y te significance test/ tose tat
do not sould 'e selected against+
$esults1 Suc a test was successfully designed and tested using a large num'er of
artificially generated data sets+ =ere te a'ove assumption of te correlation 'etween
clustering and significance is true/ tis test gives considera'ly 'etter performance tan
conventional measures+ =ere te assumption is not entirely true/ te test is ro'ust to te
Microarrays are part of an e9citing new class of 'iotecnologies tat promise to allow
monitoring of genome-wide e9pression levels >#?+ Altoug te tecni:ue presented in tis
paper is :uite general and would/ in principle/ wor% wit oter types of data/ it as 'een
developed for microarray data+ Microarray e9periments wor% 'y testing e9pression levels of
some set of genes in organisms e9posed to at least two different conditions/ and usually ten
trying to determine wic >if any? of tese genes are actually canging in response to te cange
in e9perimental condition >*?+ &nfortunately/ te random variation in microarrays is often large
relative to te canges trying to 'e detected+ Tis ma%es it very ard to eliminate false positives/
genes wose random fluctuations ma%e tem erroneously seem to respond to te e9perimental
condition/ and false negatives/ genes wose random fluctuations mas% te fact tat tey are
actually responding to te e9perimental condition+ It is standard statistical practice to andle tis
pro'lem 'y using replicates/ i+e+/ several microarrays run in identical settings >#-5?+ Altoug
replicates are igly effective/ te e9pense of additional replicates ma%es it wortwile to
minimi;e te num'er needed 'y using more effective statistical tests+
=en tere are only two conditions in an e9periment/ te conventional statistical test for
differential e9pression is te t-test >*/8?+ (ut @ong/ et. al. >A/ also 8/7? reali;ed tat 'y running a
separate t-test on eac gene/ valua'le information is lost/ namely te fact tat te actual
population standard deviation of all te genes is rater similar+ Due to te small num'er of
replicates in microarray e9periments tere are usually a few genes tat/ 'y cance/ appear to
ave e9tremely low standard deviations and tus register even a very small cange as igly
significant 'y a conventional t-test+ In reality -- and as additional replicates would confirm -- te
standard deviations of tese genes are muc iger/ and te cange is not at all significant+ In
response to tis issue/ @ong et. al. derived using (ayesian statistics a metod of transforming
standard deviations tat/ in essence/ moves outlying standard deviations closer to te mean
standard deviation+ &nfortunately/ te resulting algoritm/ wic tey call Cy'er-T/ wor%s only
wit two condition e9periments+ Since tis researc is concerned wit multiple condition
e9periments/ it was necessary to e9pand te metod to wor% wit tis type of e9periment/ te
standard statistical test for wic is called ANOVA >ANalysis Of Variations? >,?+ Bere we report
te successful e9tension of te Cy'er-T algoritm to multiple condition e9periments/ creating te
algoritm we call Cy'er-ANOVA+
In microarray e9periments involving several conditions/ te e9pression vector of eac
gene formed 'y consecutive measurements of its e9pression level as a distinctive sape/ wic
contains useful information+ Te more conditions in te e9periment >and tus num'ers in te
e9pression vector?/ te more distinctive will 'e te sape+ Often 'iologists are interested in
finding genes wit e9pression vectors of similar sape/ and tis interest as generated a well-
developed process for grouping genes 'y sape %nown as clustering >5?+ Te 'asis of clustering
tecni:ues is te concept of correlation/ te similarity 'etween two vectors+ =e cose to use te
3earson Correlation/ 'ecause tis measure generally captures more accurately te actual
'iological idea of correlation >*?+ =it te 3earson correlation/ as elsewere/ distance is defined
as one minus te correlation and measures te dissimilarity 'etween two vectors+ &sing tis
measure of distance/ it is possi'le to solve te pro'lem of ta%ing a large num'er of unorgani;ed
vectors and placing tem into clusters of similar sapes1 tis is %nown as )-means clustering
>#*?+ Once te )-means algoritm as 'een completed/ it is often useful to determine ow well
eac gene as clustered+ Tis is done 'y computing a representative vector/ often called a center/
for eac cluster and calculating te distance 'etween eac gene and te center of tat geneCs
cluster >#4/#A?+
Te fundamental tesis of tis researc is tat genes actually affected 'y e9perimental
circumstances tend to fall into a relatively small num'er of clusters/ and tat tis information can
'e used to ma%e a more accurate statistical test+ =ile it is in principle possi'le/ it is not often
te case tat large num'ers of affected genes 'eave in seemingly random and entirely dissimilar
patterns+ <iven all te clusters of clearly significant genes/ genes tat fit well in a cluster are
more li%ely to 'e differentially e9pressed tan tose tat do not+ &nfortunately/ tis ypotesis is
very difficult to test/ for no microarray e9periments ave 'een replicated enoug to allow for
accurate determination of differentially e9pressed genes+ Te studies tat ave come closest to
tis conclusion are Dao et al/ >#-? and !iang et al+/ >##?+ (ot studies ran )-Means clustering on
multiple condition data and found several distinct cluster sapes tat accounted for all clearly
significant genes+
=e used tis connection 'etween clustering and significance to create a more accurate
statistical test wose essential steps are te following+ "irst Cy'er-ANOVA is run on te entire
dataset and te igly significant genes are clustered+ Te oter genes are placed into te cluster
in wic tey fit 'est/ and te distance 'etween eac gene and te center of its cluster is
computed+ Tere are now two values for eac gene/ te 3 value >pro'a'ility? from te Cy'er-
ANOVA test and te D value >distance? from te clustering+ Tese two values are com'ined and
te resulting value is more accurate tan eiter one alone+
Te potential advantage of tis algoritm was validated 'y te creation and successful
testing of a large num'er of artificial data sets+ Artificial data was cosen over real e9perimental
data for two main reasons1 >#? in e9perimental data/ uncertainty always e9ists over wic genes
are actually differentially e9pressedE tis presents maFor pro'lems in determining ow well some
given algoritm wor%sE and >*? no single microarray e9periment is actually representative of all
possi'le microarray e9periments G to determine accurately te performance advantage of an
algoritm/ a large num'er of very different e9periments must 'e used/ a proi'itively e9pensive
and difficult process+ Artificial creation of data sets allows every parameter of te data to 'e
perfectly controlled and te algoritmCs performance can 'e tested on every possi'le
com'ination of parameters+
Te 'asic metod of analysis was to generate an artificial data set from a num'er of
parameters/ run te clustering algoritm on te data and measure its performance/ run Cy'er-
ANOVA on tat data and on similar data 'ut wit additional replicates and measure its
performance/ and ten compare te results+ Te performance gain of te clustering algoritm is
pro'a'ly 'est e9pressed as e:uivalent replicate gain1 te num'er of additional replicates tat
would 'e needed to produce te results of te clustering algoritm using a standard Cy'er-
ANOVA+ On reasona'le simulated data/ we found tat te 'enefit was appro9imately one
additional replicate added to an original two+ =it additional improvements to te algoritm
design/ tis saving could pro'a'ly 'e increased still furter/ and te advantage is clearly
significant enoug to warrant use in actual microarray e9periments+

Algoritm Design
Te algoritm is implemented as a large >H*--- lines? macro script in Microsoft Visual
(asic for Applications controlling Microsoft .9cel+ Te essential steps of te algoritm are te
#+ $un Cy'er-ANOVA on all genesE
*+ AdFust 3 values for multiple testingE
6+ &se te results of >*? to coose an appropriate num'er of most significant genesE
4+ Cluster tis selection of genes using )-MeansE
A+ Ta%e te remaining genes and group eac into te cluster wit wic it correlates 'estE
8+ Compute te distance 'etween eac gene and te center of its clusterE
7+ Compute a com'ined measurement of significance for eac gene+
#+ $un Cy'er-ANOVA on all genes+ Cy'er-ANOVA is Fust a simple e9tension of Cy'er-T/ 'ut its
novelty causes it to warrant some discussion+ Cy'er-T ta%es as input a list of genes/ means in
eac condition/ standard deviations in eac condition/ and num'er of replicates+ It gives as
output a set of 3-values muc li%e tose from a conventional t-test/ 'ut more accurate+ Te
improvement is 'ased on te idea of regulari;ing te standard deviations/ essentially moving
outlying standard deviations closer to te mean+ At te same time/ it recogni;es tat standard
deviation may cange significantly >and inversely? wit intensity level/ and tus genes are
regulari;ed only wit genes of similar intensity+ Te algoritm wor%s 'y first computing for
eac gene a 'ac%ground standard deviation/ te average standard deviation of te undred genes
wit closest intensity/ and ten com'ining te 'ac%ground standard deviation wit te o'served
standard deviation in a type of weigted average+ A standard t-test is ten run/ 'ut using te
regulari;ed standard deviation instead of te o'served one+
Cy'er-ANOVA wor%s 'y te same metod/ ta%ing as input means and standard
deviations in any num'er of conditions+ .ac 'ac%ground standard deviation is calculated using
te intensities of te genes in teir own condition/ and te values are com'ined using te same
formula+ (ut 'ecause tere are more tan two conditions/ a t-test cannot 'e appliedE instead we
use ANOVA/ inputting te regulari;ed standard deviation instead of te o'served one+ Te result
gives a dramatic increase in accuracy/ an e:uivalent replicate gain of rougly one+
*+ AdFust 3 values for multiple testing+ To run te clustering algoritm/ we need to select
a num'er of genes tat are igly significant to form te seed clusters around wic all oter
genes will cluster+ Te difficult step in te process is deciding ow many genes to select1 it is
critical to get at least a few seeds for all clusters 'ut also critical not to ave too many false
positives+ Since te first re:uirement is impossi'le to determine 'y calculation/ we use te
second+ In order to coose an appropriate num'er of e9pected false positives/ we must ave an
accurate estimate of te false positive rate+ Te simplest way of estimating te false positive rate
is a (onferroni correction/ wic simply multiplies eac p value 'y te total num'er of tests+
Tis metod assumes tat all genes are independent/ and tends to 'e too conservative/
particularly wit small num'ers of replicates/ and a num'er of more accurate tecni:ues ave
'een proposed+ Bowever/ for tis step/ only a roug estimate is necessary/ and tus te
(onferroni correction suffices+
6+ &se te results of >*? to coose an appropriate num'er of most significant genes+ Once
te e9pected false positive rate for any num'er of te most significant genes as 'een estimated/
some num'er of genes must 'e selected suc tat te num'er of e9pected false positives is some
percentage of te num'er selected genes+ Tere is no particular correct way to define tis
percentageE it depends on ow significant te genes in te least significant cluster are/ wic
cannot 'e determined directly+ Bowever/ we ave found tat te algoritm is tolerant to
deviations from optimality >Ta'le 8?+
4+ Cluster tis selection of genes using )-Means+ Te cosen group of most significant
genes is clustered using te )-Means algoritm wit te 3earson correlation+ Te primary
pro'lem wit all )-means clustering algoritms is tat te num'er of clusters must 'e declared
at te start >#6?+ (ut li%e te determination of te num'er of genes to include in te initial
clustering/ tere is no pro'lem as long as ) is rater ig+ Tis can 'e e9plained conceptually 'y
considering wat appens as ) increases to differentially e9pressed and non-differentially
e9pressed genes+ If tere are only differentially e9pressed genes in te original set/ ten as ) is
raised/ genes tat were originally placed in one cluster will separate into multiple clusters+
"ortunately/ tis is guaranteed 'y te process of )-Means to lower te D values and create 'etter
cluster sapes+ If tere are non-differentially e9pressed genes/ ten as ) is increased tese genes
will tend to separate out from te differentially e9pressed genes and form teir own clusters G a
igly desira'le occurrence+ &nfortunately/ if ) 'ecomes very large/ te noise genes may start
to cluster very well wit oter noise genes forming erroneously low distances/ and te oter
noise genes added in te ne9t step will ave a 'etter cance of incidentally finding a cluster wit
wic tey correlate well+ In principle/ too large a ) migt 'e partially monitored 'y cec%ing
for very small clusters and for clusters wit very ig average distances/ 'ot of wic could 'e
removed and teir genes forced to Foin a different cluster+ Tese metods ave not yet 'een
A+ Ta%e te remaining genes and group eac into te cluster wit wic it correlates 'est+
Te clusters formed in step 4 act as seeds for te remaining genes+ Te clusters formed
optimally include all of te cluster sapes of actually differentially e9pressed genes and no
oters+ It was possi'le to form tese clusters/ even if tey were less tan ideal/ 'y using only a
su'set of te genes in te data set+ Now it is possi'le to loo% at te remaining genes and to see
ow well eac 'elongs in one of te clusters+ Ideally/ eac differentially e9pressed gene would
correlate perfectly wit e9actly one cluster and eac noise gene would correlate poorly wit all
of tem+ In practice/ of course/ tis is far from te case/ 'ut te principle remains+ Since eac
gene is/ of course/ only e9pected to correlate well wit one cluster/ eac gene in te rest of te
dataset is added to te cluster wit wic it correlates 'est+
8+ Compute te distance 'etween eac gene and te center of its cluster+ Te
measurement of ow well eac gene 'elongs in its cluster is now determined 'y computing te
3earson correlation 'etween eac gene and te center of its cluster+ Te resulting correlations
are converted to distances >distance I #- correlation?+ Tese distance values are te %ey
num'ers produced 'y te first alf of te algoritm+ Tey represent weter a gene vectorCs
sape is similar to te sape of gene vectors %nown to 'e significant+ Te 'asis of tis researc
is tat low distances tend to imply differentially e9pressed genes+ &nfortunately/ distance values
alone are not an accurate measure of significance G ran%ing genes 'y distance alone produces
results far worse tat te original Cy'er-ANOVA+ Instead/ te distance >JDK? values must 'e
com'ined wit te original 3 values to give new values 'etter tan eiter alone+
7+ Compute a com'ined measurement of significance for eac gene+ Te goal now is to
ta%e te 3 value and te D value and com'ine tem in some way to ma%e a new value+ !ust li%e
wit te calculation of te previous pro'a'ilities/ wat we are really interested in is finding te
pro'a'ility of a non-differentially e9pressed gene wit a certain standard deviation attaining 'ot
a 3 value less tan or e:ual to te o'served 3 value and a D value less tan or e:ual to te
o'served D value+ &nfortunately/ te teoretical 'asis for te connection 'etween 3 and D
values is not well defined/ and it may 'e very complicated and e9periment-dependentE tere may
well 'e no useful analytic solution to tis pro'lem+ Instead/ we generate te distri'ution of 3 and
D values for eac e9periment+ "irst tousands of non-differentially e9pressed genes are
generated using te standard deviation tat would 'e e9pected at teir intensity level and teir 3
and D values are calculated+ Ten/ for eac gene in te data set/ te num'er of non-differentially
e9pressed genes tat ave 'ot 3 and D values e:ual to or lower tan te 3 and D values of tat
gene is counted+ Tat num'er is divided 'y te num'er of non-differentially e9pressed genes
generated/ and tat :uotient gives an appro9imate pro'a'ility+
It is important to note tat for igly significant genes/ tis metod does not give accurate
pro'a'ilities/ for if #-
non-differentially e9pressed genes are generated/ only genes wit an
actual pro'a'ility of occurring of a'out #-
are li%ely to 'e generated+ Tus/ genes in te data
set tat are actually significant to a pro'a'ility of 'elow #-
will almost certainly not ave any
generated genes more significant and will all 'e assigned pro'a'ilities of ;ero+ (ut tis is not a
maFor concern/ since tese genes are so significant tat tere is little :uestion tat tey are
differentially e9pressed/ and teir precise ran% is not li%ely to 'e important+ If it is/ ten te
genes wit ;ero com'ined values can 'e ran%ed internally 'y teir 3 values+ In tis case/ witin
te very most significant genes/ te algoritm will not ave ad any effect+
<eneration of te Data
To test te performance of te algoritm/ a num'er of artificial data sets were generated+
Te goal was to see ow muc 'enefit could 'e derived from te algoritm and ow different
parameter com'inations would impact te results+ A'ove all/ te data is intended to mimic real
microarray data/ at least in all respects tat are li%ely to significantly influence te results of te
test+ Te 'asic process of data generation was to use certain parameters to create a mean value
and standard deviation for eac gene and condition and to use a random num'er generator
assuming a normal distri'ution to generate plausi'le e9perimental values+ Some of te most
important parameters are te num'ers of differentially and non-differentially e9pressed genes/
num'er of groups/ cluster sapes/ standard deviations/ and fold canges+
"or eac data set/ a'out ,A-- genes were created/ wit te num'er of differentially
e9pressed genes ranging from a'out #-- to a'out A--+ Te rest were genes wose mean value
did not cange across conditions/ 'ut tat mean value did vary 'etween different non-
differentially e9pressed genes+ Te num'er of groups used is a %ey determinant of te
performance of te algoritm/ more groups causing 'etter performance+ Most of te data sets
generated used si9 groups/ 'ut a set wit twenty groups was tried+
Te cluster sapes of non-differentially e9pressed genes are/ of course/ flat lines/ for tat
is te definition of 'eing non-differentially e9pressed+ Te cluster sapes of te differentially
e9pressed genes were more complicated+ Tey are drawn loosely from Dao et. al. >#-? 'ut
really te specific sapes cosen are not all tat important+ An important property of te 3earson
correlation is tat/ given some e9pression vector/ te pro'a'ility of some oter e9pression vector
aving a distance to te original vector 'elow some value is te same regardless of te sape of
te original vectorE it depends only on te num'er of conditions >,?+ On te oter and/ given
two original vectors/ te pro'a'ility of a test vector aving a distance 'elow some value to eiter
one of tem does depend on te sapes of te original vectors+ As intuition suggests/ vectors of
opposite sape >li%e a linear increasing vector and linear decreasing vector? ma%e it more li%ely
for a test vector to cluster well wit one of tem+ To te e9tent tat cluster sape does affect te
algoritm/ te more similar te cluster sapes/ te 'etter te algoritm will perform+ Cluster
sapes can 'e graped and assigned names 'ased on teir sape+ Data was generated using up to
twelve cluster sapes >"igure #a/ #'?+
Figure 1B. The Other Six Cluster Shapes
1 2 3 4 5 6


Linear Gradient
Linear Gradient
Concave Up
Gradient Up
Concave Down
Gradient Up
Concave Up
Gradient Down
Concave Down
Gradient Down
Figure 1A. Six of the Twelve Cluster Shapes
1 2 3 4 5 6


Cli Up
Cli Down
Te metod of assigning standard deviations was cosen to 'e as realistic as possi'le+
Cy'er-ANOVA ta%es into account te fact tat standard deviation often canges significantly
wit a'solute intensity in microarray e9periments/ and te generated data model tis
penomenon+ Te data of @ong/ et+ al/ >A? wic is freely availa'le/ was graped/ mean intensity
vs+ standard deviation/ and a :uadratic regression calculated+ Tis regression e:uation is an
e9plicit function for standard deviation in terms of mean/ and it was used wit a modification to
calculate standard deviations for te generated data+ Te modification is re:uired 'ecause it is
not te case tat standard deviation depends solely on intensity+ Te mean intensity vs+ standard
deviation grap does not ma%e a perfect lineE tere is considera'le widt to tat curve/ and tis
too was simulated+ Te standard deviation itself was randomi;ed wit a small meta-standard
deviation+ Te assumption of normality in tis generation is pro'a'ly false/ 'ut te cange in te
algoritm performance caused 'y introducing te meta-randomi;ation at all is so small tat it
seems igly unli%ely tat a 'etter model would produce a significant difference+
To ta%e full advantage of te standard deviation dependence on mean/ non-differentially
e9pressed genes were generated wit means separated 'y undredts 'etween , and #8 >on a log*
scale?+ Tis created a considera'le range of standard deviations for eac data set+ Te means of
differentially e9pressed genes were generated starting from , different 'aselines/ te integers
from , to #A+
.ac differentially e9pressed gene was generated at five different fold canges1 #+A/ #+7/
*+-/ 6+-/ and 8+- on an unlogged scale+ Te si9 and tree fold cange genes generally made up
te seed clustering group/ and te #+A and #+7 fold cange genes were responsi'le for most of te
difference in performance 'etween algoritms+
Scoring Algoritms
Once data as 'een generated and assigned 3 values using Cy'er-ANOVA and com'ined
values using te clustering algoritm/ tere must 'e some way to compare te performance+ =e
introduce a new metod for determining te performance of an algoritm/ wic we 'elieve to
fi9 a sortcoming in te standard metod+
Te most common metod of scoring appears to 'e a consistency test >#-7? wic wor%s
te following way+ Say tere is a data set wit N genes %nown to 'e differentially e9pressed+
"irst/ some significance test is performed and eac gene assigned a significance level+ Ne9t/ te
genes are ran%ed 'y significance level/ most significant first+ Out of te top N most significant
genes/ te num'er D of genes tat are actually differentially e9pressed is determined/ and te
score is D L N+
Te pro'lem wit te consistency test is tat ignores te fact tat it often will matter
weter te >N-D? genes are listed consecutively rigt after te N
gene or at te very 'ottom of
te list+ (iologists would prefer tat te differentially e9pressed genes 'e listed as close to te
top as possi'le/ for it is easier to find an interesting gene or group of genes if it is iger on te
As a solution/ we introduce an algoritm called ran% sum+ $an% sum also ta%es a list of
genes ran%ed 'y significance level/ 'ut computes te score differently+ $an% sum is e:ual to te
sum of te ran%s of all te N differentially e9pressed genes minus >#M*M6MNN?/ te minimum
score possi'le+ @ower ran% sums indicate 'etter performance/ and a perfect significance test
gives a ran% sum of ;ero+
(ut ran% sum as significant flaws/ too/ for it gives too muc weigt to te genes placed
after te N
ran%+ In te worst case/ one single misplaced gene could cange te score from - to
te num'er of genes in te e9periment/ ,--- in tis case+ Bere is a successful modification+
Instead of summing te actual ran%s/ we sum for all differentially e9pressed genes ">$an%?/
were " is some increasing concave-down function+ Tere is no particular rigt coice for "/ for
te preferred function will vary depending on te e9periment and e9perimenter/ 'ut 'ot
logaritmic and polynomial >to a power less tan/ say/ #L*? functions give almost identical
Te modified ran% sum successfully determines ow well a significance test as ran%ed
te genes+ (ut te score from te ran% sum itself is not particularly useful informationE it must
'e placed in a conte9t+ Since te goal of any significance test is ultimately to reduce te num'er
of replicates necessary to attain accurate results/ we convert te ran% sum to a new measure tat
we call e:uivalent replicate gain+ .:uivalent replicate gain is defined as te num'er of replicates
needed to reac te same ran% sum using only a standard Cy'er-ANOVA+ Specifically/ tis is
done 'y running Cy'er-ANOVA on several data sets wit parameters all identical e9cept for te
num'er of replicates+ Te grap of num'er of replicates vs+ ran% sum is drawn and a regression
calculated+ Te clustering algoritm is ten run and a ran% sum o'tained+ Tat ran% sum is
entered into te inverse of te regression function to find te appro9imate num'er of replicates
re:uired to attain te same performance using only Cy'er-ANOVA+ Te e:uivalent replicate
gain is te difference 'etween tat num'er of replicates and te num'er of replicates actually
used+ Anoter useful measure is te e:uivalent replicate gain percentage/ te e:uivalent replicate
gain divided 'y te e:uivalent replicate num'er+ Tis gives an appro9imation of te percent cost
saving possi'le 'y using te algoritm and fewer replicates+ Te com'ination of te tecni:ues
of ran% sum and e:uivalent replicate gain percentage as 'een very successful+

=e ave generated a num'er of artificial data sets using a wide range of parameters and
found te e:uivalent replicate gain percentage of te algoritm in many situations+ Te most
important parameters were found 'e te num'er of conditions/ te num'er of cluster sapes/ te
num'er of replicates/ te num'er of non-differentiated genes in te initial clustering/ and te
num'er of clusters omitted from te initial clustering+
In a'solutely ideal situations/ te algoritm can give an enormous 'enefit+ An
e9periment wit twenty conditions/ one cluster sape/ two replicates/ and perfect initial
clustering approaces suc an ideal situation and gives e9cellent results >Ta'le #? +
Num'er of Conditions .:uivalent $eplicate
.:uivalent $eplicate
<ain 3ercentage
8 6+46 4#+7O
*- 8+*, 8,+#O
Ta'le #+ 3erformance of te clustering algoritm on data wit two num'ers of conditions/ one
cluster sape/ two replicates/ and perfect initial clustering+
A more realistic case would involve fewer conditions/ say/ si9/ and particularly wit tis
smaller num'er of conditions/ te num'er of cluster sapes 'ecomes a factor in te performance
of te algoritm+ Te performance 'enefit is still :uite significant1 recall tat e:uivalent
replicate gain percentage is an appro9imation of te cost saving due to fewer replicates+ As
e9pected/ te performance 'enefit is reduced 'y additional cluster sapes >Ta'le *?+
Num'er of Cluster
.:uivalent $eplicate
.:uivalent $eplicate
<ain 3ercentage
#* *+,4 *5+8 O
* 6+64 4-+# O
Ta'le *+ 3erformance of te clustering algoritm on data wit two num'ers of cluster sapes/
using si9 conditions/ two replicates/ and perfect initial clustering+
.ven wen identical parameters are used to generate te data/ te ran% sum of any
algoritm will fluctuate 'etween data sets differing only 'y random num'er generation+
"ortunately/ tis variation appears to affect Cy'er-ANOVA and te Clustering algoritm e:ually/
for te e:uivalent replicate gain percentage does not cange significantly >Ta'le 6?+ Tis
invariance as te 'eneficial effect of ma%ing it unnecessary to run multiple data sets for eac
parameter com'inationE more tan one adds little information+
Data Set .:uivalent $eplicate
.:uivalent $eplicate
<ain 3ercentage
# 6+64 4-+#O
* 6+47 4*+6O
6 6+*5 65+*O
4 6+*A 6,+AO
Ta'le 6+ 3erformance of te clustering algoritm on data generated randomly four times using
identical parameters1 si9 conditions/ two replicates/ two cluster sapes/ and perfect initial
Cange in te standard deviation or in te fold canges of te genes will certainly affect
'ot te ran% sum from 'ot Cy'er-ANOVA and clustering+ Over small canges/ te e:uivalent
replicate gain percentage is not seriously affected/ 'ut if te standard deviation 'ecomes so large
as to corrupt te initial clustering/ te performance loss is significant >Ta'le 4?+
Standard Deviation
>in multiples of te
normal SD used
.:uivalent $eplicate
.:uivalent $eplicate
<ain 3ercentage
9# 6+64 4-+# O
9* 6+#4 68+6O
94 *+86 *4+-O
Ta'le 4+ 3erformance of te clustering algoritm on data wit tree levels of standard deviation/
using si9 conditions/ two replicates/ two cluster sapes/ and perfect initial clustering+
@i%e Cy'er-ANOVA/ and in fact all statistical tests/ te e:uivalent replicate gain
percentage of te algoritm decreases as te num'er of replicates 'ecomes larger and te test
'ecomes more accurate >Ta'le A?+ If te initial clustering is correct/ tere is no loss in
performance/ 'ut in e9periments wit parameters similar to tose of Ta'le A/ te algoritm
pro'a'ly ceases to 'e wortwile after four replicates+
Algoritm Tested Num'er of $eplicates .:uivalent $eplicate
.:uivalent $eplicate
<ain 3ercentage
Clustering * *+,4 *5+8O
Clustering 4 4+74 #A+7O
Clustering , ,+#* #+A#O
Cy'er-ANOVA * *+56 6#+7O
Cy'er-ANOVA 4 A+*6 *6+AO
Cy'er-ANOVA , ,+7A ,+8-O
Ta'le A+ 3erformance of te clustering algoritm on data wit tree num'ers of replicates/ using
si9 conditions/ twelve cluster sapes/ and perfect initial clustering+ Te rate of decrease in
performance at iger num'er of replicates is compared wit tat rate in Cy'er-ANOVA+ Note
tat for te clustering algoritm/ te e:uivalent replicate gain percentage is e9pressed as a gain
from Cy'er-ANOVA/ 'ut tat for Cy'er-ANOVA/ te e:uivalent replicate gain percentage is
e9pressed as a gain from a standard ANOVA test+ Tus/ on te scale tat Cy'er-ANOVA is 'eing
compared to/ te clustering algoritm result for a two replicate data set is e:uivalent to te plain
ANOVA of four replicates+
Te most serious impact to te algoritmCs performance occurs if te initial clustering is
done incorrectly+ Te most difficult and important step is coosing te num'er of genes to
include in te initial groupE if eiter too many or too few are cosen/ te performance of te
algoritm will 'e diminised significantly+ Te ultimate goal is to coose a set of genes tat
includes all te clusters of differentially e9pressed genes witout including any non-differentially
e9pressed genes+ It is true tat tis migt 'e very difficult to do wit e9perimental data/ 'ut it is
fortunately true tat te algoritm is ro'ust to small departures from optimality >Ta'le 8?+ Te
only time tat te algoritmCs performance drops 'elow te performance of Cy'er-ANOVA is
wen almost all of te clusters are left out+
Num'er of
<enes Selected
Num'er of
"alse 3ositives
Num'er of
Num'er of
genes included
$eplicate <ain
6 -+--A #- - #+A7 -*#+8
#- -+-* 5 - #+56 -6+6
*- -+-A A - *+## A+#,
6- -+-5 6 - *+#* A+,4
A- -+*4 - - *+64 #4+7
#-- #+6 - - *+86 *6+5
6-- 5# - # *+AA *#+4
4-- 6-- - *6 *+67 #A+,
A-- 458 - 5A *+-5 4+66
Ta'le 8+ 3erformance of te clustering algoritm on data wit several e9pected false positive
rates for te initial clustering/ using si9 conditions/ two replicates/ and twelve cluster sapes+

Te potential cost saving of te algoritm in cases wen te correlation 'etween distance
and significance is strong is igly significant+ It is interesting to note tat for normal data wit
two replicates suc as tat in Ta'le A/ te percent e:uivalent replicate gain 'etween Cy'er-
ANOVA and Clustering is rougly e:ual to te performance gain 'etween standard ANOVA and
Cy'er-ANOVA/ on te order of 6-O+
It is true tat tis performance degrades in some cases/ 'ut tis is also true of Cy'er-
ANOVA/ and also of oter algoritms >6/ 4?+ One situation tat causes te percentage e:uivalent
replicate gain of all tese algoritms to decrease is iger num'er of replicates+ &nfortunately/
te performance decreases more rapidly in te clustering algoritm+ Tis penomenon is
peraps 'est e9plained 'y a :uality of information argument1 additional replicates increase te
:uality of information of te 3 values/ 'ut ave little effect on te :uality of te information of
te D values+ Additional replicates will cause differentially e9pressed genes to cluster 'etter/ in a
sense reducing te num'er of false negatives/ 'ut will not reduce te false positives/ 'ecause
non-differentially e9pressed genes are Fust as li%ely to randomly ave low distances given any
num'er of replicates+ Tus/ as te num'er of replicates increases/ te D values gradually cause
less improvement+
Te most maFor degradation of te clustering algoritmCs performance/ owever/ is in a
situation tat does not affect te oter algoritms/ tat of 'ad initial clustering+ $educing te
severity of tis pro'lem will 'e te primary direction for furter researc on tis algoritm+
Some possi'le solutions include >#? clustering all te genes 'ut weiging teir significance in te
clustering algoritm 'y 3 valueE and >*? setting te num'er of genes to use as seeds 'y trying
many values and seeing te point a'ove wic new clusters stop appearing to form+ $egardless
of weter tis issue can 'e resolved in te general case/ if in some e9periment it is strongly
suspected >peraps on 'iological grounds? tat no clusters ave 'een omitted from te initial set
of genes/ tis clustering metod can 'e used and safely e9pected to produce a su'stantial
performance 'enefit+
#+ Dudoit/ S+ 0ang/ 0+/ Callow/ M+!+/ and Speed/ T+3+ >*---?+ Statistical metods for identifying
differentially e9pressed genes in replicated cDNA microarray e9periments+ Tecnical report
PA7,/ Stat Dept/ &C-(er%eley+
*. )nudsen/ S+ >*--*? A Biologist’s Guide to Analysis of DNA Microarray Data+ !on =iley Q
6+ 3an/ =+/ >*--*? A comparative review of statistical metods for discovering differentially
e9pressed genes in replicated microarray e9periments+ Bioinformatics #,>4?1A48-A4
4. Tuser/ V+<+/ Ti'sirani/ $+/ and Cu/ <+ >*--#?+ Significance analysis of microarrays applied
to te ioni;ing radiation response+ Proc. Natl. Acad. Sci. &SA 5,1A##5-A#*#+
A+ @ong/ D+/ Mangalam/ B+/ Can/ (+/ Tolleri/ @+/ Batfield/ <+/ (aldi/ 3+/ Improved statistical
inference from DNA microarray data using analysis of variance and a (ayesian statistical
"ramewor%+ Journal of Biological Chemistry. *78 >6?1 #5567-#5544+
8. (aldi/ 3+/ and (runa%/ S+/ >#55,? Bioinformatics: he Machine !earning A""roach+ MIT
3ress/ Cam'ridge MA+
7+ (aldi/ 3+ and @ong/ A+D+ >*--#?+ A (ayesian framewor% for te analysis of microarray
e9pression data1 $egulari;ed t-test and statistical inferences of gene canges+ Bioinformatics
,+ )reys;ig/ .+ >#57-? Mathematical Statistics: Princi"les and Methods+ !on =iley QSons/
5+ .wens/ =+/ and <rant/ <+/ >*--#? Statistical Methods in Bioinformatics: An #ntroduction+
Spriger-Verlag/ N0+
#-+ Dao/ $+/ <is/ )+/ Murpy/ M+/ 0in/ 0+/ Notterman/ D+/ Boffman/ =+/ Tom+/ ./ Mac%/ D+/
and @evine/ A+ >*---?+ Analysis of pA6-regulated gene e9pression patterns using oligonucleotide
arrays+ Genes and De$elo"ment+ #4 >,?1 5,#-556+
##+ !iang M/ $yu !/ )iraly M/ Du%e )/ $ein%e V/ and )im S)+ <enome-wide analysis of
developmental and se9-regulated gene e9pression profiles in Caenora'ditis elegans+ Proc. Natl.
Acad. Sci. % S A &''( Jan &)*+,(-:&(+.&/
#*+ Duda/ $+O/ and Bart/ 3+.+ >#576? Pattern Classification and Scene Analysis+ !on =iley and
#6+ Saran/ $+/ and Samir/ $+ C@IC)1 a clustering algoritm wit applications to gene
e9pression analysis+ In Proceedings of the &''' Conference on #ntelligent Systems for
Molecular Biology ,#SMB''-0 !a Jolla0 CA0 6-7-6#8
#4+ Alon/ &+/ (ar%ai/ N/+ notterman/ D+A+/ <is )+/ 0'arra/ S+/ Mac%/ D+/ and @evine/ A+!+
>#555? (road patterns of gene e9pression revalued 'y clustering analysis of tumor and normal
colon tissues pro'ed 'y oligonucleotide arrays+ Proc. Natl. Acd. Sci. %SA/ 581874A-87A-
#A+ Serloc%/ <+ Analysis of large-scale gene e9pression data+ Curr 1"in #mmunol *---

You're Reading a Free Preview