This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

### Categories

### Publishers

Scribd Selects Books

Hand-picked favorites from

our editors

our editors

Scribd Selects Audiobooks

Hand-picked favorites from

our editors

our editors

Scribd Selects Comics

Hand-picked favorites from

our editors

our editors

Scribd Selects Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

P. 1

Improved Statistical Test4.69

|Views: 278.934|Likes: 707Publicado porJared Friedman

My Intel paper. Now it is finally published.

My Intel paper. Now it is finally published.

See more

See less

https://pt.scribd.com/doc/1/Improved-Statistical-Test

01/29/2016

text

original

**Multiple-Condition Microarray
**

Data Integrating ANOVA wit

Clustering

!ared "riedman

#

Tis researc was done at Te $oc%efeller

&niversity and supervised 'y (rian )ir%

*

#+ #,- .ast .nd Ave/ New 0or%/ N0/ #-#*,+ .mail1 vicvic2att+net+ 3one1 *#*-

*45--#64

*+ .mail1 )ir%(2$oc%efeller+edu+ 3one1 *#*-6*7-7-84

#

Summary

Microarrays are e9citing new 'iological instruments+ Microarrays promise to ave

important applications in many fields of 'iological researc/ 'ut tey are currently fairly

inaccurate instruments and tis inaccuracy increases teir e9pense and reduces teir usefulness+

Tis researc introduces a new tecni:ue for analy;ing data from some types of microarray

e9periments tat promises to produce more accurate results+ .ssentially/ te metod wor%s 'y

identifying patterns in te genes+

A'stract

Motivation1 Statistical significance analysis of microarray data is currently a pressing

pro'lem+ A multitude of statistical tecni:ues ave 'een proposed/ and in te case of simple two

condition e9periments tese tecni:ues wor% well/ 'ut in te case of multiple condition

e9periments tere is additional information tat none of tese tecni:ues ta%e into account+ Tis

information is te sape of te e9pression vector for eac gene/ i+e+/ te ordered set of e9pression

measurements/ and its usefulness lies in te fact tat genes tat are actually affected 'iologically

'y some e9perimental circumstance tend to fall into a relatively small num'er of clusters+ <enes

tat do appear to fall into tese clusters sould 'e selected for 'y te significance test/ tose tat

do not sould 'e selected against+

$esults1 Suc a test was successfully designed and tested using a large num'er of

artificially generated data sets+ =ere te a'ove assumption of te correlation 'etween

clustering and significance is true/ tis test gives considera'ly 'etter performance tan

conventional measures+ =ere te assumption is not entirely true/ te test is ro'ust to te

deviation+

Introduction

Microarrays are part of an e9citing new class of 'iotecnologies tat promise to allow

monitoring of genome-wide e9pression levels >#?+ Altoug te tecni:ue presented in tis

paper is :uite general and would/ in principle/ wor% wit oter types of data/ it as 'een

developed for microarray data+ Microarray e9periments wor% 'y testing e9pression levels of

*

some set of genes in organisms e9posed to at least two different conditions/ and usually ten

trying to determine wic >if any? of tese genes are actually canging in response to te cange

in e9perimental condition >*?+ &nfortunately/ te random variation in microarrays is often large

relative to te canges trying to 'e detected+ Tis ma%es it very ard to eliminate false positives/

genes wose random fluctuations ma%e tem erroneously seem to respond to te e9perimental

condition/ and false negatives/ genes wose random fluctuations mas% te fact tat tey are

actually responding to te e9perimental condition+ It is standard statistical practice to andle tis

pro'lem 'y using replicates/ i+e+/ several microarrays run in identical settings >#-5?+ Altoug

replicates are igly effective/ te e9pense of additional replicates ma%es it wortwile to

minimi;e te num'er needed 'y using more effective statistical tests+

=en tere are only two conditions in an e9periment/ te conventional statistical test for

differential e9pression is te t-test >*/8?+ (ut @ong/ et. al. >A/ also 8/7? reali;ed tat 'y running a

separate t-test on eac gene/ valua'le information is lost/ namely te fact tat te actual

population standard deviation of all te genes is rater similar+ Due to te small num'er of

replicates in microarray e9periments tere are usually a few genes tat/ 'y cance/ appear to

ave e9tremely low standard deviations and tus register even a very small cange as igly

significant 'y a conventional t-test+ In reality -- and as additional replicates would confirm -- te

standard deviations of tese genes are muc iger/ and te cange is not at all significant+ In

response to tis issue/ @ong et. al. derived using (ayesian statistics a metod of transforming

standard deviations tat/ in essence/ moves outlying standard deviations closer to te mean

standard deviation+ &nfortunately/ te resulting algoritm/ wic tey call Cy'er-T/ wor%s only

wit two condition e9periments+ Since tis researc is concerned wit multiple condition

e9periments/ it was necessary to e9pand te metod to wor% wit tis type of e9periment/ te

standard statistical test for wic is called ANOVA >ANalysis Of Variations? >,?+ Bere we report

te successful e9tension of te Cy'er-T algoritm to multiple condition e9periments/ creating te

algoritm we call Cy'er-ANOVA+

In microarray e9periments involving several conditions/ te e9pression vector of eac

gene formed 'y consecutive measurements of its e9pression level as a distinctive sape/ wic

contains useful information+ Te more conditions in te e9periment >and tus num'ers in te

6

e9pression vector?/ te more distinctive will 'e te sape+ Often 'iologists are interested in

finding genes wit e9pression vectors of similar sape/ and tis interest as generated a well-

developed process for grouping genes 'y sape %nown as clustering >5?+ Te 'asis of clustering

tecni:ues is te concept of correlation/ te similarity 'etween two vectors+ =e cose to use te

3earson Correlation/ 'ecause tis measure generally captures more accurately te actual

'iological idea of correlation >*?+ =it te 3earson correlation/ as elsewere/ distance is defined

as one minus te correlation and measures te dissimilarity 'etween two vectors+ &sing tis

measure of distance/ it is possi'le to solve te pro'lem of ta%ing a large num'er of unorgani;ed

vectors and placing tem into clusters of similar sapes1 tis is %nown as )-means clustering

>#*?+ Once te )-means algoritm as 'een completed/ it is often useful to determine ow well

eac gene as clustered+ Tis is done 'y computing a representative vector/ often called a center/

for eac cluster and calculating te distance 'etween eac gene and te center of tat geneCs

cluster >#4/#A?+

Te fundamental tesis of tis researc is tat genes actually affected 'y e9perimental

circumstances tend to fall into a relatively small num'er of clusters/ and tat tis information can

'e used to ma%e a more accurate statistical test+ =ile it is in principle possi'le/ it is not often

te case tat large num'ers of affected genes 'eave in seemingly random and entirely dissimilar

patterns+ <iven all te clusters of clearly significant genes/ genes tat fit well in a cluster are

more li%ely to 'e differentially e9pressed tan tose tat do not+ &nfortunately/ tis ypotesis is

very difficult to test/ for no microarray e9periments ave 'een replicated enoug to allow for

accurate determination of differentially e9pressed genes+ Te studies tat ave come closest to

tis conclusion are Dao et al/ >#-? and !iang et al+/ >##?+ (ot studies ran )-Means clustering on

multiple condition data and found several distinct cluster sapes tat accounted for all clearly

significant genes+

=e used tis connection 'etween clustering and significance to create a more accurate

statistical test wose essential steps are te following+ "irst Cy'er-ANOVA is run on te entire

dataset and te igly significant genes are clustered+ Te oter genes are placed into te cluster

in wic tey fit 'est/ and te distance 'etween eac gene and te center of its cluster is

computed+ Tere are now two values for eac gene/ te 3 value >pro'a'ility? from te Cy'er-

4

ANOVA test and te D value >distance? from te clustering+ Tese two values are com'ined and

te resulting value is more accurate tan eiter one alone+

Te potential advantage of tis algoritm was validated 'y te creation and successful

testing of a large num'er of artificial data sets+ Artificial data was cosen over real e9perimental

data for two main reasons1 >#? in e9perimental data/ uncertainty always e9ists over wic genes

are actually differentially e9pressedE tis presents maFor pro'lems in determining ow well some

given algoritm wor%sE and >*? no single microarray e9periment is actually representative of all

possi'le microarray e9periments G to determine accurately te performance advantage of an

algoritm/ a large num'er of very different e9periments must 'e used/ a proi'itively e9pensive

and difficult process+ Artificial creation of data sets allows every parameter of te data to 'e

perfectly controlled and te algoritmCs performance can 'e tested on every possi'le

com'ination of parameters+

Te 'asic metod of analysis was to generate an artificial data set from a num'er of

parameters/ run te clustering algoritm on te data and measure its performance/ run Cy'er-

ANOVA on tat data and on similar data 'ut wit additional replicates and measure its

performance/ and ten compare te results+ Te performance gain of te clustering algoritm is

pro'a'ly 'est e9pressed as e:uivalent replicate gain1 te num'er of additional replicates tat

would 'e needed to produce te results of te clustering algoritm using a standard Cy'er-

ANOVA+ On reasona'le simulated data/ we found tat te 'enefit was appro9imately one

additional replicate added to an original two+ =it additional improvements to te algoritm

design/ tis saving could pro'a'ly 'e increased still furter/ and te advantage is clearly

significant enoug to warrant use in actual microarray e9periments+

Metods

Algoritm Design

Te algoritm is implemented as a large >H*--- lines? macro script in Microsoft Visual

(asic for Applications controlling Microsoft .9cel+ Te essential steps of te algoritm are te

following1

A

#+ $un Cy'er-ANOVA on all genesE

*+ AdFust 3 values for multiple testingE

6+ &se te results of >*? to coose an appropriate num'er of most significant genesE

4+ Cluster tis selection of genes using )-MeansE

A+ Ta%e te remaining genes and group eac into te cluster wit wic it correlates 'estE

8+ Compute te distance 'etween eac gene and te center of its clusterE

7+ Compute a com'ined measurement of significance for eac gene+

#+ $un Cy'er-ANOVA on all genes+ Cy'er-ANOVA is Fust a simple e9tension of Cy'er-T/ 'ut its

novelty causes it to warrant some discussion+ Cy'er-T ta%es as input a list of genes/ means in

eac condition/ standard deviations in eac condition/ and num'er of replicates+ It gives as

output a set of 3-values muc li%e tose from a conventional t-test/ 'ut more accurate+ Te

improvement is 'ased on te idea of regulari;ing te standard deviations/ essentially moving

outlying standard deviations closer to te mean+ At te same time/ it recogni;es tat standard

deviation may cange significantly >and inversely? wit intensity level/ and tus genes are

regulari;ed only wit genes of similar intensity+ Te algoritm wor%s 'y first computing for

eac gene a 'ac%ground standard deviation/ te average standard deviation of te undred genes

wit closest intensity/ and ten com'ining te 'ac%ground standard deviation wit te o'served

standard deviation in a type of weigted average+ A standard t-test is ten run/ 'ut using te

regulari;ed standard deviation instead of te o'served one+

Cy'er-ANOVA wor%s 'y te same metod/ ta%ing as input means and standard

deviations in any num'er of conditions+ .ac 'ac%ground standard deviation is calculated using

te intensities of te genes in teir own condition/ and te values are com'ined using te same

formula+ (ut 'ecause tere are more tan two conditions/ a t-test cannot 'e appliedE instead we

use ANOVA/ inputting te regulari;ed standard deviation instead of te o'served one+ Te result

gives a dramatic increase in accuracy/ an e:uivalent replicate gain of rougly one+

*+ AdFust 3 values for multiple testing+ To run te clustering algoritm/ we need to select

a num'er of genes tat are igly significant to form te seed clusters around wic all oter

8

genes will cluster+ Te difficult step in te process is deciding ow many genes to select1 it is

critical to get at least a few seeds for all clusters 'ut also critical not to ave too many false

positives+ Since te first re:uirement is impossi'le to determine 'y calculation/ we use te

second+ In order to coose an appropriate num'er of e9pected false positives/ we must ave an

accurate estimate of te false positive rate+ Te simplest way of estimating te false positive rate

is a (onferroni correction/ wic simply multiplies eac p value 'y te total num'er of tests+

Tis metod assumes tat all genes are independent/ and tends to 'e too conservative/

particularly wit small num'ers of replicates/ and a num'er of more accurate tecni:ues ave

'een proposed+ Bowever/ for tis step/ only a roug estimate is necessary/ and tus te

(onferroni correction suffices+

6+ &se te results of >*? to coose an appropriate num'er of most significant genes+ Once

te e9pected false positive rate for any num'er of te most significant genes as 'een estimated/

some num'er of genes must 'e selected suc tat te num'er of e9pected false positives is some

percentage of te num'er selected genes+ Tere is no particular correct way to define tis

percentageE it depends on ow significant te genes in te least significant cluster are/ wic

cannot 'e determined directly+ Bowever/ we ave found tat te algoritm is tolerant to

deviations from optimality >Ta'le 8?+

4+ Cluster tis selection of genes using )-Means+ Te cosen group of most significant

genes is clustered using te )-Means algoritm wit te 3earson correlation+ Te primary

pro'lem wit all )-means clustering algoritms is tat te num'er of clusters must 'e declared

at te start >#6?+ (ut li%e te determination of te num'er of genes to include in te initial

clustering/ tere is no pro'lem as long as ) is rater ig+ Tis can 'e e9plained conceptually 'y

considering wat appens as ) increases to differentially e9pressed and non-differentially

e9pressed genes+ If tere are only differentially e9pressed genes in te original set/ ten as ) is

raised/ genes tat were originally placed in one cluster will separate into multiple clusters+

"ortunately/ tis is guaranteed 'y te process of )-Means to lower te D values and create 'etter

cluster sapes+ If tere are non-differentially e9pressed genes/ ten as ) is increased tese genes

will tend to separate out from te differentially e9pressed genes and form teir own clusters G a

igly desira'le occurrence+ &nfortunately/ if ) 'ecomes very large/ te noise genes may start

7

to cluster very well wit oter noise genes forming erroneously low distances/ and te oter

noise genes added in te ne9t step will ave a 'etter cance of incidentally finding a cluster wit

wic tey correlate well+ In principle/ too large a ) migt 'e partially monitored 'y cec%ing

for very small clusters and for clusters wit very ig average distances/ 'ot of wic could 'e

removed and teir genes forced to Foin a different cluster+ Tese metods ave not yet 'een

implemented+

A+ Ta%e te remaining genes and group eac into te cluster wit wic it correlates 'est+

Te clusters formed in step 4 act as seeds for te remaining genes+ Te clusters formed

optimally include all of te cluster sapes of actually differentially e9pressed genes and no

oters+ It was possi'le to form tese clusters/ even if tey were less tan ideal/ 'y using only a

su'set of te genes in te data set+ Now it is possi'le to loo% at te remaining genes and to see

ow well eac 'elongs in one of te clusters+ Ideally/ eac differentially e9pressed gene would

correlate perfectly wit e9actly one cluster and eac noise gene would correlate poorly wit all

of tem+ In practice/ of course/ tis is far from te case/ 'ut te principle remains+ Since eac

gene is/ of course/ only e9pected to correlate well wit one cluster/ eac gene in te rest of te

dataset is added to te cluster wit wic it correlates 'est+

8+ Compute te distance 'etween eac gene and te center of its cluster+ Te

measurement of ow well eac gene 'elongs in its cluster is now determined 'y computing te

3earson correlation 'etween eac gene and te center of its cluster+ Te resulting correlations

are converted to distances >distance I #- correlation?+ Tese distance values are te %ey

num'ers produced 'y te first alf of te algoritm+ Tey represent weter a gene vectorCs

sape is similar to te sape of gene vectors %nown to 'e significant+ Te 'asis of tis researc

is tat low distances tend to imply differentially e9pressed genes+ &nfortunately/ distance values

alone are not an accurate measure of significance G ran%ing genes 'y distance alone produces

results far worse tat te original Cy'er-ANOVA+ Instead/ te distance >JDK? values must 'e

com'ined wit te original 3 values to give new values 'etter tan eiter alone+

7+ Compute a com'ined measurement of significance for eac gene+ Te goal now is to

ta%e te 3 value and te D value and com'ine tem in some way to ma%e a new value+ !ust li%e

,

wit te calculation of te previous pro'a'ilities/ wat we are really interested in is finding te

pro'a'ility of a non-differentially e9pressed gene wit a certain standard deviation attaining 'ot

a 3 value less tan or e:ual to te o'served 3 value and a D value less tan or e:ual to te

o'served D value+ &nfortunately/ te teoretical 'asis for te connection 'etween 3 and D

values is not well defined/ and it may 'e very complicated and e9periment-dependentE tere may

well 'e no useful analytic solution to tis pro'lem+ Instead/ we generate te distri'ution of 3 and

D values for eac e9periment+ "irst tousands of non-differentially e9pressed genes are

generated using te standard deviation tat would 'e e9pected at teir intensity level and teir 3

and D values are calculated+ Ten/ for eac gene in te data set/ te num'er of non-differentially

e9pressed genes tat ave 'ot 3 and D values e:ual to or lower tan te 3 and D values of tat

gene is counted+ Tat num'er is divided 'y te num'er of non-differentially e9pressed genes

generated/ and tat :uotient gives an appro9imate pro'a'ility+

It is important to note tat for igly significant genes/ tis metod does not give accurate

pro'a'ilities/ for if #-

A

non-differentially e9pressed genes are generated/ only genes wit an

actual pro'a'ility of occurring of a'out #-

-A

are li%ely to 'e generated+ Tus/ genes in te data

set tat are actually significant to a pro'a'ility of 'elow #-

-A

will almost certainly not ave any

generated genes more significant and will all 'e assigned pro'a'ilities of ;ero+ (ut tis is not a

maFor concern/ since tese genes are so significant tat tere is little :uestion tat tey are

differentially e9pressed/ and teir precise ran% is not li%ely to 'e important+ If it is/ ten te

genes wit ;ero com'ined values can 'e ran%ed internally 'y teir 3 values+ In tis case/ witin

te very most significant genes/ te algoritm will not ave ad any effect+

<eneration of te Data

To test te performance of te algoritm/ a num'er of artificial data sets were generated+

Te goal was to see ow muc 'enefit could 'e derived from te algoritm and ow different

parameter com'inations would impact te results+ A'ove all/ te data is intended to mimic real

microarray data/ at least in all respects tat are li%ely to significantly influence te results of te

test+ Te 'asic process of data generation was to use certain parameters to create a mean value

and standard deviation for eac gene and condition and to use a random num'er generator

5

assuming a normal distri'ution to generate plausi'le e9perimental values+ Some of te most

important parameters are te num'ers of differentially and non-differentially e9pressed genes/

num'er of groups/ cluster sapes/ standard deviations/ and fold canges+

"or eac data set/ a'out ,A-- genes were created/ wit te num'er of differentially

e9pressed genes ranging from a'out #-- to a'out A--+ Te rest were genes wose mean value

did not cange across conditions/ 'ut tat mean value did vary 'etween different non-

differentially e9pressed genes+ Te num'er of groups used is a %ey determinant of te

performance of te algoritm/ more groups causing 'etter performance+ Most of te data sets

generated used si9 groups/ 'ut a set wit twenty groups was tried+

Te cluster sapes of non-differentially e9pressed genes are/ of course/ flat lines/ for tat

is te definition of 'eing non-differentially e9pressed+ Te cluster sapes of te differentially

e9pressed genes were more complicated+ Tey are drawn loosely from Dao et. al. >#-? 'ut

really te specific sapes cosen are not all tat important+ An important property of te 3earson

correlation is tat/ given some e9pression vector/ te pro'a'ility of some oter e9pression vector

aving a distance to te original vector 'elow some value is te same regardless of te sape of

te original vectorE it depends only on te num'er of conditions >,?+ On te oter and/ given

two original vectors/ te pro'a'ility of a test vector aving a distance 'elow some value to eiter

one of tem does depend on te sapes of te original vectors+ As intuition suggests/ vectors of

opposite sape >li%e a linear increasing vector and linear decreasing vector? ma%e it more li%ely

for a test vector to cluster well wit one of tem+ To te e9tent tat cluster sape does affect te

algoritm/ te more similar te cluster sapes/ te 'etter te algoritm will perform+ Cluster

sapes can 'e graped and assigned names 'ased on teir sape+ Data was generated using up to

twelve cluster sapes >"igure #a/ #'?+

#-

##

Figure 1B. The Other Six Cluster Shapes

5

6

7

8

9

10

11

1 2 3 4 5 6

Condition

L

o

g

E

x

p

r

e

s

s

i

o

n

L

e

v

e

l

Linear Gradient

Up

Linear Gradient

Down

Concave Up

Gradient Up

Concave Down

Gradient Up

Concave Up

Gradient Down

Concave Down

Gradient Down

Figure 1A. Six of the Twelve Cluster Shapes

5

6

7

8

9

10

11

1 2 3 4 5 6

Condition

L

o

g

E

x

p

r

e

s

s

i

o

n

L

e

v

e

l

Spike

Dip

Cli Up

Cli Down

!ill

"alle#

Te metod of assigning standard deviations was cosen to 'e as realistic as possi'le+

Cy'er-ANOVA ta%es into account te fact tat standard deviation often canges significantly

wit a'solute intensity in microarray e9periments/ and te generated data model tis

penomenon+ Te data of @ong/ et+ al/ >A? wic is freely availa'le/ was graped/ mean intensity

vs+ standard deviation/ and a :uadratic regression calculated+ Tis regression e:uation is an

e9plicit function for standard deviation in terms of mean/ and it was used wit a modification to

calculate standard deviations for te generated data+ Te modification is re:uired 'ecause it is

not te case tat standard deviation depends solely on intensity+ Te mean intensity vs+ standard

deviation grap does not ma%e a perfect lineE tere is considera'le widt to tat curve/ and tis

too was simulated+ Te standard deviation itself was randomi;ed wit a small meta-standard

deviation+ Te assumption of normality in tis generation is pro'a'ly false/ 'ut te cange in te

algoritm performance caused 'y introducing te meta-randomi;ation at all is so small tat it

seems igly unli%ely tat a 'etter model would produce a significant difference+

To ta%e full advantage of te standard deviation dependence on mean/ non-differentially

e9pressed genes were generated wit means separated 'y undredts 'etween , and #8 >on a log*

scale?+ Tis created a considera'le range of standard deviations for eac data set+ Te means of

differentially e9pressed genes were generated starting from , different 'aselines/ te integers

from , to #A+

.ac differentially e9pressed gene was generated at five different fold canges1 #+A/ #+7/

*+-/ 6+-/ and 8+- on an unlogged scale+ Te si9 and tree fold cange genes generally made up

te seed clustering group/ and te #+A and #+7 fold cange genes were responsi'le for most of te

difference in performance 'etween algoritms+

Scoring Algoritms

Once data as 'een generated and assigned 3 values using Cy'er-ANOVA and com'ined

values using te clustering algoritm/ tere must 'e some way to compare te performance+ =e

introduce a new metod for determining te performance of an algoritm/ wic we 'elieve to

fi9 a sortcoming in te standard metod+

#*

Te most common metod of scoring appears to 'e a consistency test >#-7? wic wor%s

te following way+ Say tere is a data set wit N genes %nown to 'e differentially e9pressed+

"irst/ some significance test is performed and eac gene assigned a significance level+ Ne9t/ te

genes are ran%ed 'y significance level/ most significant first+ Out of te top N most significant

genes/ te num'er D of genes tat are actually differentially e9pressed is determined/ and te

score is D L N+

Te pro'lem wit te consistency test is tat ignores te fact tat it often will matter

weter te >N-D? genes are listed consecutively rigt after te N

t

gene or at te very 'ottom of

te list+ (iologists would prefer tat te differentially e9pressed genes 'e listed as close to te

top as possi'le/ for it is easier to find an interesting gene or group of genes if it is iger on te

list+

As a solution/ we introduce an algoritm called ran% sum+ $an% sum also ta%es a list of

genes ran%ed 'y significance level/ 'ut computes te score differently+ $an% sum is e:ual to te

sum of te ran%s of all te N differentially e9pressed genes minus >#M*M6MNN?/ te minimum

score possi'le+ @ower ran% sums indicate 'etter performance/ and a perfect significance test

gives a ran% sum of ;ero+

(ut ran% sum as significant flaws/ too/ for it gives too muc weigt to te genes placed

after te N

t

ran%+ In te worst case/ one single misplaced gene could cange te score from - to

te num'er of genes in te e9periment/ ,--- in tis case+ Bere is a successful modification+

Instead of summing te actual ran%s/ we sum for all differentially e9pressed genes ">$an%?/

were " is some increasing concave-down function+ Tere is no particular rigt coice for "/ for

te preferred function will vary depending on te e9periment and e9perimenter/ 'ut 'ot

logaritmic and polynomial >to a power less tan/ say/ #L*? functions give almost identical

results+

Te modified ran% sum successfully determines ow well a significance test as ran%ed

te genes+ (ut te score from te ran% sum itself is not particularly useful informationE it must

#6

'e placed in a conte9t+ Since te goal of any significance test is ultimately to reduce te num'er

of replicates necessary to attain accurate results/ we convert te ran% sum to a new measure tat

we call e:uivalent replicate gain+ .:uivalent replicate gain is defined as te num'er of replicates

needed to reac te same ran% sum using only a standard Cy'er-ANOVA+ Specifically/ tis is

done 'y running Cy'er-ANOVA on several data sets wit parameters all identical e9cept for te

num'er of replicates+ Te grap of num'er of replicates vs+ ran% sum is drawn and a regression

calculated+ Te clustering algoritm is ten run and a ran% sum o'tained+ Tat ran% sum is

entered into te inverse of te regression function to find te appro9imate num'er of replicates

re:uired to attain te same performance using only Cy'er-ANOVA+ Te e:uivalent replicate

gain is te difference 'etween tat num'er of replicates and te num'er of replicates actually

used+ Anoter useful measure is te e:uivalent replicate gain percentage/ te e:uivalent replicate

gain divided 'y te e:uivalent replicate num'er+ Tis gives an appro9imation of te percent cost

saving possi'le 'y using te algoritm and fewer replicates+ Te com'ination of te tecni:ues

of ran% sum and e:uivalent replicate gain percentage as 'een very successful+

$esults

=e ave generated a num'er of artificial data sets using a wide range of parameters and

found te e:uivalent replicate gain percentage of te algoritm in many situations+ Te most

important parameters were found 'e te num'er of conditions/ te num'er of cluster sapes/ te

num'er of replicates/ te num'er of non-differentiated genes in te initial clustering/ and te

num'er of clusters omitted from te initial clustering+

In a'solutely ideal situations/ te algoritm can give an enormous 'enefit+ An

e9periment wit twenty conditions/ one cluster sape/ two replicates/ and perfect initial

clustering approaces suc an ideal situation and gives e9cellent results >Ta'le #? +

#4

Num'er of Conditions .:uivalent $eplicate

Num'er

.:uivalent $eplicate

<ain 3ercentage

8 6+46 4#+7O

*- 8+*, 8,+#O

Ta'le #+ 3erformance of te clustering algoritm on data wit two num'ers of conditions/ one

cluster sape/ two replicates/ and perfect initial clustering+

A more realistic case would involve fewer conditions/ say/ si9/ and particularly wit tis

smaller num'er of conditions/ te num'er of cluster sapes 'ecomes a factor in te performance

of te algoritm+ Te performance 'enefit is still :uite significant1 recall tat e:uivalent

replicate gain percentage is an appro9imation of te cost saving due to fewer replicates+ As

e9pected/ te performance 'enefit is reduced 'y additional cluster sapes >Ta'le *?+

Num'er of Cluster

Sapes

.:uivalent $eplicate

Num'er

.:uivalent $eplicate

<ain 3ercentage

#* *+,4 *5+8 O

* 6+64 4-+# O

Ta'le *+ 3erformance of te clustering algoritm on data wit two num'ers of cluster sapes/

using si9 conditions/ two replicates/ and perfect initial clustering+

.ven wen identical parameters are used to generate te data/ te ran% sum of any

algoritm will fluctuate 'etween data sets differing only 'y random num'er generation+

"ortunately/ tis variation appears to affect Cy'er-ANOVA and te Clustering algoritm e:ually/

for te e:uivalent replicate gain percentage does not cange significantly >Ta'le 6?+ Tis

invariance as te 'eneficial effect of ma%ing it unnecessary to run multiple data sets for eac

parameter com'inationE more tan one adds little information+

Data Set .:uivalent $eplicate

Num'er

.:uivalent $eplicate

<ain 3ercentage

# 6+64 4-+#O

#A

* 6+47 4*+6O

6 6+*5 65+*O

4 6+*A 6,+AO

Ta'le 6+ 3erformance of te clustering algoritm on data generated randomly four times using

identical parameters1 si9 conditions/ two replicates/ two cluster sapes/ and perfect initial

clustering+

Cange in te standard deviation or in te fold canges of te genes will certainly affect

'ot te ran% sum from 'ot Cy'er-ANOVA and clustering+ Over small canges/ te e:uivalent

replicate gain percentage is not seriously affected/ 'ut if te standard deviation 'ecomes so large

as to corrupt te initial clustering/ te performance loss is significant >Ta'le 4?+

Standard Deviation

>in multiples of te

normal SD used

elsewere?

.:uivalent $eplicate

Num'er

.:uivalent $eplicate

<ain 3ercentage

9# 6+64 4-+# O

9* 6+#4 68+6O

94 *+86 *4+-O

Ta'le 4+ 3erformance of te clustering algoritm on data wit tree levels of standard deviation/

using si9 conditions/ two replicates/ two cluster sapes/ and perfect initial clustering+

@i%e Cy'er-ANOVA/ and in fact all statistical tests/ te e:uivalent replicate gain

percentage of te algoritm decreases as te num'er of replicates 'ecomes larger and te test

'ecomes more accurate >Ta'le A?+ If te initial clustering is correct/ tere is no loss in

performance/ 'ut in e9periments wit parameters similar to tose of Ta'le A/ te algoritm

pro'a'ly ceases to 'e wortwile after four replicates+

Algoritm Tested Num'er of $eplicates .:uivalent $eplicate

Num'er

.:uivalent $eplicate

<ain 3ercentage

Clustering * *+,4 *5+8O

Clustering 4 4+74 #A+7O

Clustering , ,+#* #+A#O

Cy'er-ANOVA * *+56 6#+7O

Cy'er-ANOVA 4 A+*6 *6+AO

#8

Cy'er-ANOVA , ,+7A ,+8-O

Ta'le A+ 3erformance of te clustering algoritm on data wit tree num'ers of replicates/ using

si9 conditions/ twelve cluster sapes/ and perfect initial clustering+ Te rate of decrease in

performance at iger num'er of replicates is compared wit tat rate in Cy'er-ANOVA+ Note

tat for te clustering algoritm/ te e:uivalent replicate gain percentage is e9pressed as a gain

from Cy'er-ANOVA/ 'ut tat for Cy'er-ANOVA/ te e:uivalent replicate gain percentage is

e9pressed as a gain from a standard ANOVA test+ Tus/ on te scale tat Cy'er-ANOVA is 'eing

compared to/ te clustering algoritm result for a two replicate data set is e:uivalent to te plain

ANOVA of four replicates+

Te most serious impact to te algoritmCs performance occurs if te initial clustering is

done incorrectly+ Te most difficult and important step is coosing te num'er of genes to

include in te initial groupE if eiter too many or too few are cosen/ te performance of te

algoritm will 'e diminised significantly+ Te ultimate goal is to coose a set of genes tat

includes all te clusters of differentially e9pressed genes witout including any non-differentially

e9pressed genes+ It is true tat tis migt 'e very difficult to do wit e9perimental data/ 'ut it is

fortunately true tat te algoritm is ro'ust to small departures from optimality >Ta'le 8?+ Te

only time tat te algoritmCs performance drops 'elow te performance of Cy'er-ANOVA is

wen almost all of te clusters are left out+

Num'er of

<enes Selected

.9pected

Num'er of

"alse 3ositives

Num'er of

Clusters

Omitted

Num'er of

non-

differentially

e9pressed

genes included

.:uivalent

$eplicate

Num'er

.:uivalent

$eplicate <ain

3ercentage

6 -+--A #- - #+A7 -*#+8

#- -+-* 5 - #+56 -6+6

*- -+-A A - *+## A+#,

#7

6- -+-5 6 - *+#* A+,4

A- -+*4 - - *+64 #4+7

#-- #+6 - - *+86 *6+5

6-- 5# - # *+AA *#+4

4-- 6-- - *6 *+67 #A+,

A-- 458 - 5A *+-5 4+66

Ta'le 8+ 3erformance of te clustering algoritm on data wit several e9pected false positive

rates for te initial clustering/ using si9 conditions/ two replicates/ and twelve cluster sapes+

Discussion

Te potential cost saving of te algoritm in cases wen te correlation 'etween distance

and significance is strong is igly significant+ It is interesting to note tat for normal data wit

two replicates suc as tat in Ta'le A/ te percent e:uivalent replicate gain 'etween Cy'er-

ANOVA and Clustering is rougly e:ual to te performance gain 'etween standard ANOVA and

Cy'er-ANOVA/ on te order of 6-O+

It is true tat tis performance degrades in some cases/ 'ut tis is also true of Cy'er-

ANOVA/ and also of oter algoritms >6/ 4?+ One situation tat causes te percentage e:uivalent

replicate gain of all tese algoritms to decrease is iger num'er of replicates+ &nfortunately/

te performance decreases more rapidly in te clustering algoritm+ Tis penomenon is

peraps 'est e9plained 'y a :uality of information argument1 additional replicates increase te

:uality of information of te 3 values/ 'ut ave little effect on te :uality of te information of

te D values+ Additional replicates will cause differentially e9pressed genes to cluster 'etter/ in a

sense reducing te num'er of false negatives/ 'ut will not reduce te false positives/ 'ecause

non-differentially e9pressed genes are Fust as li%ely to randomly ave low distances given any

num'er of replicates+ Tus/ as te num'er of replicates increases/ te D values gradually cause

less improvement+

Te most maFor degradation of te clustering algoritmCs performance/ owever/ is in a

situation tat does not affect te oter algoritms/ tat of 'ad initial clustering+ $educing te

severity of tis pro'lem will 'e te primary direction for furter researc on tis algoritm+

Some possi'le solutions include >#? clustering all te genes 'ut weiging teir significance in te

#,

clustering algoritm 'y 3 valueE and >*? setting te num'er of genes to use as seeds 'y trying

many values and seeing te point a'ove wic new clusters stop appearing to form+ $egardless

of weter tis issue can 'e resolved in te general case/ if in some e9periment it is strongly

suspected >peraps on 'iological grounds? tat no clusters ave 'een omitted from te initial set

of genes/ tis clustering metod can 'e used and safely e9pected to produce a su'stantial

performance 'enefit+

$eferences

#+ Dudoit/ S+ 0ang/ 0+/ Callow/ M+!+/ and Speed/ T+3+ >*---?+ Statistical metods for identifying

differentially e9pressed genes in replicated cDNA microarray e9periments+ Tecnical report

PA7,/ Stat Dept/ &C-(er%eley+

*. )nudsen/ S+ >*--*? A Biologist’s Guide to Analysis of DNA Microarray Data+ !on =iley Q

Sons+

6+ 3an/ =+/ >*--*? A comparative review of statistical metods for discovering differentially

e9pressed genes in replicated microarray e9periments+ Bioinformatics #,>4?1A48-A4

4. Tuser/ V+<+/ Ti'sirani/ $+/ and Cu/ <+ >*--#?+ Significance analysis of microarrays applied

to te ioni;ing radiation response+ Proc. Natl. Acad. Sci. &SA 5,1A##5-A#*#+

A+ @ong/ D+/ Mangalam/ B+/ Can/ (+/ Tolleri/ @+/ Batfield/ <+/ (aldi/ 3+/ Improved statistical

inference from DNA microarray data using analysis of variance and a (ayesian statistical

"ramewor%+ Journal of Biological Chemistry. *78 >6?1 #5567-#5544+

8. (aldi/ 3+/ and (runa%/ S+/ >#55,? Bioinformatics: he Machine !earning A""roach+ MIT

3ress/ Cam'ridge MA+

7+ (aldi/ 3+ and @ong/ A+D+ >*--#?+ A (ayesian framewor% for te analysis of microarray

e9pression data1 $egulari;ed t-test and statistical inferences of gene canges+ Bioinformatics

#71A-5-A#5

,+ )reys;ig/ .+ >#57-? Mathematical Statistics: Princi"les and Methods+ !on =iley QSons/

Inc+

5+ .wens/ =+/ and <rant/ <+/ >*--#? Statistical Methods in Bioinformatics: An #ntroduction+

Spriger-Verlag/ N0+

#-+ Dao/ $+/ <is/ )+/ Murpy/ M+/ 0in/ 0+/ Notterman/ D+/ Boffman/ =+/ Tom+/ ./ Mac%/ D+/

and @evine/ A+ >*---?+ Analysis of pA6-regulated gene e9pression patterns using oligonucleotide

arrays+ Genes and De$elo"ment+ #4 >,?1 5,#-556+

#5

##+ !iang M/ $yu !/ )iraly M/ Du%e )/ $ein%e V/ and )im S)+ <enome-wide analysis of

developmental and se9-regulated gene e9pression profiles in Caenora'ditis elegans+ Proc. Natl.

Acad. Sci. % S A &''( Jan &)*+,(-:&(+.&/

#*+ Duda/ $+O/ and Bart/ 3+.+ >#576? Pattern Classification and Scene Analysis+ !on =iley and

Sons+

#6+ Saran/ $+/ and Samir/ $+ C@IC)1 a clustering algoritm wit applications to gene

e9pression analysis+ In Proceedings of the &''' Conference on #ntelligent Systems for

Molecular Biology ,#SMB''-0 !a Jolla0 CA0 6-7-6#8

#4+ Alon/ &+/ (ar%ai/ N/+ notterman/ D+A+/ <is )+/ 0'arra/ S+/ Mac%/ D+/ and @evine/ A+!+

>#555? (road patterns of gene e9pression revalued 'y clustering analysis of tumor and normal

colon tissues pro'ed 'y oligonucleotide arrays+ Proc. Natl. Acd. Sci. %SA/ 581874A-87A-

#A+ Serloc%/ <+ Analysis of large-scale gene e9pression data+ Curr 1"in #mmunol *---

AprE#*>*?1*-#-A

*-

Test ePUB comic

Murder at the Speakeasy

Alice's Adventures in Wonderland by Lewis Carroll

Haruko HTML Jpeg 20120524

HuckleBerry Finn

adsfAlzheimer's Disease at Home (1)

Rich text editor test

Jared Local Test

A personal account of recovering from RSI through the approach of John Sarno

0a5602491acd4ea7a9327308f5ee113c

About Downloads

39589412 VPR Vermont Poll Key Issues

Max Hawkins

Resume 9

World Economic Forum Technology Pioneers 2010

TechnologyPioneers2010 (1)

Court Order for Dismissal of Scott v. Scribd

Stipulation of Dismissal for Scott v. Scribd case

Stambecco Preso Base 2010-04-12

csc-sampis2

Vintage Scribd.com homepage

Very old Scribd browse page design

Test 5

1 Improved Statistical Test

1 Improved Statistical Test

- Leia e imprima sem anúncios
- Download to keep your version
- Edit, email or read offline

Enabling Technologies for Australlian Innovative Industries, 2005

Chinchilla Secrets

Exchange Rate Systems

Scribd Architecture Overview

What We Almost Called Scribd

Ashley Richards Plea

Very Cool Project Proposal

Dogs

Deep Water

BernieCard.pdf

MAN BAIT!

Kouran Mayweather Statement

Winter Wishes

Distinct Counting With a Self-Learning Bitmap (2009)

Simple Balanced Self Reliance.

sbi Form 1

Jawbone countersuit against Fitbit

Tom Brady and Alex Guerrero Emails

Al Qaeda Court Document 2001-042501

Fitbit patent suit against Jawbone

You are so Alive

Food Science and Food Biotechnology

Gordon v. Paypal - Complaint

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

CANCEL

OK

You've been reading!

NO, THANKS

OK

scribd