Você está na página 1de 51

Stat13FinalReview

A.Probabilitytablestouse.
B.variancealgebra,correlation,covariance,
regression
C.ProbabilityandConditionalprobability

Stat13Finalreview
A.ProbabilityTablestouse.
Before(midterm)After
Normaldistribution
Howto
standardize?

Chisquaredistribution
tdistribution
Degreesoffreedom(d.f)
1.Foronesample,d.f.=n1(lecture12,

Mean
Variance=
(Standard
deviation)2
Reviewlecture3,

especiallyslide4

slide9,lecture13,14)

2.Forfrequency/count,d.f.=number
ofcells1numberofparameters
estimated(lecture15,16,20/21)
3.Forlinearregression,d.f=sample

size2(lecture25)

Pearsonschisquare
Sumof
(Observedexpected)2/expected
Fortestofindependence,degreeoffreedom
equals(#Columns1)(#rows1)

Stat13FinalReview
PartB
Before(Midterm) After
Variancealgebra,
confidenceinterval
Regressionline:
Independent:
Var(XY)=var(X)
+var(Y)
Dependent:
Var(XY)=
Var(X)+Var(Y)
2Cov(X,Y)

Slopeequals
r[SD(Y)/SD(X)]
Whereristhe
correlationcoefficient
Lecture23,24,25

Standarderrorofthemean

Lecture6,7 Correlation=cov(X,Y)/SD(X)SD(Y)

Consistency:ifusen1indoingSD,thenusen1foraveraging
product

Practice:StepbystepforCovariance,variance,
andcorrelationcoefficients.
x

XEX YEY product

(XEX)2

(YEY)2

1.5

7.5

25

2.25

2.5

7.5

6.25

0.5

0.5

0.25

0.5

0.5

0.25

10

2.5

7.5

6.25

12

1.5

7.5

25

2.25

EX=7 EY=5.5 SD(X)=3.4 SD(Y)=1.7 Cov=29/6

sqrt(35/3)=3.4
Usepopulationversion,sodividedbyn

Corr=0.828
=cov/sd(x)sd(y)

AlgebraforVariance,covariance

Var(X+Y)=VarX+VarY+2cov(X,Y)
Var(X)=Cov(X,X)
Var(X+a)=Var(X)
Cov(X+a,Y+b)=Cov(X,Y)
Cov(aX,bY)=abCov(X,Y)
Var(aX)=a2Var(X)
Cov(X+Y,Z)=cov(X,Z)+cov(Y,Z)
Cov(X+Y,V+W)=cov(X,V)+cov(X,W)+cov(Y,W)
+cov(Y,W)

TRICK:pretendallmeansarezero;(X+Y)

(V+W)=XV+XW+YW+YW

Lecture7Accuracyofsample
meanX
Var(X)=Var(X)dividedbysamplesizen
WhatisXbar?Calledsamplemean.
Standarderrorofthemean=SD(X)
=SD(X)dividedbysquaredrootofn
Assamplesizeincreases,thesamplemeanbecomemore
andmoreaccurateinestimatingthepopulationmean
Samplesizeneededtomeetaccuracyrequirement

Stat13FinalReview
PartC
Probabilityfunction:meanandstandard
deviation;lecture19,20,21
Conditionalprobability:tree,table,should
knowhowtoupdateprobability(Bayes
theorem);lecture17,18

BinomialandPoisson
YouShouldrememberbinomial
n
P(X=x)=(x)px(1p)(nx)
IwillprovidePoissonintheexam;you
shouldknowhowtouseit
P(X=x)=ex/x!,wheree=2.71828

Officehoursnextweek
Monday,Wednesday34pm
Myoffice:
Geology4608

Lecture3Normaldistribution,
stemleaf,histogram

IdealizedPopulation,Boxofinfinitelymanytickets,each
tickethasavalue.
RandomvariableandprobabilitystatementP(X<85)
Notations,Greekletters:Mean(expectedvalue)andstandard
deviation,E(X)= , SD(X)= Var(X)=
Examples
Empiricaldistribution:Stemleaf,histogram
Threevariantsofhistogram:frequency,relativefrequency,
density(calledstandardizedinbook)
Sameshapewithdifferentverticalscale
Density=relativefrequency/lengthofinterval

Givenaboxofticketswithvaluesthatcomefrom
anormaldistributionwithmean75andstandard
deviation15,whatistheprobabilitythata
randomlyselectedticketwillhaveavalueless
than85?
LetXbethenumberelected(arandomvariable).
Pr(X<85).

Howdoesthenormaltablework?

StartfromZ=0.0,thenZ=0.1
Increasingpatternobserved
OnthenegativesideofZ
Usesymmetry

Howtostandardize?
Findthemean
Findthestandarddeviation
Z=(Xmean)/SD
Reversequestions:
HowtorecoverXfromZ?
HowtorecoverXfrompercentile?

Supposethereare20percentstudents
failingtheexam
Whatisthepassinggrade?
GofrompercentagetoZ,usingnormal
table
ConvertZintoX,usingX=mean+Ztimes
SD

Probabilityforaninterval
P(60<X<85)
Drawthecurve(locatemean,andendpoints
ofinterval)
=P(X<85)P(X<60)where
P(X<60)=P(Z<(6075)/15)=P(Z<1)=1
P(Z<1)=1.841=about.16

Lecture12Brownianmotion,
chisquaredistribution,d.f.
Adjustedscheduleahead
Chisquaredistribution(lotofsupplementarymaterial,
cometoclass!!!)1lecture
Hypothesistesting(abouttheSDofmeasurement
error)andPvalue(whyn1?supplement)1lecture
ChisquaretestforModelvalidation(chapter11)
Probabilitycalculation(chapter4)
BinomialdistributionandPoisson(chapter5,supplement,
horsekickdeathcavalierdata,hittinglottery,SARS
infection)
Correlation,prediction,regression(supplement)
tdistribution,Fdistribution

Slide9of
R2=(X1A)2+(X2A)2++(XnA)2;A=(X1+..+Xn)/n=average
Lecture12
Followsachisquaredistributionwithn1degreesoffreedom

IfvarianceofnormaleachXis2
ThenD2/2followsachisquare
distributionwithndegreesoffreedom
R2/2followsachisquaredistributionwith
n1degreesoffreedom;thisisalsotrue
evenifthemeanofthenormaldistribution
(foreachX)isnotzero(why?)

Lecture13Chisquareand
samplevariance

Finishthediscussionofchisquaredistributionfromlecture12
Expectedvalueofsumofsquaresequalsn1.
Whydividingbyn1incomputingsamplevariance?
Itgivesanunbiasedestimateoftruevarianceofmeasurementerro
TestinghypothesisabouttrueSDofmeasurementerror
ConfidenceintervalaboutthetrueSDofmeasurementerror.

Slide4.Lecture
13

Measurementerror=
readingfromaninstrumenttruevalue

Onebiotechcompanyspecializingmicroarraygeneexpression
profilingclaimstheycanmeasuretheexpressionlevelofagene
withanerrorofsize.1(thatis,aftertestingtheirmethodnumerous
times,theyfoundthestandarddeviationoftheirmeasurement
errorsis0.1)Thedistributionoferrorsfollownormaldistribution
withmean0(unbiased).

Cellsfromatumortissueofapatientaresenttothiscompanyfor
Microarrayassay.Toassureconsistency,thecompanyrepeattheassay
4times.Theresultofonegene,P53(themostwellstudiedtumor
suppressorgene),is1.1,1.4,1.5,1.2.

Isthereenoughevidencetorejectthecompanysclaimabout
theaccuracyofmeasurement?NotethatsampleSDissqrt(0.1/3),
Biggerthan0.1.
Thisproblemcanbesolvedbyusingchisquareddistribution.Weask
HowlikelyitistoobserveasampleSDthisbigandiftheprobabilityis
Small,thenwehavegoodevidencethattheclaimmaybefalse.(nextlecture)

Lecture14chisquaretest,Pvalue

Measurementerror(reviewfromlecture13)
Nullhypothesis;alternativehypothesis
Evidenceagainstnullhypothesis
MeasuringtheStrengthofevidencebyPvalue
Presettingsignificancelevel
Conclusion
Confidenceinterval

Testingstatisticsisobtainedbyexperienceorstatistical
training;itdependsontheformulationoftheproblemandhow
thedataarerelatedtothehypothesis.
FindthestrengthofevidencebyPvalue:
fromafuturesetofdata,computetheprobabilitythatthe
summarytestingstatisticswillbeaslargeasorevengreater
thantheoneobtainedfromthecurrentdata.IfPvalueisvery
small,theneitherthenullhypothesisisfalseoryouare
extremelyunlucky.Sostatisticianwillarguethatthisisa
strongevidenceagainstnullhypothesis.
IfPvalueissmallerthanaprespecifiedlevel(calledsignificance
level,5%forexample),thennullhypothesisisrejected.

Backtothemicroarrayexample
Ho:trueSDdenote0.1by0)
H1:trueSD>0.1(becausethisisthemainconcern;youdontcareif
SDissmall)
Summary:
SampleSD(s)=squarerootof(sumofsquares/(n1))=0.18
Wheresumofsquares=(1.11.3)2+(1.21.3)2+(1.41.3)2+(1.51.3)2=
0.1,n=4
Theratios/isittoobig?
ThePvalueconsideration:
Supposeafuturedataset(n=4)willbecollected.
LetsbethesampleSDfromthisfuturedataset;itisrandom;sowhatisthe
probabilitythats/willbe
Asbigasorbiggerthan1.8?P(s/0>1.8)

P(s/0>1.8)
Buttofindtheprobabilityweneedtousechisquare
distribution:
Recallthatsumofsquares/truevariancefollowachi
squaredistribution;
Therefore,equivalently,wecompute
P(futuresumofsquares/02>sumofsquaresfromthe
currentlyavailabledata/02),(recall0is
Thevalueclaimedunderthenullhypothesis);

Onceagain,ifdataweregeneratedagain,thenSumofsquares/true
varianceisrandomandfollowsachisquareddistribution
withn1degreesoffreedom;wheresumofsquares=sumofsquared
distancebetweeneachdatapointandthesamplemean
Note:Sumofsquares=(n1)samplevariance=(n1)(sampleSD)2
Pvalue=P(chisquarerandomvariable>computedvaluefrom
data)=P(chisquarerandomvariable>10.0)
Forourcase,n=4;solookatthechisquaredistribution
withdf=3;fromtablewesee:
Pvalueisbetween.025and.
01,rejectnullhypothesisat
5%significancelevel
9.348
11.34
Thevaluecomputedfromavailabledata=.10/.01=10

2
(notesumofsquares=.1,truevariance=.1

Confidenceinterval
A95%confidenceintervalfortruevariance2is
(Sumofsquares/C2,sumofsquares/C1)
WhereC1andC2arethecuttingpointsfromchi
squaretablewithd.f=n1sothat
P(chisquarerandomvariable>C1)=.975
P(chisquarerandomvariable>C2)=.025
Thisintervalisderivedfrom
P(C1<sumofsquares/2<C2)=.95
Forourdata,sumofsquares=.1;fromd.f=3oftable,
C1=.216,C2=9.348;sotheconfidenceintervalof2is0.1017
to.4629;howaboutconfidenceintervalof

Lecture15Categoricaldataand
chisquaretests
Continuousvariable:height,weight,geneexpression
level,lethaldosageofanticancercompound,etc
ordinal
Categoricalvariable:sex,profession,politicalparty,blood
type,eyecolor,phenotype,genotype
Questions:dosmokecauselungcancer?Dosmokers
haveahighlungcancerrate?
Dothe4nucleotides,A,T,G,C,occurequallylikely?

Lecture16chisquaretest
(continued)
Suppose160pairsofconsecutive
nucleotidesareselectedatrandom.
Aredatacompatiblewiththeindependent
occurrenceassumption?

15

10

13

10

13

10

10

10

10

10

12

10

Independenceimpliesjoint
probabilityequalsproductof
marginalprobabilities
LetP(firstnucleotide=A)=PA1
P(firstnucleotide=T)=PT1andsoon
LetP(secondnucleotide=A)=P A2
P(secondnucleotide=T)=P T2andsoon
P(AA)=PA1PA2
P(AT)=PA1PT2
WedonotassumePA1=PA2andsoon

Expectedvaluein();df=(#of
rows1)(#ofcolumns1)

A
A
T
G
C

15
10
(11.25) (12.66)
10(10) 13
(11.25)
10(10) 10
(11.25)
5(8.75) 12(9.84)

13
7
(11.25) (9.84)
7(10) 10
(8.75)
10(10) 10
(8.75)
10
8(7.66)
(8.75)

Pearsonschisquarestatistic=166.8>27.88.Pvalue<.001

Simpleorcompositehypothesis
Simple:parametersarecompletelyspecified
Composite:parametersarenotspecifiedandhaveto
beestimatedfromthedata

Lossof1degreeoffreedomperparameterestimated
Numberofparametersestimated=(#ofrows1)+
(#ofcolumns1)
Sothedfforchisquaretestis#ofcells1(#ofrows
1)(#ofcolumns1)=(#ofrow1)(#ofcol1)

Testofindependenceinacontingencytable

AreSARSdeathratesindependentofcountries?DatafromLA
times,asofMonday5.pm.(Wednesday,fromApril30,2003)

China

Hong
Kong

Singapo Canada others


re

cases

3303

1557

199

344

243

death

148

138

23

21

11

Df=1times4=4;butwait,

converttodeathalivetablefirst

d.f.=4

China
total

death

alive

total

148
(199.5
)
3155
(3103.
5)

138 23 21 11
(94)
(12)
(20.8) (14.7)

341

1419 176 323 232


5305
(1463) (187) (323.2 (228.3
)
)

330315571993442435646
PearsonsChisquarestatistic=47.67>18.47;Pvalue<.001,reject

nullhypothesis,dataincompatiblewithindependenceassumption

Lectures20/21Poisson
distribution
Asalimittobinomialwhennislargeandpissmall.
AtheorembySimeonDenisPoisson(17811840).
Parameter=np=expectedvalue
Asnislargeandpissmall,thebinomialprobabilitycan
beapproximatedbythePoissonprobabilityfunction
P(X=x)=ex/x!,wheree=2.71828
Ionchannelmodeling:n=numberofchannelsincellsand
pisprobabilityofopeningforeachchannel;

x
0
1
2
3
4
5
6
7

BinomialandPoisson
approximation
n=100,p=.01
.366032
.36973
.184865
.06099
.014942
.002898
.0000463

Poisson
.367879
.367879
.183940
.061313
.015328
.003066
.000511

Advantage:Noneedtoknownandp;
estimatetheparameterfromdata

X=Numberofdeaths

frequencies

109

65

22

total

200

200yearlyreportsofdeathbyhorsekickfrom10cavalrycorps

overaperiodof20yearsin19thcenturybyPrussianofficials.

Expected
frequencies

Data
Poisson
frequencies probability
109
.5435

65

.3315

66.3

22

.101

20.2

.0205

4.1

.003

0.6

108.7

200
Poolthelasttwocellsandconductachisquaretesttoseeif
Poissonmodeliscompatiblewithdataornot.Degreeof
freedomis411=2.Pearsonsstatistic=.304;Pvalueis.859
(youcanonlytellitisbetween.95and.2fromtableinthe

book);acceptnullhypothesis,datacompatiblewithmodel

RutherfoldandGeiger(1910)
Poloniumsourceplacedashortdistance
fromasmallscreen.Foreachof2608
eighthminuteintervals,theyrecordedthe
numberofalphaparticlesimpingingonthe
screen
Otherrelatedapplicationin
MedicalImaging:Xray,PETscan(positronemission
tomography),MRI

#ofparticles

0
1
2
3
4
5
6
7
8
9
10
11+

Observedfrequency
57
203
383
525
532
408
273
139
45
27
10

Expectedfreq.
54
211
407
526
508
394
254
140
68
29
11
6

Pearsonschisquaredstatistics=12.955;d.f.=1211=10
Poissonparameter=3.87,Pvaluebetween.95and.975.Accept
nullhypothesis:dataarecompatiblewithPoissonmodel

Poissonprocessformodelingnumberof
eventoccurrencesinaspatialortemporal
domain
Homogeneity:rateofoccurrenceis
uniform
Independentoccurrenceinnon
overlappingareas
Nonclumping

Stat13lecture25
regression(continued,SE,tand
chisquare)
Simplelinearregressionmodel:
Y=0+1X+
Assumption:isnormalwithmean0variance2
Thefittedlineisobtainedbyminimizingthesumof
squaredresiduals;thatisfinding0and1sothat
(Y101X1)2+.(Yn01Xn)2isassmallaspossible
Thismethodiscalledleastsquaresmethod

Leastsquarelineisthesameasthe
regressionlinediscussedbefore
Itfollowsthatestimatedslope1canbecomputedby
r[SD(Y)/SD(X)]=[cov(X,Y)/SD(X)SD(Y)]
[SD(Y)/SD(X)]
=cov(X,Y)/VAR(X)(thisisthesameasequationforhat1
onpage518)
Theinterceptisestimatedbyputtingx=0inthe
regressionline;yieldingequationonpage518
Therefore,thereisnoneedtomemorizetheequationfor
leastsquareline;computationallyitisadvantageoustouse
cov(X,Y)/var(X)insteadofr[SD(Y)/SD(X)]

Findingresidualsandestimating
thevarianceof
Residuals=differencesbetweenYandthe
regressionline(thefittedline)
Anunbiasedestimateof2is
[sumofsquaredresiduals]/(n2)
Whichdividedby(n2)?
Degreeoffreedomisn2becausetwoparameters
wereestimated
[sumofsquaredresiduals]/2followsachisquare.

Hypothesistestingforslope
Slopeestimate1israndom
Itfollowsanormaldistributionwithmean
equaltothetrue1andthevariance
equalto2/[nvar(X)]
Because2isunknown,wehavetoestimate
fromthedata;theSE(standarderror)of
theslopeestimateisequaltothesquared
rootoftheabove

tdistribution
Supposeanestimatehatisnormalwith
variancec2.
Suppose2isestimatedbys2whichis
relatedtoachisquareddistribution
Then()/(cs2)followsa
tdistributionwiththedegreesoffreedom
equaltothechisquaredegreefreedom

Anexample
Determiningsmallquantitiesofcalciuminpresenceof
magnesiumisadifficultproblemofanalyticalchemists.
Onemethodinvolvesuseofalcoholasasolvent.
Thedatabelowshowtheresultswhenapplyingto10
mixtureswithknownquantitiesofCaO.Thesecond
columngives
AmountCaOrecovered.
Questionofinterest:testtoseeifinterceptis0;testtosee
ifslopeis1.

X:CaO
present
4.0
8.0
12.5
16.0
20.0
25.0
31.0
36.0
40.0
40.0

Y:CaO
recovered
3.7
7.8
12.1
15.6
19.8
24.5
31.1
35.5
39.4
39.5

Fitted
value
3.751
7.73
12.206
15.688
19.667
24.641
30.609
35.583
39.562
39.562

residual
.051
.070
.106
.088
.133
.141
.491
.083
.161
.062

Standard
error

Estimate

LeastSquaresEstimates:

Constant0.228090(0.137840)
Predictor0.994757(5.219485E3)

RSquared:0.999780
Sigmahat:0.206722
Numberofcases:10
Degreesoffreedom:8

.22809/.1378=1.6547

Squaredcorrelation
Estimateof
SD()

(10.994757)/5.219485E3=
1.0045052337539044

Você também pode gostar