Escolar Documentos
Profissional Documentos
Cultura Documentos
Outline
Chapter1
Datatypes(discrete,continuous,categorical) Problem(3differentaspects) Populations(target,study,sample) Representationsofdata
Graphical:histograms,CDFs,boxplots Numerical:mean,standarddeviation,IQR
Bivariate Data
Relativerisk Correlationcoefficient
Outline
Chapter2
Reviewofprobabilitydistributions RandomPPDACexamples
Outline
Chapter3
BinomialModel ResponseModel RegressionModel MaximumLikelihoodEstimation
Outline
Chapter4
Samplingdistributionsforestimators Introductiontonewdistributions Gaussian Chisquared t ConfidenceInterval HypothesisTesting ConfidenceIntervalsandHypothesisTestingwiththelikelihood function
Outline
Chapter5
Testingforindependencewithcategoricalvariates Modelcheckingandassessmentforassumptions
Outline
Chapter 6
Comparison 2 sample t-tests Paired t-test Causality Testing for association Blocking Randomization and repetition Matching Prediction Prediction intervals for response Prediction intervals for regression
ConfidenceIntervalsusingthe RelativeLikelihoodFunction
Definethelikelihoodfunction
L( ) = f ( xi )
i =1
Definetherelativelikelihoodfunctionas:
L( ) ) L( )
ConfidenceIntervalsusingthe RelativeLikelihoodFunction
Graphtherelativelikelihoodfunction:
Drawahorizontallineat0.1,theintersectionofthetwo xcoordinatesformsanapproximate95%confidenceinterval
HypothesisTestingusingthe LikelihoodFunction
1)Definethenullhypothesis,definethealternate hypothesis 2)Definetheteststatistic,identifythedistribution, calculatetheobservedvalue 3)Calculatethepvalue Theteststatistic: DistributionofD:
D = 2[l ( ) l ( 0 )]
HypothesisTestingusingthe LikelihoodFunction
ObservedvalueofD: Pvalue: P ( D d )
d = 2[l ( ) l ( 0 )]
D ~ 2 n p
Example
Example
Theobservedvalueoftheteststatistic
d = 2[l ( ) l ( 0 )]
Example
l ( ) = n ln( + 1) + ln xi
i =1 n
Example
Example
d = 2[l ( ) l ( 0 )]
)
l ( ) = n ln( + 1) + ln xi
i =1 n
ModelAssessment
Wevebeenassumingourdatacollectedfits toaspecificmodel(Binomial,Response,etc.) Withthesemodelscomemanyassumptions, includingindependence Inthischapter,weanalyzeourdatato actuallyseeifwereabletousethesemodels tofitourdata
Independencewith BinaryVariates
Wewanttoseeifwecanassumetwobinary variates (representedby2randomvariablesX andY)areindependent Thisisessentiallyanothertypeofhypothesis testing Sinceabinaryvariateisjustacategorical variatewith2categories,thistestcanbe extendedtotwocategoricalvariates
Independencewith BinaryVariates
Define: LetXrepresentthebinaryvariategender(Male=0,Female=1) LetYrepresentthebinaryvariatesmoker(NonSmoker=0, Smoker=1) Letnbethesamplesize Letuscollectourobserveddataandpresentinthefollowing frequencytable:
Male (X=0) Non-Smoker (Y=0) Smoker (Y=1) Total a c a+c Female (X=1) b d b+d Total a+b c+d n=a+b+c+d
Independencewith BinaryVariates
IfXandYareindependentthen: Expectedfrequencyofmalesmokersis
n P ( X = 0) P (Y = 1)
Expectedfrequencyofmalenonsmokersis
n P ( X = 0) P (Y = 0)
Expectedfrequencyoffemalesmokersis
n P ( X = 1) P (Y = 1)
Expectedfrequencyoffemalenonsmokersis
n P ( X = 1) P (Y = 0)
Independencewith BinaryVariates
Usingtheobservedfrequencytable
Non-Smoker (Y=0) Smoker (Y=1) Total Male (X=0) a c a+c Female (X=1) b d b+d Total a+b c+d n=a+b+c+d
P ( X = 0)
P(Y = 0)
P( X = 1)
P (Y = 1)
Independencewith BinaryVariates
Creatingourexpectedfrequencytable
Male (X=0) Non-Smoker (Y=0) Female (X=1) Total a+b
n P( X = 0) P(Y = 0) n P( X = 1) P(Y = 0)
= e1
Smoker (Y=1)
= e2
n P( X = 1) P(Y = 1)
c+d
n P ( X = 0) P (Y = 1)
= e3
Total a+c
= e4
b+d n=a+b+c+d
Independencewith BinaryVariates
Aswithanyotherhypothesistestingquestion, weneedtodefinetheteststatistic. TestStatistic:
(oi ei ) 2 S = ei i =1
n
Independencewith BinaryVariates
pvalue
= P( S s)
Makeyourconclusion: Reject: XandYarenotindependent Accept: XandYareindependent
Example
Example
Example
Observedvalue:
(oi ei ) 2 s= ei i =1
n
Example
Pvalue:
ModelAssessment
Fortheregressionmodel,wehavethefollowing assumptionswhenfittingourdata
1)TheexpectationofYisalinearfunctionoftheexplanatory variate 2)ThemodelusedisGaussian 3)Yisareindependent 4)Themodelhasaconstantvariance
ModelAssessment
TheexpectationofYisalinearfunctionofthe explanatoryvariate
ThemodelassumesthatE[Yi]isalinearcombinationofxi IfweplotYi vs.xi weshouldseealinearrelationship
ModelAssessment
ThemodelusedisGaussian
Inthemodel,weassume R ~ G (0, ) andthus Y ~ G ( + x, ) Howdowecheckifthisassumptionisreasonable? Residuals Rearrangingthemodel, R = Y ( + x) ArealizationofRbecomes ri = yi ( + xi ) ) ) ) ) Anestimatedresidualis,ri = yi ( + xi ) = y yi ) ri Graphically,isthedistancefromthelineofbestfittoour observedresponsevariate
ModelAssessment
WecancheckfortheGaussianassumptionsbyplottingaQQ plot Plotthesamplequantiles againstthetheoreticalquantiles of theestimatedresiduals,ifthelineisrelativelystraight,then theGaussianassumptionholds
ModelAssessment
Yisareindependent
Wewillchecktheseassumptionsbyplottingthefitted ) ) ) ) response,againsttheestimatedresiduals, ri yi = + xi Ifourassumptionsaretrue,weshouldseearandompattern centeredaround0
ModelAssessment
ModelAssessment
YishaveConstantVariance
IfYishaveconstantvariance,weshouldseeresidualsevenly distributedaroundzero
Nonconstantvariance:funnelshaped
Comparison
RecallinChapter1welearnedtherewerethree differentaspects(typeofproblem) Descriptive Causative Predictive Chapter6looksattechniquesforsolvingeachof the3problems
Comparison
Thedescriptiveaspectoftheproblemcouldinvolvelooking andcomparingbetweentwodifferentpopulations Inthissection,wewilllearnhowtoconducthypothesistests thatwillallowustomaketheconclusionwhethertheresa differencebetween2populations Thequestionaskedisisthereadifferencebetweenthe meanvaluesofthe2populations? Essentially,thehypothesistestediswhethertheparameter foreachpopulationisequal H 0 : 1 = 2
Comparison
2samplettests(ResponseModel)
Twopopulations
Y1 j = 1 + R1 j
Y2 j = 2 + R2 j
Theestimatorforeachpopulationis
~ 1 =
Y
j =1
n1
1j
n1
~ 2 =
Y
j =1
n2
2j
n2
Thesamplingdistributionforeachestimatoris ~ ~ G( , ) 1 1 n1 ~ ~ G( , ) 2 2 n2
Comparison
Inthehypothesistests,wewanttoseeifthetwoparameters ~ ~ 1 andareequal,soletslookatther.v. 1 2 2 ~ ~ Whatisthesamplingdistributionofunderthe 1 2 assumption1 = 2 ~ ~ G( , ) 1 1 n1 ~ ~ G( , ) 2 2 n2
Comparison
~ ~ 1 2 ~ G (0,
Standardize
1 1 + ) n1 n2
~ ~ 1 2 1 1 + n1 n2
~ G (0,1)
~ ~ 1 2 ~ 1+ 1 n1 n2
~ t n1 + n2 2
Comparison
(n1 1) 1 + (n2 1) 2 = (n1 + n2 2)
)
2
T=
~ ~ 1 2 ~ 1 + 1 n1 n2
~ t n1 + n2 2
Example
Example
1 = 71.3 2 = 68.7
) )
1 = 10.2 2 = 11.3
) )
n1 = 47
n2 = 36
Example
(n1 1) 1 + (n2 1) 2 (47 1)10.2 2 + (36 1)11.32 = = = 10.6892 (n1 + n2 2) (47 + 36 2) )
2 2
PairedTTests
Inthepriorpages,welookedattwosamplettests Astrongertestiscalledthepairedttest Thistestonlyworksifthetwosampleswecollectareactually dataforthesamegroupofnunits,butatdifferenttimes Thepairedttestinvolvessimplifyingthetwodatasetsinto onebyfindingthedifferenceofeachpairofdata,and workingwiththissingledataset Thenweconductausualttest/hypothesistestonthissingle datasetofdifferences
Causation
Thecausativeaspectofaproblemlooksatthe relationshipbetweentheexplanatoryandresponse variates Recallinchapter1welookedat2typesofconceptsthat looksattherelationshipbetweenXandY
RelativeRisk Association
Associationinvolvescalculatingthecorrelation coefficient n
r== S XY S XX SYY =
(x
i =1
x ) ( yi y )
n 2
( xi x )
i =1
( yi y ) 2
i =1
Causation
Inthiscourse,weonlyhavetheskillstotestfor association H0 : = 0 Thisinvolvestestingthehypothesis intheregressionmodel H0 : = 0 If,thenwecansaythereisno associationbetweenXandY
Example
Example
0 t= ~ SE ( )
Causation
AssociationdoesNOTimplycausation Thecoursenotestalksaboutwhythisisthe caseandhowwecanavoidmakingthewrong assumptionusingthreetechniques
Blocking RepetitionandRandomization Matching
Causation
Confounding Associationdoesnotimplycausation Therecouldbeathirdhiddenvariatethatisrelatedtoboth theexplanatoryandresponseandcausesthiscausal relationship:thisiscalledconfounding Thedifficultywithconfoundingvariates isidentifyingthemin thefirstplace,orelsewewillmakeawrongconclusionabout therelationshipbetweentheexplanatoryandresponse variates Ifwecanidentifytheconfoundingvariates,thenthereare toolswecanusewhendesigningexperimentalplansto accountforthesevariates
Causation
Blocking Ifweveidentifiedtheconfoundingvariate,weneutralizeits effectbycollectingsampleswheretheunitshavethesame valuefortheconfoundingvariate TheChickenExample:
Responsevariate:growthrateofchickens Explanatoryvariate:proteinindiet Confoundingvariate:genderofthechickens Blocking:lookatsamplesofonlymalechickensandsamplesofonly femaleschickens Thiseliminatesthegendereffectandtheexperimenterisabletolook attheeffectsofproteinindietonthegrowthrateofchickens
Causation
ReplicationandRandomization Ifwecannotidentifyorcontroltheconfoundingvariate,wecan alsotrytoneutralizeitseffectsbyrandomlyallocatingour controlledvariateintheexperimentalplan TheMedicineExample:
Responsevariate:survivalrate Explanatoryvariate:typeoftreatment Confoundingvariates:medicalhistory/healthofthepatient Usingrandomizationandreplicationtoassignthetreatmenttype toeach unitwillresultintwoverybalancedgroupsintermsoftheir health/medicalhistory Thiswilleliminatetheconfoundingvariates asmuchaspossible
Causation
MatchingandObservationalPlans Inobservationalplans,theexperimentercannot controlthevariates Themethodofmatchingisusedwheretheunitsthat arebeingobservedarecomparedwithacontrolunit thathasverysimilarcharacteristicstotheunitinthe plan,(thisissimilartoblocking) Thusifthereisadifferenceinthevalueobserved betweenthesampledunitandthecontrolunit,the differencemustbelegitimate
Prediction
Thepredictiveaspectofaprobleminvolves usingourcollecteddatatoestimateavalue foraunittoberandomlyselectedfromthe population Wewilllookatpredictionintervalsfor
Response Regression
Prediction
TheModel
Y =+R
Y ~ G( , )
Thepredictedunit:Y0 Sincefollowstheresponsemodelthen Y0
Y 0~ G ( , )
Prediction
Whatwouldbealogicalchoicetouseasourpredicted value? Theaverage
~ Weneedtheestimatorforthemeanparameter:
~ =
Y
i =1
~ ~ G( , ) n
Sampling Distribution
From MLE
Prediction
Ifwelookatthedifferencebetweenourpredictedvalueandthe populationaverage,thenwehavetherandomvariable
~ Y0
Y 0~ G ( , )
~ ~ G( , ) n
Prediction
~ ~ G (0, 1 + 1 ) Y0 n
Standardizinggives
~ Y0 1 1+ n ~ Y0 ~ 1+ 1 n
~ G (0,1)
Replacewithanestimatorgives
~ t n1
Prediction
Constructinga95%PredictionIntervalforY0 ( Ourultimategoal:a Y0 b
~ t n1 Sincewecanmaketheprobabilitystatement: 1 ~ ~ Y0
unknown)
1+
P(
~ Y0 ~ 1+ 1 n
c) = 0.95
Prediction
~ Y0 P ( c c) = 0.95 ~ 1+ 1 n
Example
LetYbetheresponsevariaterepresentingbodyweight(kg).The followingsampleiscollected: 60 54 72 65 64
Constructa95%predictionintervalforthebodyweightofsomeonewe randomlyselectfromthepopulation.
c 1+
1 n
Example
c 1+
) ) 1 n
Prediction
TheModel
Y = + xi + R
Y = + ( xi x ) + R
Butforourpurposes,wewilluseashiftedversionofthemodel
Prediction
TheModel
Y = + ( xi x ) + R
Thepredictedunit:Y0 Wewanttopredictgiventhesubgroup xi = x0 Y0
Y0 Sincefollowstheregressionmodelthen
Y0 ~ G ( + ( x0 x ), )
Prediction
Whatwouldbealogicalchoicetouseasourpredicted value? xi = x 0 Theaveragegiventhesubgroupwhichwe ~ willdenote ( x0 )
Y = + ( xi x ) + R
Regression Model
~ ~ ~( x ) = E[Y | x ] = + ( x x ) 0 0 0
Average of the subgroup
xi = x 0
Prediction
UsingMaximumLikelihoodEstimationweobtaintheestimators
~ =
Yi
i =1
(Y Y )( x
i =1 i n i =1
x)
( xi x ) 2
S XY = S XX
Thesamplingdistributionsofthesetwoestimatorsare
~ ~ G ( , ) n
~ G( ,
S XX
Prediction
~ ~ ~ Whatisthesamplingdistributionof ( x0 ) = + ( x0 x )
~ ~ G ( ,
~ G( ,
S XX
1 ( x0 x ) 2 ~ ( x0 ) ~ G ( + ( x0 x ), ( + )) n S xx
Prediction
Ifwelookatthedifferencebetweenourpredictedvalueandthe populationaverage,thenwehavetherandomvariable
~ Y0 ( x0 )
Y0 ~ G ( + ( x0 x ), )
1 ( x0 x ) 2 ~ ( x0 ) ~ G ( + ( x0 x ), ( + )) n S xx
Theobviousnextstepwouldbetodeterminethesampling ~ distributionof Y0 ( x0 )
Prediction
Y0 ~ G ( + ( x0 x ), )
1 ( x0 x ) 2 ~ ( x0 ) ~ G ( + ( x0 x ), ( + )) n S xx
Prediction
~( x ) ~ G (0, 1 + 1 + ( x0 x ) ) Y0 0 n S xx
2
Standardizinggives
~ Y0 ( x0 ) 1 ( x0 x ) 1+ + n S xx
2
~ G (0,1)
Estimatingsigmagives
~ Y0 ( x0 ) ~ 1 + 1 + ( x0 x ) n S xx
2
~ t n2
Prediction
Constructinga95%PredictionIntervalforY0 ( Ourultimategoal:a Y0 b
~ Y0 ( x0 )
2
unknown)
1 (x x) ~ Sincewecanmaketheprobability 1+ + 0 n statement: S xx
~ tn2
P(
~ Y0 ( x0 ) ~ 1 + 1 + ( x0 x ) n S xx
2
c) = 0.95
Prediction
P( ~ Y0 ( x0 ) ~ 1+ + 1 n ( x0 x ) S xx
2
c) = 0.95
P ( c
~ Y0 ( x0 ) ~ 1 + 1 + ( x0 x ) n S xx
2
c) = 0.95
1 ( x0 x ) 2 1 ( x0 x ) 2 ~ ~ ~ Y0 ( x0 ) c 1 + + ) = 0.95 P ( c 1 + + n S xx n S xx 1 ( x0 x ) 2 1 ( x0 x ) 2 ~ ~ ~ ~ Y0 ( x0 ) + c 1 + + ) = 0.95 P ( ( x0 ) c 1 + + n S xx n S xx
Prediction
1 ( x0 x ) + ( x0 x ) c 1 + + n S xx ) ) )
Upper and Lower bounds of a regression prediction interval
Example
LetYbetheresponsevariaterepresentingbodyweight(kg)and Xbetheexplanatoryvariaterepresentingbodyheight(cm). Thefollowingsampleiscollected:
i xi yi 1 172 60 2 162 54 3 180 72 4 170 65 5 174 64
Example
i xi yi 1 172 60 2 162 54 3 180 72 4 170 65 5 174 64
1 ( x0 x ) 2 + ( x0 x ) c 1 + + n S xx ) ) )
Example
1 ( x0 x ) 2 + ( x0 x ) c 1 + + n S xx ) ) )
Outline
Chapter1
Datatypes(discrete,continuous,categorical) Problem(3differentaspects) Populations(target,study,sample) Representationsofdata
Graphical:histograms,CDFs,boxplots Numerical:mean,standarddeviation,IQR
Bivariate Data
Relativerisk Correlationcoefficient
Chapter2
Reviewofprobabilitydistributions RandomPPDACexamples
PPDAC
PPDAC
PPDAC
ConceptReview
Fromthepreviousexample:
Targetpopulation,studypopulation,sample,unit Responsevs.explanatoryvariates Aspects
Descriptive Causative Predictive
Histograms
BinWidth Frequencyhistogram
Outline
Chapter3
BinomialModel ResponseModel RegressionModel MaximumLikelihoodEstimation
MLE
L( ) = f ( xi ; )
i =1
MLE
l ( ) = n ln ( + 1) ln( xi )
i 1 n
ConceptReview
Fromthepreviousexample:
MaximumLikelihoodEstimationMethod
Definelikelihoodfunction Defineloglikelihoodfunction Differentiatewithrespecttotheparameter Settozero Solvefortheparameter
Outline
Chapter4
Samplingdistributionsforestimators Introductiontonewdistributions Gaussian Chisquared t ConfidenceInterval HypothesisTesting ConfidenceIntervalsandHypothesisTestingwiththelikelihood function
ConfidenceIntervals
ConfidenceInterval
ConceptsReview
Fromthepreviousexample:
ConfidenceIntervalsfortheresponsemodel,sigma unknown Structureofasymmetricconfidenceinterval
HypothesisTesting
HypothesisTesting
Forapairedttest,wecreateanewsetofdata
Diff 1 0.48 9 0.46 2 0.53 10 0.76 3 0.52 11 3.09 4 0.21 12 0.26 5 -0.05 13 0.34 6 0.44 14 0.32 7 0.41 15 -0.07 8 0.68 16 0.33
Diff
HypothesisTesting
Teststatistic:
~ D 0 T= ~ ~ t n1 D n
HypothesisTesting
Pvalue
HypothesisTesting
HypothesisTesting
Fora2samplettest,wehavetwopopulations,with2setsofdata
HypothesisTesting
Teststatistic: T =
~ ~ 1 2 ~ 1 + 1 n1 n2 ~ t n1 + n2 2
HypothesisTesting
)2 ) 2 (n1 1) 1 + (n2 1) 2 (16 1)2.48 2 + (16 1)2.912 ) = = 2.704 = (n1 + n2 2) (16 + 16 2)
Observedvalueoftheteststatistic: ) )
t=
1 2
) 1 1 + n1 n2
HypothesisTesting
Pvalue
ConceptsReview
Fromthepreviousexample:
HypothesisTesting
Definethenullhypothesis Definetheteststatistic,identifythedistribution,calculate theobservedvalueoftheteststatistic Calculatethepvalue
2samplettest Pairedttest