Você está na página 1de 247

BiostatisticsWorkbook

FieldEpidemiologyandLabTrainingPrograms(FELTP)
DRAFT
DepartmentofHealthandHumanServices
CentersforDiseaseControlandPrevention
CoordinatingOfficeforGlobalHealth
OfficeofCapacityDevelopmentandProgramCoordination
DivisionofEpidemiologyandSurveillanceCapacityDevelopment
Acknowledgements:
Wethankthefollowingfortheirtimeandeffortsindevelopingthecontentofthis
workbook:
DonnaJones
MichaelA.Joseph
JenniferScharff
NadineSunderland
ContentReview:
EdmondMaes
PeterNsubuga
BiostatisticsWorkbook 5
DRAFT:Aug.28,2007
TableofContents
HowtoUsethisWorkbook ...........................................................................................6
IntroductiontoBiostatistics..........................................................................................7
ScalesofMeasurement ................................................................................................9
FrequencyDistributions ............................................................................................ 11
CentralLocationandDispersion................................................................................ 33
MeasuresofCentralTendency................................................................................... 34
MeasuresofDispersion ............................................................................................. 41
ProbabilityandtheNormalDistribution ................................................................... 52
ProbabilityDistribution............................................................................................. 53
NormalDistribution .................................................................................................. 55
CentralLimitTheorem .............................................................................................. 61
StatisticalInference..................................................................................................... 63
ConfidenceIntervalAroundaMean.......................................................................... 65
ConfidenceIntervalAroundaProportion.................................................................. 77
HypothesisTesting:TwoSamplettest ...................................................................... 85
ConfidenceIntervalEstimation:TwoSamplettest ................................................... 95
HypothesisTesting:ztestforDifferenceinProportions.......................................... 106
ConfidenceIntervalEstimation:ztestforDifferenceinProportions ....................... 115
HypothesisTesting:Pairedttest ............................................................................. 125
ConfidenceIntervalEstimation:Pairedttest .......................................................... 136
FishersExactTest .................................................................................................. 145
ChiSquareTestforIndependence........................................................................... 155
ConfidenceIntervalsforCaseControlandCohortStudies.................................... 163
ConfidenceIntervals:OddsRatiosandRelativeRisks ............................................. 164
SampleSize................................................................................................................ 181
SampleSizeforDescriptiveStudies ......................................................................... 182
SampleSizeforAnalyticStudies .............................................................................. 191
CorrelationandRegressionAnalysis........................................................................ 205
PearsonProductMomentCorrelationCoefficient................................................... 206
SimpleLinearRegression ........................................................................................ 217
OneWayAnalysisofVariance(ANOVA) ................................................................ 223
References.................................................................................................................. 231
Appendix1:AnswerKey .......................................................................................... 234
Appendix2:DistributionTables............................................................................... 243
StudentstTable...................................................................................................... 244
StandardNormalz................................................................................................... 245
ChiSquareDistribution........................................................................................... 246
FDistribution .......................................................................................................... 247
BiostatisticsWorkbook 6
DRAFT:Aug.28,2007
HowtoUsethisWorkbook
Thisworkbookisintendedasaresourceforstudentsinintroductorybiostatistics
courses. Itprovidesstudentswithstepbystepguidancethroughexample
problemscalculatedbyhandandwithreadilyavailablestatisticalsoftware
programs.Practiceproblemsaregiven,alongwithananswerkey,sothat
studentsareabletosolidifywhattheyhavelearnedintheirbiostatisticscourses.
Theworkbookmayalsobeusedasareferenceonceastudenthascompleteda
biostatisticscourse. Thoughitdoesnotprovidedetailedinformationonthe
theoryofbiostatisticalconcepts,itwillserveasarefresherastowhatstatistical
testshouldbeusedinagivensituationandhowtodothecalculationsthat
accompanythattest.
IntroductiontoBiostatistics
BiostatisticsWorkbook 7
DRAFT:Aug.28,2007
IntroductiontoBiostatistics
Thisworkbookprovidesanoverviewofbasicbiostatisticstopicsincludingscales
ofmeasurement,centrallocationanddispersion,normaldistribution,testsof
statisticalinference,samplesize,andcorrelationandregressionanalysis.
Followingthedescriptionareexamplesandpracticeproblemstobecompleted
bothbyhandandwiththeaidofastatisticalcomputerprogram.Theseexamples
andpracticeproblemswillgiveyouanopportunitytoapplytheconceptsto
situationsthatyoumayfindinthefield. Datasetsforthepracticeproblemsare
eitherincludedintheworkbookorontheaccompanyingCD. Asyoucomplete
thepracticeproblems,youmaycheckyourworkbyreferringtotheanswerkey
locatedinAppendix1.
Thisworkbookismeantasasupplementaltextandisnotintendedtoreplace
yourregularbiostatisticscourse.However,weallneedafriendlyreminderfrom
timetotime.Forthisreason,wehaveincludeddefinitionsofcommonlyused
termsinbiostatisticsforyourreference.
Data: Therawmaterialofstatistics,datagenerallyconsistsofnumbersof
measurementorcountsofapopulationsample.Forexample,anursemay
recordthetemperatureofpatients(ameasurement)orcountthenumberof
patientswithatemperatureabovenormal.
Variable: Thetermforacharacteristicthatisdifferentinmembersofa
populationorsample,suchasheight.Thismeasurementisnotconstant,so
thereforeitisvariable.Variablescanbequalitativeorquantitative,continuousor
discrete.Randomvariablescannotbepredictedandarethemostusefulfor
statisticalpurposes.
Population: Acollectionofentities.Astatisticalpopulationreferstothelargest
collectionofentitiesinwhichwehaveaninterest.Forexample,wemaybe
interestedinlookingatwomenofreproductiveagewhohavehadonechild.
Therefore,ourpopulationislimitedtoonlythosewomenaged1545whohave
onechild.
Sample: Partofapopulation.Asampleoftheexamplepopulationofwomen
1545withonechildmightconsistofanestimated25percentofthepopulation.
Parameter:Adescriptivemeasurecomputedfromthedataofapopulation.
Statistic: Adescriptivemeasurecomputedfromthedataofasample.Statistics
isafieldwhichexaminesthecollection,organization,summarization,and
analysisofdataanddrawsinferencesregardingthatdataforapopulation
throughobservationofasample.
IntroductiontoBiostatistics
BiostatisticsWorkbook 8
DRAFT:Aug.28,2007
DescriptiveStatistics: Methodsforpresentingandsummarizingdata.
Descriptivestatisticsallowustounderstandgeneralpatternsinalargequantity
ofdatawithoutconductingaformaltestofahypothesis.
InferentialStatistics:Statisticsusedtoreachaconclusionaboutapopulation
basedoninformationgatheredfromasampleofthatpopulation. Involves
estimationorhypothesistesting.
StatisticalSymbols
:populationmean :populationstandarddeviation
x :samplemean s:samplestandarddeviation

.50
:median
FrequencyDistributions
BiostatisticsWorkbook 9
DRAFT:Aug.28,2007
ScalesofMeasurement
Therearefourcommonlyrecognizedscalesofmeasurementforvariables.
NominalScale
Thenominalscaleclassifiespersonsorthingsbasedonaqualitative
assessmentofthecharacteristicbeingassessed.Itneitherincludesinformation
onquantityoramountnordoesitindicatemorethanorlessthan.
Example:Gender(maleorfemale)isacommonnominalvariableusedin
epidemiologicstudies.
Example:Countrytelephonecodesareanexampleofnumericvariablesthatdo
notindicatemoreorless(countrycode82isnotmorethancountrycode37).
OrdinalScale
Theordinalscalealsoclassifiespersonsorthingsbasedonthecharacteristic
beingassessedbutdoesindicatemorethanorlessthan.Inthissense,it
providesmoreinformationthanthenominalscale. However,theordinalscale
doesnotindicatehowmuchmorethanorlessthan.
Example:Ratingstudentsperformanceasbeingpoor,average,good,or
excellentindicateshowwellstudentsperformandprovidesabasisfor
comparison.However,itdoesnotindicatehowmuchbetteranexcellent
performanceiscomparedtoagoodone.
IntervalScale
Theintervalscalehasthesamecharacteristicsoftheordinalscaleclassifying
personsorthingsbasedonthecharacteristicassessedandindicatingmorethan
orlessthanbuttheintervalscaleindicateshowmuchmorethanorlessthan.
Whattheintervalscaledoesnotdoisindicateatruezeropointmeaningthat
Overview
Scalesofmeasurementallowyoutocategorizedatainordertoprovide
informationaboutthecharacteristicbeingmeasured.
Thetypeofscaleusedinmeasuringdataaffectsthetypeandamount
ofinformationthatcanbeobtained.Thisaffectshowdatawillbe
treatedstatistically.
Recognizingthedifferentscalesofmeasurementandunderstanding
theirimplicationsforanalyzingdatawillalsoassistyouincreating
questionnairesforepidemiologicstudies.
FrequencyDistributions
BiostatisticsWorkbook 10
DRAFT:Aug.28,2007
therecannotbeanabsenceofacharacteristicbeingmeasured. Additionally,
ratiosmadewithtwonumbersintheintervalscaledonothavemeaning.
Example:Temperatureisanintervalinthatdifferentvaluescantellyouhow
muchmoreorless.However,thereisnotruezeropoint.Thevalueofzeroin
temperaturedoesnotindicateabsenceoftemperature. Also,whencomparing
twotemperatures,theirratioisnotmeaningful.Wewouldnotsaythata90
degreetemperatureistwiceashotasa45degreetemperature.
RatioScale
Theratioscaleincludesallthecharacteristicsoftheintervalscalebutdoes
indicateatruezeropoint.
Example:Heightandweightmeasurementsindicatehowmuchmoreorless,but
alsohaveatruezeropoint.Aweightofzeroindicatesanabsenceofweight.
ScalesofMeasurement:SUMMARY
Nominal Ordinal Interval Ratio
Classifiespersons
orthingsbased
onaqualitative
assessment
Similaror
dissimilarbutnot
moreorless
Canbenumeric
butnothereisno
implicationof
moreorless
Classifiespersons
orthingsbased
onaqualitative
assessment
Moreorlessbut
nothowmuch
moreorless
Indicateshow
muchmoreor
less
Doesnotcontain
atruezeropoint
Cannotcreate
meaningfulratios
ofthesetwo
numbers
Includesallthe
characteristicsof
theintervalscale,
butcontainsa
truezeropoint.
Practice:ScalesofMeasurement
Identifythescaledescribedineachsituationbelow:
1. Temperatureofpatientsatahealthfacility
2. Theweightofchildrenunderfiveataweeklybabyweighing
3. Thereligionoffamiliesinavillage
4. Thelengthoftimespentinthehospital
5. Thediagnosisofpatientsuponadmissiontothehospital
RelatedConcepts
FrequencyDistribution
FrequencyDistributions
BiostatisticsWorkbook 11
DRAFT:Aug.28,2007
FrequencyDistributions
Oneofthemostcommonwaystosummarizedataforbetterunderstandingand
clearerpresentationisthroughafrequencydistribution.Afrequencydistribution
isapresentationofthenumberoftimes(orthefrequency)thateachvalue(or
groupofvalues)occursinthestudypopulation.
Afrequencydistributionhelpstogiveapictureoftheshapeofthedistributionof
thedata. Dataisunimodalifitonlyhasonepeak,bimodalifithastwopeaks,
andmultimodaliftherearemorethantwopeaks.Measuresofdispersionwill
helpyoutoform aclearerpictureofthedistributionofthedatabydescribingthe
height,orthespread,ofthedata.Wewilldiscussthisinmoredetailinthe
sectiontitled,MeasuresofDispersion.
Afrequencydistributioncanbedisplayedasatable,abarchart,ahistogram,or
afrequencypolygon. Eachmethodshouldbeclearlylabeledwiththefrequency
number. Themethodusuallydependsonthetypeofvariablebeingdescribed.
Overview
Frequencydistributionsshowhowofteneachvalueforavariableoccurs
inasampleorpopulation.
Example:Malariacasesmaybereportedonafrequencybymonthbasis
inordertodeterminethehighriskmonthsintheyear.
FrequencyDistributions
BiostatisticsWorkbook 12
DRAFT:Aug.28,2007
Categoricalvariablesarequalitativeinnatureandarebestdisplayedasatable
orabarchart.
Table
Afrequencytablesimplyshowsthenumberoftimeseachspecificobservation
appearsinasampleorpopulation.
Casesof
Malaria
Frequency
Monday 6
Tuesday 4
Wednesday 2
Thursday 5
Friday 3
Saturday 4
Total 24
Barchart
Abarchart,likeatable,displaysthenumberofobservationsforeachvariable,
butprovidesabettervisualrepresentation.
CasesofMalaria
0
1
2
3
4
5
6
7
M
o
n
d
a
y

T
u
e
s
d
a
y

W
e
d
n
e
s
d
a
y

T
h
u
r
s
d
a
y

F
r
i
d
a
y

S
a
t
u
r
d
a
y

F
r
e
q
u
e
n
c
y
FrequencyDistributions
BiostatisticsWorkbook 13
DRAFT:Aug.28,2007
Numericalvariablesarequantitativeinnatureandarebestdisplayedasa
frequencyhistogramorafrequencypolygon.
Frequencyhistogram
Afrequencyhistogramshowsthefrequenciesrelativetoeachother.Thewidthof
thebarisinproportionwiththeclassintervalthatitrepresents.Typicallythere
arenospacesbetweenbarsinafrequencyhistogram,thoughyoumayseethem
constructedinthisfashionattimes.
FrequencyofMalariaCasesinthePastYear
0
5
10
15
20
25
0 1 2 3 3+
NumberofCases
P
e
o
p
l
e
FrequencyDistributions
BiostatisticsWorkbook 14
DRAFT:Aug.28,2007
Frequencypolygon
Afrequencypolygonincludesthesameareaunderthelinethatahistogram
displayswithinthebars. Eachpointrepresentsamidpointinthedata.Thougha
frequencypolygonmaylooklikealinegraph,afrequencypolygonmustbe
closedattheends.
FrequencyofMalariaCasesinthePastyear
0
5
10
15
20
25
. 0 1 2 3 3+ .
NumberofCases
P
e
o
p
l
e

Numericalvariablesmayneedtobegroupedforpresentationifthenumberof
valuesis largeoritisacontinuousvariable.Theboxbelowgivesguidelineson
howtogroupvariables.
FrequencyDistributions
BiostatisticsWorkbook 15
DRAFT:Aug.28,2007
RelativeFrequency
Oftenitisusefultoknowtheproportionofthevaluesthatfallwithinaspecific
categoryorgroup.Thisisobtainedbydividingthenumberofvaluesatthat
categorybythetotalnumberinthesample.Thisisreferredtoastherelative
frequencyandispresentedasaproportion(valuesfrom0.0to1.0)orapercent
(valuesfrom 0%to100%).
Whenreportingeitherthefrequencyortherelativefrequencyintableorgraph
form,makesurethatalldataisclearlylabeled.
Casesof
Malaria
Frequency Percent CumPercent
Monday 6 25.0 25.0
Tuesday 4 16.7 41.7
Wednesday 2 8.3 50.0
Thursday 5 20.8 70.8
Friday 3 12.5 83.3
Saturday 4 16.7 100.0
Total 24 100.0 100.0
Inthetableabove,therelativefrequencyispresentedasapercentofthewhole.
GroupingVariables
Continuousnumericvariablesmustoftenberegroupedintocategoriesfor
analysispurposes.Listedbelowaresomegeneralguidelinestousewhen
groupingvariables:
Createclassintervalsthataremutuallyexclusiveandincludeall
data.Itshouldbeclearwhereoneintervalstopsandthenextone
begins.Nointervalshouldincludethesamenumbertwice.
Usealargenumberofnarrowclassintervalsfortheinitialanalysis.
Allintervalsshouldbethesamesize.Youcancombineintervals
laterifneeded,butitisimpossibletobreakintervalsdownfurther
withoutreferringbacktotheoriginaldata.
Usenaturalormeaningfulgroupingswhenpossible.Thereare
manygroupings,suchasfiveyearageintervalsandbodymass
index(BMI),whichareusedfrequentlyand,therefore,havebecome
standard.Somegroupingshavebeenestablishedbyorganizations
suchasWHOorCDC.
Createaseparatecategoryforunknowns.Thiswillavoidconfusion
whencomparingsubgroupobservations(n)tothetotalnumberof
observations(N).
FrequencyDistributions
BiostatisticsWorkbook 16
DRAFT:Aug.28,2007
StepbyStepExample:FrequencyDistributions
Usethedatabelowtocreatefrequencydistributions. Thismightrepresenta
classofmastersstudents.First,createafrequencytableforGender,then
displaythesameinformationinabarchart.Next,createahistogramofNumber
ofchildren. Also,displaythisinformationinafrequencypolygon.
Subject Gender Age Numberof
children
Marital
Status*
1 M 32 1 M
2 M 35 0 M
3 F 28 0 S
4 M 45 3 D
5 F 47 3 M
6 F 36 2 D
7 M 29 1 S
8 M 31 0 S
9 F 42 2 D
10 F 44 2 M
*M=married,S=single,D=divorced
Step Example
1. Createafrequency
table.
Determinethenumberofobservationsforeach
variableunderGender.Displaythisinatable.
Gender Frequency
Female 5
Male 5
2. Createabarchart. DisplaythefrequencyoftheobservationsforGender
inabarchart.
GenderofParticipants
0
1
2
3
4
5
6
Male Female
Gender
F
r
e
q
u
e
n
c
y
FrequencyDistributions
BiostatisticsWorkbook 17
DRAFT:Aug.28,2007
Step Example
3. Createahistogram. Displaythefrequencyoftheobservationsfor
Numberofchildreninahistogram.
NumberofChildrenofParticipants
0
0.5
1
1.5
2
2.5
3
3.5
0 1 2 3
NumberofChildren
4. Createafrequency
polygon.
DisplaythefrequencyforNumberofChildrenasa
polygon.
NumberofChildrenofParticipants
0
0.5
1
1.5
2
2.5
3
3.5
. 0 1 2 3 .
Children
5. Describethedata. Thereareanequalnumberofmenandwomen
participatingintheconference. Thefrequency
distributionshowsthatthevariablechildrenis
bimodalinnature.Themajorityofparticipantshave
eithernochildrenortwochildren.
FrequencyDistributions
BiostatisticsWorkbook 18
DRAFT:Aug.28,2007
Practice:FrequencyDistributions
Usingthefollowingdataset,createvisualrepresentationsofthefrequency
distributionsforthevariables.
Subject Gender Age Numberof
children
Marital
Status
1 M 32 1 M
2 M 35 0 M
3 F 28 0 S
4 M 45 3 D
5 F 47 3 M
6 F 36 2 D
7 M 29 1 S
8 M 31 0 S
9 F 42 2 D
10 F 44 2 M
1. Createafrequencytableforthevariable,MaritalStatus.(Includethe
cumulativepercent.)
2. Showthesameinformationinabarchart.
3. Drawafrequencyhistogramforthevariable, Age.Grouptheagesin
intervalsoffivebeforebeginning.
4. Displaythesameinformationinafrequencypolygon.
Spacehasbeenprovidedonthefollowingpagestocompleteyourwork.
FrequencyDistributions
BiostatisticsWorkbook 19
DRAFT:Aug.28,2007
Step PracticeSpace
1. Createafrequency
table.
2. Createabarchart.
FrequencyDistributions
BiostatisticsWorkbook 20
DRAFT:Aug.28,2007
Step PracticeSpace
3. Createahistogram.
4. Createafrequency
polygon.
5. Describethedataset.
FrequencyDistributions
BiostatisticsWorkbook 21
DRAFT:Aug.28,2007
EpiInfoExample:FrequencyDistributions
Youareattendingafictitiousinternationalconference.Demographicdatawas
collectedontheattendees.Usewhatyouknowaboutfrequencydistributionto
summarizethedata. First,createatableandabarchartofthecategorical
variable,Occupation.Then,createahistogramandafrequencypolygonforthe
continuousnumericalvariable,Weight_kg. ThedatasetiscalledFrequency_Dist
andisfoundintheBios_Workbook_Examples.mdbdatabase.
FrequencyTable
Step Example
1. READthedataset. OpenEpiInfoandchooseAnalyzeData.
SelectREADunderDataAnalysisCommands.
OpenFrequency_Distinthedatabase,
Bios_Workbook_Examples.mdb.
2. Createa
frequencytable.
SelecttheFREQUENCIEScommand.
IntheFrequencydropdownbox,highlightthevariable
thatyouwanttoexamine.Forthisexample,highlight
Occupation.
ClickOK.
3. Describethedata. Youshouldseeafrequencytableonyourscreenthat
looksliketheonebelow:
Thischartprovidesinformationonthevariable
occupationbypresentingfrequenciesandrelative
frequencies.
FrequencyDistributions
BiostatisticsWorkbook 22
DRAFT:Aug.28,2007
BarChart
1. Makeafrequency
barchartinEpi
Info.
ChooseGRAPHunderStatistics.
IntheGraphTypedropdownbox,chooseBar
(default).
Intheboxlabeled1
st
Title|2
nd
Title,typeOccupationof
Participants.Thisisthetitleofyourchart.
UnderXAxis,chooseOccupationastheMain
Variable.
UnderYAxis,ShowValueofCount.(default)
ClickOK.
FrequencyDistributions
BiostatisticsWorkbook 23
DRAFT:Aug.28,2007
2. Describethedata. EpiInfowillgiveyouthegraphbelow:
Noticethatthegraphrepresentstheexactnumbers
listedinthetablecreatedpreviously.
Youcanmakeabarchartofthepercentageof
participantsineachoccupationbychoosingShow
ValueofCount%underYAxis.
FrequencyDistributions
BiostatisticsWorkbook 24
DRAFT:Aug.28,2007
Histogram
1. Makeahistogram
inEpiInfo.
ChooseGRAPHunderStatistics.
UnderGraphType,chooseHistogram.
Createatitleforyourgraph.
ChooseWeight_kgasthemainvariableandShow
ValueofCount.
NoticewhenyouselectHistogramastheGraphType,
youaregiventheoptiontocreateintervals.This
allowsyoutogroupthevariableWeight_kg,without
creatinganewvariable.UsingtheIntervalsoption
makesthedataeasiertoview.Ifyoucreatea
FREQUENCIEStableyoucanseethatthereare
nearly50differentweightsrecorded.Itmaynotbe
usefultohaveeachonelistedseparately.
Tocreateintervals,lookatthecolumnmarkedXAxis.
Type5inthefirstspaceunderInterval
Type45inthespaceunderFirstValue.
ClickOK.
2. Describethedata. Nowthegraphyouseewillpresenttheweightof
participantsin5kgintervals.
FrequencyDistributions
BiostatisticsWorkbook 25
DRAFT:Aug.28,2007
EpiInfoPractice:FrequencyDistributions
Usethedatasetfromthefictitiousconference(Frequency_Dist)onceagainto
createfrequencydistributionsforHeight_cmandPreferredLanguageinEpiInfo.
1. CreateafrequencytableofPreferredLanguageinEpiInfo.
2. MakeafrequencybarchartofPreferredLanguageinEpiInfo.
3. MakeahistogramofHeightinEpiInfo.
Revieweachofthesedisplaysanddescribethedataset.
Step PracticeSpace
4. Describethedataset
usingthefrequency
chartsandgraphsthat
youhavecreated.
ExcelExample:FrequencyDistributions
NowuseExceltocreateafrequencypolygonforthecontinuousnumerical
variable,Weight_kg.ThedatasetiscalledFrequency_Distandisfoundinthe
Bios_Workbook_Examples.mdbdatabase.
1. Createa
frequency
polygoninExcel.
a.OpenExceland
importthe
dataset.
Fromthetoolbar,selectData.
HighlightImportExternalData.
ChooseImportData.
LocateFrequency_Distinthe
Bios_Workbook_Examples.mdbdatabase.
ClickOpen.
ThedatasetshouldappearasanExcelspreadsheet.
FrequencyDistributions
BiostatisticsWorkbook 26
DRAFT:Aug.28,2007
b.Createa
frequency
tablefor
Weight_kg.
CopythevariableWeight_kgbyhighlightingthe
column.PressCtrl+Ctocopy.Chooseablankcellon
thespreadsheetandpastethevariablebypressing
Ctrl+V.
Inthecellnexttothevariableheading,typeInterval.
Completethecolumnbyenteringtheintervalsthatyou
havechosenforthedata.Inthiscase,createintervals
of5,beginningwith4549andcontinuinguntil100
104.Youshouldanchortheintervalsbyincluding<=44
and>=105.Thefirstandlastintervalsshouldhavea
frequencyofzero.
ThenextcolumnwillbetitledBin. Binisawordused
byExceltodefineintervallimits. Inthiscolumn,wetell
Excelhowtoreadtheintervalsthatwehavecreated.
ThefirstnumberinthebinarraywilltellExceltofindall
observationslessthanorequaltothatnumber,n.The
secondnumber,p,willtellExceltolocateall
observationsthatoccurbetweenn+1andp.This
continuesuntilthefinalnumberinthebin,whichtells
Exceltolocateallnumbersgreaterthanorequalto
thatfinalnumber.
Createthebinbytypinginthehighestnumberthat
shouldbeincludedinthatinterval.Forthefirstnumber
inthebin,Excelwilllookforallobservationslessthan
orequaltothatnumber.Forthelastnumberinthebin,
Excelwillfindobservationsgreaterthanorequalto
thatnumber.
Weight
(kg) Intervals BIN Frequency
73 <=44 44
52 4549 49
68 5054 54
93 5559 59
71 6064 64
86 6569 69
69 7074 74
74 7579 79
60 8084 84
58 8589 89
91 9094 94
95 9599 99
48 100104 104
59 >=105 105
FrequencyDistributions
BiostatisticsWorkbook 27
DRAFT:Aug.28,2007
67
75
87
YourfinalcolumnwillbecalledFrequency.Wewilllet
Excelcalculatethefrequenciesforus.
HighlighttheFrequencycolumnbyclickingonthefirst
cellundertheheadinganddraggingthemouseuntil
theshadedareaequalsthelengthoftheBincolumn.
Donotincludethecolumnlabel(Frequency)when
highlighting.
UnderInsertinthetoolbar,chooseFunction.
SelectthefunctionFREQUENCY.Youmayhavetodo
asearchforthefrequencyoptionbytypingtheword
frequencyattheprompt.
ClickOK.
Youwillseethefollowingbox:
FrequencyDistributions
BiostatisticsWorkbook 28
DRAFT:Aug.28,2007
Clickonthecharticontotherightoftheboxlabeled
Data_array.Highlightallthevaluesforthevariable
Weight_kg.
Clickonthecharticonagaintoreturntothefunction
box.
FrequencyDistributions
BiostatisticsWorkbook 29
DRAFT:Aug.28,2007
c.Createa
frequency
polygon.
Clickonthecharticontotherightoftheboxlabeled
Bins_array.
HighlightallthevaluesintheBincolumn.
Clickonthecharticonagaintoreturntothefunction
box.
PressControlandShifttogetherandhitEnterwhile
continuingtoholdtheothertwokeysdown.(DONOT
CLICKOK!)
Thenumberofobservationsincludedineachinterval
willbeshowninthechart.Younowhaveafrequency
table.Notethatthereisafrequencyofzeroatthehigh
endandatthelowendoftheweightintervals.Youwill
needthisinordertocreateafrequencypolygon
correctly.
Usingthefrequencytablethatyoujustmade,highlight
allthevaluesinthefrequencycolumn.
UnderInsertinthetoolbar,selectChart.
ChooseChartType:Line.Thefirstlinegraphinthe
secondrowispreferredbecauseitshowsthe
midpointsinthegraph.
ClickNext.
Afrequencypolygonwillappear.
FrequencyDistributions
BiostatisticsWorkbook 30
DRAFT:Aug.28,2007
Tocorrectlylabelthepolygon,choosetheSeriestab.
ClickthecharticonnexttotheboxlabeledCategory
(X)axislabels.
Highlightthevaluesinthecolumn,Intervals.
Yourchartshouldnowbelabeledsimilartotheone
below:
ClickNext.
ChooseTitletogiveyourchartatitleandlabeltheX
axis.
ClickFinish.
FrequencyDistributions
BiostatisticsWorkbook 31
DRAFT:Aug.28,2007
2. Describethedata.
WeightofConferenceParticipants
0
1
2
3
4
5
6
7
8
9
<
=
4
4

4
5

4
9

5
0

5
4

5
5

5
9

6
0

6
4

6
5

6
9

7
0

7
4

7
5

7
9

8
0

8
4

8
5

8
9

9
0

9
4

9
5

9
9

1
0
0

1
0
4

>
=
1
0
5

Weightinkg
Thisdistributionisunimodalbecauseonepeakis
higherthantherest.Themajorityofparticipants
weightsfalltotheleftofthepeak.Mostparticipants
weighlessthan84kg.
ExcelPractice:FrequencyDistributions
Usethedatasetfromthefictitiousconference(Frequency_Dist)tocreatea
frequencypolygonforHeight_cminExcel.
1. CreateafrequencypolygonofHeightinExcel.
Useyourgraphtoanswerthefollowingquestions.
Step PracticeSpace
2. Describethedataset
usingthefrequency
polygon.
FrequencyDistributions
BiostatisticsWorkbook 32
DRAFT:Aug.28,2007
3. Howisthissimilarto
thehistogramthatyou
createdinEpiInfo?
RelatedConcepts
CentralLocationandDispersion
CentralLocationandDispersion
BiostatisticsWorkbook 33
DRAFT:Aug.28,2007
CentralLocationandDispersion
Measuresofcentrallocationanddispersionaregenerallyreferredtoas
descriptivestatisticsbecausetheydescribethedistributionofthedataset.
Frequencydistributionprovidesapictureofthenumberoftimesthatavariable
occurs,butrevealsnothingaboutthespreadofthedata. Inordertogaina
clearerpictureofhowdataisdistributed,wewillcalculate:
Measuresofcentraltendency:mean,median,mode,range
Measuresofdispersion:variance,standarddeviation,andstandarderror
Throughthesemeasures,thedatabeginstotakeshape.Whencombinedwith
frequencydistribution,wecanvisualizethedistributionofthedata. Weobtain
thenumberandheightofthepeaksinthedistributionfromthefrequency.
Measuresofdispersionallowustoobtainanideaofthewidth,orthespreadof
thedistributionofthedata.
Datacanbeeithersymmetricorskewed.Ifthedatacanbedividedintopieces
thatareverysimilartoeachother,wecansaythatthedataissymmetric.Ifone
tailofaunimodaldistributionislongerthantheothertail,thenthedatais
skewed,meaningthatthedataisnotspreadevenly.Datacanbeeitherright
skewedorleftskewed. Ifdataisskewedtotheright,itwillrisequicklytoapeak
andhavealongtailontheright.Theoppositeistruefordatathatisskewedto
theleft.
CentralLocationandDispersion
BiostatisticsWorkbook 34
DRAFT:Aug.28,2007
MeasuresofCentralTendency
Mean
Themeanissimplythearithmeticaverageofthedataandiscalculatedbytaking
thesumofallvaluesinthenumbersetanddividingthattotalbythenumberof
valuesinthedataset. Themeanisthemostcommonlyusedmeasureofcentral
tendency.
n
x
x

=
Median
Themedianisthe50
th
percentileofthevaluesinadatasetandrepresentsthe
literalmiddleofthedata.Themedianisfoundbyarrangingallvaluesinthe
datasetinnumericalorderandthenchoosingthemiddlevalue. Ifthenumberof
valuesinadatasetiseven,takethemeanofthetwomiddlenumberstofindthe
median.
Mode
Themoderepresentsthevaluethatisfoundmostfrequentlyinasetofnumbers.
Notethatitispossibletohavemorethanonemode. Inthefollowingsetof
numbers,{87889656467},themodeisboth8and6,sinceeachisincluded
inthedatasetthreetimes. Thisdatasetisreferredtoasbimodalbecauseithas
twomodes. Itisalsopossiblenottohaveamodeinasetofnumbers.Inthe
followingsetofnumbers,{5497638},thereisnonumberwhichoccursmore
frequentlythananyother.Therefore,thereisnomode.
Overview
Measuresofcentraltendencyareusedtodescribethedatainthe
samplebygivinganideaofthecenterandthedistributionofthedata.
Therearethreecommonmeasuresofcentraltendency:mean,median
andmode.
Formula:Forinstance,thearithmeticmeaniscalculatedasfollows:
n
x
x

=
CentralLocationandDispersion
BiostatisticsWorkbook 35
DRAFT:Aug.28,2007
Comparisonofmean,median,andmode
Whenyouaretoldtoaveragethedata,itisgenerallyexpectedthatyouwilltake
themean.Technically,however,theaveragecouldrefertothemean,the
median,orthemodeofthedata.Themeanisabletogiveusthemost
informationaboutthedatasetasawhole,especiallywhencombinedwiththe
standarddeviation.Therefore,weprefertousethemeanwhenwecan.
Therearecertainadvantagestothemedian. Themedianisresistanttoskewing,
theresultofanoutliercausingthemeanofthedatatoshifteithertotheleftorto
theright. Itisnotaffectedbyextremevalueslikethemeanisanditismore
representativeofthecenterofdatawhendataisasymmetrical.
Letsconsiderskeweddata.Lookatthegraphofthepopulationdistributionby
stateintheUnitedStates.
PopulationoftheUnitedStatesbyState
0
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
30,000,000
35,000,000
40,000,000
.
C
a
l
i
f
o
r
n
i
a

.
T
e
x
a
s

.
N
e
w

Y
o
r
k

.
F
l
o
r
i
d
a

.
I
l
l
i
n
o
i
s

.
P
e
n
n
s
y
l
v
a
n
i
a

.
O
h
i
o

.
M
i
c
h
i
g
a
n

.
G
e
o
r
g
i
a

.
N
e
w

J
e
r
s
e
y

.
N
o
r
t
h

C
a
r
o
l
i
n
a

.
V
i
r
g
i
n
i
a

.
M
a
s
s
a
c
h
u
s
e
t
t
s

.
W
a
s
h
i
n
g
t
o
n

.
I
n
d
i
a
n
a

.
T
e
n
n
e
s
s
e
e

.
A
r
i
z
o
n
a

.
M
i
s
s
o
u
r
i

.
M
a
r
y
l
a
n
d

.
W
i
s
c
o
n
s
i
n

.
M
i
n
n
e
s
o
t
a

.
C
o
l
o
r
a
d
o

.
A
l
a
b
a
m
a

.
L
o
u
i
s
i
a
n
a

.
S
o
u
t
h

C
a
r
o
l
i
n
a

.
K
e
n
t
u
c
k
y

.
O
r
e
g
o
n

.
O
k
l
a
h
o
m
a

.
C
o
n
n
e
c
t
i
c
u
t

.
I
o
w
a

.
M
i
s
s
i
s
s
i
p
p
i

.
A
r
k
a
n
s
a
s

.
K
a
n
s
a
s

.
U
t
a
h

.
N
e
v
a
d
a

.
N
e
w

M
e
x
i
c
o

.
W
e
s
t

V
i
r
g
i
n
i
a

.
N
e
b
r
a
s
k
a

.
I
d
a
h
o

.
M
a
i
n
e

.
N
e
w

H
a
m
p
s
h
i
r
e

.
H
a
w
a
i
i

.
R
h
o
d
e

I
s
l
a
n
d

.
M
o
n
t
a
n
a

.
D
e
l
a
w
a
r
e

.
S
o
u
t
h

D
a
k
o
t
a

.
A
l
a
s
k
a

.
N
o
r
t
h

D
a
k
o
t
a

.
V
e
r
m
o
n
t

.
D
i
s
t
r
i
c
t

o
f

.
W
y
o
m
i
n
g

State
P
o
p
u
l
a
t
i
o
n

Thestatesappearingontheleftsideofthehistogramhaveasignificantlylarger
populationthanotherstates.Becauseofthis,weexpectthemeantobehigher
invaluethanthemedian.Thecalculatedmeaninthissampleis5,811,968.706,
whichisjustmarkedonthegraphabove.Themedianis4,173,405,alsomarked
onthegraph. Themeaninthisexampleisgreaterthanthemedian. Ageneral
ruletofollowisthatifthedataisskewedeithertotheleftortotheright,
themedianrepresentsthedatabetterthanthemean. Ifasampleisnormally
distributed,themeanandmedianwillbenearlythesame.Withsymmetrical
data,themodewillbesimilaraswell.
Mean Median
UnitedStatesPopulationbyState
CentralLocationandDispersion
BiostatisticsWorkbook 36
DRAFT:Aug.28,2007
Whenthesamplesizeissmall,themodemayrepresentthedatamost
accurately. Itispossiblethatinbimodaldata,themodeswillbeamoreaccurate
descriptionaswell.Themodeisalsofrequentlyusedtodescribequalitativedata.
Forexample,youmightfindamodaldiagnosis,orusethemodetodescribe
medicaldiagnosesbystatingthediagnosisthatwasseenmostfrequentlyovera
givenperiodoftime.
StepbyStepExample:Mean,Median,Mode
Thefollowingareagesofpatientsseenbythedoctorforabrokenboneinthe
pastmonth:
15 17 20 14 16 15 17 22 18 13 15 14 16 18 20
Usethedatatoanswerthefollowingquestions:
Whatisthemeanageofthepatients?
Whatisthemedianageofthepatients?
Whatisthemodalageofthepatients?
Whichmeasureisthemostrepresentativeofthesample?
Step Example
1. Findthe
mean, x ,of
thesample.
x =
n
x

=
15
20 18 16 14 15 13 18 22 17 15 16 14 20 17 15 + + + + + + + + + + + + + +
=
15
250
=16.7
2. Findthe
medianofthe
sample.
Firstlinethenumbersupinnumericalorder:
131414151515161617171818202022
Findthemiddlenumber:
131414151515161617171818202022
Thereare7numbersoneithersideofthearrow,thus16is
themedian.
3. Findthemode
ofthesample.
131414151515161617171818202022
Thenumberthatappearsmost,atthreetimes,inthis
datasetis15.Therefore,15isthemode.
CentralLocationandDispersion
BiostatisticsWorkbook 37
DRAFT:Aug.28,2007
Step Example
4. Which
statisticis
most
representative
ofthecenter
ofthe
dataset?
Inthiscase,themeanandthemedianarenearlyequal.
Therefore,wecanassumethatthecurveisnormally
distributedandthemeanrepresentsthecenterofthecurve.
Ifthemeanandthemedianaredifferent,wecanassume
thatthedataisskewedandthemedianwillgenerallybe
moreappropriate.
Practice:Mean,Median,Mode
Inordertodetermineifthereisarelationshipbetweenageandthenumberof
visitstothedoctor,youdecidetocountthenumberofdoctorvisitsthat
individualsmakeoverthecourseofayear.Belowisthedatathatyouhave
collected:
Individual Age Visits
1 45 15
2 60 8
3 52 22
4 46 9
5 23 2
6 52 15
7 37 3
8 33 13
Describetheaverageageofyoursampleandtheaveragenumberofdoctor
visitsmadebyanindividualusingthemean,median,andmode.
Step PracticeSpace
1. Findthemean, x .
x =
n
x

2. Findthemedian.
CentralLocationandDispersion
BiostatisticsWorkbook 38
DRAFT:Aug.28,2007
Step PracticeSpace
3. Findthemode.
4. Whichstatisticis
mostrepresentativeof
thecenterofthe
datasetandwhy?
EpiInfoExample:Mean,Median,Mode
Usingthesamedatathatwepracticedwithbeforeonpage36,wecanfindthe
mean,median,andmodeintwosimplestepsusingEpiInfo.
Step Example
1. UseEpiInfoto
determine
descriptive
statistics.
a. READthe
dataset.
OpenEpiInfoandchooseAnalyzeData.
SelectREADinDataAnalysisCommands.
HighlightCentral_TendencyfromtheDataSource
Bios_Workbook_Examples.
ClickOK.
b. Findthe
MEANSofthe
data.
SelectMEANSfromtheCommandscolumnunder
Statistics.
ChooseAgefromthedropdownboxunderMeansof.
ClickOK.
CentralLocationandDispersion
BiostatisticsWorkbook 39
DRAFT:Aug.28,2007
Step Example
2. Identifythemean,
median,andmode
ofthedata.
Thisistheoutputthatyoushouldsee:
Theoutputgivesyouthemean,themedian,andthe
mode.EpiInforeportsthemeantobe16.7,the
mediantobe16.0,andthemodetobe15.0.Thisdoes
notdifferfromthehandcalculationsthatweperformed
previously.
3. Interpretthe
results.
Aswedeterminedearlier,themeanandthemedianare
nearlyequal. Therefore,wecanassumethatthecurve
isnormallydistributedandthemeanrepresentsthe
centerofthecurve.Ifthemeanandthemedianare
different,wecanassumethatthedataisskewedand
themedianwillgenerallybemoreappropriate.
EpiInfoPractice:Mean,Median,Mode
Youareweighingbabiesfrom9AMto11AMatanunderfiveclinicinthevillage.
Yourresultsareasfollows:
Age
(months)
Length
(cm)
Weight
(kg)
21 77 9.8
34 87 11.5
23 84 10.8
30 92 14.0
27 85 12.0
24 82 10.8
31 87 11.6
26 85 11.8
22 85 12.4
32 86 12.0
UseEpiInfotofindthemean,median,andmode. Then,answerthequestions
thatfollow. ThedatasetyouareworkingfromiscalledBabyWeighing.
RemembertoopenthedatasetinEpiInfobyusingtheREADcommand.
CentralLocationandDispersion
BiostatisticsWorkbook 40
DRAFT:Aug.28,2007
Step PracticeSpace
1. Identifythemean,
median,andmode
ofthedata.
Length: Weight:
Mean______ Mean______
Median_____ Median_____
Mode______ Mode______
2. Whatisthe
averagelength
andweightof
babiesthatcame
intotheclinicon
thismorning?
3. Whatcanyou
determineabout
thedistributionof
thedatabasedon
yourresults?
MeasuresofDispersion
RelatedConcepts
MeasuresofDispersion
NormalDistribution
CentralLocationandDispersion
BiostatisticsWorkbook 41
DRAFT:Aug.28,2007
MeasuresofDispersion
Intheprevioussection,wediscussedmethodsofdescribingthecenterofthe
data.Nowwewanttoexaminewaystodescribethespreadofthedata,orhow
fareachdatapointisfromthecenter.
Range:Therangeofthedataisthedifferencebetweenthesmallestobservation
(minimumvalue)andthelargestobservation(maximumvalue)inasetofdata.
Therangeiscalculatedbyfindingthedifferencebetweenthemaximumvalue
andtheminimumvalueinasetofdata.
range=maximum minimum
InterquartileRange(IQR): Theinterquartilerangeisthedifferencebetweenthe
25
th
percentile(1
st
quartile)andthe75
th
percentile(3
rd
quartile)inasetofdata.
Thismeasurementgivesanideaofthemiddle50percentoftheobservations
andis,therefore,lesslikelytobeinfluencedbyoutliersorextremevalues.
IQR
4
) 1 n (
4
) 1 n ( 3 +
-
+
=
Overview
Measuresofdispersiondescribevariabilityofdatainasampleby
describingthespreadofthedata.
Formulas:
Range=maximum minimum
InterquartileRange=
4
) 1 n (
4
) 1 n ( 3 +
-
+
=
Variance=
2
i
n
1 i
2
) x x (
) 1 n (
1
s -
-
=
S
=
OR
) 1 n ( n
) x ( x n
2
i
2
i
-
-
Standarddeviation=
2
s s =
Standarderror=
n
s
SE =
CentralLocationandDispersion
BiostatisticsWorkbook 42
DRAFT:Aug.28,2007
Variance(s
2
): Thevariancerepresentstheamountofspreadorvariability
aroundthemeanofasetofdata. Becausethevarianceisinunitssquared,we
findthestandarddeviationtodescribeourdataintheproperunits. Thesymbol
s
2
isusedwhenwearereferringtothevarianceofasampleandthesymbol
2
(pronouncedsigmasquared)whenwearereferringtothevarianceofa
population.
2
i
n
1 i
2
) x x (
) 1 n (
1
s -
-
=
S
=
OR
) 1 n ( n
) x ( x n
2
i
2
i
-
-
StandardDeviation(s): Thestandarddeviationofasetofdataisthesquare
rootofthevariance. Itdescribestheaveragedistanceofallobservationsfrom
themeanofthesampleandisusedasvariabilitytodescribethespreadofthe
data.Alargestandarddeviationrepresentsawidespreadbecausethe
observationsarefarfromthemean. Whenwerefertothestandarddeviationof
apopulation,weusethesymbol(sigma).
2
s s =
StandardError(SE): Thestandarderroristhestandarddeviationofthe
samplingdistributionofthemeans,ratherthantheobservationsthemselves.
Thesmallerthestandarderror,thecloseranygivensamplemeanislikelytobe
tothetruepopulationmean.
n
s
SE =
StepbyStepExample:MeasuresofDispersion
Usingthedatabelow,followtheinstructionstoidentifythemeasuresof
dispersionforAge.
Individual Age Visits
1 45 15
2 60 8
3 52 22
4 46 9
5 23 2
6 52 15
7 37 3
8 33 13
CentralLocationandDispersion
BiostatisticsWorkbook 43
DRAFT:Aug.28,2007
Minimum,maximum,andrange
Step Example
1. Identifytheminimum
valueofAge.
Theminimumvalueisthelowestvalueinthe
sample.Inthiscase,itis23.
2. Identifythemaximum
valueofAge.
Themaximumvalueisthehighestvalueinthe
sample.Inthiscaseitis60.
3. Determinetherange
ofAge.
maxmin=range
6023=37
37istherangeofthesample.
4. Stateyour
conclusions.
TheobservationsinAgecoverarangeof37years.
InterquartileRange
Step Example
1. Arrangeobservations
ofthevariableAgein
orderofincreasing
value.
1)23
2)33
3)37
4)45
5)46
6)52
7)52
8)60
2. Findthepositionof
the1
st
(Q
1
)and3
rd
(Q
3
)quartiles.
4
) 1 n (
Q
1
+
=
4
) 1 n ( 3
Q
3
+
=
25 . 2 =
4
) 1 + 8 (
= Q
1
75 . 6 =
4
) 1 + 8 ( 3
= Q
3
CentralLocationandDispersion
BiostatisticsWorkbook 44
DRAFT:Aug.28,2007
Step Example
3. Locateeachnumber
indicatedinthe
dataset.
Q
1
,withapositionof2.25,isonefourthoftheway
betweenthe2
nd
and3
rd
observationsintheset.
The2
nd
valueis33andthe3
rd
is37,so
34 1 33 ) 33 37 (
4
1
33
1
= + = - + = Q
Q
3
,withapositionof6.75,isthreefourthsofthe
waybetweenthe6
th
and7
th
observationsintheset.
The6
th
valueis52andthe7
th
valueisalso52.
Therefore,Q
3
=52.
4. Findthedifference
betweenQ1andQ3to
determinethe
interquartilerange.
Q
3
Q
1
=IQR
Q
1
=34
Q
3
=52
5234=18
5. Stateyour
conclusions.
The50
th
percentileofthedatahasarangeof18.
Thismeansthatthemiddlehalfofallthe
observationsinAgeisspreadacross18years.
Variance,standarddeviation,andstandarderror
Step Example
1. Findthemeanof
thedataset.
1)23
2)33
3)37
4)45
5)46
6)52
7)52
8)60
5 . 43 =
8
348
=
8
60 + 52 + 52 + 46 + 45 + 37 + 33 + 23
= x
CentralLocationandDispersion
BiostatisticsWorkbook 45
DRAFT:Aug.28,2007
2. Calculatethe
varianceusing
theformula
below.
2
i
n
1 i
2
) x x (
) 1 n (
1
s -
-
=
S
=
] ) 5 . 43 60 ( + ) 5 . 43 52 ( 2 + ) 5 . 43 46 ( + ) 5 . 43 45 (
+ ) 5 . 43 37 ( + ) 5 . 43 33 ( + ) 5 . 43 23 [(
) 1 8 (
1
= s
2 2 2 2
2 2 2 2


] 25 . 272 + ) 25 . 72 ( 2
+ 25 . 6 + 25 . 2 + 25 . 42 + 25 . 110 + 25 . 420 [
7
1
= s
2
998
7
1
= s
2
57 . 142 s
2
=
3. Calculatethe
standard
deviation.
2
s s =
57 . 142 = s
s=11.94
4. Calculatethe
standarderrorof
themeans.
n
s
SE =
8
94 . 11
SE =
SE=4.22
5. Stateyour
conclusions
Theobservationsareanaverageof11.94yearsaway
fromthemean.Ifweweretotakemanysamplesfrom
thesamepopulation,theaverageofthesample
meanswouldbe4.44yearsfromtheactualpopulation
mean.
CentralLocationandDispersion
BiostatisticsWorkbook 46
DRAFT:Aug.28,2007
Practice:MeasuresofDispersion
Usethesamedatasettodescribethedispersionoftheobservationsofthe
variableVisits.
Individual Age Visits
1 45 15
2 60 8
3 52 22
4 46 9
5 23 2
6 52 15
7 37 3
8 33 13
Minimum,maximum,andrange
Step PracticeSpace
1. Identifytheminimum
valueofVisits.
2. Identifythemaximum
valueofVisits.
3. Determinetherange
ofVisits.
maxmin=range
4. Stateyour
conclusions.
CentralLocationandDispersion
BiostatisticsWorkbook 47
DRAFT:Aug.28,2007
InterquartileRange
Step PracticeSpace
1. Arrangeobservations
ofthevariableVisits
inorderofincreasing
value.
2. Findthepositionof
the1
st
(Q
1
)and3
rd
(Q
3
)quartiles.
4
) 1 n (
Q
1
+
=
4
) 1 n ( 3
Q
3
+
=
3. Locateeachnumber
indicatedinthe
dataset.
4. Findthedifference
betweenQ
1
andQ
3
to
determinethe
interquartilerange.
Q
3
Q
1
=IQR
5. Stateyour
conclusions.
CentralLocationandDispersion
BiostatisticsWorkbook 48
DRAFT:Aug.28,2007
Variance,standarddeviation,andstandarderror
Step PracticeSpace
1. Findthemeanofthe
variableVisits.
2. Calculatethevariance
usingtheformula
below.
2
i
n
1 i
2
) x x (
) 1 n (
1
s -
-
=
S
=
3. Calculatethestandard
deviation.
2
s s =
4. Calculatethestandard
errorofthemeans.
n
s
SE =
5. Stateyour
conclusions.
CentralLocationandDispersion
BiostatisticsWorkbook 49
DRAFT:Aug.28,2007
EpiInfoExample:MeasuresofDispersion
Usethetablebelow(datasetBabyWeighing)tofindmeasuresofdispersionfor
thevariableAgeinEpiInfo.Firstfindthemaximum,minimum,range,and
interquartilerange.Thencalculatethevariance,thestandarddeviation,andthe
standarderror.
Step Example
1. READthedatasetin
EpiInfo.
OpenEpiInfoandchooseAnalyzeData.
SelectREADandopenthedatabase,
Bios_Workbook_Examples.Choosethedataset
BabyWeighing.
ClickOK.
2. FindtheMEANSofthe
dataset.
SelectMEANSundertheStatisticsheading.
InthedropdownmenuforMeansOf,choose
Age_in_months.
ClickOK.
Age
(months)
Length
(cm)
Weight
(kg)
21 77 9.8
34 87 11.5
23 84 10.8
30 92 14.0
27 85 12.0
24 82 10.8
31 87 11.6
26 85 11.8
22 85 12.4
32 86 12.0
CentralLocationandDispersion
BiostatisticsWorkbook 50
DRAFT:Aug.28,2007
Step Example
3. Usetheoutputto
determinetherange
andtheinterquartile
range.
Theoutputprovidesyouwiththemaximumandthe
minimuminthedata.Findthedifferenceto
determinetherange.
Range=maximumminimum
Range=3421=13
Theoutputalsoprovidesthe25
th
percentile,equal
toQ1,andthe75
th
percentile,equaltoQ3,sothat
wecandeterminetheinterquartilerange.
IQR=Q
3
Q
1
IQR=3123=8
4. Usetheoutputto
identifythevariance
andstandard
deviationofthe
variable.
Variance=20.67
StandardDeviation=4.55
Ifwewanttocalculatethestandarderror,we
simplydividethestandarddeviationbythesquare
rootofthenumberofobservations:
44 . 1
10
5461 . 4
SE = =
CentralLocationandDispersion
BiostatisticsWorkbook 51
DRAFT:Aug.28,2007
Step Example
5. Describethevariable
intermsofdispersion.
TherangeofthevariableAge_in_monthsis13
months.Themiddlehalfofthedataspans8
months.Theaveragedistanceofeachobservation
fromthemeanofthedatais4.55months.Ifwe
weretotakemanysamplesfromthesame
population,wewouldfindthattheaveragesample
meanis1.44monthsfromtheactualpopulation
mean.
EpiInfoPractice:MeasuresofDispersion
Usethesamedataset,BabyWeighing,topracticedescribingdataintermsof
dispersionwiththehelpofEpiInfo.Determinetherangeandinterquartilerange
andidentifythevariance,standarddeviation,andthestandarderrorofthe
variableLength.
FindtheMEANSofthedatasetinEpiInfo.
Usetheoutputtoanswerthefollowingquestions.
Step PracticeSpace
1. Determinetherange
andtheinterquartile
range.
Range=
IQR=
2. Identifythevariance
andstandard
deviationofthe
variable.
s=______
s
2
=______
3. Describethevariable
intermsofdispersion.
RelatedConcepts
NormalDistribution
ProbabilityandtheNormalDistribution
BiostatisticsWorkbook 52
DRAFT:Aug.28,2007
Probability andtheNormalDistribution
Uptothispoint,wehavefocusedondescriptivestatistics.Wehavesimplybeen
organizingandsummarizingdatathathasbeencollected.Wealsowantto
exploresomemethodsfordrawingconclusionsaboutpopulationsbasedsolely
ondatathatwehaveforasampleofthatpopulation. Becausewecanneverbe
certainthatourconclusionsbasedonthissampleaccuratelyrepresentthetarget
population,werefertothisasinferentialstatistics.Inferentialstatisticsisbased
onprobabilitytheory,orthescienceofuncertainty.Thefollowingsections
describehowprobabilitytheoryallowsustomakeinferencesaboutapopulation
basedondataobtainedfromasampleofthatpopulation.
NormalDistribution
BiostatisticsWorkbook 53
DRAFT:Aug.28,2007
ProbabilityDistribution
Probabilityisanindicatorofthelikelihoodthataneventorconditionwilloccur.
Somedescribeitasthelongrunrelativefrequencyoftheeventinrepeatedtrials
undersimilarconditions.Itreflectstheproportionofthepopulationwiththe
conditionorevent.Forexample,if40%ofworkersinafactoryarefemale,the
probabilitythatarandomlyselectedworkerwillbeafemaleis40%orstated
anotherwayifwerandomlyselectnworkers,theexpectednumberoffemales
inthesampleisnx40%. Alternatively,theexpectednumberofmalesis
nx(100%40%),ornx60%.
Probabilitycanalsobeusedtoconsidercontinuousvariables(notjustconditions
oreventsasnotedabove).Itcanindicatethelikelihoodofavalueinaparticular
range.Forexample,if5%ofmenatthefactoryhaveaheightover180cm,the
probabilitythatarandomlyselectedmanwillhaveaheightover180cmis5%.
Probabilitydistributionsrepresenttheprobabilityofthedifferentoutcomes(e.g.
male,female)forasampleselection.Therelationshipbetweenthevaluesofa
variableandtheprobabilitiesoftheiroccurrencecanbesummarizedina
probabilitydistribution.
Ifweselectasingleworkerfromthisfactory,theprobabilitydistributionforthe
possibleoutcomesforgenderissimple.
Possibleoutcome Probability
Male 0.60
Female 0.40
Ifweselectthreeworkersthentheprobabilitydistributionbecomesmore
complicated.
Possibleoutcomes Probability
Allmale 0.216=(0.60x0.60x0.60)
2male,1female 0.432=(0.60x0.60x0.40)
2female,1male 0.288=(0.40x0.40x0.60)
Overview
Aprobabilitydistributionisadistributionofdatabasedonthelikelihood
thataneventorindicatorwilloccurinasampleofthepopulation.
Knowledgeoftheprobabilitydistributionofavariableallowsustodraw
conclusionsaboutapopulationbasedondatatakenfromasampleof
thatpopulation.
NormalDistribution
BiostatisticsWorkbook 54
DRAFT:Aug.28,2007
Allfemale 0.064=(0.40x0.40x0.40)
Thereareseveralmodelortheoreticalprobabilitydistributionsthatwillallowusto
determinetheprobabilityofagivenvalueforarandomvariableevenifwedonot
have(orknow)thefullprobabilitydistributionforthatvariable.Theseprobability
distributionsaregivenorcalculatedbymathematicalformulaecalledprobability
functions. Wecanapplythemodeltocreateaprobabilitydensitycurvewherethe
heightofthecurvereflectsthefrequencyoftheindividualvaluesandtheareasin
anintervalunderthecurvereflectstheproportionofapopulationinthatinterval.
Thisisalsoaprobabilitydistribution.
Examplesofprobabilityandotherdistributionsincludethenormal,binomial,
Poisson,Chisquare,F,andtdistributions. Forthesakeofsimplicity,theonly
distributionwewillcoverinthisworkbookisthenormaldistribution.
RelatedConcepts
NormalDistribution
NormalDistribution
BiostatisticsWorkbook 55
DRAFT:Aug.28,2007
NormalDistribution
Thenormaldistributionisthemostfamousandimportantofthetheoretical
probabilitydistributionsfortwomainreasons.First,formanyvariableswe
encounterinthehealthfield(e.g.height,bloodpressure,hemoglobinlevel,etc.),
itisagooddescriptionofthedistributionofthevariable.Secondlyandmore
importantly,thenormaldistributionhasacentralroleinstatisticalanalysisasitis
usedastheprobabilitydistributionofthesamplemeans. Calculationsbasedon
thenormaldistributionareusedtoderiveconfidenceintervalsanddeterminep
valuesforquantitativedata,proportions,andrates.
Characteristicsofanormaldistribution:
Itisspecifiedbytwoparameters:thepopulationmeanandthestandard
deviation.
Itissymmetricalaroundthemean,bellshaped,andunimodal.Thisiswhy
thenormalcurveisfrequentlyreferredtoasthebellcurve.
Themean,median,andmode,areallinthemiddleofthecurve.
Thetotalareaunderthecurveabovethexaxisisonesquareunitwith
50%oftheareatotherightofthemeanand50%totheleftofthemean.
AccordingtotheEmpiricalRule:
Theareaboundedbyonestandarddeviationtotherightandonestandard
deviationtotheleftofthemeanwillrepresentsapproximately68%ofthe
values.
Theareaboundedbytwostandarddeviationstotherightandtwotothe
leftwillrepresentsapproximately95%ofthevalues.
99.7%ofthevalueswillbewithinthreestandarddeviationsofthemean.
Thisisdemonstratedinthegraphonthenextpage:
Overview
Thenormaldistributionisabellshapedcurvewithboththemeanand
themedianatthecenterofthecurve.
Thestandardnormaldistributionisadistributionofdatawithameanof
zeroandastandarddeviationofone.Itallowsdifferentpopulationsto
becomparedtoeachother.
Formula:Theformulabelowisusedtocalculatethestandardscore,or
thezscorewhencomparingnormallydistributedpopulations.

x
= z

NormalDistribution
BiostatisticsWorkbook 56
DRAFT:Aug.28,2007
Knowingthemeanandstandarddeviationofanormaldistributionallowsoneto
determinethefollowingvalues:
Theproportionofindividualswhofallintoanyrangeofvalues
Thepercentileatwhichagivenvaluefalls
Thevaluewhichcorrespondstoagivenpercentile
BelowisafrequencydistributionoftheheightofmenintheUSpopulation,
characterizedbyanormaldistributionwithameanof171.5cmandastandard
deviationof6.5cm.
=171.5cm
NormalDistribution
BiostatisticsWorkbook 57
DRAFT:Aug.28,2007
GiventhatthemeanheightofthemenintheUSis171.5cm(=171.5cm)and
thestandarddeviationis6.5cm(=6.5cm)andusingourknowledgeofthe
normalcurve,weknowthefollowinginformation:
68.3%ofmenarebetween165and178cm ( 1=171.5 6.5)
95.5%ofmenarebetween158.5and184.5cm( 2=171.5 2x6.5)
Whatifwewanttoknowspecificinformationsuchas:
Whatproportionofmenareover180cm?
Whatheightvalueisatthe10
th
percentile?
Statisticianshavedevisedamethodtotransformallnormaldistributionssothat
theyusethesamescale.Thisisknownasthestandardnormaldistribution.
Thestandardnormaldistributionisanormaldistributionwithameanof0anda
standarddeviationof1. Anormaldistributioncanbecomparedwithother
normaldistributionsbyconvertingittoastandardnormaldistributionusingthe
formulashownbelow. Thestandardnormaldistributionspecifieshowfaran
individualvalueisfromthemeaninunitsofthestandarddeviation,whichallows
ustocalculateastandardscore.Thestandardscoreisawayofexpressingan
individualvalueintermsofstandarddeviationunits.Thestandardscore,
referredtoasthezscore,iscalculatedas (observedvaluemean)dividedby
thestandarddeviation.Theformulaisbelow:

x
= z

Thezscorewillalsobereferredtoasateststatistic.Eachdistributionhasa
correspondingteststatistic.Thezscorecorrespondswiththestandardnormal
distribution.
NormalDistribution
BiostatisticsWorkbook 58
DRAFT:Aug.28,2007
Example:UsingtheStandardNormalDistribution
Givenanormaldistributionofmaleheightswith=171.5cmand=6.5cm,
whatistheproportionofmentallerthan180cm?
5 . 6
5 . 171 180
=

x
= z

31 . 1 =
5 . 6
5 . 8
= z
Nowthatweknowthezscore,wemustfindtheareaofthestandardnormal
curveabove1.31.
Inordertofindtheareaofthecurvethatisrepresentedbythezscore,1.31,we
mustrefertothestandardnormalzdistributionlocatedinAppendix2.
OntheStandardNormalzTable,locatethezscore1.31. Underthecolumn
labeledz,findthevalue,1.3.Therowlabeledzwillprovideyouwiththe
hundredthsplaceofyourzscore,sofollowitoveruntil0.01.Ifyouplaceone
fingeron1.3andononefingeron0.01andfollowthosepathsuntilyourtwo
fingersmeet,youfindthevalue,0.9049. UsetheexcerptfromtheStandard
NormalzTableonthefollowingpagetohelpyoulocatethezscore.
0 1.31
NormalDistribution
BiostatisticsWorkbook 59
DRAFT:Aug.28,2007
ThistablewillgiveustheareaofthecurvelocatedtotheLEFTofthezscore.
Asyoucanseebythediagram,wewanttofindtheareaofthecurvelocatedto
theRIGHTofthezscore. Tofindtheareatotherightofthezscore,wesubtract
0.9049from1.
10.9049=0.0951
Therefore,approximately9.5%(0.0951x100%)ofthecurveisabove180cm(or
above1.31SDofthemean).Wecanalsosaythatmenwhoseheightsare180
cmandabovearetallerthan90.5%ofAmericanmen. Thus,aheightof180cm
representsthe90
th
percentile.
Topracticeusingthetableforthestandardnormaldistribution,answerthe
followingquestion.
Whatheightvalueisatthe10
th
percentile? Wemanipulatetheformulatosolve
forxratherthanz:
x=+(z )
where:
xistheobservedvalue
isthepopulationmean(given)
isthepopulationstandarddeviation(given)
zcomesfromthestandardnormaldistribution
NormalDistribution
BiostatisticsWorkbook 60
DRAFT:Aug.28,2007
Tofindtheanswertothisproblem,firstlookupthezscorefromthetablein
Appendix2whichcorrespondstothelowest10%oftheareabeneaththecurve.
Thisareawillbeonthelefthandsideofthecurve. Dothisbyreversingthe
stepswepreviouslyusedtofindthearea.
Locatetheareaclosestto0.10intheztable.Thenfollowtherowandcolumnto
identifythezscorethatitisassociatedwith.Youshouldfindazscoreof1.28.
x=+(z )
x=171.5+(1.28x6.5)
x=171.58.3525=163.1475
The10
th
percentileis163.1cm.Thismeansthat10%ofAmericanmenare163.1
cmorshorterand90%ofAmericanmenaretallerthan163.1cm.
Practice:UsingtheStandardNormalDistribution
YouhaveattendedanHIV/AIDStrainingwhereapretestandaposttestwas
giveninordertomeasureknowledgegained.Pretestscoresareincludedinthe
tablebelow.Usethetabletoanswerthefollowingquestions.
PretestScores:HIVKnowledge
Females Males
Mean 60 40
SD 12 10
N 138 97
1. Ifamalegetsascoreof70,whatishiszscore?
2. Whatisthezscoreforafemalewithascoreof35?
3. Whatscoreforfemalesisequivalenttoamalesscoreof78?
RelatedConcepts
CentralLimitTheorem
CentralLimitTheroem
BiostatisticsWorkbook 61
DRAFT:Aug.28,2007
CentralLimitTheorem
Notalldataisnormallydistributed.Datathatisnotnormallydistributedrequires
differenttestsinordertoproperlyanalyzeandcompareit.Fortunately,ifwe
haveanadequatelylargesamplesize,(n>30),thesamplingdistributiontendsto
approachnormalityandweareabletotreatitasnormal.Thisconceptisknown
astheCentralLimitTheorem.
Justaswecalculatedthestandarddeviationforadistributionofindividualvalues
aroundamean,wenowcancalculateasimilarmeasureofvariabilityforaseries
ofsamplesfromthepopulation.ThisistheStandardErrorofthestatisticand
measurestheprecisionofthestatistic(meanorproportion)asanestimateofthe
populationmeanorpopulationproportion.Itindicatesthedegreetowhicha
samplestatisticreflectsthetruepopulationvalue.
Thestandarderroristhebasisforcalculatingconfidenceintervalsand
conductinghypothesistestsformeansandproportions.Thisallowsustomake
generalizationsaboutalargergroupofindividualsbasedonasubsetorsample.
Asyouknow,mostepidemiologicstudiesarecarriedoutwiththeaimoflearning
aboutacharacteristicinatargetpopulation.Itisrarelyfeasibletostudyevery
individual.Therefore,weusuallycompareexposuresordiseasewithinasample
ofthepopulation.Amajorroleofstatisticsistoallowustogeneralizeresults
fromasampletothelargegroupandunderstandhowaccuratelythat
generalizationreflectstheactualpopulationmean(orproportion).
Overview
Thesamplingdistributionofsamplestatistics(meanorproportion)will
looknormallydistributedforlargesamplesizes.
Simply,ifthesamplesizeislarge(typicallyn>30),thedistributionof
samplemeansorsampleproportionsapproximatesanormal
distribution.
Formula:
n
s
= SE
CentralLimitTheroem
BiostatisticsWorkbook 62
DRAFT:Aug.28,2007
Thus,standarderrorbecomessmallerasngetsbigger,meaningthatthelarger
thesamplesize,themoreprobableitisthatthesamplemean, x ,approaches
thepopulationmean,.
RelatedConcepts
StatisticalInference
StandardDeviationVs.StandardError
Botharemeasuresofvariationinadataset.
Standarddeviationisameasureofvariation ofindividual
observationsfromthemeaninasetofdata.
Standarderrorofthemeanmeasuresthestandarddeviationofthe
samplemeans.
StatisticalInference
BiostatisticsWorkbook 63
DRAFT:Aug.28,2007
StatisticalInference
Forindividualvaluesweusethezscoretotellushowfaranindividualvalueis
fromthemeanofthesample.Anysamplewillhaveanelementofrandomerror,
meaningthatbychanceitmaynotlookexactlylikethepopulationfromwhichit
wasdrawn.Inferentialstatisticsallowsustoquantifytheamountofrandom
error.
Thestepsforconductinginferentialstatisticaltestsaresimilarforeachtest:
1. Statethenullandalternativehypotheses.
2. Determinethedecisionrule.
3. Conducttheappropriatetest.
4. Interprettheresults.
1. Statethenullandalternativehypotheses
Hypothesesareformulatedbasedonprovingordisprovingthestatusquo,or
whatwecurrentlyregardtobeastrue.Eachtimewetestanewidea,wearein
actualitycomparingittoouroldideaofwhatalreadyisknown.Forexample,if
weknowchloroquinetobeaneffectivemalariadrug,thenwhenwetestthe
effectivenessofanewdrugsuchassulfadoxinepyrimethamine,weusetheold
drug,chloroquine,asthebaseline.Thus,ourexpectationisthatchloroquine
worksandtherewillbenodifferencefoundbyusingthenewdrug.This
becomesthenullhypothesis,orH
0
.Thealternatehypothesis(H
A
),oftenreferred
toastheresearchhypothesis,thenrepresentsthechancethatasignificant
differenceisfoundbetweenthenewdrugandtheolddrug.Asweknow,a
differencecanbeeitherhigherorlower,betterorworse.Ifwearetestingforany
difference,wewilluseatwotailedtest.Ifwearetestingtoseeinwhichdirection
thedifferencelies,weuseaonetailedtest.Usingthesamelevelofsignificance
(alphavalue),atwotailedtestismorestringentthanaonetailedtest.
2. Determinethedecisionrule
Analphavalue()determinesthelevelofsignificanceatwhichyouwillconduct
yourtest.Thisvalueischosenbytheresearcher.Themostcommonalpha
valueseenandonewhichisconsideredanacceptablelevelofsignificanceby
researchersworldwideis0.05,or5percent.Youwillalsoseeanalphavalueof
0.10,butanythingbelowthatisgenerallyconsideredtobetoolenienttoaccount
fordifferencesbeyondthosewhicharerandomorcoincidentaloccurrences.
Wecangenerallydeterminetheresultsofhypothesistestinginthreeways:1)by
comparingacalculatedvalue(t
calc
)toacriticalvalue(t
crit
)2)bycomparingthe
alphavaluetoapvalue,and3)bydeterminingifthevaluespecifiedinthenull
hypothesisiscontainedwithinthelimitsofaconfidenceinterval. Thecalculated
valueisalsoreferredtoastheteststatisticandiscalculatedthroughtheuseof
descriptivestatisticsforthesample.Acriticalvalueisidentifiedbyusingthe
correcttable.Analphavalue,aspreviouslydiscussed,isspecifiedbythe
StatisticalInference
BiostatisticsWorkbook 64
DRAFT:Aug.28,2007
researcherandwillbegiven.Thepvaluecorrespondstothevalueofthe
computedteststatisticandcanbefoundinsometables,ordeterminedusinga
statisticalsoftwarepackage.
Whenthevalueofthecomputedteststatisticexceedsthecriticalvalue,(i.e.t
calc
>t
crit
)wecanrejectthenullhypothesis.When>p,wecanalsorejectthenull
hypothesis. Lastly,ifthevaluespecifiedinthenullhypothesisisnotcontained
withinthelimitsofourconfidenceinterval,wecanonceagainrejectthenull
hypothesis. Notethatwhenwearenotabletorejectthenull,weusethephrase
failtorejectthenull.Weneveracceptthenull.Weonlyrejectitorfailtoreject
it.Byrejectingthenull,wehaveprovenouralternativehypothesistobetrue.
3. Conducttheappropriatetest
Thereareseveraldifferentteststatisticsthatyoumustchoosefromwhentesting
forstatisticalsignificance.Theteststatisticyouwillusedependsontheknown
parametersofthevariable.Ifapopulationstandarddeviation()isknown,then
weusetheztest.Withtheexceptionoftestsofproportionorverysmall
populations,wewillgenerallyknowonlythestandarddeviationofasample(s),
inwhichcaseweusethettest.Therefore,whentalkingaboutstatisticaltestsin
general,wearereferringtothetdistribution.Thetdistributionlooksverysimilar
tothenormalzdistribution,butthetailsoneithersideofthecurvearelonger.
Letusnowrevisitthegeneralformulafortheconstructionofateststatistic:
teststatistic=samplestatistichypothesizedpopulationparameter
standarderroroftherelevantsamplestatistic
Forcontinuousdataanalyzedusingthetwosamplettest,thenumerator
comparesthedifferencebetweenthetwosamplemeans ( )
2 1
x x referredto
asthesamplestatisticorpointestimatehere,withthedifferencethatwouldbe
expectedunderatruenullhypothesis(i.e., 0 = : H
2 1 0
) referredtoasthe
hypothesizedpopulationparameter,whichoftenequalszero.Thedenominator
ismadeupbythestandarderror,whichservesasourmeasureofvariability.
4. Interprettheresults
Thedistributiontablesthatyouwillneedinordertointerpretresultswhen
conductingtestsbyhandareincludedattheendofthisworkbook.Theyinclude
theStudentsttable,thenormalstandardzdistribution,andthechisquare
distributiontables. Tablesneededtocompletetheexercisespresentedinthis
workbookareincludedinAppendix2.
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 65
DRAFT:Aug.28,2007
ConfidenceIntervalAroundaMean
Thesamplemean( x )estimatesthepopulationmean()butsuppliesno
informationonthevariabilityorourconfidenceintheestimate. Forthisreason,
weuseconfidenceintervals.
TheintervalestimatemakesuseoftheCentralLimitTheoremandthezscore.
Wefirstdeterminehowconfidentwewanttobeinourestimate.Themost
commonlevelofconfidenceis95%.AswelearnedwiththeEmpiricalRule,a
featureofthenormalcurveisthat95%ofthevalueswillbewithintwostandard
deviationsofthemean. Thisvalueof2isroundedupfromtheexactvalueof
1.96. Thustheprobability(P)thatzfallsbetween1.96and+1.96is0.95,or
95%.
Ifwesubstituteourformula,
n /
) x (
,forz,weget
Aftersomealgebra,weendupwiththeformulaforthe95%confidenceinterval
aroundthemeanas:
Theprobabilitythatthepopulationmeanliesbetweenoursamplemeanisplusor
minus1.96timesthestandarderror,whichisequalto95%. Themultiplier1.96
waschosenfromthestandardztablewithanalpha0.05.If,forexample,we
wantedtocalculatea99%confidenceinterval,wewouldusethezscorethat
correspondswithanalphaof0.01. (Notethatitisthestandarderrorofthemean
thatwearemultiplyingbythezscore.)
Overview
Theconfidenceintervalofthemeangivestherangeofplausiblevalues
forthetruepopulationmean.
95%ofthetime,thepopulationmeanwillbewithinapproximatelytwo
standarderrorsofthesamplemean.
Formula:
95%CI= )
n

96 . 1 + x ,
n

96 . 1 x (
95 . 0 ) 96 . 1 96 . 1 ( = + - z P
95 . 0 ) 96 . 1
(
96 . 1 ( = +
/
)
-
n
x
P
s
m
95 . 0 = )
n

96 . 1 + x
n

96 . 1 x ( P
)
n

96 . 1 + x ,
n

96 . 1 x (
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 66
DRAFT:Aug.28,2007
Thus,the95%confidenceintervalis:
StepbyStepExample:ConfidenceIntervalAroundaMean
Youwanttodeterminethemeanbloodpressureamonggovernmentemployees.
Inordertodothis,youmeasurethebloodpressureof200employees. Usethe
descriptivestatisticsbelowtodeterminea95%confidenceintervalaroundthe
mean.
n=200
x =127mmHg
s=13
Step Example
1. Calculatethestandard
errorofthemean.
n
s
= SE
SE=
200
13
=0.92
2. Findthelowerlimitof
the95%confidence
interval.
95%LL= ) SE ( 96 . 1 x
95%LL= ) 92 . 0 ( 96 . 1 127
=1271.80
=125.2
3. Findtheupperlimitof
the95%confidence
interval.
95%UL= ) SE ( 96 . 1 + x
95%UL=1271.96(0.92)
=1271.80
=128.8
4. Interpretthe95%
confidenceinterval.
The95%confidenceintervalis(125.2,128.8).
Thismeansthatwithrepeatedrandomsampling,
95%ofthemeanswillfallbetween125.2and
128.8.Weare,therefore,95%confidentthatthisis
oneofthoseintervalsandthetruemeanofthe
population()isbetween125.2and128.8.
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 67
DRAFT:Aug.28,2007
Practice:ConfidenceIntervalAroundaMean
Yourecordgestationalageatbirthforlivebirthsinthepastmonthatthree
primaryhealthfacilitiesintheregion. Calculatea95%confidenceinterval
aroundthemean.
n=350
x =37.5weeks
s=12.2
Step PracticeSpace
1. Calculatethestandard
errorofthemean.
n
s
= SE
2. Findthelowerlimitof
the95%confidence
interval.
95%LL= ) SE ( 96 . 1 x
3. Findtheupperlimitof
the95%confidence
interval.
95%UL= ) SE ( 96 . 1 + x
4. Interpretthe95%
confidenceinterval.
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 68
DRAFT:Aug.28,2007
OpenEpiExample:ConfidenceIntervalAroundaMean
Usingthesamebloodpressuredataasbefore,useOpenEpitocalculatea95%
confidenceintervalaroundthemean.
n=200
x =127mmHg
s=13
Step Example
1. OpentheOpenEpi
application.
FromtheOpenEpimenuchooseMeanCIunderthe
heading,ContinuousVariables.
2. Enterthedescriptive
statisticsas
prompted.
ClickonEnterNewData.
Thescreenshownabovewillopenup.
Usethegiveninformationtofillintheboxes.
Noticethatyouonlyneedtoprovideeitherthe
standarddeviation,thestandarderror,orthe
variance.Youdonotneedtoprovideallthree.
Sincethestandarddeviationisgiven,thisisthe
statisticthatwewilluse.
Becauseourpopulationislargeandunknown,we
canusethedefaultnumber,999999999,to
representthepopulationsize. Ifyouhaveaknown
population,specifythatnumberhere.
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 69
DRAFT:Aug.28,2007
Step Example
3. Calculatethe95%
confidenceinterval.
ClickonthebuttonlabeledCalculate.
Apopupwillopendisplayingtheresultsofthe
calculation.Notethatyoumustsetyourbrowser
toallowpopupsinordertoviewtheresults.
4. Interprettheresults.
Choosethe95%confidenceintervalcorresponding
withthettest,sincewedonotknowthevarianceof
thepopulation,onlythestandarddeviationofthe
sample.
The95%confidenceintervalis(125.2,128.8).
Withrepeatedrandomsampling,95%ofthemeans
willfallbetween125.2and128.8.Weare,
therefore,95%confidentthatthisisoneofthose
intervalsandthetruemeanofthepopulation()is
between125.2and128.8.
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 70
DRAFT:Aug.28,2007
ExcelExample:ConfidenceIntervalAroundaMean
Wecanfindaconfidenceintervalaroundameanusingdescriptivestatisticsin
Excelaswell. Usethesamebloodpressuredatathatweusedintheprevious
example.
Step Example
1. Selecttheconfidence
intervalfunctionin
Excel.
Inablankworksheet,chooseInsertfromthe
toolbar.Fromthedropdownmenu,selectFunction.
TypeconfidenceintervalintheboxlabeledSearch
forafunction.Thefunctionforconfidenceintervals,
CONFIDENCEwillappearasyouronlyoption.
Alternatively,youcanscrolldownthelistof
functionsuntilyoufindtheonelabeled
CONFIDENCE.
ClickonOK.
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 71
DRAFT:Aug.28,2007
Step Example
2. Enterthedescriptive
statistics.
Youwillbepromptedtoenterthealpha,standard
deviation,andsamplesize.Sinceweare
calculatinga95%confidenceinterval,=1.00
0.95andistherefore,0.05.
ClickonOK.
Theresultwillthenbedisplayedontheworksheetin
thecellmarkedbyyourcursor.
Theresultistheequivalentofz(SE).
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 72
DRAFT:Aug.28,2007
Step Example
3. Calculatethe95%
confidenceinterval.
Therefore,wecancalculatethe95%confidence
intervalbysubtractingandadding1.80toour
samplemeanof127.
95%LL=1271.80=125.2
95%UL=127+1.80=128.8
4. Interpretyourresults. The95%confidenceintervalis(125.2,128.8).This
meansthatwithrepeatedrandomsampling,95%of
themeanswillfallbetween125.2and128.8.We
are,therefore,95%confidentthatthisisoneof
thoseintervalsandthetruemeanofthepopulation
()isbetween125.2and128.8.
YoucanalsouseExceltofindtheconfidenceintervalaroundthemeanifyouare
givenadatasetinsteadofdescriptivestatistics.
ExcelExample:ConfidenceIntervalAroundaMean
Forthisexample,wewillusethedatasetSit/Lie.Calculatea95%confidence
intervalaroundthemeanforthevariableSitting.
Step Example
1. Importthe
datasetinto
Excel.
Importthedataset,twosamplet,byusingthedirectionsin
theboxbelow.
ToopenadatasetinExcel:
ChoosetheheadingDatafromthetoolbar.
ClickonImportExternalData.
ClickonImportData.
Openthefolderwhereyouhavestoredthe
database.
Choosethetablethatyouwillbeworkingfrom.
ClickOK.
Choosewhereyouwouldliketoputthedataby
selectingacellofthecurrentworksheetor
seclectinganewworksheet.ClickOK.
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 73
DRAFT:Aug.28,2007
Step Example
2. Calculatethe
95%confidence
intervalusing
Excel.
ChooseToolsfromthetoolbar.SelectDataAnalysisfrom
thedropdownbox.HighlightDescriptiveStatisticsand
clickOK.Youwillseeaboxliketheonebelow:
ClickonthecharticonnexttothetextboxmarkedInput
Range.
Highlightthecolumn
forthevariableSitting
byclickingonthe
letterwhich
correspondswiththe
column.
ClickonthecharticonintheboxlabeledDescriptive
Statisticstoreturntothedialoguebox.
ChecktheboxnexttoLabelsinFirstRow.
Next,chooseyouroutputoptions. Anewworksheetis
chosenasthedefault,butifyouwouldlikeyouroutputto
appearonthesameworksheetasyourdataset,selectthe
firstoptionunderOutputoptions,OutputRange. Clickon
theiconnexttothetextbox. Choosetheareawhereyou
wouldlikeyouroutputtoappearbyclickingonacell.
Clickontheiconagaintoreturntothedialoguebox.
ChecktheboxesnexttoSummarystatisticsand
ConfidencelevelforMean.
ClickOK.
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 74
DRAFT:Aug.28,2007
Step Example
3. Usetheoutput
tocalculatethe
confidence
interval.
Youroutputwilllooklikethis:
Noticethattheoutputdoesnot
actuallyprovideyouwitha
confidenceinterval.Instead,you
aregivenanumberwhich
representsthedifferencefromthe
mean.Tofindtheconfidence
intervalaroundthemean,
subtractthisnumberfromand
addthisnumbertothemean.
95%CI= x confidencelevel
=80.9514.13,80.95+14.13
=66.82,95.08
4. Interpretthe
results.
The95%confidenceintervalaroundthemeanis(66.82,
95.08).Withrepeatedrandomsampling,95%ofthe
meanswillfallbetween66.82and95.08.Weare,
therefore,95%confidentthatthisisoneofthoseintervals
andthetruemeanofthepopulation()isbetween66.82
and95.08.
ExcelorOpenEpiPractice:ConfidenceIntervalAroundaMean
UsingthedatafromtheHIVKnowledgepretest,calculatethe95%confidence
intervalaroundthemeanscoreforfemalesineitherExcelorOpenEpi.
PretestScores:HIVKnowledge
Females Males
Mean 60 40
SD 12 10
N 138 97
Foradditionalpractice,calculatethe95%confidenceintervalaroundthemean
scoreformalesbyusingthecomputerapplicationthatyoudidnotpreviously
use.
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 75
DRAFT:Aug.28,2007
1. Opentheappropriateapplication.
2. Enterthedescriptivestatistics.
Step PracticeSpace
3. Calculatethe95%
confidenceinterval.
4. Interpretyourresults.
RelatedConcepts
ConfidenceIntervalAroundaProportion
ConfidenceInterval:TwoSampletTest
ConfidenceIntervalAroundaProportion
BiostatisticsWorkbook 77
DRAFT:Aug.28,2007
ConfidenceIntervalAroundaProportion
TheCentralLimitTheoremalsoapplieswhenconsideringadistributionof
sampleproportions,whenthesamplesizeislargeenough.Thesampling
distributionwouldbeconstructedsimilarlyasforthemean.Howeverthe
characteristicsofthesamplingdistributionwillbedifferentasthisisabinomial
distribution.Wewillbeestimatingthepopulationproportionratherthanthe
populationmean.Sincethebinomialdistributionisasamplingdistributionforp,
itsmeanequalsthepopulationmeananditsstandarddeviationrepresentsthe
standarderror(SE).
n=samplesizeornumberoftrials
p=probabilityofsuccess
1p=probabilityoffailure
SEoftheproportion=
n
) p 1 ( p
Asthesamplesize,n,increases,thebinomialdistributionbecomesvery
closetoanormaldistributionduetothecentrallimittheorem
Therefore,thenormaldistributioncanbeusedtocalculateconfidence
intervalsanddohypothesistests
Ifnpandn(1p)areequalto10ormore,thenthenormalapproximation
maybeused
Similartothemethodusedtocalculateaconfidenceintervalaroundamean,to
calculatethe95%confidenceintervalaroundaproportion,wefirstcalculatethe
standarderroroftheproportionandthenusethesameformula:
95%CI
n
) p 1 ( p
96 . 1 p =

Overview
Theconfidenceintervalaroundaproportiongivestherangeofplausible
valuesforthetruepopulationproportion.
95%ofthetime,thepopulationproportionwillbewithinapproximately
twostandarderrorsofthesampleproportion.
Formula:
95%CI
n
) p 1 ( p
96 . 1 p =

,
n
) p 1 ( p
96 . 1 + p

ConfidenceIntervalAroundaProportion
BiostatisticsWorkbook 78
DRAFT:Aug.28,2007
StepbyStepExample:ConfidenceIntervalAroundaProportion
Outof212pregnantwomentestedforHIV,53hadpositiveresults.Usethis
informationtofinda95%confidenceintervalforthepopulation.
Step Example
1. Identifypand1p.
p,theproportionofsuccess= 25 . 0 =
212
53
1p,theproportionoffailures=10.25=0.75
2. Calculatethe95%
lowerlimit.
95%LL
n
) p 1 ( p
1.96 p =

95%LL
212
) 75 . 0 ( 25 . 0
96 . 1 25 . 0 =
=0.25 96 . 1
212
1875 . 0
=0.251.96 00088 . 0
=0.25(1.96x0.0297)
=0.250.0583=0.1918
3. Calculatethe95%
upperlimit.
95%UL
n
) p 1 ( p
1.96 + p =

95%UL
212
) 75 . 0 ( 25 . 0
96 . 1 + 25 . 0 =
=0.25+0.0583
=0.3083
4. Interprettheinterval. The95%confidenceintervalis(0.19,0.31).With
repeatedrandomsampling,95%ofintervals
calculatedwillcontainthetrueproportionofthe
population.Weare95%confidentthatthisisone
ofthoseintervalsandtheprevalenceofHIVinthe
populationisbetween19%and31%.
Note:Yousee(1p)referredtoasqlaterinthisworkbook,aswellasin
manybiostatisticstexts.
ConfidenceIntervalAroundaProportion
BiostatisticsWorkbook 79
DRAFT:Aug.28,2007
Practice:ConfidenceIntervalAroundaProportion
Upontesting250confirmedAIDScases,youfindthat116arepositivefor
tuberculosis.Findthe95%confidenceintervalaroundtheproportionofAIDS
patientsinfectedwithTB.
Step PracticeSpace
4. Identifypand1p.
4. Calculatethe95%
lowerlimit.
95%LL
n
) p 1 ( p
1.96 p =

4. Calculatethe95%
upperlimit.
95%UL
n
) p 1 ( p
1.96 + p =

4. Interprettheinterval.
ConfidenceIntervalAroundaProportion
BiostatisticsWorkbook 81
DRAFT:Aug.28,2007
OpenEpiExample:ConfidenceIntervalAroundaProportion
Usingthepreviousexample,wewilldemonstratehowtocalculatea95%
confidenceintervalaroundaproportion.Outof212pregnantwomentestedfor
HIV,53hadpositiveresults.Usethisinformationtofinda95%confidence
intervalforthepopulationinOpenEpi.
Step Example
1. OpentheOpenEpi
application.
FromtheOpenEpimenuchooseProportionunder
theheading,Counts
2. Entertheproportion
dataasprompted.
ClickonEnterNewData.
Ascreenliketheoneabovewillopen.
Usethegiveninformationtofillintheboxes.The
numeratorwillalwaysconsistofthenumberof
successes,orp.Thedenominatoristhesizeofthe
populationorsample.
3. Calculatethe95%
confidenceinterval.
ClickonthebuttonlabeledCalculate.
Apopupwillopendisplayingtheresultsofthe
calculation.Notethatyoumustsetyourbrowserto
allowpopupsinordertoviewtheresults.
ConfidenceIntervalAroundaProportion
BiostatisticsWorkbook 82
DRAFT:Aug.28,2007
Step Example
4. Interprettheresults.
OpenEpicalculatesthe95%confidenceintervalby
usingseveraldifferentmethods.Thoughtheeditors
recommendtheMidPExacttolookatfirst,itisthe
Wald(NormalApproximation)thatcorrespondsmost
closelywithourhandcalculations.
The95%confidenceintervalis(0.19,0.31).With
repeatedrandomsampling,95%ofintervals
calculatedwillcontainthetrueproportionofthe
population.Weare95%confidentthatthisisoneof
thoseintervalsandtheprevalenceofHIVinthe
populationisbetween19%and31%.
ConfidenceIntervalAroundaProportion
BiostatisticsWorkbook 83
DRAFT:Aug.28,2007
OpenEpiPractice:ConfidenceIntervalAroundaProportion
Therehasbeenameningitisoutbreak.Youfindthatinoneschool,three
studentsoutofanenrolled400havebeeninfectedwithmeningitis.UseOpenEpi
tocalculatea95%confidenceinterval.
1. OpentheOpenEpiapplication.
2. Entertheproportiondataasprompted.
3. Calculatethe95%confidenceinterval.
Step PracticeSpace
4. Interprettheresults.
RelatedConcepts
ConfidenceInterval:ztestofProportions
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 85
DRAFT:Aug.28,2007
HypothesisTesting:TwoSamplettest
Usedforcontinuousdata,thettestisoneofthemostcommonlyusedstatistical
testsperformedinthepublichealthandclinicalliterature.Hypothesistesting
Overview
Testemployedtoevaluatethenullhypothesis ( )
0
H thatthe
populationmeansareequalversusthealternativehypothesis ( )
a
H
thatthepopulationmeansaredifferent.Thistestisusedto
comparethemeansoftwoindependentsamples.
Example:Comparingthedifferenceinmeanbloodpressurefora
sampleofrefugeestothatofasampleofhostcountryresidents.
Formula:
( ) ( )
2
2
p
1
2
p
2 1 2 1
n
s
n
s
x x
t
+
- - -
=
Assumptions:
o Twoindependentrandomsamples
o Normallydistributedpopulation
o Equal,butunknownvariancesinthetwosamples
(Note:Thereisamethodtocomparetwosampleswithunequal
variancescalledSatterwaitesmethod.Pleaserefertoa
biostatisticstextforfurtherexplanation.)
Typeofvariables:Continuous
Decisionrule:Ifthecalculatedvalueoft(
calc
t )isgreaterthanthe
criticalvalueoft(
crit
t ),thenwecanrejectthenullhypothesis.
Tableused:Studentsttable
Where:
( ) ( )
2 n n
s 1 n s 1 n
s
2 1
2
2 2
2
1 1
2
p
- +
- + -
=
andisreferredtoasthepooledvariance.
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 86
DRAFT:Aug.28,2007
usingthettestallowsustodeterminewhethertheobserveddifferencebetween
themeanvaluesoftwogroupsisstatisticallysignificant.
Avitalcomponentusedinthecalculationofthestandarderrorforthetwosample
ttestisthepooledvariance,denoted
2
p
s .Asindicatedabove,amajor
assumptionnecessaryforthevalidityofthetwosamplettestisthatthe
variancesareunknown,butassumedtobeequal. Wecanjustifythisassumption
bydividingthevarianceofonesamplebythevarianceofthesecondsample
(
2
2
2
1
s
s
). If
2
2
2
1
s
s
equalsavalueoflessthanthree,assumethatthevariancesare
approximatelyequal.Thecloserthatthisvalueistoone,themoreequalthe
variancesare. Whenthisassumptionisjustified,apooledestimateofthe
commonvariancecanbecalculated( )
2
p
s ,whichestimatestheoverallvarianceof
theentirestudypopulation.
Thepooledestimateisobtainedbycomputingtheweightedaverageofthetwo
samplevariances.Thesamplevariances ( )
2
2
2
1
s and s areweightedaccordingto
thenumberofobservationsineach.Ifthesamplesizesareequal(
2 1
n n = ),this
weightedaverageisthemeanofthetwosamplevariances.Ifthetwogroupsare
ofunequalsize(
2 1
n n ),thepooledvarianceiscalculatedasfollows:
( ) ( )
2 n n
s 1 n s 1 n
s
2 1
2
2 2
2
1 1
2
p
- +
- + -
=
OurteststatisticisdistributedintheStudentsttablewith 2 n n
2 1
- + degreesof
freedom.
StepbyStepExample:HypothesisTestingTwoSamplettest
Canweconcludethatinfantsbornatalowincomeareaclinic,ontheaverage,
tendtobelighterthanthosebornataclinicservingahighincomepopulation
area?Withinthepastmonth,astudenthascollecteddataonbirthweights
(grams)from arandomsampleof80deliveriesatahighincomepopulation
servingclinic(High)and100deliveriesatalowincomepopulationserving
clinic(Low).Therelevantinformationissummarizedbelowinthetable. Let
alphaequal0.05.
Clinic n x s
HighClinic(1) 80 2800 100
LowClinic(2) 100 2650 82
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 87
DRAFT:Aug.28,2007
Step Example
1. Statethenulland
alternative
hypotheses.
Theresearcherwilldetermineifthemeanvaluefor
onegroupislowerthanthatoftheother,soaone
sidedtestofourhypothesesisindicated.
Ournullhypothesisstatesthatthemeanbirthweight
ofbabiesbornatthehighincomeclinic(
1
)should
belessthanorequaltothatofbabieswhoareborn
atthelowincomeclinic(
2
).Thenullhypothesisis
writtenas:
2 1 0
: H m m
Thealternativehypothesisstatesthatthemean
birthweightofbabiesbornatthehighincomeclinic
(
1
)isgreaterthanthatofthosebornatthelow
incomeclinic(
2
),andiswrittenas:
2 1 a
: H m m >
Anotherwayofstatingthehypothesesisbelow.
Hereyouarestatingthatthedifferencebetweenthe
twopopulationmeans(
D
)islessthanorequalto
zero(null)orthedifferenceisgreaterthanzero
(alternative).
0 : H
2 1 0
- m m
0 : H
2 1 a
> - m m
2. Statethedecision
rule.
Usingaonesidedtestwithanalphavalueof0.05
and 2 n n
2 1
- + =178df,thecriticalvalueofthetest
statisticis1.645. Weobtainthisvaluefromthe
Studentsttable.Notethat178degreesoffreedom
isnotonthetable,soweapproximateitbyusing
infinity().
Thus,weshouldreject
0
H if 1.645 t
calc
>
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 88
DRAFT:Aug.28,2007
Step Example
3. Calculatethevalueof
theteststatistic.
Computingthevalueoftheteststatisticinvolves
severalsteps. Theformulawewillfollowis
( ) ( )
2
2
p
1
2
p
2 1 2 1
n
s
n
s
x x
t
+
- - -
=
a. Calculatethe
differencein
samplemeans.
( )
2 1 d
x x x =
Beginbycomputingthedifferenceinsample
means:
( )
2 1
- isassumedtobe0becauseournull
hypothesisstatesthatthereisnodifference
betweenthetwopopulations.
( )
2 1
x x - iscomputedas: 150 2650 2800 = -
b. Computethevalue
ofthepooled
variance.
( ) ( )
2 n n
s 1 n s 1 n
s
2 1
2
2 2
2
1 1
2
p
- +
- + -
=
Thepooledvarianceiscalculatedas:
( ) ( )
8177.955
178
82 99 100 79
s
2 2
2
p
=
+
=
c. Findthevaluefor
thestandarderror.
2
2
p
1
2
p
n
s
+
n
s
= SE
Thiswillbethedenominatorofthet
calc
equation.
Usingthepooledvariancecalculatedabove,the
standarderroriscomputedas:
13.56
100
8177.955
80
8177.955
= +
d. Determinethe
valueof
calc
t .
( ) ( )
2
2
p
1
2
p
2 1 2 1
n
s
n
s
x x
t
+
- - -
=
Specifically,wearetakingourcalculationsfrom
partsaandcandsubstitutingthoseintoour
formula.
11.06
13.56
0 150
t
calc
= =
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 89
DRAFT:Aug.28,2007
Step Example
4. Statethestatistical
decision.
Wereject
0
H sincethevalueofourteststatistic
calc
t
=11.06exceedsthetcriticalvalueof1.645.We
thereforehaveevidencethatourteststatisticfallsin
therejectionregion.
5. Reportthepvalue. Forthistest,apvalue<0.005isobtainedfromthe
Studentsttable.Youshouldhaveexpectedsucha
smallpvalue,sincewefoundalargevalueforour
teststatistic.
6. Statethepractical
conclusion.
Inourexample,thereissufficientevidenceatthe
5%levelofsignificancetoindicatethatinfantsborn
attheclinicservingalowincomepopulationarea,
ontheaverage,tendtobelighterthanthosebornat
theclinicservingahighincomepopulationarea.
Practice:HypothesisTestingTwoSamplettest
AnMPHstudentonaninternshipwiththeKorleBuTeachingHospitalinAccra,
GhanaisinterestedincomparingthedemographiccharacteristicsofHIVpositive
tuberculosispatientswiththosethatareHIVnegative. Inparticular,shewould
liketodeterminewhetherthetwopopulationshavethesamemeanage.A
sampleof45HIVpositivepatientshasmeanageof23.6yearsandstandard
deviationof5.9years.Asampleof28patientswhoareHIVnegativehasamean
ageof35.2yearsandstandarddeviationof17.9years.Testthenullhypothesis
thatthetwopopulationsofpatientshavethesamemeanageatthe0.05levelof
significance.
Step PracticeSpace
1. Statethenulland
alternative
hypotheses.
2. Statethedecision
rule.
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 90
DRAFT:Aug.28,2007
Step PracticeSpace
3. Calculatethevalueof
theteststatistic.
a. Calculatethe
differencein
samplemeans.
( )
2 1 d
x x x =
b. Computethevalue
ofthepooled
variance.
( ) ( )
2 n n
s 1 n s 1 n
s
2 1
2
2 2
2
1 1
2
p
- +
- + -
=
c. Findthevaluefor
thestandarderror.
2
2
p
1
2
p
n
s
+
n
s
= SE
d. Determinethe
valueof
calc
t .
( ) ( )
2
2
p
1
2
p
2 1 2 1
n
s
n
s
x x
t
+
- - -
=
4. Statethestatistical
decision.
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 91
DRAFT:Aug.28,2007
Step PracticeSpace
5. Reportthepvalue.
6. Statethepractical
conclusion.
EpiInfoExample:HypothesisTestingTwoSamplettest
Wewillusetheexampleonpage86toconductatwosamplettestinExcel. We
aredeterminingwhetherinfantsbornatalowincomeareaclinictendtohavea
lowerbirthweightthanthosebornataclinicservinganareawithahighincome
population.Forthisstatisticaltest,wewilluseaonetailedanalysissincewe
wanttoknowspecificallywhetherbabiesbornattheclinicservingalowincome
populationarea,ontheaverage,tendtobelighterthanthosebornattheclinic
servingahighincomepopulationarea,andnotonlyifthebirthweightsdiffer.
Assumeanof0.05.
Step Example
1. Statethenulland
alternative
hypotheses.
H
0
:
1

2
or
1

2
0
(Babiesborninthehighincomeareaclinicweigh
lessthanorequaltothoseborninaclinicservinga
lowincomearea.)
H
a
:
1
>
2
or
1

2
>0
(Babiesborninthehighincomeareaclinicweigh
morethanthosebabiesborninaclinicservinga
lowincomearea.)
2. Statethedecision
rule.
Wewillchooseanalphavalueof0.05inorderto
compareourresultswiththecomputerprogramto
thosewhichwepreviouslycalculatedbyhand.
If>p,wecanrejectthenullhypothesis.
Inaddition,ifweknowthecriticaltvalue,thenif
t
calc
>t
crit
,wecanrejectthenullhypothesis.
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 92
DRAFT:Aug.28,2007
Step Example
3. Executethetwo
samplettest.
a. READthe
databasefile.
OpenEpiInfoandchooseAnalyzeData.
Choosethetabletwo_sample_tfromthedataset
Bios_Workbook_Examples.
b. Selectthe
MEANS
command.
UsethearrowunderMeansoftoscrollthroughthe
variables.ChooseBirthweight.
ScrolldownunderCrosstabulatebyValueofand
chooseClinic.
ClickonOK.
Scrolldowntofindthedescriptivestatistics.They
shouldlooklikethis:
4. Reportthepvalue
and/orthecalculated
tvalue.
Ourpvaluegivenintheoutputis0.00.
Wehavefoundatstatisticof11.05,whichdiffers
onlyslightlyfromthetstatisticcalculated(11.06)
onpage88.Thiscouldbeduetoroundingerrors
thatwemadeinourcalculations.
NotethatEpiInfousesanalphavalueof0.05and
atwotailedtestasdefaults.
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 93
DRAFT:Aug.28,2007
Step Example
5. Statethestatistical
decision.
Sinceourpvalueof0.00*islessthanthealphaof
0.05,wehavesufficientevidencetoconcludethat
thereisasignificantdifferencebetween
birthweightsinthetwoclinics.
Rememberthatwecanfindourcriticaltvalueby
usingtheStudentsttable.Inthiscaseitis1.645
(usethetotalobservationstofindNandthetotal
degreesoffreedom).Sinceourcalculatedtis
11.0545andisgreaterthan1.645,wecanconfirm
theabilitytorejectthenullhypothesis.
6. Statethepractical
conclusion.
Becausep<andt
calc
<t
crit
,wecanrejectthenull
hypothesisandstatethatthereisasignificant
differencebetweenbirthweightsatthetwoclinics.
Wemightusethisinformationtogobackand
determinethecauseofthedifferences.Thecause
maybeduetothecommunityinwhichtheclinicis
locatedorduetotheamenitiesoftheclinic.
Furtherinvestigationwouldbeneededinorderto
determinethis.
*Itisimportanttoknowthatalthoughcomputerprintoutswilloftendisplaypvaluesas0.0000,apvalue
willNEVERequalzero,butcancomeclose.Itisbettertoreportitasp<0.0001
Note: Whenusingcomputerprogramstoconductstatisticaltests,thepvalue
isreportedatthesametimeast
calc
andmanytimest
crit
isnotreportedatall.
Therefore,weoftenusethedecisionrule,p<todeterminestatistical
significanceinsteadoft
calc
>t
crit
.
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 94
DRAFT:Aug.28,2007
EpiInfoPractice:HypothesisTestingTwoSamplettest
Therewasanoutbreakofcholeraamongstudentsinavillageschool. Youwere
givenarecordofthoseinfectedbytheschooldirector. Ofthestudentsinfected
withcholera,youwanttodetermineifthereisasignificantdifferenceintheage
oftheinfectedbygender.UsethettestinEpiInfotodetermineifthereisa
significantdifference(alpha=0.05)betweenthemeanagesofmalesand
femalesinfectedwithcholera.UsethetableAgeInSchoolfromthedataset,
Bios_Workbook_Examples.
Step PracticeSpace
1. Statethenulland
alternative
hypotheses.
2. Statethedecision
rule.
3. PerformatwosamplettestinEpiInfo.
4. Statethestatistical
decision.
5. Statethepractical
conclusion.
RelatedConcepts
ConfidenceInterval:Twosamplettest
HypothesisTesting:PairedSamplesttest
ConfidenceInterval:TwoSamplettest
BiostatisticsWorkbook 95
DRAFT:Aug.28,2007
ConfidenceIntervalEstimation:TwoSamplettest
Overview
Appliedtoestimatethedifferenceinpopulationmeans,
2 1
- by
anintervalthatcommunicatesinformationregardingtheprobable
magnitudeof
2 1
-
Formula:A100(1 a)%confidenceinterval(CI)for
2 1
- is
givenby:
( )
2
2
p
1
2
p
crit 2 1
n
s
n
s
t x x + -
Assumptions:
o Twoindependentrandomsamples
o Normallydistributedpopulation
o Equal,butunknownvariances
Typeofvariables:Continuous
Decisionrule:Withrepeatedsamplingthetruedifferenceinthe
populationmeanswillbeincludedina95%confidenceinterval95
percentofthetime.Thus,iftheconfidenceintervalcontainsthe
valueofzero,thereisnosignificantdifferencebetweenthemeans.
Ifzeroisnotincludedintheinterval,wecanrejectthenull
hypothesisandsaythatthereisasignificantdifferencebetween
themeans.
Where:
( ) ( )
2 n n
s 1 n s 1 n
s
2 1
2
2 2
2
1 1
2
p
- +
- + -
=
isreferredtoasthepooledvariance.
Thenumberofdfusedindeterminingthevalueof
crit
t is
2 n n
2 1
- + .
ConfidenceInterval:TwoSamplettest
BiostatisticsWorkbook 96
DRAFT:Aug.28,2007
StepbyStepExample:ConfidenceIntervalTwoSamplettest
Recallthatpreviouslywefollowedthecriticalvalueapproachtodetermine
whetherinfantsbornatalowincomeareaclinic,ontheaverage,tendtobe
lighterthanthosebornataclinicservingahighincomepopulationarea.
Supposenowweareinterestedinconstructinga95%confidenceintervalforthe
differenceinpopulationmeanstosolvethisproblem.
Thedataareonceagainprovidedhere:
Clinic n x s
High(1) 80 2800 100
Low(2) 100 2650 82
Step Example
1. Statethenulland
alternative
hypothesesandthe
decisionrule.
H
0
:
d
=0
H
A
:
d
0
Ifthecalculated95%confidenceintervalexcludes
thevalueofzero,thenthereisenoughevidenceto
rejectthenullhypothesis.
2. Calculateapoint
estimateof
2 1
- .
Weuse ( ) 150 2650 2800 x x
2 1
= - = - asthepoint
estimate.
3. Determinethe
reliabilitycoefficient
(
crit
t ).
WhenweentertheStudentsttablewith100+80
2=178degreesoffreedom(approximateusing
df)andadesiredconfidencelevelof0.95,wefind
thereliabilitycoefficientis1.96.
4. Computethepooled
estimateofthe
commonpopulation
variance.
( ) ( )
2 n n
s 1 n s 1 n
s
2 1
2
2 2
2
1 1
2
p
- +
- + -
=
Thepooledvarianceiscalculatedas:
( ) ( )
8177.955
178
82 99 100 79
s
2 2
2
p
=
+
=
ConfidenceInterval:TwoSamplettest
BiostatisticsWorkbook 97
DRAFT:Aug.28,2007
Step Example
5. Computethestandard
errorofthepoint
estimate.
2
2
p
1
2
p
n
s
n
s
+
Thus,thestandarderroriscomputedas:
13.56
100
8177.955
80
8177.955
= +
6. Calculatethedesired
confidenceinterval.
( )
2
2
p
1
2
p
crit 2 1
n
s
n
s
t x x + -
Our95%CIfor
2 1
- iscomputedasfollows:
41 . 123 59 . 26 150
) 56 . 13 96 . 1 ( 150
= -
-
59 . 176 59 . 26 150
) 56 . 13 96 . 1 ( 150
= +
+
Thus,the95%CIis(123.41,176.59).
7. Interpretthisinterval. Weare95%confidentthatthemeandifferencein
birthweightsforinfantsbornatthetwoclinicsis
between123.41and176.59grams. Sincethe
intervalexcludeszero,whichwouldsignifyno
differencebetweenthetwomeans,thenull
hypothesisisrejectedandweconcludethatthe
differencebetweenthetwopopulationmeansis
statisticallysignificant.
As expected, we have reached the same
conclusion here as we did using the critical value
approachpreviously.
ConfidenceInterval:TwoSamplettest
BiostatisticsWorkbook 99
DRAFT:Aug.28,2007
Practice:ConfidenceIntervalTwoSamplettest
Recallthatwepreviouslyfollowedthecriticalvalueapproachtodetermine
whetherornotthereisadifferenceinthemeanageoftuberculosispatientswho
areinfectedwithHIVversusthosewhoarenotinfectedwithHIV.Supposenow
weareinterestedinconstructinga95%CIforthedifferenceinpopulationmeans
tosolvethisproblem.Thedataareonceagainprovidedhereasfollows:
Patient n x s
HIV+(1) 45 23.6 5.9
HIV(2) 28 35.2 17.9
Step PracticeSpace
1. Statethenulland
alternative
hypothesesandthe
decisionrule.
2. Calculateapoint
estimateof
2 1
- .
3. Determinethe
reliabilitycoefficient
(
crit
t ).
4. Computethepooled
estimateofthe
commonpopulation
variance.
( ) ( )
2 n n
s 1 n s 1 n
s
2 1
2
2 2
2
1 1
2
p
- +
- + -
=
5. Computethestandard
errorofthepoint
estimate.
2
2
p
1
2
p
n
s
n
s
+
ConfidenceInterval:TwoSamplettest
BiostatisticsWorkbook 100
DRAFT:Aug.28,2007
Step PracticeSpace
6. Calculatethedesired
confidenceinterval.
( )
2
2
p
1
2
p
crit 2 1
n
s
n
s
t x x + -
7. Interpretthisinterval.
OpenEpiExample:ConfidenceIntervalTwoSamplettest
YoumayhavenoticedthatEpiInfodoesnotincludetheconfidenceinterval
aroundthemeaninthedescriptivestatistics.Confidenceintervalsaresomething
thatEpiInfoforWindowsversion3.3.2doesnotcalculate. Therefore,youmay
usethedescriptivestatisticsthatEpiInfoprovidestocalculatetheconfidence
intervalbyhandoryoumayuseanotherprogramthatwillcalculatethe
confidenceintervalforyou.OpenEpihasthiscapabilityaswellasmost
statisticalprogramssuchasSASorSPSS.
Wewillonceagainbeusingthedataset,2samplettest,asanexample. Use
thedescriptivestatisticsthatyoufoundthroughEpiInfoonpage91tocalculate
the95%confidenceinterval.
ConfidenceInterval:TwoSamplettest
BiostatisticsWorkbook 101
DRAFT:Aug.28,2007
Rememberthatyouwillneedtoallowpopupsinordertousethisapplication.
Step Example
1. Statethenull
andalternative
hypotheses.
H
0
:
d
=0
H
A
:
d
0
2. Statethe
decisionrule.
Ifthecalculated95%confidenceintervalexcludesthe
valueofzero,thenthereisenoughevidencetorejectthe
nullhypothesis.
ConfidenceInterval:TwoSamplettest
BiostatisticsWorkbook 102
DRAFT:Aug.28,2007
Step Example
3. Calculatethe
95%confidence
intervalusing
OpenEpi.
FromtheOpenEpimenu,choosetheoption,ttest.This
islocatedwithintheContinuousVariablesfolder. The
pagethatopensgivesyouanoverviewofthetestandan
example.
ClickonEnterNewData.
RefertothedescriptivestatisticsthatwefoundinEpiInfo,
asshownonthepreviouspage,toenterthedata.
AllowGroup1tobetheclinicforthehighincome
populationandGroup2tobetheclinicforthelowincome
population. Theconfidenceintervalshouldbesetat95%,
thedefault. Completetheremainderofthetableas
shownbelow:
Notethatifyouprovidethestandarddeviation,youdonot
needtoprovidethestandarderror.
ClickCalculate.
ConfidenceInterval:TwoSamplettest
BiostatisticsWorkbook 103
DRAFT:Aug.28,2007
Step Example
4. Interpretthe
results.
Youwillgetanoutputthatlooksliketheonebelow:
Notethatyouaregiventhelowerandupperconfidence
limitsforbothequalvariancesandunequalvariances.
Youmustchoosewhichconfidenceintervalappliesto
yourdata.Inthiscase,wecanassumeequalvariance
andsothefirstintervalreported,(123.224,176.776)isthe
onethatinterestsus.
5. Statethe
conclusion.
The95%confidenceintervalforthemeandifferenceis
(123.224,176.776).Withrepeatedsampling,the
differenceofthemeanswillfallwithinthisinterval95%of
thetime.Sincetheintervaldoesnotincludethevalueof
0,wecanconcludethatthereisasignificantdifferent
betweenthetwosamplemeans.
ConfidenceInterval:TwoSamplettest
BiostatisticsWorkbook 104
DRAFT:Aug.28,2007
ExcelExample:ConfidenceIntervalTwoSamplettest
Ifitisknownthatthetwosamplesinatwosamplettesthaveequalvariances,
thenwecantakethedifferenceofthemeansandusedthepooledstandard
deviationtofindtheconfidenceinterval.
Ifyouhaveunequalvariancesorcannotassumethatthevariancesareequal,
anotheroptionistocalculatethe95%confidenceintervalofthemeanforeach
samplethatyouarecomparing.Ifthetwoconfidenceintervalsoverlap,then
thereisnosignificantdifferencebetweenthetwosamples.Ifthereisnooverlap
intheconfidenceintervals,thenwehavesignificantevidencetorejectthenull
andsaythatthereisadifferencebetweenthetwogroups.
Twoexamplesofhowtocalculatetheconfidenceintervalaroundthemeanofa
sampleinExcelbeginonpage70.
Step Example
1. Statethenull
andalternative
hypotheses.
H
0
:
1
=
2
H
A
:
1

2
2. Statethe
decisionrule.
Ifthecalculated95%confidenceintervalsdonotoverlap,
thenthereisasignificantdifferencebetweenthetwo
groups.
3. UseExcelto
calculatethe
confidence
interval.
Highclinic95%CI= x confidencelevelfromExcel
=280022.26,2800+22.26
=2777.74,2822.26
Lowclinic95%CI= x confidencelevelfromExcel
=265016.27,2650+16.27
=2633.73,2666.27
4. Statethe
statistical
decision.
Becausewearelookingattwoindependentsamples,we
cancomparethetwomeansbycomparingtheir
confidenceintervals.Iftheconfidenceintervalsoverlap,
thenthereisnosignificantdifferenceinthetwomeans.
Sincethe95%confidenceintervalforbirthweightsatthe
Highclinicdoesnotoverlapwiththe95%confidence
intervalfortheLowClinic,wecandeducethatthereisa
significantdifferencebetweenthemeans.Inaddition,we
caninferthedirectionofthedifferencebasedonthe
directionoftheconfidenceinterval.
ConfidenceInterval:TwoSamplettest
BiostatisticsWorkbook 105
DRAFT:Aug.28,2007
Step Example
5. Statethe
practical
conclusion.
Infantsbornatlowincomeservingclinic,onaverage,
havealowerbirthweightthaninfantsbornatthehigh
incomeservingclinic.
OpenEpiPractice:ConfidenceIntervalTwoSamplettest
Determineifthereisadifferenceinbirthweightsbetweenthosethatreceiveda
nutritionalsupplementduringpregnancyandthosethatdidnotbyfindingthe
95%confidenceinterval.Usethedataset,Birthweight.
Step PracticeSpace
1. Statethenull
andalternative
hypotheses.
2. Statethe
decisionrule.
3. UseEpiInfoorExceltocalculatethedescriptivestatistics.
4. Findthe95%confidenceintervalinOpenEpi.
5. Interpretthe
results.
6. Statethe
conclusion.
RelatedConcepts
ConfidenceIntervalAroundaMean
HypothesisTesting:TwoSampletTest
HypothesisTestingztestforDifferenceinProportions
BiostatisticsWorkbook 106
DRAFT:Aug.28,2007
HypothesisTesting:ztestforDifferenceinProportions
Overview
Testemployedtoevaluatethenullhypothesis ( )
0
H thatthe
populationproportionsareequalversusthealternativehypothesis
( )
a
H thatthepopulationproportionsaredifferent.
Example:Ifthereis6percentHIVprevalenceinthegeneral
population,shouldwebeconcernedaboutacertaingroupwithan
HIVprevalenceof9percent?
Formula:
( ) ( )
2 1
2 1 2 1
n
q p
n
q p
p p p p
z
+
- - -
=

Assumptions:
o
2 2 1 1
q

n , p

n , q

n , p

n areeachgreaterthanorequalto5
o Independentsamples
o Randomsamples
o Each p

approximatelynormallydistributedinaccordance
withthecentrallimittheorem
Typeofvariables:Proportions,discrete
Decisionrule:Ifthecalculatedvalueofz(
calc
z )isgreaterthan
thecriticalvalueofz(
crit
z ),thenwecanrejectthenull.
Tableused:Standardnormalztable
Where:
2 1
2 1
n n
x x
p
+
+
= representsthepooledestimateofthe
hypothesizedcommonproportion.
p 1 q - = .
= -
2 1
p p

differenceinsampleproportions.
HypothesisTestingztestforDifferenceinProportions
BiostatisticsWorkbook 107
DRAFT:Aug.28,2007
Anepidemiologistisofteninterestedincomparingtheresultsbetweengroupsas
opposedtoevaluatingasinglegroup,sincecomparisoniscriticalin
epidemiologicresearch.Theztestfordifferencebetweentwoproportionsis
appliedondiscretedatatodeterminewhetherornotitisreasonabletoconclude
thattwopopulationproportionsareunequal.
Fordataanalyzedusingtheztestfordifferenceinproportions,thenumerator
comparesthedifferencebetweenthetwosampleproportions ( )
2 1
p p

- referred
toasthesamplestatisticorpointestimatewiththedifferencethatwouldbe
expectedunderatruenullhypothesis(i.e., 0 p p : H
2 1 0
= - )whichoftenequals
zero,ornodifference. Thedenominatoristhestandarderror,whichservesas
ourmeasureofvariability.
Sincewealwaysconducthypothesistestsundertheassumptionofatruenull
hypothesis,byhypothesizingthat
2 1
p p = inthenullhypothesis(i.e.nodifference
inpopulationproportions),ourbestestimateofthepopulationproportionis
obtainedbypoolingthedatafrombothsamples.Thus,if
2 1
x and x arethe
numbersofsuccessesobtainedfromthetwosamples,thepooledestimateofthe
populationproportionisdenotedbyp ,andiscalculatedas:
2 1
2 1
n n
x x
p
+
+
= .
Asyouwillsee,thispooledestimateisacriticalcomponentindeterminingthe
standarderrorofthedifferenceinproportions.
StepbyStepExample:HypothesisTestingztestfor
DifferenceinProportions
Supposewehavethefollowingresultsobtainedfromsamplesofmalesand
femalesaged15to35yearstakenfromadistricthospitalinMutare,Zimbabwe.
Canweconcludefromthedatathatthereisadifferencebetweentheproportions
ofmalesandfemaleswhoreceivetreatmentfortuberculosis(TB)?Let a equal
0.05.
Gender n TreatedforTuberculosis
Males 160 20
Females 200 50
HypothesisTestingztestforDifferenceinProportions
BiostatisticsWorkbook 108
DRAFT:Aug.28,2007
Step Example
1. Statethenulland
alternative
hypotheses.
Sinceweareinterestedindeterminingwhetherthe
proportionoffemalestreatedforTBiseitherhigher
orlowerthantheproportionofmalestreatedforTB,
atwotailedtestisrequired.
Ournullhypothesisstatesthattheproportionof
malesandfemaleswhoaretreatedforTBisequal.
Thenullhypothesisiswrittenas:
p p : H
f m 0
=
Thealternativehypothesisstatesthatthetwo
populationproportionsareunequalandiswritten
as:
p p : H
f m a

Analternativewayofstatingthehypothesesisas
follows:
0 p p : H
f m 0
= -
0 p p : H
f m a
-
2. Statethedecision
rule.
Usinga2sidedalphalevelof0.05,thecriticalz
value(basedontheztable)isequalto1.96.
Thus,weshouldreject
0
H when 96 . 1 z
calc
>
3. Calculatethevalueof
theteststatistic.
( ) ( )
2 1
2 1 2 1
calc
n
q p
n
q p
p p p p
z
+
- - -
=

a. Calculatethe
differencein
sample
proportions.
Beginbycalculatingthesampleproportionsfor
malesandfemales,respectively.Proportionsare
foundbydividingthosewhohavereceived
treatmentbythepopulation.
0.250
200
50
p
0.125
160
20
p

f
m
= =
= =
Thus,thedifferenceinsampleproportions ( )
f m
p p

-
iscomputedas:
0.125 0.250 0.125 - = -
HypothesisTestingztestforDifferenceinProportions
BiostatisticsWorkbook 109
DRAFT:Aug.28,2007
Step Example
b. Determinevalues
for q and p .
2 1
2 1
n n
x x
p
+
+
= , p 1 q - =
Beforedeterminingthevalueofthestandarderror,
letusfirstcalculatethepooledproportionp ,and : q
806 . 0 194 . 0 1 q
194 . 0
200 160
50 20
p
= - =
=
+
+
=
c. Findthevaluefor
thestandarderror.








+ =
2
n
1
1
n
1
q p SE
Thestandarderroriscomputedas:
( )( ) 0.0419
200
1
160
1
0.806 0.194 SE =






+ =
d. Determinethe
valueof
calc
z .
( ) ( )
2 1
2 1 2 1
calc
n
q p
n
q p
p p p p
z
+
- - -
=

2.98 =
0.0419
0 0.125
= z
calc
4. Statethestatistical
decision.
Wereject
0
H sincetheabsolutevalueofourtest
statistic
calc
z =2.98islessthanthezcriticalvalue
of1.96.Wethereforehaveevidencethatourtest
statisticfallsintherejectionregion.
5. Statethepractical
conclusion.
Thereissufficientevidenceatthe5%levelof
significancetoindicatethatthereisadifferencein
theproportionofmalesversusfemaleswhoare
treatedfortuberculosisinthesampledpopulations.
6. Reportthepvalue. Apvalueof0.0028isobtainedfromthestandard
normalztable.Thus,thereisa0.28%probabilityof
obtainingaz 2.98oraz 2.98whenthenull
hypothesisistrue.
HypothesisTestingztestforDifferenceinProportions
BiostatisticsWorkbook 110
DRAFT:Aug.28,2007
Practice:HypothesisTestingztestforDifferencein
Proportions
AnMPHstudentworkingasaninternattheMinistryofEducationinHarare,
Zimbabwe,isinterestedindeterminingwhetheragenderdifferenceexistsinthe
participationinformaleducation.Amongasampleof2657girlssurveyed,itwas
foundthat991completedsecondaryleveleducation,while1250of2637boys
completedsecondaryleveleducation.Doesthedataprovidesufficientevidence
toindicateunequalparticipationinformaleducationofgirlsversusboys?Testat
the0.10levelofsignificance.
Step PracticeSpace
1. Statethenulland
alternative
hypotheses.
2. Statethedecision
rule.
3. Calculatethevalueof
theteststatistic.
( ) ( )
2 1
2 1 2 1
calc
n
q p
n
q p
p p p p
z
+
- - -
=

a. Calculatethe
differencein
sample
proportions.
b. Determinevalues
for q and p .
2 1
2 1
n n
x x
p
+
+
=
, p 1 q - =
HypothesisTestingztestforDifferenceinProportions
BiostatisticsWorkbook 111
DRAFT:Aug.28,2007
Step PracticeSpace
c. Findthevaluefor
thestandarderror.
SE=






+
2 1
n
1
n
1
q p
d. Determinethe
valueof
calc
z .
( ) ( )
2 1
2 1 2 1
calc
n
q p
n
q p
p p p

z
+
- - -
=
4. Statethestatistical
decision.
5. Reportthepvalue.
6. Statethepractical
conclusion.
HypothesisTestingztestforDifferenceinProportions
BiostatisticsWorkbook 112
DRAFT:Aug.28,2007
EpiInfoExample:HypothesisTestingztestforDifferencein
Proportions
Youmayhavelearnedinyourbiostatisticscoursethatthezdistributionisequal
tothechisquaredistributionwith1degreeoffreedom. Therefore,thoughwe
cannotperformaztestinEpiInfo,wecanusethetablescommandtofindthez
statistic.
Usingthepreviousexample(refertopage107),letstestforadifferenceinthose
whoaretreatedfortuberculosis.
Step Example
1. Statethenulland
alternativehypotheses.
H
0
:p
1
=p
2
(Thereisnodifferenceintreatmentforthetwo
populations,menandwomen.)
H
a
:p
1
p
2
(Thereisasignificantdifferenceintreatmentbetween
thetwopopulations.)
Wecanalsoexpressthisas:
H
0
:p
1
p
2
=0
H
a
:p
1
p
2
0
2. Statethedecisionrule. =0.05
Rejectthenullhypothesiswhenp<.
3. Findthezdifferencein
proportionsinEpiInfo
a. READthedatabase.
Since z x
2
= ,wecanusechisquaretoonedegree
offreedom(1df)toapproximatethezstatistic.
READztest_examplefromthedataset
Bios_Workbook_Examples.mdb.
b. Performthe
TABLEScommand.
SelectTABLESfromthelistofStatistics.
ChooseGenderastheExposureVariable.
ChooseTreatedastheOutcomeVariable.
ClickOK.
HypothesisTestingztestforDifferenceinProportions
BiostatisticsWorkbook 113
DRAFT:Aug.28,2007
Step Example
4. Calculatethez
statistic.
Youwillseea2x2tableliketheonebelow:
Tofindtheresults,lookattheSTATISTICALTESTS
sectionintheSingleTableAnalysis.
Thoughthezstatisticisnotcalculated,wecanuse
chisquaretoapproximateitsincethezdistribution
sharesthechisquaredistributionwithonlyone
degreeoffreedom.
z = x
2
.Therefore, 98 . 2 87 . 8 = withapvalueof
0.004.
5. Statethestatistical
decision.
Sincethepvalueof0.004islessthantheof0.05,
wecanrejectthenullhypothesis.
6. Statethepractical
conclusion.
Thereisevidencetosupportasignificantdifference
betweenTBtreatmentinmalesandfemales.
Furtherexaminationisrecommendedtodetermine
thedirectionofthedifference.Thenwewillidentify
possiblereasonsbehindthisdifferenceinorderto
provideequaltreatmenttoall.
HypothesisTestingztestforDifferenceinProportions
BiostatisticsWorkbook 114
DRAFT:Aug.28,2007
EpiInfoPractice:HypothesisTestingztestforDifferencein
Proportions
Youareretrospectivelyexamininganewtreatmentgiventopatientswith
pneumonia.UsethedatasetNew_Treatment.mdbtoseeiftheproportionof
thosethatdiedissignificantlydifferentthantheproportionofthatthosethat
survived.
Step PracticeSpace
1. Statethenulland
alternativehypotheses.
2. Statethedecisionrule.
3. UseEpiInfotocalculatechisquarewith1degreeoffreedom.
4. Calculatethez
statisticfromchi
square.
5. Statethestatistical
decision.
6. Statethepractical
conclusion.
RelatedConcepts
ConfidenceInterval:ZtestforDifferenceinProportions
ConfidenceIntervalztestforDifferenceinProportions
BiostatisticsWorkbook 115
DRAFT:Aug.28,2007
ConfidenceIntervalEstimation:ztestforDifferencein
Proportions
Overview
Appliedtoestimatethedifferenceinpopulationproportions,
2 1
p p -
byanintervalthatcommunicatesinformationregardingthe
probablemagnitudeof
2 1
p p - .
Formula:A100(1 a)%confidenceintervalfor
2 1
p p - isgivenby:
( )
( ) ( )
2
2 2
1
1 1
crit 2 1
n
q p
n
q p
z p p


+ -
Assumptions:
o 5 to equal or than greater each are q

n , p

n , q

n , p

n
2 2 1 1
o Independentsamples
o Randomsamples
o Each p

approximatelynormallydistributedinaccordance
withthecentrallimittheorem
Typeofvariables:Proportions,discrete
Decisionrule: Withrepeatedsamplingthetruedifferenceofthe
meanswillbecontainedwithina95%confidenceinterval95
percentofthetime.Ifthevalueofzeroisincluded,thereisno
evidencetosuggestsignificantdifferencesbetweenthemeans. If
theconstructedintervaldoesnotincludezero,theintervalprovides
evidencethatthetwopopulationproportionsarenotequal,thus
resultsarestatisticallysignificantandwecanrejectthenull.
Where:
Thevalueof
crit
z touseinconstructingtheintervalisfound
inthestandardnormaldistributiontable.
ConfidenceIntervalztestforDifferenceinProportions
BiostatisticsWorkbook 117
DRAFT:Aug.28,2007
StepbyStepExample:ConfidenceIntervalztestfor
DifferenceinProportions
Recallthatpreviouslywefollowedthecriticalvalueapproachusingdatafroma
hospitalinMutare,Zimbabwetodeterminewhethertherewasadifferenceinthe
proportionofmenversuswomenwhoaretreatedforTBandwhoareHIV
positive. Supposenowweareinterestedinconstructinga95%CIforthe
differenceinpopulationproportions.Thedataareonceagainprovidedhereas
follows:
Gender n TreatedforTuberculosis
Males 160 20
Females 200 50
Step Example
1. Statethenulland
alternativehypotheses
anddeterminethe
decisionrule.
H
0
:
2
p p -
1
=0
H
A
:
2
p p -
1
0
Ifthevalueofzeroisnotincludedinour
calculated95%confidenceinterval,wehave
sufficientevidencetorejectthenullhypothesis.
2. Calculateapoint
estimateof
2 1
p p - .
Weuse 0.125 p

2 1
- = - asthepointestimate.
3. Determinethe
reliabilitycoefficient.
Thereliabilityfactorfromthestandardnormalz
tableatadesiredconfidencelevelof0.95(two
sided)is1.96.
4. Computethestandard
errorofthepoint
estimate.
( ) ( )
2
2 2
1
1 1
n
q

+
n
q

= SE
Theestimatedstandarderrorofthedifference
betweensampleproportionsis:
( ) ( )
0403 . 0
200
0.75 0.25
160
0.875 0.125
= +
ConfidenceIntervalztestforDifferenceinProportions
BiostatisticsWorkbook 118
DRAFT:Aug.28,2007
Step Example
5. Calculatethedesired
confidenceinterval.
( )
( ) ( )
2
2 2
1
1 1
crit 2 1
n
q p
n
q p
z p p


+ -
Our95%CIfor
2 1
p p - iscomputedasfollows:
204 . 0 0790 . 0 125 . 0
043 . 0 96 . 1 1251 . 0
- = - -
- -
046 . 0 0790 . 0 125 . 0
043 . 0 96 . 1 125 . 0
- = + -
+ -
Thelimitsofour95%CIare(0.204,0.046).
6. Interpretthisinterval. Weare95%confidentthatthetruedifferencein
populationproportionsofmalesversusfemales
whoaretreatedforTBatthishospitalis
somewherebetween4.6%and20.4%.
Specifically,weare95%confidentthatthe
proportionofmaleswhoaretreatedforTBatthis
hospitalisanywherebetween4.6%to20.4%
lowerthantheproportionoffemaleswhoare
treatedforTBatthishospital.Sincetheinterval
doesnotcontainthenulldifferenceofzero,we
concludethatthetwopopulationproportionsare
notequal,thusindicatingourfindingsare
statisticallysignificantandwerejectthenull.
As expected, we have reached the same
conclusion here as we did using the criticalvalue
approachpreviously.
ConfidenceIntervalztestforDifferenceinProportions
BiostatisticsWorkbook 119
DRAFT:Aug.28,2007
Practice:ConfidenceIntervalztestforDifferencein
Proportions
Recallthatpreviouslywefollowedthecriticalvalueapproachtodetermine
whetheragenderdifferenceexistsintheparticipationinsecondaryeducation
amongschoolagedchildreninHarare,Zimbabwe.Supposenowweare
interestedinconstructinga95%CIforthedifferenceinpopulationproportions.
Thedataareprovidedhereonceagainasfollows:
Group n Completedsecondaryleveleducation
Girls 2657 991
Boys 2637 1250
Step PracticeSpace
1. Statethenulland
alternativehypotheses
anddeterminethe
decisionrule.
2. Calculateapoint
estimateof
2 1
p p - .
3. Determinethe
reliabilitycoefficient.
4. Computethestandard
errorofthepoint
estimate.
( ) ( )
2
2 2
1
1 1
n
q

+
n
q

= SE
5. Calculatethedesired
confidenceinterval.
( )
( ) ( )
2
2 2
1
1 1
crit 2 1
n
q p
n
q p
z p p


+ -
ConfidenceIntervalztestforDifferenceinProportions
BiostatisticsWorkbook 120
DRAFT:Aug.28,2007
Step PracticeSpace
6. Interpretthisinterval.
EpiInfoExample:ConfidenceIntervalztestforDifferencein
Proportions
Wewilluseztest_example.mdbtofindthe95%confidenceintervalaroundthe
differenceinproportionsinEpiInfo.EpiInfowilllabeltheoutputasarisk
difference,butasyoushouldknowfrom yourepidemiologycourses,arisk
differenceisalsoadifferenceinproportionsandtherefore,thecalculationsare
thesame.
Step Example
1. Statethenulland
alternativehypotheses.
H
0
:p
1
=p
2
(Thereisnodifferenceintreatmentforthetwo
populations,menandwomen.)
H
a
:p
1
p
2
(Thereisasignificantdifferenceintreatmentbetween
thetwopopulations.)
Wecanalsoexpressthisas:
H
0
:p
1
p
2
=0
H
a
:p
1
p
2
0
2. Statethedecisionrule. Wecanrejectthenullhypothesisiftheconfidence
intervaldoesnotcontainthevalueof0.
3. Findthe95%
confidenceintervalfor
theztestdifferencein
proportionsinEpiInfo
a. READthedatabase.
READztest_examplefromthedataset
Bios_Workbook_Examples.mdb.
ConfidenceIntervalztestforDifferenceinProportions
BiostatisticsWorkbook 121
DRAFT:Aug.28,2007
Step Example
b. Performthe
TABLEScommand.
SelectTABLESfromthelistofStatistics.
ChooseGenderastheExposureVariable.
ChooseTreatedastheOutcomeVariable.
ClickOK.
Youwillseea2x2tableliketheonebelow:
Tofindtheresults,lookattheSTATISTICALTESTS
sectionintheSingleTableAnalysis.
Thoughtheconfidenceintervalfortheztestof
differenceinproportionsisnotlisted,wecan
approximateitbylookingatthe95%confidence
intervalforriskdifference.
4. Statethestatistical
decision.
Our95%confidenceintervalis(4.61,20.39).
ConfidenceIntervalztestforDifferenceinProportions
BiostatisticsWorkbook 122
DRAFT:Aug.28,2007
Step Example
5. Interprettheresults. Weare95%confidentthatthetruedifferencein
populationproportionsofmalesversusfemaleswho
aretreatedforTBatthishospitalissomewhere
between4.61%and20.39%.Specifically,weare
95%confidentthattheproportionofmaleswhoare
treatedforTBatthishospitalisanywherebetween
4.6%to20.4%lowerthantheproportionoffemales
whoaretreatedforTBatthishospital.Sincethe
intervaldoesnotcontainthenulldifferenceofzero,
weconcludethatthetwopopulationproportionsare
notequal,thusindicatingourfindingsarestatistically
significantandwerejectthenull.
EpiInfoPractice:ConfidenceIntervalztestforDifferencein
Proportions
UsethedatasetNew_Treatment.mdbtocalculatethe95%confidenceinterval
aroundthedifferenceinproportionsusingEpiInfo.
Step PracticeSpace
1. Statethenulland
alternativehypotheses.
2. Statethedecisionrule.
3. Findthe95%confidenceintervalfortheztestofdifferencein
proportionsinEpiInfo.
Note:Ifyouaregivendescriptivestatisticsonlyandnotacompletedataset,
youcanalsocalculatethe95%confidenceintervalfortheztestof
differenceinproportionsinOpenEpiaswell.Usethecommand,Twoby
TwoTablesandlookattheoutputinStratum1labeledRiskDifferences.
ConfidenceIntervalztestforDifferenceinProportions
BiostatisticsWorkbook 123
DRAFT:Aug.28,2007
4. Statethestatistical
decision.
5. Interprettheresults.
RelatedConcepts
HypothesisTesting:ztestofDifferenceinProportions
ConfidenceIntervalAroundaProportion
HypothesisTesting:Pairedttest
BiostatisticsWorkbook 125
DRAFT:Aug.28,2007
HypothesisTesting:Pairedttest
Thepairedttestdiffersfrom thetwosampleindependentttestinthatthepaired
ttestcomparesobservationsthathavebeencollectedfrommatchedpairsor
repeatedmeasurementsonthesameindividualsoritems.
Overview
Description:Usedonrelatedsamplesofdatatotestthenull
hypothesisthatthepopulationmeanofthepaireddifferencesof
thetwosamplesiszero.
Example:Comparingpretestswithposttestsresultsofaclassof
studentsorthedegreeoffeverbeforeandafterafeverreducer
hasbeentaken.
Formula:
( )
n
s
d
t
d
d
-
=
Assumptions:
o Simplerandomsample
o Normallydistributedpopulationofdifferences
o Samplesrelatedandnotindependent
Typeofvariables:Related,continuous
Decisionrule:Ifthecalculatedvalueoft(
calc
t )isgreaterthanthe
criticalvalueoft(
crit
t ),thenwecanrejectthenull.
Tableused:Studentsttable
Where:
( )
1 n
d d
s
n
1 i
2
i
d

=
-
=
isreferredtoasthestandarddeviationofthedifferences.
HypothesisTesting:Pairedttest
BiostatisticsWorkbook 126
DRAFT:Aug.28,2007
StepbyStepExample:HypothesisTestingPairedttest
AnMPHstudentonaninternshipinSouthAfricaisinformedthatSouthAfrican
blackwomenaremorelikelythantheirracialcounterpartstobepronetoobesity.
Shevisitsaprominentweightlosscenterandreceivespermissiontoexamine
therecords(randomlyselected)of8blackwomen.Sheisinterestedin
determiningwhetherthehealthyeatingplanrecommendedbythedieticianwas
effectiveinencouragingpatientstoloseweight.Thefollowingtablegivesthe
weights(lbs)for8subjectsmeasuredbothatbaselineandafter6months.Do
thesedataprovidesufficientevidence,atthe0.05levelofsignificance,to
indicatethatthedietregimeniseffectiveforweightloss?
Subject Baseline After6months
1 310 263
2 295 251
3 287 249
4 305 259
5 270 233
6 323 267
7 277 242
8 299 265
Step Example
1. Statethenull
andalternative
hypotheses.
Aonesidedtestofourhypothesesisindicatedsincethe
MPHstudentisinterestedinlearningwhetherornotthe
dietwaseffectiveinweightlossonlyandnotweight
change.Thus,ourhypothesesareasfollows.
Ournullhypothesisstatesthatweightafterdietregimen
isgreaterthanorequaltoweightbeforedietregimen,
thusthemeandifferenceislessthanorequalto0.
0 = : H
0 d
m
Ouralternatehypothesisstatesthatweightafterdiet
regimenislessthanweightbeforedietregimen,thusthe
meandifferenceisgreaterthan0.
0 : H
d a
> m
HypothesisTesting:Pairedttest
BiostatisticsWorkbook 127
DRAFT:Aug.28,2007
Step Example
2. Statethe
decisionrule.
Usingaonesidedtestwithanalphavalueof0.05and
n1=7df,thecriticalvalueoftheteststatistic,
accordingtotheStudentsttable,is1.8946.Thus,we
shouldreject
0
H if 1.8946 t
calc
>
3. Calculatethe
valueofthetest
statistic.
( )
n
s
d
= t
d
d
a. Calculate
valuesofthe
differences.
i 2 1
d x x = -
Subject Baseline After6
months
Difference
(d
i
)
1 310 263 47
2 295 251 44
3 287 249 38
4 305 259 46
5 270 233 37
6 323 267 56
7 277 242 35
8 299 265 34
b. Calculate
meanof
differences,
d .
n
d
d
i
=
1 . 42
8
337
d = =
c. Calculatethe
sumofthe
squared
deviations.
( )
2
i
d d

-
( ) 9 . 394 d d
2
i
= -

Subject Baseline After6
months
d
i
d ( ) d d
i

( )
2
i
d d -
1 310 263 47 42.1 4.9 24.01
2 295 251 44 42.1 1.9 3.61
3 287 249 38 42.1 4.1 16.81
4 305 259 46 42.1 3.9 15.21
5 270 233 37 42.1 5.1 26.01
6 323 267 56 42.1 13.9 193.21
7 277 242 35 42.1 7.1 50.41
8 299 265 34 42.1 8.1 65.61
337 d
i
=

HypothesisTesting:Pairedttest
BiostatisticsWorkbook 128
DRAFT:Aug.28,2007
Step Example
d. Calculate
d
s .
( )
1 n
d d
s
n
1 i
2
i
d

=
-
=
51 . 7
7
9 . 394
s
d
= =
e. Computethe
valueofthe
standard
error.
n
s
d
2.655
8
7.51
=
f. Determine
thevalueof
calc
t .
( )
n
s
d
t
d
d
-
=
15.86
2.655
0 42.1
t
calc
= =
4. Statethe
statistical
decision.
Wereject
0
H sincethevalueofourteststatistic(
calc
t =
15.86)exceedsthetcriticalvalueof1.8946.We
thereforehavesufficientevidencetorejectthenull
hypothesis.
5. Reportthep
value.
Forthistest,apvalueof<0.005isobtainedfromthe
Studentsttable.
6. Statethe
practical
conclusion.
Atthe0.05levelofsignificance,thedataprovides
sufficientevidencetoindicatethattherewasweightloss
aftertheintervention.
HypothesisTesting:Pairedttest
BiostatisticsWorkbook 129
DRAFT:Aug.28,2007
Practice:HypothesisTestingPairedttest
Arandomsampleof10MPHstudentswasselectedtoparticipateinastudyto
assessphysiologicalchangesthatoccurimmediatelybeforeandaftercompleting
astandardizedexamination.Thefollowingtablegivesthesystolicblood
pressurereadingsfor10studentsmeasuredimmediatelybeforeandaftertaking
astandardizedexamination.Dothesedataprovidesufficientevidence,atthe
0.01levelofsignificance,toindicateanincreaseinsystolicbloodpressure
beforeandaftertheexamination?
SystolicBloodPressure
Subject Before After
1 115 128
2 112 115
3 107 106
4 119 128
5 115 122
6 138 145
7 126 132
8 105 109
9 104 102
10 115 117
Step PracticeSpace
1. Statethenulland
alternative
hypotheses.
2. Statethedecision
rule.
3. Calculatethevalueof
theteststatistic.
a. Calculate
valuesofthe
differences.
i 2 1
d x x = -
HypothesisTesting:Pairedttest
BiostatisticsWorkbook 130
DRAFT:Aug.28,2007
Step PracticeSpace
b. Calculatemean
ofdifferences,
d .
n
d
d
i
=
c. Calculatethe
sumofthe
squared
deviations.
( )
2
i
d d

-
d. Calculate
d
s .
( )
1 n
d d
s
n
1 i
2
i
d

=
-
=
e. Computethe
valueofthe
standarderror.
n
s
d
f. Determinethe
valueof
calc
t .
( )
n
s
d
t
d
d
-
=
4. Statethestatistical
decision.
5. Reportthepvalue.
HypothesisTesting:Pairedttest
BiostatisticsWorkbook 131
DRAFT:Aug.28,2007
Step PracticeSpace
6. Statethepractical
conclusion.
ExcelExample:HypothesisTestingPairedttest
NowwewilltakethesameexampleofweightlossinSouthAfrica(foundonpage
126)andperformthetestusingdataanalysissoftware. Apairedttestcanbe
conductedeasierinExcelthaninEpiInfo.
Beforebeginning,wemustopenourdatasetinExcelbyfollowingthedirections
onpage70.
Step Example
1. Statethenulland
alternative
hypotheses.
Wewillperformaonetailedttestsinceweareonly
interestedinlearningifthewomenloseweightand
notsimplywhetherthereisachangeinweight.
H
0
:
d
0
(Thedifferenceinweightbeforeandafterthedietis
greaterthanorequaltozero.)
H
a
:
d
>0
(Thedifferenceinweightbeforeandafterdietisless
thanzero,showingweightlossinthewomen.)
2. Statethedecision
rule.
Wewillchooseanalphavalueof0.05inorderto
compareourresultswiththecomputerprogramto
thosewhichwepreviouslycalculatedbyhand.
=0.05
Ifp<,wecanrejectthenullhypothesis.
HypothesisTesting:Pairedttest
BiostatisticsWorkbook 132
DRAFT:Aug.28,2007
Step Example
3. Calculatethevalue
oftheteststatistic. Fromthemenubar,clickonTools.
SelectDataAnalysisfromthedropdownmenu.
Highlightttest:PairedTwoSampleforMeansand
clickOK.Thescreenwilllookliketheonebelow:
a. Choosethe
Inputfor
Variable1
Range.
SelectallthedatafortheBaselinevariable.
b. Choosethe
Inputfor
Variable2
Range.
Selectallthedataforthevariable,After6Months.
Notethatifyouincludethelabelswiththedata,you
needtochecktheboxmarkedLabels.
c. Specifythe
Hypothesized
Mean
Difference.
Typethemeanaccordingtothenullhypothesis.In
thiscase,itiszero.
HypothesisTesting:Pairedttest
BiostatisticsWorkbook 133
DRAFT:Aug.28,2007
Step Example
d. Choose
Outputrange
andclickOK.
Theoutputshouldresembletheonebelow:
tTest:PairedTwoSampleforMeans
Baseline
After6
months
Mean 295.75 253.625
Variance 304.78571 144.8392857
Observations 8 8
PearsonCorrelation 0.9357478
HypothesizedMean
Difference 0
df 7
tStat 15.863686
P(T<=t)onetail 4.796E07
tCriticalonetail 1.8945775
P(T<=t)twotail 9.591E07
tCriticaltwotail 2.3646226
4. Reportthepvalue
and/orthe
calculatedt.
TheExceloutputprovidestwodifferentpvalues.
Oneisforaonetailedtestandoneforatwotailed
test.Wehaddecidedtoconductaonetailedtest,so
theonetailedpvalueof.00000048istheonethat
wewillconsider.
Wehavefoundacalculatedtstatisticof15.86,which
isthesameaswepreviouslycalculatedonpage128.
5. Statethestatistical
decision.
Sinceouronetailedpvalueof0.00000048isless
thanouralphaof0.05,wehavesufficientevidenceto
rejectthenullhypothesis.
Inaddition,theonetailedcriticalvalueof1.89was
providedtousintheExceloutput.Thisisthesame
valuethatwefoundearlierbyusingtheStudentst
table.Sinceourcalculatedt,ortstatistic,isgreater
thantheonetailedcriticaltvalue,ourdecisionto
rejectthenullhypothesishasbeenconfirmed.There
isadifferenceintheweightsofthesubjectsat
baselineandaftersixmonths.
HypothesisTesting:Pairedttest
BiostatisticsWorkbook 134
DRAFT:Aug.28,2007
Step Example
6. Statethepractical
conclusion.
Becausep<andt
calc
<t
crit
,wecanrejectournull
hypothesisandstatethataftersixmonths,subjects
weighedlessthantheydidatbaseline.Thisimplies
thatthedietregimenthesubjectsengagedinwas
associatedwithweightloss.
NotethatExcelwouldbeveryinconvenienttouseif
wehadalargedataset.Inthiscase,wecould
calculatethenecessarydescriptivestatistics,the
meanofthedifferenceandstandarddeviation,using
EpiInfoandcalculatetheteststatistic,t,byhand.
Letsdoanexperimentandseewhatwouldhappenifweanalyzedthisdata
usinganonpaired(independent),twosamplettest.
Pairedttest
2Samplettest
(unequalvariances)
Variable
1
Variable
2
Variable
1
Variable
2
Mean 295.75 253.625 295.75 253.625
Variance 304.7857 144.8393 304.7857 144.8393
Observations 8 8 8 8
PearsonCorrelation 0.935748
HypothesizedMean
Difference 0 0
Df 7 12
tStat 15.86369 5.619008
P(T<=t)onetail 4.8E07 5.63E05
tCriticalonetail 1.894578 1.782288
P(T<=t)twotail 9.59E07 0.000113
tCriticaltwotail 2.364623 2.178813
Notethelargedifferencesbetweenthetwooutputs,particularlybetweenthetwo
tstatistics. Inthisexample,bothhaveasignificantpvalue,butthatwillnot
alwaysbethecase. Thisshoulddemonstratetoyouwhyitisimportanttoapply
theappropriatetesttothedatabeinganalyzed.
HypothesisTesting:Pairedttest
BiostatisticsWorkbook 135
DRAFT:Aug.28,2007
ExcelPractice:HypothesisTestingPairedttest
Youaretryingtodetermineifheartratechangesin1015yearoldsbasedon
theirposition.Youtakethepulseofeachchildfor60secondswhiletheyare
sittingandthenwhiletheyarelyingdown.UseExceltoconductapairedtteston
thedataSit/LiethatisfoundinBios_Workbook_Examples.Useanalphavalue
of0.05.
Step PracticeSpace
1. Statethenulland
alternative
hypotheses.
2. Statethedecision
rule.
3. ConductapairedttestinExcel.
4. Reportthepvalue
and/orthe
calculatedt.
5. Statethestatistical
decision.
6. Statethepractical
conclusion.
RelatedConcepts
ConfidenceInterval:Pairedttest
HypothesisTesting:TwoSamplettest
ConfidenceIntervalEstimation:Pairedttest
BiostatisticsWorkbook 136
DRAFT:Aug.28,2007
ConfidenceIntervalEstimation:Pairedttest
Overview
Description:Appliedtoestimatethepopulationmeanchange,
d
,
bymakinguseofrelatedobservationsresultingfrom non
independentsampleswithanintervalthatcommunicates
informationregardingtheprobablemagnitudeof
d

Formula:A100(1 )%confidenceintervalfor
d
isgivenby:
n
s
t d
d
crit

Assumptions:
o Differencesconstituteasimplerandomsample
o Normallydistributedpopulationofdifferences
Typeofvariables:Pairedorrelated,continuous
Decisionrule:Withrepeatedsamplingthetruedifferenceofthe
meanswillbecontainedwithina95%confidenceinterval95
percentofthetime.Ifthevalueofzeroisincluded,thereis
evidencetosuggestnosignificantdifferencesbetweenthemeans.
Iftheconstructedintervaldoesnotincludezero,wesaythatthe
intervalprovidesevidencethatzeroisnotacandidateforthe
populationmeandifference,thusyourresultsarestatistically
significant.
Where:
( )
1 n
s
d

=
-
=
n
i
i
d d
1
2
isreferredtoasthestandarddeviationofthedifferences.
ConfidenceIntervalEstimation:Pairedttest
BiostatisticsWorkbook 137
DRAFT:Aug.28,2007
StepbyStepExample:ConfidenceIntervalPairedttest
Recallpreviouslythatwefollowedthecriticalvalueapproachtodetermine
whether8subjectswhoengagedina6monthhealthydietingplanexperienced
weightloss.Supposewearenowinterestedinconstructinga95%CIforthe
differenceinpopulationmeans.
Dataonweight(lbs)beforeandafterthedietarepresentedhereagainas
follows:
Subject Baseline After6months
1 310 263
2 295 251
3 287 249
4 305 259
5 270 233
6 323 267
7 277 242
8 299 265
Step Example
1. Statethenulland
alternative
hypothesesand
determinethe
decisionrule.
H
0
:
d
=0
H
A
:
d
0
Ifourcalculated95%confidenceintervaldoesnot
includethevalueofzerothenthereissufficient
evidencetorejectthenullhypothesis.
2. Calculateapoint
estimateof
d
.
Weuse 42.1 d = asthepointestimate.
3. Determinethe
reliabilitycoefficient.
WhenweentertheStudentsttablewith81=7
degreesoffreedomandadesiredconfidencelevel
of0.95(twosided),wefindthereliabilitycoefficient
is2.3646.
4. Computethestandard
deviationofthe
differences.
( )
1 n
s
d

=
-
=
n
i
i
d d
1
2
Thestandarddeviationofthedifferenceswas
foundpreviouslytobe 51 . 7 s
d
= .
ConfidenceIntervalEstimation:Pairedttest
BiostatisticsWorkbook 138
DRAFT:Aug.28,2007
Step Example
5. Computethestandard
errorofthepoint
estimate.
n
s
d
Thestandarderrorwaspreviouslyfoundtobe
2.655
8
7.51
= .
6. Calculatethedesired
confidenceinterval.
n
s
t d
d
crit

Our95%CIfor
d
iscomputedasfollows:
8 . 35 3 . 6 1 . 42
) 655 . 2 3646 . 2 ( 1 . 42
= -
-
4 . 48 3 . 6 1 . 42
) 655 . 2 3646 . 2 ( 1 . 42
= +
+
Thus,the95%CIis(35.8,48.4).
7. Interpretthisinterval. Weare95%confidentthatonaverage,subjects
tendedtoloseanywherebetween35.8to48.4lbs
afterthe6monthdietingplan.Sincetheinterval
excludeszero,thenullhypothesisisrejected,and
weconcludeourresultsarestatisticallysignificant.
Thereisadifferenceintheweightofthetwo
groups.
Practice:ConfidenceIntervalPairedttest
Recallpreviouslywefollowedthecriticalvalueapproachtodeterminewhether
thebloodpressuresfor10studentsincreasedaftertakingastandardized
examination.Supposenowweareinterestedinconstructinga95%CIforthe
populationmeandifference.
Dataonsystolicbloodpressurebeforeandafterthetestarepresentedbelow:
SystolicBloodPressure
Subject Before After
1 115 128
2 112 115
3 107 106
4 119 128
5 115 122
6 138 145
7 126 132
8 105 109
9 104 102
10 115 117
ConfidenceIntervalEstimation:Pairedttest
BiostatisticsWorkbook 139
DRAFT:Aug.28,2007
Step PracticeSpace
1. Statethenulland
alternative
hypothesesand
determinethe
decisionrule.
2. Calculateapoint
estimateof
d
.
3. Determinethe
reliabilitycoefficient.
4. Computethestandard
deviationofthe
differences.
( )
1 n
s
d

=
-
=
n
i
i
d d
1
2
5. Computethestandard
errorofthepoint
estimate.
n
s
d
6. Calculatethedesired
confidenceinterval.
n
s
t d
d
crit

7. Interpretthisinterval
ConfidenceIntervalEstimation:Pairedttest
BiostatisticsWorkbook 140
DRAFT:Aug.28,2007
ExcelExample:ConfidenceIntervalPairedttest
WewillnowuseExceltolearnifthoseinthedietgroupdescribedonpage137
lostweight. UsethedatasetPaired_ttestfromBios_Workbook_Examples.mdb.
Step Example
1. Statethenulland
alternative
hypotheses.
H
0
:
d
=0
H
A
:
d
0
2. Determinethe
decisionrule.
Ifourcalculated95%confidenceintervaldoesnot
includethevalueofzerothenthereissufficient
evidencetorejectthenullhypothesis.
3. Findthedifferenceof
thepairs.
OpenExcelandimportthedatasetintoa
spreadsheet.
LabelanewcolumnDifference.
Inthefirstcellbelowthelabel,type=abs(
ClickonthefirstcellintheBaselinecolumn.
Type
ClickonthefirstcellintheAfter6Monthscolumn.
PressEnter.
Younowshouldhaveadifferenceof47between
thebaselineandafter6monthsmeasurements.
Clickonthecellinthelowerrighthandcornerand
dragdownsothatthecolumnDifferenceis
highlighted.
Thedifferenceforeachrowisnowincludedinthe
DifferencecolumnandyourExcelchartshouldlook
liketheonebelow:
ConfidenceIntervalEstimation:Pairedttest
BiostatisticsWorkbook 141
DRAFT:Aug.28,2007
Step Example
4. Findtheconfidence
levelforthemean.
ClickonToolsfromthemenubar.
SelectDataAnalysisfromthedropdownmenu.
HighlightDescriptiveStatisticsandclickOK.
SelectallthedatafortheDifferencevariable.
SelectanemptycellfortheOutputRange.
ChecktheboxforSummarystatistics.
ChecktheboxforConfidenceLevelforMean.The
defaultfortheconfidencelevelshouldbe95%.
ClickOK.
Notethatifyouincludethelabelswiththedata,
youneedtochecktheboxmarkedLabels.
ConfidenceIntervalEstimation:Pairedttest
BiostatisticsWorkbook 142
DRAFT:Aug.28,2007
Step Example
5. Calculatethe
confidenceinterval.
Theconfidencelevelwillbedisplayedinthelast
rowofyouroutput.Inordertofindthe95%
confidenceinterval,youmustaddandsubtractthis
numberfromthemeanofthedata.
Lowerlimit=42.136.28=35.85
Upperlimit=42.13+6.28=48.41
6. Interpretthisinterval The95%confidenceintervalis(35.85,48.41).If
weweretotakerepeatedsamplesfromthecurrent
population,95%ofthetime,themeanwouldfall
between35.85and48.41. BecausetheCIdoes
notincludezero,wecanrejectthenullandsaythat
thereisadifferenceinweights,andinparticular,
thedietregimenisassociatedwithweightloss.
ExcelPractice:ConfidenceIntervalPairedttest
UsetheSit/LiedatafromthedatabaseBios_Workbook_Examplestodetermineif
thereisadifferenceinthepulseofapersondependingontheirpositionbased
ona95%confidenceinterval.
Step PracticeSpace
1. Statethenulland
alternative
hypotheses.
ConfidenceIntervalEstimation:Pairedttest
BiostatisticsWorkbook 143
DRAFT:Aug.28,2007
Step PracticeSpace
2. Determinethe
decisionrule.
3. FindthedifferenceofthepairsinExcel.
4. Findtheconfidencelevelforthemeanofthedifference.
5. Calculatethe
confidenceinterval.
6. Interpretthisinterval.
RelatedConcepts
ConfidenceIntervalAroundaMean
HypothesisTesting:Pairedttest
FishersExactTest
BiostatisticsWorkbook 145
DRAFT:Aug.28,2007
FishersExactTest
TheChisquaretestactuallyprovidesuswithanapproximationoftheFishers
Exact.TheFishersExactcalculatestheexactprobabilityofobtainingobserved
resultsorresultsthataremoreextreme.Inshort,itcalculatestheexactpvalue
sothatchartsarenotneededtoestimatethepvalue. WemustusetheFishers
Exactwhenmorethan20%ofexpectedcellcountsarelessthan5.Notethatin
aclassic2x2table,ifanyonecellvalueislessthan5,wemustuseFishers
Exacttotestforsignificanceofassociation. Otherwise,wecanestimatethe
exactpvalueusingthechisquaretest.
Overview
TheFishersExacttestexaminesthesignificanceofthe
associationbetweentwocategoricalvariablesarrangedina2x2
table.
Itisusedtofindapvalueincaseswherethesamplesizeissmall
(i.e.typicallywhenn<20)andmorethan20%oftheexpectedcell
frequenciesarelessthan5.
Canbeusedwithcountdataonlyandnotratedata.
Formula:

=
-
=
) , min(
1 1
1
0
1
1 m n
a j
N
m
n
j m
n
j
C
C C
p
DecisionRule:If>p,thereissufficientevidencetorejectthe
nullhypothesis.
Where:
Cisacombination.
Mininstructsyoutousethelesserofthetwovalues.
The2x2tableislabeledasbelow.
E
E
D j m
1
j m
1
D
c d m
0
n
1
n
0
N
FishersExactTest
BiostatisticsWorkbook 146
DRAFT:Aug.28,2007
StepbyStepExample:FishersExactTest
TocompleteyourMPHthesis,youhavebeenassignedtotheKenyattaNational
HospitalIntensiveCareUnit(ICU)inNairobi,Kenya.Youareparticularly
interestedindeterminingwhetheragenderdifferenceexistsinknowledgeamong
healthcareworkersoftheICUofhospitalpoliciestoreducetransmissionof
hospitalacquiredinfections.Summarizingtheresultsofyourbriefsurvey,the
followingdataisobtained:
Knowledgeofpoliciestoreducetransmissionofnosocomialinfections
Gender Yes No Totals
Male 4 1 5
Female 1 3 4
Totals 5 4 9
Dothesedata,atthe0.05alphalevel,providesufficientevidencetoindicatea
differenceinknowledgeofhospitalpoliciestoreducetransmissionofnosocomial
infectionsbetweenmenandwomen?
Step Example
1. Statethenulland
alternative
hypothesis.
H
0
:Thereisnodifferenceinknowledgeofhospital
policiesbetweenmenandwomen.
H
A
:Thereisadifferenceinknowledgeofhospital
policiesbetweenmenandwomen.
2. Statethedecision
rule.
Ifp<,thenwecanrejectthenullhypothesis.
FishersExactTest
BiostatisticsWorkbook 147
DRAFT:Aug.28,2007
Step Example
3. Findthepvalue
usingFishers
Exacttest.
p

) , min(

1 1
1
0
1
1 m n
a j
N
m
n
j m
n
j
C
C C
=
=
p
9
5
4
0
5
5
5
4 = j
9
5
4
1
5
4
9
5
4
j 5
5
j
C
C C
+
C
C C
=
C
C C
=

! 4 ! 5
! 9
! 4 ! 0
! 4
x
! 0 ! 5
! 5
+
! 4 ! 5
! 9
! 3 ! 1
! 4
x
! 1 ! 4
! 5
=
) 1 x 2 x 3 x 4 )( 1 x 2 x 3 x 4 x 5 (
1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9
) 1 x 2 x 3 x 4 ( x 1
1 x 2 x 3 x 4
x
1 x ) 1 x 2 x 3 x 4 x 5 (
1 x 2 x 3 x 4 x 5
) 1 x 2 x 3 x 4 )( 1 x 2 x 3 x 4 x 5 (
1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9
) 1 x 2 x 3 )( 1 (
1 x 2 x 3 x 4
x
) 1 )( 1 x 2 x 3 x 4 (
1 x 2 x 3 x 4 x 5
















+
















=






+






=
1 x 2 x 3 x 4
6 x 7 x 8 x 9
1 x 1
1 x 2 x 3 x 4
6 x 7 x 8 x 9
4 x 5
1667 . 0
126
21
126
1
126
20
=
= + =
Rememberfrompreviousmathcoursesthat0!=1.
4. Stateyour
conclusion.
Theonetailedpvalueis=0.1667. Sincethe
hypothesesthatwearetestingarenondirectional,we
findthetwotailedpvaluebymultiplyingtheonetailed
pvalueby2.
Thetwotailedpvalueof0.3333,alsolargerthanthe
alphaof0.05,doesnotsuggestsignificantevidenceto
rejectthenullhypothesis.
Thereislikelynosignificantdifferencebetweenmen
andwomenofknowledgeofhospitalpolicies.
FishersExactTest
BiostatisticsWorkbook 148
DRAFT:Aug.28,2007
Practice:FishersExactTest
Youhaveasampleof16peoplethatattendedtheweeklymarket.Ofthose16,
11havedevelopedcholera.Intryingtodeterminethesourceoftheoutbreak,
yourealizethatall16atesaladatthemarket. Eightatefromonesaladladyand
theothereighthadsaladfromadifferentvendor.Yourcountdataisincludedin
thetablebelow:
Salad1 Salad2 Total
Diseased 5 6 11
Nondiseased 3 2 5
Total 8 8 16
Doesthisdata,atthe5%alphalevelindicateadifferenceincontractionof
cholerabasedonthesaladeaten?
Step PracticeSpace
1. Statethenulland
alternative
hypothesis.
2. Statethedecision
rule.
3. Findthepvalueusing
FishersExacttest.
p

) m , n min(
a = j
N
m
n
j m
n
j
1 1
1
0
1
1
C
C C
=
FishersExactTest
BiostatisticsWorkbook 149
DRAFT:Aug.28,2007
Step PracticeSpace
4. Stateyourconclusion.
EpiInfoExample:FishersExactTest
Usingthesamedatathatweusedtocompletethestepbystepexample
problem,wewillnowconductaFishersExacttestusingEpiInfo.
Knowledgeofpoliciestoreducetransmissionofnosocomialinfections
Gender Yes No Totals
Male 4 1 5
Female 1 3 4
Totals 5 4 9
Dothesedata,atthe0.05alphalevel,providesufficientevidencetoindicatea
differenceinknowledgeofhospitalpoliciestoreducetransmissionofnosocomial
infectionsbetweenmenandwomen?
Sincewehavetwocategoricalvariables(i.e.gender:M/FandKnowledge:Y/N)
andareinterestedindeterminingwhetheradifferenceorassociationexists,the
appropriatestatisticaltesttoanalyzeourdataisthechisquaretest. Because
thecountsinthe2x2tablearelow(lessthan5)however,ourpreferredmethod
foranalysisistheFishersExacttest.
Step Example
1. Statethenulland
alternative
hypotheses.
H
0
:Thereisnodifferencebetweenwomenand
menintheknowledgeofhospitalpolicies.
H
A
:Adifferenceexistsbetweenwomenandmen
intheknowledgeofhospitalpolicies.
2. Determinethe
decisionrule.
Ifp<,rejectthenullhypothesis.
FishersExactTest
BiostatisticsWorkbook 150
DRAFT:Aug.28,2007
Step Example
3. RunStatCalcfromthe
mainEpiInfomenu.
FromthetoolbarontheEpiInfomenuscreen,
selectutilities.
Clickonthefirstoption,StatCalc.
a. Enterthedatainto
the2x2table.
OntheStatCalcmenu,choosethe1
st
option:
Tables(2x2,2xn)byhittingEnteronyour
keyboard.
A2x2tablewillappearonthescreen.
Entertheappropriatevaluesforcellsa,b,c
andd(inthatorder)astheyappearinthegiven
table.
HittheEnterkeyafterenteringeachvalue.
Hit<F4>tocalculate.
FishersExactTest
BiostatisticsWorkbook 151
DRAFT:Aug.28,2007
Step Example
b.Calculatethep
value.
Afterhitting<F4>,totalsforthe2x2tablewillbe
calculatedandthefollowingresultswillappear.
Itisatthispointthatwewouldenteroursecond
stratumifwehadstratifieddata.Pressing<F2>will
openasecond2x2tableforastratifiedanalysis.
Theoutputgivesyoutheresultsforanumberofchi
squaretestsaswellas1tailedand2tailedp
valuesforFishersExacttest.
Beneaththesecalculations,thereisawarning
whichreads,Anexpectedcellvalueislessthan5.
Fisherexactresultsrecommended.
Thistellsusthattheouptputwewanttolookatis
theoutputforFishersExact. Noticethatthe1
tailedpvalueisexactlywhatwecalculatedinthe
stepbystepexample.
FishersExactTest
BiostatisticsWorkbook 152
DRAFT:Aug.28,2007
Step Example
4. Stateyour
conclusions.
Usingthepvalueresultsforatwotailedtest(p=
0.2063492)wewouldfailtorejectthenull
hypothesis.Thus,atthe0.05alphalevel,wehave
insufficientevidencetoindicateadifferencein
knowledgeofhospitalpoliciestoreduce
transmissionofnosocomialinfectionsbetween
menandwomen.
NotethatthetwotailedpvaluegiveninEpiInfo
andthetwotailedpvaluethatwecalculatedby
handaredifferent.EpiInfousesadifferentmethod
ofcalculatingthetwotailedFishersExactpvalue.
ThoughEpiInfoscalculationsareless
conservative,bothmethodsarecorrect.
EpiInfoPractice:FishersExactTest
Supposetheobjectiveofastudywastodeterminewhetheractiveclass
participationinintroductoryepidemiologyclasseswouldincreaseclassfinal
approvalratingattheendofthesemester.Tostudytheeffectofstudent
participationandapprovaloffinalcoursegradeinepidemiology,studentswere
interviewedineachoftwoseparateintroductoryepidemiologyclasses.One
classhadactivestudentparticipationtheotherdidnot.Eachselectedstudent
wasaskedwhetherheorshegenerallyapprovedofhis/herfinalcoursegradein
epidemiologyattheendofthesemester.Theresultsoftheinterviewsareshown
inthetable:
GenerallyApprove DoNotApprove
Participation 6 2
NoParticipation 3 10
UseEpiInfo(orOpenEpi,ifyouprefer)todeterminewhetherthedifferencesin
approvalratingbetweenclassesaresignificantatthealpha=0.05level.
Note:YoucanalsocalculateapvalueusingOpenEpi.Inthetableof
contents,choose2x2tableandenteryourcountdata.Addordeletestrata
byclickingtheappropriatebutton.Whenalldataisentered,clickonthe
buttonmarkedCalculate.Lookattheoutputforthecorrectstratumfor
theFishersExactpvalue.
FishersExactTest
BiostatisticsWorkbook 153
DRAFT:Aug.28,2007
Step PracticeSpace
1. Statethenulland
alternative
hypotheses.
2. Determinethe
decisionrule.
3. CalculatethepvalueusingtheFishersExacttest.
4. Stateyour
conclusions.
RelatedConcepts
ChiSquareTestforIndependence
ConfidenceInterval:RelativeRisk
ChiSquareTestforIndependence
BiostatisticsWorkbook 155
DRAFT:Aug.28,2007
ChiSquareTestforIndependence
Inanepidemiologicstudywemightwishtodetermineifapersonsriskof
becomingillwithacertaindiseaseisdependentonsomeothervariable(e.g.,to
livinginanurban,suburban,orruralarea)oriftheriskofbecomingillis
independentofthatvariable.
Overview
Description:Teststhehypothesisthattwodiscretevariablesare
independentagainstthealternativethattheyarenotindependent.
Achisquaretestwilltellusifthereisasignificantdifferencefound
amonganyoftheindependentvariables,butwillnotspecifywhere
thedifferencelies.
Example:Istheriskofinfantmortality(highversuslow)
independentordependentonthelengthoftime(<6monthsversus
6months)thatthemotherexclusivelybreastfeeds?
Formula:

=






-
=
k
i i
j j
E
E O
1
2
2
) (
c
Assumptions:
o Atleast80%oftheexpectedvaluesinthedistribution
shouldbegreaterthan5
o Noneoftheexpectedvaluesinthedistributionshouldbe
lessthan1
o Observationsareindependent
Typeofvariables:Discrete
Decisionrule:Iftheobservedvalueof
2
(
2
obs
x )isgreaterthanor
equaltothecriticalvalueofX
2
(
2
crit
x ),thenwecanrejectH
0
.
Tableused:Chisquaredistribution
Where:
O=observedvalue
E=expectedvalue
Forobservationsitok
ChiSquareTestforIndependence
BiostatisticsWorkbook 156
DRAFT:Aug.28,2007
Iftwovariableswereindependent,wewouldexpectthatthedistributionofillness
wouldbesimilarforeachpossibleoutcomeofthesecondvariable.Forexample,
iftheriskofbecomingillwasindependentofwhereindividualslived(urban,
suburban,andruralareas),wewouldexpectthattheriskofbecomingill(and,
therefore,thedistributionofillvs.notill)wouldbesimilarforeachcategory.
Atablewhichwouldallowustodisplaytheactualnumbersforthesetwo
variables,andthentocalculatethechisquarestatisticbasedonobservedand
expectedvaluesiscalledacontingencyor chisquaretable.
Variable1
Variable2
Criteriaa Criteriab Criteriac Total
Criteria1 n
1a
n
1b
n
1c
n
1k
Criteria2 n
2a
n
2b
n
2c
n
2k
Total n
ka
n
kb
n
kc
N
Thecomparisonofthecalculated(obtained)chisquarestatistictoacriticalchi
squarevalue,derivedusingaspecifiedlevelofsignificance()anddegreesof
freedom(df),allowsustomakeastatementregardingtheindependenceofthe
twovariables.
StepbyStepExample:ChiSquareTestforIndependence
Wewanttodetermineiftheriskofbecomingillofacertaindiseaseina
populationisindependentoflocationofresidenceamongurban,suburban,and
ruralareas.
Residence
Illness
Urban Suburban Rural Total
Ill 61 69 64 194
NotIll 617 642 312 1571
Total 678 711 376 1765
Step Example
1. Statethenulland
alternative
hypotheses.
Nullhypothesis(H
0
):Theriskofbecomingillis
independentofthelocationofresidence.
Alternativehypothesis(H
A
):Theriskofbecoming
illisnotindependentoflocationofresidence.
ChiSquareTestforIndependence
BiostatisticsWorkbook 157
DRAFT:Aug.28,2007
Step Example
2. Statethedecisionrule. Withanalphaof0.05,
If
2
crit
2
obs
x x ,thenrejectH
0
.
If
2
crit
2
obs
x x < ,thenfailtorejecttheH
0
.
3. Calculateexpected
frequencyforallcells
inthechisquaretable
using:
rowtotalxcolumntotal
populationtotal
Inthisexample,theobservednumberofpersons
livinginurbanareaswhobecameillwas61.
a. Thetotalforrow(ill)is194.
b. Thetotalforcolumn(urban)is678
c. Thetotalnumberinthepopulationis1765
Usingtheformula:
So,whiletheobservedvaluewas61,theexpected
valuewas74.52.
Continuethecalculationsforallcellsinthetable.
Obs
[Exp]
Urban Suburban Rural
Ill 61
[74.52]
69
[78.15]
64
[41.33]
NotIll 617
[603.48]
642
[632.85]
312
[334.67]
4. Calculatethechi
squarevalue(obtained
chisquare),usingthe
observed(O
i
)and
expectedvalues(E
i
).
X
2
=
X
2
=
=2.45+1.07+12.43+0.30+0.13+1.54
=17.92
194x678
1765
=
131532
1765
=
74.52
k

i=1
[
(O
i
E
i
)
2
E
i
]
(6174.52)
2
74.52
+
(6978.15)
2
78.15
+
(6441.33)
2
41.33
+
(617603.48)
2
603.48
+
(642632.85)
2
632.85
+
(312334.67)
2
334.67
ChiSquareTestforIndependence
BiostatisticsWorkbook 158
DRAFT:Aug.28,2007
Step Example
5. Findthecriticalchi
squarevalue.
Findingthecriticalchisquarevaluefirstrequires
calculatingthedegreesoffreedom(df).Forchi
square,thedegreesoffreedomareequalto:
df=(#rows1)(#columns1)
=(21)(31)=2degreesoffreedom
UsingtheChiSquareDistributiontable,witha
levelofsignificance()of0.05and2df,thecritical
chisquarevalueequals5.991.
6. Applythedecision
rule.
Inthisexample,theobservedchisquarevalueis
greaterthanthecriticalchisquarevalue.
17.92(obtainedX
2
)>5.991(criticalX
2
)
Therefore,werejectH
0
.
7. Statethepractical
conclusion.
Illnessisnotindependentofthelocationof
residence.
Practice:ChiSquareTestforIndependence
Inacrosssectionalsurveyadministeredtoarandomsampleof100attendeesof
alocalhealthfair,thefollowing2x2tablewasconstructedafterreviewingthe
data:
CurrentSmoker
Diabetes
YesNo Totals
Yes 50 25 75
No 20 5 25
Totals 70 30 100
Wewanttodetermine,usingalevelofsignificance()of0.05,iftheriskof
havingdiabetesinthesurveyedpopulationisrelatedtosmoking.
Step PracticeSpace
1. Statethenulland
alternative
hypotheses.
ChiSquareTestforIndependence
BiostatisticsWorkbook 159
DRAFT:Aug.28,2007
Step PracticeSpace
2. Statethedecisionrule.
3. Calculateexpected
frequencyforallcells
inthechisquaretable
using:
rowtotalxcolumntotal
populationtotal
4. Calculatethechi
squarevalue(obtained
chisquare),usingthe
observed(O
i
)and
expectedvalues(E
i
).
X
2
=
5. Findthecriticalchi
squarevalue.
UsetheChiSquare
Distributiontableinthe
Appendices.
6. Applythedecision
rule.
7. Statethepractical
conclusion.
k

i=1
[
(O
i
E
i
)
2
E
i
]
ChiSquareTestforIndependence
BiostatisticsWorkbook 160
DRAFT:Aug.28,2007
EpiInfoExample:ChiSquareTestforIndependence
Usingtheexampledescribedonpage156,wewillnowcalculatetheriskof
becomingillbasedonlocationofresidenceusingEpiInfo.
Step Example
1. Statethenulland
alternative
hypotheses.
Nullhypothesis(H
0
):theriskofbecomingillis
independentofthelocationofresidence.
Alternativehypothesis(H
A
):theriskofbecomingill
isnotindependentoflocationofresidence.
2. Statethedecisionrule. =0.05
If>p,wecanrejectthenullhypothesis.
3. PerformtheChi
SquaretestusingEpi
Info.
READthedataset.
SelectChi_Squarefromthedatabase
Bios_Workbook_Examples.
SelecttheTABLEScommand.
ChooseResidencefromthedropdownboxasthe
ExposureVariable.
Inthiscase,IllistheOutcomeVariableofinterest.
ClickOK.
Theoutputshouldlookliketheonethatfollows:
ChiSquareTestforIndependence
BiostatisticsWorkbook 161
DRAFT:Aug.28,2007
Step Example
4. Applythedecision
rule.
Because>p,werejectthenullhypothesis.
Thereissignificantevidencesuggestingthat
illnessisnotindependentofresidence.
5. Interprettheresults. Intheresults,lookfortheChisquare,thedegrees
offreedom(df),andtheprobability(equivalentto
thepvalue). Comparethechisquareresult
(obtained)fromEpiInfowiththecriticalchisquare
valuefromthetableintheappendixusingthedf
andwhateverdesiredlevelofsignificanceyou
wishtouse.EpiInfoshouldreportaChisquareof
17.93with2df.Ifyoucompare17.93tothevalue
foundincalculatingtheproblembyhand,17.92,
wefindthatthetwovaluesarealmostthesame.
Thedifferencemaybeduetoroundinginourhand
calculations.
Bydefault,EpiInfocalculatesaprobabilitybased
onthe0.05()levelofsignificance.Therefore,
youwillcometothesameconclusioninyour
decisionstatementasyouwouldusethechi
squaretable.EpiInforeportsapvalueof0.0001,
whichissignificantwhen=0.05.
Thisisthesameresponsethatwefoundthrough
handcalculations.
ChiSquareTestforIndependence
BiostatisticsWorkbook 162
DRAFT:Aug.28,2007
EpiInfoPractice:ChiSquareTestforIndependence
Youwouldliketofindoutifthenumberofemployeesthatacompanyhas
impactswhetherHIV/AIDSeducationisofferedtothemwithinthecompany.You
survey71companies,askingeachaboutthenumberofemployeestheyhave
andwhethertheyofferHIV/AIDSeducationtotheiremployees.Usethedataset
AIDSEducationinBios_Workbook_Examples.mdbtoconductachisquaretestof
independenceinEpiInfo.=0.05
Step PracticeSpace
1. Statethenulland
alternative
hypotheses.
2. Statethedecisionrule.
3. UseEpiInfotoperformthechisquaretestforindependence.
4. Applythedecision
rule.
5. Statethepractical
conclusion.
ConfidenceIntervals:OddsRatios
andRelativeRisk
BiostatisticsWorkbook 163
DRAFT:Aug.28,2007
ConfidenceIntervalsforCaseControlandCohort
Studies
Insteadofbeingbasedaroundamean,asaconfidenceintervalforadescriptive
studyis,confidenceintervalsforanalyticstudiesarebasedonameasureof
association.Ameasureofassociationquantifiesthestrengthofanyassociation
foundbetweentheexposureandtheoutcomevariableofinterest.Themeasure
ofassociationmostcommonlyusedinacohortstudyistherelativerisk(RR),
alsocalledtheriskratio.Theoddsratio(OR)isusedasthemeasureof
associationincasecontrolstudies.Crosssectionalstudiesuseaprevalence
ratiooraprevalenceoddsratio.
Inthefollowingsection,wewillexploreonewayofusingboththeRRandtheOR
tomakeinferencesbasedondata.BecauseboththeRRandtheORareratios,
orproportions,thenullhypothesisdiffersfromthenullusedwhenlookingat
differences.
Ratioscanbeevaluatedbyusingthefollowingcriteria:
RR=1.0indicatesidenticalriskinbothgroups
RR>1.0indicatesanincreasedriskfortheexposedgroupcomparedto
theunexposedgroup
RR<1.0indicatesadecreasedriskfortheexposedgroupcomparedto
theunexposedgroup
Onemethodtomakeinferencesaboutratioscalculatedfromanalyticstudiesis
tocalculateconfidenceintervalsaroundtheratios.
ConfidenceIntervals:OddsRatios
andRelativeRisk
BiostatisticsWorkbook 164
DRAFT:Aug.28,2007
ConfidenceIntervals:OddsRatiosandRelativeRisks
CIsareconstructedbasedontheassumptionthatthepointestimateisnormally
distributed.SincepointestimatesoftheORandRRhavesamplingdistributions
thatareheavilyskewedtowardstheright(thatis,theseestimatescanneverbe
lessthanzero,buttheycantheoreticallyassumeanypositivevalueupto
infinity),theassumptionofunderlyingnormalityisnolongermet.However,as
thesamplingdistributionsofthenaturallogarithms(ln)ofboththeORandRR
arenormallydistributed,moreaccurateCIsareconstructedbyworkinginthe
worldofnaturallogarithms.
Overview
Confidenceintervals(CIs)areconstructedtoquantifytheamount
ofvariability(precision)aroundthepointestimate(i.e.the
computedsampleORorRR).
Theyprovideuswitharangeofplausiblevaluesforthepopulation
ORorRR.
Formula:
95%CIforOR=
d
1
c
1
b
1
a
1
1.96 OR ln
e
+ + +





95%CIforRR=


















+
+
+

d c c
d
b a a
b
1.96 RR ln
e
Assumptions:Thepointestimateisnormallydistributed.
Decisionrule:Ifthe95%CIdoesnotincludethenullvalue(i.e.
anORorRRequalto1),thenullhypothesisisrejectedandthe
resultsareconsideredstatisticallysignificant.
Where:
ln=naturallogarithm
e=antilogarithm
a=numberexposedanddiseased
b=numberexposedbutnotdiseased
c=numbernotexposedbutdiseased
d=numberneitherexposedordiseased
ConfidenceIntervals:OddsRatios
andRelativeRisk
BiostatisticsWorkbook 165
DRAFT:Aug.28,2007
BeforeintroducingtheformulasusedtoconstructCIsfortheORandRR,lets
revisitthegeneralformulawehavebeenusingtoconstructCIs:
CI= ( ) ( ) error standard t coefficien y reliabilit estimate point
RatherthanusingthecomputedsampleORorRRasthepointestimateinour
examples,wewillinsteadusethenaturallogarithmoftheORorRR(writtenasln
(OR)orln(RR)).Aftertakingtheantilogarithmofthecalculatednumbers,wewill
thenhaveourdesiredCI,allowingustomakeastatementconcerningstatistical
significance.
StepbyStepExample:ConfidenceIntervalforOddsRatio
Supposewehavethefollowingdatatakenfromahypotheticalcasecontrolstudy
examiningriskfactorsforobesityamong9
th
gradersofaparticularhighschoolin
Brooklyn,NY.Ofparticularinteresttoresearchersisafamilyhistoryofeatingan
unbalanceddiet. Thetablebelowpresentsthenumberofcases(obese
teenagers)andcontrols(nonobeseteenagers)reportingafamilyhistoryof
eatinganunbalanceddiet.
ObesityStatus
FamilyHistory Cases Controls Totals
UnbalancedDiet 110 165 275
BalancedDiet 125 300 425
Totals 235 465 700
Sincecasecontrolstudiesdonotprovideinformationontheincidenceofdisease
inthestudypopulation,ourbestmeasureofassociationtoquantifythe
relationshipbetweenexposureanddiseaseistheoddsratio(OR),whichserves
asanapproximationoftherelativerisk(RR).Asyourecall,theORiscomputed
as
c b
d a
OR


= ,wherea,b,canddareasdefinedpreviously.
WecomputetheORforourdataas:
1.60
125 165
300 110
OR =


=
Weseethatobeseteenagers(cases)are60%morelikelythannonobeseteens
(controls)tohavehadafamilyhistoryofeatinganunbalanceddiet.
ConfidenceIntervals:OddsRatios
andRelativeRisk
BiostatisticsWorkbook 166
DRAFT:Aug.28,2007
Isthisresultstatisticallysignificant?Onewaytoanswerthisquestionisby
constructingandinterpretingaconfidenceintervalfortheOR.
Step Example
1. Statethenulland
alternative
hypothesesand
determinethe
decisionrule.
H
0
:OR=1
H
A
:OR1
Ifthecalculated95%confidenceintervaldoesnot
includethevalueofone,thenthereissufficient
evidencetorejectthenullhypothesis.
2. Calculatethepoint
estimate.
Inthisexample,thepointestimateisthenatural
logarithmoftheOR(lnOR),foundbyusingyour
calculator:
ln1.60=0.4700
3. Determinethe
reliabilitycoefficient.
Sinceweareinterestedina95%CIfortheOR,the
reliabilitycoefficientis1.96,asyoulearnedearlier.
Mostepidemiologistsuse95%CIs,although90%
and99%CIsarealsoused.
4. Calculatethestandard
error.
standarderror(lnOR)=
d
1
c
1
b
1
a
1
+ + +
standarderror(lnOR)=
300
1
125
1
165
1
110
1
+ + +
=0.1627
5. Putitalltogetherto
computeCIfortheln
(OR).
pointestimate 1.96x
standarderror
A95%CIfortheln(OR)=
0.47000363 1.96x0.16274166
=(0.15102998,0.78897728)
6. Taketheantilogarithm
togetbacktothe
actualOR.
Usingyourcalculatorfindtheantilogofthe
calculatednumbers:
( )
0.78897728 0.15102998
e , e
=(1.16,2.20)
ConfidenceIntervals:OddsRatios
andRelativeRisk
BiostatisticsWorkbook 167
DRAFT:Aug.28,2007
Step Example
7. Applythedecision
rule.
If95%CIdoesnot
includethenullvalueof
1thenullhypothesis
(i.e.OR=1)isrejected
andtheresultsare
consideredstatistically
significant.
Ifthe95%CIdoes
includethenullvalueof
1,thenullhypothesisis
notrejectedandthe
resultsarenot
statisticallysignificant.
SinceourCIexcludesthenullvalueof1,wesay
thattheORisnotlikelytobeequalto1and,
therefore,ourresultsarestatisticallysignificant.
Thereisastatisticallysignificantrelationship
betweenobesitystatusandpresenceofafamily
historyofeatinganunbalanceddiet.
NoteoninterpretingCIs:
Inadditiontostatingwhetheryourresultsarestatisticallysignificant,itisalways
goodpracticetoprovidethereaderwithapracticalinterpretationoftheCI.So,
howdoweinterpretthisCI?Ausefulstatisticalinterpretationisthatweare
95%confidentthatthetrueORisbetween1.16and2.20because,inrepeated
sampling,about95%oftheintervalsconstructedinthemannerofthepresent
singleintervalwouldincludethetrueOR.
ProvidingamorepracticalinterpretationofthisCI,wecanalsosaythatonthe
basisoftheseresultswewouldexpect,with95%confidence,tofind
somewherebetweena1.16and2.20timesgreaterlikelihoodamongobese
teenagers(cases)ofhavingafamilyhistoryofeatinganunbalanceddiet
comparedtononobeseteenagers(controls).
ConfidenceIntervals:OddsRatios
andRelativeRisk
BiostatisticsWorkbook 168
DRAFT:Aug.28,2007
Practice:ConfidenceIntervalforOddsRatio
Youaretryingtodetermineifbreastfeedinghasanimpactonfuturedevelopment
ofasthmainachild. Usethefollowingdataderivedfromacasecontrolstudyto
calculatethe95%confidenceintervalfortheoddsratio.
AsthmaStatus
FeedingHistory Cases Controls Totals
FormulaFed 250 150 400
Breastfed 100 300 400
Totals 350 550 800
Step PracticeSpace
1. Statethenulland
alternative
hypothesesand
determinethe
decisionrule.
2. Calculatethepoint
estimate.
3. Determinethe
reliabilitycoefficient.
4. Calculatethestandard
error.
standarderror(lnOR)=
d
1
c
1
b
1
a
1
+ + +
5. Putitalltogetherto
computeCIfortheln
(OR).
pointestimate 1.96
standarderror
ConfidenceIntervals:OddsRatios
andRelativeRisk
BiostatisticsWorkbook 169
DRAFT:Aug.28,2007
Step PracticeSpace
6. Taketheantilogarithm
togetbacktothe
actualOR.
7. Applythedecision
rule.
If95%CIdoesnot
includethenullvalueof
1thenullhypothesis
(i.e.OR=1)isrejected
andtheresultsare
consideredstatistically
significant.
Ifthe95%CIdoes
includethenullvalueof
1,thenullhypothesisis
notrejectedandthe
resultsarenot
statisticallysignificant.
ConfidenceIntervals:OddsRatios
andRelativeRisk
BiostatisticsWorkbook 170
DRAFT:Aug.28,2007
EpiInfoExample:ConfidenceIntervalforOddsRatio
EpiInfoisabletosimplifythesecalculationsquiteabit.Todemonstratethis,we
willcalculatetheconfidenceintervalfortheoddsratioofobesityandan
unbalanceddietusingtheexampleonpage168. Notethatinthisdataset,those
thatarecasesandconsideredobesewillshowupasYesandthosethatare
controlsandconsideredobesewillbeNo.
Step Example
1. Statethenull
andalternative
hypotheses
anddetermine
thedecision
rule.
H
0
:OR=1
H
A
:OR1
Ifthecalculated95%confidenceintervaldoesnotinclude
thevalueofone,thenthereissufficientevidencetoreject
thenullhypothesis.
2. READthe
datasetinto
EpiInfo.
ClickontheREADcommandandopenthedataset,
Odds_Ratio.
3. UseTABLESto
findtheodds
ratioand
corresponding
95%
confidence
interval.
ClickontheTABLEScommand.
ChooseUnbalanced_DietastheExposureVariableand
GroupastheOutcomeVariable.
ClickOK.
ConfidenceIntervals:OddsRatios
andRelativeRisk
BiostatisticsWorkbook 171
DRAFT:Aug.28,2007
Step Example
4. Interpretthe
results.
IntheEpiInfooutput,noticethatthetablelookslikethe
2X2tablethatwesawinthestepbystepexample.
Fromtheoutput,welookatthecrossproductoddsratio
anddeterminethatitis1.60.Thecorresponding95%
confidenceintervalhasalowerlimitof1.16andanupper
limitof2.20.SincetheCIexcludesthevalueof1,wecan
saythattheORisstatisticallysignificant.
Withtheseeasysteps,wehavereceivedthesameresults
aswhenwedidthecalculationsbyhand.Theoddsratioof
1.60tellsusthatobeseteenagers(cases)are1.60times
aslikelyasnonobeseteenagers(controls)tohavea
familyhistoryofanunbalanceddiet.Sincethe
correspondingconfidenceintervaldoesnotcontainthe
valueofone,wecanconcludethatthereisastatistically
significantrelationshipbetweenobesityinteensanda
familyhistoryofanunbalanceddiet.Wecanexpect,with
95%confidence,betweena1.16and2.20timesgreater
chancethatobeseteenagerswillhavegrownupinafamily
withoutabalanceddietascomparedtononobese
teenagers.
ConfidenceIntervals:OddsRatios
andRelativeRisk
BiostatisticsWorkbook 172
DRAFT:Aug.28,2007
EpiInfoPractice:ConfidenceIntervalforOddsRatio
Youhaveconductedastudytodetermineifmultimicronutrientsupplements
takenduringpregnancyincreaseanewbornsweightatbirth. Usethedataset
BirthweighttocreateaconfidenceintervalfortheoddsratioinEpiInfo.
Birthweightshavebeengroupedintotwocategories:below2500gramsand
above2500grams. Usethesecategoriestofindtheconfidenceinterval.
Step PracticeSpace
1. Statethenulland
alternative
hypothesesand
determinethe
decisionrule.
2. DivideWeight_at_birthintotwogroups.
3. UseTABLEStofindtheoddsratioandcorresponding95%confidence
interval.
4. Interprettheresults.
ConfidenceIntervals:OddsRatios
andRelativeRisk
BiostatisticsWorkbook 173
DRAFT:Aug.28,2007
StepbyStepExample:ConfidenceIntervalforRelativeRisk
RecallthattheRRisameasureofassociationusedtocomparetheincidenceof
diseaseamongtheexposedtothatoftheunexposed,andisthemainmeasure
ofassociationusedinstudydesignswhereweareabletogetincidencedata
directly,asinthecohortandexperimentalstudydesigns.
Supposewehavethefollowinginformationpresentedinthetablebelow
depictingresultsfromahypotheticalprospectivecohortstudyof2000studentsat
theUniversityofZambiacomparingthemortalityratefollowingarecent
diagnosisoftuberculosisamong1000subjectsreceivingastandardtreatmentto
themortalityrateamong1000subjectsreceivinganewtreatment.The2x2
tableisdisplayedhere:
Deathfromheartattack
TreatmentType Yes No Totals
Standard 750 250 1000
New 600 400 1000
Totals 1350 650 2000
Asyourecall,theRRiscomputedas
d c
c
b a
a
+
+
= RR ,wherea,b,c,anddareasdefinedpreviously.
WecomputetheRRforourdataas:
1.25 RR = =
+
+
=
1000
600
1000
750
d c
c
b a
a
Thesedataindicatethattheriskofdeathfromaheartattackamongthosewho
weretreatedwiththestandardtherapyis25%higherthanthatamongthosewho
receivedthenewtreatment.
So,isthisRRstatisticallysignificant?
ConfidenceIntervals:OddsRatios
andRelativeRisk
BiostatisticsWorkbook 174
DRAFT:Aug.28,2007
Step Example
1. Statethenulland
alternative
hypothesesand
determinethe
decisionrule.
H
0
:RR=1
H
A
:RR1
Ifthecalculated95%confidenceintervaldoesnot
includethevalueofone,thenthereissufficient
evidencetorejectthenullhypothesis.
2. Calculatethepoint
estimate.
Inthisexample,thepointestimateisthenatural
logarithmoftheRR(lnRR),foundbyusingyour
calculator:
ln1.25=0.22314355
3. Determinethe
reliabilitycoefficient.
Onceagain,asweareinterestedina95%CIfor
theRR,ourreliabilitycoefficientremains1.96.
4. Calculatethestandard
error.
standarderror(lnRR)=
( ) ( ) d c c
d
b a a
b
+
+
+
standarderror(lnRR)=
( ) ( ) 1000 600
400
1000 750
250
+
=0.03162272
5. Putitalltogetherto
computeCIforthe
ln(RR).
pointestimate 1.96x
standarderror
A95%CIfortheln(RR)=
0.22314355 1.96x0.03162272
=0.22314355 0.06198054
=(0.16116301,0.28512409)
6. Taketheantilogarithm
togetbacktothe
actualRR
Usingyourcalculator,findtheantilogofthe
calculatednumbers:
( )
0.28512409 0.16116301
e , e
=(1.17,1.33)
ConfidenceIntervals:OddsRatios
andRelativeRisk
BiostatisticsWorkbook 175
DRAFT:Aug.28,2007
Step Example
7. Applythedecision
rule.
If95%CIdoesnot
includethenullvalueof
1thenullhypothesis
(i.e.RR=1)isrejected
andtheresultsare
consideredstatistically
significant.
Ifthe95%CIdoes
includethenullvalueof
1,thenullhypothesisis
notrejectedandthe
resultsarenot
statisticallysignificant.
SinceourCIexcludesthenullvalueof1,wesay1
isnotacandidatefortheRRand,therefore,our
resultsarestatisticallysignificant.
Thereisastatisticallysignificantrelationship
betweentreatmenttypeanddeathfromaheart
attack. Weare95%confidentthatthereis
anywherebetweena17to33percentincreased
riskofdeathfromaheartattackamongthosewho
weretreatedwiththestandardmethodoftherapy
versusthosewhoreceivedthenewtherapy.
Practice:ConfidenceIntervalforRelativeRisk
Usingtheasthmastudy,determineifchildrenwhowerebreastfedversusformula
fed arelesslikelytodevelopasthmabycalculatingthe95%confidenceinterval
oftherelativerisk.
AsthmaStatus
FeedingHistory Cases Controls Totals
FormulaFed 250 150 400
Breastfed 100 300 400
Totals 350 550 800
Step PracticeSpace
1. Statethenulland
alternative
hypothesesand
determinethe
decisionrule.
ConfidenceIntervals:OddsRatios
andRelativeRisk
BiostatisticsWorkbook 176
DRAFT:Aug.28,2007
Step PracticeSpace
2. Calculatethepoint
estimate.
d c
c
b a
a
+
+
= RR
3. Determinethe
reliabilitycoefficient.
4. Calculatethestandard
error.
standarderror(lnRR)=
( ) ( ) d c c
d
b a a
b
+
+
+
5. Putitalltogetherto
computeCIfortheln
(RR).
pointestimate 1.96x
standarderror
6. Taketheantilogarithm
togetbacktothe
actualRR.
ConfidenceIntervals:OddsRatios
andRelativeRisk
BiostatisticsWorkbook 177
DRAFT:Aug.28,2007
Step PracticeSpace
7. Applythedecision
rule.
If95%CIdoesnot
includethenullvalueof
1thenullhypothesis
(i.e.RR=1)isrejected
andtheresultsare
consideredstatistically
significant.
Ifthe95%CIdoes
includethenullvalueof
1,thenullhypothesisis
notrejectedandthe
resultsarenot
statisticallysignificant.
EpiInfoExample:ConfidenceIntervalforRelativeRisk
WewillnowuseEpiInfotosimplifyourcalculationsanddeterminetherelative
riskofdyingfromaheartattackafterhavingreceivedthestandardorthenew
treatment. Refertopage173formorebackground.
Step Example
1. Statethenull
andalternative
hypotheses
anddetermine
thedecision
rule.
H
0
:RR=1
H
A
:RR1
Ifthecalculated95%confidenceintervaldoesnotinclude
thevalueofone,thenthereissufficientevidencetoreject
thenullhypothesis.
2. READthe
dataset.
ClickontheREADcommandandchoose
New_Treatment.mdb.
3. UseTABLESto
findthe
relativerisk
and
corresponding
95%
confidence
interval.
ClickontheTABLEScommand.
ChooseStandardTreatmentastheExposureVariableand
DeathastheOutcomeVariable.
ClickonOK.
ConfidenceIntervals:OddsRatios
andRelativeRisk
BiostatisticsWorkbook 178
DRAFT:Aug.28,2007
Step Example
4. Interpretthe
results.
NotethetablegivenbyEpiInfolookslikethe2X2table
giveninthestepbystepexample.
TheEpiInfooutputhasgivenusarelativeriskof1.25.
Thecorresponding95%confidenceintervalhasalower
limitof1.17andanupperlimitof1.33.Fromthiswe
determinethatthosewhoweretreatedwiththestandard
therapyhavea25%greaterriskofdyingfromaheart
attackthanthosewhoreceivedthenewtherapy.Wecan
alsoconcludethatthisriskisstatisticallysignificantsince
ourconfidenceintervaldoesnotcontainthevalueofone.
(Arelativeriskofonewouldmeanthatthereisnogreater
orlesserriskineithergroup.)Weare95%confidentthat
thereisbetweena17and33%increaseindeathfroma
heartattackamongthosewhoweretreatedwiththe
standardtherapyversusthosewhoreceivedthenew
therapy.Theseresultscoordinateperfectlywiththeresults
wefoundinourstepbystepexample.
ConfidenceIntervals:OddsRatios
andRelativeRisk
BiostatisticsWorkbook 179
DRAFT:Aug.28,2007
EpiInfoPractice:ConfidenceIntervalforRelativeRisk
UsethedatasetBirthweighttodetermineifalongerintervalafterawomanslast
livebirthimpactsthebirthweightofthenextinfant.Rememberthatyouwillhave
todividetheweightatbirthintotwogroups:2500gramsandaboveandbelow
2500grams. Hint:Usethecategorizedbirthintervalprovidedforyou.Intervals
aregroupedintothoselongerthan24monthsandthose24monthsorless.
Step PracticeSpace
1. Statethenull
andalternative
hypotheses
anddetermine
thedecision
rule.
2. Groupthe
variables.
3. UseTABLESto
findthe
relativerisk
and
corresponding
95%
confidence
interval.
4. Interpretthe
results.
RelatedConcepts
SampleSize
SampleSizeforDescriptiveStudies
BiostatisticsWorkbook 181
DRAFT:Aug.28,2007
SampleSize
Planningyoursamplesizeisveryimportanttothedesignofanalyticand
descriptivestudies.Samplesizeshouldbeconsideredearlyinstudyplanning.
Astudywithasamplesizethatislargerthanneededwastesresources,buta
samplesizethatistoosmallcanalsowasteresourcesiftheresultis
uninformativeandstudyobjectivesarenotreached.Samplesizeisintegralto
thestatisticalpowerofastudythatallowstheresearchertoevaluatetheroleof
chanceinexplainingthestudyfindings.Astudywithinadequatesamplesize
andthusinadequatepower,doesnotallowonetodetermineifthelackof
statisticalsignificanceisduetonoassociationbetweenexposureanddiseaseor
thelackofpower.
Themethodofdeterminingsamplesizedependsonseveralfactors:
1. Variabilityinthetargetpopulation.Ifthevariabilityisunknown,wemust
assumemaximumvariabilityinthepopulation.
2. Desiredprecisionintheestimate
3. Desiredconfidenceintheestimate
4. Feasibility
Thetargetpopulationisthegroupofpeopleaboutwhomyouwanttomakean
estimate.
Thesampleisthepartofthetargetpopulationyoumeasuretomakeanestimate
abouttheentiretargetpopulation.
Itisoftennotpossibletoexamineanentirepopulation.Samplingallowsyouto
examineaproportionofthepopulationandtousetheresultofthesampleto
estimatetheproportionofthepopulationwiththecharacteristicforwhichyouare
looking.Inorderforthesamplepopulationproportiontoaccuratelyreflectthe
targetpopulationproportion,everymemberofthetargetpopulationmusthavean
equal(oratleastknown)probabilityofbeinginthesample.Thesamplesize
usuallydoesnotchangewiththesizeofthetargetpopulation.Finitepopulation
correctionadjuststhesizeofthesampleneededwhenthereisasmalltarget
population(RightSize).
Rememberthatpoweristheprobabilityofcorrectlyrejectingthenull
hypothesiswhenitisindeedfalse.Itisindicatedby1where(beta)is
theprobabilityoffailingtorejectthenullhypothesiswhenitisfalse(orthe
probabilityofmakingaTypeIIerror).
SampleSizeforDescriptiveStudies
BiostatisticsWorkbook 182
DRAFT:Aug.28,2007
SampleSizeforDescriptiveStudies
Simplerandomsamples
IfasampleofsizenisdrawnfromapopulationofsizeNinsuchawaythat
everypossiblesampleofsizenhasthesamechanceofbeingselected,itisa
simplerandomsample.
Stepsindeterminingsamplesizeforadescriptivestudy(mean,proportion)
drawnfromasimplerandomsample
1. Estimatetheproportion(p)ofthesubjectswiththeoutcomeorthe
standarddeviation(s)oftheestimatedmean
2. Specifythedesiredprecisionorthewidthofconfidenceinterval(d)
3. Specifytheconfidencelevel(z)
Overview
Tworandomsamplingmethodsthatprovideequalprobabilityof
selectionfromthetargetpopulationaresimplerandomsampling
andclustersurveys(Daniels,p.7).
Formulas:
Simplerandomsamplesize
2
2 2
d
s z
n =
Proportion
2
2
d
) p 1 ( p z
n
-
=
Assumptions:Sampleisindependentandrandomlyselected.
Where:
z=riskofTypeIerror
s=standarddeviation
p=expectedprevalence
q=1p
d=absoluteprecision
SampleSizeforDescriptiveStudies
BiostatisticsWorkbook 183
DRAFT:Aug.28,2007
SimpleRandomSampleSizeBasedonEstimatingtheMean
Ifdrepresentstheminimumdetectabledifference,srepresentsthestandard
deviation,andzrepresentsthe
2
1 a -
percentileofastandardnormal
distribution,thentheappropriatesamplesizewouldbe
2
2 2
d
s z
n =
Specifyingthevaluesforthedesiredprecisionandconfidencelevelareeasyand
simplydependontheneedsoftheinvestigator.Estimatingthestandard
deviationofthemeanrequiressomeknowledgeofthepopulationvariancewhich
isgenerallyunknown.Theoptionsareto
1) Drawasamplefromthetargetpopulationandusethevariancefromthis
sampletoestimatethepopulationvariance(s
2
),
2) Lookatpreviousstudiesonsimilarpopulationsfortheestimate,or
3) Ifthetargetpopulationisthoughttobenormallydistributedyoucanusethe
factthattherange(highesttolowestvalue)isapproximatelyequalto6
standarddeviations,thus
6
r
s = butyouwouldstillneedtoknowsomething
aboutthepopulationvalues.
StepbyStepExample:SimpleRandomSampleSizeMean
Youvisitaclinicthatdesirestomeasuremeanweightofinfantsbornthere.The
estimatedstandarddeviationofthemeanweightis400g.Calculatethesample
sizerequiredtoensureaconfidencelevelof95%(z=1.96)andaprecision
within100g.
Step Example
1. Definetheterms. s=400
z=1.96
d=100
2. Calculatethesimple
randomsamplesize.
2
2 2
d
s z
n =
47 . 61
10000
160000 8416 . 3
100
400 96 . 1
n
2
2 2
=

= =
3. Statetheresults. Oursamplesizeis61.4or62.
Becausewecannotincludehalfapersoninour
sample,werounduptothenextwholenumber.
SampleSizeforDescriptiveStudies
BiostatisticsWorkbook 184
DRAFT:Aug.28,2007
Practice:SimpleRandomSampleSizeMean
Youvisitaclinicthatdesirestomeasuremeanhemoglobin(Hgb)ofasampleof
pregnantwomeninANC.Theestimatedstandarddeviationofthemeanweight
is1.8.Calculatethesamplesizerequiredtoensureaconfidencelevelof95%(z
=1.96)andaprecisionwithin0.3.
Step Practicespace
1. Definetheterms.
2. Calculatethesimple
randomsamplesize.
2
2 2
d
s z
n =
3. Statetheresults.
Moststatisticalcomputerprogramsdonotallowcalculationofthesamplesize
basedonthemean,butcalculatesamplesizebasedonproportionsor
prevalenceinapopulation.
SimpleRandomSampleSizeBasedonEstimatingtheProportion
Pistheestimatedproportioninthepopulation. YoucanestimatePfromapilot
sample,fromotherresearch,orinformationthatallowstheinvestigatorto
estimatetheupperboundfortheproportion.Finally,ifthereisnoinformation
thenusetheestimateof50%(p=0.5)whichwillyieldthemaximumvalueforn.
2
2
d
) p 1 ( p z
n
-
=
SampleSizeforDescriptiveStudies
BiostatisticsWorkbook 185
DRAFT:Aug.28,2007
StepbystepExample:SimpleRandomSampleSize
Proportion
Youwouldliketodeterminecoverageofthemeaslesvaccineinchildrenunder
fiveinacountryof5millionpeople.Calculatethesamplesizeneededifthe
proportionofvaccinatedchildrenunderfivehaspreviouslybeenfoundtobe8%.
Youwouldtoensureaconfidencelevelof95%(z=1.96).
Step Example
1. Definetheterms. p=0.08
z=1.96
d=0.05
2. Calculatethesimple
randomsamplesize.
2
2
d
) p 1 ( p z
n
-
=
1 . 113
0025 .
2827 . 0
05 . 0
) 08 . 0 1 ( 08 . 0 96 . 1
n
2
2
= =
-
=
3. Statetheresults. Thesamplesizeis113.
Practice:SimpleRandomSampleSizeProportion
Calculatethenecessarysamplesizetofindtheprevalenceoflowweightfor
heightinacommunitywithapopulationofabout300,000.Literatureleadsyou
toestimatethepopulationprevalencetobesomewherearound18%.Youwant
5%precision.
Step PracticeSpace
Note: Theaboveformulasforsamplesizeassumethatyouareusingrandom
samplingandthesampledpopulationislargeenoughthatyoudonotneed
thefinitepopulationcorrection.Finitepopulationcorrectionadjuststhesample
sizeforsituationsinwhichthetargetpopulationissmallenoughthatthe
samplepopulationismorethan5%ofthetargetpopulation.Theformulais:
pq z ) 1 N ( d
pq Nz
n
2 2
2
+ -
=
BothRightSizeandEpiInfoStatCalchaveafieldtoincludethesizeofthe
targetpopulationsothatthismaybetakenintoaccountwhencalculatingthe
samplesize.
SampleSizeforDescriptiveStudies
BiostatisticsWorkbook 186
DRAFT:Aug.28,2007
Step PracticeSpace
1. Definetheterms.
2. Calculatethesimple
randomsamplesize.
2
2
d
) p 1 ( p z
n
-
=
3. Statetheresults.
EpiInfoExample:SimpleRandomSampleSizeProportion
Wewillusethestepbystepexamplefrompage185tocalculatethesamplesize
inEpiInfosStatCalc.
Step Example
1. OpenStatCalcfrom
theEpiInfotoolbar.
StatCalcisfoundunderUtilities.
NotethatStatCalcwillonlyallowyoutocalculate
thesamplesizebasedonproportionsofsimple
randomsamples.
2. ChooseSampleSize
andPowerfromthe
menu.
ChoosePopulationSurveyfromthesubheadings.
3. Enterthecorrect
valuesintothe
program.
Becausethepopulationweplantosamplefromis
large,wecanleavethedefaultforpopulationsize.
Weknowtheestimatedproportiontobe8%.Since
wewantaprecisionof5%,ourworstacceptable
resultis8%5%,or3%.
SampleSizeforDescriptiveStudies
BiostatisticsWorkbook 187
DRAFT:Aug.28,2007
Step Example
4. Press<F4>to
calculatethesample
size.
Sincewehavechosen5%precision,welookatthe
samplesizewhichcorrespondstoa95%
confidenceinterval.
5. Statetheresults. Oursamplesizeforasimplerandomsampleis
113.
OpenEpiExample:SimpleRandomSampleSizeProportion
Onceagain,wewillusestepbystepexamplefrompage185tocalculatethe
samplesize.However,thistimewewilldemonstratetheuseofOpenEpi.
Step Example
1. ChooseProportion
(underSampleSize)
fromthemenu.
Thescreenwillread,SampleSizeforaProportion
orDescriptiveStudy.
SampleSizeforDescriptiveStudies
BiostatisticsWorkbook 188
DRAFT:Aug.28,2007
Step Example
2. Clickonthebutton,
EnterNewData.
Enterthepopulationsize,theanticipatedfrequency
(prevalence),andtheprecision,Thedesigneffect,
whichwewilldiscussinClusterSampleSurveys,
remains1.
3. Calculatethesample
sizebyclicking
Calculate.
Sincewehavechosen5%precision,welookatthe
samplesizewhichcorrespondstoa95%
confidenceinterval.
4. Statetheresults. Thecalculatedsamplesizeis114.
SampleSizeforDescriptiveStudies
BiostatisticsWorkbook 189
DRAFT:Aug.28,2007
Practice:SimpleRandomSampleSizeProportion
Calculatethenecessarysamplesizetofindtheprevalenceoflowweightfor
heightinacommunityof3million.Literatureleadsyoutoestimatethepopulation
prevalencetobesomewherearound18%.Youwant5%precision. Youmayuse
eitherEpiInfoorOpenEpitocompletethisexercise.
1. Opentheappropriateprogram.
2. Enterthecorrectdata.
3. Calculatethesamplesize.
Step PracticeSpace
4. Statetheresults.
Clustersurveysamples
Ingeneral,peopledonotliveevenlydistributed.Theytendtoclusterintowns
andvillages. Inatypicalsurvey,mostofthetimeyouspendinthefieldis
travelingfromsubjecttosubject,notdoinginterviews.Clustersurveys
dramaticallydecreasethecostintimeandtransportbecause,although
interviewershavesometravelcostbetweenclusters,theytravelonlyshort
distanceswithinacluster. Althoughyouneedtointerviewmorepersonsina
clustersamplethaninasimplerandomsample,itisoftenmorepracticaland
lessexpensivetodoaclustersurvey.
Calculatingthesamplesizeneededforaclustersurveycanbedifficultbecause
mostpeopledonothaveagoodideaofhowtoestimatethingslikedesigneffect
orrateofhomogeneity,whichareneededtocalculatethesamplesizefora
clustersurvey.
Aswithsimplerandomsamples,clustersurveyanalysisisbasedonthe
assumptionthateachmemberofthetargetpopulationhasanequal(orknown)
chanceofbeinginthesampleandthatthereisminimalbiasanderror.Ifthisis
notthecase,theestimateswillnotbeaccurate.
DesignEffect
Inbothaclustersurveyandasimplerandomsamplesurvey,everymemberof
thetargetpopulationhasanequalchanceofbeinginthesample,but,ifyou
SampleSizeforDescriptiveStudies
BiostatisticsWorkbook 190
DRAFT:Aug.28,2007
collectyoursampleinclusters,youmayincreasethevariability(or
variance)ofthesamplebecausethesamplingdesignselectssubjects
whoseresultsarenotindependentfromeachother.
Thedesigneffectmeasurestheincreaseinvarianceofthesamplewhenusing
aclustersampleratherthanasimplerandomsample. Thedesigneffectina
simplerandomsamplewillalwaysbe1.Sometimespeoplecalculatesample
sizesforclustersurveysbymultiplyingthesimplerandomsampleformulaby2.
Thishastheadvantageofbeingsimpletocalculate,butitmayleadtosample
sizesthatareeithertoolargeortoosmall.Thisisbecausedesigneffectis
influencedbyhowmuchthevariablestendtoclusterinthefield.
Thedesigneffectincreasestheoverallsamplesizeandisusuallyestimatedto
be2inimmunizationclustersurveys.
RelatedConcepts
SampleSizeforAnalyticStudies
Note:RightSizeisaprogramdevelopedbytheCDCtoteachand
determinesamplesizecalculations.Theinformationintheprevious
sectionistakenfromthisprogram.
Formoreinformationonthetopicofsamplesize,particularlythe
calculationofclustersurveysamplesanddeterminationofdesigneffect,
pleaserefertothisprogramat
http://www.cdc.gov/descd/materials.html#rightsize.
YoumayalsouseOpenEpitocalculateclustersurveysamplesizeby
appropriatelyadjustingthedesigneffect.
SampleSizeforAnalyticStudies
BiostatisticsWorkbook 191
DRAFT:Aug.28,2007
SampleSizeforAnalyticStudies
Overview
Analyticstudiesrequirethecalculationofboththecaseandthe
controlgroups.
Whencalculatingthesamplesizeforanalyticstudies,takeinto
accounttheriskofmakingatypeIerrorandcontrolforitwithz
/2.
ThepowerofastudyistheprobabilityofavoidingatypeIIerror
andcanbecontrolledthroughsamplesizecalculationsby
choosingz
1
.
Formula:
2
2 1
2
2 2 1 1 1
2
1
) p p ( r
q p q rp z q p ) 1 r ( z
n
-






+ + +
=
b - a
,
1 2
n r n =
Assumptions: Thesampleisindependentandrandomlyselected.
Where:
Variable Cohort Casecontrol
n
1
numberofexposed numberofcases
n
2
numberofunexposed numberofcontrols
z
/2
zscorefortwotailedtestbasedonlevel
z
1
zscoreforonetailedtestbasedonlevel
r
unexposed:exposed controls:cases
p
1
proportionofexposed
withdisease
proportionofcaseswith
exposure
q
1
1p
1
p
2
proportionofunexposed
withdisease
proportionofcontrols
withexposure
q
2
1p
2
And:
)] 1 OR ( p [ 1
) OR ( p
p
2
2
1
- +
= or ) RR ( p p
2 1
=
1 r
p r p
p
2 1
+
+
= p 1 q - =
SampleSizeforAnalyticStudies
BiostatisticsWorkbook 192
DRAFT:Aug.28,2007
Valuesrequiredforasamplesizecalculationofananalyticstudyare:
TwosidedconfidencelevelMostoftenthisisa95%confidence
intervalandmaybereferredtoasprecision.
PowerMostoftenthisis80%or90%
Ratioofunexposedtoexposedintheplannedsample(r)Ifthereare
tobeanequalnumberofunexposedandexposed,thenr=1. Ifthereare
tobetwiceasmanyunexposedasexposed,thenr=2.0.
PercentofunexposedwithoutcomeAnestimateofthepercentageof
unexposedindividualsthatwilldevelop(orhave)theoutcomeofinterest.
Forexample,inacohortstudy,thisisthepercentageofunexposed
individualswhowilldeveloptheoutcomeofinterestduringthestudy.Ina
crosssectionalstudy,thisistheestimatedprevalenceofdiseaseamong
theunexposed.Thisstatisticisoftenfoundfrompreviousstudiesor
reports.
MeasureofassociationThisestimateisbasedonpreviousstudiesor
reportsandshouldreflecttheminimumeffectthattheinvestigator
considersworthdetecting. Twoexamplesaretherelativeriskandthe
oddsratio.
ThesamplesizeformulabyFleissis:
2
2 1
2
2 2 1 1 1
2
1
) p p ( r
q p q rp z q p ) 1 r ( z
n
-






+ + +
=
b - a
,
1 2
n r n =
Ifn
1
=n
2
thentheformulasimplifiesto:
2
2 1
2
2 2 1 1 1
2
1
) p p (
q p q p z q p 2 z
n
-






+ +
=
b - a
Whentheinputisprovidedasanoddsratioratherthantheproportionofexposed
withdisease,theproportionofexposedwithdiseaseiscalculatedas:
)] 1 OR ( p [ 1
) OR ( p
p
2
2
1
- +
=
Foracrosssectionalstudywhereyouwanttocalculateaprevalenceoddsratio,
thisistheformulaonewoulduse.
SampleSizeforAnalyticStudies
BiostatisticsWorkbook 193
DRAFT:Aug.28,2007
Whentheinputisprovidedasariskratioorrelativerisk(RR)ratherthanthe
proportionofexposedwithdisease,theproportionofexposedwithdiseaseis
calculatedas:
) RR ( p p
2 1
=
Thisisalsotheformulausedforacrosssectionalstudywhereyouwantto
calculateaprevalenceratio.
StepbyStepExample:SampleSizeCohortStudy
Youwishtoconductacohortstudyofintermittentpreventivetreatmentuse(IPT)
useinrelationtoriskofanemiaamongpregnantwomen.Previousstudieshave
shownthat15%ofthosethatreceivedIPThaveanemiaand25%ofthosewho
didnotreceiveIPThaveanemia.Calculatethesamplesizeforyourstudy.You
setat0.05(twosided)andat0.20.Assumethatn
1
=n
2.
Step Example
1. Identifyp
1
,p
2
,q
1
,and
q
2
. Definez
1/2
andz
1
.
p
1
=0.15 q
1
=(1p
1
)=1.00.15=0.85
p
2
=0.25 q
2
=(1p
2
)=1.00.25=0.75
z
1/2
=1.96
z
1
=0.84
2. Calculatepandq.
1 r
rp p
p
2 1
+
+
=
p 1 q - =
1 1
25 . 0 1 15 . 0
p
+
+
= = 2 . 0
2
4 . 0
=
q=10.2=0.8
3. Calculatethesample
size.
2
2 1
2
2 2 1 1 1
2
1
) p p ( r
q p q rp z q p ) 1 r ( z
n
-






+ + +
=
b - a
Becauser=1,wecanusethesimplifiedequation.
[ ]
2
2
1
) 25 . 0 15 . 0 (
75 . 0 25 . 0 85 . 0 15 . 0 84 . 0 8 . 0 2 . 0 2 96 . 1
n
-
+ +
=
[ ]
2
2
1
) 1 . 0 (
315 . 0 84 . 0 32 . 0 96 . 1
n
+
=
[ ]
70 . 249
01 . 0
5827 . 1
) 1 . 0 (
4714 . 0 1087 . 1
n
2
2
2
1
= =
+
=
Wedonotneedtocalculaten
2
becauser=1,therefore
n
2
=249.70.
SampleSizeforAnalyticStudies
BiostatisticsWorkbook 194
DRAFT:Aug.28,2007
Step Example
4. Stateyourresults. Forthiscohortstudy,wewillneedatotalsamplesizeof
about500participants.Halfofoursample(n
1
= 250)will
havebeenexposedtoIPTandtheotherhalf(n
2
=250)
willnothavebeenexposedtoIPT.
Practice:SampleSizeCohortStudy
Youwishtoconductastudytodetermineifwomenwhotakemultiple
micronutrientsupplementsduringpregnancyaremorelikelytodeliverbabiesat
whatisconsideredanormalbirthweight(2500gorabove).Literatureindicates
that55%ofthosethatreceivedthesupplementdeliveredbabies2500gormore
and30%ofthosewhodidnotreceivethesupplementdeliveredbabies2500gor
more.Calculatethesamplesizeforyourstudy.Setat0.05(twosided)and
at0.20.Assumethatn
1
=n
2.
Step PracticeSpace
1. Identifyp
1
,p
2
,q
1
,and
q
2
.Definez
1/2
andz
1
.
2. Calculatepandq.
1 r
rp p
p
2 1
+
+
=
p 1 q - =
3. Calculatethesample
size.
2
2 1
2
2 2 1 1 1
2
1
) p p ( r
q p q rp z q p ) 1 r ( z
n
-






+ + +
=
b - a
4. Interpretyourresults.
EpiInfoExample:SampleSizeCohortStudy
UsingtheproposedIPTstudyfrompage193,useEpiInfotocalculatethe
samplesize.
SampleSizeforAnalyticStudies
BiostatisticsWorkbook 195
DRAFT:Aug.28,2007
Step Example
1. OpenStatCalcfrom
theEpiInfotoolbar.
StatCalcisfoundundertheUtilitesheading.
2. ChooseSample
SizeandPower
fromthemenu.
ChooseCohortorCrossSectionalfromthe
subheadings.
3. Enterthecorrect
valuesintothe
program.
Fillin95%fortheconfidenceleveland80%forthe
powerlevel.Theratioremains1:1.Theexpected
frequencyofdiseaseintheexposedgroupisp
2
or
15%.
Weonlyneedtofillintheinformationforoneofthe
bottomthree.Wehaveinformationforthepercent
diseasedamongtheexposed,25%.
4. Press<F4>to
calculatethe
samplesize.
Thetoplineoftheresultsdisplaysthesamplesizefor
thevaluesthatwehaveentered.
5. Statetheresults. Oursamplesizeforthecohortstudyis540.
Thisdiffersfrom thevaluethatweobtainedwithour
handcalculations. Thisisduetoacontinuity
correctionandwillbecomemoreclearinthethenext
example.
SampleSizeforAnalyticStudies
BiostatisticsWorkbook 196
DRAFT:Aug.28,2007
Step Example
OpenEpiExample:SampleSizeCohortStudy
WewillusethesampleexampletocalculatethesamplesizeinOpenEpi.
Step Example
1. ChooseCohort/RCT
(underSampleSize)
fromthemenu.
Thescreenwillread,SampleSizeUnmatched
CrossSectional&CohortStudies.
2. Clickonthebutton,
EnterNewData.
Entertheconfidencelevel,thedesiredpower,the
ratio,andthepercentofunexposedwiththe
outcomeofinterest.
AsinEpiInfo,weonlyneedtofillintheinformation
foroneofthebottomfour.Wehaveinformationfor
thepercentdiseasedamongtheexposed.The
remainingvalueswillbecalculatedbyOpenEpi.
SampleSizeforAnalyticStudies
BiostatisticsWorkbook 197
DRAFT:Aug.28,2007
Step Example
3. Calculatethesample
sizebyclicking
Calculate.
4. Statetheresults. Thecalculatedsamplesizeis500usingtheFleiss
equation,whichistheequationweusedinour
handcalculations.
NotethattheresultfoundinEpiInfoisalso
reportedasFleisswithCC,orcontinuitycorrection.
SampleSizeforAnalyticStudies
BiostatisticsWorkbook 198
DRAFT:Aug.28,2007
Practice:SampleSizeCohortStudy
Youwouldliketoseeifthosewhousewaterfiltersarelesslikelytobecome
infectedwithschistosomiasis(bilharzia)inacommunitywithapopulationof
150,000.Theprevalenceofschistosomiasisinthecommunitybeforeawater
filtercampaignwasthoughttobe15%.Othercommunitieswhohave
implementedtheuseofwaterfiltershavefoundaprevalenceof6%.Calculate
thesamplesizeforyourstudyusingeitherStatCalcorOpenEpi.Setat0.05
(twosided)andat0.20.Assumethatn
1
=n
2.
1. Opentheappropriateprogram.
2. Enterthecorrectdata.
3. Calculatethesamplesize.
Step PracticeSpace
4. Statetheresults.
SampleSizeforAnalyticStudies
BiostatisticsWorkbook 199
DRAFT:Aug.28,2007
StepbyStepExample:SampleSizeCaseControlStudy
Youwillconductacasecontrolstudyofthepossiblerelationshipbetween
choleraamongconferenceattendeesandeatingatrestaurantX.Yourecently
conductedadescriptivestudywhichrevealedthat10%ofattendeeshadeatenat
restaurantX. YouexpectanORof3.0andyouchooseatwosidedof0.05
andaof0.20.Assumethatn
1
=n
2
.
Step Example
1. Identifyp
1
,p
2
,q
1
,and
q
2
.Definez
1/2
andz
1
.
)] 1 OR ( p [ 1
) OR ( p
p
2
2
1
- +
=
25 . 0
2 . 1
30 . 0
)] 1 0 . 3 ( 10 . 0 [ 1
0 . 3 10 . 0
p
1
= =
- +

=
q
1
=1.00.25=0.75
p
2
=0.10 q
2
=1.00.10=0.90
z
1/2
=1.96
z
1
=0.84
2. Calculatepandq.
1 r
p r p
p
2 1
+
+
=
p 1 q - =
175 . 0
2
35 . 0
1 1
) 10 . 0 1 ( 25 . 0
p = =
+
+
=
825 . 0 175 . 0 1 q = - =
3. Calculatethesample
size.
2
2 1
2
2 2 1 1 1
2
1
) p p ( r
q p q rp z q p ) 1 r ( z
n
-






+ + +
=
b - a
[ ]
2
2
1
) 1 . 0 25 . 0 (
9 . 0 1 . 0 75 . 0 25 . 0 84 . 0 825 . 0 175 . 0 2 96 . 1
n
-
+ +
=
[ ]
2
2
1
) 15 . 0 (
2775 . 0 84 . 0 28875 . 0 96 . 1
n
+
=
[ ]
4 . 99
15 . 0
4957 . 1
) 15 . 0 (
4425 . 0 0532 . 1
n
2
2
2
2
1
= =
+
=
4. Stateyourresults. Thesamplesizeforthecasesofthiscasecontrolstudyis
100.Becausen
1
=n
2
,weshouldalsohave100controls.
Thus,thetotalsamplesizewillbe200.
SampleSizeforAnalyticStudies
BiostatisticsWorkbook 200
DRAFT:Aug.28,2007
Practice:SampleSizeCaseControlStudy
Youplantoconductastudytodeterminetheeffectsofminingonlungdisease.
Inaregionofthecountrywhereminingisthemainsourceofincome,60%of
menages15to49areminers.Basedontheliterature,youexpectanORof3.5
andyouchooseatwosidedof0.05andaof0.20.Assumethatn
1
=n
2
.
Step PracticeSpace
1. Identifyp
1
,p
2
,q
1
,and
q
2
.Definez
1/2
andz
1
.
)] 1 OR ( p [ 1
) OR ( p
p
2
2
1
- +
=
2. Calculatepandq.
1 r
p r p
p
2 1
+
+
=
p 1 q - =
3. Calculatethesample
size.
2
2 1
2
2 2 1 1 1
2
1
) p p ( r
q p q rp z q p ) 1 r ( z
n
-






+ + +
=
b - a
4. Interpretyourresults.
SampleSizeforAnalyticStudies
BiostatisticsWorkbook 201
DRAFT:Aug.28,2007
EpiInfoExample:SampleSizeCaseControlStudy
Usingtheproposedcholerastudyfrompage199,useEpiInfotocalculatethe
samplesize.
Step Example
1. OpenStatCalcfrom
theEpiInfotoolbar.
StatCalcisfoundundertheUtilitiesheading.
2. ChooseSample
SizeandPower
fromthemenu.
ChooseUnmatchedCaseControlfromthe
subheadings.
3. Enterthecorrect
valuesintothe
program.
Fillin95%fortheconfidenceleveland80%forthe
powerlevel.Theratioremains1:1.Theexpected
frequencyofexposureinthecontrolgroupis p
2
or
10%.
Weonlyneedtofillintheinformationforoneofthe
bottom two.Wehaveinformationfortheoddsratio,3.
4. Press<F4>to
calculatethe
samplesize.
Thetoplineoftheresultsdisplaysthesamplesizefor
thevaluesthatwehaveentered.
SampleSizeforAnalyticStudies
BiostatisticsWorkbook 202
DRAFT:Aug.28,2007
Step Example
5. Statetheresults. Ourtotalsamplesizeforthecasecontrolstudyis
224.
Thisdiffersfromthevaluethatweobtainedwithour
handcalculations.Thisisduetoacontinuity
correctionandwillbecomemoreclearinthenext
example.
OpenEpiExample:SampleSizeCaseControlStudy
WewillusethesameexampletocalculatethesamplesizeinOpenEpi.
Step Example
1. ChooseUnmatched
CC(underSample
Size)fromthemenu.
Thescreenwillread,SampleSizeUnmatched
CaseControlStudy.
2. Clickonthebutton,
EnterNewData.
Entertheconfidencelevel,thedesiredpower,the
ratio,andthepercentofunexposedwiththe
outcomeofinterest.
AsinEpiInfo,weonlyneedtofillintheinformation
foroneofthebottom two.Wehaveinformationfor
theoddsratio.Theremainingvaluewillbe
calculatedbyOpenEpi.
SampleSizeforAnalyticStudies
BiostatisticsWorkbook 203
DRAFT:Aug.28,2007
Step Example
3. Calculatethesample
sizebyclicking
Calculate.
4. Statetheresults. Thecalculatedsamplesizeis200usingtheFleiss
equation,whichistheequationweusedinour
handcalculations.
NotethattheresultfoundinEpiInfoisalso
reportedasFleisswithCC,orcontinuitycorrection.
NoticethatbothEpiInfoandOpenEpispecifythatthecasecontrolstudyis
unmatched.Ifyouplanonconductingamatched,orpaired,casecontrolstudy,
youmustcalculateyoursamplesizedifferently.Thiscalculationisbeyondthe
scopeofthisworkbook.
SampleSizeforAnalyticStudies
BiostatisticsWorkbook 204
DRAFT:Aug.28,2007
Practice:SampleSizeCaseControlStudy
YouplantoconductastudytodeterminetheeffectsofANCvisitsongestational
ageatbirth.Approximately12%ofwomenattendtwoormoreANCvisitsduring
pregnancy.Basedontheliterature,youexpectanORof1.8andyouchoosea
twosidedof0.05andaof0.20.Assumethatn
1
=n
2
. Youmayuseeither
EpiInfoorOpenEpitofindthesamplesize.
1. Opentheappropriateprogram.
2. Enterthecorrectdata.
3. Calculatethesamplesize.
Step PracticeSpace
4. Statetheresults.
RelatedConcepts
SampleSizeforDescriptiveStudies
CorrelationandRegressionAnalysis
BiostatisticsWorkbook 205
DRAFT:Aug.28,2007
CorrelationandRegressionAnalysis
Regressiondescribestheprobablerelationshipbetweentwovariables.The
utilityofregressionanalysisisthatitallowsustopredictthevalueofonevariable
basedonanotheroncetherelationshipbetweenthetwohasbeenestablished.
Correlationdiffersfromregressioninthatitmeasurestheactualstrengthofthe
relationshipbetweenthetwovariables.
PearsonProductMomentCorrelationCoefficient
BiostatisticsWorkbook 206
DRAFT:Aug.28,2007
PearsonProductMomentCorrelationCoefficient
Thecorrelationcoefficientisameasureofthestrengthofthelinearrelationship
betweentwocontinuousvariables.Itisadimensionlessnumberithasnounits
ofmeasurement.Thecorrelationrangesfrom1to+1.Acorrelationof1
indicatesaperfectlinearrelationship,whereasacorrelationof0indicatesno
linearrelationship,howeveranonlinearrelationshipmightexist.
Supposethatweareinterestedinapairofcontinuousrandomvariables,eachof
whichismeasuredonthesamesetofstudents.Forexample,wemightwishto
Overview
Description:PearsonProductMomentCorrelationCoefficient(or
simplythecorrelationcoefficient)isusedtoquantifythestrengthof
thelinearrelationshipbetweentwocontinuousrandomvariables.
ThepopulationcorrelationisdenotedbytheGreekletter r (rho).
Example:Therelationshipbetweenageandbloodpressureis
positivelycorrelated.Asageincreases,bloodpressuretendsto
increase.
Formula:Theestimatorofthepopulationcorrelation,denotedbyr,
thesamplecorrelationiscalculatedusingtheformula:


- -
-
=
] y) ( ) y ][n( x) ( ) x [n(
y x xy n
r
2 2 2 2
Tomakeinferencesaboutthepopulationcorrelation r basedon
thesamplecorrelationr,weusethefollowingtteststatistic:
2 n
r 1
r
t
2
-
-
=
Assumptions:Ifweassumethepairsofcontinuousobservations
( )
y
x
i
i
, wereobtainedrandomlyandthatbothXandYare
normallydistributed,thentheteststatistichasatdistributionwith
n2degreesoffreedomonlywhenthenullhypothesisistrue.
Where: t=Numberofstandarddeviationsrisfrom0
r=Samplecorrelationcoefficient
n=Samplesize
PearsonProductMomentCorrelationCoefficient
BiostatisticsWorkbook 207
DRAFT:Aug.28,2007
investigatetherelationshipbetweenthescoreattainedonamathematicsskills
assessmentexamandbiostatisticsfinalexamgradeforagroupof101
st
year
MPHstudents.Thedataispresentedinthetablebelow.Computethe
correlationbetweenmathskillsscoreandbiostatisticsfinalexamgrade,andthen
determinewhetherthecorrelationissignificantlydifferentfrom zero.
Student MathSkillsScore BiostatsFinalExam
1 39 65
2 43 78
3 21 52
4 64 82
5 57 92
6 47 89
7 28 73
8 75 98
9 34 56
10 52 75
Beforeweconductacorrelationanalysis,weshouldalwaysexaminethenature
ofthedatainatwowayscatterplot. Ascatterplotisagraphthatmaybeused
torepresenttherelationshipbetweentwovariablesalsoreferredtoasascatter
diagram.Wecanoftendeterminewhetheralinearrelationshipexistsbetweenx
andysimplybyexaminingthegraph.Thebivariatescatterplotofthesescores
andgradesisgivenbelow.Whilethepointsdonotlieexactlyonaline,it
appearsthatthedatatendtobepositivelylinearlyrelated.Thatis,biostatistics
finalexamgradestendtoincreasewithincreasingmathskillstestassessment
score.
80 70 60 50 40 30 20
math_score
100
90
80
70
60
50
bios_grade
Scatterplotofmathskillsassessmenttestscoreandbiostatsfinalexamgrade
PearsonProductMomentCorrelationCoefficient
BiostatisticsWorkbook 208
DRAFT:Aug.28,2007
StepbyStepExample:PearsonProductMomentCorrelation
Coefficient
Usethedataabovetodetermineifthemathskillsscoreandthebiostatisticsfinal
examgradearecorrelated.
Step Example
1. Preparethestandard
tableusedtoassistin
calculatingsums.
2. Computecorrelation
coefficient,r.
3. Interpretthesample
correlationcoefficient.
Basedonthissample,thereappearstobea
strongpositivelinearrelationshipbetweenthe
scoreattainedonthemathskillsassessment
examandfinalbiostatsexamgrade.
4. Determinestatistical
significance.Beginby
statingappropriatenull
andalternative
hypotheses.
Ourhypothesesareasfollows:
0 : H
0
= r (nocorrelationintheunderlying
population)
0 : H
a
r (thereisacorrelationintheunderlying
population).
Grade mathscore
y x yx y
2
x
2
65 39 2,535 4,225 1521
78 43 3,354 6,084 1849
52 21 1,092 2,704 441
82 64 5,248 6,724 4096
92 57 5,244 8,464 3249
89 47 4,183 7,921 2209
73 28 2,044 5,329 784
98 75 7,350 9,604 5625
56 34 1,904 3,136 1156
75 52 3,900 5,625 2704
760 460 36,854 59,816 23634
84 . 0
] ) 760 ( ) 816 , 59 ( 10 ][ ) 460 ( ) 634 , 23 ( 10 [
) 760 ( 460 ) 854 , 36 ( 10
r
2 2
=
- -
-
=


- -
-
=
] ) y ( ) y ( n ][ ) x ( ) x ( n [
y x xy n
r
2 2 2 2
PearsonProductMomentCorrelationCoefficient
BiostatisticsWorkbook 209
DRAFT:Aug.28,2007
Step Example
5. Statethedecisionrule. Usingatwosidedtestwithanalphavalueof0.05
andn2=8df,thecriticalvalueofthetest
statisticis 2.306.Thus,weshouldreject
0
H if
2.306 t
calc
>
6. Calculatethevalueof
theteststatistic.
2 n
r 1
r
t
2
-
-
=
t=
( )
=
-
-
2 10
84 . 0 1
84 . 0
2
4.37
7. Statethestatistical
decision.
Wereject
0
H sincethevalueofourteststatistic
calc
t =4.37isgreaterthanthetcriticalvalueof
2.306.Wethereforehavesufficientevidencethat
ourteststatisticfallsintherejectionregion.
8. Reportthepvalue. p<0.01
9. Statethepractical
conclusion.
Thereisevidencethatthetruepopulation
correlationcoefficientisdifferentfrom zero.
Biostatisticsfinalexamgradeincreasesasthe
mathskillsassessmentscoreincreases
therefore,thecorrelationispositive.
PearsonProductMomentCorrelationCoefficient
BiostatisticsWorkbook 210
DRAFT:Aug.28,2007
Practice:PearsonProductMomentCorrelationCoefficient
Supposeyouhavedataonageandsystolicbloodpressure(SBP)fromarandom
sampleof16adultfemales.Thedataarepresentedasfollows:
Subject Age(years) SBP(mmHg)
1 22 131
2 24 116
3 28 114
4 29 123
5 30 117
6 32 122
7 35 121
8 41 171
9 47 111
10 49 133
11 51 130
12 51 133
13 56 145
14 57 141
15 63 155
16 77 217
Letsfirstexaminethenatureoftherelationshipbetweenthetwovariables.The
dependentvariable,SBPisplottedontheyaxisandtheindependentvariable,
ageisplottedonthehorizontal,orxaxis.Thescatterplotispresentedhere:
8
0
7
0
6
0
5
0
4
0
3
0
2
0
Age
22
0
20
0
18
0
16
0
14
0
12
0
10
0
SBP
ScatterplotofageandSBP
PearsonProductMomentCorrelationCoefficient
BiostatisticsWorkbook 211
DRAFT:Aug.28,2007
Thus,itappearsthatageandSBParepositivelylinearlyrelated.Thatis,
increasingagetendstobeassociatedwithincreasingSBP.Nowletsdetermine
thevalueofthecorrelationcoefficient.
Step PracticeSpace
1. Preparethe
standardtable
usedtoassist
incalculating
sums.
SBP age
y x yx y2 x2
131 22
116 24
114 28
123 29
117 30
122 32
121 35
171 41
111 47
133 49
130 51
133 51
145 56
141 57
155 63
217 77
2180 692
PearsonProductMomentCorrelationCoefficient
BiostatisticsWorkbook 212
DRAFT:Aug.28,2007
Step PracticeSpace
2. Compute
correlation
coefficient,r.
3. Interpretthe
sample
correlation
coefficient.
4. Statethenull
and
alternative
hypotheses.
5. Statethe
decisionrule.
6. Calculatethe
valueofthe
teststatistic.
7. Statethe
statistical
decision.
8. Reportthep
value.
9. Statethe
practical
conclusion.


- -
-
=
] ) y ( ) y ( n ][ ) x ( ) x ( n [
y x xy n
r
2 2 2 2
PearsonProductMomentCorrelationCoefficient
BiostatisticsWorkbook 213
DRAFT:Aug.28,2007
ExcelExample:PearsonProductMomentCorrelationCoefficient
Usetheexampleonpage208tofindthecorrelationcoefficientinExcel.
Step Example
PearsonProductMomentCorrelationCoefficient
BiostatisticsWorkbook 214
DRAFT:Aug.28,2007
Step Example
1. Lookatthedatain
ascatterplotto
determineifa
linearrelationship
exists.
ImportthedatasetMathScoreintoExcel.
SelectInsertfromthetoolbar.
ChooseChartfromthedropdownbox.
Fromthecharttypesoffered,chooseXY(Scatter)and
clickonNext.
EnteryourDataRangebyclickingontheiconnextto
thetextboxandthenhighlightingthetwovariablesof
interest:MathSkillsScoreandBiosFinalExam.Make
surethatColumnsandnotRowsisselectedasour
dataisarrangedincolumns.
PearsonProductMomentCorrelationCoefficient
BiostatisticsWorkbook 215
DRAFT:Aug.28,2007
Step Example
2. Compute
correlation
coefficient,r.
SelectToolsfromthetoolbar.
SelectDataAnalysis.
ChooseCorrelation.
ClickOK.
Highlightthecolumnsunderthevariables,MathSkills
ScoreandBiostatsFinalExam.Makesuretocheckthe
boxnexttoLabelsincludedifyouhaveincludedthem.
(Itisgenerallyhelpfultoincludethemastheresulting
tableisautomaticallylabeled.)
SelectyourOutputRangeandclickOK.
3. Interpretthe
samplecorrelation
coefficient.
Youwillseeachartliketheonebelow:
Thecellinwhichourtwovariablesintersectisthecell
thatweareinterestedin.ThePearsonscorrelation
coefficient(r)=0.84.Thisisrepresentativeofthe
strongpositivelinearrelationshipthatisshowninthe
scatterplot.
Youcandeterminethestatisticalsignificanceofthecorrelationcoefficientby
performingattestonthedata. Pleaserefertopage91fordetailsonconducting
attest.
PearsonProductMomentCorrelationCoefficient
BiostatisticsWorkbook 216
DRAFT:Aug.28,2007
ExcelPractice:PearsonProductMomentCorrelationCoefficient
ThedatasetBirthweightprovidesinformationonbirthweightandtheinterval
betweenthecurrentandthepreviouslivebirth.UseExceltocreatea
scatterplot.Onceyouhavedeterminedifthereisalinearrelationshipbetween
thevariablesBirthIntervalandWeightatBirth.Ifthereis,calculatethe
correlationcoefficient,r,inExcel.
1. Lookatthedatainascatterplottodetermineifalinearrelationship
exists.
2. Computecorrelationcoefficient,r.
Step PracticeSpace
3. Interpretthe
samplecorrelation
coefficient.
RelatedConcepts
SimpleLinearRegression
SimpleLinearRegression
BiostatisticsWorkbook 217
DRAFT:Aug.28,2007
SimpleLinearRegression
Overview
Description:Simplelinearregressionanalysisanalyzesthelinear
relationshipbetweentwocontinuousrandomvariablesbyallowing
ustoinvestigatethechangeinthedependentvariable
correspondingtoagivenchangeintheindependentvariable.
Example:
Formula:
SIMPLELINEARREGRESSIONMODEL(SAMPLEMODEL)
x
1
b
0
b
i
y + =
LEASTSQUARESEQUATIONS
-
- -
=
2
) x x (
) y y )( x x (
1
b algebraicequivalent:


-


-
=
n
2
) x (
2
x
n
y x
xy
1
b
x
1
b y
0
b - =
Assumptions:
o Individualvaluesoftheerrorarestatisticallyindependentof
oneanother.
o Thedistributionofallpossiblevaluesofeisnormal.
o Thedistributionsofpossibleerrorvalueshaveequal
variancesforallvalueofx.
o Themeansofthedependentvariable,forallspecified
valuesoftheindependentvariable,y,canbeconnected
byastraightlinecalledthepopulationregressionmodel.
where:
y = Estimated,orpredicted,yvalue
b
0
= Unbiasedestimateoftheregressionintercept
b
1
= Unbiasedestimateoftheregressionslope
x= Valueoftheindependentvariable
SimpleLinearRegression
BiostatisticsWorkbook 218
DRAFT:Aug.28,2007
StepbyStepExample:SimpleLinearRegression
Referbacktothecorrelationanalysisexample.Ratherthanquantifyingthe
strengthoftheassociationbetweenmathskillsassessmentscoreand
biostatisticsfinalexamgrade,wemightbeinterestedinpredictingthechangein
finalexamgradethatcorrespondstoagivenchangeinmathskillsassessment
score.Inthiscase,biostatisticsfinalexamgradeistheresponse(ordependent
variable),andmathskillsassessmentscoreistheexplanatoryvariable
(independentvariable).
1. Preparethe
standardtable
usedtoassistin
calculatingsums.
2. Calculatevalues
forslopeb
1
and
intercept,b
0
.
3. Putitalltogether
tofindleast
squaresline,
x b b y
1 0
+ =
Theleastsquaresregressionlineis:
4. Interpretthe
regression
equation.
Theyinterceptofthefittedlineis40.78.
Theoretically,thisisthemeanfinalbiostatistics
examgradethatcorrespondstoamathskills
assessmentscoreofzero.Theslopeofthelineis
0.766,implyingthatforeachincreaseof1pointon
themathskillstest,thebiostatisticsfinalexamgrade
ispredictedtoincreaseby0.766pointsonaverage.
Grade mathscore
y x yx y
2
x
2
65 39 2,535 4,225 1521
78 43 3,354 6,084 1849
52 21 1,092 2,704 441
82 64 5,248 6,724 4096
92 57 5,244 8,464 3249
89 47 4,183 7,921 2209
73 28 2,044 5,329 784
98 75 7,350 9,604 5625
56 34 1,904 3,136 1156
75 52 3,900 5,625 2704
760 460 36,854 59,816 23634
76556 . 0
10
) 460 (
634 , 23
10
) 760 ( 460
854 , 36
n
) x (
x
n
y x
xy
b
2 2
2
1
=
-
-
=
-
-
=




78 . 40 ) 46 ( 76556 . 0 76 x b y b
1 0
= - = - =
) x ( 766 . 0 78 . 40 y + =
SimpleLinearRegression
BiostatisticsWorkbook 219
DRAFT:Aug.28,2007
Practice:SimpleLinearRegression
ReferbacktodataofonageandSBPfromarandomsampleof16adult
females.Thedataarepresentedhereagainasfollows:
Subject Age(years) SBP(mmHg)
1 22 131
2 24 116
3 28 114
4 29 123
5 30 117
6 32 122
7 35 121
8 41 171
9 47 111
10 49 133
11 51 130
12 51 133
13 56 145
14 57 141
15 63 155
16 77 217
Wedeterminedthecorrelationcoefficientforthisdatatober=0.726.Weare
nowinterestedincreatingapredictionequationthatexpressesSBPasafunction
ofAGE.
SimpleLinearRegression
BiostatisticsWorkbook 220
DRAFT:Aug.28,2007
Step PracticeSpace
1. Preparethe
standardtable
usedtoassistin
calculating
sums.
Age
(years)
SBP
(mmHg)
x y
xy x
2
y
2
22 131
24 116
28 114
29 123
30 117
32 122
35 121
41 171
47 111
49 133
51 130
51 133
56 145
57 141
63 155
77 217
Sum
2. Calculatevalues
forslopeb
1
and
intercept,b
0
.
3. Putitall
togethertofind
leastsquares
line, x
1
b
0
b y + =
4. Interpretthe
regression
equation.
SimpleLinearRegression
BiostatisticsWorkbook 221
DRAFT:Aug.28,2007
EpiInfoExample:SimpleLinearRegression
WewilluseEpiInfotodeterminethelineofregressionfortheexampleonpage
218.
Step Example
1. READthedata. READthedataset,Linear_Regres.
2. SelectLINEAR
REGRESSION
under
Advanced
Statisticson
theEpiInfo
toolbar.
SelectSBP_mmHgastheOutcomeVariableand
Age_YearsastheOtherVariable.
ClickOK.
3. Findtheleast
squaresline.
x
1
b
0
b y + =
UsetheConstant,82.461torepresent
0
b andAge_Years
torepresent
1
b . Thus,theequationbecomes
x 244 . 1 461 . 82 y + = .
SimpleLinearRegression
BiostatisticsWorkbook 222
DRAFT:Aug.28,2007
4. Interpretthe
equation.
y=82.46+1.24(age)
Foreveryoneyearincreaseinage,theSBPisexpectedto
increaseby1.24units.
EpiInfoPractice:SimpleLinearRegression
UsethedatasetBirthweighttopredictweightatbirthbasedonthepreviousbirth
interval.
Step PracticeSpace
1. SelectLINEAR
REGRESSION
underAdvanced
Statisticsonthe
EpiInfotoolbar.
2. Findtheleast
squareslineusing
theequation,
x
1
b
0
b y + = .
3. Interpretthe
equation.
RelatedConcepts
OneWayAnalysisofVariance
OneWayAnalysisofVariance
BiostatisticsWorkbook 223
DRAFT:Aug.28,2007
OneWayAnalysisofVariance(ANOVA)
Overview
Description:ANOVAisemployedtoevaluatethenullhypothesis ( )
0
H that
thepopulationmeansareequalversusthealternativehypothesis ( )
a
H that
atleastonepopulationmeandiffers.Thistechniqueistypicallyusedwhen
youwishtocompareatleast3populationmeans.Theterm oneway
indicatesthatthereisasinglefactororcharacteristicthatdistinguishesthe
variouspopulationsfromeachother.
Formulas:
Thewithingroupsvariability,denotedhereas
2
w
s ,iscomputedas:
( ) ( ) ( )
k n
s 1 n ... s 1 n s 1 n
s
2
k k
2
2 2
2
1 1 2
w
-
- + + - + -
=
Thebetweengroupsvariability,denotedhereas
2
B
s ,iscomputedas:
( ) ( ) ( )
1 k
x x n ... x x n x x n
s
2
k
k
2
2
2
2
1
1 2
B
-
- + + - + -
=
Assumptions
Thesamplesarerandomlyandindependentlyselectedfromtheir
respectivepopulations.
Thepopulationsarenormallydistributedwithmeans
k 2 1
,..., , m m m and
equalvariances, . ...
2 2
k
2
2
2
1
s = s = s = s
Where:
n
x n x n x n
x
k
k
+ + +
=
... 2
2
1
1
representstheoverallmean.
OneWayAnalysisofVariance
BiostatisticsWorkbook 224
DRAFT:Aug.28,2007
OnewayANOVAanalyzesvariancefromtwocomponentssomeofthevariation
comesfromdifferencesamongthegroupmeans,therestofthevariationcomes
fromdifferencesamongthesubjectswithineachgroup. Thisquantityissimplya
pooledestimateofthecommonvariance,
2
s ,andissimilartothepooled
varianceusedintheindependentsamplesttest.Theoverallmeanisdefinedas
theoverallaverageofthenobservationsthatmakeupthekdifferentgroups.
Ifthevariabilitywithinthekdifferentpopulationsissmallrelativetothevariability
amongtheirrespectivemeans,thissuggeststhatthepopulationmeansarein
factdifferent.Aformaltestofsignificancetoexaminethenullhypothesisthatthe
populationmeansareidenticalusesthefollowingFteststatistic:
2
W
2
B
s
s
F = .
StepbyStepExample:OneWayANOVA
Supposeyouwereaskedtoanalyzethefollowingdata,obtainedfromadietician
whoexamined4groupsofoverweightfemalesforaperiodof6months.Atotal
of20womenwererandomlyassignedtooneof4groups.The1
st
group
changedneithertheirdietnorlevelofphysicalactivity.The2
nd
groupchangedto
ahealthydietingplanbutdidnotparticipateinanexerciseprogram.The3
rd
groupexercisedvigorouslybutdidnotaltertheireatinghabits.The4
th
group
exercisedregularlyandateahealthydiet.Attheendofthestudy,totalweight
losswasmeasuredforeachindividual(Note:positivenumberindicatesdecrease
inweight,negativenumbercorrespondstoincrease).Dataarepresentedinthe
tablebelow.Youareinterestedindeterminingwhetherthereisanyevidenceof
adifferenceinmeanweightlossacrossthefourgroups.Testatthe0.05alpha
levelofsignificance.
Group1 Group2 Group3 Group4
5 2 8 12
2 8 0 6
3 4 2 15
2 12 6 8
0 4 2 10
6 . 1 1 = x 6 x2 = 6 . 3 x3 = 2 . 10 x4 =
s
1
=2.70 s
2
=4.00 s
3
=3.29 s
4
=3.49
35 . 5 x =
Step Example
1. Statethe
appropriatenull
andalternative
hypotheses.
4 3 2 1 0
: H m = m = m = m (meanweightlossisidenticalfor
the4groups)
:
a
H atleastonemeandiffersfromtheothers
OneWayAnalysisofVariance
BiostatisticsWorkbook 225
DRAFT:Aug.28,2007
Step Example
2. Statethedecision
rule.
ForanFdistributionwithk1=41=3numeratordf
andnk=204=16denominatordf,atthe0.05
levelwewillreject
0
H ifF
calc
>F
crit
(3.24).
3. Computevalues
foreachindividual
groupmeanand
standarddeviation,
andtheoverall
groupmean, . x
Checkyourcalculationswithanswersprovidedintable
above.
4.Calculatewithin
groupsvariability,
2
w
s .
( ) ( ) ( )
( )( ) ( )( ) ( )( ) ( )( )
57355 . 11
16
1768 . 185
4 20
49 . 3 4 29 . 3 4 4 4 70 . 2 4
k n
s 1 n ... s 1 n s 1 n
s
2 2 2 2
2
k k
2
2 2
2
1 1 2
w
= =
-
+ + +
=
-
- + + - + -
=
5.Calculatebetween
groupsvariability,
2
B
s .
( ) ( ) ( )
1
...
2 2
2
2
2
1
1 2
-
- + + - + -
=
k
x x n x x n x x n
s
k
k
B
( ) ( ) ( ) (
45 . 68
3
35 . 205
3
5 2 . 10 5 35 . 5 6 . 3 5 35 . 5 6 5 35 . 5 6 . 1 5
2 2 2
2
=
=
- + - + - + -
=
B
s
6.Calculatethevalue
oftheteststatistic.
91 . 5
57355 . 11
45 . 68
s
s
F
2
W
2
B
CALC
= = =
7.Reportthepvalue. 0.01<p<0.005
8.Statethestatistical
decision.
SinceF
calc
>F
crit
(5.91>3.24),wehaveevidencethat
ourteststatisticfallsintherejectionregion.Therefore,
werejectthenullhypothesis.
9.Statethepractical
conclusion.
Weconcludethatthemeanweightlossisnotidentical
forthe4populations.
OneWayAnalysisofVariance
BiostatisticsWorkbook 226
DRAFT:Aug.28,2007
Practice:OneWayAnalysisofVariance
YouareworkingwithresearchersintheOB/GYNDepartmentattheKorleBu
TeachingHospitalinAccra,Ghana.Theyareinterestedindeterminingwhether
adifferenceexistinthepainscoresrecordedfrompregnantwomen2days
beforetheirscheduledcaesariansection.Tenwomeneachwererandomly
selectedtoreporttheirpainusingoneof3scales:VisualAnalogScale(VAS),
theBoxNumericalScale(BNS),andtheVerbalRatingScale(VRS).For
comparativepurposes,responsesfromallsurveysweretranslatedtoanumerical
scalerangingfrom0to10,withzerorepresentingoneextreme(e.g.nopain)and
10representingtheotherextreme(e.g.theworstpainpossible). Dothedata
providesufficientevidencetoindicateadifferenceinpainamongthe3groups?
VAS BNS VRS
6 3 4
6 4 4
8 3 4
4 8 4
3 6 5
0 7 8
1 7 6
7 0 0
8 7 5
7 5 5
0 . 5 1 = x 5 2 = x 5 . 4 3 = x
s
1
=2.87 s
2
=2.49 s
3
=2.01
8333 . 4 = x
Note:TheANOVAtellsusonlyifatleastonemeandifferssignificantlyfromthe
othermeansinthegroup,butdoesnotspecifywhichmean(ormeans)is
different.Therefore,thenextquestionyoumightaskinvolvesthenatureofthe
differencesamongthefourgroups.Whichmeansarestatisticallysignificantly
differentfromtheothers?Toanswerthisquestion,refertoacomputerprogram
andperformamultiplecomparisonsposttest. Thisprocedureisbeyondthe
scopeofthisworkbook.
OneWayAnalysisofVariance
BiostatisticsWorkbook 227
DRAFT:Aug.28,2007
Step PracticeSpace
1. Statethe
appropriatenull
andalternative
hypotheses.
2. Statethe
decisionrule.
3. Computevalues
foreach
individualgroup
meanand
standard
deviation,andthe
overallgroup
mean, . x
4.Calculatewithin
groups
variability,
2
w
s .
5.Calculate
between
groups
variability,
2
B
s .
6.Calculatethe
valueofthetest
statistic.
OneWayAnalysisofVariance
BiostatisticsWorkbook 228
DRAFT:Aug.28,2007
Step PracticeSpace
7. Reportthep
value.
8.Statethe
statistical
decision.
9.Statethepractical
conclusion.
ExcelExample:OneWayAnalysisofVariance
Usingthesamedieticiansdataasweusedonpage224,determinewhether
thereisanyevidenceofadifferenceinmeanweightlossacrossthefourgroups.
UsingExcel,testatthe0.05alphalevelofsignificance.Thedataset,
ANOVAExample,canbefoundinBiostats_Workbook_Examples.mdb.
Step Example
1. Statethe
appropriatenull
andalternative
hypotheses.
4 3 2 1 0
: H m = m = m = m (meanweightlossisidenticalfor
the4groups)
: H
a
atleastonemeandiffers
2. Statethe
decisionrule.
ForanFdistributionwithk1=41=3numeratordf
andnk=204=16denominatordf,atthe0.05
levelwewillreject
0
H ifF
calc
>F
crit
(3.24).
3. Analysein
Excel.
a. Importthe
datasetinto
Excel.
OpenANOVAexample.
ChooseToolsfromthetoolbar.
SelectDataAnalysisfromthedropdownbox.
ChoosethetestANOVA:Singlefactor.
OneWayAnalysisofVariance
BiostatisticsWorkbook 229
DRAFT:Aug.28,2007
Step Example
b. Performthe
test.
ForInputRange,highlightalldatainthetable.
Choosewhetherthedataisgroupedbycolumnsor
rows.Inthiscase,itisarrangedbycolumns.
BesuretochecktheLabelsboxifyouhaveincluded
thelabels.
ClickOK.
4. Interpretthe
results.
TheaboveresultsshowacalculatedFvalueof5.91
andacriticalFvalueof3.24.F
calc
>F
crit
,thereforewe
haveevidencetoreject
0
H .Inaddition,thepvalueof
0.006<0.05,reinforcingourdecisiontorejectthenull.
5.Statethe
practical
conclusion.
SinceF
calc
>F
crit
,wecanrejectthenullhypothesisand
statethatatleastonegroupissignificantlydifferent
thantheothergroups.
OneWayAnalysisofVariance
BiostatisticsWorkbook 230
DRAFT:Aug.28,2007
ExcelPractice:OneWayAnalysisofVariance
Thesamepainscaledescribedonpage226wastestedatanotherhospitalin
Accra. UsethedatasetANOVApracticetodetermineifthissampleshowsa
differenceinoneofthepainscalesinExcel.
Step PracticeSpace
1. Statethe
appropriatenull
andalternative
hypotheses.
2. Statethedecision
rule.
3.AnalyzeinExcel.
4. Interpretthe
results.
5. Statethepractical
conclusion.
RelatedConcepts
SimpleLinearRegression
References
BiostatisticsWorkbook 231
DRAFT:Aug.28,2007
References
BiostatisticsfortheHealthSciences(BlairandTaylor,1999).
Daniel,WayneW.Biostatistics:AFoundationforAnalysisintheHealth
Sciences,SeventhEdition.JohnWiley&Sons,1999.
FleissJL.StatisticalMethodsforRatesandProportions.JohnWiley&Sons,
1981.
KelseyJL,WhittemoreAS,EvansAS,ThompsonWD.MethodsinObservational
Epidemiology.OxfordUniversityPress,1996.
OpenEpiwww.openepi.com.TheOpenSourceInitiative2006.
PEPI
RightSize
Ryan,Philip.AShortCourseinElementaryBiostatistics.1998
Appendices
BiostatisticsWorkbook 233
DRAFT:Aug.28,2007
Appendices
1:AnswerKey
2:Tables
NormalCurveTable
CriticalValuesofStudentstDistribution
CriticalValuesofFDistribution
CriticalValuesoftheChiSquareDistribution
Appendix1:AnswerKey
BiostatisticsWorkbook 234
DRAFT:Aug.28,2007
Appendix1:AnswerKey
ScalesofMeasurement(page10)
1. Interval
2. Ratio
3. Nominal
4. Ratio
5. Ordinal
FrequencyDistributions(page18)
1.
2.
MaritalStatus
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Single Married Divorced
Marital
Status
Frequency Percent Cumulative
Percent
Single 3 30 30
Married 4 40 70
Divorced 3 30 100
Appendix1:AnswerKey
BiostatisticsWorkbook 235
DRAFT:Aug.28,2007
3.
Age
0
0.5
1
1.5
2
2.5
<=25 2629 3034 3539 4044 4549 >=50
4.
Age
0
0.5
1
1.5
2
2.5
<=25 2629 3034 3539 4044 4549 >=50
6. Thefrequencyofmarriedparticipantsisthehighestofthethreecategories.
Thereareanequalnumberofsingleanddivorcedparticipants.Age,infiveyear
categories,isequallydistributed.
FrequencyDistributionsinEpiInfo(page25)
AmajorityofparticipantschoseEnglishastheirpreferredlanguage.Morethantwiceas
manyparticipantspreferEnglishtoFrenchorSpanish,thenextmostpreferred
languages.Theheightofparticipantsrisesslowlyuntilitreachesthepeakof170180
cm,afterwhichitdropssharply.
Appendix1:AnswerKey
BiostatisticsWorkbook 236
DRAFT:Aug.28,2007
FrequencyDistributionsinExcel(page31)
Mostparticipantsarebetween170and179cm.Moreparticipantsareshorterthan170
cmthantallerthan179cm.Thereisasharpdropoffinheightsabove179cm.The
frequencypolygonmirrorstheshapeofthehistogramcreatedinEpiInfo.
MeasuresofCentralTendency(page37)
1. Age:43.5
Visits:10.88
2. Age:Themedianisbetween45and46,sothemedianageis45.5years.
Visits:Themedianisbetween9and13sothemediannumberofvisitsis
11
2
22
2
13 9
= =
+
visits.
3. Age:52
Visits:15
4. Age:Themeanandthemedianarefairlycloseat43.5yearsand45.5years
respectively.Themode,however,doesnotfollowthispattern.Thedatamaybe
slightlyskewed,inwhichcase,wemaychoosetousethemedian.Theaverage
ageofpeopleinoursampleis45.5years.
Visits:Inthiscasethemeanandmedianarealmostexactlythesame,showing
thatthisdataisnormallydistributed.Themode,onceagain,doesnotfollowthe
pattern.Theaveragenumberofvisitstothedoctorinoneyearis11.
MeasuresofCentralTendencyinEpiInfo(page39)
Length: Weight:
Mean 85.0_ Mean_11.7_
Median_85.0_ Median_11.7
Mode_85.0__ Mode_10.8_
Theaveragelengthofbabiesis85.0cm.
Theaverageweightisabout11.7kg.
Bothlengthandweightarenormallydistributedinthiscase.
MeasuresofDispersion(page46)
Minimum,maximum,andrange
1. Theminimumvalueis2.
2. Themaximumvalueis22.
3. Range:222=20
4. Thedatavariesbyarangeof20visits.
InterquartileRange
IQR=155=10
Themiddlehalfofthedata(betweenthe25
th
percentileandthe75
th
percentile)iswithin
10visits.
Appendix1:AnswerKey
BiostatisticsWorkbook 237
DRAFT:Aug.28,2007
Variance,standarddeviation,andstandarderror
98 44 = s
2
.
71 . 6 s =
37 . 2 SE =
EachobservationinVisitsisanaverageof6.7unitsfromthemean.Ifweweretotake
manysamplesfromthepopulation,wewouldfindthatthemeanofeachsamplewould
beanaverageof2.37unitsfromthemeanofthepopulation.
MeasuresofDispersioninEpiInfo(page51)
1. Range=9277=15
IQR=8784=3
2. S
2
=14.67
S=3.83
3. TherangeoftheobservationsinthevariableLength_cmis15cm.Themiddle50
percentoftheobservationsfallwithinarangeof3cm.Eachobservationisan
averageof3.83unitsfromthesamplemean.
StandardNormalDistribution(page60)
1. 3
10
40 70
z
M
=
-
=
2. 08 . 2
12
60 35
z
F
- =
-
=
3. 8 . 3
10
40 78
z
M
=
-
= 6 . 105 ) 12 ( 8 . 3 60 x
F
= + =
ConfidenceIntervalAroundaMean(page67)
Thegestationalagesforthethreehealthfacilitiessampledinthatregionhavea95%
confidenceintervalof(36.22,38.78).Wecanstatewith95%confidencethatthetrue
populationmeanfortheregionliesbetween36.22and38.78weeks.
ConfidenceIntervalAroundaMeaninExcelorOpenEpi(page74)
The95%confidenceintervalis(57.98,62.02).Withrepeatedrandomsampling,95%of
intervalscalculatedwillcontainthetruemeanofthepopulation.Weare95%confident
thatthisisoneofthoseintervalsandthemeanscoreofwomenliesbetween57.98and
62.02.
ConfidenceIntervalAroundaProportion(page79)
The95%confidenceintervalis(0.40,0.52).Withrepeatedrandomsampling,95%of
intervalscalculatedwillcontainthetrueproportionofthepopulation.Weare95%
confidentthatthisisoneofthoseintervalsandtheprevalenceofTBinthoseinfected
withAIDSisbetween40%and52%.
Appendix1:AnswerKey
BiostatisticsWorkbook 238
DRAFT:Aug.28,2007
ConfidenceIntervalAroundaProportioninOpenEpi(page83)
The95%confidenceintervalis(0.00,0.016).Withrepeatedrandomsampling,95%of
intervalscalculatedwillcontainthetrueproportionofthepopulation.Weare95%
confidentthatthisisoneofthoseintervalsandtheprevalenceofHIVinthepopulation
isbetween0%and1.6%.
HypothesisTesting:TwoSampletTest(page89)
Wereject
0
H sincethevalueofourteststatistic
calc
t =4.02exceedsthetcriticalvalue
of1.9945.Wethereforehaveevidencethatourteststatisticfallsintherejection
region.
OR
Sinceapvalueof0.01islessthantheof0.05,thereissufficientevidenceatthe5%
levelofsignificancetoindicatethatthereisadifferenceinthemeanageofHIVinfected
andnoninfectedpatientswithtuberculosis.
HypothesisTesting:TwoSampletTestinEpiInfo(page94)
Since0.00<0.05,p<.Therefore,thereissufficientevidencetoreject
0
H .
Atthe5%level,thereisastatisticallysignificantdifferencebetweentheagesofboys
andgirlswhosufferfromcholera.
ConfidenceInterval:TwoSampletTest(page99)
Weare95%confidentthatthemeanageoftuberculosispatientswhoareHIV+is
between5.85and17.35yearsyoungerthanthemeanageoftuberculosispatientswho
areHIV.Sincetheintervalexcludeszero,thenullhypothesisisrejected,andwe
concludethatthetwopopulationmeansarestatisticallysignificantlydifferent.
ConfidenceInterval:TwoSamplettestinOpenEpi(page105)
The95%confidenceintervalis(124.52,526.95).Sincetheintervalexcludeszero,the
nullhypothesisisrejected,andweconcludethatthetwopopulationmeansare
statisticallysignificantlydifferent.Thereisadifferencebetweenbirthweightsofthose
thatreceivedasupplementduringpregnancyandthosethatdidnot.
HypothesisTesting:zTestforDifferenceinProportions(page110)
Wereject
0
H sincethevalueofourteststatistic
calc
z =7.43islessthanthezcritical
valueof 1.645.Wethereforehaveevidencethatourteststatisticfallsintherejection
region.
OR
Sincethecalculatedpvalue(<0.0002)islessthan(0.10),thereissufficientevidence
atthe10%levelofsignificancetoindicateunequalparticipationinformaleducationof
girlsversusboysamongthesampledpopulations.
Appendix1:AnswerKey
BiostatisticsWorkbook 239
DRAFT:Aug.28,2007
HypothesisTesting:zTestforDifferenceinProportionsinEpiInfo(page114)
EpiInfocalculatesax
2
of51.2821.Therefore,z
calc
=7.16withapvalueof<0.0001.
Sincep<,werejectthenullatthe5%levelofsignificance.Thereissufficient
evidencetoindicateadifferenceintheproportionofthosethatdiedontheold
treatmentandthosethatdiedonthenewtreatment.
ConfidenceInterval:zTestforDifferenceinProportions(page119)
The95%confidenceintervalis ( ) 0.1275 0.0745, . Wecaninterpretthisconfidence
intervalasmeaningthatweare95%confidentthatthetruedifferenceinpopulation
proportionsofgirlsversusboyswhocompletedsecondaryeducationisbetween7.45%
and12.75%.Sincetheintervalcontainsthenulldifferenceofzero,weconcludethat
thetwopopulationproportionsareequal.
ConfidenceInterval:zTestforDifferenceinProportionsinEpiInfo(page122)
The95%confidenceintervalis(10.95,19.05). Weare95%confidentthatthetrue
differenceinproportionsofthosewhodieafterbeingexposedtotheoldtreatmentand
thosethatdieafterbeingexposedtothenewtreatmentliesbetween10.95%and
19.05%.Sincetheintervaldoesnotcontainthenulldifferenceofzero,weconclude
thatthetwopopulationproportionsarenotequal,indicatingourfindingsarestatistically
significant.
HypothesisTesting:PairedtTest(page129)
Wereject
0
H sincethevalueofourteststatistic
calc
t =3.32exceedsthetcriticalvalue
of2.821.Wethereforehaveevidencethatourteststatisticfallsintherejectionregion.
OR
Since0.0005<p<0.005,thedataprovidessufficientevidencetoindicateanincrease
inSBPbeforeandaftertakingtheexaminationatthe0.01levelofsignificance.
HypothesisTesting:PairedtTestinExcel(page135)
Since0.00<0.05,p<.ThereforewecanrejectH
0
.Alternatively:t
calc
of6.10>t
crit
of
2.09.Therefore,wecanrejectH
0
. Accordingtoatwotailedttest,thereissufficient
evidencetoconcludethatthereisastatisticaldifferenceinpulseswhenchildrenare
sittingversuswhentheyarelyingdown.
ConfidenceInterval:PairedtTest(page138)
The95%confidenceintervalis ( ) 7.62 1.98, . Weare95%confidentthatonaverage,
MPHsubjectstendedtohavechangeinSBPanywherebetween1.98to7.62units
aftertheexamination.Sincetheintervalincludeszero,wefailtorejectthenull
hypothesis.
ConfidenceInterval:PairedtTestinExcel(page142)
The95%confidenceintervalis(8.52,16.78). Withrepeatedsampling,95%ofall
samplemeanswillfallbetween8.52and16.78.Becausethisintervaldoesnotinclude
Appendix1:AnswerKey
BiostatisticsWorkbook 240
DRAFT:Aug.28,2007
thenullvalueofzero,wecanrejectthenullandconcludethatthereisasignificant
differencebetweenthetwogroups.
FishersExactTest(page148)
p=0.8590.Sincep>,wefailtorejectthenullat=0.05.Thereisnotsufficient
evidencetoconcludethatthereisadifferenceincontractionofcholerabasedonthe
saladeaten.
FishersExactTestinEpiInfo(page152)
p=0.0318.Sincep<,wecanrejectthenullhypothesis.Thereissufficientevidence
toconcludethatapprovalratingsfromthosethatparticipatedifferfromapprovalratings
fromthosethatdonotparticipateinclass.However,becausepisapproaching,
furthertestingmaybeneeded.
ChiSquareTestforIndependence(page158)
Since
2
calc
=1.587(lessthancriticalvalueof3.841),thereisinsufficientevidenceatthe
0.05leveltoconcludethatsmokingandthepresenceofdiabetesarerelated.
ChiSquareTestforIndependenceinEpiInfo(page162)
X
2
=20.74p=0.0001
Since>p,wecanrejectthenullhypothesisandconcludethattheavailabilityof
HIV/AIDSeducationatacompanyisnotindependentofthenumberofemployeesin
thatcompany.
ConfidenceIntervalforOddsRatio(page168)
The95%confidenceintervalfortheoddsratiois(3.6906,6.7761). Becausethe
confidenceintervaldoesnotcontainthevalueof1,wecanrejectthenullhypothesis.
Weare95%confidentthatthetrueoddsratioliesbetween3.69and6.78,meaningthat
childrenthatwereformulafedarebetween3.69and6.78timesaslikelytodevelop
asthmathanchildrenthatwerebreastfed.
ConfidenceIntervalforOddsRatioinEpiInfo(page172)
EpiInfohascalculatedanoddsratioof0.09witha95%confidenceintervalof(0.03,
0.31).Becausetheconfidenceintervaldoesnotincludethenullvalueof1,theodds
ratioisstatisticallysignificantandwecanrejectthenullhypothesis.Weexpect,with
95%confidence,tofindthatwomenwhotakeamicronutrientsupplementduring
pregnancyarebetween0.03and0.31timesaslikelytohaveinfantsabove2500gat
birththanwomenwhodonottakeasupplementduringpregnancy.
ConfidenceIntervalforRelativeRisk(page175)
The95%confidenceintervalis(2.08,3.15).Withrepeatedsamples,95%ofallmeans
wouldfallwithinthelimits,2.08and3.15.Becausethisintervaldoesnotcontainthe
valueone,thereisenoughevidencetorejectthenullhypothesis.Thereisadifference
inthedevelopmentofasthmabetweenchildrenwhowerebreastfedandthosewho
werenotbreastfedininfancy.
Appendix1:AnswerKey
BiostatisticsWorkbook 241
DRAFT:Aug.28,2007
ConfidenceIntervalforRelativeRiskinEpiInfo(page179)
Therelativeriskisreportedas3.29witha95%confidenceintervalof(1.67,6.47).
Becausetheconfidenceintervaldoesnotincludethevalue1,thereisasignificant
differencebetweentheexposedandtheunexposed. Thereisasignificantdifference
betweentheriskoflowbirthweightbetweeninfantsbornwithintwoyearsofthelast
birthandthosebornmorethantwoyearsafterthelastbirth.
SimpleRandomSampleMean(page184)
n=138.30.Thesamplesizeis139.
SimpleRandomSampleProportion(page185)
Thesamplesizeis227.
SimpleRandomSampleProportioninEpiInfo/OpenEpi(page189)
Thesamplesizeis227.
SampleSizeCohortStudy(page194)
Forthiscohortstudy,wewillneedatotalsamplesizeofabout118participants.Halfof
oursample(59)willhavebeenexposedtothesupplementduringpregnancyandthe
otherhalf(59)willnothavebeenexposedtothesupplement.
SampleSizeCohortStudyinEpiInfo/OpenEpi(page198)
InEpiInfo:Thesamplesizeis406.Therewillbe203exposedand203unexposed.
InOpenEpi:Thesamplesizeis360.Therewillbe180exposedand180unexposed.
SampleSizeCaseControlStudy(page200)
Thetotalsamplesizeis104.Therewillbe52casesand52controls.
SampleSizeCaseControlStudyinEpiInfo/OpenEpi(page204)
InEpiInfo:Thetotalsamplesizeis704.Therewillbe352casesand352controls.
InOpenEpi:Thetotalsamplesizeis754.Therewillbe377casesand377controls.
PearsonProductMomentCorrelationCoefficient(page210)
r=0.73.Basedonthissample,thereappearstobeapositivelinearrelationship
betweenageandSBP.
Ifwecontinuethisprobleminordertodeterminestatisticalsignificanceofr,we
calculateatstatisticof3.95.Thisisgreaterthanthecriticaltvalueof2.145,therefore
thisrelationshipisstatisticallysignificant.Inaddition,thepvalueof<0.01islessthan
theof0.05.Thus,thereisevidencethatthetruepopulationcorrelationcoefficientis
differentfromzero.SBPincreaseswithincreasingagetherefore,thecorrelationis
positive.
Appendix1:AnswerKey
BiostatisticsWorkbook 242
DRAFT:Aug.28,2007
PearsonProductMomentCorrelationCoefficientinExcel(page216)
r=0.94. Thisisrepresentativeofthestrongpositivelinearrelationshipthatisshownin
thescatterplot.
SimpleLinearRegression(page219)
y=82.76+1.2437x
Theyinterceptofthefittedlineis82.46.Theoretically,thisisthemeanSBPthat
correspondstoapersonsageof0.Inthisexample,however,anageofzerodoesnot
makesense.Theslopeofthelineis1.2437,implyingthatforevery1yearincreaseina
subjectsage,theSBPispredictedtoincreaseby1.2437mmHgonaverage.
SimpleLinearRegressioninEpiInfo(page222)
y=2207.386+11.800x
Inthispopulation,themeanweightatbirthis2207grams.Thisweightincreasesby
nearly12gramsforeverymonthintheintervalbetweenthisbirthandtheonebeforeit.
Therefore,wewouldexpectaninfantborn24monthsfollowingthelastlivebirthto
weighapproximately142gmoreatbirththananinfantborn12monthsfollowingthe
lastlivebirth.
OneWayAnalysisofVariance(page226)
F
calc
(0.14)<F
crit
(3.35).SinceF
calc
issmallerthanF
crit,
wefailtorejectthenull
hypothesis.Thereisnosignificantdifferenceamonganyofthethreegroups.
OneWayAnalysisofVarianceinExcel(page230)
F
calc
(1.00)<F
crit
(3.22).Ourpvalueis0.37.Therefore,wefailtorejectthenull
hypothesis.Thereisnosignificantdifferenceamonganyofthethreegroups.
Appendix2:DistributionTables
BiostatisticsWorkbook 243
DRAFT:Aug.28,2007
Appendix2:DistributionTables
Appendix2:DistributionTables
BiostatisticsWorkbook 244
DRAFT:Aug.28,2007
StudentstTable
df\p 0.4 0.25 0.1 0.05 0.025 0.01 0.005 0.0005
1 0.32492 1 3.077684 6.313752 12.7062 31.82052 63.65674 636.6192
2 0.288675 0.816497 1.885618 2.919986 4.30265 6.96456 9.92484 31.5991
3 0.276671 0.764892 1.637744 2.353363 3.18245 4.5407 5.84091 12.924
4 0.270722 0.740697 1.533206 2.131847 2.77645 3.74695 4.60409 8.6103
5 0.267181 0.726687 1.475884 2.015048 2.57058 3.36493 4.03214 6.8688
6 0.264835 0.717558 1.439756 1.94318 2.44691 3.14267 3.70743 5.9588
7 0.263167 0.711142 1.414924 1.894579 2.36462 2.99795 3.49948 5.4079
8 0.261921 0.706387 1.396815 1.859548 2.306 2.89646 3.35539 5.0413
9 0.260955 0.702722 1.383029 1.833113 2.26216 2.82144 3.24984 4.7809
10 0.260185 0.699812 1.372184 1.812461 2.22814 2.76377 3.16927 4.5869
11 0.259556 0.697445 1.36343 1.795885 2.20099 2.71808 3.10581 4.437
12 0.259033 0.695483 1.356217 1.782288 2.17881 2.681 3.05454 4.3178
13 0.258591 0.693829 1.350171 1.770933 2.16037 2.65031 3.01228 4.2208
14 0.258213 0.692417 1.34503 1.76131 2.14479 2.62449 2.97684 4.1405
15 0.257885 0.691197 1.340606 1.75305 2.13145 2.60248 2.94671 4.0728
16 0.257599 0.690132 1.336757 1.745884 2.11991 2.58349 2.92078 4.015
17 0.257347 0.689195 1.333379 1.739607 2.10982 2.56693 2.89823 3.9651
18 0.257123 0.688364 1.330391 1.734064 2.10092 2.55238 2.87844 3.9216
19 0.256923 0.687621 1.327728 1.729133 2.09302 2.53948 2.86093 3.8834
20 0.256743 0.686954 1.325341 1.724718 2.08596 2.52798 2.84534 3.8495
21 0.25658 0.686352 1.323188 1.720743 2.07961 2.51765 2.83136 3.8193
22 0.256432 0.685805 1.321237 1.717144 2.07387 2.50832 2.81876 3.7921
23 0.256297 0.685306 1.31946 1.713872 2.06866 2.49987 2.80734 3.7676
24 0.256173 0.68485 1.317836 1.710882 2.0639 2.49216 2.79694 3.7454
25 0.25606 0.68443 1.316345 1.708141 2.05954 2.48511 2.78744 3.7251
26 0.255955 0.684043 1.314972 1.705618 2.05553 2.47863 2.77871 3.7066
27 0.255858 0.683685 1.313703 1.703288 2.05183 2.47266 2.77068 3.6896
28 0.255768 0.683353 1.312527 1.701131 2.04841 2.46714 2.76326 3.6739
29 0.255684 0.683044 1.311434 1.699127 2.04523 2.46202 2.75639 3.6594
30 0.255605 0.682756 1.310415 1.697261 2.04227 2.45726 2.75 3.646
inf 0.253347 0.67449 1.281552 1.644854 1.95996 2.32635 2.57583 3.2905
Appendix2:DistributionTables
BiostatisticsWorkbook 245
DRAFT:Aug.28,2007
StandardNormalz
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07
0 0 0.004 0.008 0.012 0.016 0.0199 0.0239 0.0279
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675
0.2 0.0793 0.0832 0.0871 0.091 0.0948 0.0987 0.1026 0.1064
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443
0.4 0.1554 0.1591 0.1628 0.1664 0.17 0.1736 0.1772 0.1808
0.5 0.1915 0.195 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486
0.7 0.258 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794
0.8 0.2881 0.291 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.334
1 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.377 0.379
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.398
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292
1.5 0.4332 0.4345 0.4357 0.437 0.4382 0.4394 0.4406 0.4418
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.475 0.4756
2 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808
2.1 0.4821 0.4826 0.483 0.4834 0.4838 0.4842 0.4846 0.485
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911
2.4 0.4918 0.492 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932
2.5 0.4938 0.494 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.496 0.4961 0.4962
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.497 0.4971 0.4972
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985
3 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989
Appendix2:DistributionTables
BiostatisticsWorkbook 246
DRAFT:Aug.28,2007
ChiSquareDistribution
df\area 0.995 0.99 0.975 0.95 0.9 0.75 0.5 0.25 0.1 0.05 0.025 0.01 0.005
1 0.00004 0.00016 0.00098 0.00393 0.01579 0.10153 0.45494 1.3233 2.70554 3.84146 5.02389 6.6349 7.87944
2 0.01003 0.0201 0.05064 0.10259 0.21072 0.57536 1.38629 2.77259 4.60517 5.99146 7.37776 9.21034 10.59663
3 0.07172 0.11483 0.2158 0.35185 0.58437 1.21253 2.36597 4.10834 6.25139 7.81473 9.3484 11.34487 12.83816
4 0.20699 0.29711 0.48442 0.71072 1.06362 1.92256 3.35669 5.38527 7.77944 9.48773 11.14329 13.2767 14.86026
5 0.41174 0.5543 0.83121 1.14548 1.61031 2.6746 4.35146 6.62568 9.23636 11.0705 12.8325 15.08627 16.7496
6 0.67573 0.87209 1.23734 1.63538 2.20413 3.4546 5.34812 7.8408 10.64464 12.59159 14.44938 16.81189 18.54758
7 0.98926 1.23904 1.68987 2.16735 2.83311 4.25485 6.34581 9.03715 12.01704 14.06714 16.01276 18.47531 20.27774
8 1.34441 1.6465 2.17973 2.73264 3.48954 5.07064 7.34412 10.21885 13.36157 15.50731 17.53455 20.09024 21.95495
9 1.73493 2.0879 2.70039 3.32511 4.16816 5.89883 8.34283 11.38875 14.68366 16.91898 19.02277 21.66599 23.58935
10 2.15586 2.55821 3.24697 3.9403 4.86518 6.7372 9.34182 12.54886 15.98718 18.30704 20.48318 23.20925 25.18818
11 2.60322 3.05348 3.81575 4.57481 5.57778 7.58414 10.341 13.70069 17.27501 19.67514 21.92005 24.72497 26.75685
12 3.07382 3.57057 4.40379 5.22603 6.3038 8.43842 11.34032 14.8454 18.54935 21.02607 23.33666 26.21697 28.29952
13 3.56503 4.10692 5.00875 5.89186 7.0415 9.29907 12.33976 15.98391 19.81193 22.36203 24.7356 27.68825 29.81947
14 4.07467 4.66043 5.62873 6.57063 7.78953 10.16531 13.33927 17.11693 21.06414 23.68479 26.11895 29.14124 31.31935
15 4.60092 5.22935 6.26214 7.26094 8.54676 11.03654 14.33886 18.24509 22.30713 24.99579 27.48839 30.57791 32.80132
16 5.14221 5.81221 6.90766 7.96165 9.31224 11.91222 15.3385 19.36886 23.54183 26.29623 28.84535 31.99993 34.26719
17 5.69722 6.40776 7.56419 8.67176 10.08519 12.79193 16.33818 20.48868 24.76904 27.58711 30.19101 33.40866 35.71847
18 6.2648 7.01491 8.23075 9.39046 10.86494 13.67529 17.3379 21.60489 25.98942 28.8693 31.52638 34.80531 37.15645
19 6.84397 7.63273 8.90652 10.11701 11.65091 14.562 18.33765 22.71781 27.20357 30.14353 32.85233 36.19087 38.58226
20 7.43384 8.2604 9.59078 10.85081 12.44261 15.45177 19.33743 23.82769 28.41198 31.41043 34.16961 37.56623 39.99685
21 8.03365 8.8972 10.2829 11.59131 13.2396 16.34438 20.33723 24.93478 29.61509 32.67057 35.47888 38.93217 41.40106
22 8.64272 9.54249 10.98232 12.33801 14.04149 17.23962 21.33704 26.03927 30.81328 33.92444 36.78071 40.28936 42.79565
23 9.26042 10.19572 11.68855 13.09051 14.84796 18.1373 22.33688 27.14134 32.0069 35.17246 38.07563 41.6384 44.18128
24 9.88623 10.85636 12.40115 13.84843 15.65868 19.03725 23.33673 28.24115 33.19624 36.41503 39.36408 42.97982 45.55851
25 10.51965 11.52398 13.11972 14.61141 16.47341 19.93934 24.33659 29.33885 34.38159 37.65248 40.64647 44.3141 46.92789
26 11.16024 12.19815 13.8439 15.37916 17.29188 20.84343 25.33646 30.43457 35.56317 38.88514 41.92317 45.64168 48.28988
27 11.80759 12.8785 14.57338 16.1514 18.1139 21.7494 26.33634 31.52841 36.74122 40.11327 43.19451 46.96294 49.64492
28 12.46134 13.56471 15.30786 16.92788 18.93924 22.65716 27.33623 32.62049 37.91592 41.33714 44.46079 48.27824 50.99338
29 13.12115 14.25645 16.04707 17.70837 19.76774 23.56659 28.33613 33.71091 39.08747 42.55697 45.72229 49.58788 52.33562
30 13.78672 14.95346 16.79077 18.49266 20.59923 24.47761 29.33603 34.79974 40.25602 43.77297 46.97924 50.89218 53.67196
Appendix2:DistributionTables
BiostatisticsWorkbook 247
DRAFT:Aug.28,2007
FDistribution
df2/df1 1 2 3 4 5 6 7 8
1 161.4476 199.5 215.7073 224.5832 230.1619 233.986 236.7684 238.8827
2 18.5128 19 19.1643 19.2468 19.2964 19.3295 19.3532 19.371
3 10.128 9.5521 9.2766 9.1172 9.0135 8.9406 8.8867 8.8452
4 7.7086 6.9443 6.5914 6.3882 6.2561 6.1631 6.0942 6.041
5 6.6079 5.7861 5.4095 5.1922 5.0503 4.9503 4.8759 4.8183
6 5.9874 5.1433 4.7571 4.5337 4.3874 4.2839 4.2067 4.1468
7 5.5914 4.7374 4.3468 4.1203 3.9715 3.866 3.787 3.7257
8 5.3177 4.459 4.0662 3.8379 3.6875 3.5806 3.5005 3.4381
9 5.1174 4.2565 3.8625 3.6331 3.4817 3.3738 3.2927 3.2296
10 4.9646 4.1028 3.7083 3.478 3.3258 3.2172 3.1355 3.0717
11 4.8443 3.9823 3.5874 3.3567 3.2039 3.0946 3.0123 2.948
12 4.7472 3.8853 3.4903 3.2592 3.1059 2.9961 2.9134 2.8486
13 4.6672 3.8056 3.4105 3.1791 3.0254 2.9153 2.8321 2.7669
14 4.6001 3.7389 3.3439 3.1122 2.9582 2.8477 2.7642 2.6987
15 4.5431 3.6823 3.2874 3.0556 2.9013 2.7905 2.7066 2.6408
16 4.494 3.6337 3.2389 3.0069 2.8524 2.7413 2.6572 2.5911
17 4.4513 3.5915 3.1968 2.9647 2.81 2.6987 2.6143 2.548
18 4.4139 3.5546 3.1599 2.9277 2.7729 2.6613 2.5767 2.5102
19 4.3807 3.5219 3.1274 2.8951 2.7401 2.6283 2.5435 2.4768
20 4.3512 3.4928 3.0984 2.8661 2.7109 2.599 2.514 2.4471
21 4.3248 3.4668 3.0725 2.8401 2.6848 2.5727 2.4876 2.4205
22 4.3009 3.4434 3.0491 2.8167 2.6613 2.5491 2.4638 2.3965
23 4.2793 3.4221 3.028 2.7955 2.64 2.5277 2.4422 2.3748
24 4.2597 3.4028 3.0088 2.7763 2.6207 2.5082 2.4226 2.3551
25 4.2417 3.3852 2.9912 2.7587 2.603 2.4904 2.4047 2.3371
26 4.2252 3.369 2.9752 2.7426 2.5868 2.4741 2.3883 2.3205
27 4.21 3.3541 2.9604 2.7278 2.5719 2.4591 2.3732 2.3053
28 4.196 3.3404 2.9467 2.7141 2.5581 2.4453 2.3593 2.2913
29 4.183 3.3277 2.934 2.7014 2.5454 2.4324 2.3463 2.2783
30 4.1709 3.3158 2.9223 2.6896 2.5336 2.4205 2.3343 2.2662
40 4.0847 3.2317 2.8387 2.606 2.4495 2.3359 2.249 2.1802

Você também pode gostar