Você está na página 1de 45

OLSDataAnalysisinR

DinoChristenson&ScottPowell OhioStateUniversity November20,2007

IntroductiontoROutline
I. DataDescription I II. DataAnalysis
i. Commandfunctions i ii. Handrolling

III. OLSDiagnostics&Graphing III IV. Functionsandloops V. Movingforward

11/20/2007

Christenson&Powell:IntrotoR

DataAnalysis:DescriptiveStats
Rhasseveralbuiltin commandsfor describingdata Thelist() commandcanoutput p allelementsofan object

DataAnalysis:DescriptiveStats
Thesummary() y commandcanbe usedtodescribeall variablescontained withinadataframe Thesummary() commandcanalso beusedwith individualvariables

DataAnalysis:DescriptiveStats
Simple p plots p canalso providefamiliarity withthedata Thehist() commandp producesa histogramforany givendatavalues

DataAnalysis:DescriptiveStats
Simple p plots p canalso providefamiliarity withthedata Theplot() commandcan produceboth univariateand bivariateplotsfor anygivenobjects

DataAnalysis: y Descriptive p Stats


OtherUsefulCommands
sum mean var sd range min max median di cor summary

DataAnalysis:Regression
Asmentionedabove, ,oneofthebig gp perksofusing gRis flexibility. Rcomeswithitsowncannedlinearregressioncommand: lm(y ~ x) However,weregoingtouseRtomakeourownOLS estimator.Thenwewillcomparewiththecanned procedure,aswellasStata.

DataAnalysis:Regression
First,letstakealookatour codeforthehandrolledOLS estimator TheHolyGrail: (XX) (X X)-1 X XY Y Weneedasinglematrixof independentvariables Thecbind() command takestheindividualvariable vectorsandcombinesthem intoonexvariablematrix A1isincludedasthefirst elementtoaccountforthe constant.

DataAnalysis:Regression
Withthexandy matricescomplete, wecannow manipulatethemto producecoefficients. Afterperformingthe divinemultiplication, wecanobservethe estimatesbyentering theobjectname(in thiscaseb).

DataAnalysis:Regression
Withthexandy matricescomplete, wecannow manipulatethemto producecoefficients. Afterperformingthe divinemultiplication, wecanobservethe estimates byentering theobjectname(in thiscaseb).

DataAnalysis:Regression
Tofindthestandard errors,weneedto computeboththe varianceoftheresiduals and dthe h covmatrixof fthe h xs. Thesqrtofthediagonal elements l of fthis hi varcov matrix willgiveusthe standarderrors. O h teststatistics Other i i canbe b easilycomputed. Viewthestandarderrors.

DataAnalysis:Regression
Tofindthestandard errors,weneedto computeboththe varianceoftheresiduals and dthe h covmatrixof fthe h xs. Thesqrtofthediagonal elements l of fthis hi varcov matrixwillgiveusthe standarderrors. O h teststatistics Other i i can beeasilycomputed. Viewthestandarderrors.

DataAnalysis:Regression
Tofindthestandard errors,weneedto computeboththe varianceoftheresiduals and dthe h covmatrixof fthe h xs. Thesqrtofthediagonal elements l of fthis hi varcov matrixwillgiveusthe standarderrors. O h teststatistics Other i i canbe b easilycomputed. Viewthestandarderrors.

DataAnalysis:Regression
TimetoCompare p Usethelm() commandtoestimate themodelusingRs cannedprocedure p Aswecansee,the estimatesarevery similar

DataAnalysis:Regression
TimetoCompare p Wecanalsoseehow boththehandrolled and dcanned dOLS proceduresstackup toStata Usethereg commandtoestimate themodel Aswecansee,the estimatesareonce againverysimilar

DataAnalysis:Regression

DataAnalysis: y Regression g
OtherUsefulCommands
lm
Linear Model

glm
- General lm

lme
Mixed Effects

multinom
- Multinomial
Logit

anova

optim
- General
Optimizer

OLSDiagnosticsinR
Postestimationdiagnostics g arekey ytodata analysis
Wewanttomakesureweestimatedtheproper model Besides,Irfan willhurtyouifyouneglecttodothis

Furthermore,diagnostics g allowusthe opportunitytoshowoffsomeofRsgraphs


Rsrealstrengthisthatithasvirtuallyunlimited graphingcapabilities Ofcourse,suchstrengthsonRspartisdependenton yourknowledgeofbothRandstatistics
Still, Still withjustsomebasicswecandosomecoolgraphs
11/20/2007 Christenson&Powell:IntrotoR 19

OLSDiagnosticsinR
Whatcouldbeunjustifiably drivingourdata?
Outlier: O tli unusual lobservation b ti Leverage:abilitytochangetheslopeofthe regressionline Influence:thecombinedimpactofstrongleverage andoutlierstatus
AccordingtoJohnFox,influence=leverage*outliers

11/20/2007

Christenson&Powell:IntrotoR

20

OLSDiagnostics:Leverage
Recall eca ou ouro ols s model ode
ols.model1<-lm(formula = repvshr~income+presvote+pressup)

Ourmeasureofleverage:isthehi orhatvalue
Itsjustthepredictedvalueswrittenintermsofhi Where, Where Hij isthecontributionofobservationYitothefitted valueYj Ifhij islarge,thentheith observationhasasignificantimpacton thejth fittedvalue So,skippingtheformulas,weknowthatthelargerthehatvalue thegreatertheleverageofthatobservation

11/20/2007

Christenson&Powell:IntrotoR

21

OLSDiagnostics:Leverage
Findthehatvalues
hatvalues(ols.model1)

Calculatetheaveragehatvalue
avg.mod1<-ncol(x)/nrow(x)
11/20/2007 Christenson&Powell:IntrotoR 22

OLSDiagnostics:Leverage
Butapictureisworthahundred numbers? Graphthehatvalueswithlinesfor theaverage,twicetheavg (large samples)andthreetimestheavg (smallsamples)hatvalues
plot(hatvalues(ols.model 1)) abline(h=1*(ncol(x))/nro w(x)) abline(h=2*(ncol(x))/nro bli (h 2*( l( ))/ w(x)) abline(h=3*(ncol(x))/nro w(x)) identify(hatvalues(ols.m odel1))
identify letsusselectthedata pointsinthenewgraph
5 hatvalues(ols.model1) 2 0.35 18 0.30 20

0.20

0.25

3 11

0.15

14 0.10

1 5

19

State#2isovertwicetheavg Nothingabovethreetimes
Christenson&Powell:IntrotoR

10 Index

15

20

11/20/2007

23

OLSDiagnostics:Outliers
Canwefindany ydatap pointsthatareunusualforY ui giventheXs? * ui = u ( 1 ) 1 hi Usestudentized residuals
Wecanseewhether h h there h isasignificant f change h in themodel Iftheirabsolutevaluesarelarger g than2,thenthe correspondingobservationsarelikelytobeoutliers) rstudent(ols.model1)

11/20/2007

Christenson&Powell:IntrotoR

24

OLSDiagnostics:Outliers
Again,letsplotthemwith li f lines for2&2 States2and3appeartobe outliers,ordarnclose Weshoulddefinitelytakea lookatwhatmakesthese statesunusual
Perhapsthereisamistake i data in d entry Perhapsthemodelis misspecified intermsof functionalform (forthcoming)oromitted vars Maybeyoucanthrowout yourbadobservation Ify youmustincludethebad observation,tryrobust regression
11/20/2007
2 2 14 1 15 1 19 0 10

rstu udent(ols.model1)

5 -1

22 3 -2

10 Index

15

20

Christenson&Powell:IntrotoR

25

OLSDiagnostics:Influence

0.5

CooksDgivesakindofsummary foreachobservation observations sinfluence

coo okd(ols.model1)

IfCooksDisgreaterthan4/(n / k 1),thentheobservationissaidto exertundueinfluence Letsjustplotit


plot(cookd(ols.model1)) abline(h=4/(nrow(x)ncol(x))) Identify(cookd(ols.mode y l1))

0.2

0.3

0.4

'2 i

k + 1

h 1 hi

13 0.1 18 11 0.0 1 5 10 Index 15 17

States2and(maybe)3areinthe troublezone

20

11/20/2007

Christenson&Powell:IntrotoR

26

OLSDiagnostics:Influence
Forahostofmeasures ofinfluence, influence including df betasanddf fits
influence.measu res(ols.model1)

dfbeta givesthe influenceofan observationonthe coefficients orthe changeinivscoefficient causedbydeletinga singleobservation Simplecommandsfor partialregressionplots canbefoundonFoxs website website
11/20/2007 Christenson&Powell:IntrotoR 27

OLSDiagnostics:Normality

Studen ntized Residuals(ols s.model1)

Isourdatadistributednormally? Wasitcorrecttousealinear model? Useaquantile plot(qq plot)to check


Pl Plots t empirical i i lquantiles til of f a variableagainststudentized residuals Lookingforobs onastraightline In I Riti issimple i l t toplot l tth theerror bandsaswell Deviationrequiresusto transformourvariables

2 14

-1

22 3 -2 13 -2 -1 0 norm Quantiles 1 2

qq.plot(ols.model1,dist l ( l d l1 di ribution="norm") Theproblemsareagain2and13, ,22and14bordering gon with3, troublethistimearound


11/20/2007

Christenson&Powell:IntrotoR

28

OLSDiagnostics:Normality
Asimple p density yp plot ofthestudentized residualshelpsto determinethenature ofourdata Theapparent deviationfromthe normalcurveisnot severe butthere severe, certainlyseemstobe aslightnegativeskew
11/20/2007
density.default(x = rstudent(ols.model1))
0.4 Density 0.0 -4 0.1 0.2 0. .3

-2

0 N = 22 Bandwidth = 0.4217

Christenson&Powell:IntrotoR

29

OLSDiagnostics:ErrorVariance
Wecanalsoeasilylookfor heteroskedasticity Plottingtheresidualsagainstthe fittedvaluesandthecontinuous independentvariablesletsus examineourstatistical lmodel d lfor f thepresenceofunbalanced errorvariance
par(mfrow=c(2,2)) plot(resid(ols.model1) ~fitted.values(ols.mod el1)) plot(resid(ols.model1) p ~income) plot(resid(ols.model1) ~presvote) p plot(resid(ols.model1) ( ( ) ~pressup)
11/20/2007
10 resid(ols.model1) resid(ols.model1) 30 40 50 60 70 0 -10 -10 -20 30000 0 10

-20

35000

40000 income

45000

50000

fitted.values(ols.model1)

10

resid(o ols.model1)

resid(o ols.model1) 35 40 45 50 presvote 55 60 65

-10

-20

-20 65

-10

10

70

75

80

85

90

95

pressup

Christenson&Powell:IntrotoR

30

OLSDiagnostics:ErrorVariance
Formaltestsforheteroskedasticity areavailablefromthelmtest library
library(lmtest) bptest(ols.model1) willgiveyoutheBreuschPaganteststat gqtest(ols.model1) willgiveyoutheGoldfeldQuandttest stat hmctest(ols.model1)willgiveyoutheHarrisonMcCabeteststat

11/20/2007

Christenson&Powell:IntrotoR

31

OLSDiagnostics:Collinearity
Finally,letslookoutfor collinearity Togetthevarianceinflation factors
vif(ols.model1)

Letslookattheconditionindex fromthep perturb library y


library(perturb) colldiag(ols.model1)

Issueshereisthelargest conditionindex Ifitislargerthan30,Houston wehave have


11/20/2007 Christenson&Powell:IntrotoR 32

OLSDiagnostics:Shortcut
Residuals vs Fitted Normal Q-Q
14

Myfavoriteshortcut commandtogetyou fouressentialdiagnostic plotsafteryourunyour model d l


plot(ols.model1, which=1:4)

10

Standardized residu uals

Residuals

-10

-1

-20

13

-2 13

30

40

50 Fitted values

60

70

-2

-1

Theoretical Quantiles

N Nowyouh haveno excusenottorunsome diagnostics! Btw, Bt l look kat tthe th high hi h residualsinthervf plot for14,13and3 suggestingoutliers
11/20/2007

Scale-Location
1.5
13 3 2

Cook's distance
0.5 Cook's d distance 0.3 0 0.4
2

Standardize ed residuals

1.0

0.5

0.2

3 13

0.0

30

40

50 Fitted values

60

70

0.0

0.1

10

15

20

Obs. number

Christenson&Powell:IntrotoR

33

TheFinalAct:LoopsandFunctions
Aswasmentionedabove, ,Rsbiggest gg assetisitsflexibility. y Loopsandfunctionsdirectlyutilizethisasset. Loopscanbeimplementedforanumberofpurposes, essentiallywhenrepeatedactionsareneeded(i.e. simulations). ) Functionsallowustocreateourowncommands.Thisis especiallyusefulwhenacannedproceduredoesnotexist. WewillcreateourownOLSfunctionwiththehandrolled codeusedearlier.

Loops
for loops p arethe mostcommonandthe onlytypeofloopwe willlookattoday. today Thefirstloop p commandattheright showssimpleloop iteration. iteration

Loops
However, ,wecanalso seehowloopscanbe alittlemoreuseful. The Th second dexample l atright(although inefficient)calculates themeanofincome Notehowtheindex accesseselementsof theincomevector. LoopsandMonte Carlo

Loops
However, ,wecanalso seehowloopscanbe alittlemoreuseful. The Th second dexample l atright(although inefficient)calculates themeanofincome Notehowtheindex accesseselementsof theincomevector. LoopsandMonte Carlo

Functions
Nowwewillmakeourown linearregressionfunction usingourhandrolledOLS code Functionsrequireinputs (whicharetheobjectstobe utilized)andarguments (whicharethecommands thatthefunctionperforms) Theactualestimation proceduredoesnotchange. However somechangesare However, made.

Functions
First,wehavetotellRthat wearecreatingafunction. Wellnameitols. Thislets Thi l t usgeneralize li the th proceduretomultiple objects. Second,wehavetotellthe functionwhatwewant returnedorwhatwewant theoutputtolooklike.

Functions
First,wehavetotellRthat wearecreatingafunction. Wellnameitols. Thisl Thi lets t usgeneralize li th the proceduretomultiple objects. Second,wehavetotellthe functionwhatwewant returnedorwhatwewant theoutputtolooklike.

Functions
First,wehavetotellRthat wearecreatingafunction. Wellnameitols. Thislets Thi l t usgeneralize li the th proceduretomultiple objects. Second,wehavetotellthe functionwhatwewant returnedorwhatwe wanttheoutputtolook like.

Functions
OLS:HandrolledvsFunction

Functions
Implementing p gour newfunctionols, wegetpreciselythe outputthatwe askedfor. Wecancheckthis againsttheresults producedbythe standardlm function.

Functions
Implementing p gour newfunctionols, wegetpreciselythe outputthatweasked for. Wecancheckthis againsttheresults producedbythe standardlm function.

FavoriteResources
InvaluableResourcesonline
The h Rmanuals l http://cran.rproject.org/manuals.html Foxsslideshttp://socserv.mcmaster.ca/jfox/Courses/Rcourse/index.html Faraway's book http://cran.r // project.org/doc/contrib/Faraway / / / PRA.pdf Anderson'sICPSRlecturesusingR http://socserv.mcmaster.ca/andersen/icpsr.html Arai'sguidehttp://people.su.se/~ma/R_intro/ UCLAnoteshttp://www.ats.ucla.edu/stat/SPLUS/default.htm Keeles introguidehttp://www.polisci.ohiostate.edu/faculty/lkeele/RIntro.pdf

G tRbooks Great b k
Verzanis book http://www.amazon.com/UsingIntroductoryStatisticsJohn Verzani/dp/1584884509 Maindonald M i d ld and dBrauns B book b k http://www.amazon.com/DataAnalysisGraphicsUsingR/dp/0521813360

11/20/2007

Christenson&Powell:IntrotoR

45

Você também pode gostar