Escolar Documentos
Profissional Documentos
Cultura Documentos
Correlationand Simp
Regression
A. Introduction
B. Correlation
C. Significanceof a CorrelationCoefficient
D. How to Interpret a CorrelationCoefficient
E. PartialCorrelations
n Linear Regression
G. Partitioningthe Total Sum of Squares
H. Producinga ScatterPlot and the RegressionLine
L Adding a QuadraticTerm to the RegressionEquation
J. TiansforminsData
INTRODUCTION
A commonstatisticfor indicatingthe strengthof a linea
two continuousvariablesis called the Pearsoncorrela
just correlationcoefficient(there are other typesof co
son is the most commonlyused).The correlationcoeffi
-1 to *1. A positivecorrelationmeansthat asvalueso
of the other variablealsotend to increase.A smallor z
us that the two variablesare unrelated.Finally,a negati
an inverserelationshipbetweenthe variables:asone go
This chapteralso dealswith simple regression(o
gressionbearssomesimilarityto correlationbut is usefu
variable,giventhe valueof another.Regressionalsoyie
betweenthe two variables,not just the strengthof as
(more than one independentvariable)is coveredin Cha
INPI]'T GENDER $ HEIGHT IiilEIG}Tf
DATALINES;
M 68 155 23
F6L9920
M 70 205 45
M 59 1_70
r' 65 1,25 30
M 72 224 48
E x a m p l eo f a C o r r e l a t i o n M a t r i x
T h e C o R RP r o c e d u r e
Vaniables: H E I G H T W E I G H T AGE
S i m p 1 eS t a t i s t i c s
Simple Statistlcs
H EI G H T 6 1. 0 0 0 0 0 72.00000
WEIGHT 99.00000 220.00000
AGE 20.00000 48.00000
7
0.86614 0.92496 1. 0 0 0
0.0257 0.0082
6 6
.)HTiilil:''
This notationcan saveconsiderablecomputationtime.
C. OF A CORRELATION
SIGNIFICANCE COEFFICIENT
You may ask,"How largea correlationcoefficientdo I needto showthat two variables
are correlated?"Eachtime PROC CORR prints a correlationcoefficient,it alsoprints
a probabilityassociated
with the coefficient.That number givesthe probabilityof ob-
taininga samplecorrelationcoefficientas large as or larger than the one obtainedby
chancealone(i.e.,when the variablesin questionactuallyhavezero correlation).
relationship-it does not imply causality.For example, w
positive correlation between the number of hospitals i
the number of household pets in each state.Does this
sick and therefore make more hospitals necessary?Dou
planation is that both variables (number of pets and nu
to population size.
Being SIGNIFICANT is not the same as being IMP
is, knowing the significance of the correlation coefficie
Once we know that our correlation coefficient is signifi
.cientsbetweenall pairs of
need to look further in interpreting the importance of th
:esultsin a verylargenum-
moment to explain what we mean by the significance of
of correlationcoefficients
pose we have a population of x and y values in which th
EST=numberoption with
imagine a plot of this population as shown in the followin
;cending order.
: of variablesand another
le.PROCCORR will then
in the WITH list against Correlation with Population Correla
thevariablesIQ and GPA
o recordeda studentscore v
600
rrrelationbetweenthe IQ
o
ooooo
oO^
hvv
o oo-o
cO o
o o
o -oo
nv
o o o
o
o ^ o oo
"o o
o
o
taa
o
ot
^* lo
o
o doo
o O ^v v
^lb o
o oo-o
CO
6- o0oo
o o o
o
o n o oo
oa"^oo
oo
D. HOW TO INTERPRET
A CORRELATION
COEFFICIENT
One of the bestwaysto interpreta correlationcoefficient(r) is to look at the squareof
the coefficient(r-squared);r-squaredcan be interpretedas the proportion of variance
in one of the variablesthat can be explainedby variationin the other variable.As an
example,our height/weightcorrelationis .97.Thus,r-squaredis .94.We can say that
94"/"of the variationin weightscanbe explainedby variationin height(or vice versa).
Also, (1 - .94), or 6"/",of the varianceof weight,is due to factorsother than height
variation.With this interpretationin mind,if we examinea correlationcoefficientof .4,
we seethat only .16,or l6Yo,of the varianceof one variableis explainedby variationin
the other.
(say blood pressure and heart rate) for each of the 20 su
a correlation coefficientusing 20 pairs of data points.Su
measurementsof blood pressureand heart rate for eac
valid correlation coefficient using all 200 data points (1
becausethe 10 points for each subjectare not independ
to compute the mean blood pressure and heart rate for
values to compute the correlation coefficient.Having d
we correlated mean valueswhen we are interpreting th
PARTIALCORRELATIONS
F. LINEARREGRESSION
Given a person'sheight,what would be his or her predictedweight?How can we best
define the relationship between height and weight? By studying the following graph,
we seethat the relationshipis approximatelylinear.That is, we can imagine drawinga
straight line on the graph with most of the data points being only a short distancefrom
the line.The verticaldistancefrom eachdatapoint to this line is calleda residual.
V1JEGHT
28
zfr
2m
1m
1S
170
't60
150
140
1m
1n
1t)
1S
s I I t""l I I I
61 e, 63 64 65 66 67 ffi 69 70 71 72
HE]GHT
MODELdependent variable(s) = independent-
RT}N;
R e g n e s s i o nL j . n e f o r H e i g h t - W e i g h t D a t a
T h e R E GP r o c e d u n e
Model: MODELl
D e p e n d e n tV a r i a b l e : W E I G H T
N u m b e no f o b s e r v a t i . o n s R e a d
N u m b e ro f o b s e r v a t i o n s U s e d
Analvsi.s of Vaniance
Sumof M
S o ur c e Squares Squ
Model 1 11880 11
Ennon 703.38705 140.67
Corrected Total o 12584
Variable Esti.mate Error t Value ltl
I n t en c e p t - 592. 64458 81.54217 -7.27 0.0008
HEIGHT 1 1. 1 9 1 2 7 1.21780 9.19 0.0003
y:a*bx
Given any height,we can now predict the weight.For example,the predicted
weightof a 70-inch-tallpersonis:
PARTITIONTNG
THETOTALSUM OF SQUARES
definedby our height/weight The top portion of the output from PROC REG presen
variancetable for the regression.It takesthe variation
breaksit out into varioussources. To understandthis ta
lIGHT weight.That weight can be thought of as the mean wei
minus)a certainamount;becausethe individualis taller
For example,the predicted regressiontells us that taller peopleare heavier.Finally
to the fact that the predictionis lessthan perfect.So we
down due to the regression, and then up or down again
190.66Ib. analysis-of-variance tablebreaksthesecomponentsapa
of eachthroughthe sum of squares.
m labeled"StandardError,"
The total sum of squaresis the sumof squareddev
probabilities (Pr > ltl) test
'the from the grandmean.This total sum of squares(SS)ca
true slopeand intercept
the sum of squaresdue to regression(or model) and th
hancealone,a value as large
timescalledresidual).One portion, calledthe Sum of S
, error can be thought of in
put, reflectsthe deviationof each weight from the PR
ectstheaccuracywith which
portion reflectsdeviationsbetweenthe PREDICTED
calledthe sum of squaresdue to the MODEL in the out
of theslopeis l.22.We can
Squareis the sum of squaresdividedby the degreesof f
takingtwo (approximately)
sevendatapoints.The total degreesof freedomis equal
denceintervalfor our slope
has1 df.The error degreesof freedomis the total df min
r exampleis small (n : 7),
5.We can think of the Mean Squareasthe respectiveva
arderrorsaboveand below
deviation)of eachof thesetwo portionsof variation.O
n is lessthan30.)Going to
deviationsaboutthe regressionline are small(error mea
n - 2 and level of signifi-
viation betweenthe predictedvaluesand the mean (m
I p = .05is2.57.Our 95%
= 3.1.4. havea good regressionline.To comparethesemeansqu
put,we seevaluesfor Root
Mean square model
tr. Root MSE is the square F:
n of theresiduals.R-square Mean squareerror
H. PRODUCINGA SCATTERPLOT AND THE REGRESSIONLINE
To plot out height/weightdata, we can use PROC GPLOT as follows:
PROCGPLOTDATA=CORR-EG;
PLOTWEIGHT*HEIG}TT;
RUN;
If you want to override the default plotting symbol (a plus sign), you can add a
SYMBOL statementbefore running the PROC GPLOT' like this:
Some other plotting symbols besidesDOT (large black dots) that are useful are
CIRCLE. SQUARE. TRIANGLE. and PLUS. To see a complete list of plotting sym-
M.
bols,enter SYMBOL in your SAS help window or seethe Online Docl You can also
selectVALUE : NONE if you also selectan interpolation method (such as JOIN or
SMOOTH) and you don't want to see individual data points on the plot.
There are two ways to have SAS show us the data points and the regressionline.
First. PROC REG allows the use of a PLOT statement.The form is:
DTnrF \r rr:zi:hlo - .'..i:l-rlo ,/ ^v yn uf r v ri t^on, c .
Y_variable andx_voriable are the namesof the variablesto be plotted on the y- and
x-axes,respectively.PROC REG will plot the data points and show the regressionline.
In addition, there are some special "variable names" that can be used with a PLOT
statement.In particular,the names PREDICTED. and RESIDUAL. (the periods are
part of thesekeywords) are used to plot predicted and residualvalues.
SYMBOL1 VALUE=DOT COLOR=BI,ACK;
LINE PROC REG DATA=CORR-EG;
TITLE "Regression arld Residual Plots";
I asfollows:
MODEL WEIG}IT = HEIGHT;
PLOT WEIGHT * HEIGHT
RESIDUAI. * HEIGHT;
RUN;
QUIT;
Regressionand ResidualPl
W E I G H T= - 5 9 2 . 5 4+ 1 1 . 1 9 1 HEIGHT
220
t!
Ig o
'6
o
E -5
HEIGHT
By using the GPLOT procedure, you can show the scatter plot of WEIGHT by
HEIGHT, the regressionline, and two types of 957" confidence intervals.The most com-
mon type of confidence interval about a regressionline is an interval that indicateswhere
you believe the mean value of y is most likely to lie.The other, lessoften used confidence
interval, is an interval for individual data points. Limits of this interval determine the
likely location of individual y-values.The following PROC GPLOT statementswill pro-
duce a scatter plot of WEIGHT by HEIGHT and both types of confidence intervals.As
you can see,the confidence interval for the mean of y is much narrower than the one for
individual data points.
@PTIONS CSYMBOL=BLACK;
SYMBOLI"\IALUE=DOT;
SY14BOL2VAI,UE=I,IONE I=RLCLM95 ;
SYMBOL3VAIUE=NONE I=RLCLI9S LrNE=3;
PROC GPLCrI DAIA=CORR-EG;
TITLE "Regrression Lines and 958 CI's';
PlOry WEIGIIA * HEIGIIa = 1
TIEIGHT*HEIGIIT=2
WEIGHT * lIEfGln = 3 / OVffiLAY;
RUN;
QUTT;
When you run this model,you will get an r-squaredof .9743,animprovementover the
.9441obtainedwith the linear model.The followingresidualplot showsthat the distri-
bution of the residualsis more random than the earlier plot. A caution or two is in
order here.First,rememberthat this is an examplewith a very smalldata set and that
the originalcorrelationis quite high.Second,althoughit is possibleto alsoentercubic
terms,etc.,one shouldkeep in mind that resultsneedto be interpretable.
12.5
7.5
E 50
E
u 2.8
0.0
-2"8
-5.0
-7.5
TRANSFORMINGDATA
10 JZ
'ry--:39-lTl=*''
FI,(II HR*DOSE'
RI]Ni
PROC REG DATA=HEART;
MODEL IIR = DOSE;
RUN,.
Investigating t h e D o s e / H RR e l a t i o n s h i p
T h e R E GP r o c e d u n e
Model: MODELl
lstratesomeadditional steps DeoendentVariable: HR
hereare data collectedfrom
Model 1 299.48441 233'4844'l
8 38.11559 4'76445
Error
Connected Total I 271.60000
P a r a m e t e rE s t i m a t e s
Parameter Standard
Estimate Error t value Pn >
Vaniable DF
1.04492 5 8 . 10 <.0001
I n t er c e P t 60.70833
0.44288 0.06326 7,O0 0.0001
DOSE
Investigatingthe Dose/HRRelationship
HR
71).
73,
72:. --'
71-
70.
69
68
6?
66
65
6/t
63
62
6t
60
59
5E
2A
DOSE
is written as To create a new variable, which is the lo
DATA HEART,.
INPUT DOSE}IR @@;
LDOSE= LOG(DOSE);
LABELLmSE = 'Log of Dose";
DATAIINES;
2 60 2 58 4 63 4 62 B 67 8 65 L6 70 L6 70 32
SYMBOL \IALUE=DOT;
PROC GPLOT DATA=HEART;
PI,C[ HR*LDOSE;
RUN'
I n v e s t i g a t i n g t h e D o s e / H RR e l a t i o n s h i p
T h e R E GP r o c e d u r e
ModeI: MODELl
D e p e n d e n tV a r i a b l e : H R
Model '1 266.45000 266.45000 4 1 3 . 9 0 < . 0 0 0 1
Erron I 5.15000 0.6437s
C o r r e c t e dT o t a l I 271.60000
P a r a m e t e rE s t i m a t e s
Parameter Standard
Vaniable Label DF Estimate Error t Value Pr >
Investigatingthe Dose/HRRelationship
HR
74
73
72
71
70
69
68
67
66
65
64
63
62
6t
60
59
58
2
Log of Dore
5.1 Given the followins data:
x Z
I 3 1
7 l-) 7
8 t2
3 4 t4
A
7 1
DATA D(AM;
INPUT (Q1-Q8)(1. );
DATALINES;
10101010
11111111
11110101
01"100000
1l"l_l_0001
111r-1111
11111101
11111101
10110101
00010110
Starting with this data set, create a new SAS data set wit
raw score on the test (the sum of Ql to Q8). Using this da
each of the eight questions with the total test score.This
correlation coefficient.
40 150
50 148
DATA SCORSS;
m SUBIECT = 1 T0 100;
rF RANUI\[(1357) LT .5 THEN GROIJP= 'A';
ELSE GROUP= 'B';
! " I A T H= R O I I N D ( R A N N O R ( I - 3 5 7 ) * 2+0 5 5 0 + 1 0 * ( G R O I J PE Q ' A ' ) ) t
SCIEVCE = ROU1{D(RANNOR(1-3S11 *15 + .4*MATH + 300) ;
EIJGLISH = ROUND(RANNOR(135?)*20+ 500 + .05*SCIEI,ICE +
. O 5 * M A T H ;)
SPELLING = ROUND(RANNOR(1357)*1"5 + 500 + ,1"*SGLISH) ;
VOCAB = ROUND(RANNOR(1357)*5+ 400 + .1*SPELLIIIG *
.2*BGLTSH} ;
p1{YSICAL = ROLI{D(RANNOR(1357) r20 + 550) ;
OVERALL . ROUND(M&qN(MA11{,SCIENCE. Bil3LISH, SPELLING, VOCAB,
PIIYSICAI) } ;
OUIRII;
END;
Rl]N;
(a) Generate a correlation matrix of all the test scores plus the overall score. Use the op-
tion to omit the simplestatisticsproducedby PROC CORR.
(b) PHYSICAL is independentof all the other test scoresbut is correlatedto OVER-
ALL. Why?
(c) Write the statementsto correlateeachof the scoreswith OVERALL.
From the datafor X andY in Problem5.1:
(a) Computea regression line (Y on X). Y is the dependent
variable,X the independent
variable.
(b) Whatis slopeand intercept?
(c) Are they significantlydifferentfrom zero?
5.6 (a) Using data set SCORESfrom Problem5.4,computea regressionequationfor pre-
dictingthe SCIENCE scorefrom the MATH score.Make one plot showingthe data
blood pressure(DBP). Createa SAS data set (DOSE_
computetwo regressionequations.One for SBP vers
DOSE. Produceone plot of SBPversusDOSE, anothe
canbe explainedby the fact SBP)versusDOSE,one of DBP versusDOSE, andfina
hecorrelationbetweenSBP DOSE. Use the PLC)Tstatementof PROC REG to do
CORES.
Dose SystolicBlood Pressure
4 180
+ 190
4 t78
8 170
8 180
'o',,,
ii4, 8 168
r.
|;' 16 160
l6 172
t6 r70
l{€ttt, )L t40
I' JZ
32
130
128
S1e.uo.*,
' '
i' ,,,. 5.9 Generate:
(a) A plot of Y versusX (datafrom Problem5.1).
(b) A plot of the regressionline and the originaldatao
5.10 RepeatProblem5.8usingthe naturallog of dose(LOG
pare the fit statistics.
Which fit is betterfor SBP?Which
[e overallscore.Use the op- 5.11 Given the dataset:
RR.
butis correlatedto OVER-
COUNTY POP HOSPITAL F
OVERALL. 1 35 1
2 88 5
variable,
X the independent J 5 0
4 55 3
5 75 4
o 125
egressionequation for pre- 7 225 7
: one plot showing the data 8 500 10
s.t2 Repeat Problem 5.a (a), except do the analysisseparatelyfor GROUP:'A'and
GROUP:'B . What is the effect on the magnitudeof the correlationsand the p-values?
5.13 What'swrong with the followingprogram?(Nore: Theremay be missingvaluesfor X,l
andZ.)
], DATA IvIANY_ERR;
2 INPTIIXYZ;
3 IF X LE 0 THEI{ X=1;
4 IF Y LE 0 THEI{ Y*l-;
5 IF Z LE 0 THEN Z=1;
6 I,oGX = LOG(X) ;
7 l0GY = L,OG(Y);
8 LOGZ = LOG(Z) ;
9 DATALINES;
L23
.18
4 1"0
18LL
;
l-0 PROC C'ORRDATA=I"IANY-RR / PEARSON SPEARMAN;
1I VARX-I,OGZ;
12 RUN;