Cody - Smith - Chap 5

:ethisproblemeasier.
Correlationand Simp
Regression
A. Introduction
B. Correlation
C. Significanceof a CorrelationCoefficient
D. How to Interpret a CorrelationCoefficient
E. PartialCorrelations
n Linear Regression
G. Partitioningthe Total Sum of Squares
H. Producinga ScatterPlot and the RegressionLine
L Adding a QuadraticTerm to the RegressionEquation
J. TiansforminsData
INTRODUCTION
A commonstatisticfor indicatingthe strengthof a linea
two continuousvariablesis called the Pearsoncorrela
just correlationcoefficient(there are other typesof co
son is the most commonlyused).The correlationcoeffi
-1 to *1. A positivecorrelationmeansthat asvalueso
of the other variablealsotend to increase.A smallor z
us that the two variablesare unrelated.Finally,a negati
an inverserelationshipbetweenthe variables:asone go
This chapteralso dealswith simple regression(o
gressionbearssomesimilarityto correlationbut is usefu
variable,giventhe valueof another.Regressionalsoyie
betweenthe two variables,not just the strengthof as
(more than one independentvariable)is coveredin Cha
INPI]'T GENDER $ HEIGHT IiilEIG}Tf
DATALINES;
M 68 155 23
F6L9920
M 70 205 45
M 59 1_70
r' 65 1,25 30
M 72 224 48
PROC CORR DATA=CORR-EG;

TITLE "b<afiq>le of a Correlation Matrix";
VAR }MIGHT WEIGI{T AGE;
RUN;
The output from this program
E x a m p l eo f a C o r r e l a t i o n M a t r i x
T h e C o R RP r o c e d u r e
Vaniables: H E I G H T W E I G H T AGE
S i m p 1 eS t a t i s t i c s
V ani a b l e Mean Std Dev Sum
HEIGHT 7 66.8571 4 3.97612 468.00000

WEIGHT 7 15 5 . 5 7 14 3 4 5 . 7 9 6 13 10 8 9
AGE b 3 1 . 16 6 6 7 12 . 4 16 3 9 18 7 . 0 0 0 0 0
Simple Statistlcs
Variable Minlmum Maximum
H EI G H T 6 1. 0 0 0 0 0 72.00000
WEIGHT 99.00000 220.00000
AGE 20.00000 48.00000
7
WEIGHT 0 . 9 7 16 5 1.00000 0.924

0.0003 0.00
0.86614 0.92496 1. 0 0 0
0.0257 0.0082
6 6
PROC CORR gives us some simple descriptives

VAR list along with a correlation matrix. If you examin
column in this matrix, you will find the correlation coeff
(the secondnumber), and the number of data pairs us
(third number). If the number of data pairs are the sam
ables,the third number in each group is not printed. In
the top of the table.
In this listing, we see that the correlation betwe
.9'7165,and the significancelevel is .0003;thecorrelation
.866141p : .0257);the correlationbetweenWEIGHT
Let's concentratefor a moment on the HEIGHT and W
p-valuecomputed for this correlationindicatesthat it is
relation this large by chance if the sample of seven sub
tion whose correlation was zero. Remember that this i
this large are quite rare in socialsciencedata.
Sum
To generatea correlation matrix (correlationsbe
tion of the variables),use the following generalsyntax
468.00000
1089
18 7 . 0 0 0 0 0
PROCCORRoptions,'
VAR list-of -variables ;
RUN;
The term "list-of-variables" should be replaced wi

rated by spaces.If no options are selected,Pearsoncorre
scriptive statisticsare computed. As we discusslater, se
want simplestatisticsto be printed,you would submitthe followingcode:
PROC CORR DATA=CORR-EG PEARSON SPEARMANNOSIIvIPLE;

TITLE "Exarq>le of a Correlation Matrix";
VAR HEIGHT WETGHTAGE;
RUNt
The CORR procedurewill computecorrelationcoefficientsbetweenall pairsof

variablesin the VAR list.If the list of variablesis large,this resultsin a very largenum-
If you want to seeonly a limited numberof correlationcoefficients
ber of coefficients.
(the oneswith the highestabsolutevalues),include the BEST:number option with
PROC CORR.This resultsin thesecorrelationslistedin descendingorder.
If all you want are the correlationsbetweena subsetof variablesand another
subsetof variablesin a dataset.a WITH statementis available.PROC CORR will then
compute a correlation coefficientbetweenevery variable in the WITH list against
everyvariablein the VAR list.For example,supposewe had the variablesIQ and GPA
(gradepoint average)in a dataset calledRESULTS.We alsorecordeda studentscore
on L0 tests(TEST1-TEST10).If we want to seeonly the correlationbetweenthe IQ
and GPA versuseachof the 10 tests.the svntaxis:
.)HTiilil:''
This notationcan saveconsiderablecomputationtime.
C. OF A CORRELATION
SIGNIFICANCE COEFFICIENT
You may ask,"How largea correlationcoefficientdo I needto showthat two variables
are correlated?"Eachtime PROC CORR prints a correlationcoefficient,it alsoprints
a probabilityassociated
with the coefficient.That number givesthe probabilityof ob-
taininga samplecorrelationcoefficientas large as or larger than the one obtainedby
chancealone(i.e.,when the variablesin questionactuallyhavezero correlation).
relationship-it does not imply causality.For example, w
positive correlation between the number of hospitals i
the number of household pets in each state.Does this
sick and therefore make more hospitals necessary?Dou
planation is that both variables (number of pets and nu
to population size.
Being SIGNIFICANT is not the same as being IMP
is, knowing the significance of the correlation coefficie
Once we know that our correlation coefficient is signifi
.cientsbetweenall pairs of
need to look further in interpreting the importance of th
:esultsin a verylargenum-
moment to explain what we mean by the significance of
of correlationcoefficients
pose we have a population of x and y values in which th
EST=numberoption with
imagine a plot of this population as shown in the followin
;cending order.
: of variablesand another
le.PROCCORR will then
in the WITH list against Correlation with Population Correla
thevariablesIQ and GPA
o recordeda studentscore v
600
rrrelationbetweenthe IQ
o
ooooo
oO^
hvv
o oo-o
cO o
o o
o -oo
nv
o o o
o
o ^ o oo
"o o
o
o
lo showthat two variables

n coefficient,
it alsoprints
vestheprobabilityof ob-
thanthe one obtainedbv
vezerocorrelation).
fl)o
taa
o
ot
^* lo
o
o doo
o O ^v v
^lb o
o oo-o
CO
6- o0oo
o o o
o
o n o oo
oa"^oo
oo
Testingfor significancetells us if the correlationfound in the sampleis large

enoughto indicatea stronglikelihood that there is actuallya nonzerocorrelationin
the (larger)population.In this sample,a nonzerocorrelationwould occur,which may
or may not be significantand must be testedanewfor eachcorrelation.
D. HOW TO INTERPRET
A CORRELATION
COEFFICIENT
One of the bestwaysto interpreta correlationcoefficient(r) is to look at the squareof
the coefficient(r-squared);r-squaredcan be interpretedas the proportion of variance
in one of the variablesthat can be explainedby variationin the other variable.As an
example,our height/weightcorrelationis .97.Thus,r-squaredis .94.We can say that
94"/"of the variationin weightscanbe explainedby variationin height(or vice versa).
Also, (1 - .94), or 6"/",of the varianceof weight,is due to factorsother than height
variation.With this interpretationin mind,if we examinea correlationcoefficientof .4,
we seethat only .16,or l6Yo,of the varianceof one variableis explainedby variationin
the other.
(say blood pressure and heart rate) for each of the 20 su
a correlation coefficientusing 20 pairs of data points.Su
measurementsof blood pressureand heart rate for eac
valid correlation coefficient using all 200 data points (1
becausethe 10 points for each subjectare not independ
to compute the mean blood pressure and heart rate for
values to compute the correlation coefficient.Having d
we correlated mean valueswhen we are interpreting th
PARTIALCORRELATIONS
A researchermay wish to determine the strength of the

ableswhen the effect of other variableshas been remov
is by computing a partial correlation.To remove the e
from a correlation,use a PARTIAL statementto list th
want to remove. Using the CORR_EG data set from Se
tial correlation between HEIGHT and WEIGHT with
PROC CORR DATA=CORR-EG NOSTMPLE;

TITLE "Exanp]e of a Partial Correlation";
VAR HEIGHT WETGHT;
ound in the sample is large PARTIA], AGE:
t l l y a n o n z e r oc o r r e l a t i o ni n RIJN;
lion would occur, which may
h correlation.
As you can see in the following listing,the partial
and WEIGHT is now lower than before (.91934),
(p: .0212).
(r) is to look at the square of
s l h e p r o p o r t i o no f v a r i a n c e
in the other variable.As an E x a m o l - eo f a P a n t i a l - C o n n e l a t i o n
ared is .94.We can say that
T h e C o R BP r o c e d u n e
:ionin height(or vice versa).
to factors other than height
1 Pantiaf Variabfes: AUE
' correlationcoefficient of .4, 2 vaniabl-es: HEIGHT WEIGHT
e i s e x p l a i n e db y v a r i a t i o n i n
1.00000
F. LINEARREGRESSION
Given a person'sheight,what would be his or her predictedweight?How can we best
define the relationship between height and weight? By studying the following graph,
we seethat the relationshipis approximatelylinear.That is, we can imagine drawinga
straight line on the graph with most of the data points being only a short distancefrom
the line.The verticaldistancefrom eachdatapoint to this line is calleda residual.
V1JEGHT
28
zfr
2m
1m
1S
170
't60
150
140
1m
1n
1t)
1S
s I I t""l I I I
61 e, 63 64 65 66 67 ffi 69 70 71 72
HE]GHT
MODELdependent variable(s) = independent-
RT}N;
'd weight?How can we best Using our height/weightprogram,we add the fo

udyingthe following graph, equationfor the regression
line:
i, we can imagine drawing a
g only a short distancefrom
ine is calleda residual. PROC REG DATA=CORR_EG;
TfTLE "Regression Line for Height-weight D
l,lODEL WEIGIff = HEIGHT;
RUN;
The output from the previous procedure is as follow
R e g n e s s i o nL j . n e f o r H e i g h t - W e i g h t D a t a
T h e R E GP r o c e d u n e
Model: MODELl
D e p e n d e n tV a r i a b l e : W E I G H T
N u m b e no f o b s e r v a t i . o n s R e a d
N u m b e ro f o b s e r v a t i o n s U s e d
Analvsi.s of Vaniance
Sumof M
S o ur c e Squares Squ
Model 1 11880 11
Ennon 703.38705 140.67
Corrected Total o 12584
Variable Esti.mate Error t Value ltl
I n t en c e p t - 592. 64458 81.54217 -7.27 0.0008
HEIGHT 1 1. 1 9 1 2 7 1.21780 9.19 0.0003
Look first at the last two lines.Thereis an estimatefor two parameters,

INTER-
CEPT and HEIGHT. The generalequationfor a straightline can be written as:
y:a*bx
wherea : intercept,b : slope.

We canwrite the equationfor the "best" straightline definedby our height/weight
dataas:
WEIGHT: -592.64+ 11.19x HEIGHT
Given any height,we can now predict the weight.For example,the predicted
weightof a 70-inch-tallpersonis:
WEIGHT : -592.64 + 1t.19 x 70 : 190.66Ib.
Under the heading"ParameterEstimates"are columnslabeled"StandardError,"

"tValue,"and "Pr > ltl."TheTvaluesandtheassociatedprobabilities(Pr > ltl)test
the hypothesisthat the parameteris actuallyzero.That is,if the true slopeand intercept
were zero,what would the probability be of obtaining,by chancealone,a value as large
as or larger than the one actually obtained?The standarderror can be thought of in
muchthe sameway asthe standarderror of the mean.It reflectsthe accurarywith which
we know the true slopeand intercept.
In our case,the slopeis 11.19,and the standarderror of the slopeis 1,.22.We can
thereforeform a 95% confidenceintervalfor the slopeby taking two (approximately)
standarderrorsaboveand below the mean.The 95"/oconfidenceintervalfor our slope
is 8.75to 13.63.Actually,sincethe number of points in our exampleis small (n : 7),
we really shouldgo to a t-tableto find the numberof standarderrorsaboveand below
the meanfor a 95"/"confidenceinterval.(This is true when n is lessthan 30.) Going to
a t-table,we look under degreesof freedom (df) equal to n - 2 and level of signifi-
cance(two-tail) equal to .05.The value of t for df : 5 and p : .05 is 2.57.Our 95%
confidenceintervalis then 11.19plus or minus2.57 x 1..22: 3.14.
Inspectingthe Analysisof Varianceportion of the output,we seevaluesfor Root
MSE, R-Square,DependentMean,Adj R-Sq,and Coeff Var. Root MSE is the square
root of the error variance.That is,it is the standarddeviationof the residuals.
R-square
-7.27 0.0008 variable:WEIGHT in this case.Coeff Var.is the coeffic
9 ,1 9 0.0003 deviationexpressedasa percentof the mean.Finally,A
tion coefficientcorrectedfor the number of indepen
This adjustmenthas the effect of decreasingthe value
lor two parameters,INTER- typicallysmallbut becomeslarger and more important
inecanbe written as: dependentvariables(seeChapter9).
PARTITIONTNG
THETOTALSUM OF SQUARES
definedby our height/weight The top portion of the output from PROC REG presen
variancetable for the regression.It takesthe variation
breaksit out into varioussources. To understandthis ta
lIGHT weight.That weight can be thought of as the mean wei
minus)a certainamount;becausethe individualis taller
For example,the predicted regressiontells us that taller peopleare heavier.Finally
to the fact that the predictionis lessthan perfect.So we
down due to the regression, and then up or down again
190.66Ib. analysis-of-variance tablebreaksthesecomponentsapa
of eachthroughthe sum of squares.
m labeled"StandardError,"
The total sum of squaresis the sumof squareddev
probabilities (Pr > ltl) test
'the from the grandmean.This total sum of squares(SS)ca
true slopeand intercept
the sum of squaresdue to regression(or model) and th
hancealone,a value as large
timescalledresidual).One portion, calledthe Sum of S
, error can be thought of in
put, reflectsthe deviationof each weight from the PR
ectstheaccuracywith which
portion reflectsdeviationsbetweenthe PREDICTED
calledthe sum of squaresdue to the MODEL in the out
of theslopeis l.22.We can
Squareis the sum of squaresdividedby the degreesof f
takingtwo (approximately)
sevendatapoints.The total degreesof freedomis equal
denceintervalfor our slope
has1 df.The error degreesof freedomis the total df min
r exampleis small (n : 7),
5.We can think of the Mean Squareasthe respectiveva
arderrorsaboveand below
deviation)of eachof thesetwo portionsof variation.O
n is lessthan30.)Going to
deviationsaboutthe regressionline are small(error mea
n - 2 and level of signifi-
viation betweenthe predictedvaluesand the mean (m
I p = .05is2.57.Our 95%
= 3.1.4. havea good regressionline.To comparethesemeansqu
put,we seevaluesfor Root
Mean square model
tr. Root MSE is the square F:
n of theresiduals.R-square Mean squareerror
H. PRODUCINGA SCATTERPLOT AND THE REGRESSIONLINE
To plot out height/weightdata, we can use PROC GPLOT as follows:
PROCGPLOTDATA=CORR-EG;
PLOTWEIGHT*HEIG}TT;
RUN;
If you want to override the default plotting symbol (a plus sign), you can add a
SYMBOL statementbefore running the PROC GPLOT' like this:
SYMBOL VALUE=DOT COLOR=BI,ACK;

***You can abbreviate VALUE as V= and COLOR as C=;'
PROC GPLOT DATA=CORR-EG;
PLOT WEIGHT*HEIGHT;
RUN;
Some other plotting symbols besidesDOT (large black dots) that are useful are
CIRCLE. SQUARE. TRIANGLE. and PLUS. To see a complete list of plotting sym-
M.
bols,enter SYMBOL in your SAS help window or seethe Online Docl You can also
selectVALUE : NONE if you also selectan interpolation method (such as JOIN or
SMOOTH) and you don't want to see individual data points on the plot.
There are two ways to have SAS show us the data points and the regressionline.
First. PROC REG allows the use of a PLOT statement.The form is:
DTnrF \r rr:zi:hlo - .'..i:l-rlo ,/ ^v yn uf r v ri t^on, c .
Y_variable andx_voriable are the namesof the variablesto be plotted on the y- and
x-axes,respectively.PROC REG will plot the data points and show the regressionline.
In addition, there are some special "variable names" that can be used with a PLOT
statement.In particular,the names PREDICTED. and RESIDUAL. (the periods are
part of thesekeywords) are used to plot predicted and residualvalues.
SYMBOL1 VALUE=DOT COLOR=BI,ACK;
LINE PROC REG DATA=CORR-EG;
TITLE "Regression arld Residual Plots";
I asfollows:
MODEL WEIG}IT = HEIGHT;
PLOT WEIGHT * HEIGHT
RESIDUAI. * HEIGHT;
RUN;
QUIT;
The keyword RESIDUAL. stands for the resid

(a plus sign),you can add a minus the predicted value).The SYMBOL statements
like this: plotting symbol and the color to be black.The second
HEIGHT (it usesthe symbols defined in SYMBOLI
shown here:
Regressionand ResidualPl
W E I G H T= - 5 9 2 . 5 4+ 1 1 . 1 9 1 HEIGHT
220
ack dots) that are useful are

ompletelist of plotting sym-
Online DocrM. You can also F 160
rn method(such as JOIN or I
g
ntson the plot.
I reo
rints and the regression line.
a
te form is:
a---
esto be plotted on the y- and

rnd show the regressionline.
t can be used with a PLOT
ESIDUAL. (the periods are
HEIGHT
idual values.
11.861
t!
Ig o
'6
o
E -5
HEIGHT
By using the GPLOT procedure, you can show the scatter plot of WEIGHT by
HEIGHT, the regressionline, and two types of 957" confidence intervals.The most com-
mon type of confidence interval about a regressionline is an interval that indicateswhere
you believe the mean value of y is most likely to lie.The other, lessoften used confidence
interval, is an interval for individual data points. Limits of this interval determine the
likely location of individual y-values.The following PROC GPLOT statementswill pro-
duce a scatter plot of WEIGHT by HEIGHT and both types of confidence intervals.As
you can see,the confidence interval for the mean of y is much narrower than the one for
individual data points.
@PTIONS CSYMBOL=BLACK;
SYMBOLI"\IALUE=DOT;
SY14BOL2VAI,UE=I,IONE I=RLCLM95 ;
SYMBOL3VAIUE=NONE I=RLCLI9S LrNE=3;
PROC GPLCrI DAIA=CORR-EG;
TITLE "Regrression Lines and 958 CI's';
PlOry WEIGIIA * HEIGIIa = 1
TIEIGHT*HEIGIIT=2
WEIGHT * lIEfGln = 3 / OVffiLAY;
RUN;
QUTT;
The first SYMBOL statementrequestsdots as the plotting symbols.The second

and third SYMBOL statements setVALUE equalto 'NONE,' whichhidesthe points.
The I : (interpolation)option in SYMBOL2 requestsa regressionline and the 95"/"
CI aboutthe meanof y (RLCLM95).The SYMBOL3 statementasksfor the confidence
Regression Lines and 9506 C
WEGHT
2n
zfr
2m
190
1m
170
't60
scatter plot of WEIGHT by
150
enceintervals.Themost com-
r interval that indicates where 140
ter,lessoften used confidence 130
rf this interval determine the
120
GPLOT statements will pro-
'esof confidence intervals. As 1m
tchnarrower than the one for 1m
90
61 Q^ 63 64 65 66 67 68
HEIGHT
ADDING A QUADRATICTERM TO THE REGRESSIO

The plot of residuals,shown in the previous section, sug
(height squared) might improve the model since the po
rather, form a curve that could be fit by a second-orde
ter deals mostly with linear regression,let us quickly sh
this possible quadratic relationship between height and
line in the original DATA step to compute a variab
lotting symbols.The second squared.After the INPUT statement, include a line suc
\E. which hides the points. HEIGHT2 = HEIGHT * HEIGHT;
'egression
line and the 95o/o or
nentasksfor the confidence HEIGHT2 = HEIGHT**2;
PLOT R. HEIGHT;
***NoLe: R. is short for RESIDIIAL,;
RUN;
When you run this model,you will get an r-squaredof .9743,animprovementover the
.9441obtainedwith the linear model.The followingresidualplot showsthat the distri-
bution of the residualsis more random than the earlier plot. A caution or two is in
order here.First,rememberthat this is an examplewith a very smalldata set and that
the originalcorrelationis quite high.Second,althoughit is possibleto alsoentercubic
terms,etc.,one shouldkeep in mind that resultsneedto be interpretable.
ResidualPlot with QuadraticTerm

WEIGHT= 2321.1-76.E4aHElGHT
+0.662$eight2
16.0
12.5
7.5
E 50
E
u 2.8
0.0
-2"8
-5.0
-7.5
TRANSFORMINGDATA
Another regressionexample is provided here to demonstrate some additional steps

that may be necessarywhen doing regression.Shown here are data collected from
10 people:
8 16
9 JL
10 JZ
U43,animprovementover the Let's write a SAS programto definethis collect

dualplot showsthat the distri heartrate.
ruplot.A cautionor two is in
a verysmalldata set and that
lis possible DATA HEART;
to alsoenter cubic I :Nptrr nosr ffi @@;
beinterpretable. ***The double @ at the end of the INPUI
read several observations from one da
lerm instruction not to rnove the pointer t
reach the bott.on of the DATA step (de
DATAIINES;
N .2 60 2 58 4 63 4 62 I 6? I 5s 16 70 L6 ?0 3
i
R3q
0.97,[3 &o*o" VAL.IE=DOT GoI'R=BLACK r=sM;
AdjRiq
0.9615
***r = SM produces a srnooth line through th
RMSE follow sM bry a number frorn 0 to 99 to contro
E.9EiTI
should try to touch each data point. (loru va
lots of wiggle high values such as St'I80, giv
A1so, you can add to the end of this
are not sorted (exanple: I=SI.{80S);
'ry--:39-lTl=*''
FI,(II HR*DOSE'
RI]Ni
PROC REG DATA=HEART;
MODEL IIR = DOSE;
RUN,.
The resultinggraphand the PROC REG outpu
Investigating t h e D o s e / H RR e l a t i o n s h i p
T h e R E GP r o c e d u n e
Model: MODELl
lstratesomeadditional steps DeoendentVariable: HR
hereare data collectedfrom
Model 1 299.48441 233'4844'l
8 38.11559 4'76445
Error
Connected Total I 271.60000
2.18276 R-square 0'8597

RoOt MSE
66.20000 Adj R-Sq 0'8421
D e p e n d e n tM e a n
Coeff Van 3.29722
P a r a m e t e rE s t i m a t e s
Parameter Standard
Estimate Error t value Pn >
Vaniable DF
1.04492 5 8 . 10 <.0001
I n t er c e P t 60.70833
0.44288 0.06326 7,O0 0.0001
DOSE
Investigatingthe Dose/HRRelationship
HR
71).
73,
72:. --'
71-
70.
69
68
6?
66
65
6/t
63
62
6t
60
59
5E
2A
DOSE
is written as To create a new variable, which is the lo
DATA HEART,.
INPUT DOSE}IR @@;
LDOSE= LOG(DOSE);
LABELLmSE = 'Log of Dose";
DATAIINES;
2 60 2 58 4 63 4 62 B 67 8 65 L6 70 L6 70 32
LOG is a SAS function that yields the natural (

valueis within the parentheses.
We can now plot log doseversusheart rate and c
PROC Rre DATA=HEART;

TfTtE "Investigating the Dose,/HR Relationsh
MODEL lIR = LDOSE;
tship l(ul\ I
SYMBOL \IALUE=DOT;
PROC GPLOT DATA=HEART;
PI,C[ HR*LDOSE;
RUN'
Output from the previous statements is shown on

transforming variables with caution. Keep in mind that w
one should not refer to the variable as in the untransform
the "logof dosage" as "dosage." Somevariablesarefreque
of groups,and magnitudesof earthquakes
are usuallypre
transformation.
I n v e s t i g a t i n g t h e D o s e / H RR e l a t i o n s h i p
T h e R E GP r o c e d u r e
ModeI: MODELl
D e p e n d e n tV a r i a b l e : H R
Model '1 266.45000 266.45000 4 1 3 . 9 0 < . 0 0 0 1
Erron I 5.15000 0.6437s
C o r r e c t e dT o t a l I 271.60000
Root MSE 0.80234 R-Squane

D e p e n d e nM
t ean 66.20000 Adj R-Sq
Coeff Var 1 . 2 11 9 9
P a r a m e t e rE s t i m a t e s
Parameter Standard
Vaniable Label DF Estimate Error t Value Pr >
I n t er c e p t Intencept 1 55.25000 0.59503 92.85 <.0001

LDOSE Log of Dose 1 5.26584 0.25883 20.34 <.0001
Investigatingthe Dose/HRRelationship
HR
74
73
72
71
70
69
68
67
66
65
64
63
62
6t
60
59
58
2
Log of Dore
5.1 Given the followins data:
x Z
I 3 1
7 l-) 7
8 t2
3 4 t4
A
7 1
(a) Write a SAS program and compute the Pearson

and Y; X and Z. What is the significance of each?
(b) Change the correlation request to produce a corr
tion coefficient between each variable versus every
5.2 Ten students take an eight-item test.The responsesto e
correct or correct) are stored in variablesQ1 to Q8 in
following program to create this data set.
DATA D(AM;
INPUT (Q1-Q8)(1. );
DATALINES;
10101010
11111111
11110101
01"100000
1l"l_l_0001
111r-1111
11111101
11111101
10110101
00010110
Starting with this data set, create a new SAS data set wit
raw score on the test (the sum of Ql to Q8). Using this da
each of the eight questions with the total test score.This
correlation coefficient.
40 150
50 148
How muchof the varianceof SBP(systolicblood pressure)canbe explainedby the fact

that there is variabilityin AGE? (Use SAS to computethe correlationbetweenSBP
a n dA G E . )
5.4 Run the followingprogramto createa SASdatasetcalledSCORES.
DATA SCORSS;
m SUBIECT = 1 T0 100;
rF RANUI\[(1357) LT .5 THEN GROIJP= 'A';
ELSE GROUP= 'B';
! " I A T H= R O I I N D ( R A N N O R ( I - 3 5 7 ) * 2+0 5 5 0 + 1 0 * ( G R O I J PE Q ' A ' ) ) t
SCIEVCE = ROU1{D(RANNOR(1-3S11 *15 + .4*MATH + 300) ;
EIJGLISH = ROUND(RANNOR(135?)*20+ 500 + .05*SCIEI,ICE +
. O 5 * M A T H ;)
SPELLING = ROUND(RANNOR(1357)*1"5 + 500 + ,1"*SGLISH) ;
VOCAB = ROUND(RANNOR(1357)*5+ 400 + .1*SPELLIIIG *
.2*BGLTSH} ;
p1{YSICAL = ROLI{D(RANNOR(1357) r20 + 550) ;
OVERALL . ROUND(M&qN(MA11{,SCIENCE. Bil3LISH, SPELLING, VOCAB,
PIIYSICAI) } ;
OUIRII;
END;
Rl]N;
(a) Generate a correlation matrix of all the test scores plus the overall score. Use the op-
tion to omit the simplestatisticsproducedby PROC CORR.
(b) PHYSICAL is independentof all the other test scoresbut is correlatedto OVER-
ALL. Why?
(c) Write the statementsto correlateeachof the scoreswith OVERALL.
From the datafor X andY in Problem5.1:
(a) Computea regression line (Y on X). Y is the dependent
variable,X the independent
variable.
(b) Whatis slopeand intercept?
(c) Are they significantlydifferentfrom zero?
5.6 (a) Using data set SCORESfrom Problem5.4,computea regressionequationfor pre-
dictingthe SCIENCE scorefrom the MATH score.Make one plot showingthe data
blood pressure(DBP). Createa SAS data set (DOSE_
computetwo regressionequations.One for SBP vers
DOSE. Produceone plot of SBPversusDOSE, anothe
canbe explainedby the fact SBP)versusDOSE,one of DBP versusDOSE, andfina
hecorrelationbetweenSBP DOSE. Use the PLC)Tstatementof PROC REG to do
CORES.
Dose SystolicBlood Pressure
4 180
+ 190
4 t78
8 170
8 180
'o',,,
ii4, 8 168
r.
|;' 16 160
l6 172
t6 r70
l{€ttt, )L t40
I' JZ
32
130
128
S1e.uo.*,
' '
i' ,,,. 5.9 Generate:
(a) A plot of Y versusX (datafrom Problem5.1).
(b) A plot of the regressionline and the originaldatao
5.10 RepeatProblem5.8usingthe naturallog of dose(LOG
pare the fit statistics.
Which fit is betterfor SBP?Which
[e overallscore.Use the op- 5.11 Given the dataset:
RR.
butis correlatedto OVER-
COUNTY POP HOSPITAL F
OVERALL. 1 35 1
2 88 5
variable,
X the independent J 5 0
4 55 3
5 75 4
o 125
egressionequation for pre- 7 225 7
: one plot showing the data 8 500 10
s.t2 Repeat Problem 5.a (a), except do the analysisseparatelyfor GROUP:'A'and
GROUP:'B . What is the effect on the magnitudeof the correlationsand the p-values?
5.13 What'swrong with the followingprogram?(Nore: Theremay be missingvaluesfor X,l
andZ.)
], DATA IvIANY_ERR;
2 INPTIIXYZ;
3 IF X LE 0 THEI{ X=1;
4 IF Y LE 0 THEI{ Y*l-;
5 IF Z LE 0 THEN Z=1;
6 I,oGX = LOG(X) ;
7 l0GY = L,OG(Y);
8 LOGZ = LOG(Z) ;
9 DATALINES;
L23
.18
4 1"0
18LL
;
l-0 PROC C'ORRDATA=I"IANY-RR / PEARSON SPEARMAN;
1I VARX-I,OGZ;
12 RUN;

Cody - Smith - Chap 5

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Cody - Smith - Chap 5

Enviado por

Direitos autorais:

Formatos disponíveis

:ethisproblemeasier.

PROC CORR DATA=CORR-EG;

The output from this program

V ani a b l e Mean Std Dev Sum

HEIGHT 7 66.8571 4 3.97612 468.00000

Variable Minlmum Maximum

WEIGHT 0 . 9 7 16 5 1.00000 0.924

PROC CORR gives us some simple descriptives

The term "list-of-variables" should be replaced wi

PROC CORR DATA=CORR-EG PEARSON SPEARMANNOSIIvIPLE;

The CORR procedurewill computecorrelationcoefficientsbetweenall pairsof

lo showthat two variables

Testingfor significancetells us if the correlationfound in the sampleis large

A researchermay wish to determine the strength of the

PROC CORR DATA=CORR-EG NOSTMPLE;

'd weight?How can we best Using our height/weightprogram,we add the fo

The output from the previous procedure is as follow

Look first at the last two lines.Thereis an estimatefor two parameters,

wherea : intercept,b : slope.

WEIGHT: -592.64+ 11.19x HEIGHT

WEIGHT : -592.64 + 1t.19 x 70 : 190.66Ib.

Under the heading"ParameterEstimates"are columnslabeled"StandardError,"

SYMBOL VALUE=DOT COLOR=BI,ACK;

The keyword RESIDUAL. stands for the resid

ack dots) that are useful are

esto be plotted on the y- and

The first SYMBOL statementrequestsdots as the plotting symbols.The second

ADDING A QUADRATICTERM TO THE REGRESSIO

ResidualPlot with QuadraticTerm

Another regressionexample is provided here to demonstrate some additional steps

U43,animprovementover the Let's write a SAS programto definethis collect

The resultinggraphand the PROC REG outpu

2.18276 R-square 0'8597

LOG is a SAS function that yields the natural (

PROC Rre DATA=HEART;

Output from the previous statements is shown on

Root MSE 0.80234 R-Squane

I n t er c e p t Intencept 1 55.25000 0.59503 92.85 <.0001

(a) Write a SAS program and compute the Pearson

How muchof the varianceof SBP(systolicblood pressure)canbe explainedby the fact

Você também pode gostar