Você está na página 1de 9

Statistics Spring 2008

Lab #3 Correlation
Defined: The measure of the strength and direction of the linear relationship between two
variables.
Variables: V is continuous! DV is continuous
"elationship: "elationship amongst variables
#$ample: "elationship between height and weight.
%ssumptions: &ormalit'. (inearit'.
1. Graphing - Scatterplot
The first step of an' statistical anal'sis is to first graphicall' plot the data. n terms of correlation! graphical
plots are called scatterplots. Scatterplots can show 'ou visuall' the strength of the relationship between the
variables! the direction of the relationship between the variables! and whether outliers e$ist.
)orrelation is the measure of the strength and direction of the relationship between the variables.
a. )orrelations can var' between *+ and +.
b. Direction of the relationship can be either positive or negative. % positive relationship is indicated b' a
positive value ,e.g.! ranging from 0 to +-. % negative relationship is indicated b' a negative value ,e.g.!
ranging from 0 to *+-. %n e$ample of a positive relationship is the relationship between height and
weight. The higher the outcome on one variable! the higher the outcome on the other variable. %n
e$ample of a negative relationship is the relationship between e$ercise and weight. The higher the
outcome on one variable! the lower the outcome on the other variable.
c. Strength of the relationship is measured from 0 to +.*+. The farther the value is awa' from 0! the
stronger the relationship. The appro$imate criteria for strength is 0 for no effect! .+ for a small effect! .
/ for a medium effect! and .0 for a large effect. &otice those values can be either positive or negative!
depending upon the direction of the relationship! so a .2 and *.2 relationship indicate the same strength!
but different direction.
1hat does a scatterplot loo2 li2e3 4elow are 5 scatterplots that show three e$amples of a positive relationship
in the top row ,perfect! strong! wea2-! three e$amples of a negative relationship in the middle row ,perfect!
strong wea2-! and three e$amples of no relationship.
,graph ta2en from bla website-
+
,graph ta2en from wi2ipedia-
6ow do graph a scatterplot3
+. Select Graphs **> Legacy Dialogs **> Scatter
2. )lic2 7Simple8! and 7Define8
/. 9ove appropriate variables into the 7: a$is8 and ; a$is8
<. )lic2 =>.
Output below is for two ?uestions: 7commit+8 and 7commit/8. &otice that there is a positive relationship.
@rom this scatterplot! would anticipate that the correlation between the two variables is positive and medium
in siAe.
2
1hat is the purpose of graphing the scatterplot3 The purpose of graphing the scatterplot is to loo2 at the
relationship between the variables and determine if there are an' problems.issues with the data or if the
scatterplot indicates an'thing uni?ue or interesting about the data! such as:
a. 6ow is the data dispersed3 @or e$ample! in the scatterplot above! it appears all the scores are grouped
in the top right ?uadrant. 1hat does this impl' about the ?uestions and.or data in 'our stud'3 t
appears that subBects answered both commit+ and commit/ on the higher part of the scale. n other
words! most subBects feel that most people brought to trial did in fact commit the crime ,commit+- and
that most people convicted b' Buries did in fact commit the crime ,commit/-. Thus! when discussing
these variables in 'our paper! Bust tal2ing about the siAe and direction of the correlation does not tell
the whole stor'. f 'ou want to also discuss descriptive anal'sis of the data! 'ou could tal2 about how
the data are distributed at the high end of the scale. n other words! Bust presenting the correlational
anal'sis ,e.g.! r C ./0! p C D.00+- ma' mislead the reader about an interesting distribution of the data.
b. %re there outliers3 % scatterplot is useful for 7e'eballing8 the presence of outliers. Eust as a histogram
is useful for 7e'eballing8 univariate outliers! the scatterplot is useful for 7e'eballing8 bivariate outliers.
n a later section describe how to statisticall' anal'Ae whether or not bivariate outliers e$ist.
c. s the relationship linear3
1hat is linearit'3 (inearit' is a straight*line relationship between variables.
1h' is linearit' important3 )orrelation and regression tests rests upon the assumption of linearit' because
the' onl' capture linear relationships. &ot all relationships are linear. Eust as not all variables are normall'
distributed in the real*world! not all relationships are supposed to be linear. @or e$ample! there could be a non*
linear relationship for FS% immigrants between length of residence in FS% and depression. t is a F*shaped
relationship. Depression levels starts high during the first few 'ears of initial resettlement! then decreases for a
while as the' adBust to the new environment! and then increases again later in life. %nother e$ample of a non*
linear relationship is mortalit' and water consumption. %bsence of water increases mortalit'! middle levels of
water decreases mortalit'! but too much water increases mortalit'.
)orrelation and regression onl' capture linear relationships. @or e$ample! all correlations below have the same
siAe and direction! r C .8+. 4FT onl' the top*left graph is appropriate for correlational anal'sis because the
other three graphs depict data that can &=T be captured b' the formulas for correlation and regression.
,graph ta2en from wi2ipedia-
/
2. i!ariate "an# $ulti!ariate outliers%
6ow do identif' bivariate and multivariate outliers3 The procedure for identif'ing bivariate outliers is the
same as for identif'ing multivariate outliers. The procedure is called 9ahalanobis Distances! and it calculates
the distance of particular scores from the center cluster of remaining cases. The procedure creates a new
column at the end of the data file containing a calculated score for each subBect. The newl' calculated score is
based upon the specific variables entered into the anal'sis. Thus! 'ou could calculate man' different
9ahalanobis Distances where 'ou enter different sets of variables into the anal'sis. @or each anal'sis! a
separate score for each subBect is created in a new column at the end of the data file. The 9ahalanobis
Distances score for each subBect is considered an outlier if it e$ceeds a 7critical value8.
a. The critical value is determined b' a table at the bac2 of most te$tboo2s. :ou can also find the table at
this webpage * http:..www.ento.vt.edu.Gsharov.Hop#col.tables.chis?.html
b. The table involves the 7"eBection "egions8 for a )hi*S?uare test. "emember bac2 to the first da' of
class when we tal2ed about probabilit' distributions and 7"eBection "egions8. The "eBection "egions
for the chi*s?uare test is based upon two factors: the probabilit' level 'ou set! and the degrees of
freedom. 1e will tal2 later in*depth about these concepts! but for right now what 'ou need to 2now is
that:
c. degrees of freedom for this test is e?ual to the number of variables under investigation. Thus! if 'ou are
anal'Aing a bivariate relationship! then degrees of freedom C 2. f 'ou are anal'Aing / variables! then
degrees of freedom C /! and so forth
d. the probabilit' level we set for this test is p D .00+.
e. so! if 'ou loo2 at the table! 'ou find the degrees of freedom! then scan to the right until 'ou get to the
column associated with 0.00+. That is 'our critical value. @or e$ample! the critical value for a bivariate
relationship is +/.82.
f. %n' 9ahalanobis Distances score above that critical value is an outlier.
6ere is how to calculate 9ahalanobis Distances scores:
+. Select &naly'e **> (egression **> Linear
2. 9ove all the variables under investigation into the 7ndependents8 bo$. n the 7Dependent8 bo$! move the
subBect number variable ,numb-. @or e$ample! if 'ou are interested in the bivariate outlier anal'sis for
7commit+8 and 7commit/8! 'ou move both those variables into the 7ndependent8 bo$! and move 7numb8 into
the 7Dependent8 bo$.
/. )lic2 7Save8! and clic2 79ahalanobis8
<. )lic2 =>.
The newl' created variable is saved as 79%6I+8
<
Output below is for 7commit+8 and 7commit/8. n 7"esiduals Statistics8 bo$! loo2 for 79ahal. Distance8.
(oo2 at the 79a$imum8 score. f that number e$ceeds 'our critical value! then an outlier e$ists. n this case!
with 2 variables! the critical value is +/.82. The 79a$imum8 is listed as /0.08/. Thus! we have at least one
outlier.

:ou identif' the outlier,s- b' sorting the data b' this new variable 79%6I+8! and then scroll to the bottom of
the list to find the highest valued scores. :ou can sort b': Data **> Sort Cases. n this case! we find ++
variables that have scores above +/.82.
&otice! however! that multivariate outlier anal'sis is Bust as arbitrar' as univariate outlier anal'sis. The
determination for the threshold level is arbitraril' determined! Bust as the threshold level for univariate outliers
as +.0J K" and /JK" is arbitraril' determined. Hlus! the 7e'eball8 method of the scatterplot does show some
differences when compared to the statistical method of using 9ahalanobis Distances scores. @or e$ample! if
'ou loo2 at the scatterplot for our two variables ,see above-! can 'ou identif' which ++ subBects are the ones
deemed outliers b' the 9ahalanobis Distances anal'sis3
3. Correlation
% correlation is eas' to conduct:
+. Select &naly'e **> Correlate **> i!ariate
2. 9ove all variables into the 7Variable,s-8 window.
/. )lic2 =>.
Output below is for two ?uestions 7commit+8 and 7commit/8. The 7)orrelations8 bo$ tells 'ou three pieces
of information: n C sample siAe! pearson C siAe and direction of the relationship! Sig. C significance level. n
essence! the 7Hearson )orrelation8 tells 'ou siAe and direction of the h'pothetical line that can be drawn
through the dataL and 7Significance8 tells 'ou the probabilit' that the line is due to chance. 9ore specificall'!
the 7Significance8 represents a test of whether the line is different from a flat line ,e.g. a flat line would be
represented b' a Hearson correlation C 0-. @or the data below! there is a positive and medium relationship
between the variables! and there is a p<.00+ probabilit' that the line is due to chance.
0
%nother useful piece of information is "
2
the coefficient of determination. This is the amount of variance
e$plained b' another variable. "
2
is not provided in the output! but 'ou can calculate "
2
b' s?uaring the
Hearson )orrelation. n our e$ample! therefore! /02 $ ./02 C .+2<. f 'ou multiple this b' +00! 'ou converted
the value into a percentage. Thus! in our e$ample! commit+ e$plains +2.<M of the variance in commit/! and
vice versa. This also means that 8N.OM of the variance is unaccounted because +00*+2.< C 8N.O.
1"T#*FH: The report of a correlational stud' should include the strength of the relationship and the
significance level. Some researchers also include the descriptive statistics of each variable. Some researchers
also include the "
2
a. 7There was a positive correlation between the two variables! r C ./0! p C D.00+.8
b. 7There was a positive correlation between the belief about what percent of people brought to trial did
in fact commit the crime ,M C N8./5M SD C +O.//- and the belief about what percent of people
convicted b' Buries did in fact commit the crime ,M C 8/.22M SD C +0.0<- ! r C ./0! p C D.00+.8
c. 7There was a positive correlation between the two variables! r C ./0! p C D.00+! with a R
2
C .+2<.8
#V%(F%T=&:
a. :ou evaluate correlational anal'sis b' loo2ing at the direction of the relationship between the
variables. s it in the same direction as the research h'pothesis.
b. :ou then loo2 at the significance level. s the relationship significant3 "emember that significance is
related to sample siAe. n small sample ,nC/0- 'ou ma' have correlations that donPt reach significance!
but if the sample siAe was larger ,nC+00-! it would be significant. %lso! remember that sample siAe
does not t'picall' affect the strength of the relationship! onl' the probabilit' that the result was due to
chance.
c. :ou then loo2 at the siAe of the relationship. s it strong or wea23 Eust because the h'pothesis is
confirmed in the predicted direction does not indicate if the relationship between the variables is strong
or important. Strength of the relationship is measured from 0 to +.*+. The farther the value is awa'
from 0! the stronger the relationship. The appro$imate criteria for strength is 0 for no effect! .+ for a
small effect! ./ for a medium effect! and .0 for a large effect. &otice those values can be either positive
or negative! depending upon the direction of the relationship! so a .2 and *.2 relationship indicate the
same strength! but different direction.
d. :ou can also loo2 at "
2
. n terms of percentage of variance e$plained! small is +M! medium is 5M! and
large is 20M.
). Correlation - *ultiple
1hen 'ou conduct correlations! 'ou t'picall' enter 9%&: variables simultaneousl' into the anal'sis! and the
output provides all possible bivariate relationships. @or e$ample:
+. Select &naly'e **> Correlate **> i!ariate
2. 9ove all variables into the 7Variable,s-8 window.
/. )lic2 =>.
Output below is +or the ,+orensic- ite$s an# ,innocence- ite$s. &otice the diagonal is alwa's 7+8 because
there is a perfect correlation between the same variable. %lso notice that sample siAe is different for each
bivariate relationship because the default in correlation is 7pairwise8 deletion. %lso! notice that the matri$ is a
mirror of itself along the diagonal! so the information is depicted twice for each bivariate combination.
O
Eust as 'ou can have correlational output of multiple variables simultaneousl'! 'ou can have scatterplots of
multiple variables simultaneousl'. The onl' limitation is that if there are more than / variables simultaneousl'
the scatterplots get so small as to be relativel' useless. :ou conduct multiple scatterplots simultaneousl' b':
+. Select Graphs **> Legacy Dialogs **> Scatter
2. )lic2 79atri$8! and 7Define8
/. 9ove appropriate variables into the 79atri$ variables8 bo$
<. )lic2 7=ptions8 and 7e$clude cases variable b' variable8
0. )lic2 =>.
=utput below is for the first three 7forensic items8.
.. Correlation - /artial
Hartial correlation is the relationship between two variables while controlling for a third variable. The purpose
is to find the uni?ue variance between two variables while eliminating the variance from a third variables. The
diagram below from 'our te$tboo2 page +/0 graphicall' represents the purpose of partial correlation.
N
:ou t'picall' onl' conduct partial correlation when the third variable has shown a relationship to one or both
of the primar' variables. n other words! 'ou t'picall' first conduct correlational anal'sis on all variables so
that 'ou can see whether there are significant relationships amongst the variables! including an' 7third
variables8 that ma' have a significant relationship to the variables under investigation. n addition to this
statistical pre*re?uisite! 'ou also want some theoretical reason wh' the third variable would be impacting the
results.
6ow to conduct partial correlation:
+. Select &naly'e **> Correlate **> /artial
2. 9ove variables into the 7Variable,s-8 window.
/. 9ove the variable 'ou want to control for into the 7)ontrolling8 bo$
<. )lic2 7=ptions8 and clic2 7Qero =rder correlations8 and clic2 7#$clude cases pairwise8
,b' clic2ing 7Aero order correlations8! the output will show both the relationships amongst the variables
while controlling for the third variable! and %(S= the relationships amongst the variables without
controlling for the third variable. This is useful so that 'ou can easil' see the difference between
controlling for the variable and not controlling for the variable.-
0. )lic2 =>.
=utput below is for the relationship between 7commit+8 and commit/8 while controlling for 7prosecutor+8.
included 7prosecutor+8 as the controlling variable because: ,+- statisticall'! it shows significant relationship
to both commit+ and commit/. :ou can see that significant relationship in the top part of the 7)orrelations8
bo$ below which presents the correlations without controlling for a third variable! ,2- theoreticall'! it is
possible that the reason wh' there is a positive correlation between commit+ and commit/ is because
prosecutor+ as2s 7whom do 'ou trust more! defense attorne's or prosecutors8! so it is possible the reason wh'
8
subBects believe defendants brought to trial and convicted at trial are guilt' ,commit+ and commit/- is because
the' trust the prosecutor over the defense attorne'.
Thus! given this plausible ,statistical and theoretical- third*variable relationship! it is interesting to note that
controlling for 7prosecutor+8 did not lower the strength of the relationship between commit+ and commit/ b'
that much because the outcome while controlling for prosecutor+ was r C ./<+! p D.00+. n other words! the
relationship between commit+ and commit/ is &=T due to subBects trusting the prosecutor.
:ou can conduct Hartial )orrelation with more than Bust + third*variable. :ou can include as man' third*
variables as 'ou wish.
0. Correlation /oint-biserial Correlation
Hoint*biseral )orrelations are conducted when one of the variables is dichotomous! which means itPs a
categorical variable with onl' two categories! such as gender: male! female.
@: The Hoint*biserial )orrelation is analogous to a 7t*test8! which we will cover in later wee2s. % 7t*test8
is conducted when 'ou are interested in the relationship between a categorical independent variable ,such as
gender: male! female- and a continuous dependent variable ,such as belief in the death penalt' on +*N scale-.
:ou conduct a Hoint*biserial )orrelation the same wa' that 'ou conduct a regular correlation:
+. Select &naly'e **> Correlate **> i!ariate
2. 9ove all variables into the 7Variable,s-8 window.
/. )lic2 =>.
The output! 1rite*up! and interpretation are the same as for a regular correlation.
@: * f 'ou want to anal'Ae the tests from 'our classes using the Hoint*biserial )orrelation! 'ou would need
to first create a new dichotomous variable ,e.g.! +Canswered correctl'! 2Canswered incorrectl'-.
See 7(ab2 Descriptives8 for 7O. Transforming categorical variables into other categorical variables8.
5