Module 9 (Correlation Analysis)

Reading Material #9 (Correlation Analysis)
---------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------------------------------
Gabino P. Petilos, Ph.D.
FIC, EDRE 231, 2
nd
Sem 11-12

CORRELATION ANALYSIS

INTRODUCTION

It has been said that research is conducted in order to find relationship between or
among variables. When factors or variables are related in some systematic pattern, so that a
change in the value of one is associated with a concurrent change in the value of the other,
we say that they are correlated. Thus, we know that ability level is correlated with academic
performance based on our common observation that students belonging to high ability level
tends to show better academic performance while those belonging to low ability level tend
to show poor academic performance.

In statistics, we not only establish the existence of certain correlations but also
measure the direction and the degree of correlation. Ideally, we want to know the
correlation between two variables X and Y in a given population (Figure 1). The correlation
between these variables is denoted by the symbol , called population correlation
coefficient. Since it is not always feasible to study the entire population, we attempt to
describe the correlation between X and Y by drawing a random sample from the population.
We denote the estimate of the parameter by the sample correlation coefficient r.

If a sample is used to estimate the amount of correlation between two variables,
significance testing is called for to find out if the variables in the actual population are indeed
significantly related. For this reason, correlation analysis employs both descriptive statistics
as well as inferential statistics.

Correlation analysis is concerned with the linear relationship between two variables.
It aims to determine the direction (whether positive or negative) as well as the strength
(whether weak, moderate, or strong) of linear association between two variables. When
two variables vary in the same direction, we say that the variables are positively correlated.
For example, it has been shown that IQ and academic performance are positively correlated.
This means that a person who has high IQ would tend to have a good academic performance
Population
=?
X Y
Sample
r=?
X Y
Figure 1
Reading Material #9 (Correlation)
---------------------------------------------------------------------------------------------------------------------- 2
-----------------------------------------------------------------------------------------------------------------
FIC, EDRE 231, 2
nd
Sem 11-12
in school and in turn a person's good academic performance is usually associated with his
high IQ. Other examples of variables which are positively correlated are:

- Grade in Mathematics and Grade in Physics;
- Work performance and Level of morale;
- Number of hours spent in studying and Grades in mathematics.

On the other hand, when two variables vary in the opposite direction, the variables
are said to be negatively correlated. Examples of variables which exhibit negatively
correlation are:

- Academic achievement and Hours per week of watching TV
- Time spent in typing practice and Number of typing errors
- Absenteeism and Job satisfaction

Variables that are not linearly correlated have zero correlation. For instance, height
of students and their ability level have a zero correlation. In this example, it does not make
sense to associate a particular value of height to a particular ability level. As another
example, there is zero correlation between size of shoes and level of income of bank
managers!

The direction and strength of linear correlation between variables may be described
using a statistical device called scatter plot or scatter diagram. Examples of scatter plots
are given in Figure 1. Here, the scatter plots from (a) to (c) illustrate a positive correlation
between the two variables in varying strengths while (d) to (f) illustrate a negative
correlation also in varying strengths. The scatter plots in (a) and (d) illustrate a perfect
correlation between the two variables while those of (g) and (h) illustrate a zero correlation.

(a) perfect positive (b) strong positive (c) weak positive (d) perfect negative

(e) strong negative (f) weak negative (g) zero correlation (h) zero correlation

Figure 1. Examples of Scatter Plots
---------------------------------------------------------------------------------------------------------------------- 3
-----------------------------------------------------------------------------------------------------------------
FIC, EDRE 231, 2
nd
Sem 11-12
A strong correlation between two variables can occur when not all points fall on the
line of relationship but they are close to it. If the distances of the points are far from the
line, the correlation is said to be weak (or low) [See graphs (c) and (f)].

When the points do not tend o follow the path of a straight line, the correlation is
said to be zero. This is illustrated by the scatter plots in (g) and (h). Note that zero
correlation between two variables does not necessarily mean that the variables are not
related. In (g) for instance, there is zero linear correlation between the variables yet they are
related in a quadratic sense.

THE CORRELATION COEFFICIENT

As mentioned earlier, the scatter diagram is a visual device which is useful in
characterizing the direction and strength of linear correlation between two variables. The
direction of relationship is perhaps easy to discern in a scatter diagram. However,
interpretation of the strength of linear correlation using a scatter diagram is not easy since it
is open to various interpretations when viewed by different persons.

The correlation coefficient is another tool by which the direction and strength of
linear correlation between two variables may be described. As a measure of correlation, the
correlation coefficient ranges in value from -1.0 to +1.0. Thus if (rho) represents the
population correlation coefficient, then

-1.0 s s +1.0

If = 1.0, the variables are said to be perfectly correlated in a positive sense. If the
value is -1.0, the variables are perfectly correlated in a negative sense. A value of = 0
indicates a zero linear correlation between the two variables. Figure 2 illustrates the
descriptive interpretation of the correlation coefficient.

-0.5 moderate negative correlation

Figure 2. Interpretation of
Value of

Interpretation of
(Direction of correlation)

Interpretation of
(Strength of correlation)

-

-

-

-

-

+1.0 perfect positive correlation

+0.5 moderate postive correlation

0 zero correlation

-1.0 perfect negative correlation

POSITIVE CORRELATION
coron

POSITIVE CORRELATION

weak or low correlation

weak or low correlation

strong or high correlation

strong or high correlation

---------------------------------------------------------------------------------------------------------------------- 4
-----------------------------------------------------------------------------------------------------------------
FIC, EDRE 231, 2
nd
Sem 11-12

Since is usually unknown, it has to be estimated from a sample data. The estimate
of is called a sample correlation coefficient. We present in the succeeding discussion
techniques of estimating the population correlation coefficient.

PEARSON'S PRODUCT-MOMENT CORRELATION COEFFICIENT

Although there are several measures of correlation, the most common measure and
useful one is the Pearsons product moment correlation denoted by r. This measure of
correlation is used when both variables are measured in at least the interval scale. The
computational formula for the Pearson's r is given by

| || |

=
2 2 2 2
) ( ) (
) )( (
Y Y N X X N
Y X XY N
r (Equation 1)

The Pearsons r is a parametric measure of correlation. The following assumptions
must be satisfied when using the Pearson's r:

1. Both variables X and Y must be measured in at least the interval scale;
2. Observations are sampled from a bivariate normal distribution; and
3. The variables are linearly related.

Example 1. The table below shows experimental data for the observed pairs (x, y). Find the
value of r.

x 2 3 7 4 6 8 5
y 3 5 8 5 7 10 5

Solution: Without loss of generality, let us assume that the first two assumptions above have
been satisfied. To determine whether the third assumption is also satisfied, we
construct a scatter plot for the given data. This scatter plot is shown below.

-1 1 2 3 4 5 6 7 8 9 10
-1
1
2
3
4
5
6
7
8
9
10
x
y

Clearly, the scatter plot suggests a linear relationship between the two variables. To
determine the extent of correlation between the two variables, we compute the value of r
using Equation 1. The following worksheet illustrates how this value is computed.
---------------------------------------------------------------------------------------------------------------------- 5
-----------------------------------------------------------------------------------------------------------------
FIC, EDRE 231, 2
nd
Sem 11-12

X Y X
2
Y
2
XY
2 3 4 9 6
3 5 9 25 15
7 8 49 64 56
4 5 16 49 20
6 7 36 49 42
8 10 64 100 80
5 4 25 16 20
35 = X 42 = Y
203
2
= X 288
2
= Y
239 = XY

Based on the obtained sums, the value of r is given by:

| || |

=
2 2 2 2
) ( ) (
) )( (
Y Y N X X N
Y X XY N
r =
| || |
2 2
42 ) 288 ( 7 35 ) 203 ( 7
) 42 )( 35 ( ) 239 )( 7 (

| || |
9134 . 0
1764 2016 1225 1421
1470 1673
=

=

The sample correlation coeficient is usually rounded to the nearest hundredths.
Hence in the above example , we have r = +0.91. Since this value is positive and is close to
1.0, we say that the correlation between X and Y is positive, and that the strength of
correlation is high or strong.

Interpretation of r

It is often difficult to decide what value of r indicates low, moderate, or high degree
of correlation. This decision involves the size of r. With the help of Figure 2, we can say that
values close to 0 . 1 indicate high or strong correlation between the two variables. On the
other hand, values close to 0 indicate low or weak correlation while values that cluster
around 5 . 0 indicate moderate correlation. There are some books, however, that offer a
table for interpreting the values of r. The table below for instance is taken from the book of
Best & Khan (1989).

Coefficient Interpretation
0 to .20 Negligible
.21 to .40 Low (Weak)
.41 to .60 Moderate
.61 to .80 Substantial
.81 to 1.0 High to Very High

Since there is little agreement on the meaning of the terms "high" or "low"
correlation, it is very important that the researcher interprets the obtained measure of
correlation within the context in which it was obtained (May, et al., 1990). For example, a
---------------------------------------------------------------------------------------------------------------------- 6
-----------------------------------------------------------------------------------------------------------------
FIC, EDRE 231, 2
nd
Sem 11-12
correlation coefficient of 0.72 between political affiliation and religion in social science
research may be interpreted as high. The same value, however, may be interpreted as low
when used as a measure of reliability or validity of standardized tests. Also, It is often
tempting to say that an r value of 0.80 is twice as strong as an r value of 0.40. Such an
interpretation is incorrect since the correlation scale is not ratio or interval but rather an
ordinal one.

Another consideration in interpreting a correlation coefficient is when the value is 0.
In general, a value of r = 0 does not mean that the variables are not related. As shown in
Figure 1 (g), a value of 0 merely implies that there is no linear association between the two
variables. Moreover, values of r that are different from 0 cannot be construed that one
variable causes the other which means that if two variables are correlated, it does not imply
that one of them causes the other.

One meaningful interpretation of r involves the concept of coefficient of
determination which is denoted by
2
r . This value gives us a measure of the amount of
variation in one variable which can be attributed to the variation of the other variable and
vice versa. Thus, if r = 0.91, r
2
= 0.8281 or 82.81% which means that 82.81% of the variation
in one variable is accounted for by the variation of the other variable and versa. The
coefficient of determination is a very a important and useful concept in regression analysis.

Testing the Significance of r

If the value of the correlation coefficient is obtained from a sample data, the
researcher would often want to know whether the variables are in fact related in the actual
population from which the sample was drawn. The hypothesis of interest is about whether
the population correlation coefficient is zero or not. Thus the following null hypothesis
must be tested using the obtained sample correlation coefficient.

Null Hypothesis : H
o
: = 0 (There is no correlation between X and Y)
Alternative Hypothesis : H
a
: = 0 (For a Non-directional Test)
: H
a
: < 0 or > 0 (For a Directional Test)

To test whether the obtained Pearsons r is significantly different from zero, a t-test
could be used if N < 30 or z-test is N > 30. The test statistics are given below:

2
1
2
r
N
r t
= , d.f. = N - 2 (Equation 2)

Thus, for instance, if r = .91 and n = 7, we have

9078 . 4
) 91 (. 1
2 7
) 91 (.
2
=

= t

---------------------------------------------------------------------------------------------------------------------- 7
-----------------------------------------------------------------------------------------------------------------
FIC, EDRE 231, 2
nd
Sem 11-12
At o = 0.05 (two-tailed), the corresponding critical value of t is 2.571. Since the
absolute value of the computed t-value exceeded the critical value, we reject the null
hypothesis. We conclude that the relationship between the two variables cannot be
attributed to chance.

When analyzed using the SPSS, we obtain the following output. This table provides
us both descriptive and inferential information about the correlation between the variables
X and Y. In this table, the value of the Pearsons r is 0.913, hence the correlation is positive
and the strength of linear correlation is high. Also the associated p-value is .004 (two-tailed)
which is less than = 0.025 (
2
05 . 0
), hence the null hypothesis is rejected.
Correlations
1 .913**
.004
7 7
.913** 1
.004
7 7
Pearson Correlat ion
Sig. (2-tailed)
N
Pearson Correlat ion
Sig. (2-tailed)
N
X
Y
X Y
Correlation is signif icant at the 0. 01 level
(2-t ailed).
**.

Another test statistic for testing the null hypothesis about the value of the population
correlation coefficient is the z-test. This test statistic is particulary useful when the
hypothesized value of the population correlation coefficient is different from zero, as for
example H
o
: =
0
(
0
= 0). In using this test, we first apply the Fisher's Transformation to
the obtained value of r to get the corresponding z-value. For a given r, the transformed
value of z is given by
|
.
|
\
|
+
=
r
r
z
1
1
l n
2
1
.

For this variable, the mean and standard deviation are given by

|
.
|
\
|
+
=
r
r
z
1
1
l n
2
1
and
3
1
=
n
Z
o .

Therefore, the equation
z
z
z
Z
o

= is a standard score which follows the standard
normal distribution. Using the same values of r and N in Example 1, for example, we have

5275 . 1
09 .
91 . 1
l n
2
1
=
|
.
|
\
|
=
z
and 5 . 0
3 7
1
=
=
Z
o .

---------------------------------------------------------------------------------------------------------------------- 8
-----------------------------------------------------------------------------------------------------------------
FIC, EDRE 231, 2
nd
Sem 11-12
Thus, 055 . 3
5 . 0
0 5275 . 1
=
= Z which is significant at o =.05 level of significance (two-

tailed.

The Pearson's product moment correlation is the most popular measure of
correlation. However, as was pointed out earlier, this measure is appropriate only when
both variables are measured in at least the interval scale. When the assumptions on the use
of r are not met, it is not advisable to use the Pearsons r. Instead, we estimate the
population correlation using other measures of correlation. The succeeding discussion
considers other measures of correlation when the scale of measurement is not interval and
one of the assumptions (normality and linearity) is violated.

OTHER MEASURES OF CORRELATION

The Spearman's Rank Order Correlation (r
s
)

The Spearman's rho (r
s
)
a
is a measure of correlation based on the difference between
ranks of the values of two variables X and Y. It is used when both variables are measured in
at least the ordinal scale. The Spearmans rho is the nonparametric counterpart of the
Pearsons r. Unlike the Pearson's r, this measure does not make assumption about normality
of distribution of the paired data.

The formula for computing the Spearman's r
s
is given by

) 1 )( 1 (
6
1
2
+
=

N N N
d
r
s
(Equation 3)

where d is the difference between the ranks of paired values of X and Y, and N is the total
number of cases.

When ranking the data, 1 is usually treated as the lowest rank corresponding to the
lowest score value of the variable, followed by 2 for the next higher score, etc. Thus,
higher ranks correspond to higher scores while lower ranks correspond to lower scores. You
have to adapt this rule of ranking numbers because this is the convention used in analyzing
ordinal data using nonparametric statistics. (Note: The same value of
2
d in Equation 3 is
obtained if we assign rank 1 to the highest score instead of rank 1 to the lowest score. Check
this!)

Another important rule that you should remember is the assignment of ranks for tied
scores. The rule is very simple: think of the scores as if they were distinct, get their ranks,
and assign the average of their ranks as the ranks of the tied scores. Let us illustrate these
rules by considering an example.

a
We dont use the symbol for rho since this is our symbol for the population correlation coefficient.
---------------------------------------------------------------------------------------------------------------------- 9
-----------------------------------------------------------------------------------------------------------------
FIC, EDRE 231, 2
nd
Sem 11-12

Suppose we have the following scores in an achievement test in science: 47, 43, 46,
40, 43, 47, 47, 48. The scores in ascending order with their ranks if they were distinct and
actual ranks considering the tied scores are shown as follows:

Score 40 43 43 46 47 47 47 48
Ranks if scores were distinct: 1 2 3 4 5 6 7 8
Actuals rank (with tied scores) 1 2.5 2.5 4 6 6 6 8

5 . 2
2
3 2
=
+

6
3
7 6 5
=
+ +

Thus, the two 43s are ranked 2.5 each while the three 47s are ranked 6 each.

Example 2. The following hypothetical data are the grades of 7 students in mathematics and
statistics. Estimate the strength of correlation between the variables using the
Spearmans rank order correlation coefficient.

X
(Grade in Math)
Y
(Grade in Statistics)
86

88
78

78
79

78
85

86
87

90
90

88
87

78

Solution: The necessary computations are indicated in the following table based on the
ranks of the values of X and values of Y denoted by R
X
and R
Y
, respectively.

X Y R
X
R
Y
d
d
2
86 88 4 5.5 1.5 2.25
78 78 1 2 1 1
79 78 2 2 0 0
85 86 3 4 1 1
87 90 5.5 7 1.5 2.25
90 88 7 5.5 -1.5 2.25
87 78 5.5 2 -3.5 12.25

21
2
=
d

Since N = 7, and 21
2
=
d , it follows that 625 . 0

) 48 )( 7 (
) 21 ( 6
1 = =
s
r or 63 . 0 =
s
r .
Hence, there is a substantial positive correlation between students grades in
mathematics and grades in statistics.

---------------------------------------------------------------------------------------------------------------------- 10
-----------------------------------------------------------------------------------------------------------------
FIC, EDRE 231, 2
nd
Sem 11-12
Testing the Significance of r
s

The test statistic for testing the significance of the r
s
is similar to Equation 2. The
test statistic is given by

2
1
2
s
s
r
N
r t
= , d.f. = N - 2. (Equation 4)

Using Example 2, the relevant null hypothesis and the corresponding alternative
hypothesis can be stated as follows:

H
0
: There is no significant correlation between grades in mathematics and grades in Statistics.
H
a
: There is a significant correlation between grades in mathematics and grades in Statistics.

Since 625 . 0 =
s
r and n = 7, the computed t-value is given by

7903 . 1
) 625 (. 1
2 7
) 625 (.
2
=

= t

We use a two-tailed test because the alternative hypothesis is non-directional. At
0.05 = o (two-tailed), and d.f. = 5, the corresponding critical value of t is 2.571. Since the
absolute value of the computed t-value did not exceed the critical value, the null hypothesis
cannot be rejected. We say that the data did not provide sufficient evidence to reject the
null hypothesis.
--------------------------------------------------------------------------------------------------------------------------
Note: When a statistical test IS NOT SIGNIFICANT, we accept the null hypothesis. Accepting the null
hypothesis, however, does not mean that it (the null hypothesis) is true because we only considered
one sample out of the so many possible samples from the population.
--------------------------------------------------------------------------------------------------------------------------

We present below the SPSS output for the same data analyzed using the Spearmans
rank-order correlation coefficient. Note that the p-value = 0.151 > = 0.025. Hence, the
variables are NOT significantly correlated. Note the discrepancy between the value we
obtained using the formula and the value in the SPSS output which is 0.604. Some sort of
adjustment is made in the SPSS formula because of tied observations. (Research on this.)
Correlati ons
1.000 .604
. .151
7 7
.604 1.000
.151 .
7 7
Correlation Coef f icient
Sig. (2-tailed)
N
Correlation Coef f icient
Sig. (2-tailed)
N
Grade in Mat hematics
Grade in Statistics
Spearman's rho
Grade in
Mathematics
Grade in
St at ist ics

---------------------------------------------------------------------------------------------------------------------- 11
-----------------------------------------------------------------------------------------------------------------
FIC, EDRE 231, 2
nd
Sem 11-12
The Point-Biserial Correlation Coefficient (r
pb
)

In the previous discussion, we considered the correlation between the variables X
and Y when both are measured in at least the interval scale or ordinal scale. If for instance
the variable X is a dichotomy (categorical with 2 categories) and Y is a measured in the
interval scale, both Pearsons r and Spearmans rank order correlation coefficients are not
appropriate as a measure of correlation.

The point biserial correlation coefficient which is denoted by
pb
r is a measure of
correlation which is appropriate when one variable is a dichotomy and the other is measured
in at least the interval scale. For instance, if one wants to know the strength of correlation
between gender and mathematics performance, then the point biserial correlation
coefficient will be an appropriate measure. The formula for the point biserial coefficient is
given by

pq
SD
M M
r
X
q p
pb
) (
= (Equation 5)

where M
p
= the mean score of those in one category of the dichotomised variable
M
q
= the mean score of those scoring in the other category
p = the proportion scoring in the first category
q = the proportion scoring in the other category.
SD
X
= is the standard deviation of the interval variable.

NOTE: This formula is discussed on page 95 and the example is given on page 96 of Module 12.

There is another formula for the point-biserial correlation coefficient which is slightly
different from Equation 5. The formula makes use of the number of cases in the
dichotomized interval variable and is given by

) 1 (
0 1 0 1
=
n n
n n
SD
M M
r
X
pb
(Equation 6)

where M
1
= the mean score of the scores in category 1 of the dichotomised variable
M
0
= the mean score of the scores in category 0 of the dichotomised variable
n
1
= the number of cases in category 1
n
0
= the number of cases in category 0
n = the total number of cases (n
1
+ n
0
)
SD
X
= is the standard deviation of the interval variable.

The test of significance of r
pb
is given by the test statistic

2
1
2
pb
pb
r
n
r t
= , d.f. = n - 2. (Equation 7).

---------------------------------------------------------------------------------------------------------------------- 12
-----------------------------------------------------------------------------------------------------------------
FIC, EDRE 231, 2
nd
Sem 11-12
Note the similarity between this test statistic and the test statistics for testing the
Pearsons r and the Spearmans r
s
(Equations 2 and 4).

Example 3. Are graduates from private high schools better than graduates from public
schools? Suppose we have the entrance test scores of 6 students who
graduated from private schools (coded 1) and 8 students who graduated from
public high schools (coded 0) as follows:

Student 89 78 94 86 85 79 81 82 96 90 88 75 87 84
School 1 0 1 0 1 0 0 0 1 0 1 0 0 1

a) Compute the correlation coefficient between type of high school graduated
from and entrance test score using Equation 6.
b) State the null hypothesis and the corresponding alternative hypothesis.
c) Test the null hypothesis at = .05

Solution: a) We first categorize the scores into two groups coded 1 and 0 as shown below.

Group Coded 1 Group Coded 0
(Private) (Public)
89 78
94 86
85 79
96 81
88 82
84 90
75
87
n
1
=6 M
1
=89.33 n
0
=8 M
0
=82.25
n = 14
SD
X
= 5.9927 (s.d. of all scores combined)

Using the summary values in the table, we have,

607 . 0
) 1 14 ( 14
) 8 )( 6 (
9927 . 5
25 . 82 33 . 89
=
=
pb
r or 61 . 0 =
pb
r .

b) The null hypothesis and the corresponding alternative hypothesis based on
the given problem are as follows:

H
0
: There is no significant relationship between type of high school
graduated from and score in the entrance test.
H
a
: Graduates from private schools are better than graduates from public
schools in terms of scores in the college entrance test.

---------------------------------------------------------------------------------------------------------------------- 13
-----------------------------------------------------------------------------------------------------------------
FIC, EDRE 231, 2
nd
Sem 11-12
c) To test the significance of the obtained r, we compute the test statistic using
the values 607 . 0 =
pb
r and n = 14. Thus,

646 . 2
) 607 (. 1
2 14
) 607 (.
2
=

= t

Because the alternative hypothesis is directional, we use a one-tailed test. The
critical value of t at = 0.05 d.f. = 12 (one-tailed) is 1.782. Since the absolute value of the
computed t-value exceeds the critical value, the null hypothesis is rejected. Based on the
hypothetical data, we conclude that there is a significant correlation between type of school
graduated from and performance in the college entrance test. The specific relationship
between the given variables can be specified by using the mean scores of the two groups.
Thus, we say that graduates from private schools are generally better than graduates from
public high schools.

Remarks:

1. The same conclusion is arrived at when a t-test for independent samples is conducted.
Using the pooled variance estimate, the computed t-value is 2.646 (which is equal to the
computed value in the test of significance of r
pb
) with a p-value of 0.021 < 0.042
b

(20.021 since the test is one-tailed) as shown in the computer output. Hence, the mean
scores of 89.3 and 82.25 are significantly different in favor of students who graduated
from private schools.
Group Statisti cs
6 89.3333 4.80278 1.96073
8 82.2500 5.06388 1.79035
Type of high school
Priv ate
Public
Entrance test score
N Mean
St d.
Deviation
St d. Error
Mean

Independent Samples Test
.043 .839 2.646 12 .021
2.668 11.235 .022
Equal variances
assumed
Equal variances
not assumed
Entrance test score
F Sig.
Levene's Test
f or Equality of
Variances
t df
Sig.
(2-tailed)
t-t est f or Equalit y of Means

b
The p-value associated to the computed t of 2.646 is 0.021(2-tailed). Since the test is one-tailed, this p-value
must be multipltied by 2 (2 0.021 = 0.042) since the test is supposed to be one-tailed.
---------------------------------------------------------------------------------------------------------------------- 14
-----------------------------------------------------------------------------------------------------------------
FIC, EDRE 231, 2
nd
Sem 11-12

Using the computed value of t, the value of
pb
r can actually be computed by solving for
pb
r in the formula
2
1
2
pb
pb
r
n
r t
= . This value is given by

. .
2
2
f d t
t
r
pb
+
= , where t is the
computed t-value and d.f. = n
1
+ n
2
2. From the computer output, we have t = 2.646
and the corresponding degrees of freedom is 12. Hence we have,

607 . 0
12 646 . 2
646 . 2
. .
2
2
2
2
=
+
=
+
=
f d t
t
r
pb

which is identical to the value obtained using the formula for
pb
r . This is the reason why
Equation 6 is more preferred than Equation 5.

2. Another reason which justifies the use of Equation 6 instead of Equation 5 is that, the
value of
pb
r can be obtained using the same idea as the Pearsons. However, the
categories of the dichotomous variable are first coded as 1 and 0 (other codes are NOT
acceptable). We construct a similar worksheet for computing the Pearsons r. The
variables to be correlated are X [school type (coded 1 and 0)] and Y (entrance test score).
The worksheet is shown as follows:

X
(School Type)
Y
(Entrance test Score)
X
2
Y
2
XY
1 89 1 7921 89
0 78 0 6084 0
1 94 1 8836 94
0 86 0 7396 0
1 85 1 7225 85
0 79 0 6241 0
0 81 0 6561 0
0 82 0 6724 0
1 96 1 9216 96
0 90 0 8100 0
1 88 1 7744 88
0 75 0 5625 0
0 87 0 7569 0
1 84 1 7056 84
1 89 1 7921 89
6 = X 1194 = Y
6
2
= X 102298
2
= Y
536 = XY

Therefore,
| || |

=
2 2 2 2
) ( ) (
) )( (
Y Y N X X N
Y X XY N
r

| || |
607 . 0
) 1194 ( ) 102298 ( 14 ) 6 ( ) 6 ( 14
) 1194 )( 6 ( ) 536 ( 14
2 2
=

= =
pb
r as obtained before.
---------------------------------------------------------------------------------------------------------------------- 15
-----------------------------------------------------------------------------------------------------------------
FIC, EDRE 231, 2
nd
Sem 11-12

Chi-Square Based
c
Measures of Correlation

In our discussion of the Chi-square test, we mentioned that the test can be used to
establish independence of variables. When the null hypothesis is rejected using the Chi-
Square test, we conclude that the variables ARE NOT independent, which means that the
variables are correlated. The Chi-square value, however, does not give information as to the
strength of correlation between the variables.

We can estimate the strength of correlation by using the computed value of the Chi-
square statistic (hence the term Chi-Square based). These measures, are described as crude
measures because they are not as accurate or reliable as in the case of Pearson-based
measures (r, r
s
, and r
pb
).

Also, the technique employed here is different from the Pearson based measures in
the sense that one tries to establish first whether the variables are significantly correlated or
not using the Chi-square statistic. The strength of correlation is computed only when the Chi-
square test is significant. We outline below the computation of some Chi-square based
measures of correlation.

A. Contingency Coefficient:
N +
=
2
2
O

C where
2
is the computed Chi-square value and
N is the grand total in the contingency table.

This measure is used when the contingency table is a square table with at least
three categories for each variable.

Example 4. The Chi-square value based on the contingency table below is 27.160 which is
significant at = 0.05. Estimate the strength of correlation between interest in
sports and social class using the contingency coefficient.

Interest in Sports
Social Class
TOTAL Working Middle Upper
High 12 45 7 64
Moderate 24 40 21 85
Low 21 14 23 58
Total 57 99 51 207

Solution: The contingency table is a 3x3 square table. Hence the contingency coefficient is
an appropriate measure of correlation. Using the values 160 . 27
2
= and N = 207,
we have

c
Siegel S. & Castellan, J. (1988) Nonparametric Statistics. New York: McGraw-Hill Book Company (2
nd
ed).

---------------------------------------------------------------------------------------------------------------------- 16
-----------------------------------------------------------------------------------------------------------------
FIC, EDRE 231, 2
nd
Sem 11-12
34 . 0
207 160 . 27
27.160

C
2
2
O
=
+
=
+
=
N
.
Hence sports and social class are significant correlated (since the computed value
is significant) and the strength of correlation is 0.34 (weak).

B. Cramer's Coefficient:
1) - N(L
C
2
r
= , where,

2
is the computed Chi-square value;
N is the grand total in the contingency table; and
L is either the number of rows or the number of columns, whichever is smaller

The Cramers coefficient is applied when the contingency table is non-square.

Example 5. The table below is taken from page 5 of Reading Material #11. The computed
Chi-square value based on this contingency table is 20.67 which is significant at
= 0.05. Estimate the strength of correlation between method of teaching and
academic performance using the Cramers coefficient.

Performance
Category
Method of Teaching
TOTAL Lecture Modular CAI
Above Satisfactory 9 20 18 47
Satisfactory 12 18 21 51
Fair 15 10 8 33
Below Satisfactory 24 12 6 42
Total 60 60 53 173

Solution: As gleaned from the problem, 67 . 20
2
= and N = 173. Also, the contingency
table is non-square and the smaller number of categories is 3, hence L = 3.
Therefore, we have
24 . 0
1) - 173(3
20.67
1) - N(L
C
2
r
= = =

Based on the computed Cramerss coefficient, there is a significant correlation
between method of teaching and academic performance but the strength of
correlation is weak (0.24).

---------------------------------------------------------------------------------------------------------------------- 17
-----------------------------------------------------------------------------------------------------------------
FIC, EDRE 231, 2
nd
Sem 11-12
3. Phi Coefficient:
N
2
= , where
2
is the computed Chi-square value and N is the
grand total in the contingency table.

The Phi-coefficient is used only for 22 contingency tables.

Example 6. A survey of 300 undergraduate and 100 graduate students from a large
university was conducted to determine their opinions on autonomous status of
colleges. The following contingency table was generated from the survey.

Opinion
Level of Education
Total
Undergrad Graduate
Favor 100 70 170
Not Favor 200 30 230
Total 300 100 400

Find out if there is a significant correlation between opinion and level of
education at = 0.05. If significant, estimate the strength of correlation.

Solution: We are given a 22 table, thus we can compute the
2
value using the formula
) )( )( )( (
) (
2
2
d c b a d b c a
bc ad N
+ + + +

= _ with d.f. = 1.

Note that the expected frequencies are all greater than 5 (check this). Thus,

262 . 41
) 230 )( 170 )( 100 )( 300 (
)] 70 )( 200 ( ) 30 )( 100 [( 400
) )( )( )( (
) (
2 2
2
=
=
+ + + +

=
d c b a d b c a
bc ad N
.

The critical value of
2
at = 0.05 and d.f. = 1 is 3.84. Since the computed
2
is greater than the critical value, the null hypothesis is rejected which means
that opinions of the students regarding the autonomoous status of college is
dependent on the level of education.

Using the Phi-coefficient, the estimated strength of correlation is

32 . 0
400
(41.262)
N
2
= = = (weak).

---------------------------------------------------------------------------------------------------------------------------

Module 9 (Correlation Analysis)

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Module 9 (Correlation Analysis)

Enviado por

Direitos autorais:

Formatos disponíveis

Reading Material #9 (Correlation Analysis)

= Z which is significant at o =.05 level of significance (two-

d , it follows that 625 . 0

= , d.f. = n - 2. (Equation 7).

= . This value is given by

Você também pode gostar