Você está na página 1de 45

#  Correlation Coefficient (r)

 Regression Analysis
Y

X
Y

(Xi, Yi)
Yi

Xi X
Y

X
Y

X
Y

X
 The correlation coefficient is based on the covariance.
 For a sample, the covariance is calculated as:
_ _
 sxy = (Xi - X)(Yi - Y)
N-1
 Interpretation: Covariance tells us how variation in one
variable “goes with” variation in another variable
(“covary”).
 Two variables are statistically independent
(perfectly unrelated) when their covariance =
0.
 Positive relationships indicated by + value,
negative relationships by a – value.
 Problem with Covariance as a measure of
association?
 Correlation Coefficient (Pearson’s r)
 A way of standardizing the covariance.
 rxy = sxy / sxsy
 Intepretation: Measures the strength of a linear
relationship.
 -1  r  1
 X and Y are perfectly unrelated (independent, uncorrelated) iff rxy
=0
 What explains variation the generosity of
state welfare expenditures? (STATES – 55)
 What explains variation in the generosity of
state welfare expenditures? (STATES – 1570)
 738 – Poverty rate
 739 – Median Family Income
 1644 - %Clinton
 1715 – Female State Legislators
 Regression is concerned with dependence of one variable
(the dependent variable, measured at the interval/ratio
level) on one or more other variables (independent
variables, measured at the interval, ratio, ordinal or
nominal levels).

##  Y used as dependent variable and X as independent

variable.
 The correlation coefficient measures the
strength of a linear association between two
variables measured at the interval level

##  In a scatterplot – the degree to which the

points in the plot cluster around a “best-
fitting” line
 The purpose of regression analysis is to
determine exactly what that line is (i.e. to
estimate the equation for the line)

##  The regression line represents predicted

values of Y based on the value of X
Y

X
Y

X
Y

Xi X
Y

Yi

Xi X
Yi = a + bXi
a = Intercept, or Constant = The value
of Y when X = 0

## b = Slope coefficient = The change (+ or -) in Y given a

one unit increase in X
Yi = a + bXi + ei

## Residual (ei ) – for every observation, the

difference between the observed value of Y
and the regression line (“prediction errors”)
 Using statistical calculations, for any relationship between
X and Y, we can determine the best-fitting line for the
relationship

##  This means finding specific values for a and b for the

regression equation

Yi = a + bXi + ei
 Regression analysis finds the line that
minimizes the sum of squared residuals

Yi = a + bXi + ei
 a = the expected value of Y when X=0
 b = the expected change in Y given a one
unit increase in X

Yi = a + bXi + ei
 We can calculate a predicted value for the
dependent variable for any value of X by
using the regression equation for the
regression line:
 ^
Yi = a + bXi
Y

intercept
Xi X
Y

Slope (b)

Xi Xj X
One
unit of
X
Y

Yi
ei
Yi

Xi X
 Research Question: Did the butterfly
ballot result in an unusual number of
votes for Pat Buchanan in the 2000
election in Palm Beach Co.?
 Did it cost Al Gore the election?
 Research Question: Did the butterfly ballot
result in an unusual number of votes for Pat
Buchanan in the 2000 election in Palm
Beach Co.?
 Unit of analysis – Fla. Counties (67)
 Dependent variable (Y) – vote for Buchanan
in 2000
 Independent variable (X) – vote for
Buchanan in 1996 Republican primary
4000

PALM BEACH
3000
2000
1000

PINELLAS
HILLSBOROUGH BROWARD
DUVAL
MARION PASCO
ESCAMBIA BREVARD
VOLUSIA ORANGE
SANTA
LEON
CITRUS
ALACHUA
BAY ROSA
LAKE
HERNANDO SARASOTA
MANATEE
OKALOOSA LEE
ST JOHNS
CLAY SEMINOLE
CHARLOTTE
PUTNAM
WALTON
SUMTEROSCEOLA
HIGHLANDS
SUWANNEE COLLIER
ST LUCIE
MARTIN
SOTOINDIAN
JACKSON
CALHOUN
WASHINGTON
NASSAU
COLUMBIA
FLAGLER
HOLMES
BAKER
GULF
LEVY
WAKULLA
OKEECHOBEE
LIBERTY
DE
FRANKLIN
UNION
DIXIE
GILCHRIST
HARDEE
JEFFERSON
HAMILTON
TAYLOR
HENDRY
LAFAYETTE
RIVER
MONROE
0

## 0 5000 10000 15000

Buchanan Vote in 1996
Y = 12.957 + .101(X)
Y = 12.957 + .101(X)

Intercept (a)
Y = 12.957 + .101(X)

## Intercept (a) Slope (b)

2000 vote = 12.957 + .101(1996 Vote)

## Intercept (a) Slope (b)

 To generate a predicted value for Palm Beach in 2000, we
could simply plug in the appropriate X value in the regression
equation and solve for Y.

##  In 1996, Buchanan received 8788 votes in Palm Beach. Our

prediction for Palm Beach in 2000 based on this regression is
thus:
12.957 + .101*8788 = 903.45

## (What does this tell us?)

4000

PALM BEACH
3000
2000
1000

PINELLAS
HILLSBOROUGH BROWARD
DUVAL
MARION PASCO
ESCAMBIA BREVARD
VOLUSIA ORANGE
SANTA
LEON
CITRUS
ALACHUA
BAY ROSA
LAKE
HERNANDO SARASOTA
MANATEE
OKALOOSA LEE
ST JOHNS
CLAY SEMINOLE
CHARLOTTE
PUTNAM
WALTON
SUMTEROSCEOLA
HIGHLANDS
SUWANNEE COLLIER
ST LUCIE
MARTIN
SOTOINDIAN
JACKSON
CALHOUN
WASHINGTON
NASSAU
COLUMBIA
FLAGLER
HOLMES
BAKER
GULF
LEVY
WAKULLA
OKEECHOBEE
LIBERTY
DE
FRANKLIN
UNION
DIXIE
GILCHRIST
HARDEE
JEFFERSON
HAMILTON
TAYLOR
HENDRY
LAFAYETTE
RIVER
MONROE
0

## 0 5000 10000 15000

Buchanan Vote in 1996
4000

3000

Buchanan in
Palm Beach,
2000 election
2000

1000

## Predicted vote for

PINELLAS
HILLSBOROUGH Buchanan based on 1996
BROWARD
DUVAL vote (from regression
MARION PASCO
ESCAMBIA BREVARD model)
VOLUSIA ORANGE
SANTA
LEON
CITRUS
ALACHUA
BAY ROSA
LAKE
HERNANDO SARASOTA
MANATEE
OKALOOSA LEE
ST JOHNS
CLAY SEMINOLE
CHARLOTTE
PUTNAM
WALTON
SUMTEROSCEOLA
HIGHLANDS
SUWANNEE COLLIER
ST LUCIE
MARTIN
SOTOINDIAN
JACKSON
CALHOUN
WASHINGTON
NASSAU
COLUMBIA
FLAGLER
HOLMES
BAKER
GULF
LEVY
WAKULLA
OKEECHOBEE
LIBERTY
DE
FRANKLIN
UNION
DIXIE
GILCHRIST
HARDEE
JEFFERSON
HAMILTON
TAYLOR
HENDRY
LAFAYETTE
RIVER
MONROE
0

## 0 5000 10000 15000

Buchanan Vote in 1996
 We can calculate the residual for any
observation by first calculating the predicted
value for Y, and then subtracting the
predicted value from the observed value of Y:
 ^
ei = Yi - Yi
 For any observation in our data, the residual
represents the “prediction error” for that
observation (based on the regression
equation)
 ^
ei = Yi - Yi
4000

3000

Buchanan in
Palm Beach,
2000 election
2000

1000

## Predicted vote for

PINELLAS
HILLSBOROUGH Buchanan based on 1996
BROWARD
DUVAL vote (from regression
MARION PASCO
ESCAMBIA BREVARD model)
VOLUSIA ORANGE
SANTA
LEON
CITRUS
ALACHUA
BAY ROSA
LAKE
HERNANDO SARASOTA
MANATEE
OKALOOSA LEE
ST JOHNS
CLAY SEMINOLE
CHARLOTTE
PUTNAM
WALTON
SUMTEROSCEOLA
HIGHLANDS
SUWANNEE COLLIER
ST LUCIE
MARTIN
SOTOINDIAN
JACKSON
CALHOUN
WASHINGTON
NASSAU
COLUMBIA
FLAGLER
HOLMES
BAKER
GULF
LEVY
WAKULLA
OKEECHOBEE
LIBERTY
DE
FRANKLIN
UNION
DIXIE
GILCHRIST
HARDEE
JEFFERSON
HAMILTON
TAYLOR
HENDRY
LAFAYETTE
RIVER
MONROE
0

## 0 5000 10000 15000

Buchanan Vote in 1996
4000

3000

## Actual Vote for

Buchanan in
Palm Beach,
Palm Beach Residual
2000 election
3407-903.45 =
2000

2503.55

1000

## Predicted vote for

PINELLAS
HILLSBOROUGH Buchanan based on 1996
BROWARD
DUVAL vote (from regression
MARION PASCO
ESCAMBIA BREVARD model)
VOLUSIA ORANGE
SANTA
LEON
CITRUS
ALACHUA
BAY ROSA
LAKE
HERNANDO SARASOTA
MANATEE
OKALOOSA LEE
ST JOHNS
CLAY SEMINOLE
CHARLOTTE
PUTNAM
WALTON
SUMTEROSCEOLA
HIGHLANDS
SUWANNEE COLLIER
ST LUCIE
MARTIN
SOTOINDIAN
JACKSON
CALHOUN
WASHINGTON
NASSAU
COLUMBIA
FLAGLER
HOLMES
BAKER
GULF
LEVY
WAKULLA
OKEECHOBEE
LIBERTY
DE
FRANKLIN
UNION
DIXIE
GILCHRIST
HARDEE
JEFFERSON
HAMILTON
TAYLOR
HENDRY
LAFAYETTE
RIVER
MONROE
0

## 0 5000 10000 15000

Buchanan Vote in 1996
 Testing for statistical significance for the slope

##  The p-value - probability of observing a sample slope

value at least as large (different from zero) as the one we
are observing in our sample IF THE NULL HYPOTHESIS IS
TRUE

##  P-values closer to zero suggest the null hypothesis is less

likely to be true (.05 usually the threshold for statistical
significance)
 The R-squared = the proportion of variation
in the dependent variable (Y) explained by
the independent variable (X).

##  In bivariate regression analysis it is simply the

square of the correlation coefficient (r)
 Intercept (a)
 Slope (b)
 Predicted values of Y
 Residuals
 P-value for the slope
 R-squared