Você está na página 1de 45

 Correlation Coefficient (r)

 Regression Analysis
Y

X
Y

(Xi, Yi)
Yi

Xi X
Y

X
Y

X
Y

X
 The correlation coefficient is based on the covariance.
 For a sample, the covariance is calculated as:
_ _
 sxy = (Xi - X)(Yi - Y)
N-1
 Interpretation: Covariance tells us how variation in one
variable “goes with” variation in another variable
(“covary”).
 Two variables are statistically independent
(perfectly unrelated) when their covariance =
0.
 Positive relationships indicated by + value,
negative relationships by a – value.
 Problem with Covariance as a measure of
association?
 Correlation Coefficient (Pearson’s r)
 A way of standardizing the covariance.
 rxy = sxy / sxsy
 Intepretation: Measures the strength of a linear
relationship.
 -1  r  1
 X and Y are perfectly unrelated (independent, uncorrelated) iff rxy
=0
 What explains variation the generosity of
state welfare expenditures? (STATES – 55)
 What explains variation in the generosity of
state welfare expenditures? (STATES – 1570)
 738 – Poverty rate
 739 – Median Family Income
 1644 - %Clinton
 1715 – Female State Legislators
 Regression is concerned with dependence of one variable
(the dependent variable, measured at the interval/ratio
level) on one or more other variables (independent
variables, measured at the interval, ratio, ordinal or
nominal levels).

 Bivariate vs. Multivariate regression analysis

 Y used as dependent variable and X as independent


variable.
 The correlation coefficient measures the
strength of a linear association between two
variables measured at the interval level

 In a scatterplot – the degree to which the


points in the plot cluster around a “best-
fitting” line
 The purpose of regression analysis is to
determine exactly what that line is (i.e. to
estimate the equation for the line)

 The regression line represents predicted


values of Y based on the value of X
Y

X
Y

X
Y

Xi X
Y

Yi

Xi X
Yi = a + bXi
a = Intercept, or Constant = The value
of Y when X = 0

b = Slope coefficient = The change (+ or -) in Y given a


one unit increase in X
Yi = a + bXi + ei

Residual (ei ) – for every observation, the


difference between the observed value of Y
and the regression line (“prediction errors”)
 Using statistical calculations, for any relationship between
X and Y, we can determine the best-fitting line for the
relationship

 This means finding specific values for a and b for the


regression equation

Yi = a + bXi + ei
 Regression analysis finds the line that
minimizes the sum of squared residuals

Yi = a + bXi + ei
 a = the expected value of Y when X=0
 b = the expected change in Y given a one
unit increase in X

Yi = a + bXi + ei
 We can calculate a predicted value for the
dependent variable for any value of X by
using the regression equation for the
regression line:
 ^
Yi = a + bXi
Y

intercept
Xi X
Y

Slope (b)

Xi Xj X
One
unit of
X
Y

Yi
ei
Yi

Xi X
 Research Question: Did the butterfly
ballot result in an unusual number of
votes for Pat Buchanan in the 2000
election in Palm Beach Co.?
 Did it cost Al Gore the election?
 Research Question: Did the butterfly ballot
result in an unusual number of votes for Pat
Buchanan in the 2000 election in Palm
Beach Co.?
 Unit of analysis – Fla. Counties (67)
 Dependent variable (Y) – vote for Buchanan
in 2000
 Independent variable (X) – vote for
Buchanan in 1996 Republican primary
4000

PALM BEACH
3000
2000
1000

PINELLAS
HILLSBOROUGH BROWARD
DUVAL
MARION PASCO
POLK DADE
ESCAMBIA BREVARD
VOLUSIA ORANGE
SANTA
LEON
CITRUS
ALACHUA
BAY ROSA
LAKE
HERNANDO SARASOTA
MANATEE
OKALOOSA LEE
ST JOHNS
CLAY SEMINOLE
CHARLOTTE
PUTNAM
WALTON
SUMTEROSCEOLA
HIGHLANDS
SUWANNEE COLLIER
ST LUCIE
MARTIN
SOTOINDIAN
JACKSON
CALHOUN
WASHINGTON
NASSAU
COLUMBIA
FLAGLER
HOLMES
BAKER
GULF
LEVY
BRADFORD
WAKULLA
OKEECHOBEE
LIBERTY
GADSDEN
DE
FRANKLIN
UNION
DIXIE
GILCHRIST
HARDEE
JEFFERSON
MADISON
HAMILTON
TAYLOR
HENDRY
LAFAYETTE
GLADES
RIVER
MONROE
0

0 5000 10000 15000


Buchanan Vote in 1996
Y = 12.957 + .101(X)
Y = 12.957 + .101(X)

Intercept (a)
Y = 12.957 + .101(X)

Intercept (a) Slope (b)


2000 vote = 12.957 + .101(1996 Vote)

Intercept (a) Slope (b)


 To generate a predicted value for Palm Beach in 2000, we
could simply plug in the appropriate X value in the regression
equation and solve for Y.

 Regression equation: 2000 vote = 12.957 + .101(1996 Vote)

 In 1996, Buchanan received 8788 votes in Palm Beach. Our


prediction for Palm Beach in 2000 based on this regression is
thus:
12.957 + .101*8788 = 903.45

(What does this tell us?)


4000

PALM BEACH
3000
2000
1000

PINELLAS
HILLSBOROUGH BROWARD
DUVAL
MARION PASCO
POLK DADE
ESCAMBIA BREVARD
VOLUSIA ORANGE
SANTA
LEON
CITRUS
ALACHUA
BAY ROSA
LAKE
HERNANDO SARASOTA
MANATEE
OKALOOSA LEE
ST JOHNS
CLAY SEMINOLE
CHARLOTTE
PUTNAM
WALTON
SUMTEROSCEOLA
HIGHLANDS
SUWANNEE COLLIER
ST LUCIE
MARTIN
SOTOINDIAN
JACKSON
CALHOUN
WASHINGTON
NASSAU
COLUMBIA
FLAGLER
HOLMES
BAKER
GULF
LEVY
BRADFORD
WAKULLA
OKEECHOBEE
LIBERTY
GADSDEN
DE
FRANKLIN
UNION
DIXIE
GILCHRIST
HARDEE
JEFFERSON
MADISON
HAMILTON
TAYLOR
HENDRY
LAFAYETTE
GLADES
RIVER
MONROE
0

0 5000 10000 15000


Buchanan Vote in 1996
4000

PALM BEACH (Actual vote: 3,407)


3000

Actual Vote for


Buchanan in
Palm Beach,
2000 election
2000

Predicted Vote from Regression Model = 903.45


1000

Predicted vote for


PINELLAS
HILLSBOROUGH Buchanan based on 1996
BROWARD
DUVAL vote (from regression
MARION PASCO
POLK DADE
ESCAMBIA BREVARD model)
VOLUSIA ORANGE
SANTA
LEON
CITRUS
ALACHUA
BAY ROSA
LAKE
HERNANDO SARASOTA
MANATEE
OKALOOSA LEE
ST JOHNS
CLAY SEMINOLE
CHARLOTTE
PUTNAM
WALTON
SUMTEROSCEOLA
HIGHLANDS
SUWANNEE COLLIER
ST LUCIE
MARTIN
SOTOINDIAN
JACKSON
CALHOUN
WASHINGTON
NASSAU
COLUMBIA
FLAGLER
HOLMES
BAKER
GULF
LEVY
BRADFORD
WAKULLA
OKEECHOBEE
LIBERTY
GADSDEN
DE
FRANKLIN
UNION
DIXIE
GILCHRIST
HARDEE
JEFFERSON
MADISON
HAMILTON
TAYLOR
HENDRY
LAFAYETTE
GLADES
RIVER
MONROE
0

0 5000 10000 15000


Buchanan Vote in 1996
 We can calculate the residual for any
observation by first calculating the predicted
value for Y, and then subtracting the
predicted value from the observed value of Y:
 ^
ei = Yi - Yi
 For any observation in our data, the residual
represents the “prediction error” for that
observation (based on the regression
equation)
 ^
ei = Yi - Yi
4000

PALM BEACH (Actual vote: 3,407)


3000

Actual Vote for


Buchanan in
Palm Beach,
2000 election
2000

Predicted Vote from Regression Model = 903.45


1000

Predicted vote for


PINELLAS
HILLSBOROUGH Buchanan based on 1996
BROWARD
DUVAL vote (from regression
MARION PASCO
POLK DADE
ESCAMBIA BREVARD model)
VOLUSIA ORANGE
SANTA
LEON
CITRUS
ALACHUA
BAY ROSA
LAKE
HERNANDO SARASOTA
MANATEE
OKALOOSA LEE
ST JOHNS
CLAY SEMINOLE
CHARLOTTE
PUTNAM
WALTON
SUMTEROSCEOLA
HIGHLANDS
SUWANNEE COLLIER
ST LUCIE
MARTIN
SOTOINDIAN
JACKSON
CALHOUN
WASHINGTON
NASSAU
COLUMBIA
FLAGLER
HOLMES
BAKER
GULF
LEVY
BRADFORD
WAKULLA
OKEECHOBEE
LIBERTY
GADSDEN
DE
FRANKLIN
UNION
DIXIE
GILCHRIST
HARDEE
JEFFERSON
MADISON
HAMILTON
TAYLOR
HENDRY
LAFAYETTE
GLADES
RIVER
MONROE
0

0 5000 10000 15000


Buchanan Vote in 1996
4000

PALM BEACH (Actual vote: 3,407)


3000

Actual Vote for


Buchanan in
Palm Beach,
Palm Beach Residual
2000 election
3407-903.45 =
2000

2503.55

Predicted Vote from Regression Model = 903.45


1000

Predicted vote for


PINELLAS
HILLSBOROUGH Buchanan based on 1996
BROWARD
DUVAL vote (from regression
MARION PASCO
POLK DADE
ESCAMBIA BREVARD model)
VOLUSIA ORANGE
SANTA
LEON
CITRUS
ALACHUA
BAY ROSA
LAKE
HERNANDO SARASOTA
MANATEE
OKALOOSA LEE
ST JOHNS
CLAY SEMINOLE
CHARLOTTE
PUTNAM
WALTON
SUMTEROSCEOLA
HIGHLANDS
SUWANNEE COLLIER
ST LUCIE
MARTIN
SOTOINDIAN
JACKSON
CALHOUN
WASHINGTON
NASSAU
COLUMBIA
FLAGLER
HOLMES
BAKER
GULF
LEVY
BRADFORD
WAKULLA
OKEECHOBEE
LIBERTY
GADSDEN
DE
FRANKLIN
UNION
DIXIE
GILCHRIST
HARDEE
JEFFERSON
MADISON
HAMILTON
TAYLOR
HENDRY
LAFAYETTE
GLADES
RIVER
MONROE
0

0 5000 10000 15000


Buchanan Vote in 1996
 Testing for statistical significance for the slope

 The p-value - probability of observing a sample slope


value at least as large (different from zero) as the one we
are observing in our sample IF THE NULL HYPOTHESIS IS
TRUE

 P-values closer to zero suggest the null hypothesis is less


likely to be true (.05 usually the threshold for statistical
significance)
 The R-squared = the proportion of variation
in the dependent variable (Y) explained by
the independent variable (X).

 In bivariate regression analysis it is simply the


square of the correlation coefficient (r)
 Intercept (a)
 Slope (b)
 Predicted values of Y
 Residuals
 P-value for the slope
 R-squared