Você está na página 1de 10

SPS 580 Lecture 7

Data Mining Dummy


Variables notes
I.
THE LINEARITY ASSUMPTION
A. Its called multiple LINEAR regression because Y is assumed to be a linear function of the
each X variable Y = a + B1(X1) + B2(X2) +B3(X3) . . .
a. We like linear models because they are an intuitive way to talk about the effects of
each variable (intervening, control), difference between the zero order and the partial
B. Violations of linear assumption: Curvilinearity
a. look at it with the zero order relationship.,
b. if the relationship is curvilinear, then the linear slope doesnt do as good a job at
predicting Y as some other alternatives.
c. In most cases the linear model is pretty accurate predictor not usually the end of the
world.
d. Were about to learn a way to deal with the situation when there is a curvilinear
relationship.
C. Violations of linear assumption: Interactions
a. look at by examining the conditional slopes in the three-variable graph.
b. If there is an interaction then the effect of an X variable on Y is not linear because
the magnitude of the slope DEPENDS on a third variable. in most cases interactions
are not significant.
c. But when they are it IS the end of the world. You have to do the analysis separately
for the groups involved in the interaction or incorporate an interaction term in the
linear regression model.
d. Well learn about how to deal with them in a couple of weeks.
II.

SUPPRESSOR EFFECT
a. Not a violation of linearity, rather an unusual outcome of causal analysis
Three variable path diagram

b. Happens when the SIGN of the indirect path B2 * B3 is opposite from the SIGN of
the direct path B1.
c. If this happens then B1 > ZERO ORDER and you get an estimated of suppression
effect rather than explanation.
III.
WHY IS CAUSAL ANALYSIS IMPORTANT?
Intervening variables often show points of policy input . . . Lets say you knew higher income
people were moving out of a neighborhood. And that they would often explain their reasons
for doing so in terms of neighborhood pessimism.
You want to reduce neighborhood turnover.
Income
Neighborhood Pessimism
Move out
???

SPS 580 Lecture 7


Data Mining Dummy
Variables notes
You cant do much about income, but you might be able to find things that cause pessimism
that you can affect.
IV.
HOW TO FIND GOOD INTERVENING VARIABLES
X1 Y
pretend data, not PQ

How do you find variables that when you put them inside the causal chain X1 Y, the partial is
less than the zero order?
A. reflect on own experience or talk to people
B. literature an article or a report
C. data mining there will be certain statistical relationships between X1, X2 and Y
V.

DATA MINING . . . For an intervening to explain part/all of the X1 Y


relationship, two conditions have to be met . . .

CONDITION 1: The explanatory variable X2 has a significant impact on Y


X2 Y is significant
pretend data , not PQ
Fear is a cause of pessimism
This is the intervening CAUSAL process, it comes from psychological theory, literature,
observational studies, it is a reflection of social process this is the reason you like stx
CONDITION 2: Groups that differ on the independent variable X1 differ on the
explanatory variable X2
X1 X2
pretend data, not PQ
Income groups differ on Fear
In order for fear to be a reason
income causes pessimism, higher income people have to be less fearful than lower income
low income
In order for X2 to explain X1 Y,

X1 has to be a cause of X2

SPS 580 Lecture 7


Data Mining Dummy
Variables notes

VI.

OUTCOME OF SUCCESSFUL DATA MINING


A. Mechanically, when you control for X2 the partial is lower than the zero order

X1 Y controlling X2
pretend, not PQ

in this case Partial = 0


B. Intuitively, in X1Y relationship you think youre looking at groups that differ on X1

But actually were looking at groups


that differ on X1 and also X2
So you need to control for X2 to see
the impact of X1 alone (Partial)
VII. SO HOW DO YOU DATA MINE FOR (OTHER) INTERVENING VARIABLES
A. Get a list of candidate intervening variables from the same survey years . . .
A. Read a book in the past month -- readers less pessimistic
B. Frequency of using the local park in the past month park users less pessimistic
C. Employment status -- unemployed > pessimistic

SPS 580 Lecture 7


Data Mining Dummy
Variables notes

(CODING EMPSTAT)

B. Recode the candidate


variables, look at the xtabs
to see if two conditions are met
1. First check X2 Y to see if the explanatory variable actually causes pessimism

doesnt make the cut


makes the cut weakly
makes the cut

2. Then check X1 X2 to see if income groups actually differ on it


Park use might
be OK
Unemp fails
Working v. Other
seems important
Weak

Sweet

Fails

SPS 580 Lecture 7


Data Mining Dummy
Variables notes
BOTTOM LINE: Go with LF status
coded (1 = working, 0 = other)
Xtab results -- since y= dichot(0,1)

Pessimism = .40 - .166 (Income) +.042 (Labor Force Status) slope for LF status is significant
but the impact of the control variable
isnt very great
Worse yet . . . there might be an
interaction effect
VIII. DEALING WITH
CURVILINEARITY
A. Start by looking at how to deal with Ordinal (3+) variables
Education is a very important
variable for a lot of public policy
analysis
Its not really usable as an interval
variable, not across the full range, and
not in the US context
But you dont want to lose the
gradient, usually best to treat it as a
ordinal variable

B. Recode the variable into (k) ordinal categories . . . as shown above

SPS 580 Lecture 7


Data Mining Dummy
Variables notes
do a xtab or table of means -- depending on whether
Y is dichotomous(0,1) or interval(3+)
Look at the pattern in the data
line goes down, higher education lower
pessimism
not much diff between some college vs. HSG/trade
C. Dont think of the pattern in the data as a line
Think of the pattern in the data
as (k-1) separate CONTRASTS
...

[WARNING DATA ANALYSIS METHOD AHEAD]


D. Think of each of the (k-1) contrasts as something that is measured with a (0,1)
dichotomous variable.
(0,1) dichotomies created this way are
knows as DUMMY VARIABLES
With (k) categories of education, we need
(k-1) dummy variables to estimate the
available contrasts
The left out category is called the reference category (in this case 0-11 yrs of education)
A dummy var measures the difference between the contrast category and the reference category
E. Creating (K-1) Dummy Variables To Analyze The Impact Of An Ordinal Variable
RECODE education (0=0) (1=1) (2=0) (3=0) (ELSE=9) INTO educHSG.
VARIABLE LABELS educHSG 'dummy var HSG vs 0-11'.
RECODE education (0=0) (3=0) (1=0) (2=1) (ELSE=9) INTO educANYCOLL.
VARIABLE LABELS educANYCOLL 'dummy any coll vs 0-11'.

SPS 580 Lecture 7


Data Mining Dummy
Variables notes
RECODE education (0=0) (1=0) (3=1) (2=0) (ELSE=9) INTO educCOLLGRAD.
VARIABLE LABELS educCOLLGRAD 'dummy coll grad vs 0-11'.
MISSING VALUES educHSG educANYCOLL educCOLLGRAD (9).

The result will be (k-1) variables, each of which codes the ENTIRE SAMPLE . . .

Regression works the as before, except that instead of having one education variable there are
now 3 dummy variables measuring the effects of education. Whenever you estimate the effect of
education, put all (k-1) dummy vars in the regression equation together
F. For the ZERO ORDER, there are now (k-1) slopes, t-tests
All 3 are significant
D1 and D2 are pretty
similar to each other
G. To test Education as a control variable, enter ALL (k-1) dummy variables together in the
multiple regression equation along with income . . .

Income effect is reduced


substantially

education makes a pretty big


difference as an explanatory
variable

SPS 580 Lecture 7


Data Mining Dummy
Variables notes
H. The prediction equation works the same way too
Regression equation . . .
Predicted avg(Y) = .513 - .130*(Income)-.077 *(educHSG) -.093 *(educANYCOLL) -.200*(educCOLLGRAD)

predicted values

IX.

DUMMY VARIABLES ARE THE MAIN TECHNIQUE FOR DEALING WITH


CURVILINEARITY
A. Example: Client = WBEZ want to target fundraising
Commission research to explore extent to which Education Listen to public radio
and reasons why this might be the case
B. ZERO ORDER RESULTS
The listenership variable is nominal
(4 cat), so to proceed with causal
analysis, Im going to recode it into a
dichotomy

SPS 580 Lecture 7


Data Mining Dummy
Variables notes

Curvilinear relationship . . .
. . . is significant Chi sq(3) = 135 p < .04 phi =.204
Examine Contrasts
Minimal difference HSG/Trade vs. 0-11
Small difference Some College vs. 0-11
Large difference College Grad vs. 0-11

One of the DUMMIES is


not significant

Conclusion . . .

C. INTERVENING VARIABLE:
Theory . . . . . .

Education Politically Independent Listen to Public radio


X1

X2

recode X2 to Independent vs. other

SPS 580 Lecture 7


Data Mining Dummy
Variables notes
REGRESSION ANALYSIS
Education effect is still
curvilinear
Independence isnt
significant
X.

HOW TO SUMMARIZE THE ZERO ORDER AND PARTIAL EFFECTS OF AN


ORDINAL/NOMINAL VARIABLE MEASURED WITH DUMMY VARIABLES
Independence doesnt
explain much of the
relationship between
education and listenership

10

Você também pode gostar