Sps 580 Lecture 7 Data Mining Dummy Variables Notes

SPS 580 Lecture 7
Data Mining Dummy

Variables notes
I.
THE LINEARITY ASSUMPTION
A. Its called multiple LINEAR regression because Y is assumed to be a linear function of the
each X variable Y = a + B1(X1) + B2(X2) +B3(X3) . . .
a. We like linear models because they are an intuitive way to talk about the effects of
each variable (intervening, control), difference between the zero order and the partial
B. Violations of linear assumption: Curvilinearity
a. look at it with the zero order relationship.,
b. if the relationship is curvilinear, then the linear slope doesnt do as good a job at
predicting Y as some other alternatives.
c. In most cases the linear model is pretty accurate predictor not usually the end of the
world.
d. Were about to learn a way to deal with the situation when there is a curvilinear
relationship.
C. Violations of linear assumption: Interactions
a. look at by examining the conditional slopes in the three-variable graph.
b. If there is an interaction then the effect of an X variable on Y is not linear because
the magnitude of the slope DEPENDS on a third variable. in most cases interactions
are not significant.
c. But when they are it IS the end of the world. You have to do the analysis separately
for the groups involved in the interaction or incorporate an interaction term in the
linear regression model.
d. Well learn about how to deal with them in a couple of weeks.
II.
SUPPRESSOR EFFECT
a. Not a violation of linearity, rather an unusual outcome of causal analysis
Three variable path diagram
b. Happens when the SIGN of the indirect path B2 * B3 is opposite from the SIGN of
the direct path B1.
c. If this happens then B1 > ZERO ORDER and you get an estimated of suppression
effect rather than explanation.
III.
WHY IS CAUSAL ANALYSIS IMPORTANT?
Intervening variables often show points of policy input . . . Lets say you knew higher income
people were moving out of a neighborhood. And that they would often explain their reasons
for doing so in terms of neighborhood pessimism.
You want to reduce neighborhood turnover.
Income
Neighborhood Pessimism
Move out
???
SPS 580 Lecture 7

Data Mining Dummy
Variables notes
You cant do much about income, but you might be able to find things that cause pessimism
that you can affect.
IV.
HOW TO FIND GOOD INTERVENING VARIABLES
X1 Y
pretend data, not PQ
How do you find variables that when you put them inside the causal chain X1 Y, the partial is
less than the zero order?
A. reflect on own experience or talk to people
B. literature an article or a report
C. data mining there will be certain statistical relationships between X1, X2 and Y
V.
DATA MINING . . . For an intervening to explain part/all of the X1 Y

relationship, two conditions have to be met . . .
CONDITION 1: The explanatory variable X2 has a significant impact on Y

X2 Y is significant
pretend data , not PQ
Fear is a cause of pessimism
This is the intervening CAUSAL process, it comes from psychological theory, literature,
observational studies, it is a reflection of social process this is the reason you like stx
CONDITION 2: Groups that differ on the independent variable X1 differ on the
explanatory variable X2
X1 X2
pretend data, not PQ
Income groups differ on Fear
In order for fear to be a reason
income causes pessimism, higher income people have to be less fearful than lower income
low income
In order for X2 to explain X1 Y,
X1 has to be a cause of X2
SPS 580 Lecture 7

Data Mining Dummy
Variables notes
VI.
OUTCOME OF SUCCESSFUL DATA MINING

A. Mechanically, when you control for X2 the partial is lower than the zero order
X1 Y controlling X2
pretend, not PQ
in this case Partial = 0

B. Intuitively, in X1Y relationship you think youre looking at groups that differ on X1
But actually were looking at groups

that differ on X1 and also X2
So you need to control for X2 to see
the impact of X1 alone (Partial)
VII. SO HOW DO YOU DATA MINE FOR (OTHER) INTERVENING VARIABLES
A. Get a list of candidate intervening variables from the same survey years . . .
A. Read a book in the past month -- readers less pessimistic
B. Frequency of using the local park in the past month park users less pessimistic
C. Employment status -- unemployed > pessimistic
SPS 580 Lecture 7

Data Mining Dummy
Variables notes
(CODING EMPSTAT)
B. Recode the candidate

variables, look at the xtabs
to see if two conditions are met
1. First check X2 Y to see if the explanatory variable actually causes pessimism
doesnt make the cut

makes the cut weakly
makes the cut
2. Then check X1 X2 to see if income groups actually differ on it

Park use might
be OK
Unemp fails
Working v. Other
seems important
Weak
Sweet
Fails
SPS 580 Lecture 7

Data Mining Dummy
Variables notes
BOTTOM LINE: Go with LF status
coded (1 = working, 0 = other)
Xtab results -- since y= dichot(0,1)
Pessimism = .40 - .166 (Income) +.042 (Labor Force Status) slope for LF status is significant
but the impact of the control variable
isnt very great
Worse yet . . . there might be an
interaction effect
VIII. DEALING WITH
CURVILINEARITY
A. Start by looking at how to deal with Ordinal (3+) variables
Education is a very important
variable for a lot of public policy
analysis
Its not really usable as an interval
variable, not across the full range, and
not in the US context
But you dont want to lose the
gradient, usually best to treat it as a
ordinal variable
B. Recode the variable into (k) ordinal categories . . . as shown above
SPS 580 Lecture 7

Data Mining Dummy
Variables notes
do a xtab or table of means -- depending on whether
Y is dichotomous(0,1) or interval(3+)
Look at the pattern in the data
line goes down, higher education lower
pessimism
not much diff between some college vs. HSG/trade
C. Dont think of the pattern in the data as a line
Think of the pattern in the data
as (k-1) separate CONTRASTS
...
[WARNING DATA ANALYSIS METHOD AHEAD]

D. Think of each of the (k-1) contrasts as something that is measured with a (0,1)
dichotomous variable.
(0,1) dichotomies created this way are
knows as DUMMY VARIABLES
With (k) categories of education, we need
(k-1) dummy variables to estimate the
available contrasts
The left out category is called the reference category (in this case 0-11 yrs of education)
A dummy var measures the difference between the contrast category and the reference category
E. Creating (K-1) Dummy Variables To Analyze The Impact Of An Ordinal Variable
RECODE education (0=0) (1=1) (2=0) (3=0) (ELSE=9) INTO educHSG.
VARIABLE LABELS educHSG 'dummy var HSG vs 0-11'.
RECODE education (0=0) (3=0) (1=0) (2=1) (ELSE=9) INTO educANYCOLL.
VARIABLE LABELS educANYCOLL 'dummy any coll vs 0-11'.
SPS 580 Lecture 7

Data Mining Dummy
Variables notes
RECODE education (0=0) (1=0) (3=1) (2=0) (ELSE=9) INTO educCOLLGRAD.
VARIABLE LABELS educCOLLGRAD 'dummy coll grad vs 0-11'.
MISSING VALUES educHSG educANYCOLL educCOLLGRAD (9).
The result will be (k-1) variables, each of which codes the ENTIRE SAMPLE . . .
Regression works the as before, except that instead of having one education variable there are
now 3 dummy variables measuring the effects of education. Whenever you estimate the effect of
education, put all (k-1) dummy vars in the regression equation together
F. For the ZERO ORDER, there are now (k-1) slopes, t-tests
All 3 are significant
D1 and D2 are pretty
similar to each other
G. To test Education as a control variable, enter ALL (k-1) dummy variables together in the
multiple regression equation along with income . . .
Income effect is reduced

substantially
education makes a pretty big

difference as an explanatory
variable
SPS 580 Lecture 7

Data Mining Dummy
Variables notes
H. The prediction equation works the same way too
Regression equation . . .
Predicted avg(Y) = .513 - .130*(Income)-.077 *(educHSG) -.093 *(educANYCOLL) -.200*(educCOLLGRAD)
predicted values
IX.
DUMMY VARIABLES ARE THE MAIN TECHNIQUE FOR DEALING WITH

CURVILINEARITY
A. Example: Client = WBEZ want to target fundraising
Commission research to explore extent to which Education Listen to public radio
and reasons why this might be the case
B. ZERO ORDER RESULTS
The listenership variable is nominal
(4 cat), so to proceed with causal
analysis, Im going to recode it into a
dichotomy
SPS 580 Lecture 7

Data Mining Dummy
Variables notes
Curvilinear relationship . . .
. . . is significant Chi sq(3) = 135 p < .04 phi =.204
Examine Contrasts
Minimal difference HSG/Trade vs. 0-11
Small difference Some College vs. 0-11
Large difference College Grad vs. 0-11
One of the DUMMIES is

not significant
Conclusion . . .
C. INTERVENING VARIABLE:
Theory . . . . . .
Education Politically Independent Listen to Public radio

X1
X2
recode X2 to Independent vs. other
SPS 580 Lecture 7

Data Mining Dummy
Variables notes
REGRESSION ANALYSIS
Education effect is still
curvilinear
Independence isnt
significant
X.
HOW TO SUMMARIZE THE ZERO ORDER AND PARTIAL EFFECTS OF AN

ORDINAL/NOMINAL VARIABLE MEASURED WITH DUMMY VARIABLES
Independence doesnt
explain much of the
relationship between
education and listenership
10

Sps 580 Lecture 7 Data Mining Dummy Variables Notes

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Sps 580 Lecture 7 Data Mining Dummy Variables Notes

Enviado por

Direitos autorais:

Formatos disponíveis

SPS 580 Lecture 7

Data Mining Dummy

SPS 580 Lecture 7

DATA MINING . . . For an intervening to explain part/all of the X1 Y

CONDITION 1: The explanatory variable X2 has a significant impact on Y

SPS 580 Lecture 7

OUTCOME OF SUCCESSFUL DATA MINING

in this case Partial = 0

But actually were looking at groups

SPS 580 Lecture 7

B. Recode the candidate

doesnt make the cut

2. Then check X1 X2 to see if income groups actually differ on it

SPS 580 Lecture 7

B. Recode the variable into (k) ordinal categories . . . as shown above

SPS 580 Lecture 7

[WARNING DATA ANALYSIS METHOD AHEAD]

SPS 580 Lecture 7

Income effect is reduced

education makes a pretty big

SPS 580 Lecture 7

DUMMY VARIABLES ARE THE MAIN TECHNIQUE FOR DEALING WITH

SPS 580 Lecture 7

One of the DUMMIES is

Education Politically Independent Listen to Public radio

recode X2 to Independent vs. other

SPS 580 Lecture 7

HOW TO SUMMARIZE THE ZERO ORDER AND PARTIAL EFFECTS OF AN

Você também pode gostar