Você está na página 1de 6

POMPEU FABRA UNIVERSITY - DEPARTMENT OF POLITICAL AND SOCIAL SCIENCES

Master Social studies and demography 2015 Techniques Statistical Analysis 1 Group 2
Gerrit Kreffer Final assignment - December 2015
_______________________________________________________________________________________________

1. Our model
We start with the model developed for the midterm assignment but this time we use the new ESS
round 7 data (ESS7), instead of ESS1, and add an extra question to the scale. We set the values of
our dependent variable antimi ranging from 0 to 10.
Opinion of Dutch nationals on the entry of immigrants in the ESS 2014
35
30

Percent

25
20
15
10
5

an
m
w

all
o

me
so
w

all
o

few
a
w

all
o

all

ow

no

ne

To explain this variation we start with the variables we used in our midterm model, still used in the
ESS7. With forward and backward selection we find that the coefficients of religious attendance and
income are not significant any more with this dataset. We leave them out but are still not happy with
the low predictive value of the model (R-squared 0.087). Going back to the literature we find that it is
not so much the individual position of the respondents which explains the anti-attitudes, but more the
respondents look on the consequences for society. We therefore add imbgeco (attitude if immigration
is bad or good for the economy; continuous scale from 0 bad to 10 good for the economy}. And
aesfdrk, answer to question if the respondent is afraid to go out in the dark, which tells about the
feelings outside the home (categorical, 4 values from very safe to very unsafe). With the regress
command we estimate our changed regression model. The newly added variables prove significant
and make age and happiness insignificant. By exchanging these variables we are now able to explain
33% of the variation. Our model: antimi = 2.73 + 0.08 eduyrs 0.12 lrscale + 0.66 imbgeco 0.39
aesfdrk.

2. Misspecification
To discover misspecification we run the Ramsey regression specification error test for omitted
variables. The significant F statistic we find suggests some sort of functional form problem. Two
possible sources of misspecification are omission of relevant variables causing biased parameters and
non-linearity/heteroscedasticity (biased standard errors).
The skewness of the distribution of y could cause this problem. Our y has a skewness of -.30 and a
kurtosis of 2.49. There appears to be a significant lack of normality doing the sktest on antimi. We
need to look at the actual size of the skewness and kurtosis as a measure, as well as a histogram, and
cannot depend just on the significance test, because we have a large sample. We see the distribution
of y is not normal, partly an effect of the scale we have built. We see some extreme observations on
both sides. Comparing the normal and the density curve, the latter predicting the population
distribution: especially the score 7 is outstanding, but also the other original answer categories on the
antimi questions (0, 3 and 10).

To correct for the dependent variable's skewness we transform the dependent variable with a semi-log
transformation. The logarithmic transformation compresses the variable's scale. The transformation
should end in a more symmetric variable. The corrected model has much smaller coefficients and
standard errors: lnant=1.14 + 0.01 eduyrs 0.02 lrscale + 0.12 imbgeco 0.06 aesfdrk.
This because antimi has now a smaller range of values (from 0 to 2.3). The t tests for the coefficients
are still significant.

3. Heteroscedasticity
We plot the residuals of our corrected model against the continuous independent variable eduyrs. The
standard deviation of the error, the variance of the residuals variates with years of education. The
variation in attitudes regarding immigration diminishes somewhat going from 10 to 19 years of
education. Eduyrs=20 stands for 20 plus more years of education what leads to some clustering on
the right.

-2

-1

Residual Plots and fitted line for years of education completed

10

15

20

Years of full-time education completed


Residuals

Fitted values

We perform the White test for the H0 that there is homoscedasticity. An alternative is the BreuschPagan test, but that tests assumes that error terms are normally distributed and the White test does
not. So we choose the White test. That test calculates a W statistic where W =nR2 which follows a chi2
distribution. We get a high chi2 of 80.7 (significant), that tells the variation of the residuals is not
constant. There is unrestricted heteroscedasticity .
A solution to heteroscedasticity are robust standard errors. We can use them here because we have a
large sample. Estimating with robust gives the same coefficients for our model, but the t-values testing
their significance change. However all coefficients are still significantly different from zero. Robust
standard errors are normally larger than the usual OLS standard errors. In our case the standard
errors for the constant, eduyrs and imbgeco where a bit higher, for lrscale more than double and for
aesfdrk slightly lower. Of course we prefer lower errors. But thanks to the heteroscedasticity-robust
procedure we can maintain our model: lnant=1.14+0.01 eduyrs 0.02 lrscale + 0.12 imbgeco 0.06
aesfdrk.

4. Logistic regression
Exploring useful variables and associations for our model we find that political interest is correlated
with antimi. People against immigration tend to say they are not interested in politics, sometimes
meaning they are not interested in (Dutch) politics (which is mainly pro-immigration and anti the antimi
party PVV). We use politint as a binary dependent variable. It has the following variation: 70.6% is
interested in politics (score 1) and 29.4% not (score 0). In other words, in our sample the odds are 2.4
that a respondent is interested in politics. On every one person not interested in politics there are on
average 2.4 persons who are. We use as independent variables in our logistic regression analysis the
number of education years eduyrs (mean 13.9 years, ranging from 6 to 20) and (the dummy) male
(46.4% men score 1- in our sample, and 53,6% women score 0).
This time we use logistic regression, a type of regression that is used for binary outcome variables, to
estimate our new regression model. This overcomes problems we would encounter with OLS
regression. Predicting the logit of intpolit is predicting the probability of being interested in politics. The
logit is the logarithm of the odds, the ratio of being interested and not being interested.
intpolit

Coef.

male
male
eduyrs
_cons

.5127305
.1249315
-1.034593

Std. Err.

.1167181
.016624
.2302337

P>|z|

4.39
7.52
-4.49

0.000
0.000
0.000

[95% Conf. Interval]

.2839672
.092349
-1.485842

.7414937
.157514
-.5833429

The coefficients we get from logit for estimating the logit score are difficult to interpret. In our case we
find that the coefficients for the constant, male and eduyrs are statistically significant. And the
coefficients of eduyrs and male are positive which means that the log-odds of intpolit go up with men
and with more years of education, holding the other predictor constant. To calculate the probability of
being politically interested for a man with six years of education we display normal the outcome of 1.034593 + 0.5127305 *1 + 0.1249315*6 and get a probability of 59%.
To be able to say more after this regression we reverse the transformation to get the odds ratios.
intpolit

Odds Ratio

male
male
eduyrs
_cons

1.669844
1.133071
.3553711

Std. Err.

.1949011
.0188362
.0818184

4.39
7.52
-4.49

P>|z|

[95% Conf. Interval]

0.000
0.000
0.000

1.328389
1.096748
.2263117

2.099069
1.170597
.5580298

All the odds ratios are significant. The odds ratio 0.36 for the constant expresses the odds of women
with (hypothetical) 0 years of education to be politically interested; for every 100 women with 0 eduyrs
that have no political intererest there are 36 politically interested.
The odds ratio 1.67 for male expresses relative odds of men to women for any years of education to
be interested. For every 100 interested women there are 167 interested men.
The odds ratio 1.13 for eduyrs expresses relative odds of individuals with any years of education
relative to any individual with one year less of education to be interested. With 100 interested people
at eduyrs n, there are 113 interested at eduyrs n+1. With every year increase in eduyrs there is a 13%
increase in the odds of being interested.
We do a goodness-of-fit test for this model, a test of the null hypothesis that the model truly holds in
the population, measuring the difference between observed responses and the expected values. With
commands estat ic and fitstat we get a selection of measures of fit for the logit of intpolit. There are
basically three ways of assessing fit, the likelihood ratio, Pseudo R2 measures and Bayesian
inference. We dont really have individual level variation with our binary dependent variable intpolit,
where R2 hints at; so we leave the pseudo R2 measures. the Bayesian method is less accepted. Thus
we choose the likelihood ratio as our goodness-of-fit test. We get a likelihood ratio LR chi2 of 84.41
which is significant; our full model can be preferred over the null model with no predictors because the
full model is more likely. The model we estimated and tested significant is:
logit(prob(intpolit=1))= -1.03 + 0.12 eduyrs + 0.51 male

5. Interaction effects
We add the variable bigood to our model developed in paragraph 4. We use our imbgeco variable
(see paragraph 1) to generate it. Bigood is a categorical variable with 3 values, and appears in the
analysis as three dummies. Bigood has the value 1 if the respondent thinks immigration is bad for the
economy (36% has this opinion), 2 if he thinks it is neither good nor bad (26%) and 3 if he thinks it is
good for the economy (38%). Estimating our model with the command logit, or we find that with the
addition of bigood all the variables and the model remain significant.
intpolit

Odds Ratio

P>|z|

[95% Conf. Interval]

eduyrs

1.117493

Std. Err.
.0191032

6.50

0.000

1.080672

1.155569

male
male

1.664226

.1974605

4.29

0.000

1.318916

2.099943

bigood
immnotgood~d
immgood

2.089687
2.166213

.308097
.2956633

5.00
5.66

0.000
0.000

1.565247
1.657761

2.789843
2.830612

_cons

.2732432

.0655394

-5.41

0.000

.1707585

.4372366

Looking at the odds ratios a constant is now predicted of 0.27. That is the odds that a woman with 0
years of education and thinking immigration is bad for the economy will be interested in politics. These
base odds are lower than in our more simple model in 4, as people who think immigration is bad for
the economy are really not interested in politics. With all the odds ratios positive all other categories
are more interested. The odds ratio 2.17 for immgood expresses the relative odds of people thinking
immigration is good for the economy to people thinking it is bad, holding male and eduyrs constant.
We test for interaction by adding the interaction effect between the categorical variable bigood and
eduyears to our model. Looking at the Z statistics of the interaction terms, there is no significant
interaction between eduyrs and immgood and between eduyrs and immnotgoodnotbad. We also test
for an interaction effect between bigood and eduyrs using the log-likelihood ratio. Testing the H0 that
the original model a was nested in the model b with interaction, we cannot reject H0. We leave this
interaction out of our model.
We also test for interaction by adding the interaction effect between the categorical variable bigood
and male to our model. We get the same conclusions. There is no significant interaction between male
and immgood and between male and immnotgoodnotbad. Testing the H0 that the original model a was
nested in the model b with interaction, using the log-likelihood ratio we cannot reject H0. In the
following graphs you see a possible interaction between the two predictors but the confidence
intervals overlap, and this doesnt bring us to inclusion of the interaction terms in our model.
In the following graphs we plot the probabilities that the possible combinations of bigood and male are
politically interested, with and without the interaction effect
Predictive Margins of bigood#male with 95% CIs

.7

Pr(Intpolit)

.7

.6

.6
female

male

gender is male
immbad
immgood

.5

.5

Pr(Intpolit)

.8

.8

.9

.9

Adjusted Predictions of male#bigood with 95% CIs

immbad
immnotgoodnotbad

immnotgoodnotbad

immgood

immigration good for economy


female

male

Keeping eduyrs constant, the probability that men are politically interested is significantly higher than
the probability for women, except for the immnotgoodnotbad categories.
And the probability that people thinking that immigration is bad for the economy are less interested in
politics than the notgoodnotbad and good categories is significantly lower. The differences we see in
the probabilities of these last two groups are not significant.

6. Probit
In this paragraph we will estimate a probit regression model. In the logistic regression model, we
predicted the probability of being politically interested through a linear combination of different
independent variables. To ensure that the predicted probabilities remained between the limits of 0 and
1, the probability of being interested underwent a logit transformation.
An alternative is the probit transformation used in probit models. We can interpret the estimated
coefficients the same way as in logistic regression, except that now the value of the inverse
distribution function of the standard normal distribution increases by units - instead of the log-odds
ratio increasing by units - for each one unit change in the corresponding independent variables.
After running probit with the variables used in paragraph 4, we find a log-likelihood ratio chi2 of 84.85
with a p-value of 0.0000, telling us that the model as a whole is statistically significant, that is, it fits
significantly better than the model with no predictors.
intpolit

Coef.

P>|z|

[95% Conf. Interval]

eduyrs

.0748023

Std. Err.
.0097982

7.63

0.000

.0555981

.0940065

male
male
_cons

.310493
-.6101249

.0693722
.1381459

4.48
-4.42

0.000
0.000

.174526
-.880886

.44646
-.3393639

All coefficients are significantly different from zero. We get coefficients which are more or less the
coefficients we got from logit divided by 1.66: eduyrs 0.075 (logit 0.12), male 0.31 (logit 0.51). The
probit regression coefficients give the predicted change in the probit or z score for a one unit change
in the predictor. The constant is -0,61 (logit constant was - 1.03). This is the predicted probit or z score
of women with zero years of education. Display normal gives a probability of 27% that they are
politically interested. This is comparable with the odds ratio we found for this category in paragraph 4.
For a one unit increase in eduyrs, there is 0.07 increase in the z score of the latent variable governing
the probability of being interested in politics, net of the effect of male. Being male instead of female
increases this z score by 0.31.
To calculate the probability of being politically interested for a man with six years of education we
display normal the outcome of -.6101249+(0.310493*1)+(0.0748023*6) and get a probability of 56%.
We estimate the population-averaged marginal effect of the independent interval variable (covariate)
eduyrs at the two values of the independent categorical (factor) variable male, 0 = female and 1=
male.
. margins, dydx(c.eduyrs) at(male=(0 100)) vsquish
Average marginal effects
Model VCE
: OIM
Expression
dy/dx w.r.t.
1._at
2._at

:
:
:
:

Number of obs

Pr(intpolit), predict()
eduyrs
male
=
male
=

0
100

Delta-method
Std. Err.

dy/dx

1,551

P>|z|

[95% Conf. Interval]

0.000
0.000

.0201036
.016678

eduyrs
_at
1
2

.026478
.0223844

.0032523
.0029115

8.14
7.69

.0328523
.0280909

The average marginal effect for eduyrs with women is 0.026 and with men 0.022. That means that
when the education in number of years of women increases, the probability of being politically
interested on average increases with 2.6 percentage points or 0.026 at each year. And that when the
education in number of years of men increases, the probability of being politically interested on
average increases with 2.2 percentage points or 0.022 at each year. Though the effect for men is
smaller, they start at 0 eduyrs with a higher probability of being interested than women and stay on
top though the distance becomes smaller with more years of education and there is some overlap of
the 95% confidence intervals. See the following graphs.

.6
.2

.02

.4

.025

Pr(Intpolit)

.03

.8

.035

Adjusted Predictions of male with 95% CIs

.015

Effects on Pr(Intpolit)

Average Marginal Effects of eduyrs with 95% CIs

female

male

gender is male

9 10 11 12 13 14 15 16 17 18 19 20

Years of full-time education completed


female

male

7. Conclusions for our model


We started this article with a regression model explaining 8.7 % of anti-migration attitudes in the
Netherlands. Using the latest data from the ESS round 7 and the variables and models we introduced
and tested, using also stepwise to see what variables we should include, we can now propose a better
model explaining 28.5% of the attitudes:
lnant = 0.99 + 0.01 eduyrs - 0.02 lrscale 0.08 aesfdrk - 0.001 male + 0.12 imbgeco.
We now have two background variables as predictors, male and eduyrs, and one that could be
classified as economic, imbgeco. Lrscale en aesfdrk can be seen as social. Given the predictive
power of the added economic opinion variable, we could look for a variable expressing opinions on the
social effect of immigration to further add to the predictive power of our model. The logarithm used in
the current model makes the model less transparent and more difficult to interpret. So we should also
look at the problems with our dependent variable and alternative solutions for fixing them.

Você também pode gostar