Escolar Documentos
Profissional Documentos
Cultura Documentos
The socioeconomic and demographic factors of the households like household size,
number of employees, number of students, number of vehicles and income category, household
type etc., of each household obtained from the CHTS are used to develop Poisson Regression
Model (PRM), Negative Binomial Regression Model (NBRM) and Zero Inflated Negative
Binomial Regression Model (ZINB). A total of 5 continuous variables and 5 categorical variables
are used in the model development.
2. Literature Review
Even though Multiple Linear regression techniques are widely used they suffer from three major
drawbacks. (i). Number of trips is considered as a continuous variable whereas it is discrete in
nature (ii). Possibility of predicting negative number of trips when lot of zeros are present in the
data (iii). In ability to observe the non-trip making behavior i.e. zero trips can never be estimated
in linear regression 13.
To overcome the difficulties of discrete nature and non-negativity, count regression models
such as Poisson Regression Model (PRM), Negative Binomial Regression Model (NBRM), and
Cumulative Logistic Regression Model (also called Ordered Logit Models - OLM) have been
tested and used recently 13.
Jang Y.T.(2005) developed and tested four types of house hold level trip generation count
models such as PRM, NBRM, Zero Inflated Poisson Model (ZIP) and Zero Inflated Negative
Binomial Regression Model (ZINB) for the city of Jeonju, Korea using a survey data of 4,416
households. Using the Home based work trip travel surveys for the years 2002 and 2006 for the
city of Seoul, Korea Chang et al, (2014) developed variety of trip generation models such as
Multiple Linear Regression, Tobit model, PRM and OLM. Huntingster et al., (2013) developed
an OLM using the 1995 North Carolina Household Travel Survey. Summary of the regression
Page | 2
techniques and the corresponding explanatory variables used in the reference journal are shown in
Table 1.
Table 1. Summary of Regression type and the Key Explanatory Variables
Research by
Jang, T. Y.
Huntsinger, L. F.,
et al.
Linear Regression
From the literature as well as Engineering judgement, the household demographics and the
personal characteristics of household head would influence the number of trips made by a
household on a given day. The explanatory variables chosen for this study and their nature is shown
in Table 2.
Page | 3
Sl. No
Variable
Name
Description
Week day on which the trip dairy was
Type
Number
of
Levels
Categorical
Remarks
L1 - Monday;.L7 - Sunday
recorded
Income
Categorical
Dwelling Unit
Categorical
Type
person_count
Continuous
student_count
Continuous
vehicle_count
Continuous
bike_count
Continuous
Personal Characteristics
8
new_age_head
Categorical
gender_head
Categorical
1 - Male; 2 - Female
10
edu_head
Categorical
11
emp_status
Categorical
1 - Yes, 2 -No
12
toll_pass
Categorical
1 - Yes, 2 -No
13
transit_use
Categorical
1 Yes, 2 No
Page | 4
# Observations
vehicle_count
persons_count
worker_count
student_count
license_count
bike_ count
trip_count
1976
1976
1976
1976
1976
1976
1976
# zero values
61
381
1269
37
743
227
# missing
values
Minimum
Maximum
10
25
Range
10
25
3943
5061
2563
1208
3893
3018
14666
Sum
Median
Mean
2.00
2.56
1.30
0.61
1.97
1.53
7.42
Variance
0.89
1.68
0.81
0.89
0.71
2.76
34.56
Std Deviation
0.94
1.30
0.90
0.95
0.84
1.66
5.88
COV
0.47
0.51
0.69
1.55
0.43
1.09
0.79
Page | 5
From Figure 1(b), it can be clearly seen that the trip frequency is not symmetrically distributed
about mean, looks similar to Poisson distribution but does not have monotonically decreasing
frequency values. A Poisson count model is certainly an option to start with.
(1)
= [ | ] = ( )
(2)
[ = | ] =
mean parameter,
Where, = number of trips per day of household i. The mean, can vary for each household
based on its demographics. Parameters of PRM are obtained using Maximum Likelihood (ML)
techniques. These parameters are asymptotically normal, consistent and asymptotically efficient.
A critical assumption made to use the Poisson model is that the number of trips generated
with in a house hold are intra independent, i.e. a trip made by a household for one purpose say
Page | 6
shopping does not yield another trip from the household (to satisfy the no memory property of
Poisson distribution), and also the trips made by different households are independent of each
other. Another limiting assumption of PRM is the conditional mean of trip rate is equal to the
conditional variance of the trip rate which is called the equidispersion. With this assumption, the
variance can vary in proportion to the mean of the distribution accounting for the observed
heterogeneity.
For the Poisson distribution, as the mean of the distribution increases, the mass of the
distribution moves towards right and hence can under predict the count of zeros. Also, quite often
the conditional variance of the trip rate would be greater (smaller) than the conditional mean which
is called over dispersion (under dispersion). If over dispersion is present, PRM parameter estimates
would still be consistent but inefficient. This will lead to underestimated standard errors and
spuriously large Z-values indicating some variables to be significant, which may be insignificant
otherwise. Negative Binomial regression can handle the unobserved heterogeneity by adding an
error term in the exponent of mean parameter. The estimated PRM model will be subjected to
Cameron and Trivedi (1990)s test for over dispersion.
4.1.2. Negative Binomial Regression Model
Negative Binomial model (Equation 4) can be obtained by adding an error term to the mean
parameter of Equation 2.
= [ | , ] = ( + )
(3)
Where is a random error term assumed to be uncorrelated with other independent variables.
NBRM count model can be written as in Equation 4. The term ( ) is called unobserved
heterogeneity term and could reflect specification errors such as omitted exogenous variable, or
other sources of randomness.
Page | 7
( exp())
[ = | , ] =
!
(4)
The error term is assumed to be Gamma distributed with mean value 1 and variance .
Induction of the error term into the mean parameter does not affect the conditional mean but the
conditional variance varies with conditional mean as shown in Equation 5.
[ ] = [ ] (1 + [ ])
(5)
There might be some households with zero trips not because of randomness but may be actually
they do not make any trips. This produces large proportion of zeros in the data and NBRM may
not be able to predict these zeros. To take this into account, zero inflated models are used.
4.1.3. Zero Inflated Negative Binomial Regression Model
ZINB models predict the number of zeros by two separate processes. One process models the
households with always zero trips and the other is a count model trying to predict the number of
zero trips made by households that actually produce trips. In this study a logit model has been used
to process the Zero inflation and a NBRM count model for the count process. The count
probabilities are obtained using equations 6.1. and 6.2.
= 0 + (1 )( )
=
(1 )( )
!
(6.1)
(6.2)
The appropriateness of using ZINB has been verified using Vuongs test.
Page | 8
Page | 9
()
()
(0)
(8)
9.76
<2.22e-16
Table 4. None, of the values are beyond the allowable limit of 10, so there is no multi collinearity
among the explanatory variables considered.
4.3.2. Exogeneity of Explanatory variables and error term
The main assumption of NBRM is that, the error terms are uncorrelated with the Explanatory
variables. This assumption is verified by plotting the independent continuous variables against the
residuals as shown in the figures below.
Table 4. GVIF values for the count models developed
GVIF values for
Variable
Df
PRM
NBRM
ZINB
new_income
1.11
1.11
1.42
new_residence
1.07
1.07
1.14
dow
1.01
1.01
1.19
transit_use
1.06
1.06
2.49
persons_count
2.05
2.01
5.06
student_count
1.87
1.79
2.24
worker_count
1.56
1.63
3.14
vehicle_count
1.40
1.39
3.89
bike_count
1.19
1.19
1.69
new_age_head
1.12
1.12
2.65
gender_head
1.04
1.04
1.47
edu_head
1.05
1.04
1.46
emp_status_head
1.37
1.41
1.74
toll_pass
1.08
1.08
1.92
license_count
4.94
Figure 2. Residuals vs Explanatory variables for Exogeneity constraint (a) NBRM, (b) ZINB
Page | 11
The Residual plots with respect to the independent variables clearly explain the heterogeneity
present in the data. These graphs also give a hint that the error terms and the explanatory variables
are dependent on each other due to the decreasing pattern of residuals with higher values of the
explanatory variables in both NBRM and ZINB. Presence of endogeneity will make the coefficient
estimates unbiased.
PRM
NBRM
Zero
-3.7470
-0.4983*
0.1255**
0.0648
-0.0416
0.0423
-0.0424
-0.4273
-0.0004
-0.0804
-0.1093
.
-0.7034
-0.2522
-0.2251
-0.2165
0.3367
0.5552*
0.2144
0.3068*
-0.2064
-0.1766
(Intercept)
new_income2
1.1778
0.0797**
new_income3
new_income4
new_residence2
new_residence3
dow2
0.1570***
0.0729*
-0.0385
0.0543*
0.0533
.
0.0888
0.1692**
0.0716
-0.0343
0.0428
0.0040
dow3
dow4
dow5
dow6
dow7
transit_use2
persons_count
student_count
worker_count
0.1078***
0.0994**
0.1438***
-0.0214
-0.1045**
-0.1617***
0.1804***
0.0416**
0.0685***
0.0870
0.1014
0.0994
-0.0218
-0.1029
-0.1676***
0.2269***
0.0294
0.0660*
vehicle_count
.
0.0204
0.0328***
0.0259
0.0651
0.0824
0.0798
0.0113
-0.0465
-0.1506***
0.2364***
0.0240
.
0.0459
0.0076
0.0315**
0.0229*
new_age_head2
new_age_head3
0.0180
-0.0821
0.0097
-0.0673
0.0877
0.0446
gender_head2
edu_head2
0.0077
-0.0326
0.0160
-0.0674
0.0124
0.0191
edu_head3
edu_head4
0.0753
0.0595
0.0848
0.1258
bike_count
1.0854
ZINB
Count
1.1407
0.0378
edu_head5
0.0969
0.1962 ***
edu_head6
emp_status_head2
toll_pass2
0.1654 **
0.0171
-0.0275
0.1761
0.1411
0.0044
-0.0254
license_count
.
0.1566
0.2152**
0.2155*
0.0183
.
-0.0521
-0.0058
-0.0757
-0.1133
1.5604
.
1.9330
-0.0962
.
0.9036
0.7274
0.8127
0.5164
0.8038
0.1395
-0.2975
-0.3911*
Page | 12
Table 6. Incident Rate Ratios of the variables considered in NBRM and ZINB
Variable
(Intercept)
NBRM
2.961***
.
1.093
1.184 **
1.074
0.966
1.044
ZINB
Count_model
3.129***
Zero_inflation
0.024***
1.039
0.608*
1.134**
1.067
0.959
1.043
0.652
1.000
0.923
0.896
1.004
0.959
dow3
dow4
dow5
dow6
dow7
transit_use2
persons_count
student_count
1.091
1.107
1.105
0.978
0.902
0.846***
1.255***
1.030
worker_count
1.068*
vehicle_count
1.026
1.067
1.086
1.083
1.011
0.955
0.860***
1.267***
1.024
.
1.047
1.008
.
0.495
0.777
0.798
0.805
1.400
1.742*
1.239
1.359
0.813*
1.032**
1.023*
new_age_head2
1.010
1.092
new_age_head3
0.935
1.046
gender_head2
1.016
1.012
edu_head2
0.935
1.019
edu_head3
1.061
edu_head4
edu_head6
emp_status_head2
1.089
.
1.193
1.152
1.004
toll_pass2
0.975
new_income2
new_income3
new_income4
new_residence2
new_residence3
dow2
bike_count
edu_head5
license_count
1.134
.
0.838
0.927
0.893
4.761
6.910
0.908
2.468
2.070
1.169
2.254
1.240**
1.676
1.240*
1.019
.
0.949
0.994
2.234
1.150
.
.
.
0.743
0.676*
Since the ZINB could predict the proportion of zeros more accurately, we use ZINB for making
inferences.
Zero inflation model exponentiated coefficients are explained as odds ratio. The baseline odds of
being among the households with zero trip count is 0.024. For factor variables the odds ratios are
interpreted as a housel hold in income level 2 has 40% decrease in the odds of producing zero trips
compared to the household with income category 1, holding other parameters at constant level.
Page | 13
For continuous variables, like number of students in the household, for each unit increase the odds
of producing zero trip count reduces by approximately 19%. Significant variables contributing for
the zero inflation process are license count, student count and Income level 2. Each unit increase
in the license count decreases the odds of zero trip count by 32% by controlling all other variables.
Count Model of ZINB indicates that a household in income level 2 ($50k-$100k annual
income) 3.9% more trips than the households in income level. A Household falling in Income
levels 3 and 4 produces 13.4% and 6.7% more trips than a household with income level 1
respectively.
The IRR values for day of the week indicate in relative comparison to the number of trips
produced on a Monday how the number of trips produced on other week days. For example, a
given household produces 4% less trips on Tuesday (dow2) and 8.6% more trips on Thursday
(dow4) relative to the amount of trips made on Monday. Similar inferences can be made for other
week days as well.
A Household in which the head of the household does not use transit makes 14% less trips
than the household which does use transit. For each unit increment in the household size, the
number of trips produced gets multiplied with 1.267. Similarly, for each unit increment in the
student count of a household, the trips per day would get multiplied by 1.024. For each unit
increment of employees in the household, the households trip rate gets multiplied by 1.047. Each
unit addition of a vehicle to an household multiplies its trip rate by 1.008. Similarly, each addition
of a bike in the household multiplies the trip rate by 1.023.
Households with age of the head in the categories of 30 59years and >60 years make
9.2% and 4.6% trips more than the households with age of the head < 30 years. A household with
female as its head produces 1.2% more trips than their counterparts with a male head. Households
Page | 14
with its head having a higher degree are making more trips compared to the base category. For
example, households whose head is either an under graduate or graduate produces 24% more trips
than households whose head is having less than 12th grade education. A counter intuitive result
obtained is the households whose head is not employed makes 1.9% more trips than their
counterparts. The households whose head does not have a toll pass makes 5% less trips than the
households with their head having a toll pass. Each additional unit of license will multiply the trips
by 0.996, which is very close to one and rather the license count does not alter the trip rate.
6. Conclusion
Three different count models have been developed for estimating the number of trips produced per
day at household level for Orange County. It was found that the sample data violates the equidispersion criteria for the Poisson regression model. NBRM would under predict the number of
zero trip counts. A ZINB with binary logit model for zero inflation is developed and found that
Household Size, Number of Employees, Income category, Bike Count Education level of the
Household head and possession of toll pass are significant variables in the model.
Household Income category, Day of the week, number of students in the household,
number of bikes in the household, Age and Education level of the household and Number of people
possessing licenses are determined to be significant variables for the Zero Inflation Model.
The study tried to understand the preliminary significant variables for the household level
trip production and limited in scope in terms of variable selection and the check for diagnostics of
the models. In future, the model can be used to understand effects of various factors influencing
the household trip generation process.
Page | 15
7. References
1.
Jang, T. Y. Count Data Models for Trip Generation. J. Transp. Eng. 131, 444450 (2005).
2.
Chang, J. S., Jung, D., Kim, J. & Kang, T. Comparative analysis of trip generation
models: results using home-based work trips in the Seoul metropolitan area. Transp. Lett.
6, 7888 (2014).
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Kazuki Yoshida, Models for excess zeros using pscl package (Hurdle and zero-inflated
regression models) and their interpretations, https://rpubs.com/kaz_yos/pscl-2, Date
Accessed: 03/09/2016
Page | 16
8. Appendix A
# Converting variables to factors
totdata$dow <- as.factor(totdata$dow)
totdata$new_income <- as.factor(totdata$new_income)
totdata$transit_use <- as.factor(totdata$transit_use)
totdata$new_residence <- as.factor(totdata$new_residence)
totdata$new_age_head <- as.factor(totdata$new_age_head)
totdata$gender_head <- as.factor(totdata$gender_head)
totdata$edu_head <- as.factor(totdata$edu_head)
totdata$emp_status_head <- as.factor(totdata$emp_status_head)
totdata$toll_pass <- as.factor(totdata$toll_pass)
# 1.1. Descriptive Summary Statistics using 'pastecs' package
boxplot(trip_count)
# removing the outliers observed from boxplot [points beyond Q3 + 1.5*IQR]
trimdata <- subset(totdata,trip_count < 26)
detach(totdata);attach(trimdata)
# 1.2. Histogram of trip count
hist(trip_count, main = "Histogram of Trips per day")
t <- table(trip_count); plot(t)
library(pastecs)
write.csv(stat.desc(trimdata),"trim_summary_stats1.csv")
# 1.3. Correlation check
prelim <data.frame(trip_count,persons_count,worker_count,student_count,bike_count,vehicle_count,license_coun
t)
r <- cor(prelim)
plot(prelim)
# 2. Poisson regression model
library(sandwich);library(msm);library(ggplot2);library(AER)
# fpsntrip1 <glm(trip_count~new_income+new_residence+dow+transit_use+persons_count+student_count+worker_c
ount+vehicle_count+bike_count+new_age_head+gender_head+edu_head+emp_status_head+toll_pass,fa
mily = "poisson", data = trimdata)
psntrip <glm(trip_count~new_income+new_residence+dow+transit_use+persons_count+student_count+worker_c
ount+vehicle_count+bike_count+new_age_head+gender_head+edu_head+emp_status_head+toll_pass,fa
mily = "poisson", data = trimdata)
summary(psntrip)
write.csv(coef(summary(psntrip)),"PRMcoef.csv")
# 2.1. Goodness of Fit tests
psntrip0 <- update(psntrip, .~1)
# 2.1.1. likelihood ratio test
llratio_p <- -2*(logLik(psntrip0)-logLik(psntrip))
p1 <- pchisq(llratio_p, df = 27,lower.tail = FALSE)
# 2.1.2 McFadden R2
library(pscl) # Library to compute various pseudo R2 values
psn_pr2 <- pR2(psntrip)
msfR2 <- psn_pr2[["McFadden"]]
# 2.2. Cameron and Trivedi overdispersion test
dispersiontest(psntrip)
# 2.3.1. Check for Multi collinearity
write.csv(vif(psntrip),'gvifvalues_psn.csv')
# 3. NBRM
library(foreign);library(MASS)
nbrm <glm.nb(trip_count~new_income+new_residence+dow+transit_use+persons_count+student_count+worker
_count+transit_use+vehicle_count+bike_count+new_age_head+gender_head+edu_head+emp_status_hea
d+toll_pass, data = trimdata)
summary(nbrm)
nb_coef <-coef(summary(nbrm))
write.csv(coef(summary(nbrm)),"NBRMCoef.csv")
plot(table(round(fnbrm4$fitted.values,1)))
exp_nb_coef <- exp(nb_coef)
write.csv(exp_nb_coef,"exp_NBRMCoef.csv")
# 3.1 Model Goodness-of-Fit Measures
# 3.1.1.likelihood ratio test
library(pscl) # Library to compute various pseudo R2 values
nbrm0 <- glm.nb(trip_count~1)
summary(nbrm0)
anova(nbrm,nbrm0)
# 3.1.2.check for x2 value
qchisq(0.95,14)
# 3.1.3.McFadden R2
p2 <-pR2(nbrm)
p2[["McFadden"]]
# 3.2. Diagnostics (Dfbetas, dffit, covarianceratio, cooks distance, hat matrix value)
imnbrm <- influence.measures(nbrm)
attributes(imnbrm)
colnames(imnbrm$infmat)
View(imnbrm$infmat)
# 3.3.Check for Multicollinearity
vif(nbrm)
write.csv(vif(nbrm),'gvifvalues_nbrm.csv')
# 3.3.1 Check for exogenity of expalanatory variables.
# plot(nbrm,which = 1, main = "NBRM")
# 4.2.5. Vuong test for checking zero inflated models with normal models
vuong(zinbt,nbrm)
# 4.2.6 ZINB multi collinearity check
write.csv(vif(zinbt),'gvifvalues_zinb.csv')
zinblogy <- log(zinbt$fitted.values)
#4.2.6.1.Independence of error term
# plot(zinbt$residuals,zinbt$fitted.values,main = "ZINB",xlab = "log predicted values",ylab =
"Residuals")
zifit <- zinbt$fitted.values
zires <- zinbt$residuals
plot(log(zifit),zires, main = "ZINB",ylab = "Residuals", xlab = "Log Fitted Values")
par(mfrow = c(2,2))
plot(persons_count,zires)
plot(student_count,zires)
plot(vehicle_count,zires)
plot(bike_count,zires)
par(mfrow = c(1,1))
# 4.3 prediction using ZINB
# 4.3.1. Mean zero trips predicted
przinb <- predict(zinbt,type = "zero")
muzinb <- predict(zinbt,type = "count")
pred_zinb <- przinb + (1-przinb)*exp(-muzinb)
mean(pred_zinb)
# 4.3.2 prediction of trip count by ZINB
pred_zinb1 <- predict(zinbt,type = "response")
pred_zinb_prob <- predict(zinbt,type = "prob")