Count Trip Generation Models

Count Regression Model for Household Level Trip Generation
A case study of Orange County, Southern California

Koti Reddy Allu (23406771)
1. Introduction and Motivation

Trip Generation is the first and important step of any Transportation Planning Framework.
Generally, trip generation models are developed at zonal level. The demographic, socioeconomic
and land use factors of each zone influence the type and quantum of trips generated. Present study
is restricted to develop a household level trip generation model. Various methods like, Growth
Factor models, Category Analysis, Multiple Classification Analysis and Regression Techniques
are used for the trip estimation purpose.
Multiple Linear Regression Techniques can consider many explanatory variables which
are plausible contributing factors for trip generation at household level and have been widely used
in academia, industry and various other forums. These methods treat the number of trips as a
continuous variable even though it is discrete in nature, and also these methods can estimate
negative trip rates which is clearly not possible.
Count models have been widely used in econometrics, Ecological Studies and
Bioinformatics quite often. In Transportation, they are mainly used for modelling number of
accidents per year, queue lengths per unit time, driver route changes per week etc., but less often
for household level trip generation.
The objective of the present study is to develop an appropriate count regression model for
the total number of household trips generated on a given day for orange county using the California
Household Travel Survey (CHTS 2010-2012) data. CHTS was a comprehensive Household level
survey conducted in the state of California during the years 2010-2012. As part of this study, 2401
households of orange county have been covered.
Page | 1
The socioeconomic and demographic factors of the households like household size,
number of employees, number of students, number of vehicles and income category, household
type etc., of each household obtained from the CHTS are used to develop Poisson Regression
Model (PRM), Negative Binomial Regression Model (NBRM) and Zero Inflated Negative
Binomial Regression Model (ZINB). A total of 5 continuous variables and 5 categorical variables
are used in the model development.
2. Literature Review
Even though Multiple Linear regression techniques are widely used they suffer from three major
drawbacks. (i). Number of trips is considered as a continuous variable whereas it is discrete in
nature (ii). Possibility of predicting negative number of trips when lot of zeros are present in the
data (iii). In ability to observe the non-trip making behavior i.e. zero trips can never be estimated
in linear regression 13.
To overcome the difficulties of discrete nature and non-negativity, count regression models
such as Poisson Regression Model (PRM), Negative Binomial Regression Model (NBRM), and
Cumulative Logistic Regression Model (also called Ordered Logit Models - OLM) have been
tested and used recently 13.
Jang Y.T.(2005) developed and tested four types of house hold level trip generation count
models such as PRM, NBRM, Zero Inflated Poisson Model (ZIP) and Zero Inflated Negative
Binomial Regression Model (ZINB) for the city of Jeonju, Korea using a survey data of 4,416
households. Using the Home based work trip travel surveys for the years 2002 and 2006 for the
city of Seoul, Korea Chang et al, (2014) developed variety of trip generation models such as
Multiple Linear Regression, Tobit model, PRM and OLM. Huntingster et al., (2013) developed
an OLM using the 1995 North Carolina Household Travel Survey. Summary of the regression
Page | 2
techniques and the corresponding explanatory variables used in the reference journal are shown in
Table 1.
Table 1. Summary of Regression type and the Key Explanatory Variables
Research by
Trip Generation Method
Key Explanatory Variables
Chang, J. S., et al.
Linear Regression, Tobit Regression,

Poisson Regression, Ordered Logit,
Category Analysis, Multiple
Classification Analysis
Household Income, Household Size,

Number of Cars Owned, Number of
Employees, Number of Preschool children,
Land use areas
Jang, T. Y.
Poisson Regression, Zero-inflated

Poisson regression, Negative
Binomial Regression
Household Size, No. of vehicles, Income

Category, Age of Head, No. of workers, etc.
Huntsinger, L. F.,
et al.
Cumulative Logistic Regression Model
Household Size, Number of Household

Workers, Income Group, Number of
Vehicles, total children, etc.
Zenina, N., et al.
Linear Regression
Land use, Number of employed people,

Number of Public Transport Vehicles,
Number of Intersections etc.
Kim,N , S.,et al.
Poisson Regression, Negative

Binomial Regression
Age, Education, Income, Non-Residential

Unit Size, Degree of Urbanism etc.
From the literature as well as Engineering judgement, the household demographics and the
personal characteristics of household head would influence the number of trips made by a
household on a given day. The explanatory variables chosen for this study and their nature is shown
in Table 2.
3. Initial summary of the Data

3.1. Data collation and Cleansing
The survey data from CHTS has been organized separately at each level of aggregation like
household level, individual level, trip level, activity level, etc., by using a common key identifier
for the household. The required variables for this study have been extracted and collated from three
sheets namely, household characteristics file, activity characteristics file and person characteristics
file.
Page | 3
Table 2. Household level Socio - Demographic Characteristics
Sl. No
Variable
Name
Day of the Week
Description
Week day on which the trip dairy was
Type
Number
of
Levels
Categorical
Remarks
L1 - Monday;.L7 - Sunday
recorded
Household Level Demographics

2
Income
Income level of the household
Categorical
L1 - $0-$50k; L2- $50-$100k; L3 - $100k-$150k, L4 >$150k
Dwelling Unit
The type of household being surveyed
Categorical
Type
L1 - Single Family; L2 - Mobile Home; L3 Apartments
person_count
Number of persons in the household
Continuous
student_count
Number of students in the household
Continuous
vehicle_count
Number of vehicles in the household
Continuous
bike_count
Number of bikes in the household
Continuous
Personal Characteristics
8
new_age_head
Age of the Household head
Categorical
L1- 1 to 29 yrs; L2 - 30 to 59 yrs; L3 - >60 yrs
gender_head
Gender of the household
Categorical
1 - Male; 2 - Female
10
edu_head
Educational level of the household
Categorical
1- grade 12 or less; 2- High school graduate ; 3- Some

college credit but no degree; 4- Associate or technical
school degree; 5- Bachelor's or undergraduate degree;
6- Graduate degree
11
emp_status
Employement status of household head
Categorical
1 - Yes, 2 -No
12
toll_pass
Possession/usage of a toll pass
Categorical
1 - Yes, 2 -No
13
transit_use
Indicator variable for Transit use
Categorical
1 Yes, 2 No
Page | 4
3.2. Descriptive Statistics of the Key variables

The income information for 269 of the 2402 households was not available as the respondents either
dont know or were not willing to share the information. Similarly, for some more variables the
information was either not provided So, the survey data for the remaining 2133 households is only
being used for this study. The initial boxplot (Figure 1) of trip count shows that trips greater than
26 are outliers of the data, so only observations with trip counts less than 26 are considered for
count model development purpose. Trimmed dataset contains 2041 observations. The summary
statistics of the continuous variables are shown in Table 2.
Figure 1. (a). Box plot of the Number of trips per household (b). Frequency chart of number of trips per day
Table 3. Summary Statistics for the continuous variables
# Observations
vehicle_count
persons_count
worker_count
student_count
license_count
bike_ count
trip_count
1976
1976
1976
1976
1976
1976
1976
# zero values
61
381
1269
37
743
227
# missing
values
Minimum
Maximum
10
25
Range
10
25
3943
5061
2563
1208
3893
3018
14666
Sum
Median
Mean
2.00
2.56
1.30
0.61
1.97
1.53
7.42
Variance
0.89
1.68
0.81
0.89
0.71
2.76
34.56
Std Deviation
0.94
1.30
0.90
0.95
0.84
1.66
5.88
COV
0.47
0.51
0.69
1.55
0.43
1.09
0.79
Page | 5
From Figure 1(b), it can be clearly seen that the trip frequency is not symmetrically distributed
about mean, looks similar to Poisson distribution but does not have monotonically decreasing
frequency values. A Poisson count model is certainly an option to start with.
4. Methodology and Model selection

4.1. Methodology and Models
The total number of trips generated by a household on a given day would be a non-negative integer
with discrete nature4. The widely used Linear regression models treat the trip rate as a continuous
variable and would possibly produce negative values or fail to estimate the large proportion of the
zeros in the model. Also, the homoscedasticity assumption is very weak, as the variance of trip
rate can vary with different households. For example, a house hold which falls in to a high end
income category can have more variations in their trip pattern compared to a household in low end
income category which will mostly have work trips and less variation. Count models like PRM,
NBRM, and ZINB can address both the issues of nonnegative discrete nature and heterogeneity.
4.1.1. Poisson Regression Model
In PRM, the probability of a count is determined by Poisson distribution whose mean is obtained
as a function of Independent Variables. PRM is given by equation 1.
(1)
= [ | ] = ( )
(2)
[ = | ] =
mean parameter,
Where, = number of trips per day of household i. The mean, can vary for each household
based on its demographics. Parameters of PRM are obtained using Maximum Likelihood (ML)
techniques. These parameters are asymptotically normal, consistent and asymptotically efficient.
A critical assumption made to use the Poisson model is that the number of trips generated
with in a house hold are intra independent, i.e. a trip made by a household for one purpose say
Page | 6
shopping does not yield another trip from the household (to satisfy the no memory property of
Poisson distribution), and also the trips made by different households are independent of each
other. Another limiting assumption of PRM is the conditional mean of trip rate is equal to the
conditional variance of the trip rate which is called the equidispersion. With this assumption, the
variance can vary in proportion to the mean of the distribution accounting for the observed
heterogeneity.
For the Poisson distribution, as the mean of the distribution increases, the mass of the
distribution moves towards right and hence can under predict the count of zeros. Also, quite often
the conditional variance of the trip rate would be greater (smaller) than the conditional mean which
is called over dispersion (under dispersion). If over dispersion is present, PRM parameter estimates
would still be consistent but inefficient. This will lead to underestimated standard errors and
spuriously large Z-values indicating some variables to be significant, which may be insignificant
otherwise. Negative Binomial regression can handle the unobserved heterogeneity by adding an
error term in the exponent of mean parameter. The estimated PRM model will be subjected to
Cameron and Trivedi (1990)s test for over dispersion.
4.1.2. Negative Binomial Regression Model
Negative Binomial model (Equation 4) can be obtained by adding an error term to the mean
parameter of Equation 2.
= [ | , ] = ( + )
(3)
Where is a random error term assumed to be uncorrelated with other independent variables.
NBRM count model can be written as in Equation 4. The term ( ) is called unobserved
heterogeneity term and could reflect specification errors such as omitted exogenous variable, or
other sources of randomness.
Page | 7
( exp())
[ = | , ] =
!
(4)
The error term is assumed to be Gamma distributed with mean value 1 and variance .
Induction of the error term into the mean parameter does not affect the conditional mean but the
conditional variance varies with conditional mean as shown in Equation 5.
[ ] = [ ] (1 + [ ])
(5)
There might be some households with zero trips not because of randomness but may be actually
they do not make any trips. This produces large proportion of zeros in the data and NBRM may
not be able to predict these zeros. To take this into account, zero inflated models are used.
4.1.3. Zero Inflated Negative Binomial Regression Model
ZINB models predict the number of zeros by two separate processes. One process models the
households with always zero trips and the other is a count model trying to predict the number of
zero trips made by households that actually produce trips. In this study a logit model has been used
to process the Zero inflation and a NBRM count model for the count process. The count
probabilities are obtained using equations 6.1. and 6.2.
= 0 + (1 )( )
=
(1 )( )
!
(6.1)
(6.2)
The appropriateness of using ZINB has been verified using Vuongs test.
4.2. Model Selection

Initially a PRM is developed with all the variables explained in the Table 2 as explanatory
variables. The estimates of the coefficients of the model and their significance levels are shown in
Table 4. The likelihood ratio test shows that overall model is significant. PRM shows that 15 out
of 27 explanatory variables are significant.
Page | 8
4.2.1. Test for over dispersion check

The equi-dispersion assumption of the PRM is verified by carrying out a regression based test
proposed by Cameron and Trivedi (1990) 5. The over-dispersion test indicates that the conditional
variance is 3.17 times (Z = 20.81, p value = 2.2e-16) larger than the conditional mean. So, the
coefficients in PRM are consistent but inefficient and all the variables that are shown to be
significant may truly not be significant due to under estimation of their standard errors.
To overcome the problem of over dispersion, an NBRM with same number of explanatory
variables is developed and the estimates of the coefficients for this model are shown in Table 4.
NBRM indicates only 5 of the explanatory variables are significant at 5% significance level and
two more variables would be significant if 10% significance is considered.
4.2.2. Test for the zero trip count
Mean percentage of zero trip count by NBRM is obtained as the mean value of the
probability densities at zero for all negative binomial distributions with the theta value of the
NBRM as scale parameter and the mean expected trip rate as mean. Our model predicts that 4.8%
(Appendix A, 3.6.1) of the trip counts predicted could be Zeroes whereas the observed sample
consists of 11.5% trip counts to be zero.
NBRM under-predicts the Zero trip counts and so, a Zero inflated NBRM (ZINB) has been
developed with binomial logit model for zero inflation considering all the variables as its
explanatory variables along with the count model. For ZINB, the mean percentage of zero trip
counts predicted is computed as the expected value of the probability of a trip count (Equation
6.1). ZINB predicts 10.5% zero trip counts (Appendix A, 4.3.1) which is very close to the 11.5%
of the observed value.
Page | 9
4.2.3. Goodness of Fit Tests

Two tests are mainly used for identifying the statistical significance of the models developed.
4.2.3.1. Likelihood ratio test
Likelihood ratio test is used to verify the overall significance of the count models developed.
Likelihood ratio statistic defined as negative of the twice the difference in Log likelihood values
of the full model and intercept model as shown in Equation (7) is used for checking the overall
model significance. The 2 is 2 distributed with degrees of freedom equal to the number of
parameter coefficients in the full model minus one.
2 = 2((0) ())
()
4.2.3.2. McFadden pseudo

McFadden 2 statistic is computed using equation (8). This statistic lies between zero and one.
The closer the statistic value to one, the better fitted model we have.
2 = 1
()
(0)
(8)
4.2.3.3. Vuongs test for ZINB

To check whether ZINB is statistically significant compared to normal NBRM, Vuong test has
been conducted and found that ZINB is statistically much better than NBRM.
Vuong Z Statistic
p-Value
9.76
<2.22e-16
4.3. Check for Assumptions

4.3.1. Multi collinearity check
All three models developed are checked for multi collinearity among their explanatory variables
using Variance Inflation Factors (VIF). For factored variables with certain degrees of freedom, a
generalized VIF value is computed as VIF^(1/(2*df)). The computed GVIF values are shown in
Page | 10
Table 4. None, of the values are beyond the allowable limit of 10, so there is no multi collinearity
among the explanatory variables considered.
4.3.2. Exogeneity of Explanatory variables and error term
The main assumption of NBRM is that, the error terms are uncorrelated with the Explanatory
variables. This assumption is verified by plotting the independent continuous variables against the
residuals as shown in the figures below.
Table 4. GVIF values for the count models developed
GVIF values for
Variable
Df
PRM
NBRM
ZINB
new_income
1.11
1.11
1.42
new_residence
1.07
1.07
1.14
dow
1.01
1.01
1.19
transit_use
1.06
1.06
2.49
persons_count
2.05
2.01
5.06
student_count
1.87
1.79
2.24
worker_count
1.56
1.63
3.14
vehicle_count
1.40
1.39
3.89
bike_count
1.19
1.19
1.69
new_age_head
1.12
1.12
2.65
gender_head
1.04
1.04
1.47
edu_head
1.05
1.04
1.46
emp_status_head
1.37
1.41
1.74
toll_pass
1.08
1.08
1.92
license_count
4.94
Figure 2. Residuals vs Explanatory variables for Exogeneity constraint (a) NBRM, (b) ZINB
Page | 11
The Residual plots with respect to the independent variables clearly explain the heterogeneity
present in the data. These graphs also give a hint that the error terms and the explanatory variables
are dependent on each other due to the decreasing pattern of residuals with higher values of the
explanatory variables in both NBRM and ZINB. Presence of endogeneity will make the coefficient
estimates unbiased.
5. Results and Discussion

The incident rate ratios of the coefficients of count models and the odds ratios of coefficients of
zero inflation model can be used for making inferences and comparisons. The exponentiated
coefficients for the NBRM and ZINB are presented in Table 6.
Table 5. Coefficients of the parameters for the PRM, NBRM and ZINB
Model
Exp. Variable
PRM
NBRM
Zero
-3.7470
-0.4983*
0.1255**
0.0648
-0.0416
0.0423
-0.0424
-0.4273
-0.0004
-0.0804
-0.1093
.
-0.7034
-0.2522
-0.2251
-0.2165
0.3367
0.5552*
0.2144
0.3068*
-0.2064
-0.1766
(Intercept)
new_income2
1.1778
0.0797**
new_income3
new_income4
new_residence2
new_residence3
dow2
0.1570***
0.0729*
-0.0385
0.0543*
0.0533
.
0.0888
0.1692**
0.0716
-0.0343
0.0428
0.0040
dow3
dow4
dow5
dow6
dow7
transit_use2
persons_count
student_count
worker_count
0.1078***
0.0994**
0.1438***
-0.0214
-0.1045**
-0.1617***
0.1804***
0.0416**
0.0685***
0.0870
0.1014
0.0994
-0.0218
-0.1029
-0.1676***
0.2269***
0.0294
0.0660*
vehicle_count
.
0.0204
0.0328***
0.0259
0.0651
0.0824
0.0798
0.0113
-0.0465
-0.1506***
0.2364***
0.0240
.
0.0459
0.0076
0.0315**
0.0229*
new_age_head2
new_age_head3
0.0180
-0.0821
0.0097
-0.0673
0.0877
0.0446
gender_head2
edu_head2
0.0077
-0.0326
0.0160
-0.0674
0.0124
0.0191
edu_head3
edu_head4
0.0753
0.0595
0.0848
0.1258
bike_count
1.0854
ZINB
Count
1.1407
0.0378
edu_head5
0.0969
0.1962 ***
edu_head6
emp_status_head2
toll_pass2
0.1654 **
0.0171
-0.0275
0.1761
0.1411
0.0044
-0.0254
license_count
.
0.1566
0.2152**
0.2155*
0.0183
.
-0.0521
-0.0058
-0.0757
-0.1133
1.5604
.
1.9330
-0.0962
.
0.9036
0.7274
0.8127
0.5164
0.8038
0.1395
-0.2975
-0.3911*
Page | 12
Likelihood Ratio test

2596 ( df = 27)
568 (df = 27)
810.8 (df = 31)
[-2(LL(0)-LL())]
0.16
0.05
0.069
Mc Fadden
*** 0.1% significance; ** - 1% significance; * - 5% significance; . - 10% significance
Table 6. Incident Rate Ratios of the variables considered in NBRM and ZINB
Variable
(Intercept)
NBRM
2.961***
.
1.093
1.184 **
1.074
0.966
1.044
ZINB
Count_model
3.129***
Zero_inflation
0.024***
1.039
0.608*
1.134**
1.067
0.959
1.043
0.652
1.000
0.923
0.896
1.004
0.959
dow3
dow4
dow5
dow6
dow7
transit_use2
persons_count
student_count
1.091
1.107
1.105
0.978
0.902
0.846***
1.255***
1.030
worker_count
1.068*
vehicle_count
1.026
1.067
1.086
1.083
1.011
0.955
0.860***
1.267***
1.024
.
1.047
1.008
.
0.495
0.777
0.798
0.805
1.400
1.742*
1.239
1.359
0.813*
1.032**
1.023*
new_age_head2
1.010
1.092
new_age_head3
0.935
1.046
gender_head2
1.016
1.012
edu_head2
0.935
1.019
edu_head3
1.061
edu_head4
edu_head6
emp_status_head2
1.089
.
1.193
1.152
1.004
toll_pass2
0.975
new_income2
new_income3
new_income4
new_residence2
new_residence3
dow2
bike_count
edu_head5
license_count
1.134
.
0.838
0.927
0.893
4.761
6.910
0.908
2.468
2.070
1.169
2.254
1.240**
1.676
1.240*
1.019
.
0.949
0.994
2.234
1.150
.
.
.
0.743
0.676*
*** 0.1% significance; ** - 1% significance; * - 5% significance; . - 10% significance
Since the ZINB could predict the proportion of zeros more accurately, we use ZINB for making
inferences.
Zero inflation model exponentiated coefficients are explained as odds ratio. The baseline odds of
being among the households with zero trip count is 0.024. For factor variables the odds ratios are
interpreted as a housel hold in income level 2 has 40% decrease in the odds of producing zero trips
compared to the household with income category 1, holding other parameters at constant level.
Page | 13
For continuous variables, like number of students in the household, for each unit increase the odds
of producing zero trip count reduces by approximately 19%. Significant variables contributing for
the zero inflation process are license count, student count and Income level 2. Each unit increase
in the license count decreases the odds of zero trip count by 32% by controlling all other variables.
Count Model of ZINB indicates that a household in income level 2 ($50k-$100k annual
income) 3.9% more trips than the households in income level. A Household falling in Income
levels 3 and 4 produces 13.4% and 6.7% more trips than a household with income level 1
respectively.
The IRR values for day of the week indicate in relative comparison to the number of trips
produced on a Monday how the number of trips produced on other week days. For example, a
given household produces 4% less trips on Tuesday (dow2) and 8.6% more trips on Thursday
(dow4) relative to the amount of trips made on Monday. Similar inferences can be made for other
week days as well.
A Household in which the head of the household does not use transit makes 14% less trips
than the household which does use transit. For each unit increment in the household size, the
number of trips produced gets multiplied with 1.267. Similarly, for each unit increment in the
student count of a household, the trips per day would get multiplied by 1.024. For each unit
increment of employees in the household, the households trip rate gets multiplied by 1.047. Each
unit addition of a vehicle to an household multiplies its trip rate by 1.008. Similarly, each addition
of a bike in the household multiplies the trip rate by 1.023.
Households with age of the head in the categories of 30 59years and >60 years make
9.2% and 4.6% trips more than the households with age of the head < 30 years. A household with
female as its head produces 1.2% more trips than their counterparts with a male head. Households
Page | 14
with its head having a higher degree are making more trips compared to the base category. For
example, households whose head is either an under graduate or graduate produces 24% more trips
than households whose head is having less than 12th grade education. A counter intuitive result
obtained is the households whose head is not employed makes 1.9% more trips than their
counterparts. The households whose head does not have a toll pass makes 5% less trips than the
households with their head having a toll pass. Each additional unit of license will multiply the trips
by 0.996, which is very close to one and rather the license count does not alter the trip rate.
6. Conclusion
Three different count models have been developed for estimating the number of trips produced per
day at household level for Orange County. It was found that the sample data violates the equidispersion criteria for the Poisson regression model. NBRM would under predict the number of
zero trip counts. A ZINB with binary logit model for zero inflation is developed and found that
Household Size, Number of Employees, Income category, Bike Count Education level of the
Household head and possession of toll pass are significant variables in the model.
Household Income category, Day of the week, number of students in the household,
number of bikes in the household, Age and Education level of the household and Number of people
possessing licenses are determined to be significant variables for the Zero Inflation Model.
The study tried to understand the preliminary significant variables for the household level
trip production and limited in scope in terms of variable selection and the check for diagnostics of
the models. In future, the model can be used to understand effects of various factors influencing
the household trip generation process.
Page | 15
7. References
1.
Jang, T. Y. Count Data Models for Trip Generation. J. Transp. Eng. 131, 444450 (2005).
2.
Chang, J. S., Jung, D., Kim, J. & Kang, T. Comparative analysis of trip generation
models: results using home-based work trips in the Seoul metropolitan area. Transp. Lett.
6, 7888 (2014).
3.
Huntsinger, L. F., Rouphail, N. M. & Bloomfield, P. Trip Generation Models Using

Cumulative Logistic Regression. J. Urban Plan. Dev. 139, 176184 (2013).
4.
Williams, R. Models for Count Outcomes. 125 (2015). at

<https://www3.nd.edu/~rwilliam/xsoc73994/CountModels.pdf>
5.
Cameron, A. C. & Trivedi, P. K. Regression-based tests for overdispersion in the Poisson

model. J. Econom. 46, 347364 (1990)
6.
http://data.princeton.edu/R/glms.html; date accessed : 03/10/2016
7.
http://www.ats.ucla.edu/stat/r/dae/zinbreg.htm; date accessed: 03/05/2016
8.
Simon P. Washington, Matthew G. Karlaftis, Fred Mannering, Statistical and Econometric

Methods for Transportation Data Analysis, CRC Press, 283-300, (2011)
9.
Colin Cameron, Pravin K. Trivedi, Regression Analysis of Count Data, Cambridge

University Press, 80 - 88
10.
Joseph m. Hilbe, Negative Binomial Regression, Cambridge University Press, 77-137

(2011)
11.
Kelly Black, http://www.cyclismo.org/tutorial/R/index.html, Date Accessed: 02/05/2016
12.
Kazuki Yoshida, Models for excess zeros using pscl package (Hurdle and zero-inflated
regression models) and their interpretations, https://rpubs.com/kaz_yos/pscl-2, Date
Accessed: 03/09/2016
Page | 16
8. Appendix A
# Converting variables to factors
totdata$dow <- as.factor(totdata$dow)
totdata$new_income <- as.factor(totdata$new_income)
totdata$transit_use <- as.factor(totdata$transit_use)
totdata$new_residence <- as.factor(totdata$new_residence)
totdata$new_age_head <- as.factor(totdata$new_age_head)
totdata$gender_head <- as.factor(totdata$gender_head)
totdata$edu_head <- as.factor(totdata$edu_head)
totdata$emp_status_head <- as.factor(totdata$emp_status_head)
totdata$toll_pass <- as.factor(totdata$toll_pass)
# 1.1. Descriptive Summary Statistics using 'pastecs' package
boxplot(trip_count)
# removing the outliers observed from boxplot [points beyond Q3 + 1.5*IQR]
trimdata <- subset(totdata,trip_count < 26)
detach(totdata);attach(trimdata)
# 1.2. Histogram of trip count
hist(trip_count, main = "Histogram of Trips per day")
t <- table(trip_count); plot(t)
library(pastecs)
write.csv(stat.desc(trimdata),"trim_summary_stats1.csv")
# 1.3. Correlation check
prelim <data.frame(trip_count,persons_count,worker_count,student_count,bike_count,vehicle_count,license_coun
t)
r <- cor(prelim)
plot(prelim)
# 2. Poisson regression model
library(sandwich);library(msm);library(ggplot2);library(AER)
# fpsntrip1 <glm(trip_count~new_income+new_residence+dow+transit_use+persons_count+student_count+worker_c
ount+vehicle_count+bike_count+new_age_head+gender_head+edu_head+emp_status_head+toll_pass,fa
mily = "poisson", data = trimdata)
psntrip <glm(trip_count~new_income+new_residence+dow+transit_use+persons_count+student_count+worker_c
ount+vehicle_count+bike_count+new_age_head+gender_head+edu_head+emp_status_head+toll_pass,fa
mily = "poisson", data = trimdata)
summary(psntrip)
write.csv(coef(summary(psntrip)),"PRMcoef.csv")
# 2.1. Goodness of Fit tests
psntrip0 <- update(psntrip, .~1)
# 2.1.1. likelihood ratio test
llratio_p <- -2*(logLik(psntrip0)-logLik(psntrip))
p1 <- pchisq(llratio_p, df = 27,lower.tail = FALSE)
# 2.1.2 McFadden R2
library(pscl) # Library to compute various pseudo R2 values
psn_pr2 <- pR2(psntrip)
msfR2 <- psn_pr2[["McFadden"]]
# 2.2. Cameron and Trivedi overdispersion test
dispersiontest(psntrip)
# 2.3.1. Check for Multi collinearity
write.csv(vif(psntrip),'gvifvalues_psn.csv')
# 3. NBRM
library(foreign);library(MASS)
nbrm <glm.nb(trip_count~new_income+new_residence+dow+transit_use+persons_count+student_count+worker
_count+transit_use+vehicle_count+bike_count+new_age_head+gender_head+edu_head+emp_status_hea
d+toll_pass, data = trimdata)
summary(nbrm)
nb_coef <-coef(summary(nbrm))
write.csv(coef(summary(nbrm)),"NBRMCoef.csv")
plot(table(round(fnbrm4$fitted.values,1)))
exp_nb_coef <- exp(nb_coef)
write.csv(exp_nb_coef,"exp_NBRMCoef.csv")
# 3.1 Model Goodness-of-Fit Measures
# 3.1.1.likelihood ratio test
library(pscl) # Library to compute various pseudo R2 values
nbrm0 <- glm.nb(trip_count~1)
summary(nbrm0)
anova(nbrm,nbrm0)
# 3.1.2.check for x2 value
qchisq(0.95,14)
# 3.1.3.McFadden R2
p2 <-pR2(nbrm)
p2[["McFadden"]]
# 3.2. Diagnostics (Dfbetas, dffit, covarianceratio, cooks distance, hat matrix value)
imnbrm <- influence.measures(nbrm)
attributes(imnbrm)
colnames(imnbrm$infmat)
View(imnbrm$infmat)
# 3.3.Check for Multicollinearity
vif(nbrm)
write.csv(vif(nbrm),'gvifvalues_nbrm.csv')
# 3.3.1 Check for exogenity of expalanatory variables.
# plot(nbrm,which = 1, main = "NBRM")
# 3.4. Check for endogenity

nbfit <- nbrm$fitted.values
nbres <- nbrm$residuals
plot(log(nbfit),nbres, main = "NBRM",ylab = "Residuals", xlab = "Log Fitted Values")
par(mfrow = c(2,2))
plot(persons_count,nbres)
plot(student_count,nbres)
plot(vehicle_count,nbres)
plot(bike_count,nbres)
par(mfrow = c(1,1))
# Writing down coefficients of the model to csv file
write.csv(coef(summary(nbrm)),"nbrmcoef.csv")
# 3.5. profiling of likelihood function to calculate the confidence intervals for coefficients
(est <- cbind(Estimate = coef(nbrm),confint(nbrm)))
exp(est)
#3.6. Prediction using NBRM
# 3.6.1 Predicting the # zeros by NBRM
zobs <- trip_count == 0
munb <- exp(predict(nbrm))
nbtheta <- nbrm$theta
znb <- dnbinom(0,mu=munb,size = nbtheta) # Number of zeros predicted by zeros
c(obs=mean(zobs), nb=mean(znb))
# 3.6.2 Predicting the trips with NBRM
pred_nbrm <- predict(nbrm,type = "response")
# 4.1. Zero Inflated Poisson Model
library(boot)
# zipt <zeroinfl(trip_count~new_income+new_residence+dow+transit_use+persons_count+student_count+worke
r_count+license_count+vehicle_count+bike_count|new_income+new_residence+dow+transit_use+person
s_count+student_count+worker_count+license_count+vehicle_count+bike_count, data = trimdata)
# summary(zipt)
# 4.2.Zero Inflated Negative Bionomial
zinbt <zeroinfl(trip_count~new_income+new_residence+dow+transit_use+persons_count+student_count+
worker_count+vehicle_count+bike_count+ new_age_head + gender_head + edu_head + emp_status_head
+ toll_pass +license_count , data = trimdata, dist = "negbin")
summary(zinbt)
#4.2.2. coefficients of the estimates
zinb_coef <- coef(zinbt)
zinb_coef <- matrix(zinb_coef,ncol = 2)
colnames(zinb_coef) <- c("Count_model","Zero_inflation_model")
zinb_coef <- as.data.frame(zinb_coef)
write.csv(zinb_coef,"coef_zinb.csv")
#4.2.2.1. Exponentiated coefficients

exp_zinb_coef <- exp(zinb_coef)
write.csv(exp_zinb_coef,"exp_coef_zinb.csv")
# 4.2.3 Overall model statistical significance
zinbt0 <- update(zinbt, . ~1)
summary(zinbt0)
# log likelihood ratio test
llratio_zinb <- - 2*(logLik(zinbt0)-logLik(zinbt))
pchisq(llratio_zinb ,df = 26,lower.tail = FALSE)
# 4.2.4 McFaddens R2
mcfr2_zinb <- (1-(logLik(zinbt)/logLik(zinbt0)))
# 4.2.5. Vuong test for checking zero inflated models with normal models
vuong(zinbt,nbrm)
# 4.2.6 ZINB multi collinearity check
write.csv(vif(zinbt),'gvifvalues_zinb.csv')
zinblogy <- log(zinbt$fitted.values)
#4.2.6.1.Independence of error term
# plot(zinbt$residuals,zinbt$fitted.values,main = "ZINB",xlab = "log predicted values",ylab =
"Residuals")
zifit <- zinbt$fitted.values
zires <- zinbt$residuals
plot(log(zifit),zires, main = "ZINB",ylab = "Residuals", xlab = "Log Fitted Values")
par(mfrow = c(2,2))
plot(persons_count,zires)
plot(student_count,zires)
plot(vehicle_count,zires)
plot(bike_count,zires)
par(mfrow = c(1,1))
# 4.3 prediction using ZINB
# 4.3.1. Mean zero trips predicted
przinb <- predict(zinbt,type = "zero")
muzinb <- predict(zinbt,type = "count")
pred_zinb <- przinb + (1-przinb)*exp(-muzinb)
mean(pred_zinb)
# 4.3.2 prediction of trip count by ZINB
pred_zinb1 <- predict(zinbt,type = "response")
pred_zinb_prob <- predict(zinbt,type = "prob")

Count Trip Generation Models

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Count Trip Generation Models

Enviado por

Direitos autorais:

Formatos disponíveis

Count Regression Model for Household Level Trip Generation

A case study of Orange County, Southern California

1. Introduction and Motivation

Trip Generation Method

Key Explanatory Variables

Chang, J. S., et al.

Linear Regression, Tobit Regression,

Household Income, Household Size,

Poisson Regression, Zero-inflated

Household Size, No. of vehicles, Income

Cumulative Logistic Regression Model

Household Size, Number of Household

Zenina, N., et al.

Land use, Number of employed people,

Kim,N , S.,et al.

Poisson Regression, Negative

Age, Education, Income, Non-Residential

3. Initial summary of the Data

Table 2. Household level Socio - Demographic Characteristics

Day of the Week

Household Level Demographics

Income level of the household

L1 - $0-$50k; L2- $50-$100k; L3 - $100k-$150k, L4 >$150k

The type of household being surveyed

L1 - Single Family; L2 - Mobile Home; L3 Apartments

Number of persons in the household

Number of students in the household

Number of vehicles in the household

Number of bikes in the household

Age of the Household head

L1- 1 to 29 yrs; L2 - 30 to 59 yrs; L3 - >60 yrs

Gender of the household

Educational level of the household

1- grade 12 or less; 2- High school graduate ; 3- Some

Employement status of household head

Possession/usage of a toll pass

Indicator variable for Transit use

3.2. Descriptive Statistics of the Key variables

Table 3. Summary Statistics for the continuous variables

4. Methodology and Model selection

4.2. Model Selection

4.2.1. Test for over dispersion check

4.2.3. Goodness of Fit Tests

4.2.3.2. McFadden pseudo

4.2.3.3. Vuongs test for ZINB

4.3. Check for Assumptions

5. Results and Discussion

Likelihood Ratio test

*** 0.1% significance; ** - 1% significance; * - 5% significance; . - 10% significance

Huntsinger, L. F., Rouphail, N. M. & Bloomfield, P. Trip Generation Models Using

Williams, R. Models for Count Outcomes. 125 (2015). at

Cameron, A. C. & Trivedi, P. K. Regression-based tests for overdispersion in the Poisson

http://data.princeton.edu/R/glms.html; date accessed : 03/10/2016

http://www.ats.ucla.edu/stat/r/dae/zinbreg.htm; date accessed: 03/05/2016

Simon P. Washington, Matthew G. Karlaftis, Fred Mannering, Statistical and Econometric

Colin Cameron, Pravin K. Trivedi, Regression Analysis of Count Data, Cambridge

Joseph m. Hilbe, Negative Binomial Regression, Cambridge University Press, 77-137

Kelly Black, http://www.cyclismo.org/tutorial/R/index.html, Date Accessed: 02/05/2016

# 3.4. Check for endogenity

#4.2.2.1. Exponentiated coefficients

Você também pode gostar

* 0.1% significance; - 1% significance; * - 5% significance; . - 10% significance