Você está na página 1de 20

Count Regression Model for Household Level Trip Generation

A case study of Orange County, Southern California


Koti Reddy Allu (23406771)

1. Introduction and Motivation


Trip Generation is the first and important step of any Transportation Planning Framework.
Generally, trip generation models are developed at zonal level. The demographic, socioeconomic
and land use factors of each zone influence the type and quantum of trips generated. Present study
is restricted to develop a household level trip generation model. Various methods like, Growth
Factor models, Category Analysis, Multiple Classification Analysis and Regression Techniques
are used for the trip estimation purpose.
Multiple Linear Regression Techniques can consider many explanatory variables which
are plausible contributing factors for trip generation at household level and have been widely used
in academia, industry and various other forums. These methods treat the number of trips as a
continuous variable even though it is discrete in nature, and also these methods can estimate
negative trip rates which is clearly not possible.
Count models have been widely used in econometrics, Ecological Studies and
Bioinformatics quite often. In Transportation, they are mainly used for modelling number of
accidents per year, queue lengths per unit time, driver route changes per week etc., but less often
for household level trip generation.
The objective of the present study is to develop an appropriate count regression model for
the total number of household trips generated on a given day for orange county using the California
Household Travel Survey (CHTS 2010-2012) data. CHTS was a comprehensive Household level
survey conducted in the state of California during the years 2010-2012. As part of this study, 2401
households of orange county have been covered.
Page | 1

The socioeconomic and demographic factors of the households like household size,
number of employees, number of students, number of vehicles and income category, household
type etc., of each household obtained from the CHTS are used to develop Poisson Regression
Model (PRM), Negative Binomial Regression Model (NBRM) and Zero Inflated Negative
Binomial Regression Model (ZINB). A total of 5 continuous variables and 5 categorical variables
are used in the model development.

2. Literature Review
Even though Multiple Linear regression techniques are widely used they suffer from three major
drawbacks. (i). Number of trips is considered as a continuous variable whereas it is discrete in
nature (ii). Possibility of predicting negative number of trips when lot of zeros are present in the
data (iii). In ability to observe the non-trip making behavior i.e. zero trips can never be estimated
in linear regression 13.
To overcome the difficulties of discrete nature and non-negativity, count regression models
such as Poisson Regression Model (PRM), Negative Binomial Regression Model (NBRM), and
Cumulative Logistic Regression Model (also called Ordered Logit Models - OLM) have been
tested and used recently 13.
Jang Y.T.(2005) developed and tested four types of house hold level trip generation count
models such as PRM, NBRM, Zero Inflated Poisson Model (ZIP) and Zero Inflated Negative
Binomial Regression Model (ZINB) for the city of Jeonju, Korea using a survey data of 4,416
households. Using the Home based work trip travel surveys for the years 2002 and 2006 for the
city of Seoul, Korea Chang et al, (2014) developed variety of trip generation models such as
Multiple Linear Regression, Tobit model, PRM and OLM. Huntingster et al., (2013) developed
an OLM using the 1995 North Carolina Household Travel Survey. Summary of the regression

Page | 2

techniques and the corresponding explanatory variables used in the reference journal are shown in
Table 1.
Table 1. Summary of Regression type and the Key Explanatory Variables
Research by

Trip Generation Method

Key Explanatory Variables

Chang, J. S., et al.

Linear Regression, Tobit Regression,


Poisson Regression, Ordered Logit,
Category Analysis, Multiple
Classification Analysis

Household Income, Household Size,


Number of Cars Owned, Number of
Employees, Number of Preschool children,
Land use areas

Jang, T. Y.

Poisson Regression, Zero-inflated


Poisson regression, Negative
Binomial Regression

Household Size, No. of vehicles, Income


Category, Age of Head, No. of workers, etc.

Huntsinger, L. F.,
et al.

Cumulative Logistic Regression Model

Household Size, Number of Household


Workers, Income Group, Number of
Vehicles, total children, etc.

Zenina, N., et al.

Linear Regression

Land use, Number of employed people,


Number of Public Transport Vehicles,
Number of Intersections etc.

Kim,N , S.,et al.

Poisson Regression, Negative


Binomial Regression

Age, Education, Income, Non-Residential


Unit Size, Degree of Urbanism etc.

From the literature as well as Engineering judgement, the household demographics and the
personal characteristics of household head would influence the number of trips made by a
household on a given day. The explanatory variables chosen for this study and their nature is shown
in Table 2.

3. Initial summary of the Data


3.1. Data collation and Cleansing
The survey data from CHTS has been organized separately at each level of aggregation like
household level, individual level, trip level, activity level, etc., by using a common key identifier
for the household. The required variables for this study have been extracted and collated from three
sheets namely, household characteristics file, activity characteristics file and person characteristics
file.

Page | 3

Table 2. Household level Socio - Demographic Characteristics

Sl. No

Variable
Name

Day of the Week

Description
Week day on which the trip dairy was

Type

Number
of
Levels

Categorical

Remarks
L1 - Monday;.L7 - Sunday

recorded

Household Level Demographics


2

Income

Income level of the household

Categorical

L1 - $0-$50k; L2- $50-$100k; L3 - $100k-$150k, L4 >$150k

Dwelling Unit

The type of household being surveyed

Categorical

Type

L1 - Single Family; L2 - Mobile Home; L3 Apartments

person_count

Number of persons in the household

Continuous

student_count

Number of students in the household

Continuous

vehicle_count

Number of vehicles in the household

Continuous

bike_count

Number of bikes in the household

Continuous

Personal Characteristics
8

new_age_head

Age of the Household head

Categorical

L1- 1 to 29 yrs; L2 - 30 to 59 yrs; L3 - >60 yrs

gender_head

Gender of the household

Categorical

1 - Male; 2 - Female

10

edu_head

Educational level of the household

Categorical

1- grade 12 or less; 2- High school graduate ; 3- Some


college credit but no degree; 4- Associate or technical
school degree; 5- Bachelor's or undergraduate degree;
6- Graduate degree

11

emp_status

Employement status of household head

Categorical

1 - Yes, 2 -No

12

toll_pass

Possession/usage of a toll pass

Categorical

1 - Yes, 2 -No

13

transit_use

Indicator variable for Transit use

Categorical

1 Yes, 2 No

Page | 4

3.2. Descriptive Statistics of the Key variables


The income information for 269 of the 2402 households was not available as the respondents either
dont know or were not willing to share the information. Similarly, for some more variables the
information was either not provided So, the survey data for the remaining 2133 households is only
being used for this study. The initial boxplot (Figure 1) of trip count shows that trips greater than
26 are outliers of the data, so only observations with trip counts less than 26 are considered for
count model development purpose. Trimmed dataset contains 2041 observations. The summary
statistics of the continuous variables are shown in Table 2.
Figure 1. (a). Box plot of the Number of trips per household (b). Frequency chart of number of trips per day

Table 3. Summary Statistics for the continuous variables

# Observations

vehicle_count

persons_count

worker_count

student_count

license_count

bike_ count

trip_count

1976

1976

1976

1976

1976

1976

1976

# zero values

61

381

1269

37

743

227

# missing
values

Minimum

Maximum

10

25

Range

10

25

3943

5061

2563

1208

3893

3018

14666

Sum
Median

Mean

2.00

2.56

1.30

0.61

1.97

1.53

7.42

Variance

0.89

1.68

0.81

0.89

0.71

2.76

34.56

Std Deviation

0.94

1.30

0.90

0.95

0.84

1.66

5.88

COV

0.47

0.51

0.69

1.55

0.43

1.09

0.79

Page | 5

From Figure 1(b), it can be clearly seen that the trip frequency is not symmetrically distributed
about mean, looks similar to Poisson distribution but does not have monotonically decreasing
frequency values. A Poisson count model is certainly an option to start with.

4. Methodology and Model selection


4.1. Methodology and Models
The total number of trips generated by a household on a given day would be a non-negative integer
with discrete nature4. The widely used Linear regression models treat the trip rate as a continuous
variable and would possibly produce negative values or fail to estimate the large proportion of the
zeros in the model. Also, the homoscedasticity assumption is very weak, as the variance of trip
rate can vary with different households. For example, a house hold which falls in to a high end
income category can have more variations in their trip pattern compared to a household in low end
income category which will mostly have work trips and less variation. Count models like PRM,
NBRM, and ZINB can address both the issues of nonnegative discrete nature and heterogeneity.
4.1.1. Poisson Regression Model
In PRM, the probability of a count is determined by Poisson distribution whose mean is obtained
as a function of Independent Variables. PRM is given by equation 1.

(1)

= [ | ] = ( )

(2)

[ = | ] =
mean parameter,

Where, = number of trips per day of household i. The mean, can vary for each household
based on its demographics. Parameters of PRM are obtained using Maximum Likelihood (ML)
techniques. These parameters are asymptotically normal, consistent and asymptotically efficient.
A critical assumption made to use the Poisson model is that the number of trips generated
with in a house hold are intra independent, i.e. a trip made by a household for one purpose say
Page | 6

shopping does not yield another trip from the household (to satisfy the no memory property of
Poisson distribution), and also the trips made by different households are independent of each
other. Another limiting assumption of PRM is the conditional mean of trip rate is equal to the
conditional variance of the trip rate which is called the equidispersion. With this assumption, the
variance can vary in proportion to the mean of the distribution accounting for the observed
heterogeneity.
For the Poisson distribution, as the mean of the distribution increases, the mass of the
distribution moves towards right and hence can under predict the count of zeros. Also, quite often
the conditional variance of the trip rate would be greater (smaller) than the conditional mean which
is called over dispersion (under dispersion). If over dispersion is present, PRM parameter estimates
would still be consistent but inefficient. This will lead to underestimated standard errors and
spuriously large Z-values indicating some variables to be significant, which may be insignificant
otherwise. Negative Binomial regression can handle the unobserved heterogeneity by adding an
error term in the exponent of mean parameter. The estimated PRM model will be subjected to
Cameron and Trivedi (1990)s test for over dispersion.
4.1.2. Negative Binomial Regression Model
Negative Binomial model (Equation 4) can be obtained by adding an error term to the mean
parameter of Equation 2.
= [ | , ] = ( + )

(3)

Where is a random error term assumed to be uncorrelated with other independent variables.
NBRM count model can be written as in Equation 4. The term ( ) is called unobserved
heterogeneity term and could reflect specification errors such as omitted exogenous variable, or
other sources of randomness.

Page | 7

( exp())
[ = | , ] =
!

(4)

The error term is assumed to be Gamma distributed with mean value 1 and variance .
Induction of the error term into the mean parameter does not affect the conditional mean but the
conditional variance varies with conditional mean as shown in Equation 5.
[ ] = [ ] (1 + [ ])

(5)

There might be some households with zero trips not because of randomness but may be actually
they do not make any trips. This produces large proportion of zeros in the data and NBRM may
not be able to predict these zeros. To take this into account, zero inflated models are used.
4.1.3. Zero Inflated Negative Binomial Regression Model
ZINB models predict the number of zeros by two separate processes. One process models the
households with always zero trips and the other is a count model trying to predict the number of
zero trips made by households that actually produce trips. In this study a logit model has been used
to process the Zero inflation and a NBRM count model for the count process. The count
probabilities are obtained using equations 6.1. and 6.2.
= 0 + (1 )( )
=

(1 )( )
!

(6.1)
(6.2)

The appropriateness of using ZINB has been verified using Vuongs test.

4.2. Model Selection


Initially a PRM is developed with all the variables explained in the Table 2 as explanatory
variables. The estimates of the coefficients of the model and their significance levels are shown in
Table 4. The likelihood ratio test shows that overall model is significant. PRM shows that 15 out
of 27 explanatory variables are significant.

Page | 8

4.2.1. Test for over dispersion check


The equi-dispersion assumption of the PRM is verified by carrying out a regression based test
proposed by Cameron and Trivedi (1990) 5. The over-dispersion test indicates that the conditional
variance is 3.17 times (Z = 20.81, p value = 2.2e-16) larger than the conditional mean. So, the
coefficients in PRM are consistent but inefficient and all the variables that are shown to be
significant may truly not be significant due to under estimation of their standard errors.
To overcome the problem of over dispersion, an NBRM with same number of explanatory
variables is developed and the estimates of the coefficients for this model are shown in Table 4.
NBRM indicates only 5 of the explanatory variables are significant at 5% significance level and
two more variables would be significant if 10% significance is considered.
4.2.2. Test for the zero trip count
Mean percentage of zero trip count by NBRM is obtained as the mean value of the
probability densities at zero for all negative binomial distributions with the theta value of the
NBRM as scale parameter and the mean expected trip rate as mean. Our model predicts that 4.8%
(Appendix A, 3.6.1) of the trip counts predicted could be Zeroes whereas the observed sample
consists of 11.5% trip counts to be zero.
NBRM under-predicts the Zero trip counts and so, a Zero inflated NBRM (ZINB) has been
developed with binomial logit model for zero inflation considering all the variables as its
explanatory variables along with the count model. For ZINB, the mean percentage of zero trip
counts predicted is computed as the expected value of the probability of a trip count (Equation
6.1). ZINB predicts 10.5% zero trip counts (Appendix A, 4.3.1) which is very close to the 11.5%
of the observed value.

Page | 9

4.2.3. Goodness of Fit Tests


Two tests are mainly used for identifying the statistical significance of the models developed.
4.2.3.1. Likelihood ratio test
Likelihood ratio test is used to verify the overall significance of the count models developed.
Likelihood ratio statistic defined as negative of the twice the difference in Log likelihood values
of the full model and intercept model as shown in Equation (7) is used for checking the overall
model significance. The 2 is 2 distributed with degrees of freedom equal to the number of
parameter coefficients in the full model minus one.
2 = 2((0) ())

()

4.2.3.2. McFadden pseudo


McFadden 2 statistic is computed using equation (8). This statistic lies between zero and one.
The closer the statistic value to one, the better fitted model we have.
2 = 1

()
(0)

(8)

4.2.3.3. Vuongs test for ZINB


To check whether ZINB is statistically significant compared to normal NBRM, Vuong test has
been conducted and found that ZINB is statistically much better than NBRM.
Vuong Z Statistic
p-Value

9.76
<2.22e-16

4.3. Check for Assumptions


4.3.1. Multi collinearity check
All three models developed are checked for multi collinearity among their explanatory variables
using Variance Inflation Factors (VIF). For factored variables with certain degrees of freedom, a
generalized VIF value is computed as VIF^(1/(2*df)). The computed GVIF values are shown in
Page | 10

Table 4. None, of the values are beyond the allowable limit of 10, so there is no multi collinearity
among the explanatory variables considered.
4.3.2. Exogeneity of Explanatory variables and error term
The main assumption of NBRM is that, the error terms are uncorrelated with the Explanatory
variables. This assumption is verified by plotting the independent continuous variables against the
residuals as shown in the figures below.
Table 4. GVIF values for the count models developed
GVIF values for
Variable

Df
PRM

NBRM

ZINB

new_income

1.11

1.11

1.42

new_residence

1.07

1.07

1.14

dow

1.01

1.01

1.19

transit_use

1.06

1.06

2.49

persons_count

2.05

2.01

5.06

student_count

1.87

1.79

2.24

worker_count

1.56

1.63

3.14

vehicle_count

1.40

1.39

3.89

bike_count

1.19

1.19

1.69

new_age_head

1.12

1.12

2.65

gender_head

1.04

1.04

1.47

edu_head

1.05

1.04

1.46

emp_status_head

1.37

1.41

1.74

toll_pass

1.08

1.08

1.92

license_count

4.94

Figure 2. Residuals vs Explanatory variables for Exogeneity constraint (a) NBRM, (b) ZINB

Page | 11

The Residual plots with respect to the independent variables clearly explain the heterogeneity
present in the data. These graphs also give a hint that the error terms and the explanatory variables
are dependent on each other due to the decreasing pattern of residuals with higher values of the
explanatory variables in both NBRM and ZINB. Presence of endogeneity will make the coefficient
estimates unbiased.

5. Results and Discussion


The incident rate ratios of the coefficients of count models and the odds ratios of coefficients of
zero inflation model can be used for making inferences and comparisons. The exponentiated
coefficients for the NBRM and ZINB are presented in Table 6.
Table 5. Coefficients of the parameters for the PRM, NBRM and ZINB
Model
Exp. Variable

PRM

NBRM

Zero
-3.7470
-0.4983*

0.1255**
0.0648
-0.0416
0.0423
-0.0424

-0.4273
-0.0004
-0.0804
-0.1093
.
-0.7034
-0.2522
-0.2251
-0.2165
0.3367
0.5552*
0.2144
0.3068*
-0.2064
-0.1766

(Intercept)
new_income2

1.1778
0.0797**

new_income3
new_income4
new_residence2
new_residence3
dow2

0.1570***
0.0729*
-0.0385
0.0543*
0.0533

.
0.0888
0.1692**
0.0716
-0.0343
0.0428
0.0040

dow3
dow4
dow5
dow6
dow7
transit_use2
persons_count
student_count
worker_count

0.1078***
0.0994**
0.1438***
-0.0214
-0.1045**
-0.1617***
0.1804***
0.0416**
0.0685***

0.0870
0.1014
0.0994
-0.0218
-0.1029
-0.1676***
0.2269***
0.0294
0.0660*

vehicle_count

.
0.0204
0.0328***

0.0259

0.0651
0.0824
0.0798
0.0113
-0.0465
-0.1506***
0.2364***
0.0240
.
0.0459
0.0076

0.0315**

0.0229*

new_age_head2
new_age_head3

0.0180
-0.0821

0.0097
-0.0673

0.0877
0.0446

gender_head2
edu_head2

0.0077
-0.0326

0.0160
-0.0674

0.0124
0.0191

edu_head3
edu_head4

0.0753

0.0595
0.0848

0.1258

bike_count

1.0854

ZINB
Count
1.1407
0.0378

edu_head5

0.0969
0.1962 ***

edu_head6
emp_status_head2
toll_pass2

0.1654 **
0.0171
-0.0275

0.1761
0.1411
0.0044
-0.0254

license_count

.
0.1566
0.2152**
0.2155*
0.0183
.
-0.0521
-0.0058

-0.0757
-0.1133
1.5604

.
1.9330
-0.0962
.
0.9036
0.7274
0.8127
0.5164
0.8038
0.1395
-0.2975
-0.3911*

Page | 12

Likelihood Ratio test


2596 ( df = 27)
568 (df = 27)
810.8 (df = 31)
[-2(LL(0)-LL())]
0.16
0.05
0.069
Mc Fadden
*** 0.1% significance; ** - 1% significance; * - 5% significance; . - 10% significance

Table 6. Incident Rate Ratios of the variables considered in NBRM and ZINB
Variable
(Intercept)

NBRM
2.961***
.
1.093
1.184 **
1.074
0.966
1.044

ZINB
Count_model
3.129***

Zero_inflation
0.024***

1.039

0.608*

1.134**
1.067
0.959
1.043

0.652
1.000
0.923
0.896

1.004

0.959

dow3
dow4
dow5
dow6
dow7
transit_use2
persons_count
student_count

1.091
1.107
1.105
0.978
0.902
0.846***
1.255***
1.030

worker_count

1.068*

vehicle_count

1.026

1.067
1.086
1.083
1.011
0.955
0.860***
1.267***
1.024
.
1.047
1.008

.
0.495
0.777
0.798
0.805
1.400
1.742*
1.239
1.359
0.813*

1.032**

1.023*

new_age_head2

1.010

1.092

new_age_head3

0.935

1.046

gender_head2

1.016

1.012

edu_head2

0.935

1.019

edu_head3

1.061

edu_head4
edu_head6
emp_status_head2

1.089
.
1.193
1.152
1.004

toll_pass2

0.975

new_income2
new_income3
new_income4
new_residence2
new_residence3
dow2

bike_count

edu_head5

license_count

1.134
.

0.838
0.927
0.893
4.761
6.910
0.908
2.468
2.070

1.169

2.254

1.240**

1.676

1.240*
1.019
.
0.949
0.994

2.234
1.150

.
.
.

0.743
0.676*

*** 0.1% significance; ** - 1% significance; * - 5% significance; . - 10% significance

Since the ZINB could predict the proportion of zeros more accurately, we use ZINB for making
inferences.
Zero inflation model exponentiated coefficients are explained as odds ratio. The baseline odds of
being among the households with zero trip count is 0.024. For factor variables the odds ratios are
interpreted as a housel hold in income level 2 has 40% decrease in the odds of producing zero trips
compared to the household with income category 1, holding other parameters at constant level.
Page | 13

For continuous variables, like number of students in the household, for each unit increase the odds
of producing zero trip count reduces by approximately 19%. Significant variables contributing for
the zero inflation process are license count, student count and Income level 2. Each unit increase
in the license count decreases the odds of zero trip count by 32% by controlling all other variables.
Count Model of ZINB indicates that a household in income level 2 ($50k-$100k annual
income) 3.9% more trips than the households in income level. A Household falling in Income
levels 3 and 4 produces 13.4% and 6.7% more trips than a household with income level 1
respectively.
The IRR values for day of the week indicate in relative comparison to the number of trips
produced on a Monday how the number of trips produced on other week days. For example, a
given household produces 4% less trips on Tuesday (dow2) and 8.6% more trips on Thursday
(dow4) relative to the amount of trips made on Monday. Similar inferences can be made for other
week days as well.
A Household in which the head of the household does not use transit makes 14% less trips
than the household which does use transit. For each unit increment in the household size, the
number of trips produced gets multiplied with 1.267. Similarly, for each unit increment in the
student count of a household, the trips per day would get multiplied by 1.024. For each unit
increment of employees in the household, the households trip rate gets multiplied by 1.047. Each
unit addition of a vehicle to an household multiplies its trip rate by 1.008. Similarly, each addition
of a bike in the household multiplies the trip rate by 1.023.
Households with age of the head in the categories of 30 59years and >60 years make
9.2% and 4.6% trips more than the households with age of the head < 30 years. A household with
female as its head produces 1.2% more trips than their counterparts with a male head. Households

Page | 14

with its head having a higher degree are making more trips compared to the base category. For
example, households whose head is either an under graduate or graduate produces 24% more trips
than households whose head is having less than 12th grade education. A counter intuitive result
obtained is the households whose head is not employed makes 1.9% more trips than their
counterparts. The households whose head does not have a toll pass makes 5% less trips than the
households with their head having a toll pass. Each additional unit of license will multiply the trips
by 0.996, which is very close to one and rather the license count does not alter the trip rate.

6. Conclusion
Three different count models have been developed for estimating the number of trips produced per
day at household level for Orange County. It was found that the sample data violates the equidispersion criteria for the Poisson regression model. NBRM would under predict the number of
zero trip counts. A ZINB with binary logit model for zero inflation is developed and found that
Household Size, Number of Employees, Income category, Bike Count Education level of the
Household head and possession of toll pass are significant variables in the model.
Household Income category, Day of the week, number of students in the household,
number of bikes in the household, Age and Education level of the household and Number of people
possessing licenses are determined to be significant variables for the Zero Inflation Model.
The study tried to understand the preliminary significant variables for the household level
trip production and limited in scope in terms of variable selection and the check for diagnostics of
the models. In future, the model can be used to understand effects of various factors influencing
the household trip generation process.

Page | 15

7. References
1.

Jang, T. Y. Count Data Models for Trip Generation. J. Transp. Eng. 131, 444450 (2005).

2.

Chang, J. S., Jung, D., Kim, J. & Kang, T. Comparative analysis of trip generation
models: results using home-based work trips in the Seoul metropolitan area. Transp. Lett.
6, 7888 (2014).

3.

Huntsinger, L. F., Rouphail, N. M. & Bloomfield, P. Trip Generation Models Using


Cumulative Logistic Regression. J. Urban Plan. Dev. 139, 176184 (2013).

4.

Williams, R. Models for Count Outcomes. 125 (2015). at


<https://www3.nd.edu/~rwilliam/xsoc73994/CountModels.pdf>

5.

Cameron, A. C. & Trivedi, P. K. Regression-based tests for overdispersion in the Poisson


model. J. Econom. 46, 347364 (1990)

6.

http://data.princeton.edu/R/glms.html; date accessed : 03/10/2016

7.

http://www.ats.ucla.edu/stat/r/dae/zinbreg.htm; date accessed: 03/05/2016

8.

Simon P. Washington, Matthew G. Karlaftis, Fred Mannering, Statistical and Econometric


Methods for Transportation Data Analysis, CRC Press, 283-300, (2011)

9.

Colin Cameron, Pravin K. Trivedi, Regression Analysis of Count Data, Cambridge


University Press, 80 - 88

10.

Joseph m. Hilbe, Negative Binomial Regression, Cambridge University Press, 77-137


(2011)

11.

Kelly Black, http://www.cyclismo.org/tutorial/R/index.html, Date Accessed: 02/05/2016

12.

Kazuki Yoshida, Models for excess zeros using pscl package (Hurdle and zero-inflated
regression models) and their interpretations, https://rpubs.com/kaz_yos/pscl-2, Date
Accessed: 03/09/2016

Page | 16

8. Appendix A
# Converting variables to factors
totdata$dow <- as.factor(totdata$dow)
totdata$new_income <- as.factor(totdata$new_income)
totdata$transit_use <- as.factor(totdata$transit_use)
totdata$new_residence <- as.factor(totdata$new_residence)
totdata$new_age_head <- as.factor(totdata$new_age_head)
totdata$gender_head <- as.factor(totdata$gender_head)
totdata$edu_head <- as.factor(totdata$edu_head)
totdata$emp_status_head <- as.factor(totdata$emp_status_head)
totdata$toll_pass <- as.factor(totdata$toll_pass)
# 1.1. Descriptive Summary Statistics using 'pastecs' package
boxplot(trip_count)
# removing the outliers observed from boxplot [points beyond Q3 + 1.5*IQR]
trimdata <- subset(totdata,trip_count < 26)
detach(totdata);attach(trimdata)
# 1.2. Histogram of trip count
hist(trip_count, main = "Histogram of Trips per day")
t <- table(trip_count); plot(t)
library(pastecs)
write.csv(stat.desc(trimdata),"trim_summary_stats1.csv")
# 1.3. Correlation check
prelim <data.frame(trip_count,persons_count,worker_count,student_count,bike_count,vehicle_count,license_coun
t)
r <- cor(prelim)
plot(prelim)
# 2. Poisson regression model
library(sandwich);library(msm);library(ggplot2);library(AER)
# fpsntrip1 <glm(trip_count~new_income+new_residence+dow+transit_use+persons_count+student_count+worker_c
ount+vehicle_count+bike_count+new_age_head+gender_head+edu_head+emp_status_head+toll_pass,fa
mily = "poisson", data = trimdata)
psntrip <glm(trip_count~new_income+new_residence+dow+transit_use+persons_count+student_count+worker_c
ount+vehicle_count+bike_count+new_age_head+gender_head+edu_head+emp_status_head+toll_pass,fa
mily = "poisson", data = trimdata)
summary(psntrip)
write.csv(coef(summary(psntrip)),"PRMcoef.csv")
# 2.1. Goodness of Fit tests
psntrip0 <- update(psntrip, .~1)
# 2.1.1. likelihood ratio test
llratio_p <- -2*(logLik(psntrip0)-logLik(psntrip))
p1 <- pchisq(llratio_p, df = 27,lower.tail = FALSE)

# 2.1.2 McFadden R2
library(pscl) # Library to compute various pseudo R2 values
psn_pr2 <- pR2(psntrip)
msfR2 <- psn_pr2[["McFadden"]]
# 2.2. Cameron and Trivedi overdispersion test
dispersiontest(psntrip)
# 2.3.1. Check for Multi collinearity
write.csv(vif(psntrip),'gvifvalues_psn.csv')
# 3. NBRM
library(foreign);library(MASS)
nbrm <glm.nb(trip_count~new_income+new_residence+dow+transit_use+persons_count+student_count+worker
_count+transit_use+vehicle_count+bike_count+new_age_head+gender_head+edu_head+emp_status_hea
d+toll_pass, data = trimdata)
summary(nbrm)
nb_coef <-coef(summary(nbrm))
write.csv(coef(summary(nbrm)),"NBRMCoef.csv")
plot(table(round(fnbrm4$fitted.values,1)))
exp_nb_coef <- exp(nb_coef)
write.csv(exp_nb_coef,"exp_NBRMCoef.csv")
# 3.1 Model Goodness-of-Fit Measures
# 3.1.1.likelihood ratio test
library(pscl) # Library to compute various pseudo R2 values
nbrm0 <- glm.nb(trip_count~1)
summary(nbrm0)
anova(nbrm,nbrm0)
# 3.1.2.check for x2 value
qchisq(0.95,14)
# 3.1.3.McFadden R2
p2 <-pR2(nbrm)
p2[["McFadden"]]
# 3.2. Diagnostics (Dfbetas, dffit, covarianceratio, cooks distance, hat matrix value)
imnbrm <- influence.measures(nbrm)
attributes(imnbrm)
colnames(imnbrm$infmat)
View(imnbrm$infmat)
# 3.3.Check for Multicollinearity
vif(nbrm)
write.csv(vif(nbrm),'gvifvalues_nbrm.csv')
# 3.3.1 Check for exogenity of expalanatory variables.
# plot(nbrm,which = 1, main = "NBRM")

# 3.4. Check for endogenity


nbfit <- nbrm$fitted.values
nbres <- nbrm$residuals
plot(log(nbfit),nbres, main = "NBRM",ylab = "Residuals", xlab = "Log Fitted Values")
par(mfrow = c(2,2))
plot(persons_count,nbres)
plot(student_count,nbres)
plot(vehicle_count,nbres)
plot(bike_count,nbres)
par(mfrow = c(1,1))
# Writing down coefficients of the model to csv file
write.csv(coef(summary(nbrm)),"nbrmcoef.csv")
# 3.5. profiling of likelihood function to calculate the confidence intervals for coefficients
(est <- cbind(Estimate = coef(nbrm),confint(nbrm)))
exp(est)
#3.6. Prediction using NBRM
# 3.6.1 Predicting the # zeros by NBRM
zobs <- trip_count == 0
munb <- exp(predict(nbrm))
nbtheta <- nbrm$theta
znb <- dnbinom(0,mu=munb,size = nbtheta) # Number of zeros predicted by zeros
c(obs=mean(zobs), nb=mean(znb))
# 3.6.2 Predicting the trips with NBRM
pred_nbrm <- predict(nbrm,type = "response")
# 4.1. Zero Inflated Poisson Model
library(boot)
# zipt <zeroinfl(trip_count~new_income+new_residence+dow+transit_use+persons_count+student_count+worke
r_count+license_count+vehicle_count+bike_count|new_income+new_residence+dow+transit_use+person
s_count+student_count+worker_count+license_count+vehicle_count+bike_count, data = trimdata)
# summary(zipt)
# 4.2.Zero Inflated Negative Bionomial
zinbt <zeroinfl(trip_count~new_income+new_residence+dow+transit_use+persons_count+student_count+
worker_count+vehicle_count+bike_count+ new_age_head + gender_head + edu_head + emp_status_head
+ toll_pass +license_count , data = trimdata, dist = "negbin")
summary(zinbt)
#4.2.2. coefficients of the estimates
zinb_coef <- coef(zinbt)
zinb_coef <- matrix(zinb_coef,ncol = 2)
colnames(zinb_coef) <- c("Count_model","Zero_inflation_model")
zinb_coef <- as.data.frame(zinb_coef)
write.csv(zinb_coef,"coef_zinb.csv")

#4.2.2.1. Exponentiated coefficients


exp_zinb_coef <- exp(zinb_coef)
write.csv(exp_zinb_coef,"exp_coef_zinb.csv")
# 4.2.3 Overall model statistical significance
zinbt0 <- update(zinbt, . ~1)
summary(zinbt0)
# log likelihood ratio test
llratio_zinb <- - 2*(logLik(zinbt0)-logLik(zinbt))
pchisq(llratio_zinb ,df = 26,lower.tail = FALSE)
# 4.2.4 McFaddens R2
mcfr2_zinb <- (1-(logLik(zinbt)/logLik(zinbt0)))

# 4.2.5. Vuong test for checking zero inflated models with normal models
vuong(zinbt,nbrm)
# 4.2.6 ZINB multi collinearity check
write.csv(vif(zinbt),'gvifvalues_zinb.csv')
zinblogy <- log(zinbt$fitted.values)
#4.2.6.1.Independence of error term
# plot(zinbt$residuals,zinbt$fitted.values,main = "ZINB",xlab = "log predicted values",ylab =
"Residuals")
zifit <- zinbt$fitted.values
zires <- zinbt$residuals
plot(log(zifit),zires, main = "ZINB",ylab = "Residuals", xlab = "Log Fitted Values")
par(mfrow = c(2,2))
plot(persons_count,zires)
plot(student_count,zires)
plot(vehicle_count,zires)
plot(bike_count,zires)
par(mfrow = c(1,1))
# 4.3 prediction using ZINB
# 4.3.1. Mean zero trips predicted
przinb <- predict(zinbt,type = "zero")
muzinb <- predict(zinbt,type = "count")
pred_zinb <- przinb + (1-przinb)*exp(-muzinb)
mean(pred_zinb)
# 4.3.2 prediction of trip count by ZINB
pred_zinb1 <- predict(zinbt,type = "response")
pred_zinb_prob <- predict(zinbt,type = "prob")

Você também pode gostar