Transportation Research Part D: S.D. Oduro, Q.P. Ha, H. Duc

Transportation Research Part D 49 (2016) 188202
Contents lists available at ScienceDirect
Transportation Research Part D

journal homepage: www.elsevier.com/locate/trd
Vehicular emissions prediction with CART-BMARS hybrid

models
S.D. Oduro a, Q.P. Ha a,, H. Duc b
a
Faculty of Engineering & IT, University of Technology Sydney, Australia
b
Office of Environment & Heritage, Sydney, Australia
a r t i c l e i n f o a b s t r a c t
Article history: Vehicular emission models play a key role in the development of reliable air quality mod-
eling systems. To minimize uncertainties associated with these models, it is essential to
match the high-resolution requirements of emission models with up-to-date information.
Keywords: However, these models are usually based on average trip speed, not on environmental
Vehicular emissions parameters like ambient temperature, and vehicles motion characteristics, such as speed,
On-board emission measurement acceleration, load and power. This contributes to the degradation of its predictive perfor-
Chassis dynamometer testing
mance. In this paper, we propose to use the non-parametric Classification and
CART-BMARS
ANNs
Regression Trees (CART), the Boosting Multivariate Adaptive Regression Splines (BMARS)
algorithm and a combination of them in hybrid models to improve the accuracy of vehic-
ular emission prediction using on-board measurements and the chassis dynamometer test-
ing. The experimental comparison between the proposed CART-BMARS hybrid model with
the BMARS and artificial neural networks (ANNs) algorithms demonstrates its effectiveness
and efficiency in estimating vehicular emissions.
2016 Elsevier Ltd. All rights reserved.
1. Introduction
Poor air quality has become a serious problem in recent years in many cities and their surrounding areas due to increasing
population, motor vehicles and industries. To be environmentally-sustainable, efforts have been made to improve energy
efficiency and to reduce air pollutant emissions in both generation and consumption sides (Azzi et al., 2015). Among air pol-
lutants coming from all sources, anthropogenic emissions have been the main concern in air-quality modeling and control.
The problem is exacerbated as the world demand of transport is projected to increase by 45% by the year 2030 (IEA, 2009)
while the steady growth in vehicular population in the urban areas. This will involve the increase in the number of motor
vehicles and consequently the emissions impact. As vehicular emissions are produced at the ground level, they have harmful
effects directly on the reception population (Elkafoury et al., 2015).
It is a fact that the transport sector is growing quickly and providing convenient and quick access to any geographical
location. However, it also brings disadvantages like noise, congestion and pollutant emissions such as carbon monoxide
(CO), nitrogen oxides (NOX), total volatile hydrocarbon (THC), Carbon dioxide (CO2), which are primarily responsible for glo-
bal warming (Tong et al., 2014; Sonawane et al., 2012). The amount of CO2 emitted from distance traveled is directly pro-
portional to fuel economy with every liter of gasoline burned releasing about 2.4 kg of CO2 (Goel and Guttikunda, 2015).
The problem of vehicular emissions becomes more severe when the traffic flow is congested or interrupted especially when
Corresponding author.
E-mail address: quangha@eng.uts.edu.au (Q.P. Ha).
http://dx.doi.org/10.1016/j.trd.2016.09.012
1361-9209/ 2016 Elsevier Ltd. All rights reserved.
S.D. Oduro et al. / Transportation Research Part D 49 (2016) 188202 189
the delays and disruptions occur frequently. These phenomena are regularly observed at traffic intersections, junctions, and
at signalized roadways, where traffic related characteristics combined with road and vehicle conditions contribute to the
level of emissions.
Many research initiatives have been undertaken to model and predict the complexity of vehicle emissions in order to con-
trol transport air pollution (Holmes and Morawska, 2006). However, the mechanisms by which they affect the atmosphere
and degrade the urban air quality are not completely identified. Consequently, the need of comprehensive and accurate mod-
els for vehicle emissions is essential to safeguard the urban air quality, to recognize any potential changes in the climate, and
to justify imposing new regulations. It is vital to increase the ability of policy-makers to reach sound and reasonable deci-
sions about vehicle emissions and air quality in order to maintain environmental sustainability.
Air quality models are indispensable tools to assess the impact of air pollutants on human health and the urban devel-
opment. The most critical part of assessment studies is to know the present as well as future air quality levels. In this paper,
we aim to improve the prediction accuracy of emissions modeling based on data collected from chassis dynamometer and
on-board measurement systems. The dynamometer testing is one of the three typical vehicle tailpipe emission measurement
methods, where emissions from vehicles are measured under laboratory conditions during a driving cycle to simulate vehicle
road operations (Frey and Kim, 2009). The real world on-board emissions measurement is widely recognized as a desirable
approach for quantifying emissions from vehicles since data are collected under real-world conditions at any location trav-
elled by the vehicle (Pandian et al., 2009). Using on-board measurements, variability in traffic emissions as a result of
changes in roadway characteristics, vehicles location and operation mode, driver, or other factors can be represented and
analyzed more reliably than with the other methods (Boroujeni and Frey, 2014). This is because measurements are obtained
during real world driving, eliminating the concern about non-representativeness that is often an issue with dynamometer
testing, and at any location, eliminating the setting restrictions inherent in remote sensing. Though the on-board measuring
technique seems to be more promising, the need to improve the prediction accuracy of emission factor by using effective
statistical techniques is important in any emissions modeling approach.
Therefore, to adequately model traffic emissions, the Multivariate Adaptive Regression Splines (MARS) technique has pro-
ven to be promising (Oduro et al., 2015). However, the influential factors such as vehicles speed, acceleration, load, power
and ambient temperature have not been fully considered therein. To enhance the prediction performance taking into account
these emissions factors, in this paper we focus on integrating the Classification and Regression Trees (CART) technique with
Boosting Multivariate Adaptive Regression Splines (BMARS) to provide a regression tree to better predict these continuous
dependent variables for the regression model with BMARS (Li et al., 2010). Here, our purpose is to achieve highly-accurate
estimates from the emission models from the dynamometer testing and the on-board measuring data. The effectiveness of
the proposed approach is then determined by grouping the data into two parts, one for building the model (learning) and the
other for validating the model (testing).
Among machine learning methods, artificial neural networks (ANNs), in particular, the multilayer feedforward networks
with the back-propagation algorithm, have been widely applied in the last decades to environmental modeling (Elbayoumi
et al., 2015), wherein good performance has been obtained for various vehicular emissions models (Nagendra and Khare,
2006; Najafi et al., 2009; Ghobadian et al., 2009) or prediction of air pollution profiles in a region (Kurt and Oktay, 2010;
Ha et al., 2015). Therefore, it is worth comparing the results from the CART-BMARS hybrid model developed in this paper
with those obtained by using the BMARS and MARS and ANNs techniques.
The organization of this paper is as follows. After the introduction, Section 2 presents the development of the CART, MARS,
BMARS and ANNs algorithms. Section 3 includes vehicle information and data collection procedure. Section 4 shows results
obtained by using all the mentioned methods and dissusses on their advantages. Finally, Section 5 draws a conclusion for the
paper.
2. Vehicular emissions models
The proposed model for vehicular emissions modeling is based on a combination of CART and MARS techniques coupled
with a boosting algorithm to improve the learning performance.
2.1. CART modeling
The Classification and Regression Trees (CART) technique is a non-parametric solution approach to form classifications or
regression trees depending on whether the dependent variable is categorical or numerical (Breiman et al., 1984). CART
begins with the root node at the top of the tree, which contains the entire data for the training run (Yap et al., 2011). A node
in the CART model is either a terminal node, i.e. a node without children, or non-terminal node, i.e. a node with children
(Chen, 2011). The algorithm is intended for the building of a binary solutions tree consisting of the main splitters in CART.
Here, to take into account not only the speed but also acceleration, load, power, and ambient temperature in our vehicular
emissions model, a regression tree can result from the CART analysis, as shown typically in Fig. 1. Those cells that meet the
condition within the nodes go to the left side while the remaining cells go to the right side.
The initial set of observations is divided into groups at the terminal nodes, or leaves, of the tree. The goal is to find a tree
which allows for a good distribution of data with the lowest possible relative error of prediction. Each branch of the tree ends
190 S.D. Oduro et al. / Transportation Research Part D 49 (2016) 188202
Fig. 1. Regression tree from CART analysis.
with one or two terminal nodes and each observation falls exactly into one terminal node, defined by a unique set of rules
(Tayyebi and Pijanowski, 2014). CART initially build an overgrown model to make sure that stopping rules do not prevent the
model from extracting the correct patterns in data during the training run to prevent under-fitting. Consequently, the model
is pruned back by penalizing model complexity and removing those splits that do not improve the accuracy significantly to
prevent over-fitting. The tree structure represents a series of splits for different predictors, where predictor variables in the
emission data are organized hierarchically, i.e. levels in the tree are representative of the variables levels of significance. In
CART, splits occur from the use of search algorithms to classify data into binary or multiple classes (Breiman et al., 1984) by
checking all unique values across the range of data values for different predictors (Ayoubloo et al., 2011).
The CART algorithm calculates the probability P k of the emission variables in the root node of the tree using relative
frequencies in the entire learning data, Pk N k jN; k 1; 2; . . . ; K, where N k is the number of cells corresponding to emission
variable k from the entire data N (Loh, 2010). Let Pk; t denote the probability of emission variable k and N k t be the number
of cells in node t belonging to class k, then
Nk t
Pk; t Pk : 1
Nk
Now let Pkjt denote the conditional probability that the CART algorithm classifies correctly the emission variables and
P
Pt k Pk; t, then
Pk; t
Pkjt : 2
Pt
In this paper, to measure the inequality among values of emission variables, we use the Gini index as a node impurity
function. The splitting rule for each unique value in the predictors is applied to find the best split of fragment data
(Breiman et al., 1984) from a uniform cost, i.e. the misclassification cost is equal for all classes:
!
XK XK1
1 XK
2
dt PjjtPkjt 1 P kjt ; 3
k1 j1
2 k1
or a non-uniform cost:
X
K X
K1
dt PjjtPkjt Ckjt; 4
k1 j1
where Cjjk represents the cost of misclassifying a cell that belongs to emission variable k into emission variable j.
To get the best split in node t, we look for the one that maximizes the node impurity function, or the misclassification cost
dt, in the children of node t (Loh, 2010). To make a more homogeneous subset than the previous node, the following gain
function makes use of a distribution of data before and after splitting:
Dds; t dt PL dt L PR dt R ; 5
where PL and P R are the proportions of cells going to left node t L (left) and right node tR , respectively. The gain function (5)
can be used to determine the goodness of a split, e.g. split s for node t (Paulsen et al., 2011). A splitting value is adopted at
node t to minimize the diversity obtained by the split. All the predictor data set records are assigned to one of the terminal
nodes, which represent the particular class or subset of emissions variables. The training data together with this node infor-
mation are supplied for MARS modeling.
2.2. MARS modeling
The Multivariate Adaptive Regression Splines (MARS) technique, also for non-parametric regression, uses a series of basis
functions to model complex (such as non-linear) relationships (Friedman, 1991). Its main purpose is to predict the values of a
continuous dependent variable, yn 1, from a set of p independent explanatory variables, Xn p, which in our case are
emissions factors as mentioned above. The MARS model can be represented as:
y f X e; 6
where f is a weighted sum of basis functions that depend on X and e is an error vector of dimension n 1. MARS provides a
greater flexibility to explore the non-linear relationship between a response variable and predictor variables by fitting the
data into piecewise linear regression functions. It does not require a priori assumptions about the underlying functional rela-
tionship between dependent and independent variables. Instead, this relation is uncovered from a set of coefficients and
piecewise polynomials of degree q basis functions (BFs) that are entirely driven from the regression data y; X. The MARS
regression model is constructed by fitting basis functions into distinct intervals of the independent variables. Generally,
piecewise polynomials, also called splines, have pieces smoothly connected together. Here, the joining points of the polyno-
mials are called knots, nodes or breakdown points, denoted by t. For a spline of degree q each segment is a polynomial
function. MARS uses two-sided truncated power functions as spline basis functions, described by the following equations
(Abdel-Ati and Haleem, 2011):
(
t xq ; if x < t;
x tq 7
0; otherwise:
(
x tq ; if x > t;
x tq 8
0; otherwise;
where qP 0 is the power to which the splines are raised and which determines the degree of smoothness of the resultant
function estimate. As an example, a pair of splines for q 1 at the knot t 0:5 is presented in Fig. 2.
The two-sided truncated functions of the dependent variable are basis functions that describe the underlying phenomena.
The global MARS model is defined as (Put et al., 2004):
0.6
0.5
(t-x)+ (x-t)+
0.4
Basis Function
0.3
0.2
0.1
-0.1
0 0.2 0.4 t 0.6 0.8 1
X
Fig. 2. A graphical representation of a spline basis function.

X
M
^ b0
y bm hm X; 9
m1
where y ^ is the predicted response; b0 is the coefficient of the constant basis function; hm X is the mth basis function, which
can be a single spline function or an interaction of two (or more) spline functions; bm is the coefficient of the mth basis func-
tion; and M is the number of basis functions included in the MARS model. To fit a MARS model, three main steps are applied.
In the first step, i.e., the constructive phase, basis functions are added to the model using a forward stepwise procedure. The
predictor and the knot location that contribute significantly to the model accuracy are selected. In this stage, interactions
are also introduced to examine if they could improve the model fit. To improve the prediction, the redundant basis functions
are removed one at a time using the backward stepwise procedure, in the second stage. MARS utilizes the generalized
cross-validation (GVC), incorporating the criterion for finding the overall best model from a sequence of fitted models,
and is estimated by the lack-of-fit (Hastie et al., 2009):
PN ^
2
1 i1 yi f X i
GCV h i2 ; 10
N ~
1 CM
N
h ~
i2
where 1 CM ~
is a complexity function, and CM ~
is defined as CM CM dM, in which CM is the number of param-
N
eters to be fit and smoothing parameter d is a user-defined cost for each basis function optimization. The higher the cost d is,
the more basis functions will be eliminated (Put et al., 2004). Finally, the third step to select the optimal MARS model, based
on an evaluation of the prediction characteristics of different fitted MARS models.
2.3. BMARS modeling
Boosting has been widely-used for predictive modeling as it offers an efficient, simple technique to manipulate additive
modeling (Breiman et al., 1984), that can convert weak learners to potentially a strong learner, i.e. a classifier well-correlated
with the true classification. A succession of models can be built iteratively from boosting. At this point, the examples are
being trained and re-weighted. Finally, each model or a weak classifier is weighted according to its performance and com-
bined with other weak classifiers using voting (for classification) or averaging (for regression) to create a final model. The
main advantages of boosting are that it can use any classification algorithm as a base learner, reduce model instability
and have high predictive performance. For this, the boosting algorithm, based on a multiplicative weight-update technique
(Freund and Schapire, 1997), has been successfully applied to several benchmark machine learning problems using super-
vised learning. Basically, a minimization algorithm such as the least square (LS) can be used to boost for a strong learner from
combining multiple weak learners whereby a new classifier is created based on the result of the previously generated
classifiers by focusing on misclassified samples. The algorithm increases the weights of incorrectly classified samples and
decreases the weights of those classified correctly. The LS boosting problem can be formulated as follows. Let x denote
the feature vector and y the alignment accuracy. Given an input variable x, a response variable y and some samples
^
fy ; xi gN , the goal is to obtain an estimate or approximation Fx, of the function F x mapping x to y, that minimizes some
i i1
specified loss function Ly; Fx over the joint distribution of all y; x values.
F arg minLy; Fx; 11

F
where the squared error loss is given by Ly; F y F2 =2 and the pseudo-response is obtained as

@Lyi ; Fxi
~
y yi F m1 x; i 1; 2; . . . ; N: 12
@Fxi FxF m1 x
Thus, for i 1; 2; . . . ; N the minimization of the data based estimate of the expected loss gives
X
N
qm ; am arg min y~i qhxi ; a ;
2
13
a;q
i1
where hx; a is the weak learner with basis functions fhx; am gm1 and qm is the corresponding multiplier. The LS-boost
M
algorithm (Jerome, 2001), tuned to the problem of vehicular emissions prediction, has been described in Oduro et al.
(2015), using Boosting Multivariate Adaptive Regression Splines (BMARS).
2.4. CART-BMARS hybrid modeling
Here, we propose to incorporate a regression tree with CART modeling to the BMARS algorithm (Oduro et al., 2015) for
improving the performance of air pollution prediction. CART builds the regression trees for predicting continuous dependent
variables in the regression model. In this hybrid technique, the data sets are first passed through CART to generate node
Fig. 3. Flowchart of CART-BMARS hybrid model.
information. The training data together with node information are then supplied for training the BMARS. A rationale for the
integration of CART and BMARS is that from a practical point of view, CART has the ability to handle missing values in the
database by substituting surrogate splitters which are back-up rules to closely mimic the action of the primary splitting rule.
This feature is not shared by many artificial intelligence approaches (Li et al., 2010). A flowchart of the proposed CART-
BMARS hybrid model is shown in Fig. 3, wherein boosting is adopted to improve estimation performance by adjusting
the weights of the classifiers.
2.5. ANN modeling
In order to verify merits of the CART-BMARS hybrid model, an ANN-based model is constructed to compare their predic-
tive capabilities. In the present study, the multilayer feedforward neural network is trained by the back-propagation network
(BPN) algorithm to correctly classify the training pair. Here, the Levenberg-Marquardt algorithm with a log-sigmoid activa-
tion function is used to update the network weights due to its high generalization capability. It is important to determine the
optimum network architecture to achieve reliable results. This task still relies on trial-and-error even though several heuris-
tic relations have been proposed to determine appropriately the number of neurons to be included in the hidden layer (Rafiai
and Moosavi, 2012). The architecture of the proposed ANNs is presented in Fig. 4, wherein the inputs are vehicles speed,
acceleration, load, power, and ambient temperature, and the outputs include NOX, CO, CO2 and THC. Here, the Root Mean
Square Error (RMSE) is chosen as the loss function to be minimized, as RMSE possesses properties of convexity, symmetry,
and differentiability for an excellent metric in the context of optimization.
Fig. 4. Proposed ANN architecture.
3. Vehicular emission information and statistical evaluation
This section presents the collection of vehicular emissions data and preparation of datasets as well as the statistical eval-
uation of the output parameters.
3.1. Data collection
Vehicular emissions data used in this study were supplied by the Road and Maritime Service (RMS) of the New South
Wales (NSW) Department of Vehicle Emission, Compliance Technology Operation. Ten (10) vehicles were used for the test,
whereby emissions data were collected on the second by second basis. The test vehicles include Toyota, Mitsubishi, Holden,
Ford and Nissan from 2009 and 2010 models with an engine displacement ranging from 1.8 L to 2.0 L. Emissions from these
vehicles were recorded in two ways, by using a chassis dynamometer set-up and using a Horiba On-Board Measurement
System (OBS-2000). Each drive cycle lasted for 556 s with the corresponding measurement of 556 data points.
The laboratorial dynamometer set-up was coupled to drive lines connected directly to the wheel hubs of the vehicle via a
set of rollers upon which the vehicle was placed. These rollers can be adjusted to simulate driving resistance. During testing,
the vehicle was tied down so that it remained stationary as a driver operated it according to a predetermined time-speed
profile for a given gear change pattern displayed on a monitor. The vehicle was considered as being driven to match the
speed required at different stages of the driving cycle since experienced drivers are able to closely match an established
speed profile.
The same vehicles were also tested with the Horiba On-Board Measurement System (OBS-2000). The equipment was
composed of two on-board gas analysers, a laptop computer equipped with a data logger software, a power supply unit, a
tailpipe attachment, and other accessories. The OBS-2000 collected the emission data via a global positioning system
(GPS). Although the instrument measured other air pollutants, the focus of this paper was on such gases as CO, CO2, THC
and NOX emissions. For logging the correct values of the measured emissions and other required parameters, the software
was configured to a set of values provided by the Horiba Instruments, Inc. In addition, a delay in the logging attributed to the
time it took to convert the measured concentrations from the analog to digital output was also accounted for by Horiba with
appropriate adjustments in the data analysis spreadsheets.
3.2. Preparation of training datasets
The same datasets used for CART-BMARS hybrid model analysis are applied here for modeling and evaluating the predic-
tion performance of the ANN-based model. Training a neural network architecture can be seen as a nonlinear optimization
problem in which the task is to find out the set of parameters, i.e. synaptic weights, such that the network output is as close
as to the desired output. Notably, the 556 values in the experimental dataset obtained were subject to a secondary emission
correction by NSW RMS. Previous studies have shown that different ratios for training and testing data were required (Oduro
et al., 2015). In the present study, 70% (390) of total experimental data was randomly selected for training the neural
network, 15% (83) for the network cross-validation to avoid over-fitting, and the remaining 15% (83) of the data for testing
the performance of the trained network. The data were first normalized as
RA Rmin
RN ; 14
Rmax Rmin
where RA is the actual value, Rmin and Rmax are the minimum and maximum values of R, and RN is the normalized value of R
obtained within the range from 0 to 1.
3.3. Statistical evaluation of output parameters
After normalization, data were randomized and the ANN was trained and tested against the experimental data of
vehicular emissions. In order to evaluate the prediction performance of the proposed ANN model, we have considered the
R-squared, or correlation coefficient of determination (R2 ) as a validation criterion:
PN !
2
2 i1 t i yi
R 1 PN 2
: 15
i1 yi
The performance of the ANN-based predictions is evaluated by regression analysis of the predicted outputs and the target
outputs. The correlation coefficient R2 is used to assess the strength of this relationship, of which values closer to +1 indicate
a stronger positive linear relationship. Discrepancies between the predicted outputs (y) and the target outputs t are judged
by the root mean squared error (RMSE):
r
1 XN
RMSE i1 i
y ti 2 ; 16
N
where N is the number of the data used for validation, t is actual output and y is the predicted output value.
Table 1
Comparison of HYBRID, BMARS, MARS and CART model (RMSE).
Model HYBRID (RMSE) BMARS (RMSE) MARS (RMSE) CART (RMSE)

NOX-OBS 1:001 104 3:367 104 4:243 104 4:244 104
NOX-DYN 2:478 104 3:411 104 4:652 104 5:276 104
CO-OBS 1:041 104 2:945 104 3:872 104 4:145 104
CO-DYN 2:143 104 3:254 104 5:276 104 6:243 104
CO2-OBS 1:214 104 2:845 104 3:992 104 4:249 104
CO2-DYN 2:478 104 3:214 104 4:652 104 4:356 104
THC-OBS 1:015 104 2:946 104 3:978 104 3:284 104
THC-DYN 2:784 104 4:002 104 5:115 104 5:013 104
Table 2
Comparison of HYBRID, BMARS, MARS and CART model (R2 ).
Model HYBRID (R2 ) BMARS (R2 ) MARS (R2 ) CART (R2 )

NOX-OBS 0.951 0.739 0.624 0.624
NOX-DYN 0.818 0.706 0.593 0.504
CO-OBS 0.962 0.757 0.656 0.608
CO-DYN 0.881 0.723 0.504 0.476
CO2-OBS 0.906 0.774 0.642 0.623
CO2-DYN 0.854 0.656 0.608 0.608
THC-OBS 0.907 0.757 0.672 0.656
THC-DYN 0.809 0.608 0.518 0.534
Table 3
ANNs architecture and prediction accuracy.
Model Hidden layer neuron number RMSE

ANN-NOX-OBS 11 2:244 104
ANN-NOX-DYN 12 3:921 104
ANN-CO-OBS 8 2:278 104
ANN-CO-DYN 10 3:651 104
ANN-CO2-OBS 9 2:375 104
ANN-CO2-DYN 13 2:952 104
ANN-THC-OBS 7 2:662 104
ANN-THC-DYN 14 2:978 104
4. Results and discussions
Based on the experimental data from the on-board system (OBS) and dynamometer (DYN) testing, the proposed CART-
BMARS hybrid model is implemented for emissions prediction, where speed, acceleration, power, load and ambient temper-
ature are used as predictors with different air pollutant emissions such as NOX, CO, CO2 and THC emissions. The results of the
hybrid model computed using all the available data for on-board and dynamometer testing appear to have similar interpre-
tations. It can be observed that all the five predictor variables play crucial roles in predicting the vehicle emissions by using
Fig. 5. Regression plots corresponding to the designed ANNs model: NOX and CO.
all mentioned models. However, an analysis of variance (ANOVA) from the MARS model indicated that the two most impor-
tant variables were load and speed with acceleration, power and ambient temperature having less effects to emissions. To
ensure a fair comparison, each time, the same training and test datasets were used for each model. The LS-boost algorithm
for regressions with squared error loss is implemented in this paper. In the following, predictive performance of the models
are compared in two perspectives. First, three learning techniques, namely CART, MARS and BMARS, are compared with the
hybrid one to examine the best performance for emissions prediction. Then, a comprehensive analysis is conducted to
demonstrate the predictive performance of the proposed CART-BMARS model against the ANNs one.
Fig. 6. Regression plots corresponding to the designed ANNs model: CO2 and THC.
4.1. Comparison of CART-BMARS with BMARS, MARS and CART
Tables 1 and 2 list respectively the RMSE and R-squared of the proposed CART-BMARS hybrid model in comparison with
CART, MARS and BMARS models in terms of such pollutants as NOX, CO, CO2 and THC for both on-board system (OBS) and
dynamometer testing (DYN). By analyzing the results in these tables, we can see that hybrid and BMARS models have smaller
RMSE as compared to CART and MARS ones. Combining CART and MARS with boosting techniques as the hybrid model
makes the algorithm relatively insensitive to the number of iterations, and their R2 and RMSE values remain within a rela-
tively stable range. Because both algorithms are forward additive, they can adaptively search for optimal results during the
modeling process; this makes the model stable throughout the iteration range. Here, boosting turns the weaker classifier in
the emission predicted variables into stronger classifier (Breiman et al., 1984) and then builds many complement classifiers
in order to find a highly accurate classifier on the training set by ensembling the weak hypotheses. The outcome of the pro-
posed model is a higher R2 value and lower RMSE. The selection of the generalized cross-validation GCV criterion in both
models tends to be sensitive and can overfit the model. Obviously, these relatively poor selections will degrade the model
results in CART or MARS models, so the results are unstable with a lower accuracy (Hastie et al., 2009). This suggests the
robustness of the hybrid algorithm and its capability of improving accuracy of the MARS model in vehicular emissions
prediction.
In general, boosting can improve the prediction accuracy of a particular learning model. As can be seen, the performance
of CART-BMARS hybrid and BMARS is better than that of CART or MARS alone. Comprehensive analysis shows that, combin-
ing CART with BMARS to form a hybrid model is superior to a non-boosting strategy, or a strategy without a regression tree.
Fig. 7. Comparison of Hybrid and ANNs models with experimental data for NOX and CO emissions.
Furthermore, this proves that the hybrid model strategy used in this paper has effectively improved the prediction accuracy
and generalization ability of the emission models. From Tables 1 and 2, it can be observed that the hybrid model for all pol-
lutants NOX, CO, CO2 and THC using both DYN and OBS systems outperformed the BMARS in terms of goodness of fit and
prediction accuracy. The hybrid model also takes the advantage of BMARS in its capability of handling non-linearity in
the data. However, since boosting is sensitive to noise within data, its performance may be affected in the presence of noise,
depending on the boosting method used. While the MARS have a potential problem of over-fitting the model and hence, sub-
ject to computational complexity, the CART-BMARS hybrid model can effectively handle the corresponding noise interval
and the missing values within the database at every step. Thus, the proposed technique is able to adequately solve the prob-
lem of boostings sensitivity to data noise and ultimately improve the prediction performance of the emissions model. This
explains why its prediction performance is better than BMARS.
4.2. Comparison of CART-BMARS with ANNs
In this work, we use the BPN which is adequate for predicting vehicular emissions. The accuracy of neural network pre-
diction is generally dependent on the number of hidden layers and the numbers of neurons in each layer. To find out the
suitable architecture, a number of neural network architectures have been tested by varying the number of hidden neurons
(a) CO2 on-board (b) CO dynamometer

2
(c) THC on-board (d) THC dynamometer

Fig. 8. Comparison of Hybrid and ANNs with experimental data for CO2 and THC emissions.
from 2 to 15 with 5 inputs (speed, acceleration, load, power and ambient temperature) and 1 output respectively for each
pollutant (NOX, CO, CO2 and THC).
The results are listed in Table 3, showing the air pollutant models, the number of hidden neurons correspondingly, and
the accuracy in terms of RMSE. The remaining data, set aside for testing and validation purposes, were then used to check the
predictive capabilities of the trained model. Comparison of the output obtained by the ANNs and the target values of the
experimental data are shown in regression plots of Figs. 5 and 6 for all four air pollutants with on-board and dynamometer
testing systems. As observed from the graphs, a high correlation between the predicted and the experimental values demon-
strates that the model succeeded in predicting major emissions from vehicles. The regression plots yield high R2 values closer
to 1 for both on-board and dynamometer tests, indicating that ANNs are a useful method for prediction of vehicular
emissions.
The performance of the proposed CART-BMARS and ANNs models were compared with the experimental dataset, as
shown in Figs. 7 and 8, respectively for NOX, CO, CO2 and THC. As observed, the results obtained show excellent performance
indices for both CART-BMARS hybrid and ANNs models and are also in agreement with other researchers using the same
methodology for different applications (Manzie et al., 2007; Sorek-Hamer et al., 2013).
The models performance and the efficiency features are listed in Table 4 for comparison of CART-MARS and ANNs merits.
The results therein together with those in Tables 1 and 3 confirm the advantages of the CART-BMARS hybrid model method
for all pollutant emissions considered. In addition, it appears to be faster than ANNs as the processing speed (CPU time)
remains smaller for all cases, as shown in Table 4. Another distinctive aspect is that it can identify the contribution of each
Table 4
Performance comparison between Hybrid and ANNs models.
Model Processing time (s) R2

Hybrid-NOX-OBS 6 0.951
Hybrid-NOX-DYN 8 0.817
ANN-NOX-OBS 22 0.814
ANN-NOX-DYN 23 0.659
Hybrid-CO-OBS 7 0.962
Hybrid-CO-DYN 9 0.879
ANN-CO-OBS 19 0.859
ANN-CO-DYN 20 0.701
Hybrid-CO2-OBS 7 0.904
Hybrid-CO2-DYN 9 0.854
ANN-CO2-OBS 21 0.843
ANN-CO2-DYN 24 0.783
Hybrid-THC-OBS 7 0.906
Hybrid-THC-DYN 10 0.808
ANN-THC-OBS 23 0.803
ANN-THC-DYN 25 0.767
Fig. 9. Performance evaluation of Hybrid and ANN models.

variable to the emissions prediction through the analysis of variance (ANOVA) decomposition. The model output is expressed
in a more interpretable way in the form of segmented defined on different intervals and may provide additional informa-
tion about how changes in the input data can affect the output.
The effectiveness of the combination of CART and BMARS as a hybrid method for vehicular emissions can be explained as
(i) the proposed hybrid model is computationally more efficient owing to the capability of dividing the predictors space into
multiple knots and then fitting a spline function between them, and hence, requiring less trial and error as compared to the
ANNs model, and (ii) the CART-BMARS technique allows for effectively removing data noise and reducing the sensitivity of
boosting to noise in the emissions data, while the final number of basis functions can be determined form a preset maximum
value. The performance of the proposed method in comparison against the ANNs model is further evaluated in terms of
RMSE, as shown in Fig. 9. Therein, the hybrid model RMSE (in blue1) reduces gradually and gets closer to the validation data
(in green) unlike that of the ANNs model (in red) for both the on-board and dynamometer testing systems. These results suggest
that the CART-BMARS hybrid model can constitute a valuable alternative for predicting vehicular emissions.
5. Conclusion
In this paper, a CART-BMARS hybrid method has been proposed to estimate the nonlinear relationship between vehicular
pollutant emissions and predictor variables such as speed, acceleration, load, power and ambient temperature as predictor
variables. The hybrid model is implemented with effective piecewise-linear BFs which effectively solve the problem of non-
linearity and uncertainty in the emissions data and improve the prediction accuracy of the model. The new hybrid model is
developed to overcome the shortcomings of MARS and BMARS models, effectively improving the performance the emissions
model. The proposed hybrid algorithm is then compared with a multilayer BPN trained and tested by the Levenberg-
Marquardt optimization algorithm. It can be observed that among all techniques mentioned, the proposed CART-BMARS
hybrid model exhibits excellent prediction performance for all pollutant emissions using both on-board and dynamometer
testing systems.
Acknowledgements
The data used for this study were provided by the Road and Maritime Service, Department of Vehicle Emission, Compli-
ance Technology & Compliance Operations, NSW Office of Environment & Heritage, and HORIBA, Australia.
References
Abdel-Ati, M., Haleem, K., 2011. Analyzing angle crashes at unsignalized intersections using machine learning techniques. Accid. Anal. Prev. 43, 461470.
Ayoubloo, M.K., Azamathulla, H.M., Jabbari, E., Zanganeh, M., 2011. Predictive model-based for the critical submergence of horizontal intakes in open
channel flows with different clearance bottoms using CART, ANN and linear regression approaches. Expert Syst. Appl. 38 (8), 1011410123. Available:
<http://www.sciencedirect.com/science/article/pii/S095741741100279X>.
Azzi, M., Duc, H., Ha, Q., 2015. Towards sustainable energy usage in the power generation and construction sectors a case study of australia. Automat.
Constr. 59, 122127.
Boroujeni, B.Y., Frey, H.C., 2014. Road grade quantification based on global positioning system data obtained from real-world vehicle fuel use and emissions
measurements. Atmos. Environ. 85, 179186. Available: <http://www.sciencedirect.com/science/article/pii/S1352231013009709>.
Breiman, L., Friedmann, H.L., Olshen, R.A., Stone, C.J., 1984. Classification and Regression Trees. Wadsworth International Group.
Chen, M.-Y., 2011. Predicting corporate financial distress based on integration of decision tree classification and logistic regression. Expert Syst. Appl. 38 (9),
1126111272. Available: <http://www.sciencedirect.com/science/article/pii/S0957417411003976>.
Elbayoumi, M., Ramli, N.A., Yusof, N.F.F.M., 2015. Development and comparison of regression models and feedforward backpropagation neural network
models to predict seasonal indoor PM2.5 concentrations in naturally ventilated schools. Atmos. Pollut. Res. 6 (6), 10131023. Available: <http://
www.sciencedirect.com/science/article/pii/S1309104215000136>.
Elkafoury, A., Negm, A.M., Bady, M.F., Aly, M.H., 2015. Modeling vehicular CO emissions for time headway-based environmental traffic management system.
Procedia Technol. 19, 341348. Available: <http://www.sciencedirect.com/science/article/pii/S221201731500050X>.
Freund, Y., Schapire, R., 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119139.
Frey, H.C., Kim, K., 2009. In-use measurement of the activity, fuel use, and emissions of eight cement mixer trucks operated on each of petroleum diesel and
soy-based B20 biodiesel. Transp. Res. Part D: Transp. Environ. 14 (8), 585592. Available: <http://www.sciencedirect.com/science/article/pii/
S1361920909001084>.
Friedman, J.H., 1991. Multivariate adaptive regression splines. Ann. Stat. 19 (1), 1144.
Ghobadian, B., Rahimi, H., Nikbakht, A., Najafi, G., Yusaf, T., 2009. Diesel engine performance and exhaust emission analysis using waste cooking biodiesel
fuel with an artificial neural network. Renew. Energy 34 (4), 976982. Available: <http://www.sciencedirect.com/science/article/pii/
S0960148108003108>.
Goel, R., Guttikunda, S.K., 2015. Evolution of on-road vehicle exhaust emissions in delhi. Atmos. Environ. 105, 7890. Available: <http://
www.sciencedirect.com/science/article/pii/S1352231015000680>.
Ha, Q., Wahid, H., Duc, H., Azzi, M., 2015. Enhanced radial basis function neural networks for ozone level estimation. Neurocomputing 155, 6270. Available:
<http://www.sciencedirect.com/science/article/pii/S0925231214017123>.
Hastie, R., Tibshirani, T., Friedman, J., 2009. The Elements of Statistical Learning: Data Mining. Inference and Prediction. Springer-Verlag.
Holmes, N., Morawska, L., 2006. A review of dispersion modelling and its application to the dispersion of particles: an overview of different dispersion
models available. Atmos. Environ. 40 (30), 59025928. Available: <http://www.sciencedirect.com/science/article/pii/S1352231006006339>.
IEA, 2009. World Energy Outlook (2008). International Energy Agency, Paris.
Jerome, H.F., 2001. Greedy Function Approximation: A Gradient Boosting Machine. Institute of Mathematical Statistics: Prentice-Hall.
Kurt, A., Oktay, A.B., 2010. Forecasting air pollutant indicator levels with geographic models 3 days in advance using neural networks. Expert Syst. Appl. 37
(12), 79867992. Available: <http://www.sciencedirect.com/science/article/pii/S095741741000504X>.
1
For interpretation of color in Fig. 9, the reader is referred to the web version of this article.
Li, H., Sun, J., Wu, J., 2010. Predicting business failure using classification and regression tree: an empirical comparison with popular classical statistical
methods and top classification mining methods. Expert Syst. Appl. 37 (8), 58955904. Available: <http://www.sciencedirect.com/science/article/pii/
S0957417410000552>.
Loh, W.-Y., 2010. Tree-structured classifiers. Wiley Interdiscipl. Rev.: Comput. Stat. 2 (3), 364369. http://dx.doi.org/10.1002/wics.86.
Manzie, C., Watson, H., Halgamuge, S., 2007. Fuel economy improvements for urban driving: hybrid vs. intelligent vehicles. Transp. Res. Part C: Emerg.
Technol. 15 (1), 116. Available: <http://www.sciencedirect.com/science/article/pii/S0968090X06000908>.
Nagendra, S.S., Khare, M., 2006. Artificial neural network approach for modelling nitrogen dioxide dispersion from vehicular exhaust emissions. Ecol. Model.
190 (12), 99115. Available: <http://www.sciencedirect.com/science/article/pii/S0304380005002280>.
Najafi, G., Ghobadian, B., Tavakoli, T., Buttsworth, D., Yusaf, T., Faizollahnejad, M., 2009. Performance and exhaust emissions of a gasoline engine with
ethanol blended gasoline fuels using artificial neural network. Appl. Energy 86 (5), 630639. Available: <http://www.sciencedirect.com/science/article/
pii/S0306261908002407>.
Oduro, S., Metia, S., Duc, H., Hong, G., Ha, Q., 2015. Multivariate adaptive regression splines models for vehicular emission prediction. Visualiz. Eng. 3 (1).
http://dx.doi.org/10.1186/s40327-015-0024-4.
Oduro, S., Metia, S., Duc, H., Ha, Q., 2015. Predicting carbon monoxide emissions with multivariate adaptive regression splines MARS and artificial neural
networks ANNs. In: The 32nd International Symposium on Automation and Robotics in Construction and Mining, June 2015, pp. 912920.
Pandian, S., Gokhale, S., Ghoshal, A.K., 2009. Evaluating effects of traffic and vehicle characteristics on vehicular emissions near traffic intersections. Transp.
Res. Part D: Transp. Environ. 14 (3), 180196. Available: <http://www.sciencedirect.com/science/article/pii/S1361920908001521>.
Paulsen, P., Smulders, F., Tichy, A., Aydin, A., Hck, C., 2011. Application of classification and regression tree CART analysis on the microflora of minced meat
for classification according to reg. EC 2073/2005. Meat Sci. 88 (3), 531534. Available: <http://www.sciencedirect.com/science/article/pii/
S0309174011000556>.
Put, R., Xu, Q., Massart, D., Heyden, Y.V., 2004. Multivariate adaptive regression splines MARS in chromatographic quantitative structure-retention
relationship studies. J. Chromatogr. A 1055 (12), 1119. Available: <http://www.sciencedirect.com/science/article/pii/S0021967304015614>.
Rafiai, H., Moosavi, M., 2012. An approximate ann-based solution for convergence of lined circular tunnels in elasto-plastic rock masses with anisotropic
stresses. Tunn. Undergr. Space Technol. 27 (1), 5259. Available: <http://www.sciencedirect.com/science/article/pii/S0886779811000848>.
Sonawane, N.V., Patil, R.S., Sethi, V., 2012. Health benefit modelling and optimization of vehicular pollution control strategies. Atmos. Environ. 60, 193201.
Available: <http://www.sciencedirect.com/science/article/pii/S1352231012006449>.
Sorek-Hamer, M., Strawa, A., Chatfield, R., Esswein, R., Cohen, A., Broday, D., 2013. Improved retrieval of PM2.5 from satellite data products using non-linear
methods. Environ. Pollut. 182, 417423. Available: <http://www.sciencedirect.com/science/article/pii/S0269749113004247>.
Tayyebi, A., Pijanowski, B.C., 2014. Modeling multiple land use changes using ANN, CART and MARS: comparing tradeoffs in goodness of fit and explanatory
power of data mining tools. Int. J. Appl. Earth Obs. Geoinf. 28, 102116. Available: <http://www.sciencedirect.com/science/article/pii/
S0303243413001554>.
Tong, Y., Wang, X., Zhai, J., Niu, X., Liu, L., 2014. Theoretical predictions and field measurements for potential natural ventilation in urban vehicular tunnels
with roof openings. Build. Environ. 82, 450458. Available: <http://www.sciencedirect.com/science/article/pii/S0360132314002959>.
Yap, B.W., Ong, S.H., Husain, N.H.M., 2011. Using data mining to improve assessment of credit worthiness via credit scoring models. Expert Syst. Appl. 38
(10), 1327413283. Available: <http://www.sciencedirect.com/science/article/pii/S0957417411006749>.

Transportation Research Part D: S.D. Oduro, Q.P. Ha, H. Duc

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Transportation Research Part D: S.D. Oduro, Q.P. Ha, H. Duc

Enviado por

Direitos autorais:

Formatos disponíveis

Transportation Research Part D 49 (2016) 188202

Contents lists available at ScienceDirect

Transportation Research Part D

Vehicular emissions prediction with CART-BMARS hybrid

2. Vehicular emissions models

2.1. CART modeling

Fig. 1. Regression tree from CART analysis.

2.2. MARS modeling

Fig. 2. A graphical representation of a spline basis function.

2.3. BMARS modeling

F  arg minLy; Fx; 11

2.4. CART-BMARS hybrid modeling

Fig. 3. Flowchart of CART-BMARS hybrid model.

2.5. ANN modeling

Fig. 4. Proposed ANN architecture.

3. Vehicular emission information and statistical evaluation

3.1. Data collection

3.2. Preparation of training datasets

3.3. Statistical evaluation of output parameters

Model HYBRID (RMSE) BMARS (RMSE) MARS (RMSE) CART (RMSE)

Model HYBRID (R2 ) BMARS (R2 ) MARS (R2 ) CART (R2 )

Model Hidden layer neuron number RMSE

4. Results and discussions

4.1. Comparison of CART-BMARS with BMARS, MARS and CART

4.2. Comparison of CART-BMARS with ANNs

(a) CO2 on-board (b) CO dynamometer

(c) THC on-board (d) THC dynamometer

Model Processing time (s) R2

Fig. 9. Performance evaluation of Hybrid and ANN models.

Você também pode gostar

F arg minLy; Fx; 11