Decision Trees and Random Forests

2016. 11. 27.
Decision Trees and Random Forests
Decision Trees and Random Forests
About this module

Decision Trees are one of the most widely-used and popular classification techniques in Machine
Learning. Random forests belong to a class of Machine Learning methods called "ensemble"
methods, where many models (in this case Decision Trees) are generated from the training data
and fused together into an ensemble to make predictions.
By the end of this module, you will learn how to implement these two extremely powerful
Machine Learning models using Python and scikit-learn. For every classification model built
with scikit-learn, you should follow four main steps:
1. Building or instantiating the classification model (using either default, pre-defined or

optimised parameters),
2. Training the model,
3. Testing the model, and
4. Reporting metrics and evaluating the performance and generalisation ability of the
constructed model.
Validation techniques will be applied throughout these steps to avoid cases of overfitting (or
underfitting). Finally, you will learn how to optimise the hyperparameters of a model as a way of
boosting its overall performance.
Decision Trees
Decision Trees are one of the most widely-used and popular classification techniques in Machine
Learning due to their
Simplicity: not many parameters need to be tuned and there’s no need to normalise the data
before using them.
Scalability: the classification process requires less operation than other classification models
(such as KNN).
Interpretability: a decision tree is easy to visualise and interpret, and can provide valuable
insights about the data.
Efficiency: decision trees can handle both numerical but also categorical data as well as
missing values.
http://beta.cambridgespark.com/courses/jpm/04module.html 1/25
2016. 11. 27. Decision Trees and Random Forests
Decision Tree classifiers construct classification models in the form of a tree structure. A
decision tree progressively splits the training set into smaller subsets. Each node of the tree
represents a subset of the data.
Figure 1. Data partitioning and correspondent tree representation.
The final (decision) tree consists of three types of nodes:
A root node that has no incoming edges and zero or more outgoing edges.
Internal nodes, each of which has exactly one incoming edge, and two or more outgoing
edges.
Finally, leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing
edges. In a decision tree, each leaf node is assigned a class label.
Root and internal nodes contain feature test conditions to separate samples based on different
characteristics and leaf nodes are linked to the final decision of the model. Once a new sample is
presented to the model, it applies the test conditions while a leaf node is reached and the class
linked to the leaf node is reported as result.
Basic ideas behind growing a tree

Each split in the data is made in order to minimise a misclassification metric (gini
impurity, information gain, variance reduction) and keep the number of class instances
balanced between the regions.
Leaf nodes need to be as pure as possible, which means that all the samples at a leaf
node need to belong mostly to the same class.
Nodes at the top should be impure, which means that all the samples at the top node
need to represent samples of both classes.
As a result all the classes need to have similar chances to be selected (the impurity of the
tree is minimised).
Split the data into training and test sets
Before we start with the actual model building process, we need to ensure the generalisation
ability of our classifier (remember that generalisation is the capacity of a model to perform well
on data that has not been used for the training phase).
Training and testing a classification model on the same dataset is a methodological mistake: a
model that would just repeat the labels of the samples that it has just seen would overestimate
the score and would fail to predict anything useful on yet-unseen data, leading to poor
generalisation performance.
To use different datasets for training and testing, we need to split the online retail dataset into
two mutually exclusive datasets, the training set and the test set; this validation approach is
referred to as the holdout method and is depicted as follows:
Figure 2. Holdout approach (random split into two disjoint datasets, the train and test set).
In Python, you need to use the train_test_split() function available through

sklearn.cross_validation
(http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html). The function
takes two mandatory arguments, which are the matrix X holding the input data and the vector y
holding the targets (class labels).
PYTHON
# Split into training and test sets
XTrain, XTest, yTrain, yTest = train_test_split(X, y, random_state=1) (1)
1. The random_state argument specifies a value for the seed of the random generator. By
setting this seed to a particular value, each time the code is run, the split between train and
test datasets will be exactly the same. If this value is not specified, a different split will be
output each time since the random generator driving the split will be seeded by a pseudo-
random number.
The output of train_test_split consists of four arrays. XTrain and yTrain are the two arrays
you use to train your model. XTest and yTest are the two arrays that you use to evaluate your
model. By default, scikit-learn splits the data so that 25% of it is used for testing, but you can also
specify the proportion of data you want to use for training and testing.
As previously, you can check the sizes of the different training and test sets by using the shape
attribute:
PYTHON
# Print the dimensionality of the individual splits
print("XTrain dimensions: ", XTrain.shape)

print("yTrain dimensions: ", yTrain.shape)
print("XTest dimensions: ", XTest.shape)
print("yTest dimensions: ", yTest.shape)
This results in the following output:
PYTHON
> XTrain dimensions: (1498, 10)
> yTrain dimensions: (1498, )
> XTest dimensions: (500, 10)
> yTest dimensions: (500, )
You can also investigate how the class labels are distributed within the yTest vector by using the
itemfreq function from module 2:
PYTHON
# Calculate the frequency of classes in yTest
yFreq = scipy.stats.itemfreq(yTest)
print(yFreq)
> [[ 0 59]
[ 1 441]] (1)
1. In this case, we can see that yTest includes 59 random samples of class 0 (non-returning
customers) and 441 random samples of class 1 (returning customers).
Building a Decision Tree

Decision Tree classifiers construct classification models in the form of a tree structure. A
decision tree progressively splits the training set into smaller subsets. Each node of the tree
represents a subset of the data. Once a new sample is presented, it is classified according to the
test condition generated for each node of the tree.
Try building a simple decision tree with 3 layers (See Decision Tree documentation
(http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for the
arguments that can be passed to the classifier).
PYTHON
# Building the classification model using a pre-defined parameter
dtc = DecisionTreeClassifier(max_depth=3) (1)
# Train the model

dtc.fit(XTrain, yTrain)
# Test the model

yPred = dtc.predict(XTest)
1. Setting the parameter max_depth to 3 we stipulate that the decision tree will have no more
than 3 links from the root node to the leaves.
Calculate validation metrics for your classiۘer

In a classification task, once you have created your predictive model, you will always need to
evaluate it. Evaluation functions help you to do this by reporting the performance of the model
through four main performance metrics:
precision,
sensitivity (or recall, true positive rate),
specificity (or true negative rate), and
overall accuracy.
As opposed to overall classification accuracy, the first three metrics are class-specific: they may
differ for the two classes, whereas the overall accuracy may remain the same. To understand
these metrics, it is useful to create a confusion matrix, which records all the true positive, true
negative, false positive and false negative values.
We can compute the confusion matrix for our classifier using the confusion_matrix function
in the metrics module.
PYTHON
# Get the confusion matrix for your classifier using metrics.confusion_matrix
mat = metrics.confusion_matrix(yTest, yPred) (1)

print (mat)
> [[ 32 27]
[ 19 422]] (2)
1. Compute the confusion matrix for our predictions. Remember that the test data contain
observations that are not in the training data.
2. The first value in the first row (32) is the number of True Positives (TP); the second value in
the first row (27) is the number of False Negatives (FN); the first value in the second row (19)
is the number of False Positives (FP), and the second value in the second row (422) is the
number of True Negatives (TN). We could represent this schematically as follows:
Validation Metrics
Accuracy: Accuracy is the overall "correctness" of the model and is calculated as the
number of correctly classified observations divided by the total number of
observations. Accuracy is defined by
Accuracy = (tp + tn)/total
where tp and tn are the numbers of true positive and true negative predictions and
total is the total number of instances.
Precision (for a class): Precision is a measure of the accuracy for a specific class, it
reports the proportion of correct classifications for a specific class. It is defined by:
P recision = tp/(tp + f p)
where tp and f p are the numbers of true positive and false positive predictions for the
considered class, e.g. ability to correctly classify a customer as being returning or non-
returning. tp + f p is the total number of elements labelled as belonging to the
considered class by the classifier.
Recall, aka. Sensitivity, True positive rate (for a class): Recall reports the ability of a
model to select instances of a certain class from a dataset, e.g. a classifier that has high
sensitivity with regards to the non-returning class will do well at correctly classifying
customers as being non-returning (although this may make it more likely to incorrectly
include more returning customers in this class). It is defined by:
Recall = S ensitivity = tp/(tp + f n)
where tp and f n are the numbers of true positive and false negative predictions for the
considered class. tp + f n is the total number of elements that actually belong to the
considered class.
Specificity, True negative rate (for a class): Specificity reports the ability of the
model to correctly exclude class non-members in a dataset from the class, e.g. a
classifier that has high specificity wrt the non-returning class will do well at correctly
excluding returning customers from the class (although this may make it more likely to
miss non-returning customers). It is defined by:
S pecif icity = tn/(f p + tn)
where tn and f p are the numbers of true negative and false positive predictions for the
considered class. f p + tn is the total number of elements that should not be included in
the class.
F1-score (for a class): This measures the accuracy of the model with respect to a
particular class, and is the harmonic mean of precision and recall. It is defined by:
F 1 = 2 ∗ (precision ∗ recall)/(precision + recall)
Because performance metrics are such an important step of model evaluation, scikit-learn offers
a wrapper around these functions, metrics.classification_report , to facilitate their
computation. It also offers the function metrics.accuracy_score to compute the overall
accuracy.
PYTHON
# Report the metrics using metrics.classification_report
print (metrics.classification_report(yTest, yPred))

print ("Overall Accuracy: ", round(metrics.accuracy_score(yTest, yPred), 2))
The performance summary and accuracy will be as follows:
PYTHON
precision recall f1-score support
0 0.63 0.54 0.58 59

1 0.94 0.96 0.95 441
avg / total 0.90 0.91 0.91 500
Overall Accuracy: 0.91
Boundary visualisation of Decision Trees

The border between two neighboring regions of different classes is known as the decision
boundary. We have provided you with a pre-defined function in the visplots library called
dtDecisionPlot that allows you to do this. You can check the arguments passed in this
function by using the help command. In addition to the mandatory arguments, the function
visplots.dtDecisionPlot takes as optional arguments the ones from the
DecisionTreeClassifier function, so you can have a look at the documentation
(http://scikit-
learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)
.
PYTHON
#### Check the arguments of the function
help(visplots.dtDecisionPlot)
# Plot the boundary of Decision Trees

visplots.dtDecisionPlot(XTrain, yTrain, XTest, yTest, header, max_depth = 3)
Figure 3. Decision Boundary of a Tree with max depth 3.
Boundary interpretation
The decision region for a decision tree is rectilinear ("stair-like" or "box-like" surfaces) with
segments parallel to the input axes since each test condition involves only a single
attribute. In this case, the boundary defines a decision region for each class.
The more splits are performed the more detailed the model becomes (the deeper the tree).
Despite the power and simplicity of this algorithm, decision-tree learners have three main
drawbacks:
They can be very unstable as they are sensitive to small changes in the training data: a small
change can lead to a totally different tree (high variance).
They can easily overfit. Decision-tree learners can create over-complex trees that do not
generalise the data well.
Their decision boundary is non-smooth.
Bias-variance trade-oۘ
One of the fundamental concepts of Machine Learning is that of overfitting and generalisation.
When a trained model performs extremely well on the training dataset, but fails to predict new
unseen data, the model is sufferning from the effect of high variance. This situation is called
overfitting, leading to poor generalisation performance.
Similarly, a model might be too "simple", unable to capture the true relationship in the data that
we observe. These models would be said to have high bias; they do not fit the data very well,
which leads to a high generalisation error on new test data. This situation is called underfitting.
There is a fundamental trade-off between model complexity and the possibility of high bias or
high variance for all Machine Learning algorithms. Such an example is presented in Figure 3-4.
We can see from this picture that initially both the training and test error are quite high (hence
the accuracy will be low) as the model is too simplistic and thus unable to learn and predict the
data accurately (case of under-fitting). However, as the boundaries become more and more
complex, capturing noise in the data, the test error tends to increase while the training error
steadily decreases; this is a case of overfitting.
The main task during the optimisation of any Machine Learning model is to find the optimal
"sweet spot" where the model minimises simultaneously both the bias and the variance. One
very powerful tool at our disposal is validation. It is a common approach used to help us to
detect and avoid cases of over- and underfitting.
Figure 4. Classification: over-fitting vs under-fitting.
Figure Interpretation
The green lines across the panel represent different models (Model 1-3) that have been
trained on a dataset (top row). The prediction errors are recorded below each panel. From
left to right, the models have increasing complexity. At first glance Model 3 seems perfect
with a prediction error of 0. However when submitting new data (test data - second row) to
this model it soon becomes apparent that this model does not perform well anymore. This
is referred to as overfitting. On the other end of the spectrum, Model 1 is too simple and
does not capture the relationship that is being classified. For both the training and test data
it performs poorly (underfitting). Model 2 strikes the right balance, it has a low prediction
error and is generalisable.
A thorough tutorial on misleading modelling, overfitting, cross-validation, and the bias-variance

trade-off can be found here
(http://online.cambridgecoding.com/notebooks/cca_admin/misleading-modelling-overfitting-crossvalidation-
and-the-biasvariance-tradeoff)
.
Ensemble models
Ensemble learning (or modelling) involves the combination of several accurate and diverse
models to solve a single prediction problem. It works by generating multiple models, which
learn and make predictions independently. Those predictions are then combined into a single
(mega) prediction that should be as good or better than the prediction made by any one classifer.
An ensemble is itself a supervised learning problem as it can be trained and used to make
predictions.
There are two main families of ensemble methods:
Averaging methods: where several estimators are being built independently and then their
predictions are averaged or combined by a voting scheme. Averaging methods attempt to
reduce the variance of the single base estimators. Examples include: Bagging methods and
Forests of randomised trees, among others.
Boosting methods: in this ensemble model, base estimators are built sequentially with the
motivation is to combine several weak models to produce a powerful ensemble. Boosting
methods attempt to reduce the bias of the combined estimator. Examples include AdaBoost
and Gradient Tree Boosting, among others.
Random Forests
The random forests model is an ensemble method since it aggregates a group of decision trees
into an ensemble (http://scikit-learn.org/stable/modules/ensemble.html). Unlike single decision trees
which are likely to suffer from high variance or high bias (depending on how they are tuned)
Random Forests use averaging to find a natural balance between the two extremes.
A forest of uncorrelated trees is being built using a CART like procedure, combined with
randomised node optimisation and "bagging" (bootstrap aggregating). Bagging helps reduce
variance, improve unstable procedures and avoid overfitting. So, for each tree to learn we:
1. Subsample the data randomly with replacement.
2. Select a subset of the original features.
3. Apply the learning procedure only to the subsample drawn and the features selected.
4. Once many models are generated, their predictions can be combined into a single (mega)
prediction using majority vote or averaging that should be better, on average, than the
prediction made by the single models.
The percentage of data to grow each tree is arbitrary, but a widely used choice is 63% (.632 rule).
Figure 5. Example of how bootstrap works when applied to decision trees.
Boosting
Bagging is not the only option to create an ensemble of trees, a very popular alternative is
Boosting. Boosting doesn’t subsample the data, it uses the entire dataset multiple times
giving to samples which are hard to classify an increasing importance. This strategy also
assigns a different importance to each tree which is used to weigh their vote during the
classification.
Building a Random Forest

Building a Random Forest

Start by building a simple Random Forest model, which consists of 150 independently trained
decision trees, using the RandomForestClassifier function. For further details and examples
on how to construct a Random Forest, see the help page of RandomForestClassifier
(http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
PYTHON
# Build a Random Forest classifier with 150 decision trees
rf = RandomForestClassifier(n_estimators=150, random_state=1) (1)

rf.fit(XTrain, yTrain)
predRF = rf.predict(XTest)
print(metrics.classification_report(yTest, predRF))
print("Overall Accuracy:", round(metrics.accuracy_score(yTest, predRF),2))
PYTHON
0 0.67 0.64 0.66 59

1 0.95 0.96 0.95 441
avg / total 0.92 0.92 0.92 500
1. By using the random_state attribute, you ensure that your results remain the same every
time you re-run this script. Otherwise, if removed, you may notice that your results are
different to the ones presented here (and will be different every time you run this script).
This is due to the random nature of random forests, since every predictor is trained with a
bootstrap of the data, which is a random sampling with replacement. Also, in every tree
there is some randomness in how the subset of attributes for training is selected.
Visualising the RF accuracy

You can also visualise how the overall test accuracy is affected by increase in n_estimators
(the number of decision trees) in your model. In order to do so, you can use the provided
rfAvgAcc function from visplots :
PYTHON
# Visualise average accuracy with an increasing number of trees
visplots.rfAvgAcc(rfModel=rf, XTest=XTest, yTest=yTest)
Figure 6. Average testing accuracy with the increasing number of decision trees in the Random
Forest.
Feature importance
Random forests allow you to compute a heuristic for determining how “important” a feature is
in predicting a target. This heuristic measures the change in prediction accuracy when a split is
introduced in a given feature. The more the accuracy drops when the feature is splitted, the
more “important” we deem the feature to be.
You can use the feature_importances_ attribute of the RF classifier to get the relative
importance of each feature, which you can then visualise using a simple bar plot.
PYTHON
# Display the importance of the features in a barplot
# sorting the features according their importance

importance_sorted_idx = np.argsort(rf.feature_importances_) (1)
names = header[0:10]
data = [
Bar(
x=rf.feature_importances_[importance_sorted_idx],
y=names[importance_sorted_idx],
orientation = 'h',
)
]
layout = Layout(
xaxis=dict(title = "Importance of features"),
yaxis=dict(title = "Features"),
width=800,
margin=Margin(
l=250,
r=50,
b=100,
t=50,
pad=4
)
)
fig = dict(data=data, layout=layout)
iplot(fig)
1. argsort returns the indices that would sort the features by their importance.
Which results in the following graph:
Figure 7. How “important” a feature is in predicting a target.
Boundary visualisation
You can visualise the classification boundary created by the Random Forest using the
visplots.rfDecisionPlot function. You can check the arguments passed in this function by
using the help command. In addition to the mandatory arguments, the function
visplots.rfDecisionPlot takes as optional arguments the ones from the
RandomForestClassifier function, so you can have a look at the documentation
(http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).
PYTHON
#### Check the arguments of the function
help(visplots.rfDecisionPlot)
# Plot the boundary of Random Forest

visplots.rfDecisionPlot(XTrain, yTrain, XTest, yTest, header, n_estimators=150)
Figure 8. Decision boundary for a Random Forest classifier
Boundary interpretation
A Random Forest boundary will be the union of the intersection of the decision boundaries
of many decision trees (remember that the decision region for a decision tree is rectilinear
with "stair-like" or "box-like" surfaces). Therefore, the decision region of a Random Forest
also has a rectilinear boundary composed of axis-parallel segments.
Parameter tuning and grid search

Many classification algorithms and pattern recognition techniques have one or more parameters
that need to be tuned. Such parameters are often referred to as hyperparameters. They are called
hyperparameters as they are not parameters that are adapted internally as the Machine
Learning algorithm is applied, rather we have to set them to values that we want to explore.
Selecting the optimal values of these parameters - a process commonly referred to as "tuning" -
for a given classification problem is no trivial task, and can greatly affect a model’s performance.
A common approach towards optimising the hyperparameters is to apply a "three-way split",
where one subset (usually 30% of the data using a holdout approach) of the original data is left
aside during the whole training process as a test set to evaluate the generalisation ability of the
model, whereas the remaining data are further split into training (used to fit the model) and
validation (used to tune the model parameters) dataset(s) using holdout or cross-validation (or
other validation techniques).
Figure 9. Summary of the process of optimal parameters selection. For each parameter
combination, the data are split into training and test sets a specified number of times. The
accuracy is then calculated from the mean of the models trained and evaluated on these splits, i.e.
given a cross validation accuracy score. The best parameter combination is the one with the
highest cross validation accuracy score. The figure has been extracted from this tutorial
(http://online.cambridgecoding.com/notebooks/cca_admin/misleading-modelling-overfitting-crossvalidation-
and-the-biasvariance-tradeoff)
.
K-fold cross-validation
K-fold cross-validation is a way of “cross-validating” your model. Cross-validation allows you to
estimate the performance of your model when presented with completely unseen (real) test data
using the dataset you are given. You do this by repeatedly splitting your dataset up into training
and (pseudo-)"test" sets and then evaluating the accuracy each time, i.e. each time, one
proportion of your dataset will “pretend” to be coming from outside the dataset. By repeating the
validation process for different splits of the data, you can have more confidence that your
estimate of the model’s accuracy will generalise and that it will be "robust".
The K in K-fold validation refers to the number of times you will be splitting the data. For
example, if K=3:
Figure 10. K-fold validation where K=3. Accuracy is calculated for the model trained from each
split so that in the case of K=3, a1 is the accuracy of the model trained on the first split, a2 is the
accuracy of the model trained on the second split, and a3 is the accuracy of the model trained on
the third split.
The accuracy is then calculated by taking the average of the accuracies of the models trained on
each split, i.e. the mean of a1, a2 and a3.
Grid search
The traditional way of performing hyperparameter optimisation is by applying grid search,
which is simply an exhaustive searching through a manually specified subset of the
hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some
performance metric, typically measured by cross-validation on the training set or evaluation on
a held-out validation set.
Figure 11. In the context of hyperparameter optimization, you perform k-fold cross validation
together with grid search to get a more robust estimate of the model performance associated
with specific hyperparameter values. The figure has been extracted from
https://blog.cambridgecoding.com/2016/04/03/scanning-hyperspace-how-to-tune-machine-
learning-models/
Tuning Random Forests with grid search

Random forests have several parameters that can be tuned, e.g. n_estimators , max_features ,
max_depth and min_samples_leaf (the minimum number of samples in a leaf) are just some
of the parameters to be optimised. To view the full list of arguments that can be optimised for a
Random Forest, you can use the help() function:
PYTHON
# View the list of arguments to be optimised
help(RandomForestClassifier())
Here are the main parameters that you can tune:
n_estimators: The number of trees in the forest. A larger number of trees is preferable as it
will decrease the variance in predictions, but it is also more computationally expensive. In
addition, results will stop getting significantly better beyond a critical number of trees.
max_features: The size of the random subsets of features to consider when splitting a node.
The smaller the subset of features is, the greater the reduction of variance, but also the
greater the increase in bias. Empirically, it has been found that good default values are
max_features=sqrt(n_features) (default case) for classification tasks (where
n_features is the number of features in the data).
max_depth: The maximum number of links between the root of the tree and the leaves. The
smaller it is, the simpler the decision boundary will be.
min_sample_split: The minimum number of samples required to split an internal node, by

default this is set to 2 in scikit-learn resulting in fully developed trees.
min_samples_leaf: The minimum number of samples required to be at a leaf node, by

default this is set to 1. Increasing this value can result in leaves that contain more samples
and, in some cases, can prevent the overfitting of a tree.
Good results are often achieved when setting max_depth=None in combination with
min_samples_split=1 . Bear in mind though that using these values might result in models that
consume a lot of memory. In addition, note that with scikit-learn, the RandomForestClassifier
uses bootstrap samples by default ( bootstrap=True ). (This is not the case for the
ExtraTreeClassifier, where bootstrap=False by default).
Remember, the best parameter values should always be cross-validated. The grid search
evaluates the RF classifier for every possible combination of parameters specified by the
dictionary parameters using 10-fold cross-validation. Precision is used as the measure to
evaluate the model and find the best parameter set:
PYTHON
# Conduct a grid search with 5-fold cross-validation using the dictionary of parameters
# Parameters you can investigate include:

n_estimators = np.arange(5, 100, 25)
max_depth = np.arange(1, 35, 5)
# percentage of features to consider at each split
max_features = np.linspace(.1, 1.,3)
parameters = [{'n_estimators': n_estimators,
'max_depth': max_depth,
'max_features': max_features}]
gridCV = GridSearchCV(RandomForestClassifier(), param_grid=parameters, cv=5, n_jobs=4)

(1)
gridCV.fit(XTrain, yTrain)
# Print the optimal parameters

best_n_estim = gridCV.best_params_['n_estimators']
best_max_depth = gridCV.best_params_['max_depth']
best_max_features = gridCV.best_params_['max_features']
print ('Best parameters: n_estimators=', best_n_estim,

'max_depth=', best_max_depth,
'max_features=', best_max_features)
> Best parameters: n_estimators=80 max_depth=6 max_features=1.0 (2)
1. The n_jobs argument can be used for parallelisation to speed up the tuning process.
2. You may notice that your results are different from the ones presented here (affecting also
the overall accuracy and relevant metrics) since we haven’t used the random_state
argument within the RandomForestClassifier to ensure the results remain the same
every time we run this script.
You may also choose to include in your dictionary of parameters and grid search any of the
following options:
PYTHON
# Additional parameters to investigate could include
# max_features = [1, 3, 10] or max_features: ['auto', 'sqrt', 'log2']
# min_samples_split = [1, 3, 10]
# min_samples_leaf = [1, 3, 10]
# bootstrap = [True, False]
# criterion = ["gini", "entropy"]
Visualising the grid search results in a heatmap

You can also graphically represent the results of the grid search using a heatmap:
PYTHON
# Create a heatmap to visualise the results of the grid search with cross-validation
# isolating the the results with the best value of max_features

scores_best_max_features = [s for s in gridCV.grid_scores_ if s[0]['max_features'] ==
best_max_features]
scores = visplots.rf_organise_scores(scores_best_max_features, n_estimators, max_depth)
data = [
Heatmap(
x=n_estimators,
y=max_depth,
z=scores.T,
colorscale='Blues',
reversescale=True,
colorbar=dict(
title="Classification Accuracy",
nticks=10
)
)
]
layout = Layout(
xaxis = dict(title="Number of estimators", tickvals=n_estimators),
yaxis = dict(title="Max Depth", tickvals= max_depth),
height = 700,
)
fig = dict(data=data, layout=layout)
iplot(fig)
Your plot may look as follows:
Figure 12. Grid search scores from the tuning of the RF hyperparameters
Testing and evaluating the generalisation performance

When evaluating the resulting model it is important to do it on held-out samples that were not
seen during the grid search process (XTest). So, we are testing our independent XTest dataset
using the optimal parameters:
PYTHON
# Build the classifier using the *optimal* parameters detected by grid search
clfRDF = RandomForestClassifier(n_estimators=best_n_estim,
max_depth=best_max_depth,
max_features = best_max_features)
clfRDF.fit(XTrain, yTrain)
predRF = clfRDF.predict(XTest)
print (metrics.classification_report(yTest, predRF))

print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, predRF),2))
Which may result in the following output:
PYTHON
0 0.66 0.63 0.64 59

1 0.95 0.96 0.95 441
avg / total 0.92 0.92 0.92 500
Wrap up of Module 4
A decision tree model consists of a root node, internal nodes and terminal (leaf) nodes. Root
nodes and internal nodes contain feature test conditions, while terminal nodes assign the
class label.
Decision Trees are interpretable classification models.
Random Forests reduce variance by combining many independently trained decision trees by
averaging or majority voting. The individual trees are trained on data from bootstrap
sampling with replacement.
Random Forests, like other "ensemble learning" methods, use many models from the training
data to make predictions. In the case of random forests, the individual models are decision
trees.
Last updated 2016-11-25 06:32:37 GMT

Decision Trees and Random Forests

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Decision Trees and Random Forests

Enviado por

Direitos autorais:

Formatos disponíveis

2016. 11. 27.

Decision Trees and Random Forests

About this module

1. Building or instantiating the classification model (using either default, pre-defined or

2. Training the model,

3. Testing the model, and

Figure 1. Data partitioning and correspondent tree representation.

The final (decision) tree consists of three types of nodes:

Basic ideas behind growing a tree

Split the data into training and test sets

In Python, you need to use the train_test_split() function available through

XTrain, XTest, yTrain, yTest = train_test_split(X, y, random_state=1) (1)

print("XTrain dimensions: ", XTrain.shape)

This results in the following output:

Building a Decision Tree

# Train the model

# Test the model

Calculate validation metrics for your classiۘer

sensitivity (or recall, true positive rate),

specificity (or true negative rate), and

mat = metrics.confusion_matrix(yTest, yPred) (1)

number of True Negatives (TN). We could represent this schematically as follows:

Accuracy = (tp + tn)/total

Recall = S ensitivity = tp/(tp + f n)

S pecif icity = tn/(f p + tn)

F 1 = 2 ∗ (precision ∗ recall)/(precision + recall)

print (metrics.classification_report(yTest, yPred))

The performance summary and accuracy will be as follows:

0 0.63 0.54 0.58 59

avg / total 0.90 0.91 0.91 500

Overall Accuracy: 0.91

Boundary visualisation of Decision Trees

# Plot the boundary of Decision Trees

Figure 3. Decision Boundary of a Tree with max depth 3.

Their decision boundary is non-smooth.

Figure 4. Classification: over-fitting vs under-fitting.

A thorough tutorial on misleading modelling, overfitting, cross-validation, and the bias-variance

There are two main families of ensemble methods:

1. Subsample the data randomly with replacement.

2. Select a subset of the original features.

Figure 5. Example of how bootstrap works when applied to decision trees.

Building a Random Forest

Building a Random Forest

rf = RandomForestClassifier(n_estimators=150, random_state=1) (1)

This results in the following output:

0 0.67 0.64 0.66 59

avg / total 0.92 0.92 0.92 500

Overall Accuracy: 0.92

Visualising the RF accuracy

visplots.rfAvgAcc(rfModel=rf, XTest=XTest, yTest=yTest)

This results in the following output:

# sorting the features according their importance

fig = dict(data=data, layout=layout)

Which results in the following graph:

Figure 7. How “important” a feature is in predicting a target.

# Plot the boundary of Random Forest

Figure 8. Decision boundary for a Random Forest classifier

Parameter tuning and grid search

Tuning Random Forests with grid search

Here are the main parameters that you can tune:

min_sample_split: The minimum number of samples required to split an internal node, by

min_samples_leaf: The minimum number of samples required to be at a leaf node, by

# Parameters you can investigate include:

gridCV = GridSearchCV(RandomForestClassifier(), param_grid=parameters, cv=5, n_jobs=4)