Survey Report

Machine Learning Final Project Survey Report
ERROR 2020: Entropy Overflow

B01202058
B01202065
Jan.20 2016
Platform
Feature Extraction
the last log-in and the first log-in. We

assume that the students who only log-in
All our models are run on Microsoft in a short period is likely to drop out.
Azure Machine Learning Studio platform Comparing the zero one loss Eval in cross
(https://studio.azureml.net/) or with the validation before and after adding these
python package, scikit-learn [1].
features, we do observer the error decreases.
The improvement on accuracy by adding
these features inspires us to take a step
further: count the number of each kind of
logs in each day (eg. the number server
access in the third day before the course
start). To achieve this the starting date of
a course is required. Due to an interesting
fact that for every courses, the time deltas
between the last timestamp of logs and the
first one are always fallen in 29 to 30 days,
we thus find the starting date of a course.
With this scheme, we add additional 270
features, and they do help us achieve a
higher accuracy.
We have two parts of feature extraction,

and both parts improve our final score.
First, we naively used the sample train
that TAs provide and used random forest
to fit the data. By analysis to the feature
importance from random forest model, we
found that log num and server access
are the most important features. And we
also made a hypothesis that the activities
in the last two weeks are more predictive
for the behavior in the 31 to 40 day.
Shown in figure 1, we test our data in

various algorithms (Here, we use the default parameter in Microsoft Azure Machine Learning Studio) and different training data. A is the sample data that TAs
provided but without using ID. B con-
Thus, we add the log number and

the server access number in the last
7 days and 14 days. Also, we add the
time interval that student involve the
class, which is the difference between
1
tains 5 features describe in the second para- 3.1 Random Forest

graph. C contains the 270 features deWe first do the investigation on random
scribe in the third paragraph.
forest, and there are two import parameters
to try, number of trees and number of split
per node in a tree.
Figure 1: A:the feature that TAs provided

B: log num and server access in the last 7
days and 14 days
C: all the log features per day
Figure 2: Number of splits per node
{2,4,8,16,32,64,128,256,1000}
From the figure 1, we can see that Eval

drops when we add feature B and C.
The only anti-example is neural network
performs worse when we add more features.
However, the fluctuation is still in the error
bar. Thus, we can conclude that feature
B and feature C could indeed help us
reach better prediction.
First, we want to try that how many split

per node in a decision tree is best for our
data. We try a wide range from 2 to 1000
splits per node. In these trial, we use the
forest with 200 trees and assume 200 trees
are enough to converge. From the figure 2,
we can see that the minimum is at 128 splits
per node.
Model Comparison
Second, we want to know how many trees
are much enough so that the Eval converges.
From the figure 3., we can see that Eval are
almost the same for the forest with 100, 200
and 400 trees. Thus, a random forest with
more than 100 trees and 128 splits per node
are the best parameters for our data. The
best Eval by random forest is 0.123857.
All the following Eval are calculated by

cross validation. In each turn, training data
are divided into 10 parts, 9 for training
and 1 for validation. Eval is the mean of
10 turns and the error bar is the standard
deviation of the 10 turns.
Figure 3: Number of trees in the random

forest {8,16,32,64,100,200,400}
3.2
Gradient Boost Decision Figure 4: Both x-axis and y-axis

{2,4,8,16,32,64,128,256} The color repreTrees (GBDT)
sents Eval .
For GBDT, there are two important parameters, one is the maximum number of
leaves per tree, the other is the number
of trees. From the figure 4 and figure 5, we
can see that the best parameter is 32 leaves
for each tree and 16 trees in total. The best
Eval by GBDT is 0.122872.
3.3
Neuron Network
Initially, we think that neuron network

will perform better than random forest or
GBDT, because we think that there should
be some correlation between the log activities on different date. And neural network
can make several feature correlate by linear
transform. However, when we try different Figure 5: Both x-axis and y-axis
number of neurons in the hidden layer, the {8,16,32,64}
best Eval is only 0.130037 (figure 6). And,
to our surprise, using more neurons doesnt
give us better performance.
3
scikit-learn, random forest classifier could

also give the importance of all the features.
This function gives us the hint for feature
extraction. And the features random forest
recommends also indeed play the important
roles in the following analysis. Actually,
all the decision-tree-based models provide
the importance of the features, including
GBDT. This is the strength of these models
in the beginning of exploratory data analysis. However, neural network doesnt have
this advantage.
Figure 6: Number of neuron in the hidden

layer {2,3,6,12,25,50,100} The color represents Eval .
3.4
3.4.1
In sections above, weve shown the experiment results of three machine learning
algorithm with various sets of parameters,
and the results indicates that the random
forest has the best performance, and
gradient boost decision trees do almost the
same well as random forest. Therefore, We
choose random forest and the voting results
of random forest and gradient boost decision trees as our best two models for both
track 1 and 2. The detailed parameters of
our best random forest classifier are shown
in table 2. As for the voting classifier, we
reused the result of the previous model and
made it vote with our best GBDT classifier
of which the parameters are documented in
table 3. In addition, the voting method are
soft voting[2] which just simply average
the probability in the binary classification.
Model Comparison
Efficiency
Shown in table 1 is a comparison of efficiency between the three models. Eval is

from the best parameters that previous
section gets. Running time includes the
training and prediction.
From the table 1, we can see that GBDT
is the best at the aspect of Eval and the efficiency. Because we run RF with 200 trees,
it takes lots of time. And from validation
test, neural network with only 2 neuron
is the best, so the efficiency of NN also not
bad.
3.4.2
Best Model
Pre-processing
We uploaded these two model to the

Random forest plays an important role in online judge system and found that the
our final project feature extraction. In voting model did a better job in track 1
4
model
Eval
running time
RF
0.123857
28m17s
GBDT
0.122872
25s
NN
0.130037
1m8s
Table 1: comparison of efficiency between the three models
Min Leaf Sample Count

Random Split Count
Max Depth
Number of Tree
Resampling Method
1
128
40
500
Bagging
store a RF model.
Team Work
The work loads of team members are listed

Table 2: Detailed Parameters of Our Best in table 4.
Random Forest Classifier
Number of Leaves
Minimum Leaf Instances
Learning Rate
Number of Trees
32
1
0.1
16
References
[1] Fabian Pedregosa, Gael Varoquaux,

Alexandre Gramfort, Vincent Michel,
Bertrand Thirion, Olivier Grisel, MathTable 3: Detailed Parameters of Our Graieu Blondel, Peter Prettenhofer, Ron
dient Boost Decision Trees Classifier
Weiss, Vincent Dubourg, et al. Scikitlearn: Machine learning in python. The
but worse in track 2 comparing the pure
Journal of Machine Learning Research,
random forest model.
12:28252830, 2011.
The pro of soft voting is only the data [2] scikit
learn.
achieve high probability in all model will
sklearn.ensemble.votingclassifier,
jan
result in high probability. This ensure the
2016.
data with the high score is backed by all
modes and thus favors the scoring method
in track 1. However, GBDT is used in this
voting model, and this makes the training
take a little bit longer since GBDT isnt
good at multiprocessing.
For the random forest, it is strong against
over-fitting if the number of trees in the forest is large enough, but RF usually requires
much more trees than that of GBDT. This
means we need more memories or spaces to
5
Name
feature engineering
0.7
0.3
model tuning
0.5
0.5
Table 4: work load
analysis and report writing

0.3
0.7

Survey Report

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Survey Report

Enviado por

Direitos autorais:

Formatos disponíveis

Machine Learning Final Project Survey Report

ERROR 2020: Entropy Overflow

the last log-in and the first log-in. We

We have two parts of feature extraction,

Shown in figure 1, we test our data in

Thus, we add the log number and

tains 5 features describe in the second para- 3.1 Random Forest

Figure 1: A:the feature that TAs provided

From the figure 1, we can see that Eval

First, we want to try that how many split

All the following Eval are calculated by

Figure 3: Number of trees in the random

Gradient Boost Decision Figure 4: Both x-axis and y-axis

Initially, we think that neuron network

scikit-learn, random forest classifier could

Figure 6: Number of neuron in the hidden

Shown in table 1 is a comparison of efficiency between the three models. Eval is

We uploaded these two model to the

Table 1: comparison of efficiency between the three models

Min Leaf Sample Count

The work loads of team members are listed

[1] Fabian Pedregosa, Gael Varoquaux,

Table 4: work load

analysis and report writing

Você também pode gostar