Escolar Documentos
Profissional Documentos
Cultura Documentos
1
1.3 ReLU to avoid overfitting of a training set. Con-
cretely, as being dropped-out in the for-
Rectifier Linear Unit is the activation func-
ward propagation, some neurons are not al-
tion that fires neuron, according to the fol-
lowed to participate in the backpropagation
lowing:
processing. That makes the network less
(x) = max(0, x) (1)
sensitive to the certain weights of neurons.
By using ReLU over tanh or sigmoid ac-
tivations, it can prevent from happening the
1.5 Softmax
vanishing gradient problem.
In this network, softmax is used to determine
X E(W ) ak X (al ) a final decision by outputting each probabil-
= k wkl (2) ities across 0-9 vectors for an arbitrary ex-
ak al al
kc[l] kc[l]
ample, according to the following:
As shown in Equation no.2, the backprop-
agation processing requires a derivative of exp(WiT x)
P (yi |Wi , x) = P T
(5)
the activation with respect to the previous j exp(Wj x)
activation. That implies a derivative of the
activation can influence the gradient update The class has the highest probability is
directly. chosen as a predicted label, then step in the
evaluation.
(al )
= (al )(1 (al )) (3)
al
( 2 Adjustment
(al ) 1, iff al > 0
= (4)
al 0, otherwise 2.1 The first layer exception
Having any parameter, a derivative of
sigmoid activation (Eq no.3) permanently
emerges unstable lower values, and gradu-
ally leads to the vanishing gradient prob-
lem, whereas a derivative of the ReLU (Eq
no.4) generates a value from the straight-
forward options, so that the learning proce-
dure is able to isolate the vanishing gradient
problem. Consequently, the deep architec-
ture manages to learn its weight parameters
comparably faster.
Figure 1: The first ConvLayer exclusion
1.4 Dropout
In the figure 1, abandoning the first convo-
Dropout is the regularization technique for lutional layer, a raw image vector connected
neural networks by excluding randomly se- to the second layer directly, resulted the 64
lected neurons during the training step so as features after 5 5 filters convolving them.
2
Then, the pooling step reduces the spatial model 1 model 2 model 3
size of the feature space to 14 14. After- Accuracy 0.9585 0.9627 0.9561
wards, the network starts to learn its param-
eters as a complete model. Table 1: The test accuracies
2.2 The second layer exception ReLU, and dropout exclusions can estimably
induce the slow convergence, and overfitting.
3 Result
As these models were implemented (with a
learning rate of 104 ) at the learning stages,
the accuracies at every 100th epoch of each
model had been observed (Figure 4).
3
the features of different sizes had been ex-
tracted.
After being tested, the test accuracies of
the models (see Table 1) were observed hav-
ing similar values. Even though these mod-
els outclassed capabilities of a sole fully-
connected architecture, compared to the
Deep Convolutional Architecture, they are
obviously incomplete.
As a conclusion, the assumption men-
tioned in the first section have evidently
presented. Eventually, at the very least in
this experiment, the test result has revealed
that these models are insufficient to make
predictions in this problem and needed to
be developed, which tells that using multi-
convolutional layers can provide those defi-
ciencies and those layers can learn more ad-
vanced features from an example.