Você está na página 1de 10

Image recognition with convolutional neural networks

Introduction
The main objective of this work is to understand the models used to train algorithms for
image pattern recognition, which are also used in many other fields, such as social data
mining or speech and music recognition.
We first do an introduction to neural networks with a basic approach, which is going to be
network training with gradient descent method.
What is a neural network?
A neural network consists of a combination of logical units called perceptrons or artificial
neurons, which are programmed to emulate the behaviour of a human neuron by receiving
an input and computing an output using weights and biases.
The idea of perceptron can be expressed with the following diagram and equation:

output=

0 if w j x j <threshold
j

0 if w j x j threshold
j

were x_i are the inputs coming from the given data or other perceptrons and w_i are the
weights of those inputs. We could also express previous output function as

The output of a perceptron is 0 if the scalar product of the inputs with their weights is
lower or equal to a given threshold or 1 if its higher. Perceptrons are classified by layers,

and many layers together form a neural network. From now on we use the following
notation z:=wx+b.

The first layer is called input layer, which receives the information that has to be
processed by the network (image pixels, song beats, audio frames etc.). After that there
are the hidden layers, which process the input layer to give the output layer a result
according to a target.
A problem with such a perceptron output function is that slight changes on the weights
and thresholds (from now on biases) induce big changes on the output, which is not
desirable. So we use as an output function the sigmoid function, which works similarly
but has a smooth shape, it takes all real values between 0 and 1 instead of the previous
described discrete function.
That ensures continuity behaviour, that is, small changes on the input cause small changes
on the output. So instead of defining the output as above: we improve the output function
as follows.

This way, when the product of weights and inputs is much greater than the bias then the
resulting output is going to tend to 1, but when its much lower, the output is going to tend
to 0.

So we have briefly defined which elements form a neural network. Lets see now how it
works.
We want the network to give a given output when we introduce a given input. That is, for
example we give as an input a matrix of pixels with a handwritten digit picture, and we
want the network to return the number from 0 to 9 that is written on the picture.
The output that the network should give is called target value.
In the first example we are going to train the network algorithm by approximating the
most suitable biases and weights so that when we give the network an input image with a
digit it returns the same number that we see in the picture.
To do that, the first approach is the Stochastic Gradient Descent method. It consists of a
method for minimizing multivariate functions. We are going to use it to find the minimum
of a cost function that represents how different are the network outputs from the
target outputs:

where w and b are the weights and biases of the perceptrons, n is the number of inputs, x
are the inputs, y are output targets, and a are the network outputs when x is the input
(which also depend from w and b but thats omitted for sake of simplicity).
Gradient descent method consists of getting close to the minimum of the function by
stepwise subtracting the function gradient vector, since gradient of a multivariate function
indicates the direction of maximum growth. So if we subtract the gradient multiplied by a
small scalar to the function variables we move in the direction of maximum decrease and
lower values of the function. So once subtracting the gradient doesnt give us a lower
value for C we are close to minimum and stop the algorithm. Expressed in formula, for
each step:

where v represents the vector with weights and biases and nu represents a small quantity
that we are going to multiply by the gradient to approach to the minimum stepwise
without move to .
So the next would be to calculate the gradient. To do that, we introduce the
backpropagation algorithm.

Backpropagation algorithm uses the following equalities and notations:


is the weight from the kth neuron in the (l-1)-th layer to the jth neuron in the lth layer.

is the activation of the jth neuron in the lth layer.


blj is bias of the jth neuron in the lth layer.

L is the total number of layers

The error of the output layer:

and in matrix based form

where the circle with dot is Hadamard product, which is a elementwise product.

After knowing those equations, we have a way of calculating the gradient that we are
going to subtract each step multiplied by a given learning rate, so we get closer to the
optimum (w,b).
Weights and biases initialization: We initialize weights and biases vectors with normal
Gaussian distribution, with mean 0 and standard deviation 1.
The steps of the backpropagation algorithm are:
1.
2.
3.
4.
5.
6.

Initialize the weights


Set the activation for the input layer a^1.
For each layer compute the corresponding activations
Compute the output error vector sigma^L
Backpropagate the error. For layers L-1 to 2 calculate the error
Output: The gradient vector is given by
Cwljk=al1klj and Cblj=lj.

Once we know how to calculate the gradient, we would continue with gradient descent
method for mini-batches of training datasets.

Cross-entropy function
After learning the most basic neural network model we are going to see that this approach
has some inconveniences and how to solve them.

The main problem is that this algorithm will learn slow for very low or large values of
z(=wx+b) the sigmoid function derivate is close to zero so the algorithm learning will be
very slow, since

and

so we introduce the Cross-Entropy cost function, which solves the problem

The partial derivative with respect to w is this time

simplifying

since

Which is a better expression for the learning speed. The greater the error
the

faster

the

algorithm

will

learn,

since

Similarly we compute the derivative with respect to the bias which is

This changes will improve learning speed for large initial errors.

Softmax
Another approach to the problem of learning slowdown is Sotfmax layers. Softmax layer
is a modified version of output layers in which we dont apply sigmoid function to our
weighted inputs zs, we apply softmax function to the z_j^L instead. The activation of the
jth neuron of the output layer L is going to be

This function has the property of forming a probability distribution, which means that
all outputs of a softmax layer are values between 0 and 1 and they sum 1.
Why is softmax good for learning slowdown?
Consider the log-likelihood cost function
C ln aLy
where y is the desired output for a and L is the last layer. This cost function has the needed
properties, i.e. if the network is confident that output is y then a Ly is going to be close
to 1, so ln a Ly is going to be close to 0.
After derivation of log-likelihood cost function we get an error expression similar to the
one derived from cross-entropy cost function.

We know from previous error equations that these expressions of the cost derivative imply
that we dont have learning slowdown problem.
Overfitting
With the described model, we find another problem, which is that our network has many
parameters, and the predictions wont improve after a limited number of epochs, because
of the overfitting phenomenon. The idea is that if we have enough parameters we can
build a model that fits any dataset, but that doesnt mean our model is good for predicting
new data.
To solve overfitting problem we introduce regularization term in the cost function.

We can also add regularization term to cross-entropy cost function


Or to a general cost function

Although theres no theoretical widely accepted theory of why regularization gives better
results for data fitting, the idea of regularization is to penalize large weights, because
heuristically they would take in account details of the training set instead of learning the
general pattern of the training data set.
There are many other modifications of cost function regularization techniques that we
arent going to explain here, but the one explained above technique is widely used.
Another regularization technique is dropout, which consists of training the network
modified by randomly deleting half of hidden layer neurons and taking average output of
the differently modified network. That is, if we get different results and most
modifications of the network get a specific result we think that other networks are making
mistakes.
We can also reduce overfitting by artificially expanding the training data, taking for
example small rotations of the images. This technique will improve goodness of the fit for
test data.
Weight initialization to avoid slowdown:
As we told before the first idea of initialize the weights and biases was to set them to
values described by normal Gaussian distribution. It turns out that if we do so its likely
that abs(z) will be very large so the output of our neuron sig(z) will be very close to 0 or
1, this way the neuron will be saturated, and the learning is going to be slow.
If we initialize instead the weights and biases as random variables with Gaussian
distribution with mean 0 and standard deviation 1/sqrt(n_inputs) we get much better
results, since the most values of z are close to 0 and this way the derivative of sigma is not
close to 0, which makes learning faster.

Você também pode gostar