Você está na página 1de 3

PRIMER

What are artificial neural networks?


© 2008 Nature Publishing Group http://www.nature.com/naturebiotechnology

Anders Krogh
Artificial neural networks have been applied to problems ranging from speech recognition to prediction of protein
secondary structure, classification of cancers and gene prediction. How do they work and what might they be good for?

W hen it comes to tasks other than number


crunching, the human brain possesses
many advantages over a digital computer. We
tern recognition, and they can function even
if some of the neurons are destroyed. The
demonstration, in particular by Rosenblatt,
to 1 when the total weighted sum of inputs
is equal to the threshold. The points in input
space satisfying this condition define a so-
can quickly recognize a face, even when seen that simple networks of such model neu- called hyperplane. In two dimensions, a
from the side in bad lighting in a room full rons called ‘perceptrons’ could learn from hyperplane is a line, whereas in three dimen-
of other objects. We can easily understand examples stimulated interest in the field, sions, it is a normal plane. Points on one side
speech, even that of an unknown person but after Minsky and Papert1 showed that of the hyperplane are classified as 0 and those
in a noisy room. Despite years of focused simple perceptrons could solve only the very on the other side as 1. It means that a classifi-
research, computers are far from perform- limited class of linearly separable problems cation problem can be solved by a threshold
ing at this level. The brain is also remark- (see below), activity in the field diminished. unit if the two classes can be separated by a
ably robust; it does not stop working just Nonetheless, the error back-propagation hyperplane. Such problems, as illustrated in
because a few cells die. Compare this to a method2, which can make fairly complex net- three dimensions in Figure 1b, are said to be
computer, which will not normally survive works of simple neurons learn from exam- linearly separable.
any degradation of the CPU (central process- ples, showed that these networks could solve
ing unit). But perhaps the most fascinating problems that were not linearly separable. Learning
aspect of the brain is that it can learn. No NETtalk, an application of an artificial neu- If the classification problem is separable,
programming is necessary; we do not need ral network for machine reading of text, was we still need a way to set the weights and
a software upgrade, just because we want to one of the first widely known applications3, the threshold, such that the threshold unit
learn to ride a bicycle. and many followed soon after. In the field of correctly solves the classification problem.
The computations of the brain are done biology, the exact same type of network as This can be done in an iterative manner by
by a highly interconnected network of neu- in NETtalk was also applied to prediction of presenting examples with known classifica-
rons, which communicate by sending electric protein secondary structure4; in fact, some tions, one after another. This process is called
pulses through the neural wiring consisting of the best predictors still use essentially the learning or training, because it resembles the
of axons, synapses and dendrites. In 1943, same method. Another big wave of interest process we go through when learning some-
McCulloch and Pitts modeled a neuron as in artificial neural networks started, and led thing. Simulation of learning by a computer
a switch that receives input from other neu- to a fair deal of hype about magical learning involves making small changes in the weights
rons and, depending on the total weighted and thinking machines. Some of the impor- and the threshold each time a new example
input, is either activated or remains inactive. tant early works are gathered in ref. 5. is presented in such a way that the classifica-
The weight, by which an input from another Artificial neural networks are inspired tion is improved. The training can be imple-
cell is multiplied, corresponds to the strength by the early models of sensory processing mented by various different algorithms, one
of a synapse—the neural contacts between by the brain. An artificial neural network of which will be outlined below.
nerve cells. These weights can be both posi- can be created by simulating a network of During training, the hyperplane moves
tive (excitatory) and negative (inhibitory). In model neurons in a computer. By applying around until it finds its correct position in
the 1960s, it was shown that networks of such algorithms that mimic the processes of real space, after which it will not change so much.
model neurons have properties similar to the neurons, we can make the network ‘learn’ to This is nicely illustrated in two dimensions by
brain: they can perform sophisticated pat- solve many types of problems. A model neu- the program Neural Java (http://lcn.epfl.ch/
ron is referred to as a threshold unit and its tutorial/english/index.html); follow the link
Anders Krogh is at the Bioinformatics function is illustrated in Figure 1a. It receives “Adaline, Perceptron and Backpropagation,”
Centre, Department of Biology and Biotech input from a number of other units or exter- use red and blue dots to represent two classes
Research and Innovation Centre, University nal sources, weighs each input and adds them and select “play.”
of Copenhagen, Ole Maaloes Vej 5, DK-2200 up. If the total input is above a threshold, Let us consider an example problem, for
Copenhagen, Denmark. the output of the unit is one; otherwise it is which an artificial neural network is readily
e-mail: krogh@binf.ku.dk zero. Therefore, the output changes from 0 applicable. Of two classes of cancer, only one

NATURE BIOTECHNOLOGY VOLUME 26 NUMBER 2 FEBRUARY 2008 195


PRIMER

characterized by either a highly expressed


X2
a b gene 1 and a silent gene 2 or a silent gene 1
X1
and a highly expressed gene 2; if neither or
W1 × both of the genes are expressed, it is a class 0
X2 W2
× tumor. This corresponds to the ‘exclusive or’
function from logic, and it is the canonical
g
X3 N 1 example of a nonlinearly separable function
W3 Σ WiXi (Fig. 1b). In this case, it would be necessary
i=1 X3
t
to use a multi-layer network to classify the
© 2008 Nature Publishing Group http://www.nature.com/naturebiotechnology

WN × ×
XN tumors.
X1

Back-propagation
c d 6
The previously mentioned back-propagation
Output learning algorithm works for feed-forward
5
networks with continuous output. Training
Hidden
4 starts by setting all the weights in the net-
layer 3 work to small random numbers. Now, for
each input example the network gives an
2
output, which starts randomly. We measure
1 the squared difference between this output
X1 X2 X3 X4 X5 X6 X7
0 and the desired output—the correct class
0 2 4 6 8 10
or value. The sum of all these numbers over
Figure 1 Artificial neural networks. (a) Graphical representation of the McCulloch-Pitts model neuron all training examples is called the total error
or threshold unit. The threshold unit receives input from N other units or external sources, numbered of the network. If this number was zero, the
from 1 to N. Input i is called xi and the associated weight is called wi. The total input to a unit is the network would be perfect, and the smaller
weighted sum over all inputs, Σi=1 N
wixi=w1x1+w2x2+. . .+wNxN . If this were below a threshold t, the output
the error, the better the network.
of the unit would be 1 and 0 otherwise. Thus, the output can be expressed as g ( Σi=1 N
wixi – t ), where g
By choosing the weights that minimize the
is the step function, which is 0 when the argument is negative and 1 when the argument is nonnegative
(the actual value at zero is unimportant; here, we chose 1). The so-called transfer function, g, can also total error, one can obtain the neural net-
be a continuous sigmoid as illustrated by the red curve. (b) Linear separability. In three dimensions, a work that best solves the problem at hand.
threshold unit can classify points that can be separated by a plane. Each dot represents input values x1, This is the same as linear regression, where
x2 and x3 to a threshold unit. Green dots correspond to data points of class 0 and red dots to class 1. The the two parameters characterizing the line
green and red crosses illustrate the ‘exclusive or’ function—it is not possible to find a plane (or a line in are chosen such that the sum of squared dif-
the x1, x2 plane) that separates the green dots from the red dots. (c) Feed-forward network. The network ferences between the line and the data points
shown takes seven inputs, has five units in the hidden layer and one output. It is said to be a two-layer
is minimal. This can be done analytically in
network because the input layer does not perform any computations and is not counted. (d) Over-fitting.
The eight points shown by plusses lie on a parabola (apart from a bit of ‘experimental’ noise). They were linear regression, but there is no analytical
used to train three different neural networks. The networks all take an x value as input (one input) and solution in a feed-forward neural network
are trained with a y value as desired output. As expected, a network with just one hidden unit (green) with hidden units. In back-propagation, the
does not do a very good job. A network with 10 hidden units (blue) approximates the underlying function weights and thresholds are changed each time
remarkably well. The last network with 20 hidden units (purple) over-fit the data; the training points are an example is presented, such that the error
learned perfectly, but for some of the intermediate regions the network is overly creative. gradually becomes smaller. This is repeated,
often hundreds of times, until the error no
responds to a certain treatment. As there is partial classification of the input and sends longer changes. An illustration can be found
no simple biomarker to discriminate the two, its output to a final layer, which assembles at the Neural Java site above by following the
you decide to try to use gene expression mea- the partial classifications to the final classi- link “Multi-layer Perceptron (with neuron
surements of tumor samples to classify them. fication (Fig. 1c). Such a network is called outputs in {0;1}).”
Assume you measure gene expression values a multi-layer perceptron or a feed-forward In back-propagation, a numerical opti-
for 20 different genes in 50 tumors of class 0 network. Feed-forward neural networks can mization technique called gradient descent
(nonresponsive) and 50 of class 1 (respon- also be used for regression problems, which makes the math particularly simple; the form
sive). On the basis of these data, you train a require continuous outputs, as opposed to of the equations gave rise to the name of this
threshold unit that takes an array of 20 gene binary outputs (0 and 1). By replacing the method. There are some learning parameters
expression values as input and gives 0 or 1 as step function with a continuous function, (called learning rate and momentum) that
output for the two classes, respectively. If the the neural network outputs a real number. need tuning when using back-propagation,
data are linearly separable, the threshold unit Often a ‘sigmoid’ function—a soft version of and there are other problems to consider.
will classify the training data correctly. the threshold function—is used (Fig. 1a). The For instance, gradient descent is not guar-
But many classification problems are not sigmoid function can also be used for classi- anteed to find the global minimum of the
linearly separable. We can separate the classes fication problems by interpreting an output error, so the result of the training depends
in such nonlinear problems by introducing below 0.5 as class 0 and an output above 0.5 on the initial values of the weights. However,
more hyperplanes; that is, by introducing as class 1; often it also makes sense to inter- one problem overshadows the others: that of
more than one threshold unit. This is usu- pret the output as the probability of class 1. over-fitting.
ally done by adding an extra (hidden) layer In the above example, one could, for Over-fitting occurs when the network has
of threshold units each of which does a instance, have a situation where class 1 is too many parameters to be learned from

196 VOLUME 26 NUMBER 2 FEBRUARY 2008 NATURE BIOTECHNOLOGY


PRIMER

the number of examples available, that is, get an unbiased estimate, it is very impor- more, I suggest the books by Chris Bishop6,8,
when a few points are fitted with a function tant that the individual sets do not contain a rather old book I coauthored9 or the book
with too many free parameters (Fig. 1d). examples that are very similar. by Duda et al.10. There are numerous pro-
Although this is true for any method for grams to use for making artificial neural
classification or regression, neural networks Extensions and applications networks trained with your own data. These
seem especially prone to overparameteriza- Both the simple perceptron with a single unit include extensions or plug-ins to Excel,
tion. For instance, a network with 10 hid- and the multi-layer network with multiple Matlab and R (http://www.r-project.org/) as
den units for solving our example problem units can easily be generalized to prediction well as code libraries and large commercial
would have 221 parameters: 20 weights and of more than two classes by just adding more packages. The FANN library (http://leenis-
© 2008 Nature Publishing Group http://www.nature.com/naturebiotechnology

a threshold for the 10 hidden units and 10 output units. Any classification problem sen.dk/fann/) is recommended for serious
weights and a threshold for the output unit. can be coded into a set of binary outputs. In applications. It is open source and written
This is too many parameters to be learned the above example, we could, for instance, in the C programming language, but can be
from 100 examples. A network that overfits imagine that there are three different treat- called from, for example, Perl and Python
the training data is unlikely to generalize well ments, and for a given tumor we may want programs.
to inputs that are not in the training data. to know which of the treatments it responds
There are many ways to limit over-fitting to. This could be solved using three output 1. Minsky, M.L. & Papert, S.A. Perceptrons (MIT Press,
Cambridge, 1969).
(apart from simply making small networks), units—one for each treatment—which are 2. Rumelhart, D.E., Hinton, G.E. & Williams, R.J. Nature
but the most common include averaging over connected to the same hidden units. 323, 533–536 (1986).
several networks, regularization and using Neural networks have been applied to 3. Sejnowski, T.J. & Rosenberg, C.R. Complex Systems
1, 145–168 (1987).
methods from Bayesian statistics6. many interesting problems in different areas
4. Qian, N. & Sejnowski, T.J. J. Mol. Biol. 202, 865–884
To estimate the generalization perfor- of science, medicine and engineering and in (1988).
mance of the neural network, one needs to some cases, they provide state-of-the-art 5. Anderson, J.A. & Rosenfeld, E. (eds). Neurocomputing:
test it on independent data, which have not solutions. Neural networks have sometimes Foundations of Research (MIT Press, Cambridge,
1988).
been used to train the network. This is usu- been used haphazardly for problems where 6. Bishop, C.M. Neural Networks for Pattern Recognition
ally done by cross-validation, where the data simpler methods would probably have given (Oxford University Press, Oxford, 1995).
set is split into, for example, ten sets of equal better results, giving them a somewhat poor 7. Noble, W.S. Nat. Biotechnol. 24, 1565–1567
(2006).
size. The network is then trained on nine sets reputation among some researchers. 8. Bishop, C.M. Pattern Recognition and Machine
and tested on the tenth, and this is repeated There are other types of neural net- Learning (Springer, New York, 2006).
ten times, so all the sets are used for testing. works than the ones described here, such as 9. Hertz, J.A., Krogh, A. & Palmer, R. Introduction to
the Theory of Neural Computation (Addison-Wesley,
This gives an estimate of the generalization Boltzman machines, unsupervised networks
Redwood City, 1991).
ability of the network; that is, its ability to and Kohonen nets. Support vector machines7 10. Duda, R.O., Hart, P.E. & Stork, D.G. Pattern
classify inputs that it was not trained on. To are closely related to neural networks. To read Classification (Wiley Interscience, New York, 2000).

NATURE BIOTECHNOLOGY VOLUME 26 NUMBER 2 FEBRUARY 2008 197

Você também pode gostar