Você está na página 1de 6

What Are Artificial Neural Networks

Before delving into the solution of real world problems using neural networks, a definition of neural networks will be presented. It is important to know the conditions under which this style of problem solving excels and what its limitations are. At the core of neural computation are the concepts of distributed, adaptive and nonlinear computing. Neural networks perform computation in a very different way than conventional computers, where a single central processing unit sequentially dictates every piece of the action. Neural networks are built from a large number of very simple processing elements that individually deal with pieces of a big problem. A processing element (PE) simply multiplies an input by a set of weights, and nonlinearly transforms the result into an output value (table lookup). The principles of computation at the PE level are deceptively simple. The power of neural computation comes from the massive interconnection among the PEs, which share the load of the overall processing task, and from the adaptive nature of the parameters (weights) that interconnect the PEs. Normally, a neural network will have several layers of PEs. This chapter only covers the most basic feedforward architecture, the multilayer perceptron (MLP). Other feedforward architectures as well as those with recurrent connections are addressed in the Tutorials chapter. The diagram below illustrates a simple MLP. The circles are the PEs arranged in layers. The left column is the input layer, the middle column is the hidden layer, and the right column is the output layer. The lines represent weighted connections (i.e., a scaling factor) between PEs. Figure: A simple multilayer perceptron By adapting its weights, the neural network works towards an optimal solution based on a measurement of its performance. For supervised learning, the performance is explicitly measured in terms of a desired signal and an error criterion. For the unsupervised case, the performance is implicitly measured in terms of a learning law and topology constraints. Getting Data into the Network The input and desired response data that was collected and coded must be written to one of the four file formats that NeuroSolutions accepts: ASCII, column-formatted ASCII, binary and bitmap (bmp image). Column-formatted ASCII is the most commonly used, since it is directly exportable from commercial spreadsheet programs. Each column of a column-formatted ASCII file represents one channel of data (i.e., input into one PE). Each channel may be used for the input, desired output, or may be ignored. The desired response data can be written to the same file as the input data, or they can each be written to separate files. The first line (row) of the file should contain the column headings, and not actual data (See figure below). Each group of spaces and/or tabs indicates a break in the columns. In order for the program to detect the correct number of columns, the column headings must not contain spaces. Figure: Example of a column-formatted ASCII file The remaining lines contain the individual samples of data. The data elements (values) are separated by spaces and/or tabs. The number of data elements for each line must match the number of column headings from the first line. The data elements can be either numeric or symbolic. There is a facility to automatically convert symbolic data to numeric data. The remaining three file types are simply read as a sequential stream of floating-point values. Non-formatted ASCII files contain numeric value separated by tabs and/or spaces. Any non-numeric values are simply ignored. Bitmap files can be either 16-color or 256-color. Each pixel of the image is converted to a value from 0 to 1, based on its intensity level. Binary files contain raw data, such that each 4-byte segment contains a floating-point value. Many numerical software packages export their data to this type of format.

Cross Validation During training, the input and desired data will be repeatedly presented to the network. As the network learns, the error will drop towards zero. Lower error, however, does not always mean a better network. It is possible to overtrain a network. Cross validation is a highly recommended criterion for stopping the training of a network. Although highly recommended, it is not required. One will often want to try several networks using just training data in order to see which works best, and then use cross validation for the final training. When using cross validation, the next step is to decide how to divide your data into a training set and a validation set, also called the test set. The network is trained with the training set, and the performance checked with the test set. The neural network will find the input-output map by repeatedly analyzing the training set. This is called the network training phase. Most of the neural network design effort is spent in the training phase. Training is normally slow because the networks weights are being updated based on the error information. At times, training will strain the patience of the designer. But a carefully controlled training phase is indispensable for good performance, so be patient. NeuroSolutions code was written to take maximum advantage of your computers resources. Hardware accelerators for NeuroSolutions that are totally transparent to the user are forthcoming. There is a need to monitor how well the network is learning. One of the simplest methods is to observe how the cost, which is the square difference between the networks output and the desired response, changes over training iterations. This graph of the output error versus iteration is called the learning curve. The figure below shows a typical learning curve. Note that in principle the learning curve decreases exponentially to zero or a small constant. One should be satisfied with a small error. How small depends upon the situation, and your judgment must be used to find what error value is appropriate for your problem. The training phase also holds the key to an accurate solution, so the criterion to stop training must be very well delineated. The goal of the stop criterion is to maximize the networks generalization. Figure: Typical learning curve It is relatively easy to adapt the weights in the training phase to provide a good solution to the training data. However, the best test for a networks performance is to apply data that it has not yet seen. Take a stock prediction example. One can train the network to predict the next days stock price using data from the previous year. This is the training data. But what is really needed is to use the trained network to predict tomorrows price. To test the network one must freeze the weights after the training phase and apply data that the network has not seen before. If the training is successful and the networks topology is correct, it will apply its past experience to this data and still produce a good solution. If this is the case, then the network will be able to generalize based on the training set. What is interesting is that a network with enough weights will always learn the training set better as the number of iterations is increased. However, neural network researchers have found that this decrease in the training set error was not always coupled to better performance in the test set. When the network is trained too much, the network memorizes the training patterns and does not generalize well. A practical way to find a point of better generalization is to set aside a small percentage (around 10%) of the training set and use it for cross validation. One should monitor the error in the training set and the validation set. When the error in the validation set increases, the training should be stopped because the point of best generalization has been reached. Cross validation is one of the most powerful methods to stop the training. Other methods are discussed under Network Training. Network Topology After taking care of the data collection and organization of the training sets, one must select the networks topology. An understanding of the topology as a whole is needed before the number of hidden layers and the number of PEs in each layer can be estimated. This discussion will focus on multilayer perceptrons (MLPs) because they are the most common. A multilayer perceptron with two hidden layers is a universal mapper. A universal mapper means that if the number of PEs in each layer and the training time is not constrained, then mathematicians can prove that the network has the power of solving any problem. This is a very important result but it is only an existence proof, so it does not say how such networks can be designed. The problem left to the experimentalist (like you) is to find

out what is the right combination of PEs and layers to solve the problem with acceptable training times and performance. This result indicates that there is probably not a need for more than two layers. A common recommendation is to start with a single hidden layer. In fact, unless youre sure that the data is not linearly separable, you may want to start without any hidden layers. The reason is that networks train progressively slower when layers are added. It is just like tapping a stream of water. If you take too much water in the first couple of taps there will be less and less water available for later taps. In multilayer neural networks, one can think of the water as being the error generated at the output of the network. This error is propagated back through the network to train the weights. It is attenuated at each layer due to the nonlinearities. So if a topology with many layers is chosen, the error to train the first layers weights will be very small. Hence training times can become excruciatingly slow. As you may expect by the emphasis on training, training times are the bottleneck in neural computation (it has been shown that training times grow exponentially with the number of dimension of the networks inputs), so all efforts should be made to make training easier. Network Training Training is the process by which the free parameters of the network (i.e. the weights) get optimal values. The weights are updated using either supervised or unsupervised learning. This chapter focuses on the MLP, so the details of unsupervised learning are not covered here (see the on-line documentation for the NeuralBuilder). With supervised learning, the network is able to learn from the input and the error (the difference between the output and the desired response). The ingredients for supervised learning are therefore the input, the desired response, the definition of error, and a learning law. Error is typically defined through a cost function. Good network performance should result in a small value for the cost. A learning law is a systematic way of changing the weights such that the cost is minimized. In supervised learning the most popular learning law is backpropagation. The network is trained in an attempt to find the optimal point on the performance surface, as defined by the cost definition. A simple performance surface is illustrated in the figure below. This network has only one weight. The performance surface of this system can be completely represented using a 2D graph. The x-axis represents the value of the weight, while the y-axis is the resulting cost. This performance surface is easy to visualize because it is contained within a two-dimensional space. In general, the performance surface is contained within a N+1 dimensional space, where N is the number of weights in the network. Backpropagation changes each weight of the network based on its localized portion of the input signal and its localized portion of the error. The change has to be proportional (a scaled version) of the product of these two quantities. The mathematics may be complicated, but the idea is very simple. When this algorithm is used for weight change, the state of the system is doing gradient descent; moving in the direction opposite to the largest local slope on the performance surface. In other words, the weights are being updated in the direction of down. Figure: Simple performance surface The beauty of backpropagation is that it is simple and can be implemented efficiently in computers. The drawbacks are just as important: The search for the optimal weight values can get caught in local minima, i.e. the algorithm thinks it has arrived at the best possible set of weights even though there are other solutions that are better. Backpropagation is also slow to converge. In making the process simple, the search direction is noisy and sometimes the weights do not move in the direction of the minimum. Finally, the learning rates must be set heuristically. The problems of backpropagation can be reduced. The slowness of convergence can be improved by speeding up the original gradient descent learning. NeuroSolutions provides several faster search algorithms such as Quickprop, Delta Bar Delta, and momentum. Momentum learning is often recommended due to its simplicity and efficiency with respect to the standard gradient. Most gradient search procedures require the selection of a step size. The idea is that the larger the step size the faster the minimum will be reached. However, if the step size is too large, then the algorithm will diverge and the error will increase instead of decrease. If the step size is too small then it will take too long to reach the minimum, which also increases the probability of getting caught in local minima. It is recommended that you start with a large step size. If the simulation diverges, then reset the network and start all over with a smaller step size. Starting with a large step size and decreasing it until the network becomes stable, finds a value that will solve the problem in fewer iterations. Small step sizes should be utilized to fine tune the convergence in the later stages of training.

Another issue is how to choose the initial weights. The search must start someplace on the performance surface. That place is given by the initial condition of the weights. In the absence of any a priori knowledge and to avoid symmetry conditions that can trap the search algorithm, the weights should be started at random values. However, the networks PEs have saturating nonlinearities, so if the weight values are very large, the PE can saturate. If the PE saturates, the error that goes through becomes zero, and previous layers may not adapt. Small random weight values will put every PE in the linear region of the sigmoid at the beginning of learning. NeuroSolutions uses uniformly distributed random numbers generated with a variance configurable per layer. If the networks are very large, one should further observe how many inputs each weight has and divide the variance of the random number generator by this value. The stop criteria for learning are very important. The stop criterion based on the error of the cross validation set was explained earlier. Other methods limit the total number of iterations (hence the training time), stopping the training regardless of the networks performance. Another method stops training when the error reaches a given value. Since the error is a relative quantity, and the length of time needed for the simulation to get there is unknown, this may not be the best stop criterion. Another alternative is to stop on incremental error. This method stops the training at the point of diminishing returns, when an iteration is only able to decrease the error by a negligible amount. However, the training can be prematurely stopped with this criterion because performance surfaces may have plateaus where the error changes very little from iteration to iteration. Probing A successful neural network simulation requires the specification of many parameters. The performance is highly dependent on the choice of these parameters. A productive way to assess the adequacy of the chosen parameters is to observe the signals that flow inside the network. NeuroSolutions has an amazingly powerful set of probing tools. One can observe signals flowing in the network, weights changing, errors being propagated, and most importantly the cost, all while the network is working. This means that you do not need to wait until the end of training to find out that the learning rate was set too high. All probes within NeuroSolutions belong to one of two categories -- static probes and temporal probes. The big difference is that the first kind access instantaneous data, while the second access the data over a window in time. The temporal probes have a buffer that stores past values, so one can visualize the signals as they change during learning. Fourier transforms provide a look at the frequency composition of such signals. There is also a probe that provides a 3-D representation of the state space. Breadboard NeuroSolutions has one document type, the breadboard. Simulations are constructed and run on breadboards. With NeuroSolutions, designing a neural network is very similar to prototyping an electronic circuit. With an electronic circuit, components such as resistors, capacitors and inductors are first arranged on a breadboard. NeuroSolutions instead uses neural components such as axons, synapses and probes. The components are then connected together to form a circuit. The electronic circuit passes electrons between its components. The circuit (i.e., neural network) of NeuroSolutions passes activity between its components, and is termed a data flow machine. Finally, the circuit is tested by inputting data and probing the systems' response at various points. An electronic circuit would use an instrument, such as an oscilloscope, for this task. A NeuroSolutions network uses one or more of its components within the probes family (e.g., the MegaScope). Networks are constructed on a breadboard by selecting components from the palettes, stamping them on the breadboard, and then interconnecting them to form a network topology. Once the topology is established and its components have been configured, a simulation can be run. An example of a functional breadboard is illustrated in the figure below. Figure: Example of a breadboard (single hidden layer MLP) New breadboards are created by selecting New from the File menu. This will create a blank breadboard titled "Breadboard1.nsb". The new breadboard can later be saved. Saving a breadboard saves the topology, the configuration of each component and (optionally) their weights. Therefore, a breadboard may be saved at any point during training and then restored later. The saving of weights is a parameter setting for each component that contains adaptive weights. This parameter can be set for all components on the breadboard or just selected ones.

The TestingWizard After training a network, you will want to test the network performance on data that the network was not trained with. The TestingWizard automates this procedure by providing an easy way to produce the network output for the testing dataset that you defined within the NeuralExpert or NeuralBuilder, or on a new dataset not yet defined. To launch the TestingWizard from within NeuroSolutions go to the Tools menu and choose "TestingWizard" or click the "Testing" toolbar button. If your breadboard was built with the NeuralExpert, you may alternatively click the "Test" button in the upper-left corner of the breadboard. Online help is available from all TestingWizard panels. To access help, click the Help button in the lower left corner of the wizard. Cross Validation Cross validation computes the error in a test set at the same time that the network is being trained with the training set. It is known that the MSE will keep decreasing in the training set, but may start to increase in the test set. This happens when the network starts "memorizing" the training patterns. The Termination page of the activation control inspector can be used to monitor the cross validation set error and automatically stop the network when it is not improving. The easiest way to understand the mechanics of cross validation is to use the NeuralBuilder or NeuralExpert to build a simple network that has cross validation. The Static Inspector is used to configure the switching between the testing and training phases of the simulation. The File components each contain a Training data set and a Cross-Validation data set (see the Data Set property page). The Cross-Validation data can either be a different segment of the same file, or a different file. There is an additional set of Probes and a ThresholdTransmitter for monitoring the cross validation phase of the simulation. Observe the Access Data Set setting of the Access property page for these components. Confusion Matrix A confusion matrix is a simple methodology for displaying the classification results of a network. The confusion matrix is defined by labeling the desired classification on the rows and the predicted classifications on the columns. For each exemplar, a 1 is added to the cell entry defined by (desired classification, predicted classification). Since we want the predicted classification to be the same as the desired classification, the ideal situation is to have all the exemplars end up on the diagonal cells of the matrix (the diagonal that connects the upper-left corner to the lower right). Observe these two examples: Figure: Confusion Matrix Example 1 Figure: Confusion Matrix Example 2 In example 1 we have perfect classification. Every male subject was classified by the network as male, and every female subject was classified as female. There were no males classified as females or vice versa. In example 2 we have imperfect classification. We have 9 females classified incorrectly by the network as males and 5 males classified as females. In NeuroSolutions, a confusion matrix is created by attaching a probe to one of the Confusion Matrix access points of the ErrorCriterion component. One option is to display the results as the raw number of exemplars classified for each combination of desired and actual outputs, as shown in the above examples. The other option is to display each cell as a percentage of the exemplars for the desired class. In this format, each row of the matrix sums to 100.

Performance Measures The Performance Meassures access point of the ErrorCriterion component provides six values that can be used to measure the performance of the network for a particular data set. MSE The mean squared error is simply two times the average cost (see the access points of the ErrorCriterion component.) The formula for the mean squared error is:

NMSE The normalized mean squared error is defined by the following formula: r The correlation coefficient. % Error The percent error is defined by the following formula: Note that this value can easily be misleading. For example, say that your output data is in the range of 0 to 100. For one exemplar your desired output is 0.1 and your actual output is 0.2. Even though the two values are quite close, the percent error for this exemplar is 100. AIC Akaike's information criterion (AIC) is used to measure the tradeoff between training performance and network size. The goal is to minimize this term to produce a network with the best generalization: MDL Rissanen's minimum description length (MDL) criterion is similar to the AIC in that it tries to combine the models error with the number of degrees of freedom to determine the level of generalization. The goal is to minimize this term:

Você também pode gostar