Escolar Documentos
Profissional Documentos
Cultura Documentos
network learning
Uro Lotri, Andrej Dobnikar
{uros.lotric, andrej.dobnikar}@fri.uni-lj.si
Abstract.
Keywords:
Introduction
In recent years the commercial computer industry has been undergoing a massive
shift towards parallel and distributed computing. This shift was mainly initiated
by the current limitations of semiconductor manufacturing. New developments
are also reected in the areas of intensive computing applications by fully exploiting the capabilities of the underlying hardware architecture, and impressive
enhancements in algorithm performance can be achieved with a low to moderate
investment of time and money.
Today, clusters of loosely coupled desktop computers represent an extremely
popular infrastructure for implementation of parallel algorithms. Processes running on computing nodes in the cluster communicate with each other through
messages. The message passing interface (MPI) is a standardized and portable
implementation of this concept, providing several abstractions that simplify the
use of parallel computers with distributed memory [1].
Recently, the development of powerful graphics processing units (GPUs) has
made high-performance parallel computing possible by using commercial discrete
general-purpose graphics cards [2]. There are many technologies, among which
Nvidia's compute unied device architecture (CUDA) is the most popular [3]. It
includes C/C++ software development tools, function libraries and a hardware
abstraction mechanism that hides GPU hardware architecture from developers.
Algorithms for implementation of various neural networks can take advantage of parallel hardware architectures [46]. This comes from the concurrency,
inherently present in the neural network models themselves. Training of neural
network models is computationally expensive and time consuming, especially in
cases when large neural networks and/or large data sets of input-output samples are considered. However, each particular neural network model has its own
characteristics, and even the same model with dierent parameters and training data sets may lead to dierent behaviors on the same parallel hardware.
From that point of view, gathering universal solutions is impossible, although
the applicable parallelization concepts become very similar.
The recurrent neural network is one of the most general types of neural networks [7]. Feed-back connections enable the recurrent neural network to memorize. A fully connected recurrent neural network with outputs from each neuron
connected to all neurons is presented in Fig. 1. Assume a recurrent neural net-
work with m inputs and n neurons, the rst l of them being connected to the
outputs. At time t, input sample x(t) together with the current outputs of the
neurons y(t) is presented to the neural network. For easier notation, input to
the neurons can be written as the vector z(t) = (x(t), y(t), 1) with m + n + 1
elements, the last element representing the bias. During the learning process,
knowledge is stored in the weights on connections wij , where index i runs over
the neurons and index j over the elements of the vector z. The output value of
i-th neuron is given by the non-linear sigmoid function of the weighted sum
1
yi (t + 1) =
1 + eui (t)
ui (t) =
m+n+1
X
wij zj (t) .
(1)
j=1
T
X
t=1
E(t) ,
E(t) =
1X
ei (t)2
2 i=1
(2)
with ei (t) = di (t) yi (t) being the dierence between the desired and calculated
value of the i-th neuron. The recurrent neural networks attempt to acquire the
dynamics of the system, and therefore input-output pairs should be presented
in causative order.
Many algorithms for recurrent neural network learning are known [7]. The
most standard approaches apply gradient-based techniques such as back propagation through time or real time recurrent learning. The problem of both is
expensive computation of gradients and slow convergence when large recurrent
neural networks are applied. One of alternatives is learning with heuristic approaches that mimic computation of gradients but with much smaller computation requirements. Such an algorithm is the linear reward penalty algorithm, or
LRP correction scheme, known from the eld of learning automata [8].
The basic idea of the LRP correction scheme is to change the probabilities
of possible changes in individual weights (actions), based on a given response
from the environment. When an action is rewarded, its probability is increased.
Contrarily, when an action is penalized, its probability is decreased. To preserve
the total probability of all actions, the probabilities of non-selected actions are
proportionally reduced in the rst case and increased in the second case.
In neural network learning, an action represents a change in a single weight
for a given value w [9]. There are two actions associated with each weight:
one increases the weight and the other decreases it. In the presented fully connected recurrent neural network there are Nw = n(m + n + 1) weights leading
to Na = 2Nw possible actions, while the response of the environment from the
LRP correction scheme is simply represented by the error function given in (2).
At the beginning of the learning process all actions have equal probabilities,
pk (0) = 1/Na , k = 1, . . . , Na . During the learning process the probabilities and
weights are updated according to the following scheme. Suppose that in learning
step s the action a is rewarded. In this case the probabilities of actions are
updated as
+[1 pa (s)] , k = a
pk (s + 1) = pk (s) +
(3)
,
pk (s)
, k 6= a
with , 0 < < 1, being the correction constant, and the weight change is
accepted. Conversely, when in learning step s the action a is penalized, the
corresponding weight is returned to the previous value and the probabilities of
actions become
(
pha (s)
i, k=a
.
pk (s + 1) = pk (s) +
(4)
1
p
(s)
, k 6= a
+ Na 1
k
Pseudo-code of the LRP correction scheme for a recurrent neural network. Parts
of the code suitable for parallelization are indicated by a number surrounded by two
asterisks.
Fig. 2.
Modied pseudo-code of the LRP correction scheme for recurrent neural network. The for r loop, suitable for parallelization is indicated by *4*.
Fig. 3.
In this case suitable partitioning involves splitting the for r loop into P
processes. Communication is basically needed only to get the cumulative error
E . When the processes do not share memory, each process needs its own copy of
the weights. Therefore, it is additionally necessary to update the weights after
each iteration of the for s loop.
Fig. 4.
unit.
Experimental Work
In this section the performance of the proposed hardware architectures on original and modied learning algorithms is assessed. In all cases the fully connected
recurrent neural network was trained to identify an unknown discrete dynamic
system, in our case a nite state machine which performs the time-delayed exclusive or xor(3) function, y(t) = x(t 2) x(t 3). There are 1000 binary
input-output samples in the training data set.
Fig. 5.
with the number of neurons. On the GPU, the computation of (1) is performed
concurrently for all neurons, and therefore an approximately linear increase in
computation time with an increasing number of neurons is observed. The local
peeks at 400 and 800 neurons on the speedup curve are the consequence of the
GPU hardware architecture.
Although some speedup, dened as the ratio between standalone and parallel
computation time, is observed in the case of cluster computing, the usage of
nodes is far from ecient, with less than 20% utilization of nodes. In addition
the linear dependence of the speedup on the number of neurons shows that the
GPU is not fully utilized when the learning algorithm is running on small neural
networks.
Fig. 6.
10
Conclusion
Neural networks oer a high degree of internal parallelism, which makes them
perfect candidates for implementation on parallel hardware architectures. This
paper compared two aordable parallel hardware architectures on the problem
of learning in a fully connected recurrent neural network.
The presented results show that the computing clusters provide a very limited
speedup when parallelizing the internal structure of the neural network. Results
are far more promising when the processing is performed in batches and not
online. The development of graphics processing units now oers highly parallel
hardware platforms to users. The performance of graphics processing units is improving with an increasing number of concurrent operations, and therefore they
represent a perfect target platform for neural network computation. Their main
drawbacks are computation in single-precision oating point and the development tools, which somehow require that the user understands the particularities
of the hardware in order to benet from it.
References
1. Quinn, M.: Parallel Programming in C with MPI and OpenMP. McGraw Hill,
Boston (2003)
2. Halfhill, T.R.: Parallel Processing With CUDA. Microprocessor report, http://
www.MPRonline.com (2008)
3. Nvidia: Nvidia CUDA Compute Unied Device Architecture, Programming Guide,
Version 1.1. http://nvidia.com/cuda (2007)
4. Seiert, U.: Artical neural networks on massively parallel computer hardware.
ESANN 2002 proceedings, Bruges, Belgium (2002) 319-330
5. Lotri, U., Dobnikar, A.: Parallel implementations of feed-forward neural network
using MPI and C# on .NET platform. In: Ribeiro B. et. al. (eds), Adaptive and natural computing algorithms: proceedings of the International Conference in Coimbra, Portugal (2005) 534-537
6. Catanzaro, B., Sundaram, N., Keutzer, K.: Fast support vector machine training
and classication on graphics processors. In: McCallum A. and Roweis S.: Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland
(2008) 104-111
7. Haykin, S.: Neural networks: a comprehensive foundation, 2nd ed.. Prentice-Hall,
New Jersey (1999)
8. Narendra, K., Thathachar, M.A.L.: Learning automata: an introduction. PrenticeHall, New Jersey (1989)
9. ter, B., Gabrijel, I., Dobnikar, A.: Impact of learning on the structural properties
of neural networks. Lect. notes comput. sci., part 2, Springer, Vienna (2007) 63-70
10. Deino Software: DeinoMPI - High Performance Parallel Computing for Windows.
http://mpi.deino.net (2008)