Você está na página 1de 10

Parallel implementations of recurrent neural

network learning
Uro Lotri, Andrej Dobnikar

Faculty of Computer and Information Science,


University of Ljubljana, Slovenia

{uros.lotric, andrej.dobnikar}@fri.uni-lj.si

Neural networks have proved to be eective in solving a


wide range of problems. As problems become more and more demanding, they require larger neural networks, and the time used for learning
is consequently greater. Parallel implementations of learning algorithms
are therefore vital for a useful application. Implementation, however,
strongly depends on the features of the learning algorithm and the underlying hardware architecture. For this experimental work a dynamic
problem was chosen which implicates the use of recurrent neural networks and a learning algorithm based on the paradigm of learning automata. Two parallel implementations of the algorithm were applied one on a computing cluster using MPI and OpenMP libraries and one
on a graphics processing unit using the CUDA library. The performance
of both parallel implementations justies the development of parallel algorithms.

Abstract.

Keywords:

Neural networks, Cluster computing, GPU computing

Introduction

In recent years the commercial computer industry has been undergoing a massive
shift towards parallel and distributed computing. This shift was mainly initiated
by the current limitations of semiconductor manufacturing. New developments
are also reected in the areas of intensive computing applications by fully exploiting the capabilities of the underlying hardware architecture, and impressive
enhancements in algorithm performance can be achieved with a low to moderate
investment of time and money.
Today, clusters of loosely coupled desktop computers represent an extremely
popular infrastructure for implementation of parallel algorithms. Processes running on computing nodes in the cluster communicate with each other through
messages. The message passing interface (MPI) is a standardized and portable
implementation of this concept, providing several abstractions that simplify the
use of parallel computers with distributed memory [1].
Recently, the development of powerful graphics processing units (GPUs) has
made high-performance parallel computing possible by using commercial discrete
general-purpose graphics cards [2]. There are many technologies, among which

Uro Lotri, Andrej Dobnikar

Nvidia's compute unied device architecture (CUDA) is the most popular [3]. It
includes C/C++ software development tools, function libraries and a hardware
abstraction mechanism that hides GPU hardware architecture from developers.
Algorithms for implementation of various neural networks can take advantage of parallel hardware architectures [46]. This comes from the concurrency,
inherently present in the neural network models themselves. Training of neural
network models is computationally expensive and time consuming, especially in
cases when large neural networks and/or large data sets of input-output samples are considered. However, each particular neural network model has its own
characteristics, and even the same model with dierent parameters and training data sets may lead to dierent behaviors on the same parallel hardware.
From that point of view, gathering universal solutions is impossible, although
the applicable parallelization concepts become very similar.

Fully Connected Recurrent Neural Network

The recurrent neural network is one of the most general types of neural networks [7]. Feed-back connections enable the recurrent neural network to memorize. A fully connected recurrent neural network with outputs from each neuron
connected to all neurons is presented in Fig. 1. Assume a recurrent neural net-

Fully connected recurrent neural network with m = 2 inputs, n = 5 neurons


and l = 3 outputs.
Fig. 1.

work with m inputs and n neurons, the rst l of them being connected to the
outputs. At time t, input sample x(t) together with the current outputs of the
neurons y(t) is presented to the neural network. For easier notation, input to
the neurons can be written as the vector z(t) = (x(t), y(t), 1) with m + n + 1
elements, the last element representing the bias. During the learning process,
knowledge is stored in the weights on connections wij , where index i runs over

Parallel implementations of recurrent neural network learning

the neurons and index j over the elements of the vector z. The output value of
i-th neuron is given by the non-linear sigmoid function of the weighted sum
1
yi (t + 1) =
1 + eui (t)

ui (t) =

m+n+1
X

wij zj (t) .

(1)

j=1

The objective of a neural network learning algorithm is to nd a set of weights


that minimizes the error function on the given data set of input-output samples
(x(t), d(t)), t = 0, . . . , T ,
E=

T
X
t=1

E(t) ,

E(t) =

1X
ei (t)2
2 i=1

(2)

with ei (t) = di (t) yi (t) being the dierence between the desired and calculated
value of the i-th neuron. The recurrent neural networks attempt to acquire the
dynamics of the system, and therefore input-output pairs should be presented
in causative order.
Many algorithms for recurrent neural network learning are known [7]. The
most standard approaches apply gradient-based techniques such as back propagation through time or real time recurrent learning. The problem of both is
expensive computation of gradients and slow convergence when large recurrent
neural networks are applied. One of alternatives is learning with heuristic approaches that mimic computation of gradients but with much smaller computation requirements. Such an algorithm is the linear reward penalty algorithm, or
LRP correction scheme, known from the eld of learning automata [8].
The basic idea of the LRP correction scheme is to change the probabilities
of possible changes in individual weights (actions), based on a given response
from the environment. When an action is rewarded, its probability is increased.
Contrarily, when an action is penalized, its probability is decreased. To preserve
the total probability of all actions, the probabilities of non-selected actions are
proportionally reduced in the rst case and increased in the second case.
In neural network learning, an action represents a change in a single weight
for a given value w [9]. There are two actions associated with each weight:
one increases the weight and the other decreases it. In the presented fully connected recurrent neural network there are Nw = n(m + n + 1) weights leading
to Na = 2Nw possible actions, while the response of the environment from the
LRP correction scheme is simply represented by the error function given in (2).
At the beginning of the learning process all actions have equal probabilities,
pk (0) = 1/Na , k = 1, . . . , Na . During the learning process the probabilities and
weights are updated according to the following scheme. Suppose that in learning
step s the action a is rewarded. In this case the probabilities of actions are
updated as

+[1 pa (s)] , k = a
pk (s + 1) = pk (s) +
(3)
,
pk (s)
, k 6= a

Uro Lotri, Andrej Dobnikar

with , 0 < < 1, being the correction constant, and the weight change is
accepted. Conversely, when in learning step s the action a is penalized, the
corresponding weight is returned to the previous value and the probabilities of
actions become
(
pha (s)
i, k=a
.
pk (s + 1) = pk (s) +
(4)
1

p
(s)
, k 6= a
+ Na 1
k

Exploiting Concurrency in Training Algorithm

Algorithms can be eciently parallelized by following the methodology proposed


by Ian Foster [1]. It consists of four design steps: partitioning, communication,
agglomeration and mapping. The focus of the rst two is to nd as much concurrency as possible, while the latter two consider the requirements of the underlying
hardware architecture. In the partitioning step, the data and/or computations
are divided into small tasks that can be computed in parallel. In the communication step, data that has to be passed between tasks is identied. Communication
represents the overhead of parallel designs and should be kept as low as possible.
In the agglomeration step, small tasks are grouped in the agglomerated tasks to
improve performance, mainly by reducing communication. In the mapping step
the agglomerated tasks are assigned to the processing units. Usually there are
as many agglomerated tasks as there are independent processing units.
The pseudo-code of the learning algorithm for the recurrent neural network
based on the LRP correction scheme is given in Fig. 2. The most obvious portion
randomly initialize neural network weights wij
initialize probabilities for actions pk (0)
for s 1 to S do
randomly choose action a and adequately change corresponding weight
calculate the response of the environment
initialize variables: Eold E , E 0, y(0) 0
for t 1 to T do
// over all input-output samples *3*
for i 1 to n do
// over all neurons *2*
calculate yi (t + 1)
update error, E E + E(t)
end for t
update probabilities, if E < Eold use (3) else (4)
for k 1 to Na do
// over all actions *1*
update probability pk (s + 1)
end for k
end for s

Pseudo-code of the LRP correction scheme for a recurrent neural network. Parts
of the code suitable for parallelization are indicated by a number surrounded by two
asterisks.
Fig. 2.

Parallel implementations of recurrent neural network learning

of code, suitable for parallelization, is the updating of probabilities, identied


by *1* in Fig. 2. The for k loop can be partitioned into Na small tasks, each
of them responsible for updating one probability, either by (3) or (4). Small
tasks only need to send their results to the task that chooses a new action. It is
also straightforward to parallelize the propagation of signals through the neural
network. The for i loop, identied by *2* in Fig. 2, can also be split into n
small tasks, each calculating the output of one neuron following (1). However,
the result yi (t + 1) obtained for each task must be broadcasted to all other tasks
in order to make the calculation of neuron outputs in the next time step possible.
In both identied cases, the computation is not very time demanding; therefore,
fast communication is the key issue for successful parallelization.
While using intra-processor communication can still be protable in the specied situations, inter-processor communication is certainly too slow. In cases of
slow communication between processors the computation time of each task must
be large compared to the time needed for communication. Unfortunately, the
given algorithm does not exhibit such concurrency.
In cases where the number of concurrent processes is small compared to the
number of input-output samples T , slight modication of the learning algorithm
leads to an ecient parallelization, also for systems with slow inter-processor
communication. More precisely, instead of parallelizing the for k and for i
loops, one can decide to parallelize the for t loop, marked *3* in Fig. 2. The
causality between consecutive input-output samples in the for t loop prevents
one from directly parallelizing it. Parallelization is only possible if the data set
of T input-output samples is split into P parts of approximately T /P samples,
on which the response of the environment can be calculated separately and
afterwards brought together. Instead of a single initialization of the vector y
at the beginning of the response calculation in Fig. 2, additional initializations
are needed for each part separately, which causes transitional phenomena on the
outputs of the neurons. The modied portion of the code is presented in Fig. 3.

calculate the response of the environment


initialize variables: Eold E , E 0
for r 1 to P do
// over all P parts *4*
initialize variables: y(b(r 1)T /P c) 0
for t b(r 1)T /P c + 1 to brT /P c do
for i 1 to n do
calculate yi (t + 1)
end for i
update error, E E + E(t)
end for t
end for r

Modied pseudo-code of the LRP correction scheme for recurrent neural network. The for r loop, suitable for parallelization is indicated by *4*.
Fig. 3.

Uro Lotri, Andrej Dobnikar

In this case suitable partitioning involves splitting the for r loop into P
processes. Communication is basically needed only to get the cumulative error
E . When the processes do not share memory, each process needs its own copy of
the weights. Therefore, it is additionally necessary to update the weights after
each iteration of the for s loop.

Parallel Hardware Architectures

Parallelization of the recurrent network learning algorithm based on the LRP


correction scheme was examined on two distributed hardware platforms: on a
computing cluster and on graphics processing units.

4.1 Commodity Computing Cluster


Currently, the most popular and aordable parallel computers are clusters of
commodity desktop computers. Processes running on the computing nodes in a
cluster communicate with each other through messages. The Message Passing
Interface (MPI), the standardized and portable implementation of communication through messages, is most commonly used to make parallelization on such
systems feasible. Unfortunately, commodity clusters are typically not balanced
between computation speed and communication speed - the communication network is usually quite slow compared to the speed of the processors. Therefore,
in the process of parallel algorithm design it is important to be aware of slow
communication.
In the present work, a commodity cluster composed of four nodes, each having
an Intel Core Duo 6700 processor running at 2.66 GHz and 2 GB of RAM, is
used. The nodes are connected over a 1 Gb Ethernet switch as shown in Fig 4a.
DeinoMPI [10] implementation of the MPI standard [1] is used on the Windows
XP operating system. The application utilizes MPI and OpenMP [1] function
calls.
Many modern commodity clusters are made of dual-core or even quad-core
multiprocessors. The MPI standard supports communication between processing cores inside the same multiprocessor in the same way as between processors
belonging to distinct computers. In this case the interaction between MPI processes running on the same multiprocessor will happen via message passing.
Some additional time can be gained by using only one MPI process per multiprocessor and within this process by forking threads to occupy unused cores.
OpenMP is a standardized software library that supports such thread creation
and interaction among cores via the concept of shared variables. Because of the
lower communication overhead, forking threads with OpenMP function calls inside multiprocessors is preferable to pure MPI implementation because it usually
leads to faster programs.

Parallel implementations of recurrent neural network learning

Fig. 4.

unit.

Hardware architectures of a) a computing cluster and b) a graphics processing

4.2 Graphics Processing Units


Graphics processing units (GPUs) are nowadays extending their initial role as
specialized 2D and 3D graphics accelerators to high performance computing devices, specialized for computing-intensive and memory-intensive highly parallel
computation. Nvidia's compute unied device architecture (CUDA) is the most
popular library, also featuring development tools.
CUDA represents GPUs as computing devices capable of executing a very
large number of threads in parallel. For example, the Nvidia GeForce 8800GT
GPU, used in our further experiments, consists of 14 multiprocessors, which can
use 1 GB of device memory. Each multiprocessor consists of 8 scalar processors
with 16 kB of shared memory and 8192 32-bit registers allowing computation
in single-precision oating point. Its architecture is schematically represented in
Fig. 4b.
GPUs feature memory access bandwidth an order of magnitude higher than
ordinary CPUs. For example, when there is no conict, the shared memory inside
the multiprocessor can be accessed as quickly as reading a register. Despite very
high bandwidth to the device memory, access to the device memory is faced
with very high latency, measured in hundreds of GPU cycles. Besides, CUDA
features additional texture memory. Although the shared memory and the device
memory are not cached, the texture memory is. Reading data from the texture
memory instead of the device memory can thus result in performance benets.
In our work a desktop computer with an Intel Core Duo 8400 Processor and
4 GB of RAM with 64-bit Windows XP installed hosts two Nvidia GeForce 8800
GT graphics processing units.
According to the CUDA programming model the computation is organized
into grids, which are executed sequentially. Each grid is organized as a set of
thread blocks, in which threads are executed concurrently and can cooperate

Uro Lotri, Andrej Dobnikar

together by eciently sharing data inside a multiprocessor. A maximum of 512


threads can run in parallel in each thread block. Unfortunately, threads in different blocks of the same grid cannot communicate and synchronize with each
other. Moreover, thread blocks of the same grid have the same size and their
threads execute the same kernel. A kernel is a portion of an application, a function, that is executed on the GPU. It is coded in annotated C/C++ language
with CUDA extensions.
GPU performance is radically dependent on nding high degrees of parallelism. A typical application running on the GPU must express thousands of
threads in order to eectively use the underlying hardware. In practice, a large
number of thread blocks is needed to ensure that the computing power of the
GPU is eciently utilized [3]. Utilization of the GPU heavily depends on the
size of the global data set, the maximum amount of local data in multiprocessors
that threads in thread blocks can share, the number of thread processors in the
GPU, the number of registers each thread requires, as well as the sizes of the
GPU local memories. When analyzing an algorithm and data, a programmer has
to be aware of the underlying hardware in order to nd the optimal number of
threads and blocks that will keep the GPU fully utilized.

Experimental Work

In this section the performance of the proposed hardware architectures on original and modied learning algorithms is assessed. In all cases the fully connected
recurrent neural network was trained to identify an unknown discrete dynamic
system, in our case a nite state machine which performs the time-delayed exclusive or xor(3) function, y(t) = x(t 2) x(t 3). There are 1000 binary
input-output samples in the training data set.

5.1 Original LRP correction scheme


In this case only the for loops indicated by *1* and *2* in Fig. 2 were parallelized. In the case of the computing cluster, the results are given only for the
setup in which all four nodes were utilized. Communication between nodes was
performed using the MPI library, while parallelization inside the node was done
using pragma directives of the OpenMP standard. Source code was compiled
using a Microsoft C/C++ compiler. On the other hand, only one GPU was used
to parallelize the original algorithm.
Processing times, normalized to 1000 iterations, and speedups of both architectures are given in Fig. 5 for a range of neural network sizes. For comparison,
the processing times of the standalone application, exploiting only one core of
the Intel Core Duo 6700 processor, are presented.
It is obvious that the cluster is not appropriate for parallelization of the
original LRP correction scheme, since communication overwhelms computation
by a large margin. A linear relationship between processing time and the number of neurons is expected since the length of the messages increases linearly

Parallel implementations of recurrent neural network learning

Performance of the original LRP correction scheme as a function of the number


of neurons on dierent architectures: a) processing time and b) speedup.

Fig. 5.

with the number of neurons. On the GPU, the computation of (1) is performed
concurrently for all neurons, and therefore an approximately linear increase in
computation time with an increasing number of neurons is observed. The local
peeks at 400 and 800 neurons on the speedup curve are the consequence of the
GPU hardware architecture.
Although some speedup, dened as the ratio between standalone and parallel
computation time, is observed in the case of cluster computing, the usage of
nodes is far from ecient, with less than 20% utilization of nodes. In addition
the linear dependence of the speedup on the number of neurons shows that the
GPU is not fully utilized when the learning algorithm is running on small neural
networks.

5.2 Modied LRP correction scheme


Due to the unpromising parallelization of the original LRP correction scheme,
only the for r loop marked by *4* in Fig. 3 was parallelized on the computing
cluster in this case. On the other hand, the second GPU was utilized for parallelization of the for r loop in Fig. 3, while inside each GPU the parallelization
scheme from the original algorithm was further used. Processing times for both
parallel architectures and the standalone application are given in Fig. 6a.
When parallelizing the for r loop, far more time is spent in processing than
in communication. Therefore, the processing on the cluster of four dual core

Performance of the modied LRP correction scheme as a function of the number


of neurons on dierent architectures: a) processing time and b) speedup.

Fig. 6.

10

Uro Lotri, Andrej Dobnikar

nodes is sped up by approximately a factor of eight. On the GPUs a similar


dependence is observed as in the case of the original LRP correction scheme,
except that the processing times are approximately halved whilst the speedups
are doubled.

Conclusion

Neural networks oer a high degree of internal parallelism, which makes them
perfect candidates for implementation on parallel hardware architectures. This
paper compared two aordable parallel hardware architectures on the problem
of learning in a fully connected recurrent neural network.
The presented results show that the computing clusters provide a very limited
speedup when parallelizing the internal structure of the neural network. Results
are far more promising when the processing is performed in batches and not
online. The development of graphics processing units now oers highly parallel
hardware platforms to users. The performance of graphics processing units is improving with an increasing number of concurrent operations, and therefore they
represent a perfect target platform for neural network computation. Their main
drawbacks are computation in single-precision oating point and the development tools, which somehow require that the user understands the particularities
of the hardware in order to benet from it.

References
1. Quinn, M.: Parallel Programming in C with MPI and OpenMP. McGraw Hill,
Boston (2003)
2. Halfhill, T.R.: Parallel Processing With CUDA. Microprocessor report, http://
www.MPRonline.com (2008)
3. Nvidia: Nvidia CUDA Compute Unied Device Architecture, Programming Guide,
Version 1.1. http://nvidia.com/cuda (2007)
4. Seiert, U.: Artical neural networks on massively parallel computer hardware.
ESANN 2002 proceedings, Bruges, Belgium (2002) 319-330
5. Lotri, U., Dobnikar, A.: Parallel implementations of feed-forward neural network
using MPI and C# on .NET platform. In: Ribeiro B. et. al. (eds), Adaptive and natural computing algorithms: proceedings of the International Conference in Coimbra, Portugal (2005) 534-537
6. Catanzaro, B., Sundaram, N., Keutzer, K.: Fast support vector machine training
and classication on graphics processors. In: McCallum A. and Roweis S.: Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland
(2008) 104-111
7. Haykin, S.: Neural networks: a comprehensive foundation, 2nd ed.. Prentice-Hall,
New Jersey (1999)
8. Narendra, K., Thathachar, M.A.L.: Learning automata: an introduction. PrenticeHall, New Jersey (1989)
9. ter, B., Gabrijel, I., Dobnikar, A.: Impact of learning on the structural properties
of neural networks. Lect. notes comput. sci., part 2, Springer, Vienna (2007) 63-70
10. Deino Software: DeinoMPI - High Performance Parallel Computing for Windows.
http://mpi.deino.net (2008)

Você também pode gostar