Euro Par 2009 Efficient Parallel Implementation of Evolutionary Algorithms On GPGPU Cards

Efficient Parallel Implementation of
Evolutionary Algorithms on GPGPU Cards
Ogier Maitre1 , Nicolas Lachiche1 , Philippe Clauss1 , Laurent Baumes2 , Avelino

Corma2 , Pierre Collet1
{maitre,lachiche,clauss,collet}@lsiit.u-strasbg.fr
{baumesl,acorma}@itq.upv.es
1
LSIIT University of Strasbourg, France
2
Insituto de Tecnologia Quimica UPV-CSIC, Valencia Spain
Abstract. A parallel solution to the implementation of evolutionary

algorithms is proposed, where the most costly part of the whole evolu-
tionary algorithm computations (the population evaluation), is deported
to a GPGPU card. Experiments are presented for two benchmark ex-
amples on two models of GPGPU cards: first a ”toy” problem is used
to illustrate some noticable behaviour characteristics before a real prob-
lem is tested out. Results show a speed-up of up to 100 times compared
to an execution on a standard micro-processor. To our knowledge, this
solution is the first showing such an efficiency with GPGPU cards. Fi-
nally, the EASEA language and its compiler are also extended to allow
users to easily specify and generate efficient parallel implementations of
evolutionay algorithms using GPGPU cards.
1 Introduction
Between nowadays available manycore architectures, GPGPU (General Purpose
Graphic Processing Units) cards are offering one of the most attractive cost/per-
formance ratio. However programming such machines is a difficult task. This
paper focuses on a specific kind of resource-consuming application: evolution-
ary algorithms. It is well-known that such algorithms offer efficient solutions to
many optimization problems, but they usually require a great number of evalu-
ations, making processing power a limit on standard micro-processors. However,
their algorithmic structure clearly exhibits resource-costly computation parts
that can be naturally parallelized. But, GPGPU programming constraints in-
duce to consider dedicated operations for efficient parallel execution, one of the
main performance-relevant constraint being the time needed to transfer data
from the host memory to the GPGPU memory.
This paper starts by presenting evolutionary algorithms, and studying them
to determine where parallelization could take place. Then, GPGPU cards are
presented in section 3, and a proposition on how evolutionary algorithms could
be parallelized on such cards is made and described in section 4. Experiments are
made on two benchmarks and two NVidia cards in section 5 and some related
works are described in section 7. Finally, results and future developments are
discussed in the conclusion.
2 Presentation of Evolutionary Algorithms
In [5], Darwin suggests that species evolve through two main principles: variation
in the creation of new children (that are not exactly like their parents) and
survival of the fittest, as many more individuals of each species are born than
can possibly survive.
Evolutionary Algorithms (EAs) [9] get their inspiration from this paradigm
to suggest a way to solve the following interesting question. Given :
1. a difficult problem for which no computable way of finding a good solution

is known and where a solution is represented as a set of parameters,
2. a limited record of previous trials that have all been evaluated.
How can one use the accumulated knowledge to choose a new set of parame-
ters to try out (and therefore do better than a random search) ? EAs rely on ar-
tificial Darwinism to do just that: create new potential solutions from variations
on good individuals, and keeping a constant population size through selection of
the best solutions. The Darwinian inspiration for this paradigm leads to borrow
some specific vocabulary from biology: given an initial set of evaluated potential
solutions (called a population of individuals), parents are selected among the best
to create children thanks to genetic operators (that Darwin called “variation”
operators), such as crossover and mutation. Children (new potential solutions)
are then evaluated and from the pool of parents and children, a replacement
operator selects those that will make it to the new generation before the loop is
started again.
Fig. 1. Generic evolutionary loop.

2.1 Parallelization of a generic evolutionary algorithm
The algorithm presented on figure 1 contains several steps that may or may not
be independent. To start with, population initialisation is inherently parallel,
because all individuals are created independently (usually with random values).
Then, all newly created individuals need to be evaluated. But since they
are all evaluated independently using a fitness function, evaluation of the pop-
ulation can be done in parallel. It is interesting to note that in evolutionary
algorithms, evaluation of individuals is usually the most CPU-consuming step
of the algorithm, due to the high complexity of the fitness function.
Once a parent population has been obtained (by evaluating all the individu-
als of the initial population), one needs to create a new population of children. In
order to create a child, it is necessary to select some parents on which variation
operators (crossover, mutation) will be applied. In evolutionary algorithms, se-
lection of parents is also parallelizable because one parent can be selected several
times, meaning that independent selectors can select whoever they wish without
any restrictions.
Creation of a child out of the selected parents is also a totally independent
step: a crossover operator needs to be called on the parents, followed by a mu-
tation operator on the created child.
So up to now, all steps of the evolutionary loop are inherently parallel but
for the last one: replacement. In order to preserve diversity in the successive
generations, the (N + 1)-th generation is created by selecting some of the best
individuals of the parents+children populations of generation N . However, if
an individual is allowed to appear several times in the new generation, it could
rapidly become preeminent in the population, therefore inducing a loss of diver-
sity that would reduce the exploratory power of the algorithm.
Therefore, evolutionary algorithms impose that all individuals of the new
generation are different. This is a real restriction on parallelism, since it means
that the selection of N survivors cannot be made independently, otherwise a
same individual could be selected several times by several independent selectors.
Finally, one could wonder whether several generations could evolve in parallel.
The fact that generation (N + 1) is based on generation N invalidates this idea.
3 GPGPU architecture
GPGPU and classic CPU designs are very different. GPGPUs come from the
gaming industry and are designed to do 3D rendering. They inherit specific
features from this usage. For example, they feature several hundreds execution
units grouped into SIMD bundles that have access to a small amount of shared
memory (16KB on the NVidia 8800GTX that was used for this paper), a large
memory space (several hundred megabytes), a special access mode for texture
memory and a hardware scheduling mechanism.
The 8800GTX GPGPU card features 128 stream processors (compared to
4 general purpose processors on the Intel Quad Core) even though both chips
share a similar number of transistors (681 million for the 8800GTX vs 582 million
for the Intel Quad Core). This can be done thanks to a simplified architecture
that has some serious drawbacks. For instance, all stream processors are not
independent. They are grouped into SIMD bundles (16 SPMD bundles of 8
SIMD units on the 8800GTX, which saves 7 fetch and dispatch units). Then,
space-consuming cache memory is simply not available on GPGPUs, meaning
that all memory accesses (that can be done in only a few cycles on a CPU if the
data is already in the cache) cost several hundred cycles.
Fortunately, some workarounds are provided. For instance, the hardware
scheduling mechanism allows to run a bundle of threads called a warp at the
same time, swapping between the warps as soon as a thread of the current warp
is stalled on a memory access, so memory latency can be overcome with warp
scheduling. But there is a limit to what can be done: it is important to have
enough parallel tasks to be scheduled while waiting for the memory. A thread’s
state is not saved into memory. It stays on the execution unit (like in hyper-
threading mecanism), so the number of registers used by a task directly impacts
the number of tasks that can be scheduled on a bundle of stream processors.
Then, there is a limit in the number of schedulable warps (24 warps, i.e. 768
threads on the 8800GTX).
All these quirks make it very difficult for standard programs to exploit the
real power of these graphic cards.
4 Parallel implementation on a GPGPU card
As has been shown in section 2.1, it is possible to parallelize most of the evolu-
tionary loop. However, whether it is worth to run everything in parallel on the
GPGPU card or not is another matter: in [8, 7, 11], the authors implemented
complete algorithms on GPGPU cards, but clearly show that doing so is very
difficult, for quite few performance gains.
Rather than going this way, the choice made for this paper was to keep every-
thing simple, and start with experimenting the obvious idea of only parallelizing
children evaluation, based on the three following considerations.
1. Implementing the complete evolutionary engine on the GPGPU card is very

complex, so it seems preferable to start with parallelizing only one part of
the algorithm.
2. Usually, in evolutionary algorithms, execution of the evolutionary engine
(selection of parents, creation of children, replacement step) is extremely
fast compared to the evaluation of the population.
3. Then, if the evolutionary engine is kept on the host CPU, one needs to
transfer the genome of the individuals only once on the GPGPU for each
generation. If the selection and variation operators (crossover, mutation) had
been implemented on the GPGPU, it would have been necessary to get the
population back on the host CPU at every generation for the replacement
step.
Evaluation of the population on the GPGPU is a massively parallel process
that suits well an SPMD/SIMD computing model, because standard evolution-
ary algorithms use the same evaluation function to evaluate all individuals3 .
Individual evaluations are grouped into structures called Blocks, that imple-
ment a group of threads which can be executed on a same bundle. Dispatching
individuals evaluation across this structure is very important in order to max-
imize the load on the whole GPGPU. Indeed, as seen in section 3 a bundle
has limited scheduling capacity, depending on the hardware scheduling device
or register limitations. The GPGPU computing unit must have enough registers
to execute all the tasks of a block at the same time, representing the scheduling
limit. The 8800GTX card has a scheduling capacity of 768 threads, and 8192
registers. So one must make sure that the number of individuals in a block is
not greater than the scheduling capacity, and that there are enough individuals
on a bundle in order to maximize this capacity. In this paper, the implemented
algorithm spreads the population into n ∗ k blocks, where n is the number of
popSize
bundles on the GPGPU and k is the integer ceiling of n∗schedLimit . This simple
algorithm yields good results in tested cases. However, a strategy to automati-
cally adapt blocks definitions to computation complexity either by using a static
or dynamic approach needs to be investigated in some future work.
When a population of children is ready to be evaluated, it is copied onto the
GPGPU memory. All the individuals are evaluated with the same evaluation
function, and results (fitnesses) are sent back to the host CPU, that contains
the original individuals and manages the populations.
On a standard host CPU EA implementation, an individual is made of a
genome plus other information, such as its fitness, and statistics information
(whether is has recently been evaluated or not, . . . ). So the transfer of n genomes
only onto the GPGPU would result in n transfers and individuals scattered
everywhere in GPGPU memory. Such a number of memory transfers would have
been inacceptable. So, it was chosen to ensure spatial locality of all genomes to
be contiguous into host CPU memory, and send the whole children population in
one single transfer. Some experiments showed that on a particular case, with a
large number of children, the transfer time went from 80 seconds with scattered
data down to 180µs whith a buffer of contiguous data.
Our implementation uses only global memory since in the general case, the
evaluation function does not generate significant data reuse that would justify the
use of the small 16KB shared memory or the texture cache. Indeed with shared
memory, the time saved in data accesses is generally wasted by data transfers
between global memory and shared memory. Notice that shared memory is not
accessible from the host CPU part of the algorithm. Hence one has to first copy
data into global memory, and in a second step into shared memory.
The chosen implementation strategy exhibits the main overhead risk as being
the time spent to transfer the population onto the GPGPU memory. Hence the
(computation time)/(transfer time) ratio needs to be large enough to effectively
3
This is not the case in Genetic Programming, where on the opposite, all individuals
are different functions that are tested on a common data learning set.
take advantage of the GPGPU card. Experiments on data transfer rate show
that on the 8800GTX, a 500 MB/s bandwidth was reached, which is much
lower than the advertised 4GB/s maximum bandwidth. However, experiments
presented in the following show that this rate is quite acceptable even for very
simple evaluation functions.
5 Experiments
Two implementations have been tested: a toy problem that contains interesting
tuneable parameters allowing to observe the behaviour of the GPGPU card,
and a much more complex real world problem to make sure that the GPGPU
processors are also able to run more complex fitness functions. In fact, the 400
code lines of the real world evaluation function were programmed by a chemist,
who has not the least idea on how to use a GPGPU card.
5.1 The Weierstrass benchmark program

Weierstrass-Mandelbrot test functions, defined as:
P∞
Wb,h (x) = i=1 b−ih sin(bi x) with b > 1 and 0 < h < 1
are very interesting to use as a test case of CPU usage in evolutionary compu-
tation since they provide two parameters that can be adjusted independently.
Weierstrass Iteration 120 on CPU Weierstrass Iteration 120

Weierstrass Iteration 120 on GPU 100 Weierstrass Iteration 70
3000 Weierstrass Iteration 10
2500 80
2000
Time (s)
Time (s)
60
1500
40
1000
20
500
0 0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 1000 2000 3000 4000 5000 6000
Population size Population size
Fig. 2. Left: Host CPU (top) and CPU+8800GTX (bottom) time for 10 generations
of the Weierstrass problem on an increasing population size. Right: CPU+8800GTX
curve only, for increasing numbers of iterations and increasing population sizes.
Theory defines Weierstrass functions as an infinite sum of sines. Program-

mers perform a finite number of iterations to compute an approximation of the
function. The number of iterations is closely related to the host CPU time spent
in the evaluation function.
Another parameter that can also be adjusted is the dimension of the prob-
lem: a 1,000 dimension Weierstrass problem takes 1,000 continuous parameters,
meaning that its genome is an array of 1,000 float values, while a 10 dimen-
sion problem only takes 10 floats. The 10 dimension problem will evidently take
much less time to evaluate than the 1,000 dimension problem. But since eval-
uation time also depends on the number of iterations, tuning both parameters
provides many configurations combining genome size and evaluation time.
Figure 2 left shows the time taken by the evolutionary algorithm to compute
10 generations on both a 3.6GHz Pentium computer and an 8800GTX GPGPU
card, for 1000 dimensions, 120 iterations and a number of evaluations per gen-
eration growing from 16 to 4,096 individuals (number of children = 100% of the
population). This represents the total time (including what is serially done on
the host CPU (population management, crossovers, mutations, selections, . . . )
on both architectures (host CPU only and host CPU+GPU).
For 4,096 evaluations (×10 generations), the host CPU spends 2,100 seconds
while the host CPU and 8800GTX only spends 63 seconds, resulting in a speedup
of 33.3.
Figure 2 right shows the same GPGPU curve, for different iteration counts.
On this second figure, one can see that the 8800GTX card steadily takes in
more individuals to evaluate in parallel without much difference in evaluation
time until the threshold number of 2048 individuals is reached, after which it
gets saturated. Beyond this value, evaluation time increases linearly with the
number of individuals, which is normal since the parallel card is already working
on a full load. It is interesting to see that with 10 iterations, the curve before
and after 2048 has nearly the same slope, meaning that for 10 iterations, the
time spent in the evaluation function is negligible, so the curve mainly shows
the overhead time.
0.5
Weierstrass Cpu Iteration 10 45 Weierstrass Iteration 10
Weierstrass 40B data Iteration 10 Weierstrass Iteration 70
Weierstrass 2KB data Iteration 10 Weierstrass Iteration 120
Weierstrass 4KB data Iteration 10 40
0.4
35
30
0.3
Time (s)
Time (s)
25
0.2 20
15
0.1 10
0 0
0 200 400 600 800 1000 1200 0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Fig. 3. Left: determination of genome size overhead on a very short evaluation. Right:
Same curves as figure 2 right, but for the GTX260 card.
Since using a GPGPU card induces a necessary overhead, it is interesting

to determine when it is advantageous to use an 8800GTX card. Figure 3 left
shows that on a small problem with 10 dimensions Weierstrass and 10 iterations
running in virtually no time, this threshold is met between 400 individuals or
600 individuals, depending whether the genome size uses 40 bytes or 4 kilobytes,
which is quite a big genome.
The steady line (representing the host CPU) shows an evaluation time slightly
shorter than 0.035 milliseconds, which is very short, even on a 3.6GHz computer.
The 3 GPGPU curves show that indeed, the size of the genomes has an impact
when individuals are passed to the GPGPU card for evaluations. On this figure,
evaluation is done on a 10 dimension Weierstrass function corresponding to a 40
bytes size (the 8800GTX card only accepts floats). The additional genome data
is not used on the 2 kilobytes and 4 kilobytes genomes, in order to isolate the
time taken to transfer large size genomes to the GPGPU for all the population.
On figure 3 right, the same measures are shown with a recently acquired
GTX260 NVidia card. One can see that with this card, total time is only 20
seconds for a population of 5,000 individuals while the 8800GTX card takes 60
seconds and the 3.6GHz Pentium takes 2,100 seconds. So where the 8800GTX
vs host CPU speedup was 33.3, the GTX260 vs host CPU speedup is about 105,
which is quite impressive for a card that only costs around $250.00.
5.2 Application to a real-world problem
In materials science, knowledge of the structure at an atomistic/molecular level is

required for any advanced understanding of its performance, due to the intrinsic
link between the structure of the material and its useful properties. It is therefore
essential that methods to study structures are developed.
Rietveld refinement techniques [10] can be used to extract structural details
from an X-Ray powder Diffraction (XRD) pattern [2, 1, 4], provided an approx-
imate structure is known. However, if a structural model is not available, its
determination from powder diffraction data is a non-trivial problem. The struc-
tural information contained in the diffracted intensities is obscured by systematic
or accidental overlap of reflections in the powder pattern.
As a consequence, the application of structure determination techniques
which are very successful for single crystal data (primarily direct methods) is, in
general, limited to simple structures. Here, we focus on inherently complex struc-
tures of a special type of crystalline materials whose periodic structure is a 4-
connected 3 dimensional net such as alumino-silicates, silico-alumino-phosphates
(SAPO), alumino-phosphates (AlPO), etc...
The genetic algorithm is employed in order to find “correct” locations, e.g.
from a connectivity point of view, of T atoms. As the distance T − T for bonded
atoms lies in a fixed range [dmin-dmax], the connectivity of each new config-
urations of T atoms can be evaluated. The fitness function corresponds to the
number of defects in the structure, and Fitness=f1 + f2 is defined as follows:
1. All T atoms should be linked to 4 and only 4 neighbouring T s, so:

f1 = Abs(4 − N umber of N eighbours)
2. no T should be too close, e.g. T − T < dmin, so:
f2 = N umber of T oo Close T Atoms
Speedup on this chemical problem: As mentioned earlier, the source code
came from our chemist co-author, who is not a programming expert (but is
nevertheless capable of creating some very complex code) and knows nothing
about GPGPU architecture and its use.
9 0.14
Real problem on CPU Real problem on GPU
Real problem on GPU
8
0.12
7
0.1
6
0.08
Time (s)
Time (s)
5
4 0.06
3
0.04
2
0.02
1
0 0
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
Fig. 4. Left: evaluation times for increasing population sizes on host CPU (top) and
host CPU + GTX260 (bottom). Right: CPU + GTX260 total time.
First, while the 3.60GHz CPU was evaluating 20,000 individuals in 23 seconds
only, which seemed really fast considering the very complex evaluation function,
the GPGPU version took around 80 seconds which was disappointing.
When looking at the genome of the individuals, it appeared that it was coded
in a strange structure, i.e. an array of 4 pointers towards 4 other arrays of 3 floats.
This structure seemed much too complex to access, so it was suggested to flatten
it into a unique array of 12 floats which was easy to do, but unfortunately, the
whole evaluation function was made of pointers to some parts of the previous
genome structure. After some hard work, all got back to a pointer-less flat code,
and evaluation time for the 20,000 individuals instantaneously dropped from 80
second down to 0.13 seconds. One conclusion to draw out of this experience is
that, as expected, GPGPUs do not seem very talented in allocating, copying and
de-allocating memory.
Back to the host CPU, the new function now took 7.66s to evaluate 20,000
individuals, meaning that all in all, the speedup offered by the GPGPU card
is nearly 60 on the new GTX260 (figure 4 left). Figure 4 right shows only the
GTX260 curve.
6 EASEA : Evolutionary Algorithm specification

language
EASEA4 [3] is software platform that was originally designed to help non ex-
pert programmers to try out evolutionary algorithms to optimise their applied
4
EASEA (pron. [i:zi:]) stands for EAsy Specification for Evolutionary Algorithms.
1 \User classes : 1 \GenomeClass::initialiser :
2 GenomeClass { 2 for(int i=0; i<N; i++) {
3 float x[SIZE]; 3 Genome.x[i] = random
4 } (-1.0,1.0);
4 }
1 \GenomeClass::mutator :
2 for (int i=0; i<N; i++) 1 \GenomeClass::evaluator:
3 if (tossCoin(pMutProb)){ 2 float res = 0., b=2.;
4 Genome.x[i]+=SIGMA* 3 float h=.25, val[SIZE];
random(0.,1.); 4 int i,k;
5 Genome.x[i]=MAX(X_MIN, 5 for(i = 0;i<N; i++){
MIN(X_MAX,Genome.x[i 6 val[i] = 0.;
])); 7 for (k=0;k<ITER;k++)
6 } 8 val[i] +=
9 pow(b,-(float)k*h) *
1 \GenomeClass::crossover: 10 sin(pow(b,(float)k)*
2 for(i=0; i<N; i++){ 11 Genome.x[i]);
3 float a = (float)randomLoc 12 res += Abs(val[i]);
(0.,1.); 13 }
4 if (&child1) 14 return (res);
5 child1.x[i] = 15 }
6 a*parent1.x[i]+
7 (1-a)*parent2.x[i];
8 }
Fig. 5. Evolutionary algorithm minimizing the Weierstrass function in EASEA.
problems without the need to implement a complete algorithm. It was revived

to integrate the parallelization of children evaluation on NVidia GPGPU cards.
EASEA is a specification language that comes with its dedicated compiler.
Out of the weierstrass.ez specification program shown in fig. 5, the EASEA
compiler will output a C++ source file that implements a complete evolutionary
algorithm to minimize the Weierstrass function.
The -cuda option has been added to the compiler in order to automatically
output a source code that will parallelize the evaluation of the children popula-
tion on a CUDA-compatible NVidia card out of the same .ez specification code,
therefore allowing anyone who wishes to use GPGPU cards to do so without any
knowledge about GPGPU programming. However, a couple of guidelines that
were included in the latest version of the EASEA user manual, should be fol-
lowed, such as using flat genomes rather than pointers towards matrices, fitness
functions that do not use more registers than available on a GPGPU unit, floats
rather than doubles5 , . . .
5
Although available on all recent GPGPU cards, using double precision variables will
apparently considerably slow down the calculations on current GPGPU cards, but
Adding the -cuda option to this compiler is very important, since it not
only allows replication of the presented work, but also gives non GPGPU expert
programmers the possibility to run their own code on these powerful parallel
cards.
7 Related Work
Even though many papers have been written on the implementation of Genetic
Programming algorithms on GPGPU cards, only three papers were found on the
implementation of standard evolutionary algorithms on these cards.
In [11], Yu et al. implement a refined fine-grained algorithm with a 2D
toroidal population structure stored as a set of 2D textures, which imposes re-
strictions on mating individuals (that must be neighbours). Other constraints
arise, such as the need to store a matrix of random numbers in GPGPU memory
for future reference, since there is no random number generator on the card.
Anyway, a 10 time speedup is obtained, but on a huge population of 512 × 512
individuals.
In [7], Fok et al. find that standard genetic algorithms are ill-suited to run
on GPGPUs because of such operators as crossovers “that would slow down
execution when executed on the GPGPU” and therefore choose to implement
a crossover-less Evolutionary Programming algorithm [6] here again entirely on
the GPGPU card. The obtained speedup of their parallel EP “ranges from 1.25
to 5.02 when the population size is large enough.”
In [8], Li et al. implement a Fine Grained Parallel Genetic Algorithm once
again on the GPGPU, to “avoid massive data transfer.” For a strange reason,
they implement a binary genetic algorithm even though GPGPUs have no bit-
wise operators, so go into a lot of trouble to implement simple genetic operators.
To our knowledge no paper proposed the simple approach of only parallelizing
the evaluation of the population on the GPGPU card.
8 Conclusion and future developments
Results show that deporting the children population onto the GPGPU for a par-
allel evaluation yields quite significant speedups of up to 100 on a $250 GTX260
card, in spite of the overhead induced by the population transfer.
Being faster by around 2 orders of magnitude is a real breakthrough in evo-
lutionary computation, as it will allow applied scientists to find new results in
their domains. Then, researchers on artificial evolution will need to modify their
algorithms to adapt them to such speeds, that will probably lead to premature
convergence, for instance. Then, unlike many other works that are difficult (if
not impossible) to replicate, knowhow on the parallelization of evolutionary al-
gorithms has been integrated into the EASEA language. Researchers who would
this has not been tested yet, as all this work has been done on an 8800GTX card
that can only manipulate floats.
like to try out these cards can simply specify their algorithm using EASEA and
the compiler will parallelize the evaluation.
Anyway, many improvements can still be expected. Load balancing could
probably be improved, in order to maximize bundles throughput. Using texture
cache memory may be interesting on evaluation functions that repeatedly ac-
cess genome data. Automatic use of shared memory could also yield some good
results, particulary on local variables in the evaluation function.
Finally, an attempt to implement evolutionary algorithms on Sony/Toshiba/IBM
Cell multicore chips is currently being made. Its integration into the EASEA lan-
guage could allow to compare performance of GPGPU and Cell architecture on
identical programs.
References
1. L. A. Baumes, M. Moliner, and A. Corma. Design of a full-profile matching so-
lution for high-throughput analysis of multi-phases samples through powder x-ray
diffraction. Chemistry - A European Journal, In Press.
2. L. A. Baumes, M. Moliner, N. Nicoloyannis, and A. Corma. A reliable methodology
for high throughput identification of a mixture of crystallographic phases from
powder x-ray diffraction data. CrystEngComm, 10:1321–1324, 2008.
3. P. Collet, E. Lutton, M. Schoenauer, and J. Louchet. Take it easea. In In Parallel
Problem Solving from Nature VI, pages 891–901. Springer, LNCS, 2000.
4. A. Corma, M. Moliner, J. M. Serra, P. Serna, M. J. Diaz-Cabanas, and L. A.
Baumes. A new mapping/exploration approach for ht synthesis of zeolites. Chem-
istry of Materials, pages 3287–3296, 2006.
5. C. Darwin. On the Origin of Species by Means of Natural Selection or the Preser-
vation of Favoured Races in the Struggle for Life. John Murray, London, 1859.
6. D. B. Fogel. Evolving artificial intelligence. Technical report, 1992.
7. K.-L. Fok, T.-T. Wong, and M.-L. Wong. Evolutionary computing on consumer
graphics hardware. Intelligent Systems, IEEE, 22(2):69–78, March-April 2007.
8. J.-M. Li, X.-J. Wang, R.-S. He, and Z.-X. Chi. An efficient fine-grained parallel
genetic algorithm based on gpu-accelerated. In Network and Parallel Computing
Workshops, 2007. NPC Workshops. IFIP International Conference on, pages 855–
862, 2007.
9. K. De Jong. Evolutionary Computation: a Unified Approach. MIT Press, 2005.
10. R. A. Young. The Rietveld Method. OUP and International Union of Crystallog-
raphy, 1993.
11. Q. Yu, C. Chen, and Z. Pan. Parallel genetic algorithms on programmable graphics
hardware. In Advances in Natural ComputationICNC 2005, Proceedings, Part III,
volume 3612 of LNCS, pages 1051–1059, Changsha, August 27-29 2005. Springer.

Euro Par 2009 Efficient Parallel Implementation of Evolutionary Algorithms On GPGPU Cards

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Euro Par 2009 Efficient Parallel Implementation of Evolutionary Algorithms On GPGPU Cards

Enviado por

Direitos autorais:

Formatos disponíveis

Efficient Parallel Implementation of

Evolutionary Algorithms on GPGPU Cards

Ogier Maitre1 , Nicolas Lachiche1 , Philippe Clauss1 , Laurent Baumes2 , Avelino

Abstract. A parallel solution to the implementation of evolutionary

1. a difficult problem for which no computable way of finding a good solution

Fig. 1. Generic evolutionary loop.

4 Parallel implementation on a GPGPU card

1. Implementing the complete evolutionary engine on the GPGPU card is very

5.1 The Weierstrass benchmark program

Weierstrass Iteration 120 on CPU Weierstrass Iteration 120

Theory defines Weierstrass functions as an infinite sum of sines. Program-

Since using a GPGPU card induces a necessary overhead, it is interesting

5.2 Application to a real-world problem

In materials science, knowledge of the structure at an atomistic/molecular level is

1. All T atoms should be linked to 4 and only 4 neighbouring T s, so:

6 EASEA : Evolutionary Algorithm specification

Fig. 5. Evolutionary algorithm minimizing the Weierstrass function in EASEA.

problems without the need to implement a complete algorithm. It was revived

8 Conclusion and future developments

Você também pode gostar