A New High Performance GPU-based Approach To Prime Numbers Generation

World appl. programming, Vol(5), No (1), January, 2015. pp.
1-7
TI Journals
ISSN:
World Applied Programming
2222-2510
www.tijournals.com
Copyright 2015. All rights reserved for TI Journals.
A New High Performance GPU-based Approach to Prime

Numbers Generation
Amin Nezarat *
Department of Computer, Payame Noor University, I.R. of Iran
M. Mahdi Raja
Computer Department, Shiraz University, Shiraz, Iran
GH. Dastghaybifard
Computer Department, Shiraz University, Shiraz, Iran
*Corresponding author: aminnezarat@yazdpnu.ac.ir
Keywords
Abstract
GPU
Prime Numbers
Sieve Algorithm
Parallel Computing
CUDA
SIMD Parallelization is one of the most useful ways of decreasing the computation time and
increases the performance of computation intensive algorithms. To do such process, we
could execute some processes on several machines by using different platforms like MPI,
OpenMP and distribute the workload by using message passing and shared memory. One of
the most popular and high performance methods is using an array of graphical processors
(GPU) which is used in this paper to present a new technique to save data and do
computation by overclocking sieve algorithm make use of CUDA coding. This method
shows a good performance upgrade in computation time and memory usage on generating
prime numbers in compare with CPU handling.
1.
Introduction
The impact of finding a large prime number is not mysterious in cryptography and there are many mathematicians who are
searching an efficient tactic to find large prime numbers fast and with low cost. On the other side, in parallel programming
the goal is to propose or adjust algorithms in a way that will help us break the calculation into several computer, clusters or
distributed systems to reach the result faster. Many parallelization techniques exist now days [1,3,15]. One of them is to use
execution threads which are able to use a shared memory in user level or operating system level and each of threads do a part
of the whole job simultaneously. Performance could be a little better as there is not any different on executing one or several
processes when the total power of a CPU core is the same. The next parallelization technique is to do message passing
between several machines and each of those machines (or cores) do a part of the total computation using MPI or OpenMP
platforms. In [5] parallel prime number generation could make the performance better than sequential algorithm. In addition,
authors found that creating threads could speed up the execution if the data volume is high and we do not have any speedup
on low volume data.
Another parallelization method which is very popular between experts is that GPU could be very helpful instead of CPU. In
this structure, calculations will be executed on arrays of cores on a GPU with low power cores but also many processors.
There are two types of memory there; the first one is a dedicated memory and the second one a shared memory [2,3,5,17].
Each GPU has several core blocks and each block has several cores. In execution start time, all blocks will be involved
without having any message passing in between but using a shared memory to send and receive data instead.
1.1 MPI
MPI (Message passing interface) is a dedicated protocol to do message transfer between computers. In this protocol which is
designed for large parallel-enabled computers, we could have good performance and ability to use it on different
architectures. One of its implementations named MPICH is freely available on the internet for download[4,11,18]. In this
project we are using the last version, MPICH2 which allows programmers to have comparison between shared/distributed
memories which is very helpful to use threads separately, MPI separately or using both[6,7].
1.2 Pthread
Pthread (Posix threads library) is an interface to create and transfer data between threads on an executed program in runtime.
The main idea is to use threads in the background to increase the performance of each parallelly executed programs
parts[8,9].
This process could be followed by breaking the CPU time to small parts and dedicating each part to threads. It will have the
production of being capable to move between threads and having a division of calculation time. Even we could transfer
Amin Nezarat *, Mahdi Raja, GH. Dastghaibifard

World Applied Programming Vol(5), No (1), January, 2015.
shared information between these threads which are using a dedicated time and space when we are facing a divide and
conquer algorithm. In addition these threads could play the rule of virtual computation nodes in each processor.
2.
CUDA, GPU Programming
Many scientists and researchers referred to GPU technology and used it, a few time after its announcement. Using GPU for
non-graphical computations is called GPGPU. The issue before GPUs invention was the number of cores when people were
executing parallel jobs on computer cluster nodes which had 4 cores at the best case. Having more power needs the addition
nodes which will cost a lot. But now, we migrate our processes to GPUs with very high rate of core numbers when they are
not busy and it will have very low complexity and high parallel computation power [10,12]. Each GPU has several blocks
and each block has many tiny cores which are able to execute a parallel algorithm simultaneously on all blocks. Each core
has a very limited memory for execution information management like PCB, registers, etc. beside that each block in GPU
has a shared memory which is available to all cores and they could access the data inside. GPUs have at least 32 cores which
will be the same as using many computers in a supercomputer on a network that we want to execute a parallel program on it.
The important feature in GPU is that, all these connections are on a very small hardware which will give us much more
performance and speed [9,10,16]. GPU architecture is shown in Figure 1.
Figure 1.GPU and CPU Architecture
To write a program on GPU we used CUDA protocol which is an NVidia product. CUDA is consisted of several C, Fortran
programming libraries that is able to be ran on GPU directly. In GPU, we name each core as Device and our codes should be
written in C using CUDA to be ran on CPU as Host and after that, send the instruction pointer of CUDA to GPU to start
code execution on both Host and Device simultaneously [11].
In CUDA, cudaMemcpy can copy data in either direction between GPU and CPU memory. This call runs at over a gigabyte per second between GPU memory and pinned CPU memory (unplagable memory allocated with cudaMallocHost).
Ordinary unpinned CPU memory copies run at about half this bandwidth. One conceptual limitation with cudaMemcpy is
that both GPU and CPU buffers must be contiguous groups of bytes; no stride or derived data types are allowed.
In CUDA, the per-memory-copy setup cost M is substantial, around 10 microseconds per call, which is faster than Ethernet
but much worse than most high-performance networks. Per-pixel time M is much better, mostly limited by the PCI Express
bus to about a gigabyte per second, and so GPU-CPU memory copy bandwidth is competitive with modern networks.
A naive blocking implementation for contiguous GPU-to-GPU data transmission is thus:
// Sending CPU code:
cudaMemcpy(cpuBuffer,gpuBuffer,n,...)
// Receiving CPU code:
cudaMemcpy(gpuBuffer,cpuBuffer,n,...)
Figure 2.GPU to GPU data transmission code
We expect this implementation to take CPU time

. The only minor detail to note is that we should be sure to
have previously allocated cpuBuffer as pinned CPU memory, for higher copy bandwidth. It would be best to manage this
pinned memory in a single location in the code, especially since pinned memory allocations currently take absurdly long,
millions of nanoseconds, as shown in Figure 3[13,14].
Function
Performance Model
Tolerance
A New High Performance GPU-based Approach to Prime Numbers Generation

CUDA Kernel
writing to GPU
memory
CUDA Kernel
writing to
mapped CPU
RAM
C++ new and
delete
(unpinned CPU
RAM)
cudaMallocHost
(pinned CPU
RAM)
cudaMalloc
(GPU memory)
4.9103ns+n* 0.012 ns/byte
+0%/-3%
5.3103ns +n* 0.5 ns/byte
+0%/-0%
100 ns +n* 0.710-3 ns/byte
+94%/-8%
3.5106ns +n* 0.7 ns/byte
+0%/-23%
if (n <2.05103bytes) then
+0%/-8%
1.49103ns
else 196.67103ns +n* 0.13
ns/byte
Figure 3.Measured performance of CUDA functions operating on n bytes. GPU memory is faster than mapped CPU RAM,
and pinned allocation is expensive.
3.
Main Method
Prime number checking was a very hard and complex problem all the time. The fastest proposed algorithm named
Cyclotomy test is very complex with O((log n)C log log log n). n is our number and C is a constant number related to n. Up
to now, researchers could decrease the complexity to O(log n)7.5 [15,16]. The goal of this project was to resolve this issue
and we are going to check our proposed idea in compare with the original one to realize the performance increase.
Here we used Sieve method as our starting point and we applied some modification to make it parallel[12,13]. Eratestone
started his own algorithm from 1 and applied it on a number to remove multiplications of different numbers. In this method,
it will remove all 2 to n multiplications from numbers. At last we have an array of numbers that are all prime numbers. The
challenge is that we need a very large amount of memory to hold numbers and we need many repetitions to remove numbers.
Then we should apply a technique to make it more efficient in size of memory and the runtime costs.
In addition, by using parallel threads beside a shared memory we could executed many instructions at the same time. In other
words, we could assign a block of data to each executive thread or processor and at the same time all outputs are shared with
other processing units.
In this paper, we had several performance gains by the code modification and the result was speedup in addition to less
memory requirements. We used bit arrays as there are some arrays of numbers which should be processed and we should be
able to store them. In each cell of array 32 numbers are marked and the index will point to numbers. In this method, we can
mark numbers by shifting bits, as shown in Figure 4.
__global__ void primesSMKernel( unsigned char *g_prepattern, unsigned char *g_bitbuffer,
unsigned int bitbufsize, unsigned int NUM )
{
extern unsigned char __shared__ sdata[];
if (NUM > 0){
__register__ unsigned int *isdata = (unsigned int*)&sdata[0];
__register__ unsigned int *g_ibitbuffer = (unsigned int*)g_bitbuffer;
__register__ unsigned int *g_iprepattern = (unsigned int*)g_prepattern;
__register__ unsigned int num= bitbufsize/4;
//round down
__register__ unsigned int remain = bitbufsize%4;
//remainder
__register__ const unsigned int idx = threadIdx.x;
//initialize shared memory with precomputed
//bit pattern for primes 2, 3, 5, 7, 11, 13
for (__register__ int i=0; i < num; i+= blockDim.x)
if (i+idx < num) isdata[i+idx] = g_iprepattern[i+idx];
if(idx<remain)sdata[4*num+idx]=g_prepattern[4*num+idx];
__syncthreads();
unsigned int __shared__ firstprime;
__register__ unsigned int sqrt_N = ceil(sqrtf((float)NUM));
if (threadIdx.x == 0){ firstprime = 17;
// start marking multiples of primes beginning with prime 17
sdata[0] = 0x53;
// 2 is prime, 3 is prime, 5 is prime, 7 is prime, the rest in this byte isn't

sdata[1] = 0xd7; // 11 is prime, 13 is prime

}
__syncthreads();
while (firstprime <= sqrt_N){
// mark out all multiples of "firstprime" starting with firstprime squared.
for (unsigned int i = (firstprime+idx) * firstprime;
i < NUM; i += firstprime*blockDim.x)
sdata[i>>3] |= (1<<(i&7));
__syncthreads();
// search for next prime (unmarked number) in the bit array using a single thread
Figure 4.Cuda code for marking prime number
In addition to these numbers, we do not get rid of numbers which are removed last times and it will help us not to repeat
calculations. For example number 6 will be marked in first level as it is a multiplication of 2 and in the next step we do not
mark it as 3's multiplication to decrease number of iterations. It can help us to rise performance.
3.1 Parallelization
As it was mentioned before, there are many parallelization techniques to have speedup on each algorithm's execution. In
[20], authors implemented Sieve and Naive algorithms in a parallel way by using MPI tool and OpenMP shared memory.
They found that OpenMP combination with Sieve + Naive will give us a very dramatic performance increase in large
number ranges, because of the need to access of all processors to a large number of data.
In an increasing use of graphic cards in daily life and the huge power of these tools to have many parallel computations, in
this paper we present an optimized algorithm which is implemented for GPU cards. On the other side, GPUs does not have
enough space to save all data in their memory which will not go more than several MBs up to one or two Gbs which will
cost much more.
We will separate data to several categories. The important point is that, the high cost of computation to transfer these data
from GPU to CPU will have very bad effects on the performance of program if the sent or received data became more than
the available shared memory of GPU and the high rate of transfers.
The solution is that, we will remove the coefficients of prime number from 2 to 17 in the CPU and then send the remaining
numbers to GPU for next steps. CPU will categorize blocks and each block will take place in next calculation. The proposed
algorithm use bit arrays to save and transfer data.
if (threadIdx.x == 0)
for (firstprime = firstprime + 1; firstprime < NUM;
firstprime++)
if ((sdata[firstprime>>3] & (1<<(firstprime&7))) ==
__syncthreads();
}
0) break;
// coalesced and bank-conflict free 32 bit integer copy from shared to global
if (i+idx < num)
g_ibitbuffer[i+idx] = isdata[i+idx];
// copy remaining bytes
if (idx < remain) g_bitbuffer[4*num+idx] = sdata[4*num+idx];
}
}
Figure 5.Cuda code for finding thread section

We do these computations on CPU as these are a processed data which are processes one step before. Copying data would be
specific to each block of GPU. Each block will receive it's own data and will give the array inside the blocks to be executed.
Each execution thread will grab data from shared memory and will do the computation itself.

Figure 6.Thread hierarchy

Data will be distributed in a way which will give the ability to start a new computation process if a thread want to do the
same step to available data. In other words, each thread will choose it's offset based on thread id and block id inside GPU
and in each rotation it will remove some non-prime numbers.
__register__ unsigned int *isdata = (unsigned int*)&sdata[0];
__register__ unsigned int *g_ibitbuffer
= (unsigned int*)g_bitbuffer;
__register__ unsigned int *g_iprepattern
= (unsigned int*)g_prepattern;
__register__ unsigned int num= bitbufsize / 4;// round down
__register__ unsigned int remain= bitbufsize%4;// remainder
__register__ const unsigned int idx = threadIdx.x;
// initialize shared memory with precomputed bit pattern for primes 2 to 13
if (i+idx < num) isdata[i+idx] = g_iprepattern[i+idx];
if(idx<remain)sdata[4*num+idx]=g_prepattern[4*num+idx];
// K is the block-specific offset
unsigned long long K = NUM * blockIdx.x + NUM;
__syncthreads();
unsigned int __shared__ firstprime;
__register__ unsigned int sqrt_KN = ceil(sqrtf((float)(K+NUM)));
if(threadIdx.x==0){firstprime=17;//start marking multiples of primes beginning with prime 17}
__syncthreads();
while (firstprime <= sqrt_KN){// compute an offset such that we're //instantly entering the range of K...K+N in the
loop below Because 64 bit //division is costly, use only the first thread
unsigned int __shared__ offset;
if (threadIdx.x == 0){offset = 0;
if (K >= firstprime*firstprime) offset = (K-firstprime*firstprime) / firstprime;}
__syncthreads();// mark out all multiples of "firstprime" that fall into this //thread block, starting with firstprime squared.
for (unsigned long long i = (offset+firstprime+idx) * firstprime; i < K+NUM;i +=
firstprime*blockDim.x)
if (i >= K) sdata[(i-K)>>3] |= (1<<((i-K)&7));
__syncthreads();//Search for next prime (unmarked number) in the //reference bit array using a single thread
if (threadIdx.x == 0) for (firstprime = firstprime + 1; firstprime < NUM; firstprime++)
if((d_prebitbuffer[firstprime>>3] & (1<<(firstprime&7))) == 0) break;
__syncthreads();} //byte-by-byte uncoalesced and bank-conflict-prone //copy from shared to global memory
__register__ unsigned int byteoff = bitbufsize * blockIdx.x+ bitbufsize;
for (__register__ int i=0; i < bitbufsize; i+= blockDim.x)
if (i+idx < bitbufsize) g_bitbuffer[byteoff+i+idx] = sdata[i+idx];
Figure 7.CUDA kernel code
3.2 Speedup Calculation

In different parallelization methods of mathematical algorithms, comparison of time and memory can be very important
metrics to decide and choose an optimized algorithm. The time of communication between processors is shown with tcomm
and the time of computation is defined as tcomp. To calculate the total time cost of parallelization we use this equation[9,6]:
And to sum the rate of optimization in comparison with sequential program the speedup factor equation is helpful:
We could calculate the cost in two ways.

1- Based on GPU cores.

2- Based on an array of processes which is equal to one block.
We could calculate tcomm and tcomp of each block as a processor unit which has a shared memory as the transfer will be
done from/to these shared memories.
4.
Results
After the implementation, we executed the codes on both the parallel and sequential environments. To run the program we
used two types of GPU. The first one was an NVidia GeForce 9600GT with 16 CUDA cores and the second one was MSI
N570GTX with 480 CUDA cores, which tests were performed. The calculation power of each CUDA core in 9400 is equal
to 550 MHz and the total power of 208GFlops. In Sequential test we used a dual core CPU with the power of 2.4 Ghz for
each core.
The final results are presented in the followings.
Table 1.Comparision between cpu and gpu runtime in GPU 9600gt
Numbers of Count
10,000,000
40,000,000
200,000,000
tGPU
tCPU
(ms)
(ms)
Prime Number's Count
0.372
0.752
11312
200.78
640.20
2,433,654
4125.68
8351.31
11,078,937
As it was mentioned before, the GPU computation power in GT 9600 is equal to 16 x 550 MHz = 8.59 GHz. In order to have
more accurate comparison, balance calculated time with the rate of GPU/CPU = 8.59/4.8 = 1.79
Table 2.Comparision between cpu and gpu runtime in GPU 9600gt after balancing time
Numbers of Count
tGPU
tCPU
Improvement
10,000,000
40,000,000
200,000,000
0.666
359.41
7384.97
0.752
640.20
8351.31
12%
44%
12%
In next step we executed the algorithm on GPU N570GTX which has 480 cores 750Mhz and results are shown in table 2 and
4. The computation power of this GPU model is 351.5 MHZ. GPU N570 / CPU = 351.5 / 4.8 = 73.22
The performance speedup is shown in table 3 and table 4.As it is clear in tables, the speedup was at the range of 12% to 44%
which depends on GPU memory size and number of cores. To have a better result we do not count on 40.000.000 numbers
and we choose the minimum speedup. Then the speedup factor is equal to:
Table 3.Comparision between cpu and gpu runtime in GPU n570gtx

Numbers of
Count
10,000,000
40,000,000
200,000,000
tGPU
tCPU
0.0071
5.913
100.042
0.752
640.20
8351.31
Table 4.COMPARISION BETWEEN CPU AND GPU RUNTIME IN GPU N570GTX AFTER BALANCING TIME
Numbers of
Count
10,000,000
40,000,000
200,000,000
tGPU
tCPU
Improvement
0.5198
432.97
7325.08
0.752
640.20
8351.31
31%
33%
13%

5.
Conclusion
In this paper we did some improvement of Erateston prime number computation algorithm and we applied a new structure to
be ran on GPU and we did some tests to realize the improvements and parallelization. We had several experiments on 8.49
GHz and 351.5 GHz GPUs with different powers and cores. Based on the applied method, the data transfer on bit arrays
solved the memory issue in this algorithm on GPU and we had better performance in this field too. We could conclude that
using GPU processor arrays with low computation power but high number or cores could give us good speedup as we are
using SIMD system. In addition the hardware communication between GPU cores gave us a very high speed computation
power and memory read/write as the communication speed is much more than network between a cluster or supercomputer
nodes. For the future works we would like to combine MPI and CUDA to achieve the computation power of several GPUs
beside each other.
References
[1] Message Passing Interface Forum (MPIF) MPI: A Message-Passing Interface Standard.(November 2003)
[2] David R. Butenhof (1997). Programming with POSIX Threads. Addison-Wesley, 1997.
[3] Electronic Frontier Foundation (EFF) Cooperative Computing Awards. (March 1999) http://www.eff.org/awards/coop.php
[4] H. N. Gabow (2006). Introduction to Algorithms. University of Colorado, 2006.
[5] H. Halberstam and H.E. Richert. (1974). Sieve Methods. Academic Press, London, 1974.
[6] T. H. Myer and I. E. Sutherland, On the design of display processors, Commun. ACM, vol. 11, no. 6, pp. 410414, 1968.
[7] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, GPU computing, Proceedings of the IEEE,vol. 96, no.
5, pp. 879899, May 2008. [Online].
[8] K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick, Stencil
Computation optimization and auto-tuning on state-of-the-art multicore architectures, in SC 08: Proceedings of the 2008 ACM/IEEE
conference on Supercomputing. Piscataway, NJ, USA: IEEE Press, 2008, pp. 112. [Online]. Available:
http://www.eecs.berkeley.edu/kdatta/pubs/sc08.pdf
[9] C. Huang, O. S. Lawlor, and L. V. Kale, Adaptive MPI, in Proceedings of the 16th International Workshop on Languages and
Compilers for Parallel Computing (LCPC 03), College Station, Texas, October 2003. [Online]. Available:
http://charm.cs.uiuc.edu/research/ampi/
[10] O. S. Lawlor, M. Page, and J. Genetti, MPI: Powerwall programming made easier, Journal of WSCG, pp. 130137, 2008. [Online].
Available: http://www.cs.uaf.edu/sw/mpiglut.
[11] Z. Fan, F. Qiu, and A. Kaufman, ZippyGPU: Programming toolkit for general-purpose computation on GPU clusters, in Poster at
Supercomputing 06 Workshop General-Purpose GPU Computing: Practice And Experience. ACM, 2006. [Online]. Available:
http://gpgpu.org/static/sc2006/workshop/SBU ZippyGPUAbstract.pdf
[12] A. Moerschell and J. D. Owens, Distributed texture memory in a multi-gpu environment, in Graphics Hardware, 2006. [Online].
Available:
http://graphics.idav.ucdavis.edu/publications/print pub?pub id=886
[13] J. A. Stuart and J. D. Owens, Message passing on data-parallel archi-tectures, in Proceedings of the 23rd IEEE International Parallel
and Distributed Processing Symposium, March 2009. [Online]. Available:
http://graphics.idav.ucdavis.edu/publications/print pub?pub id=959
[14] D. A. Patterson, Latency lags bandwith, Commun. ACM, vol. 47,no. 10, pp. 7175, 2004.
[15] R. Farber, CUDA, supercomputing for the masses:
Part12, Dr. Dobbs Journal, May 2009. [Online]. Available:http://www.ddj.com/cpp/217500110
[16] B. Carter, The Game Asset Pipeline.Charles River Media, 2004. [Online]. Available:
http://books.google.com/books?id=Sr2HKC46ImcC&pg=PA113
[17] P. Micikevicius, 3D finite difference computation on GPUs using CUDA, in GPGPU-2: Proceedings of 2nd Workshop on General
Purpose Processing on Graphics Processing Units. New York, NY, USA: ACM, 2009, pp. 7984.
[18] L. Wesolowski, An application programming interface for general purpose graphics processing units in an asynchronous runtime
system, Masters thesis, Dept. of Computer Science, University of Illinois,2008. [Online]. Available: http://charm.cs.uiuc.edu/papers/08-12

A New High Performance GPU-based Approach To Prime Numbers Generation

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

A New High Performance GPU-based Approach To Prime Numbers Generation

Enviado por

Direitos autorais:

Formatos disponíveis

World appl. programming, Vol(5), No (1), January, 2015. pp.

World Applied Programming

Copyright 2015. All rights reserved for TI Journals.

A New High Performance GPU-based Approach to Prime

Amin Nezarat *, Mahdi Raja, GH. Dastghaibifard

CUDA, GPU Programming

Figure 1.GPU and CPU Architecture

We expect this implementation to take CPU time

A New High Performance GPU-based Approach to Prime Numbers Generation

4.9103ns+n* 0.012 ns/byte

5.3103ns +n* 0.5 ns/byte

100 ns +n* 0.710-3 ns/byte

3.5106ns +n* 0.7 ns/byte

Amin Nezarat *, Mahdi Raja, GH. Dastghaibifard

sdata[1] = 0xd7; // 11 is prime, 13 is prime

Figure 4.Cuda code for marking prime number

Figure 5.Cuda code for finding thread section

A New High Performance GPU-based Approach to Prime Numbers Generation

Figure 6.Thread hierarchy

Figure 7.CUDA kernel code

3.2 Speedup Calculation

We could calculate the cost in two ways.

Amin Nezarat *, Mahdi Raja, GH. Dastghaibifard

1- Based on GPU cores.

Prime Number's Count

Table 3.Comparision between cpu and gpu runtime in GPU n570gtx

A New High Performance GPU-based Approach to Prime Numbers Generation

Você também pode gostar