Escolar Documentos
Profissional Documentos
Cultura Documentos
1 Introduction
The main goal of earth exploration is to provide the oil and gas industry with
knowledge of the earth’s subsurface structure to detect where oil can be found
and recovered. To do so, large-scale seismic surveys of the earth are performed,
and the data recorded undergoes complex iterative processing to extract a ge-
ological model of the earth. The data are then interpreted by experts to help
decide where to build oil recovery infrastructure[1].
In practice, seismic data processing is divided into two steps. The first step
applies signal processing algorithms to normalize the signal over the entire survey
This work was supported by grants from the National High Technology Research and
Development Program of China (863 Program) No.2007AA01A127 and the Special-
ized Research Fund for the Doctoral Program of Higher Education (New Faculty)
2007006028.
Y. Dou, R. Gruber, and J. Joller (Eds.): APPT 2009, LNCS 5737, pp. 165–176, 2009.
c Springer-Verlag Berlin Heidelberg 2009
166 X. Shi et al.
6RXUFH 5HFHLYHU
76 75
6FDWWHUSRLQW
Fig. 1. Relationship between source, receiver, scatter point and their corresponding
migration curve in PKTM.TS is the sending time from the source to the scatter point.
TR is the reflecting time from the scatter point to the receiver.
Local disk
Read traces
Migration
Broadcast computing Process 0
loop Mem buffer
Process 1 Process 2
The main reason behind such an evolution is that the GPU is specialized for
compute-intensive, highly parallel computation - exactly what graphics rendering
is about - and therefore is designed such that more transistors are devoted to
data processing rather than data caching and flow control. When GPUs are
used as general platforms to exploit data-level-parallelism (DLP) for non-graphic
applications, they are known as General Purpose GPUs (GPGPUs).
As a leading role in the seismic data processing industry, Compagnie Gen-
erale de Geophysique (CGG) has evaluated the GPGPU and Compute Unified
Device Architecture (CUDA) as accelerating platforms in its migrating software
[1]. For the DLP programming models on GPU, there are a lot of research works.
I. Buck et al. presented the Brook system for GPGPU [5]. Brook extends C to
include simple data-parallel constructs, enabling the use of the GPU as a stream-
ing coprocessor. D.Tarditi et al. presented Accelerator, a system that uses data
parallelism to program GPUs for general-purpose uses instead [6]. Peakstream
Corp. developed a DLP programming platform for GPGPU [7]. The Peakstream
platform was a new software development platform that offered an easy-to-
use stream programming model for multi-core processors and accelerators such
as GPUs.
In this paper, we demonstrate how to utilize CUDA [4], GeForce 8800GT and
Tesla C870 GPUs of NVidia, to exploit the data-level-parallelism for a practical
CR-PKTM program.
for(loopcount1){
for(loopcount2){ ...
while(condition1){ ...
if(condition2){ ...
for(loopcount3){... ...}
}else if(condition3){ ...
for(loopcount4){... ...}
}else{
for(loopcount5){... ...}
}
... ...
}//while
}//for loopcount2
}//for loopcount1
Comparing with the Kirchhoff migration CPU code of PeakStream [7], the
practical CR-PKTM program has more branches, one more layer loop and more
complicated floating point calculations. As we known, the branches will hurt the
efficiency of the SIMD instructions of GPGPU at runtime.
2.1 Prototype I
There are four-layer loops in the practical CR-PKTM program in Fig. 3. The
outer two-layer loops select appropriate coordinates to be migrated in the inner
two-layer loops. Rewriting the inner two-layer loops from CPU code to CUDA
code is an easy way to utilize the GPGPU. Fig. 4 illustrates how Prototype-I
works. For every selected coordinate, we send the input data from CPU memory
to the GPU memory, start CUDA kernels on the GPU [4], calculate the migration
results, and then send the result back to CPU memory.
every trace
However, there are serious bandwidth issues in Prototype-I. Because the input
data for every trace, including the original collected data, the pre-processed
data and the result array, are as large as more than 100M bytes, the average
transporting overhead between CPU memory and GPU memory could be 150–
160ms (for about 300M bidirectional data), with an ideal transporting rate about
5GB/s and practical transporting rate about 2GB/s. Although the GPU could
finalize every thread in 5ms, the total cost of calculation and data transportation
is much higher than the original CPU code, which could be less than 15ms on
Intel’s P4 3.0G.
2.2 Prototype II
With a deeper study on the CR-PKTM program in Fig.3, we found the input
data for every trace include a large data array with more than 100M bytes, which
record the migration result and are partly used in the next traces. We could keep
these arrays in the GPU memory until them out of usage. For the 512M GPU
memory, we could keep up to 300 traces of data in the GPU memory.
Fig.5 presents the flowcharts of Prototype-II. Comparing to Prototype-I,
Prototype-II pre-sends the large data arrays to GPU memory before the loop,
and only transports about 1M bytes between CPU memory and GPU memory
for every trace. The transportation overhead between the two memories is less
than 1ms per trace. Because the CUDA code of the inner two-layer loops could
be finalized in 5ms, the GPU code on NVidia GeForce8800GT could be more
than 4 times faster than the CPU code on Intel’s P4 3.0G.
Step0
Step1
Step2
Step3 Step4
loopcount4 will be rewritten to the CUDA code in Step4 also. The for loop with
loopcount5 is never executed in practical, so we just ignore it.
Step2 makes the decision which Step, Step3 or Step4 will be executed next.
Step3 and Step4 are well designed CUDA kernels for NVidia’s SIMD cores.
According to the iteration times of the more inner loops, Step3 will trigger at
least 3000 threads, and Step4 will trigger at least 1000 threads, respectively.
Prototype-III redesigned the original CPU code to fit the GPU and CUDA
features better, and improved the runtime efficiency more than 7.2 times com-
paring with the CPU code on Intel’s P4 3.0G.
2.4 Prototype IV
GPU Time
step1 (291)
memcopy (11)
step4 (290)
step3 (290)
step2 (290)
Step0
Step1
Step2
every trace
Step3 Step4
510
500
490
Time (ms)
480
470
460
450
440
430
1 2 3 4 5 6 10 15 20 25 30 60 100 150
Trace
Fig. 9. Parallelizing Step1 and Step2 by buffering their inputs and outputs for multiple
traces
GPU Time
step1 (14)
memcopy (330)
step4 (270)
step3 (270)
step2 (14)
We can use the CUDA streams to overlap the I/O time by kernel execution
time. Fig. 11 shows how many streams should be applied to get the best runtime
performance. The best stream number should be 5 under this scenario.
The Prototype-IV with streams support is more than 16.3 times faster than
its CPU version on Intel’s P4 3.0G.
for(...)
{
int ITM1 = KTM1>>12;
WOT(IT,KF,MC,N4,NOFF)= WOT(IT,KF,MC,N4,NOFF)
-TA1*( WAVE(ITM1-KP1,1,1,NTNEW,NBAND)
-WAVE(ITM1,1,1,NTNEW,NBAND)
-WAVE(ITM1,1,1,NTNEW,NBAND)
+WAVE(ITM1+KP1,1,1,NTNEW,NBAND));
KTM1=KTM1 + KDELT;
TA1=TA1 + ADELT;
}
Table 1. Relative floating point errors between CPU and GPU results
4 Performance Evaluation
We implemented the practical CR-PKTM program on NVidia 8800GT and Tesla
C870 GPUs, which have 512M and 1G GPU memory, respectively, and both have
PCIE-16X and CUDA2.0 support. The GPUs could achieve up to 336GFLOPs
and 350GFLOPs in terms of single-precision floating point calculation, respec-
tively. The host machine of 8800GT has an Intel’s P4 3.0G CPU and 2G DDR400
memory. The Tesla’s has an AMD Athlon64 3000+ CPU and 2G DDR400 mem-
ory. The operation systems are Linux 2.4.21. The GCC version is 3.2.3.
For 30000 traces of input data, Prototype-III and Prototype-II on 8800GT are
7.2 times and 4 times faster than the CPU code on Intel’s P4 3.0G, respectively,
like Fig. 13. Prototype-IV on 8800GT and Prototype-IV on Tesla C870 are 16.3
and 11.6 times faster than the CPU code, respectively. It is interesting that
8800GT is faster than Tesla C870, although it is not a strictly “apple-to-apple”
comparison because they have different types of host machines.
A Practical Approach of CR-PKTM on GPGPU 175
10,000,000
1,000,000
100,000
10,000
Time (ms)
Fig. 14. Final images on CPU and GPU. The Left image was generate by CPU code,
the right image was generated by GPU code.
Prototype-I on 8800GT is much slower, almost 10 times, than the CPU code
on P4, because of the significantly heavy transportation overhead between CPU
and GPU memories, as what we have aforementioned in Section 2.1.
Fig. 14 shows the final images generated by the CPU and GPU CR-PKTM
programs for the same input traces. Although Section 3 describes that the
floating point errors could be a serious issue when implementing CR-PKTM on
GPUs, the final images do not have distinct difference and are all acceptable by
geophysicists.
5 Conclusion
For seismic data processing, GPGPU is an appropriate accelerating platform.
Many seismic data processing applications, like CR-PKTM, accept the single-
precision results of floating point calculation. As we known so far, comparing
with the double precision, the single precision is the strength of GPU in terms
of performance and power consumption.
176 X. Shi et al.
However, it is not a “free lunch” to port the original CPU code to GPGPU
code. It is not easy to transform the sequential CPU code, C or Fortran programs,
to data-parallelized GPU code with hundreds and thousands threads and more
suitable to the SIMD cores.
In this paper, we introduced a serial of GPGPU prototypes for a practical
CR-PKTM program, and presented the not-easy code migration work. We hope
this work could be helpful for the future GPGPU applications, especially the
seismic data procession applications, and the GPGPU programmers.
References
1. Deschizeaux, B., Blanc, J.Y.: Imaging Earth’s Subsurface Using CUDA, http://
developer.download.nvidia.com/books/gpu_gems_3/samples/gems3_ch38.pdf
2. Taner, M.T., Koehler, F.: Velocity spectra-digital computer derivation and applica-
tion of velocity functions. Geophysics 34, 859–881 (1969)
3. Zhao, C.H., Shi, X.H., Yan, H.H., Wang, L.: Exploiting coarse-grained data paral-
lelism in seismic processing. In: Proceedings of the 2008 Workshop on Architectures
and Languages for Throughput Applications: Held in conjunction with the 35th
International Symposium on Computer Architecture, Beijing, China (2008)
4. NVidia, NVidia CUDA Computer Unified Device Architecture Programming Guide,
Version 2.0 (2008)
5. Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanra-
han, P.: Brook for GPUs: Stream Computing on Graphics Hardware, ACM 0730-
0301/04/0800-0777, pp. 777–786. ACM Press, New York (2004)
6. Tarditi, D., Puri, S., Oglesby, J.: Accelerator: Using Data Parallelism to Program
GPUs for General-Purpose Uses. In: Proceedings of ASPLOS 2006, pp. 325–335
(2006)
7. Papakipos, M.: The PeakStream Platform: High-Productivity Software development
for Nulti-Core Processors, Writepaper, PeakStream Corp. (2007)