A Practical Approach of Curved Ray Prestack Kirchhoff Time Migration On GPGPU

A Practical Approach of Curved Ray Prestack
Kirchhoff Time Migration on GPGPU
Xiaohua Shi, Chuang Li, Xu Wang, and Kang Li
School of Computer Science,

Beihang University, Beijing 100083, China
xhshi@buaa.edu.cn, whlichuang@126.com,
{xu.wang,kang.li}@sei.buaa.edu.cn
Abstract. We introduced four prototypes of General Purpose GPU so-

lutions by Compute Unified Device Architecture (CUDA) on NVidia
GeForce 8800GT and Tesla C870 for a practical Curved Ray Prestack
Kirchhoff Time Migration program, which is one of the most widely
adopted imaging methods in the seismic data processing industry. We
presented how to re-design and re-implement the original CPU code to
efficient GPU code step by step. We demonstrated optimization meth-
ods, such as how to reduce the overhead of memory transportation on
PCI-E bus, how to significantly increase the kernel thread numbers on
GPU cores, how to buffer the inputs and outputs of CUDA kernel mod-
ules, and how to utilize the memory streams to overlap GPU kernel
execution time, etc., to improve the runtime performance on GPUs. We
analyzed the floating point errors between CPUs and GPUs. We pre-
sented the images generated by CPU and GPU programs for the same
real-world seismic data inputs. Our final approach of Prototype-IV on
NVidia GeForce 8800GT is 16.3 times faster than its CPU version on
Intel’s P4 3.0G.
Keywords: General Purpose GPU, Prestack Kirchhoff Time Migration,

CUDA.
1 Introduction
The main goal of earth exploration is to provide the oil and gas industry with
knowledge of the earth’s subsurface structure to detect where oil can be found
and recovered. To do so, large-scale seismic surveys of the earth are performed,
and the data recorded undergoes complex iterative processing to extract a ge-
ological model of the earth. The data are then interpreted by experts to help
decide where to build oil recovery infrastructure[1].
In practice, seismic data processing is divided into two steps. The first step
applies signal processing algorithms to normalize the signal over the entire survey

This work was supported by grants from the National High Technology Research and
Development Program of China (863 Program) No.2007AA01A127 and the Special-
ized Research Fund for the Doctoral Program of Higher Education (New Faculty)
2007006028.
Y. Dou, R. Gruber, and J. Joller (Eds.): APPT 2009, LNCS 5737, pp. 165–176, 2009.

c Springer-Verlag Berlin Heidelberg 2009
166 X. Shi et al.
6RXUFH 5HFHLYHU
76 75
6FDWWHUSRLQW
Fig. 1. Relationship between source, receiver, scatter point and their corresponding
migration curve in PKTM.TS is the sending time from the source to the scatter point.
TR is the reflecting time from the scatter point to the receiver.
and to increase the signal-to-noise ratio. Hundreds of mathematical algorithms

are available during this step, from which geophysical specialist will select the
particular candidates for seismic data by experience. The second step, which is
the most time-consuming one, is designed to correct for the effects of changing
subsurface media velocity on the wave propagation through the earth. Prestack
Kirchhoff Time Migration (PKTM) algorithm used in the second step is one
of the most widely adopted imaging methods in the seismic data processing
industry.
Fig. 1 shows the relationship between the source, receiver and scatter point
as well as their corresponding migration curves used in PKTM algorithm. It
assumes that the energy of a sampled point on an input trace is the superposition
of the reflections from all the underground scatter points that have the same
travel time. The purpose of the migration processing is to spread the points
on an input trace to all possible scatter points in the 3D space. Each input
trace is independent of the others when it is migrated, and this makes this
problem suitable for parallelization on the cluster. After all input traces are
migrated, the migrated samples are accumulated to get the migrated image. The
algorithm is heavily time-consuming because of the huge number of iterations at
runtime.
A PKTM program, especially Curved Ray PKTM (CR-PKTM) program [2],
usually runs days or weeks to process a typical seismic job on clusters with
hundreds of machines. Fig. 2 illustrates a typical approach of CR-PKTM on
cluster [3]. Process 1–N on different nodes will get the same amount of trace data
as inputs. And the calculation work on each node is almost the same. Clearly,
one of the efficient ways to improve the overall performance of CR-PKTM is to
improve the average calculating performance for each node in the cluster.
In a matter of just a few years, the programmable graphics processor unit
has evolved into an absolute computing workhorse. With multiple cores driven
by very high memory bandwidth, today’s GPUs offer incredible resources for
both graphics and non-graphics processing [4]. The GPUs could achieve up to
hundreds or even thousands of GFLOPS, comparing to the general CPUs that
only have dozens of GFLOPS so far.
A Practical Approach of CR-PKTM on GPGPU 167
Local disk
Read traces
Migration
Broadcast computing Process 0
loop Mem buffer
Local disk Local disk

Migration Migration
Receive computing Receive computing ……
loop Mem buffer loop Mem buffer
Process 1 Process 2
Fig. 2. A parallelized CR-PKTM program on cluster
The main reason behind such an evolution is that the GPU is specialized for
compute-intensive, highly parallel computation - exactly what graphics rendering
is about - and therefore is designed such that more transistors are devoted to
data processing rather than data caching and flow control. When GPUs are
used as general platforms to exploit data-level-parallelism (DLP) for non-graphic
applications, they are known as General Purpose GPUs (GPGPUs).
As a leading role in the seismic data processing industry, Compagnie Gen-
erale de Geophysique (CGG) has evaluated the GPGPU and Compute Unified
Device Architecture (CUDA) as accelerating platforms in its migrating software
[1]. For the DLP programming models on GPU, there are a lot of research works.
I. Buck et al. presented the Brook system for GPGPU [5]. Brook extends C to
include simple data-parallel constructs, enabling the use of the GPU as a stream-
ing coprocessor. D.Tarditi et al. presented Accelerator, a system that uses data
parallelism to program GPUs for general-purpose uses instead [6]. Peakstream
Corp. developed a DLP programming platform for GPGPU [7]. The Peakstream
platform was a new software development platform that offered an easy-to-
use stream programming model for multi-core processors and accelerators such
as GPUs.
In this paper, we demonstrate how to utilize CUDA [4], GeForce 8800GT and
Tesla C870 GPUs of NVidia, to exploit the data-level-parallelism for a practical
CR-PKTM program.
2 Implement Curved Ray Prestack Kirchhoff Time

Migration on GPGPU
Fig. 3 presents the simplified pseudo code of a practical CR-PKTM program.

There are four-layer loops in the program. The outer two layers survey the
incoming floating points that represent different coordinates on the earth surface,
choose the appropriate candidates and pass them to the inner two loops to be
migrated on a particular cluster node.
168 X. Shi et al.
for(loopcount1){
for(loopcount2){ ...
while(condition1){ ...
if(condition2){ ...
for(loopcount3){... ...}
}else if(condition3){ ...
}else{
}
... ...
}//while
}//for loopcount2
}//for loopcount1
Fig. 3. Simplified pseudo code of a practical CR-PKTM program
Comparing with the Kirchhoff migration CPU code of PeakStream [7], the
practical CR-PKTM program has more branches, one more layer loop and more
complicated floating point calculations. As we known, the branches will hurt the
efficiency of the SIMD instructions of GPGPU at runtime.
2.1 Prototype I
There are four-layer loops in the practical CR-PKTM program in Fig. 3. The
outer two-layer loops select appropriate coordinates to be migrated in the inner
two-layer loops. Rewriting the inner two-layer loops from CPU code to CUDA
code is an easy way to utilize the GPGPU. Fig. 4 illustrates how Prototype-I
works. For every selected coordinate, we send the input data from CPU memory
to the GPU memory, start CUDA kernels on the GPU [4], calculate the migration
results, and then send the result back to CPU memory.
every trace
CPU mem. to GPU mem.
Start kernels on GPU
GPU mem. to CPU mem.
Fig. 4. Flowchart of Prototype-I

However, there are serious bandwidth issues in Prototype-I. Because the input
data for every trace, including the original collected data, the pre-processed
data and the result array, are as large as more than 100M bytes, the average
transporting overhead between CPU memory and GPU memory could be 150–
160ms (for about 300M bidirectional data), with an ideal transporting rate about
5GB/s and practical transporting rate about 2GB/s. Although the GPU could
finalize every thread in 5ms, the total cost of calculation and data transportation
is much higher than the original CPU code, which could be less than 15ms on
Intel’s P4 3.0G.
2.2 Prototype II
With a deeper study on the CR-PKTM program in Fig.3, we found the input
data for every trace include a large data array with more than 100M bytes, which
record the migration result and are partly used in the next traces. We could keep
these arrays in the GPU memory until them out of usage. For the 512M GPU
memory, we could keep up to 300 traces of data in the GPU memory.
Fig.5 presents the flowcharts of Prototype-II. Comparing to Prototype-I,
Prototype-II pre-sends the large data arrays to GPU memory before the loop,
and only transports about 1M bytes between CPU memory and GPU memory
for every trace. The transportation overhead between the two memories is less
than 1ms per trace. Because the CUDA code of the inner two-layer loops could
be finalized in 5ms, the GPU code on NVidia GeForce8800GT could be more
than 4 times faster than the CPU code on Intel’s P4 3.0G.
Send the large arrays

to GPU mem.
every trace
Send ~1M bytes to

GPU mem.
Start kernels on GPU
Send ~1M bytes to

CPU mem.
Send back the large

arrays to CPU mem.
Fig. 5. Flowchart of Prototype-II

170 X. Shi et al.
Although Prototype-II dramatically decreased the transportation overhead

between CPU and GPU memories, the straightly translated CUDA code from
CPU code did not take advantage of the powerful SIMD cores of GPU well.
For every incoming trace, there are at most 256 coordinates will be selected to
the inner loops, that means at most 256 threads on the GPU will be triggered.
For the NVidia GeForce8800GT GPU, there are 14 multiprocessors and every
multiprocessor has 8 stream processors. Every stream processor will run less than
3 threads on average. That means most stream processors will be idle during the
calculation.
Furthermore, there are a lot of branch instructions in the inner loops. These
branches will seriously hurt the runtime efficiency of the SIMD cores also.
2.3 Prototype III
Fig. 6 demonstrates the flowcharts of Prototype-III. The original CR-PKTM pro-

gram has been separated in 5 steps, Step0–Step4. Step0 runs on CPU, initializes
the input data and send them to the GPU memory.
Step1–Step4 run on GPU as CUDA kernels. Step1 starts one thread for ev-
ery incoming coordinate. It will survey every coordinate, select the appropriate
candidates, do some pre-migration calculations and send the results to Step2.
Step2 starts one thread for every appropriate coordinate, deals with the same
calculation work as the 3rd layer loop (the while loop) in Fig. 3. There is an
if-elseif-else conditional statement in the 3rd layer loop. The for loop with loop-
count3 will be rewritten to the CUDA code in Step3, and the for loop with

to GPU mem.
every trace
Step0
Step1
Step2
Step3 Step4
Send back the large

arrays to CPU mem.
Fig. 6. Flowchart of Prototype-III

loopcount4 will be rewritten to the CUDA code in Step4 also. The for loop with
loopcount5 is never executed in practical, so we just ignore it.
Step2 makes the decision which Step, Step3 or Step4 will be executed next.
Step3 and Step4 are well designed CUDA kernels for NVidia’s SIMD cores.
According to the iteration times of the more inner loops, Step3 will trigger at
least 3000 threads, and Step4 will trigger at least 1000 threads, respectively.
Prototype-III redesigned the original CPU code to fit the GPU and CUDA
features better, and improved the runtime efficiency more than 7.2 times com-
paring with the CPU code on Intel’s P4 3.0G.
2.4 Prototype IV
Using CUDA Profiler to analyze the runtime performance of Prototype-III, we

can find Step2 dominated the executing time, like Fig. 7. Step2 starts one thread
for every appropriate coordinate, deals with the same calculation work as the
3rd layer loop (the while loop) in Fig. 3. There are at most 256 threads will be
triggered in Step2 for every input trace. However, for GPUs like 8800GT or Tesla
C870 with more than 100 cores, the thread number is too small to utilize the
cores well. These threads will spend more time in waiting I/O instead of kernel
code.
One efficient way to improve the runtime performance of Step2 is to increase
the thread number of it. Because Step2 uses the output of Step1 as input, we
applied input and output buffers for both steps. The buffers save multiple input
and output trace data. With the input and output buffers, Step1 and Step2
could start N*256 threads before Step3 and Step4, respectively, in which N is
the buffered trace number. Fig. 8 presents the flowchart of Prototype-IV.
Fig. 9 demonstrates how many traces should be buffered to get the best run-
time performance in Step1 and Step2. For the 8800GT and Tesla C870 GPUs we
used, the best trace number is 20. That means there are at most 20*256, about
5120 threads, will be trigger in Step1 and Step2.
Fig. 10 shows the profiling data of Prototype-IV. The memcopy function in-
stead of any Steps dominates the execution time now. The memcopy functions
before Step3 and Step4 send a large array namely WAVE, from CPU to GPU.
GPU Time
step1 (291)
memcopy (11)
step4 (290)
step3 (290)
step2 (290)
0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00%
Fig. 7. CUDA profiling data of Prototype-III

172 X. Shi et al.

to GPU mem.
every N trace
Step0
Step1
Step2
every trace
Step3 Step4
Send back the large

arrays to CPU mem.
Fig. 8. Flowchart of Prototype-IV
Parallelizing Step1 & Step2

520
510
500
490
Time (ms)
480
470
460
450
440
430
1 2 3 4 5 6 10 15 20 25 30 60 100 150
Trace
Fig. 9. Parallelizing Step1 and Step2 by buffering their inputs and outputs for multiple
traces
GPU Time
step1 (14)
memcopy (330)
step4 (270)
step3 (270)
step2 (14)
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00%
Fig. 10. CUDA profiling data of Prototype-IV

Stream Numbers of Prototype-IV

460
450
440
430
Time (ms) 420
410
400
390
380
370
360
N/A 1 2 4 5 10 20
Stream Number
Fig. 11. Streaming Prototype-IV from 0 to 20
We can use the CUDA streams to overlap the I/O time by kernel execution
time. Fig. 11 shows how many streams should be applied to get the best runtime
performance. The best stream number should be 5 under this scenario.
The Prototype-IV with streams support is more than 16.3 times faster than
its CPU version on Intel’s P4 3.0G.
3 Floating Point Errors

The Curved Ray PKTM accumulates the floating point errors. Fig. 12 demon-
strates a piece of CR-PKTM code at the image generating phase. The image
buffer WOT will be accumulated times and times, before the final image been
generated. It is easy to know, the floating point errors will be accumulated times
and times too, if they do exist.
Both 8800GT and Tesla C870 have IEEE-compliant additions and multiplica-
tions. However, the two operations are often combined into a single multiply-add
for(...)
{
int ITM1 = KTM1>>12;
WOT(IT,KF,MC,N4,NOFF)= WOT(IT,KF,MC,N4,NOFF)
-TA1*( WAVE(ITM1-KP1,1,1,NTNEW,NBAND)
-WAVE(ITM1,1,1,NTNEW,NBAND)
-WAVE(ITM1,1,1,NTNEW,NBAND)
+WAVE(ITM1+KP1,1,1,NTNEW,NBAND));
KTM1=KTM1 + KDELT;
TA1=TA1 + ADELT;
}
Fig. 12. Sample CR-PKTM code at image generating phase

174 X. Shi et al.
Table 1. Relative floating point errors between CPU and GPU results
Relative Error % 1 Trace 10 Traces 50 Traces 100 Traces 300 Traces

0 26974088 26960697 26787671 26584427 26041297
0.0001 124528 104584 225252 358052 734273
0.001 5340 34188 78441 135211 268185
0.01 0 3460 8164 16866 31859
0.1 1 377 873 1869 3787
1 1 83 490 1230 5129
10 14 310 2000 4180 13981
100 43 286 1022 1989 4920
> 100 17 47 119 208 601
Errors/Total % 0.479 0.529 1.167 1.917 3.921
instruction fmad, which truncates the intermediate result of the multiplication

and has a maximum error more than 0.5 ulp (unit in the last place). For other
operations, like divisions, sqrtf, etc., the maximum ulp errors could up to 3–4.
CPU operations have the similar ulp problems also. That means, the two differ-
ent types of processors may get different calculation results for the same code
fragment, cf. Fig. 12.
For instance, the integer number KTM1 in Fig. 12 is rounded from floating
points. It will be right shifted 12 bits, to get an index number of array WAVE. If
CPU and GPU get different KTM1 numbers before, like 13000704 and 13000703,
they will get different shifted indexes like 3174 and 3173. The two different
indexes of WAVE will cause totally different calculation results of WOT.
Table 1 shows the relative errors of final images between CPU and GPU code.
The outputs of CPU code, which are assumed to be more accurate, are selected
as baseline. The relative error rates have been accumulated trace by trace, from
0.479% to 3.921% after 300 traces.
4 Performance Evaluation
We implemented the practical CR-PKTM program on NVidia 8800GT and Tesla
C870 GPUs, which have 512M and 1G GPU memory, respectively, and both have
PCIE-16X and CUDA2.0 support. The GPUs could achieve up to 336GFLOPs
and 350GFLOPs in terms of single-precision floating point calculation, respec-
tively. The host machine of 8800GT has an Intel’s P4 3.0G CPU and 2G DDR400
memory. The Tesla’s has an AMD Athlon64 3000+ CPU and 2G DDR400 mem-
ory. The operation systems are Linux 2.4.21. The GCC version is 3.2.3.
For 30000 traces of input data, Prototype-III and Prototype-II on 8800GT are
7.2 times and 4 times faster than the CPU code on Intel’s P4 3.0G, respectively,
like Fig. 13. Prototype-IV on 8800GT and Prototype-IV on Tesla C870 are 16.3
and 11.6 times faster than the CPU code, respectively. It is interesting that
8800GT is faster than Tesla C870, although it is not a strictly “apple-to-apple”
comparison because they have different types of host machines.
10,000,000
1,000,000
100,000
10,000
Time (ms)
1,000 Intel P4 3.0G

Prototype-I
100 Prototype-II
Prototype-III
10 Prototype-IV 8800
Prototype-IV Tesla
1
300 3000 30000 Traces
Fig. 13. Performance of CPU code, Prototype I, Prototype-II, Prototype-III and

Prototype-IV
Fig. 14. Final images on CPU and GPU. The Left image was generate by CPU code,
the right image was generated by GPU code.
Prototype-I on 8800GT is much slower, almost 10 times, than the CPU code
on P4, because of the significantly heavy transportation overhead between CPU
and GPU memories, as what we have aforementioned in Section 2.1.
Fig. 14 shows the final images generated by the CPU and GPU CR-PKTM
programs for the same input traces. Although Section 3 describes that the
floating point errors could be a serious issue when implementing CR-PKTM on
GPUs, the final images do not have distinct difference and are all acceptable by
geophysicists.
5 Conclusion
For seismic data processing, GPGPU is an appropriate accelerating platform.
Many seismic data processing applications, like CR-PKTM, accept the single-
precision results of floating point calculation. As we known so far, comparing
with the double precision, the single precision is the strength of GPU in terms
of performance and power consumption.
176 X. Shi et al.
However, it is not a “free lunch” to port the original CPU code to GPGPU
code. It is not easy to transform the sequential CPU code, C or Fortran programs,
to data-parallelized GPU code with hundreds and thousands threads and more
suitable to the SIMD cores.
In this paper, we introduced a serial of GPGPU prototypes for a practical
CR-PKTM program, and presented the not-easy code migration work. We hope
this work could be helpful for the future GPGPU applications, especially the
seismic data procession applications, and the GPGPU programmers.
References
1. Deschizeaux, B., Blanc, J.Y.: Imaging Earth’s Subsurface Using CUDA, http://
developer.download.nvidia.com/books/gpu_gems_3/samples/gems3_ch38.pdf
2. Taner, M.T., Koehler, F.: Velocity spectra-digital computer derivation and applica-
tion of velocity functions. Geophysics 34, 859–881 (1969)
3. Zhao, C.H., Shi, X.H., Yan, H.H., Wang, L.: Exploiting coarse-grained data paral-
lelism in seismic processing. In: Proceedings of the 2008 Workshop on Architectures
and Languages for Throughput Applications: Held in conjunction with the 35th
International Symposium on Computer Architecture, Beijing, China (2008)
4. NVidia, NVidia CUDA Computer Unified Device Architecture Programming Guide,
Version 2.0 (2008)
5. Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanra-
han, P.: Brook for GPUs: Stream Computing on Graphics Hardware, ACM 0730-
0301/04/0800-0777, pp. 777–786. ACM Press, New York (2004)
6. Tarditi, D., Puri, S., Oglesby, J.: Accelerator: Using Data Parallelism to Program
GPUs for General-Purpose Uses. In: Proceedings of ASPLOS 2006, pp. 325–335
(2006)
7. Papakipos, M.: The PeakStream Platform: High-Productivity Software development
for Nulti-Core Processors, Writepaper, PeakStream Corp. (2007)

A Practical Approach of Curved Ray Prestack Kirchhoff Time Migration On GPGPU

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

A Practical Approach of Curved Ray Prestack Kirchhoff Time Migration On GPGPU

Enviado por

Direitos autorais:

Formatos disponíveis

A Practical Approach of Curved Ray Prestack

Kirchhoﬀ Time Migration on GPGPU

Xiaohua Shi, Chuang Li, Xu Wang, and Kang Li

School of Computer Science,

Abstract. We introduced four prototypes of General Purpose GPU so-

Keywords: General Purpose GPU, Prestack Kirchhoﬀ Time Migration,

and to increase the signal-to-noise ratio. Hundreds of mathematical algorithms

Local disk Local disk

Fig. 2. A parallelized CR-PKTM program on cluster

2 Implement Curved Ray Prestack Kirchhoﬀ Time

Fig. 3 presents the simpliﬁed pseudo code of a practical CR-PKTM program.

Fig. 3. Simpliﬁed pseudo code of a practical CR-PKTM program

CPU mem. to GPU mem.

Start kernels on GPU

GPU mem. to CPU mem.

Fig. 4. Flowchart of Prototype-I

Send the large arrays

Send ~1M bytes to

Start kernels on GPU

Send ~1M bytes to

Send back the large

Fig. 5. Flowchart of Prototype-II

Although Prototype-II dramatically decreased the transportation overhead

2.3 Prototype III

Fig. 6 demonstrates the ﬂowcharts of Prototype-III. The original CR-PKTM pro-

Send the large arrays

Send back the large

Fig. 6. Flowchart of Prototype-III

Using CUDA Proﬁler to analyze the runtime performance of Prototype-III, we

0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00%

Fig. 7. CUDA proﬁling data of Prototype-III

Send the large arrays

Send back the large

Fig. 8. Flowchart of Prototype-IV

Parallelizing Step1 & Step2

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00%

Fig. 10. CUDA proﬁling data of Prototype-IV

Stream Numbers of Prototype-IV

Fig. 11. Streaming Prototype-IV from 0 to 20

3 Floating Point Errors

Fig. 12. Sample CR-PKTM code at image generating phase

Relative Error % 1 Trace 10 Traces 50 Traces 100 Traces 300 Traces

instruction fmad, which truncates the intermediate result of the multiplication

1,000 Intel P4 3.0G

Fig. 13. Performance of CPU code, Prototype I, Prototype-II, Prototype-III and

Você também pode gostar