Disser Ta Cao

Efficient GPU implementation of bioinformatics
applications
Nuno Miguel Trindade Marcos
Thesis to obtain the Master of Science Degree in
Information Systems and Computer Engineering

Supervisors: Prof. Nuno Filipe Valentim Roma
Prof. Pedro Filipe Zeferino Tomás
Examination Committee:
Chairperson: Prof. José Carlos Martins Delgado
Supervisor: Prof. Nuno Filipe Valentim Roma
Member of the Committee: Prof. David Manuel Martins de Matos
November 2014
Agradecimentos
Em primeiro lugar, gostaria de agradecer aos Professores Nuno Roma e Pedro Tomás, meus orientado-
res neste trabalho, pela vossa orientação e colaboração. Sem a vossa paciência e tolerância constante
não teria sido possível terminar este trabalho. A vós um muito Obrigado.
Seguidamente gostaria de agradecer aos meus colegas que me acompanharam durante o curso, em
especial ao David Gaspar pela força e por todos os momentos passados, ao Jhonny Aldeia por toda a
ajuda durante o curso, ao Dionisio Sousa, ao Rui Mestre e ao Artur Ferreira que me ajudaram e deram
motivação para continuar em frente. Para além disso, agradecer especialmente ao meu amigo Pedro
Monteiro pela ajuda com o trabalho dele e pelo constante incentivo.
Gostaria também de agradecer aos meus amigos Daniela Coelho, Miguel Matos, David Dias, João Ve-
lez e Pedro Chagas por todo o apoio durante este trabalho. Ao meu afilhado Tiago Carreira pelas várias
insistências para acabar este trabalho e por toda a ajuda durante o mesmo.
Para além deles, também gostaria de agradecer aos meus colegas da Premium Minds que sempre
tiveram disponíveis para me ajudar e para assumir as coisas na minha ausência, em especial ao Márcio
Nóbrega, ao André Soares, ao Renil Lacmane e ao Afonso Vilela.
Por final, mas com a maior das importâncias, gostaria de agradecer aos meus pais e ao meu irmão por
toda a força e motivação para que conseguisse chegar ao final deste trabalho e em especial à minha
namorada Ana Daniela pela motivação na recta final deste trabalho.
i
Abstract
Biological sequence data is becoming more accessible to researchers around the world. In particular,
rich databases of protein and DNA sequence data are already made available to biologists and their size
is increasing every day. However, all this obtained information needs to be processed and classified.
Several bioinformatics algorithms, such as the Needleman-Wunsch and the Smith-Waterman algorithm,
have been proposed for this purpose. Both consist of the execution of dynamic programming schemes
which allow the usage of parallelism to achieve a better performance execution. Under this context,
this thesis proposes the integration of two previously presented parallel implementations: an adaptation
of the SWIPE implementation, for multi-core CPUs that exploits SIMD vectorial instructions, and an
implementation of the Smith-Waterman algorithm for GPU platforms (CUDASW++ 2.0). Accordingly,
the presented work offers an unified solution that tries to take advantage of all computational resources
that are made available in heterogeneous platforms, composed by CPUs and GPUs, by integrating a
dynamic load balancing layer. The obtained results show that the attained speedup can reach values as
high as 6x, when executing in a quad-core CPU and two distinct GPUs.
Keywords
Bioinformatics Algorithms; Sequence Alignment; Smith-Waterman Algorithm; Heterogeneous Paral-

lel Architectures; Load Balancing; CUDA.
iv
Resumo
Hoje em dia a quantidade de informação genética disponível aos investigadores é cada vez maior.
Bases de dados com informação genética estão disponíveis na internet e aumenta a cada dia que
passa. De modo a ser utilizada pela Biologia, toda esta informação necessita de ser processada e
classificada. Para a classificar, existem diversos algoritmos bioinformáticos, tais como o algoritmo de
Needleman-Wunsch e o algoritmo de Smith-Waterman algorithm. Ambos consistem na execução de
múltiplas iterações, que permitem a sua paralelização de forma a obter uma melhor performance na
execução. Duas das implementações paralelas existentes são uma adaptação da implementação do
Rognes SWIPE, apresentada por Pedro Monteiro, baseada numa paralelização ao nível das threads
CPU e a outra a CUDASW++ 2.0, apresentada pelo Liu et. al., baseada numa paralelização ao nível
das Threads e dos dados em GPUs. Considerando ambas as soluções, este trabalho propõe uma
orquestração heterógenea que utilizando ambas consegue processar sequências nos cores da CPU e
nas GPUs disponíveis na máquina. Para além desta implementação, é proposta uma camada adicional
responsável pelo balanceamento dos dados entre os diferentes workers. Os resultados mostram que a
execução pode atingir um speedup superir a 6x quando executada com quatro cores CPU e duas GPUs
distintas.
Palavras Chave
Algoritmos Bioinformáticos; Alinhamento de sequências; Algoritmo de Smith-Waterman; Arquitectu-

ras Paralelas Heterógeneas; Módulo de Load Balancing; CUDA.
v
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Parallel Architectures 5
2.1 Flynn’s Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 CPU - Central Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 GPU - Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 CPU vs GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Hybrid Solution: Accelerated Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 CUDA - Compute Unified Device Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6.1 Definition and Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6.2 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6.3 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Open Computing Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Sequence Alignment in Bioinformatics 19

3.1 Alignment Scoring Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Optimal Alignment Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Needleman-Wunsch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Smith-Waterman Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Heuristic Sub-Optimal Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 FASTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 BLAST - Basic Local Alignment Search Tool . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Parallel Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 CPU Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.1.A Wozniak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.1.B Farrar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.1.C SWIPE (Rognes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
vii
3.4.1.D Pedro Monteiro’s Implementation . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.2 GPU Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.2.A Manavski’s Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.2.B CUDASW++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.3 Discussion on the Presented implementations . . . . . . . . . . . . . . . . . . . . 35
4 Heterogeneous Parallel Alignment MultiSW 38

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.1 CPU Worker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1.A CPU Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2 GPU Worker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2.A Asynchronous Transfers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.2.B CUDA Streams in Kernel Execution . . . . . . . . . . . . . . . . . . . . . 43
4.2.2.C Loading Sequences with Execution . . . . . . . . . . . . . . . . . . . . . 43
4.3 Application Execution Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Implementation Details and Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4.1 Database File Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4.2 Database Sequences Pre-Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4.3 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 Dynamic Load-balancing Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5 Experimental Results 50
5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.1 Experimental Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Evaluating Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.1 Scenario A - Single CPU core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.2 Scenario B - Four CPU cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.3 Scenario C - Single GPU - GeForce GTX 780 Ti . . . . . . . . . . . . . . . . . . . 54
5.3.4 Scenario D - Single GPU - GeForce GTX 660 Ti . . . . . . . . . . . . . . . . . . . 55
5.3.5 Scenario E - Four CPU cores + Single GPU Execution . . . . . . . . . . . . . . . . 55
5.3.6 Scenario F - Four CPU cores + Double GPUs Execution . . . . . . . . . . . . . . . 57
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6 Conclusions and Future Work 59

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Bibliography 61
viii
List of Figures
2.1 NVIDIA GK110 Kepler Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Coalesced memory access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 GPU Memory organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 CPU and GPU architectures [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Graphics Processing Unit (GPU) vs Central Processing Unit (CPU) GFLOPS comparation
[1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Fermi Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 CUDA Kernel definition and invocation example [1] . . . . . . . . . . . . . . . . . . . . . . 16
2.8 Execution Flows Representation [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Pairwise Alignment Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Needleman-Wunsch alignment matrix example . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Smith-Waterman alignment matrix example . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 FASTA algorithm step 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.8 Multi-sequence vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.9 Rogne’s Algorithm core instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.10 Sequences Database in several chunks [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.11 Processing Block - Message [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.12 Processing Block FIFOs [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.13 Coalesced Subject Sequence Arrangement [3]. . . . . . . . . . . . . . . . . . . . . . . . . 33
3.14 Coalesced Global Memory Access [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.15 Program workflow of CUDASW++ 3.0 [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 Heterogeneous Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 MultiSW block diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Master Worker Model [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Master Worker Model [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 CPU Wrapper Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
ix
4.6 Execution Sequence Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.7 Workers execution not balanced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.8 Workers execution balanced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.1 Processing times considering a single CPU core execution and a processing block with
30000 sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Processing times for 4 CPU cores, considering a block size of 30,000 sequences. . . . . 54
5.3 Processing Times for single GPU in Machine A, considering blocks size of 65,000 se-
quences. Total execution time about 6.35 seconds. . . . . . . . . . . . . . . . . . . . . . . 54
5.4 Processing Times for single GPU in Machine B, considering blocks size of 65,000 se-
quences. Total execution time about 7.38 seconds. . . . . . . . . . . . . . . . . . . . . . . 55
5.5 Processing Times for 4 CPU cores and a GeForce GTX780 Ti GPU, considering CPU
blocks of 30,000 sequences and GPU blocks of 65,000 sequences. Total execution time
was 6.112 seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.6 Number of Sequences Processed by CPU cores and GPU. . . . . . . . . . . . . . . . . . 56
5.7 Processing Times for 4 cores CPU, GPU A and GPU B, considering the initial block size
of 30,000 sequences blocks to the CPU solution and 65,000 to the GPU solution. Total
execution time of 4.957 seconds. Near some of the iteration blocks it is presented the new
considered block size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.8 Number of Sequences Processed by CPU, GPU A, and GPU B workers. . . . . . . . . . . 58
x
List of Tables
2.1 Flynn’s Taxonomy [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.1 Execution Speedups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
xi
List of Acronyms
ALU Arithmetic Logic Unit
AMD Advanced Micro Devices - North American Technology Company
APU Accelerated Processing Unit
BLOSUM Blocks Substitution Matrix
CPU Central Processing Unit
CUDA Compute Unified Device Architecture
GPC Graphics Processing Clusters
GPGPU General-Purpose Computation on Graphics Hardware
GPU Graphics Processing Unit
NVIDIA North American Company that invented the GPUs in 1999
PAM Point Accepted Mutation
SMX Streaming Multiprocessor
SP Streaming Processor
xiii
Introduction
1
Contents
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1
1. Introduction
1.1 Motivation
Nowadays, numerous databases spread all over the world hosting large amounts of biological data,
and they are growing in size exponentially as genomes of other species are being sequenced. Specifi-
cally, rich databases of protein and DNA sequence data are available on the Internet. The outcome of
the DNA sequencing work is very ample and can lead to many potential benefits in distinct fields such as
molecular medicine (aiming improved diagnosis of disease, drugs design, etc.), bioarchaelogy and evo-
lution (study of evolution and similarity between organisms, and others), DNA forensics (identification
of crime or catastrophe victims, establishment of paternity and other family relationships, and others)
and others, as agriculture or bioprocessing (disease- and drought-resistant crops, biopesticides, edible
vacines to incorporate into food products, and others) [6].
There are several online knowledge bases that contain millions of genes information extracted [7]:
• GenBank DNA database [8];
• National Center for Biotechnology Information (NCBI) [9];
• Universal Protein Resource (UniProt) [10];
• Nucleotide sequence database (EMBL) [11];
• Swiss-Prot [12];
• TrEMBL [13].
With this proliferation of data comes a large computational cost to perform a genetic sequence align-
ment between new genetic information and the online databases. As a consequence, genetic sequence
alignment is considered to be one of the application domains which require further improvements in
the execution speed, mostly because it involves several computationally intensive tasks, as well as
databases whose size will continue to increase. This is leading researchers to look for even faster, high-
throughput alignment tools, which can give an efficient response to this intensive growth.
One of the most known Bioinformatics algorithms is the Smith-Waterman algorithm [14]. This algorithm
is presented in Section 3.2.2 and consists of the alignment of two sequences, using a score matrix ap-
proach. So, recently several parallel computer architectures exploiting some of the parallel approaches
were presented: multiple-core processors; multiple processors installed in a single motherboard; multi-
ple computers connected through a common network - cluster on computer grids [15]. Some of these
implementations are presented in Section 2 and explore Thread-level and Data-Level parallelism using
Central Processing Unit (CPU) and Graphics Processing Unit (GPU) architectures.
Under this context, this thesis proposes the integration of two previously presented parallel implementa-
tions: an adaptation of SWIPE implementation [16], for multi-core CPUs that exploits SIMD vectorial in-
structions [2], and an implementation of the Smith-Waterman algorithm for GPU platforms (CUDASW++
2.0) [17]. Accordingly, the presented work offers an unified solution that tries to take advantage of all
computational resources that are made available in heterogeneous platforms, composed by CPUs and
GPUs, by integrating a convenient dynamic load balancing layer.
2
1.2 Objectives
This implementation was extensively evaluated considering several execution scenarios, combining both
kinds of workers.
1.2 Objectives
The aim of the present work is considering two of the Smith-Waterman algorithm implementations,
implement a unified solution that tries to take advantage of all computational resources that are made
available in heterogeneous platforms, composed by CPUs and GPUs, by integrating a convenient dy-
namic load balancing layer. With this load balancing layer it is expected that the execution time for the
multiple workers considered be equal in the timeline execution, like explained in Section 4.5. This way it
is possible to minimize the waiting times between working and guarantee that all the workers finish their
work at the same time. Besides the load balancing layer, several implementations on the GPU module
are presented in Section 4.1.
1.3 Document Outline
This thesis is organized as follows. First, in Chapter 2, the main characteristics for the parallel
architectures, describing the considered ones, the CPU and the GPU. Also in Chapter 2 it is presented
and described the Compute Unified Device Architecture (CUDA) architecture. Next, in Chapter 3, it will
be briefly presented the sequence alignment algorithms and some of the considered applications. In
Chapter 4 is presented our developed work, the MultiSW. Finally the results of this implementation and
the corresponding discussion are presented in chapter 5.
3
1. Introduction
4
Parallel Architectures
2
Contents
2.1 Flynn’s Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 CPU - Central Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 GPU - Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 CPU vs GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Hybrid Solution: Accelerated Processing Unit . . . . . . . . . . . . . . . . . . . . . . 14
2.6 CUDA - Compute Unified Device Architecture . . . . . . . . . . . . . . . . . . . . . . . 14
2.6.1 Definition and Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6.2 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6.3 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Open Computing Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5
2. Parallel Architectures
According to Almasi et al. [18], a parallel computer architecture can be defined as "That collection
of processing elements that communicate and cooperate to solve large problems fast". Taking this
definition into consideration, we realize that there are several types of parallel computing architectures
that use different memory organizations and communication topologies, as well as different processor
execution models.
In section 2.1, we review Flynn’s taxonomy which classifies the parallel architectures in four different
classes. According to the several possible approaches, four different types of parallelism may be defined:
Bit-level parallelism; Instruction-level parallelism; Data-level parallelism, Task/Thread-level parallelism.
In the remaining sections, we describe the main parallel computing architectures that are used nowa-
days: the Central Processing Unit (CPU), the Graphics Processing Unit (GPU), and architectures that
combine both, the Accelerated Processing Unit (APU). Finally, we present the parallel-programming
model used by NVIDIA GPUs, the Compute Unified Device Architecture (CUDA).
2.1 Flynn’s Taxonomy
In 1966, Michael J. Flynn proposed a simple model that it is still used to categorize computers,
taking into account the parallelism in instruction execution and memory data calls. Flynn looked at
the parallelism in the instruction and data streams1 called for the instructions at the most constrained
component of the multiprocessor, and placed all existing computers in four distinct categories [19], as
defined below and presented in Table 2.1.
Table 2.1: Flynn’s Taxonomy [5].
Single Instruction Multiple Instruction

Single Data SISD MISD
Multiple Data SIMD MIMD
1. Single instruction stream, single data stream (SISD): This category corresponds to the unipro-
cessor model. One example of this is the conventional sequential computer based on the Von
Neumann architecture, i.e., a uniprocessor computer which can only perform one single instruc-
tion at a time.
2. Single instruction stream, multiple data streams (SIMD): The same instruction is executed by
multiple processors using different data streams. SIMD computers exploit data-level parallelism,
by applying the same operations to multiple items of data in parallel. Many current CPUs use
this kind of architecture by supporting instruction set extensions. Examples for this are the MMX,
established by Intel [20], and the SSEx family of Streaming SIMD Extensions, representing an
evolution of the MMX architecture. The Advanced Vector Extensions (AVX) extension is also one
kind of SIMD extension proposed by Intel. This category is also followed by the programming
model used in Graphics Processing Units (CUDA and OpenCL) that we describe on Section 2.6.
1 The concept of stream refers to the sequence of data or instructions as seen by the machine during the execution of a
program.
6
2.2 CPU - Central Processing Unit
3. Multiple instruction streams, single data stream (MISD): This category indicates the use of
multiple independently executing functional units operating on a single stream of data, forwarding
the results from one functional unit to the next [5].
4. Multiple instruction streams, multiple data streams (MIMD): Each processor fetches its own
instructions and operates on its own dataset. This model exploits thread-level parallelism, since
multiple threads operate in parallel. Examples of this architecture are the current processors with
multi-threading support. Other examples are distributed systems and computer clusters.
2.2 CPU - Central Processing Unit
The central processing unit (CPU) is the computer hardware unit responsible for interpreting and ex-
ecuting the program instructions. One of the first commercial CPU microprocessors was the Intel 4004
presented by Intel in 1971.
A CPU is usually composed by these components [21]:
• Arithmetic Logic Unit (ALU) - Responsible for the execution of logical and arithmetic operations;
• Control Unit – Decodes instructions, gets operands and controls the execution point;
• Registers – Memory cells of the CPU that store data needed by the CPU to execute the instruc-
tions;
• CPU interconnection - communication channels among the control unit, ALU, and registers.
Nowadays, in order to reduce the power consumption and to process multiple tasks at the same time
and more efficiently, commercial CPUs are built with multi-core technology, having between 4 and 16
execution cores. This way, the multi-core CPU can process 4 or more instructions at a time, in a MIMD
parallelization way. Some solutions that take advantage of the parallelizing process on the Intel CPUs
are presented in Section 3.4.
2.3 GPU - Graphics Processing Unit
A Graphics Processing Unit(GPU) is the processing unit that is present in every graphics card on
each computer. This unit is designed specifically for performing the complex mathematical and geomet-
ric calculations that are necessary for graphics rendering. Although GPUs were originally developed
to process and display computer graphics, they have been also used for processing general purpose
operations, leading to the General-Purpose Computation on Graphics Hardware (GPGPU) paradigm.
There are several frameworks to adapt the GPU programming to this paradigm. The most known ones
are OpenCL and North American Company that invented the GPUs in 1999 (NVIDIA)’s CUDA, pre-
sented in the Section 2.6. Early approaches to computing on GPUs cast computations into a graphics
7
framework, allocating buffers (arrays) and writing shaders (kernel functions).
GPUs provide massive parallel execution resources and high memory bandwidth. Within the most
popular GPU-accelerated applications we can mention the research field, specifically:
• Higher Education and Supercomputing (numerical analytics, physics and weather and climate fore-
casting, for example);
• Defense and Intelligence applications (such as geospatial visualization);
• Computational Finance (financial analysis, etc.);
• Media and Entertainment (animation, modeling and rendering, color correction and grain manage-
ment, editing, review and stereo tools, encoding and digital distribution, etc.).
In Figure 2.1, we present the GK110 NVIDIA GPUs architecture. GeForce GTX 780 Ti GPU used in
our work, are built on this architecture. This architecture is part of Kepler GPUs family.
As shown in this figure, the GK110 NVIDIA architecture GPU has several Graphics Processing
Clusters (GPC)2 , organized in a scalable array. Each GPC contains several Streaming Multiproces-
sor (SMX)s which perform the executions and run the CUDA kernels, presented below in this document.
The design of the SMX has been evolving rapidly since the introduction of the first CUDA-capable hard-
ware in 2006, with four major revisions, codenamed Tesla, Fermi, Kepler and Maxwell [22]. Kepler’s new
Streaming Multiprocessor, called SMX, has significantly more CUDA Cores than the SM of Fermi GPUs.
Each SMX contains thousands of registers that can be partitioned among the threads under execution,
several caches and warp schedulers (presented below in this document) that can quickly switch con-
texts between threads and issue instructions to warps that are ready to execute and execution cores for
integer and floating-point operations [1]. A GPU is connected to a host through a high-speed IO bus slot
(a PCI-Express bus in current systems). The considered GPU model contains four GPCs, fifteen SMXs
and six 64-bit memory controllers [23].
In Figure 2.3, we present the architecture of a GPU based on CUDA, composed of a set of stream
multi-processors sharing a global memory.
In addition to the shared memory, each SMX is composed of [22]:
• thousands of registers that can be partitioned among threads of execution;
• several kind of memory caches (explained below);
• warp schedulers that can quickly switch contexts between threads and issue instructions to warps
that are ready to execute and
• Execution cores for integer and floating-point operations.
The memory system for current NVIDIA GPUs is more complex, as we now explain.
2 also known as Streaming Processors (SPs)
8
Figure 2.1: NVIDIA GK110 Kepler Architecture
9
Memory
NVIDIA GPUs include a complex memory system. The array of threads in a block (see Section 2.6),
explained below was designed to be an array of 1, 2, 3 dimensions leading to a memory access pattern
known as coalesced.
A coalesced memory access can be explained this way, considering that all threads in a warp (explained
below in this chapter) execute the same instruction, when all threads in a warp execute a load instruction,
the hardware detects whether the threads access consecutive memory locations. The most favorable
global memory access is achieved when the same instruction for all threads in a warp accesses global
memory locations. In this case, the hardware coalesces all memory accesses into a consolidated ac-
cess to consecutive DRAM locations. If thread 0 accesses location n, thread 1 accesses location n + 1,
... and thread 31 accesses location n + 31, then all these accesses are coalesced, that is: combined
into one single access.
Figure 2.2: Coalesced memory access
A coalesced access of memory occurs when the address locality and alignment meets certain crite-
ria, taking advantage of the distribute bus of the main memory, as presented in Figure 2.2. Specifically,
the types of memory presented in a GK110 architecture NVIDIA GPU are [24]:
• Global memory: the largest one (typically greater than 1 GB), but with high latency, low bandwidth
(when compared with the other types), and is not cached. The effective bandwidth of global mem-
ory depends heavily on the memory access pattern (e.g. coalesced access generally improves
bandwidth).
• Local memory: readable and writable per-thread memory with very limited size (16 kB per thread)
and is not cached. Access to this memory is as expensive as access to global memory.
• Constant memory: read-only memory with limited size (typically 64 kB) and cached. The reading
cost scales with the number of different addresses read by all threads. Reading from constant
memory can be as fast as reading from a register.
• Texture memory: read-only memory that is mapped and allocated in global memory. This memory
can be used like a cache.
• Shared memory: fast on-chip memory of limited size (16 kB per block), readable and writable
on a per-block basis. This memory can only be accessed by all threads in a thread block and is
10
divided into equally-sized banks that can be accessed simultaneously by each thread. Accessing
this memory is as fast as accessing a register as long as there are no bank conflicts.
• Registers: readable and writable per-thread registers. These are the fastest memory to access,
but the amount of registers is limited.
Figure 2.3: GPU Memory organization
The transfer of data between the Host and the GPU is done using direct memory access (DMA), and
can operate concurrently with both the host and the GPU computation units.
As said before, there are some programming models used in the GPGPU context, such as the CUDA
and OpenCL models. Section 2.6 presents the CUDA programming model.
11
2.4 CPU vs GPU
With the observed increase of the computational demands imposed by the gaming market, the man-
ufacturers of GPUs had to propose powerful processing units, in order to allow gamers to run their
increasingly graphically demanding games. A direct consequence is the fact that it represents the most
powerful and cost effective computer hardware. Consequently, GPUs are no longer exclusively applied
with the purpose of showing computer graphics. An increasingly interest of researchers and developers
in the potential of GPUs for applications with large amounts of computations have arisen along the last
few years.
Today, CPUs in consumer devices have several cores in a chip, and each of them has some ALUs that
perform the arithmetic and logical operations (see Figure 2.4). In comparison, the GPUs have hundreds
or even thousands of cores, each one with four units: one floating point unit, a logic unit, a move or
compare unit and a branch unit. An advantage of GPUs is the ability to perform multiple simultaneous
operations, up to an order of magnitude of 103 , since there are hundreds of execution cores in a single
GPU.
Figure 2.4: CPU and GPU architectures [1].
According to Owens et al. [25], one of the major architectural differences between CPUs and GPUs
is the fact that CPUs are optimized to achieve high performance in sequential code, with some of the
processing stages dedicated to extracting instruction-level parallelism with techniques such as branch
prediction and out-of-order execution. On the other hand, GPUs with entirely parallel computing nature
allow processing stages to be more focused on computing. This allows achieving a higher level of arith-
metic intensity, with around the same number of transistors than CPUs.
Regarding execution performance, one of the metrics that has been used is floating-point operations per
second (FLOPS). As shown in Figure 2.5, during the last years GPUs surpassed CPUs in this measure
of theoretical peek performance.
In order to compare a sequential and a parallel software implementation, the fundamental metric is
the speedup. The expression of the speed up is:
tsequential
Speedup = (2.1)
tparallel
12
2.4 CPU vs GPU
Figure 2.5: GPU vs CPU GFLOPS comparation [1]
It gives a ratio that indicates how a parallelized system is faster when compared to a sequential system.
When comparing with CPUs, some advantages and disadvantages can be identified with respect to
GPUs[25, 26]:
Advantages:
• Faster and Cheaper;
• Fully programmable processing units that support vectorized floating-point operations[27];
• Very flexible and powerful, with the introduction of new capabilities in modern GPUs, like high level
languages support the programmability of the vertex and pixel pipelines. Other features are the
implementation of vertex texture access, the full branching support in the vertex pipeline, and the
limited branching capability in the fragment pipeline.
Disadvantages:
• Memory transfers between host and device can slow the whole application;
• Complex memory management, since there are several limitations regarding memory size (which
is limited), and also in the memory organization, which has a hierarchical organization. (See
Section 2.3 for the CUDA memory model);
• Only applications with high level of parallelization sections can benefit from all the GPU execution
power.
13
2.5 Hybrid Solution: Accelerated Processing Unit
Nowadays, new hybrid solutions are appearing in the market, such as the Accelerated Processing
Units (APU) by Advanced Micro Devices - North American Technology Company (AMD). This new hard-
ware part is based on a single processor chip that combines CPU and GPU elements into an unified
architecture.
An example of these APUs is the AMD Fusion [28], the Kaveri, the Athlon or the Sempron series.
In this architecture, the x86 CPU cores and the programmable GPU cores share a common path to the
system memory. The key aspect to highlight are that the x86 CPU cores and the vector engines are
attached to the system memory through the same high speed bus and memory controller. This feature
allows the AMD Fusion architecture to alleviate the fundamental PCIe constraint that traditionally has
limited performance on a discrete GPU. Fusion architecture obviates the necessity of PCIe accesses
to and from the GPU, improving application performance [29]. However, the graphics cores that have
been placed on current APUs are not meant to be competitive with high-end or even mid-range discrete
graphic cards [30]. Recently, in November 2013, Sony introduced an AMD 1.6GHz APU on the Playsta-
tion 4 console. This was the fastest APU produced by AMD when the console was presented. Despite
the Playstation APU being Sony property, AMD took some of the implementations and included them in
their consumer APUs, improving the available APUs processing power.
2.6 CUDA - Compute Unified Device Architecture
2.6.1 Definition and Architecture
The Compute Unified Device Architecture (CUDA) is a parallel-programming model and software en-
vironment, designed by NVIDIA, in order to deliver all the performance of NVIDIA’s GPUs technology to
general purpose GPU Computing. It was first introduced in March 2007, and, since then, more than 100
million CUDA-enabled GPUs has been sold.
This programming model implements a MIMD parallel processing paradigm, since it divides the ex-
ecution flow between groups, with the result that every group is independent from another. Inside that
group, an adapted SIMD parallelism is adopted, named single instruction, multiple-thread (SIMT), where
many threads execute each function. In CUDA, the GPU is denoted as the "device" and the CPU is re-
ferred as the "host". "Kernel" refers to the function that runs on the device. Using this nomenclature, the
host invokes kernel executions on the device.
Current NVIDIA graphics cards are composed of streaming multi-processors. In these, the kernel
function runs in parallel. This execution is done according to a special execution flow (explained in
Section 2.6.3). Figure 2.6 presents the Fermi architecture of NVIDIA’s graphic cards.
14
2.6 CUDA - Compute Unified Device Architecture
Figure 2.6: Fermi Architecture
On Kepler, each multiprocessor has 192 processing cores, while on Fermi each multiprocessor has
a group of 32 SPs. The high-end Kepler has 15 multiprocessors, for a total of 2880 cores (15 ∗ 192), and
the Fermi accelerators have 16 multiprocessors, for a total of 512 cores (32 ∗ 16). Another difference is
the shared memory size. On Kepler, each SMX has 64 KB of on-chip memory that can be configured as
48 KB of Shared memory with 16 KB of L1 cache, or as 16 KB of shared memory with 48 KB of L1 cache
just like the Fermi GPUs. The memory types available in NVIDIA’s graphic cards are further explained
more fully in Section 2.3.
Another important difference is related with the maximum number of active warps (group of 32
threads that executes the kernel code at a time), that can exist in each multiprocessor. When one
warp stalls on a memory operation, the multiprocessor selects another ready warp and switches to that
one. This way, the cores can be productive as long as there is enough parallelism to keep them busy
[31]. Tesla supports up to 32 active warps on each multiprocessor, and Fermi supports up to 48.
In order to allow its use by a great number of developers, NVIDIA based its language on the C/C++
programming language and added some specific keywords in order to deploy some special features of
CUDA. This new language is called CUDA C and the compiler is NVCC. [24]
2.6.2 Programming Model
CUDA is a extension of C programming language with some reserved keywords. CUDA C extends
C by allowing the programmer to define C functions, called kernels, that, when called, are executed N
times in parallel by N different CUDA threads. The __global__ keyword declares a function as being
15
a kernel, and it is executed on the device and can only be invoked by the host using a specific syntax
configuration - < < < ... > > > as shown in Figure 2.7. Each thread that executes the kernel is given
a unique thread ID that is accessible within the kernel through the built-in threadIdx variable. These
kernel functions must be highly parallelized, in order to obtain maximum efficiency for the application [1].
The basic entities involved in the execution of the heterogeneous programming model are the host,
which is traditionally the CPU, and the other one is the devices, which are GPUs in this case.
The execution flow for a simple CUDA application can be:
1. Allocate device memory
2. Copy memory from host to device
3. Invoke Kernel
4. Copy memory from device to host
5. Free device memory
Figure 2.7 illustrates a kernel definition and a kernel invocation. The two values within three angle
brackets, 1 and N, represent respectively the dimension for each execution grid (total number of blocks),
and the dimension for each block (the total number of threads per block that will run in the kernel). These
numbers are specified by the programmer, but have a limit according to the maximum number of blocks
and threads supported by the adopted GPU[1].
Figure 2.7: CUDA Kernel definition and invocation example [1]
The next section presents the execution model.
2.6.3 Execution Model
In CUDA, the execution flow is organized by a hierarchy that is represented in figure 2.8. Threads
represent the fundamental flow of parallel execution and are executed by core processors [32].
A set of threads is called a thread block. Thread blocks are executed on multi-processors and do
not migrate over multi-processors. Several concurrent thread blocks can reside on one multi-processor.
16
2.7 Open Computing Language
Figure 2.8: Execution Flows Representation [1]
This number is delimited by multi-processor resources (shared memory and register file). Finally, a set
of thread blocks is called a grid. One kernel is launched as a grid.
2.6.4 Limitations
Despite the versatility offered by this architecture, GPUs have some limitations, particularly in mem-
ory management and allocation. Memory transfer time between the host and the device represents an
overhead that delays execution time, since data has to be transferred to the device before being pro-
cessed. Afterwards, the results of data processing need to be transferred from the device to the host.
This overhead can become larger, since limited system bus bandwidth and system bus contention can
increase the latency between the host and device components.
2.7 Open Computing Language
Open Computing Language (OpenCL) is an open standard that can be used not only for program-
ming NVIDIA GPUs, but also to program CPUs and the GPU devices from different manufacturing
brands, providing a portable language when programming in context of GPGPU.
As with CUDA technology, the OpenCL language denotes as kernel the execution code block that will
run on the GPU. The diference between a CUDA kernel and an OpenCL kernel relates to the fact that
OpenCL kernel is compiled at run-time, which increases the run time of this solution. In addition, CUDA
has the advantage of being developed by the same company that develops the hardware where it runs,
so better performance at execution time is expected.
This technology is used by the GPGPU community alongside with the CUDA programming model.
17
18
Sequence Alignment in Bioinformatics
3
Contents
3.1 Alignment Scoring Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Optimal Alignment Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Needleman-Wunsch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Smith-Waterman Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Heuristic Sub-Optimal Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 FASTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 BLAST - Basic Local Alignment Search Tool . . . . . . . . . . . . . . . . . . . . . 25
3.4 Parallel Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 CPU Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.2 GPU Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.3 Discussion on the Presented implementations . . . . . . . . . . . . . . . . . . . . 35
19
3. Sequence Alignment in Bioinformatics
Sequence alignment is a fundamental procedure in Bioinformatics, specifically used for molecular

sequence analysis, which attempts to identify the maximally homologous subsequences among sets
of long sequences [14]. In the scope of this thesis, it was considered the processing of biological
sequences consisting on a single, continuous molecule of nucleic acid or protein [33]. While DNA se-
quences can be expressed by four symbols (corresponding to the four nucleotides A,C,T and G). The
amino acids in proteins can be expressed by 22 symbols A, B, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S,
T, V, Y, Z.
When comparing sequences, one looks for patterns that diverged from a common ancestor by a process
of mutation and selection. According to Dewey et al. [34], the main objectives of sequence alignment
are to establish input data for phylogenetic analysis, to determine the evolutionary history of a set of
sequences, to discover a common motif3 in a set of sequences; to characterize the set of sequences
and also for building profiles for database sequence searching.
The considered mutational processes involved in the alignments are residue substitutions, residue
insertions, and residue deletions. Insertions and deletions are common referred to as gaps [35].
The basic idea in the aligning process of two sequences (of possibly different sizes) is to write one on
top of the other and break them into smaller pieces by inserting spaces in one or the other so that
identical sub-sequences are eventually aligned in a one to one correspondence. Naturally, spaces are
not inserted in both sequences in the same position. Figure 3.1 illustrates an alignment between the
sequences A="ACAAGACAGCGT" and B="AGAACAAGGCGT".
Figure 3.1: Pairwise Alignment Example
In order to understand all the steps involved in the algorithms that will be presented in Sections 3.2
and 3.3, we need to go through some of the concepts employed: the Scoring Model and the concept
of the Gap Penalties (Section 3.1). After explaining these concepts, this chapter will provide a brief
overview on optimal sequence alignment algorithms (Section 3.2) and heuristic sequence alignment al-
gorithms (Section 3.3).
Finally, by taking into account the parallel architectures presented in Chapter 2, we present some im-
plementations of sequence alignment using parallel architectures, based either on CPU (Section 3.4.1)
and on GPU (Section 3.4.2).
3.1 Alignment Scoring Model
The basis of many sequence alignment algorithms are based in a scoring model, which classifies
the several matching and mismatching patterns according predefined score values. The simplest ap-
proaches consider a positive constant value assigned to a match between both residues. Alternatively,
3 Sequence motifs are short, recurring patterns in DNA that are presumed to have a biological function.
20
3.2 Optimal Alignment Algorithms
instead of using fixed score values when there’s one match on the alignment, biologists frequently use
scoring schemes that take into account physicochemical properties or evolutionary knowledge of the
sequences being aligned. This is common when protein sequences are compared. The most known
schemas are Point Accepted Mutation (PAM) and Blocks Substitution Matrix (BLOSUM) alphabet-weight
scoring schemes, which are usually implemented by a substitution matrix.
The BLOSUM matrices were developed by Henikoff & Henikoff, in 1992, to detect more distant rela-
tionships. In particular, BLOSUM50 and BLOSUM62 are being widely used for pairwise alignment and
database searching.
Substitution matrices allow for the possibility of giving a negative score for a mismatch, what is
sometimes called an approximate or partial match.
Just like the score values, the gap value can be represented by a constant value or by using some
of the presented models. In order to understand the models, the gap open/start score (d), represents
the cost of starting a gap, while the gap extension (e) represents the cost of extending a gap by one
more space. The standard cost associated with a gap of length (g) is given either by a linear score given
by [35]:
γ(g) = −gd (3.1)
or by using the affine score:

γ(g) = −d − (g − 1)e (3.2)
The gap-extension penalty e is usually set to a value less than the gap-open penalty (d), allowing long
insertions and deletions to be penalized less than they would be by the linear gap cost. This is desirable
when gaps of a few residues are expected almost as frequently as gaps of a single residue [35].
The optimal alignment of two DNA or protein sequences is the alignment that maximizes the sum of
pair-scores minus any penalty for the introduced gaps [35].
Optimal Alignment algorithms include:
• Global Alignment algorithms, which align every residue in both sequences. One example is the
Needleman-Wunsch algorithm, which we present in Section 3.2.1.
• Local Alignment algorithms, which only considers part of the sequences and obtains the best sub-
sequence alignments or the Identification of common molecular subsequences [14]. One example
is the Smith-Waterman algorithm, which we present in Section 3.2.2.
21
3.2.1 Needleman-Wunsch Algorithm
In 1970, Needleman & Wunsch [36] proposed the following algorithm. Given two molecular se-
quences, A = a1 a2 ...an and B = b1 b2 ...bm , the goal is to return an alignment matrix H which indicates
the optimal global-alignment score between both sequences.
In order to understand this algorithm, consider the following definitions:
• H(i, j) represents the similarity score of two sequences A and B, ending at position i and j;
• s(ai , bj ) is the score for each aligned pair of residues. This value can be defined by a constant
value, or can be obtained using scoring matrices like PAM or BLOSUM, for the protein sequences;
• Wk , Wl represent the gap penalties, according to the considered gap model.
Each matrix cell is filled with the maximum value that results from Equation 3.3.

Hi−1 j−1 + s(ai , bj ), if ai and bj are similar symbols



H(i, j) = max Hi−k j − Wk , if ai is at the end of a deletion of length k (3.3)


ij−k − Wl , if bj is at the end of a deletion of length l

H
This equation is repeatedly applied in order to fill in the matrix with the H(i, j) values, by calculating
the value in the bottom right-hand corner of each square of four cells from one of the remaining three
values [36]. By definition, the value in the bottom-right cell of the entire matrix, H(n,m), corresponds to
the best score for an alignment between A and B. Figure 3.2 illustrates the algorithm with the alignment
between sequences A="AACGTT" and B="ATGTT". The obtained score was 13 and the best global
alignment is presented with the green arrows presented in the figure.
Figure 3.2: Needleman-Wunsch alignment matrix example
3.2.2 Smith-Waterman Algorithm
In 1981, Smith and Waterman [14] proposed a dynamic programming algorithm4 that computes the
similarity scores corresponding to the maximally homologous subsequences among sets of long se-
quences. Given two sequences A = a1 a2 ...an and B = b1 b2 ...bm , the goal of this algorithm is to return a
alignment matrix H which indicates the optimal local-alignments between both sequences. For each cell,
this algorithm computes the similarity value between the current symbol of sequence A and the current
4 Dynamic programming is a programming method that solves problems by combining the solutions to their subproblems[37].
22
symbol of sequence B. This algorithm has some data dependencies, since each cell of the alignment
matrix depends on its left, upper and upper-left neighbors.
In this algorithm, we consider the same definitions of H(i, j), s(ai , bj ), Wk and Wl used in the Needleman-
Wunsch algoritm (Section 3.2.1).
Receiving the sequences A and B as input, this algorithm begins with the initialization of the first
column and the first row, which is given by:
Hk0 = H0l = 0, for 0 ≤ k ≤ n and 0 ≤ l ≤ m (3.4)
Then the algorithm computes the similarity score H(i, j) by using the following equation:



 Hi−1 j−1 + s(ai , bj ), if ai and bj are similars


i−k j − Wk , if ai is at the end of a deletion of length k

H
Hij = max (3.5)


 Hij−k − Wl , if bj is at the end of a deletion of length l


otherwise

0,
The output for the algorithm is the optimal local alignment of sequence A and sequence B with max-
imum score. Unlike the Needleman-Wunsch algorithm, the Smith-Waterman algorithm always gives
matrix scores greater than or equal to 0.
In order to get all the optimal local alignments between sequences A and B, a trace-back algorithm
starts from the highest score in the whole matrix and ends at a score of 0.
Figure 3.3 presents the optimal local alignments between sequence A: WPCIWWPC and sequence
B: IIWPC. In this example, the BLOSUM 50 matrix scoring model is used, in order to get s(ai , bj ) value.
The gap penalty is -5. The optimal local alignments between sequences A and B are represented inside
green background color cells. These alignments occurred between the subsequences W P C of A and
W P C of B.
Figure 3.3: Smith-Waterman alignment matrix example
23
3.3 Heuristic Sub-Optimal Algorithms
Although providing optimal solutions, the described algorithms we characterized by a quadratic com-
plexity O(mn) where m is the size of sequence A and n the size of sequence B. This is made evident on
large databases with high number of residues. The current protein Uniprot Swissprot [12] database con-
tains hundreds of millions residues; for a sequence of length one thousand, approximately 1011 matrix
cells must be evaluated to search the complete database. At ten million matrix cells per second, which
is reasonable for a single workstation at the time this is being written, this would take 10000 seconds,
i.e., around three hours [35].
Heuristic algorithms address this issue at the expense of not guaranteeing to find the optimal solution.
Examples of these algorithms are the FASTA and the BLAST, presented in Section 3.3.1 and in Sec-
tion 3.3.2.
3.3.1 FASTA
The FASTA algorithm (also known as "fast A" which stands for "FAST-All") was presented by Pear-
son & Lipman in 1985 [38] and further improved in 1988 [39]. This algorithm uses local high scoring
alignments with a multistep approach, starting from exact short word matches, through maximal scoring
ungapped extensions, to finally identify gapped alignments.
This algorithm can be described in four steps [35]:
• Step 1 (Figure 3.4): locate all identically matching words of length ktup (specifies the size of the
word) between the two sequences. For proteins, ktup is typically 1 or 2, for DNA it may be 4 or 6.
It then looks for diagonals with many mutually supporting word matches.
Figure 3.4: FASTA algorithm step 1.
• Step 2 (Figure 3.5): search for the best diagonals, extending the exact word matches to find
maximal scoring ungapped regions (and, in the process, possibly joining together several seed
matches).
• Step 3 (Figure 3.6): check if any of these ungapped regions can be joined by a gapped region,
allowing for gap costs.
• Step 4 (Figure 3.7): the highest scoring candidate matches in a database search are realigned
using the full dynamic programming algorithm, but restricted to a subregion of the dynamic pro-
24
3.3 Heuristic Sub-Optimal Algorithms
gramming matrix forming a band around the candidate heuristic match. This step uses a standard
dynamic programming algorithm, such as Needleman-Wunsch or Smith-Waterman, to get the final
scores.
There is a tradeoff between speed and sensitivity in the choice of the ktup parameter, higher values
of ktup are faster, but more likely to miss true significant matches. To achieve sensitivities close to the
optimal algorithms for protein sequences, ktup needs to be set to 1.
3.3.2 BLAST - Basic Local Alignment Search Tool
The Basic Local Alignment Search Tool (BLAST) was presented by Altschul et al. in 1990 [40],
and finds regions of local similarity between sequences. The program compares nucleotide or protein
sequences to sequence databases and calculates the statistical significance of matches. BLAST can
25
be used to infer functional and evolutionary relationships between sequences as well as to help identify
members of gene families [9]. This algorithm is most effective with polypeptide5 sequences and uses a
matrix score (BLOSUM, PAM, etc.) to find the maximal segment pair (MSP) for two sequences, defined
as locally optimal if the score cannot be improved either by lengthening or shortening the segment pair.
This algorithm is the most widely used for protein-coding sequences alignment6 .
The BLAST algorithm steps are [40]:
1. Compile a list of high-scoring words
• Giving a length parameter w and a threshold parameter T , find all the w-length substring
(words) of the database sequences that align with words from the query with an alignment
score higher than T . This is called a hit in BLAST.
• Discard those words that score below T (these are assumed to carry too little information to
be useful starting seeds)
2. Scan the database for hits
• When T is high, the search will be rapid, but potentially informative matches will be missed.
3. Extend the hits
• Attempt to extend this match to see if it is part of a longer segment that scores above the
MSP score S
• Report only those hits that yield a score above S
From the score S it is also possible to calculate an expectation score E, which is an estimate of how
many local alignments of at least this score would be expected given the characteristics of the query
sequence and database.
The original BLAST did not permit gaps, so it would find relatively short regions of similarity, and it was
often necessary to extend the alignment manually or with a second alignment tool.
3.4 Parallel Implementations
Smith-Waterman is the most known algorithm in this context, and it was explored in many software
implementations, each improving the execution times and optimizing the parallelization method.
In what concerns the parallelization method, the implementations presented in this Section have some
different approaches of parallelism and they can be grouped taking into account their level of paral-
lelism [2]:
• Coarse Grained Parallelism: example of this parallelism can be the master/worker model pre-
sented in our work, where we have a single processor named master that send works to the
workers. On the parallel sequence alignment, the database sequence is split into n parts, and
5 short chains of amino acid monomers linked by peptide (amide) bonds.
6 http://cmns.umd.edu/
26
each worker node is processing one of those parts. When the execution ends, the worker sends
back the results to the master and gets more parts to process, until the processing of all parts is
finished. This parallel method is present in the proposed implementations of this work, presented
on Chapter 4.
• Fine-Grained Parallelism: examples on this parallelization methodology are those presented by

Wozniak [41], taking advantage of the Visual Instruction Set (VIS) of the SUN ULTRA SPARC
processors and Farrar and Rognes implementations [16, 42], that take advantage of the Stream-
ing SIMD Extensions (SSE) technologies, available in most modern Intel processors. All these
implementations are presented on Sections 3.4.1.A, 3.4.1.B and 3.4.1.C.
• Intermediate-grained Parallelism: this kind of implementations are the most explored nowadays,
with the growth of the General Purpose Graphics Processing Unit (GPGPU), taking advantage
of the GPUs to parallelize the execution of the algorithm. Section 3.4.2.A presents Manavski’s
solution [43], one of the first ones considering CUDA framework for modern NVIDIA GPUs. The
implementation CUDASW++, introduced by Liu et al. [3, 17] is presented in Section 3.4.2.B.
3.4.1 CPU Implementations
In this section, we survey the state of the art on CPU-based implementations of the Smith-Waterman
algorithm.
3.4.1.A Wozniak
One of the first parallel implementations of the Smith-Waterman algorithm was presented in 1997
by A. Wozniak [41], who proposed using SIMD instructions for the parallelization of the algorithm. By
exploiting the use of specialized video instructions. These instructions, SIMD-like in their design, make
possible parallelization of the algorithm at the instruction level. Another optimization of this implemen-
tation is using Visual Instruction Set (VIS) instructions found in the SUN ULTRA SPARC processors.
These VIS instructions can be used to execute in parallel four rows of the Smith-Waterman algorithm
score matrix, enabling data-level parallelization. VIS instructions use special 64-bit registers, making it
possible to add two sets of 16-bit integers and get four 16-bit results, with a single instruction.
This implementation reaches over 18 million matrix cell updates per second on a single ULTRA SPARC
running at 167 MHz. The global performance scales with the number of processors used, reaching at
12 processors, 200 million matrix cells per second.
3.4.1.B Farrar
In order to optimize the performance of the original Smith-Waterman algorithm, Michael Farrar also
proposed in 2006 [42] a SIMD solution to parallelize the algorithm at the data level. This solution takes
advantage of three different optimizations. The first one is called query profile and was presented by
27
Rognes and Seeberg [44]. It avoids calculating the score between both sequence residues in the Smith-
Waterman matrix, calculating a query profile parallel of the query for each possible residue. Then, the
calculation of the Hij requires just an addition of the pre-calculated score to the previous Hij . The
query profile is stored in memory at 16-byte boundaries. By aligning the profile at a 16-byte boundary,
the values are read with a single aligned load instruction, which is faster than reading unaligned data.
Another optimization proposed by Farrar is the use of the SSE2 instructions, available on Intel proces-
sors. To maximize the number of cells calculated per instruction, the SIMD SSE2 registers are divided
into their smallest unit possible. The 128-bit wide registers are divided into 16 8-bit elements for pro-
cessing. One instruction can therefore operate on 16 cells in parallel. Dividing the register into 8-bit
elements limits the cell’s range to between 0 and 255. In most cases, the scores fit in the 8-bit range
unless the sequences are long and similar. If a query’s score exceeds the cells maximum, that query is
recalculated using a higher precision.
Finally, Farrar proposed the Lazy F evaluation. In order to avoid calculating every cells on the matrix,
this optimization makes the algorithm not calculate the H value when the F remains at zero (thus not
contributing to the value of H). In order to avoid bad results, this optimization has a second pass loop
to correct all the matrix cells that were not calculated in the first pass. This second pass loop is exe-
cuted until all elements in F are less than H − Ginit , Ginit being the gap open penalty. According to the
presented results, this algorithm achieves over 3 billion cell updates per second using a 2.0 GHz Xeon
Core 2 Duo processor [42].
3.4.1.C SWIPE (Rognes)
Taking into account the Farrar’s implementation, in 2011 Torbjørn Rognes proposed SWIPE, an effi-
cient parallel solution based on SIMD instructions [16], which allows running the Smith-Waterman search
more than six times faster. SWIPE performs rapid local alignment searches in amino acid or nucleotide
sequence databases.
SWIPE compares sixteen residues from sixteen different database sequences in parallel for the same
query residue. This operation is carried out using Intel SSE2 vectors consisting of sixteen independent
bytes (Figure 3.8).
Another important characteristic of this algorithm is the use of a compact code of ten instructions writ-
ten in assembly, which constitute the core of the inner loop of the computations. These ten instructions
are presented in Figure 3.9 and compute in parallel the values for each vector of 16 cells in independent
alignment matrices. The exact selection of instructions and their order is important; this part of the code
was therefore hand-coded in assembly to maximize performance. In this figure, H represents the main
score vector. The H vector is saved in the N vector for the next cell on the diagonal. E and F represent
the score vectors for alignments ending in a gap in the query and database sequence, respectively. P is
the vector of substitution scores for the database sequences versus the query residue q (see temporary
28
Figure 3.8: Multi-sequence vectors.
score profiles below). Q represents the vector of gap open plus gap extension penalty. R represents
the gap extension penalty vector. S represents the current best score vector. All vectors, except N are
initialized prior to this code.
Figure 3.9: Rogne’s Algorithm core instructions.
Using a 375-residue query sequence, SWIPE achieved 106 billion cell updates per second (GCUPS)
on a dual Intel Xeon X5650 six-core processor system, which is more than six times faster than software
based on Farrar’s approach (the previous fastest implementation).
3.4.1.D Pedro Monteiro’s Implementation
Extending the Rognes implementation, Pedro Monteiro in its Master Thesis [2] proposes an ex-
tension to the presented thread-level parallelization model, exploring a fine-grained parallelization in a
inter-task SIMD solution. Basically in this implementation the database sequences are split in several
database chunks, and each chunk are processed using the Rognes execution module, like presented in
Figure 3.10. This implementation explores both intra-task and inter-task level parallelization.
29
Figure 3.10: Sequences Database in several chunks [2].
To support this implementation Pedro Monteiro’s solution proposes a different basic processing ele-
ment, represented by a structure called message or processing block, that are presented in Figure 3.11.
Figure 3.11: Processing Block - Message [2].
The message presented in Figure 3.11 contains all the needed elements for one processing iteration
by one of the system workers.
To avoid an overhead in the amount of waiting processing blocks to be processed by system workers,
Pedro Monteiro’s solution implements also two First In,First Out (FIFO) lists through which the master is
30
able to communicate asynchronously with all the workers and vice versa (Figure 3.12). Supporting the
inclusion of two queues in the solution, Pedro Monteiro introduced also multiple synchronization barriers
in the distinct access moments.
Figure 3.12: Processing Block FIFOs [2].
This way, the solution obtained great speedup values using the Dell PowerEdge R810 processing
platform. This implementation also attained a performance of more than 71 GCUPS by using 32 parallel
worker threads on a distributed-memory architecture, which is nearly 2.5 times faster than SWIPE,
running on a different memory architecture [2].
3.4.2 GPU Implementations
We now present some of the GPU-based implementations of the Smith-Waterman algorithm found
in the literature.
3.4.2.A Manavski’s Implementation
In order to get a fast implementation of the Smith-Waterman algorithm on commodity GPU hardware
using OpenGL instructions, Manavski et al. [43] proposed what they refer to "the first solution based
on commodity hardware that efficiently computes the exact Smith-Waterman algorithm". In this imple-
mentation, they used an optimization of the Smith-Waterman algorithm previously proposed by Rognes
and Seeberg [16]. This optimization consists in pre-computing the query profile parallel to the query
sequence for each possible residue, in order to avoid the lookup of s(ai , bj ) in the internal cycle of the
algorithm. Thus, the random accesses to the substitution matrix are replaced by sequential ones. In
their implementation, the query profile is stored in GPU texture memory space, since it is a low latency
memory.
The strategy that was adopted in this implementation consists of making each GPU thread compute
the whole alignment of the query sequence with one database sequence. Before that, the database is
ordered and stored in the global memory of the GPU, while the query-profile is saved into texture mem-
ory. Another optimization of this implementation is the inclusion of an initialization process, where the
number of available computational resources is automatically detected. This number will help achieve
dynamic load balancing. After this step, the database is divided into as many segments as the number
of stream-processors present in the GPU. Each stream-processor then computes the alignment of the
31
query with one database sequence.
To analyze the obtained performance, Manavski’s implementation was compared with three previ-
ous implementations. This performance was measured by running the application both on single and
on double GPU configurations. The first comparison that was carried out is with Liu’s implementation
of the Smith-Waterman algorithm based on OpenGL instructions. The obtained results show that this
implementation is 18 times faster than Liu’s [45]. The second comparison was made with BLAST and
SSEARCH algorithms [46, 47]. The obtained results show that this implementation is up to 30 times
faster than SSEARCH, and up to 2.4 faster than BLAST. Finally, the last test compares this implemen-
tation with Farrar’s implementation [42], showing a three-fold performance increase.
3.4.2.B CUDASW++
Just like the algorithm presented above, CUDASW++ is an optimized implementation of the Smith-
Waterman algorithm using CUDA. It was proposed by Liu et al. [3] and uses the computational power of
CUDA-enabled GPUs to accelerate Smith-Waterman algorithm sequence database searches.
Liu et al. presented two different approaches for the parallelization of the algorithm: inter-task paral-
lelization and intra-task parallelization. In inter-task parallelization, each task is assigned to exactly one
thread and dimBlock tasks are performed in parallel by different threads in a thread block. In Intra-task
parallelization, each task is assigned to one thread block and all dimBlock threads in the thread block
cooperate to perform the task in parallel, exploiting the parallel characteristics of cells in the minor diag-
onals.
In order to achieve the best performance, their implementation uses two stages. The first stage
exploits inter-task parallelization and the second stage exploits intra-task parallelization. The transition
between these stages is separated by a defined threshold; only when the query sequence length is
above that threshold are the alignments carried out in the second stage.
Besides this two-stage process, their implementation uses three techniques to improve the perfor-
mance: coalesced subject sequence arrangement, coalesced global memory access, and cell block
division method.
Coalesced subject sequence arrangement (Figure 3.13) - For inter-task parallelization, arrange the
sorted subject sequence in an array, where the symbols of the query sequences are restricted to
be stored in the same column from top to bottom and all sequences are arranged in increasing
length order from left to right and top to bottom in the array. For the intra-task parallelization, the
sorted subject sequence are sequentially stored in an array, row by row, from the top-left corner to
the bottom-right corner. All symbols of a sequence are restricted to be stored in the same row from
left to right. Texture cache can be utilized in order to achieve maximum performance on coalesced
access patterns.
32
Figure 3.13: Coalesced Subject Sequence Arrangement [3].
Coalesced global memory access (Figure 3.14) - This technique explores memory organization pat-
terns in order to achieve the best performance. All threads in a half-warp should access the
intermediate results in coalesced pattern. Thus, the words accessed by all threads in a half-warp
must lie in the same segment. To achieve this, all threads in a half-warp are allocated in the form
of an array to keep them in contiguous memory address.
Figure 3.14: Coalesced Global Memory Access [3].
Cell block division method - This method consists of dividing the alignment matrix into cell blocks of
equal size for inter-task parallelization.
When executing their implementation using a single-GPU version, CUDASW++ [3], achieves a per-
formance value of about 10 GCUPS on an NVIDIA GeForce GTX 280 graphics card. In a multi-GPU
version, it achieves a performance of up to 16 GCUPS on an NVIDIA GeForce GTX 295 graphics card,
which has two G200 GPU-chips on a single card.
Meanwhile, the same authors have proposed a new version of this implementation, the CUDASW++2.0 [17].
In this new version, they proposed three different implementations: a optimized SIMT SW algorithm ver-
sion, a basic vectorized SW algorithm and a partitioned vectorized SW algorithm.
Optimized SIMT SW algorithm - This implementation is a optimized version of CUDASW++ focused

on its first stage, with the introduction of two optimizations: introduction of a sequential query profile
33
and the utilization of a packed data format. The packed data format is used in the re-organization
of each subject sequence; four successive residues of each subject sequence are packed together
and represented using the uchar4 vector data type. When using the cell block division method, the
four residues loaded by one texture fetch are further stored in shared memory for the use of the
inner loop.
Basic Vectorized SW algorithm - This implementation is based on Michael Farrar striped SW imple-
mentation [42]. It directly maps Farrar’s implementation onto CUDA, based on the virtualized SIMD
vector programming model. As seen before, Farrar denotes as F values. That part of the similarity
values for H(i, j), which derive from the same line: Hi−k j − Wk . The lazy-F loop technique avoids
the calculations of similarity scores when running this algorithm. This technique states that, for
most cells in the matrix H(i, j), the value of Hi−k j − Wk remains at zero and does not contribute
to the value of H. Only when H is greater than Wk will F start to influence the value of H.
For the computation of each column of the alignment matrix, the striped SW algorithm consists
of two loops: an inner loop calculating local alignment scores postulating that F values do not
contribute to the corresponding H values, and a lazy-F loop correcting any errors introduced from
the calculations of the inner loop. This algorithm uses a striped query profile.
Partitioned Vectorized SW algorithm - In this implementation, the algorithm first divides a query se-
quence into a series of non-overlapping, consecutive small partitions, according to a pre-specified
partition length. Then, it aligns the query sequence to a subject sequence, partition by partition,
considering each one a new query sequence. Finally, it constructs a striped query profile for each
partition.
Concerning performance evaluation, just like in the first version of CUDASW++ implementation, Liu
et al. use two different approaches: a single GPU implementation (NVIDIA Geforce GTX 280) and a
multi-GPU implementation (Geforce GTX 295).
The optimized SIMT SW algorithm achieves an average performance of 16.5 GCUPS on Geforce
GTX 280. The same algorithm, when running on GTX 295, achieves an average performance of 27.2
GCUPS. The partitioned vectorized algorithm achieved an average performance of 15.3 GCUPS using
a gap penalty of 10-2 k (gap penalty initialization of 10 and gap extension penalty of 2), gap; an aver-
age performance of 16.3 GCUPS using a gap penalty of 20-2 k; and an average performance of 16.8
GCUPS using a gap penalty of 40-3 k on GTX 280. The same partitioned vectorized algorithm, when
running on GTX 295, achieved an average performance of 22.9 GCUPS using a gap penalty of 10-2 k;
an average performance of 24.8 GCUPS using a gap penalty of 20-2 k; and an average performance of
26.2 GCUPS using a gap penalty of 40-3 k.
When comparing this algorithm with the first CUDASW++ implementation, the optimized SIMT algorithm
method runs 1.74 faster on GTX 280 and 1.72 faster on GTX 295. The partitioned vectorized algorithm
method runs about 1.58 and 1.77 times faster on GTX 280 and about 1.45 and 1.66 times faster on GTX
295.
In 2013, Liu et al. [4] presented the third version of this algorithm, CUDASW++ 3.0. This implementation
34
couples CPU and GPU SIMD instructions and carries out concurrent CPU and GPU computations. For
the CPU computation, this algorithm employs SSE-based vector execution units as accelerators. Be-
sides the inclusion of CPU implementation, this version has investigated for the first time a GPU SIMD
parallelization, based on the CUDA PTX SIMD video instructions to gain more data parallelism beyond
the SIMT execution model. Moreover, sequence alignment workloads are automatically distributed over
CPUs and GPUs based on their respective computing capabilities. The GPU implementations were
specified for GPUs based on the Kepler architecture7 . In order to balance the runtimes of CPU and
GPU computations, they have dynamically distributed all sequence alignment workloads over CPUs and
GPUs, as per their compute power. For the computation on CPUs, Liu et al. [4] have employed the
streaming SIMD extensions (SSE) based vector execution units and multithreading to speed up the SW
algorithm. The program workflow is presented in Figure 3.15.
Figure 3.15: Program workflow of CUDASW++ 3.0 [4].
Evaluation on the Swiss-Prot database shows that CUDASW++ 3.0 gains a performance improve-
ment over CUDASW++ 2.0 up to 2.9 and 3.2, with a maximum performance of 119.0 and 185.6 GCUPS,
on a single-GPU GeForce GTX 680 and a dual-GPU GeForce GTX 690 graphics card, respectively.
In addition, Liu et al. CUDASW++ 3.0 algorithm [4] has demonstrated good speedups over other top-
performing tools: SWIPE and BLAST+.
3.4.3 Discussion on the Presented implementations
The Smith-Waterman algorithm represents one of the most used bioinformatics algorithms. As a
result, last years there have been proposed a lot of solutions based on it. In the previous sections, we
presented the main parallel implementations of this algorithm.
Considering task parallelization, we observe two types of implementation:
• intra-task parallelization considers the parallelization within a single alignment, breaking the se-
quences into multiple parts;
• inter-task parallelization considers the parallelization where multiple database or query sequences
are processed simultaneously, considering a single query sequence and breaking the database
into several sequences, parallelizing at the sequence level.
7 http://www.nvidia.com/object/nvidia-kepler.html
35
Concerning the presented CPU implementations, Wozniak’s [41] implementation, exploring the instruction-
level parallelism, has achieved a great performance evolution from the original Smith-Waterman algo-
rithm implementations. In addition, Wozniak proposed a different processing approach named anti-
diagonal (process the matrix in diagonal , in the intra-task parallelization method. This algorithm achieves
18 million cell updates per second on a single processor.
This optimization was then explored and surpassed by Michael Farrar [42] exploiting Intel SSE in-
structions present on modern Intel processors. This implementation considers a striped pattern in the
query sequence access and achieved a performance of over 3 billion cell updates per second (GCUPS),
reaching a speedup of approximately 8 times over the previously SIMD implementations.
Finally, Rognes [16] solution explores not only the instruction level parallelism with the usage of the
Intel’s Streaming SIMD Extensions (SSE) on ordinary CPUs, but also the data parallelism, implement-
ing the master/worker model. This implementation can use inter-task approach when considering the
execution of one alignment query against one database sequence alignment, but also the intra-task
approach when splitting the several database sequences between several workers configured in the
environment. This model was implemented on Intel processors with the SSE3 instruction set extension,
such as the Intel Core i7. SWIPE achieved performances of over 9 GCUPS for a single thread and up
to 106 GCUPS for 24 parallel threads.
In the GPU context, the two solutions presented in Section 3.4.2 use NVIDIA’s CUDA. Manavski’s [43]
implementation implements the query profile, presented by Farrar [42], in order to avoid the lookup step
of the internal cycle that calculates the s(ai , bj ), pre-computing the query profile parallel to the query
sequence. This optimization removes the random accesses to the score matrix replacing them by se-
quential accesses to the query profile. The strategy adopted by this implementation makes each GPU
thread compute the whole alignment of the query sequence with one database sequence, in a inter-task
parallelization approach. Another optimization was pre-ordering the database sequences. This algo-
rithm achieved speeds of more than 3.5 GCUPS, less than Rognes [16], but faster than than any other
previous attempt available on commodity hardware [43]. By the other way, the Liu’s CudaSW++ [17]
application also considers the query profile presented by Farrar [42], and proposes three different op-
timizations. The first optimization (Optimized SIMT SW algorithm) takes into account one intra-task
parallelization model, using a packed data format in the re-organization of each subject sequence. In
it, four successive residues of each subject sequence are packed together and represented using the
uchar4 vector data type. When using the cell block division method, the four residues loaded by one
texture fetch are further stored in shared memory for the use of the inner loop. The second one (Basic
Vectorized SW algorithm) is based on Michael Farrar striped SW implementation [42]. It directly maps
the Farrar’s implementation onto CUDA, based on the virtualized SIMD vector programming model. Far-
rar denotes as F values, that part of the similarity values for H(i, j), which derives from the same line:
Hi−k j − Wk . The lazy-F loop is a technique used by Farrar in order to avoid the calculations of similarity
36
scores when running this algorithm. This technique states that for most cells in the matrix H(i, j), the
Hi−k j − Wk remains at zero and does not contribute to the value of H. Only when H is greater than Wk
will F start to influence the value of H. And finally the third optimization, the partitioned Vectorized SW
algorithm where the algorithm first divides a query sequence into a series of non-overlapping, consecu-
tive small partitions, according to a pre specified partition length. Then, it aligns the query sequence to a
subject sequence partition by partition, considering each one a new query sequence. Then it constructs
a striped query profile for each partition.
Considering performance, CUDASW++2.0, the optimized SIMT SW algorithm achieves an average per-
formance of 16.5 GCUPS on Geforce GTX 280. The same algorithm, when running on GTX 295,
achieves an average performance of 27.2 GCUPS. The partitioned vectorized algorithm achieved an
average performance of 15.3 GCUPS using a gap penalty of 10-2 k; an average performance of 16.3
GCUPS using a gap penalty of 20-2 k; and an average performance of 16.8 GCUPS using a gap penalty
of 40-3 k on GTX 280. The same partitioned vectorized algorithm, when running on GTX 295, achieved
an average performance of 22.9 GCUPS using a gap penalty of 10-2 k; an average performance of 24.8
GCUPS using a gap penalty of 20-2 k; and an average performance of 26.2 GCUPS using a gap penalty
of 40-3 k. The third implementation of CUDASW++ gains a performance improvement over CUDASW++
2.0 up to 2.9 and 3.2, with a maximum performance of 119.0 and 185.6 GCUPS, on a single-GPU
GeForce GTX 680 and a dual-GPU GeForce GTX 690 graphics card, respectively and in addition shows
significant speedups over other top-performing tools: SWIPE and BLAST+.
So, considering the values presented and all the implementations, this work implements one ap-
plication that mixes the Rogne’s implementation with the Liu’s CudaSW++2.0 implementation, in or-
der to develop a master/worker model implementation that can speedups the execution times on the
Smith-Waterman algorithm execution. This algorithm will determine one alignment between a database
sequence file with thousands sequences against one query sequence faster and with dynamic load
balancing when getting the chunks of work, maximizing the execution time of each one of the solution
workers.
37
Heterogeneous Parallel Alignment
4
MultiSW
Contents
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.1 CPU Worker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.2 GPU Worker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Application Execution Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Implementation Details and Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4.1 Database File Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4.2 Database Sequences Pre-Loading . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4.3 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 Dynamic Load-balancing Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
38
4.1 Introduction
4.1 Introduction
In Section 3.4, a set of parallel implementations of the Smith-Waterman Algorithm that were pro-
posed in the last years were presented. Considering two of the solutions presented: CUDASW++2.0
by Liu et al. [17] and the Pedro Monteiro’s SWIPE extension [2] presented in Section 3.4.1, our work
proposes an efficient solution for parallel implementation of the Smith-Waterman algorithm, named Mul-
tiSW (Section 3.4.2.B).
This implementation consists of the orchestration of both applications execution modules in a single
solution, exploiting the use of multiple CPU cores and the NVIDIA GPUs that may be available on the
running machine, in a heterogeneous approach methodology, as presented in Figure 4.1. Each one of
the modules is called a worker, so we have the CPU workers (Section 4.2.1) and the GPU workers (Sec-
tion 4.2.2). The MultiSW application considers a load balancing abstraction layer, in order to efficiently
split the database sequences during the execution. This layer is explained in Section 4.5. Another im-
plemented optimization is a wrapper8 function for the CPU worker execution (Section 4.2.1.A). These
additional implementations were proposed in order to improve the application CPU worker execution
time. Besides these improvements in the CPU, it was also proposed several implementations in the
GPU worker (Section 4.2.2).
During the execution, the proposed MultiSW application must receive multiple arguments from the
prompt, specifying the running parameters. Then, it prepares all the execution structures (presented
in Section 4.4.3), and coordinates the execution of all executable work between the available workers
(specified at invocation time). This coordination process can be referred to as Orchestration process.
Figure 4.1: Heterogeneous Architecture
This way, multiple parallelization techniques are considered in a single software solution, in a medium-
grained parallelization approach where multiple database sequences are processed simultaneously, as
will be explained in Section 4.2.
In this kind of applications, the main objective is to process all data in the minimum execution time,
leading the maximum execution speedup (concept explained below in Section 5.2). Considering both
8 A wrapper function is a subroutine in a software library or a computer program whose main purpose is to call a second
subroutine or a system call with little or no additional computation.
39
4. Heterogeneous Parallel Alignment MultiSW
execution workers, the execution time its directly related to the amount of data processed (database
sequences) in each iteration. Due to the base implementations considered in the implementation of
MultiSW, it was necessary to create several auxiliar processing structures (see Section 4.4.3). Sec-
tion 4.2 presents the architecture of this solution and the adaptation of the existing solutions to enable it.
To improve MultiSW, in Section 4.5, a model that changes this block size in run time iterations, in order
to minimize the application total execution time, is presented.
4.2 Architecture
The solution’s architecture is presented in Figure 4.2. The orchestration can be considered the
application’s core. This main orchestration invokes the CPU and GPU implementations to execute work
that consists in processing alignments between database sequences and the query sequence. Both
workers are adapted from the considered applications Pedro Monteiro 3.4.1.D and CUDASW++2.0 to
this thesis implementation solution. This adaptation is explained below.
MultiSW
CPU Wrapper
CPU Module
Load Balancing Module
Database
Get work Orchestration
Sequences
GPU Module
Figure 4.2: MultiSW block diagram.
In order to adapt both solutions to this work, the considered model was the Master/Worker originally
proposed by Pedro Monteiro in SWIPE extension [2]. A possible representation for this model execution
is shown in Figure 4.3. The split of database in multiple chunks represents the inter-task parallelization
model introduced by Pedro Monteiro in his solution.
During the execution, all running workers, the CPU and the GPUs ones are getting some new work
to process, execution after execution, invoking the function get_f asta_sequences(). This function loads
the next processing database sequences, from the database sequences file specified at the application
run time. The access to this function is protected by a pthread_mutex_t, to ensure that only one of the
workers can obtain sequences, each time. The worker then gets the respective processing block from
the prof ile_seqs structure. GPU workers use the processing block structure presented in Section 4.4.3.
40
4.2 Architecture
Figure 4.3: Master Worker Model [2]
In Sections 4.2.1 and 4.2.2, the adaptations of each existing applications, to use the original code
adapted in the orchestration implementation, are presented. Also is presented a CPU wrapper function
used to minimize all the multiple thread accesses to the global shared variables, that need synchroniza-
tion amongst all execution threads.
4.2.1 CPU Worker
The CPU worker of our work consists of the adaption of the Pedro Monteiro’s solution [2], trans-
forming the master of the original master/worker model into one of our workers, since in the original
Pedro Monteiro’s implementation the master thread controls all the execution and creates new process-
ing work for the workers. In the original implementation, the master thread creates processing blocks of
16 sequences, blocking the access to the get_fasta_sequences() function (explained above) of the other
workers, in every execution iteration. This represents an efficiency problem in the final solution, because
of the low parallelization level on the database sequences, so a CPU wrapper function was developed
in our work (Section 4.2.1.A), to avoid this problem.
The architecture of the original solution was not changed, and the application still works as repre-
sented in Figure 4.4.
Figure 4.4: Master Worker Model [2]
So, in our implementation, the worker creates the processing blocks to be processed. It gets the
database sequence from the CPU wrapper function, and then it creates the 16 database sequence
41
blocks to be inserted in the queue to be processed. Besides the CPU wrapper implementation, some of
the initial functions were adapted, since the original database file considered was the BLAST sequence
type [48] and our implementation works with the FASTA [49] database file format. This demanded that
initializing functions were changed in order to support this different file format.
4.2.1.A CPU Wrapper
When workers get some new execution block, it is necessary to guarantee that the accessed method
which obtains the database sequences does not block the access to other execution workers. In the
CPU implementation this is implemented using one mutex that blocks every concurrent accesses to
these variables. Pedro Monteiro’s implementation [2] considers that executable blocks have only 16
sequences, and the method getwork() (explained above) gets those sequences was blocking the access
of another workers when getting them information. So, in order to avoid the CPU worker to get only 16
sequences at a time and block the other workers, it was presented in this work one CPU Wrapper
function that gets a bigger block (default value its 30000) to avoid the other worker waits for several
accesses. After that, the CPU worker creates processing blocks accessing that block obtained by this
wrapper, getting 16 sequences from it (as shown in Figure 4.5).
GetSeqs(16)
GetSeqs(16)
(…) CPU Database
CPU Worker GetSeqs(30000)
Wrapper Sequences
GetSeqs(16)
GetSeqs(16)
Figure 4.5: CPU Wrapper Function.
4.2.2 GPU Worker
The GPU module considers several GPU workers, each one assigned to a physical NVIDIA GPU
device. The number of running GPUs is specified in the prompt at run time. The application creates a
CPU pthread for each one of the considered GPUs. This thread runs a function named gpu_worker()
and this function gets the execution database sequences from the get_fasta_sequences() function, run-
ning all the preparation and execution flows from the original CUDASW++ implementation [17]. Liu’s et
al. solution works with FASTA sequences, so it was not necessary to change the sequence’s preparing
functions.
To minimize the application execution time, some optimizations that reduce the execution time for each
iteration of the worker are presented. A CUDA Stream is "a sequence of operations that execute on
the device in the order in which they are issued by the host code. While operations within a stream are
guaranteed to execute in the prescribed order, operations in different streams can be interleaved and,
42
4.2 Architecture
when possible, they can even run concurrently" [50]. Using CUDA Streams, the memory transfers can
be made asynchronous between the host and the device (Section 4.2.2.A). Also, with streams is also
possible parallelize the execution of kernels (Section 4.2.2.B). Besides these, the loading of the next pro-
cessing sequences is done in parallel with the execution of kernels on the device side (Section 4.2.2.C).
4.2.2.A Asynchronous Transfers
By creating CUDA streams and assigning them to data transfers and changing memory transfers to
asynchronous (placing -Async in the name of the transferring instruction), as shown in the code the data
transfer function does not block the execution:
//...
cudaStreamCreate(&mystream1);
cudaMemcpyAsync(deviceArray,hostArray,size,cudaMemcpyHostToDevice,mystream1);
kernel<<>>(otherDataArray);
//...
4.2.2.B CUDA Streams in Kernel Execution
It is possible to execute two different kernel functions at same time, if the data processed by each
one is different and independent:
//...
kernel<<1,N,mystream1>>(DataArray);
kernel<<1,N,mystream2>>(differentDataArray);
//...
4.2.2.C Loading Sequences with Execution
It is also possible to execute some host code during the kernel execution in the device. This way, the
three instructions are executed in parallel:
//...
kernel<<1,N,mystream1>>(DataArray);
get_fasta_sequences(); // host function
kernel<<1,N,mystream2>>(differentDataArray);
//...
43
4.3 Application Execution Flow
The application execution flow its presented in Figure 4.6. In the beginning of the application the
workers are started (the number of CPU cores and GPUs chosen for the execution are specified in the
prompt using the t parameter for the CPU threads and the g parameter for the GPUs). Considering the
CPU Wrapper as a big CPU worker, this implementation only considers one CPU worker, since it is the
CPU module master that accesses the CPU wrapper function and gets the work for all the CPU module
sub-workers. At the same time, GPU workers also start to get blocks of database sequences to process.
At the end of the execution, after all database sequences are processed, CPU module workers and
GPU workers are killed with the pthread_exit function and control returns to the main function, that
shows the results of the best alignments score, and finishes the application execution.
Figure 4.6: Execution Sequence Diagram.
4.4 Implementation Details and Optimizations
4.4.1 Database File Format
As for the database file format, Pedro Monteiro’s implementation [2] only considers the BLAST se-
quences type, and the CUDASW++ only considers the FASTA format. All the functions used from Pedro
Monteiro’s implementation were adapted to use the FASTA database sequence file format.
44
4.4 Implementation Details and Optimizations
4.4.2 Database Sequences Pre-Loading
The BLAST file format indexes all the sequences presented in the file, and can be used efficiently
to obtain the desired profile sequences at run-time. On the other hand, the FASTA file format is not
indexed, and it will be very difficult to index the correct sequences in the execution. The solution was to
create the structure profile_seqs and pre-load all the sequences of the file in this structure. This way, it
becomes possible to index the wished sequence in the run-time execution.
4.4.3 Data Structures
In order to organize all the code and to be possible to separate the blocks in several files with distinct
uses, several structures we created in this implementation to keep the running arguments of the appli-
cation.
The first structure is named execution_params and contains all the execution parameters for the appli-
cation:
typedef struct execution_parameters {
char *progname; // application name

char *matrixname; // score matrix name used for the scoring model
char *databasename; // name of the database used
char *queryname; // query sequence name
long maxmatches; // maximum show results
long minscore; // minimum show score
long threads; // number of cpu cores
long blocksize; // blocksize for the cpu worker
long p_blocksize; // size of the profile blocksize
long workerIter; // contains number of worker iterations
long nodes; // execution nodes in cpu
long gpu_enable; // number of gpus enabled
long gpu_max_seqs; // blocksize for gpu execution
long gapopen; // gap open value

long gapextend; // gap extended value
BYTE gap_open_penalty; // gap open penalty

BYTE gap_extend_penalty; // gap extended penalty
} execution_params;
Another used structure is query_seq_parameters. This one contains the query sequence parame-
ters:
typedef struct query_seq_parameters {
int qlen; // query length

int qlen_aligned; // query length aligned by 8
45
char* filename; // query filename

char *description; // query description read of the file
BYTE *query_sequence; // query sequence residues

BYTE *query_sequence_padded; // query sequence residues padded by 8
} query_seq_params;
Another one is the score_matrixes structure. This structure keeps the matrixes and the score limit
for the SWIPE execution modes. These values are read by the score matrix file indicated in the prompt
when executing the application:
typedef struct score_matrixes {
long SCORELIMIT_7; // score limit for worker7 execution

char *score_matrix_7; // score matrix for worker7

short *score_matrix_16; // score matrix for worker7
long *score_matrix_63; // score matrix for worker63
} score_matrixes;
Another structure created is profile_params that keeps the information about the database profile
parameters:
typedef struct profile_params {
// Genbank NCBI Format

int fd_psq; // file descriptor for BLAST psq file
int fd_phr; // file descriptior for BLAST phr file
// Fasta Format
int using_fasta; // flag to indicate if file is in FASTA file format
int fd_fasta; // file descriptor for FASTA file format
FILE* fasta_file; // file pointer
int fasta_pos; // next considered sequence
UINT32 *adr_pin; // address pin variable for BLAST mode execution
off_t len_psq; // offset of psq BLAST file

off_t len_phr; // offset of phr BLAST file
off_t len_pin; // offset of pin BLAST file
char *dbname; // database name obtained from the file

char *dbdate; // database date obtained from the file
unsigned seqcount; // total number of sequences presented in database file

unsigned longest; // longest sequence of the database file
46
4.5 Dynamic Load-balancing Layer
unsigned long totalaa; // total number of aminoacids for all the sequences in database
file
unsigned long phroffset; // offset of phr BLAST file

unsigned long psqoffset; // offset of phr BLAST file
} profile_params;
To enable the use of these structures, all implementation functions were changed to return the struc-
tures back to the application’s main function. The main application then sends the initialized execution
structures to both kinds of workers: CPU and GPU.
4.5 Dynamic Load-balancing Layer
"Load balancing is dividing the amount of work that a computer has to do between two or more
computers so that more work gets done in the same amount of time and, in general, all users get
served faster" 9 . In our implementation the processing data unit represents the database sequences,
that needs to be aligned against the query sequence. The execution time of each worker iteration is
directly affected by the size of the processed block. To make the implementation more efficient and to
adjust the execution time for all workers, in this implementation, also is considered a load balancing
module, adjusting dynamically the block size of the obtained work.
The load balancing layer dynamically adjusts the block size for each worker (concept of block size
is presented above). In this implementation, all workers were considered equals. The only difference
is that the default CPU worker block size was 30000 and the GPUs default block size is 65000. These
sizes were defined taking into account the average execution time for each processing modules.
Imagine the scenario presented in Figure 4.7. In this case, worker A spends almost twice as much
time than the time that worker B takes to process its block. If the application finishes its execution after
the worker B finishes its iteration, worker A has not processed all the information. This way, the solution
does not take advantage of each worker most efficient.
Worker A
Worker B
Figure 4.7: Workers execution not balanced.
In order to minimize this inefficiency in each iteration, the proposed load balancing layer adjusts the
block size of each worker, to make the execution time as close as possible. Considering this, it is meant
to adjust the block size, in order to reduce the execution time of the worker A.
In the developed model, the following variables were considered:
9 http://searchnetworking.techtarget.com/definition/load-balancing
47
Worker A
Worker B
Figure 4.8: Workers execution balanced.
• blocksize(w, i) - Represents the block size computed by worker w in iteration i;
• Texecution (w, i) - Represents the execution time of worker w in iteration i;
• Tminexecution (i) - Represents the minimum execution time for all workers in iteration i.
When a worker finishes its execution, it calls the registerExecutionTime function, that registers for the
deviceNum worker the execution time and processed block size. This function updates the attributes for
the current worker execution and calls a method named adjustBlockSizes that reprocesses all workers
block sizes.
The first worker (the fastest one) to finish its work increases their block size 10% considering the
previous block size. Thus, the next iteration’s block size is going to be:
blocksize = blocksize × 1.1 (4.1)
After all the workers finish their execution, the new block size for each worker is calculated taking into
account the fastest worker’s execution, and the time spent in that execution. This time is presented in
Equation 4.2 and its given by the function getM inExecT ime().
Tminexecution = getM inExecT ime(); (4.2)
Then, for each worker, the block size of next iteration can be calculated by (This block size must be
an integer value, in this case it is used the ceil function to round the value obtained by the formula):
Tminexecution × blockSize(i − 1)
blockSize(i) = ceil( ) (4.3)
Texecution (i − 1)
Considering an execution example like the ones presented on Figure 4.7, being worker A the CPU
worker and the worker B the GPU worker. Initialization values are given by:
blocksize(0, 0) = 30000 (4.4)
blocksize(1, 0) = 65000 (4.5)
Texecution (0, 0) = 0 seconds (4.6)
48
4.6 Conclusion
Texecution (1, 0) = 0 seconds (4.7)
These values, presented in equations 4.11, 4.5, 4.6, and 4.7, represent the initial execution structures
for the workers of our implementation.
After the first worker finish their execution, their block size value is updated and their execution time
is registered:
blocksize(1, 0) = 65000 × 1.1 = 71500 (4.8)
Texecution (1, 0) = 2.01 seconds (4.9)
Texecution (0, 0) = 3.5 seconds (4.10)
Then the load balancing layer updates the block size of other workers.
2.01 × 30000
blocksize(0, 0) = ceil( ) = 17228 (4.11)
3.5
4.6 Conclusion
The main objective of this work is to study efficient implementations based on heterogeneous archi-
tectures on the Smith-Waterman algorithm. For this algorithm, as was seen on Chapter 3, there were
proposed several efficient implementations as Pedro Monteiro’s implementation, with an extension to
the initial SWIPE proposal, presented by Rognes and also Liu et. al. CUDASW++ 2.0 implementation.
Considering the execution modules of both implementations, this work considers each one of them as
base of its execution workers, in order to process all of the work chunks that exists in the introduced
database file, in the fastest way possible.
Thus, the architecture of the MultiSW implementation presented in Section 4.2 represents an orchestra-
tion between the execution modules, considering all the synchronizing mechanisms and access to the
sequence to process database file. In general, for us to accomplish an orchestration of the two different
solutions, was needed to understand and adapt both modules so they could access the same database
file and process the sequences in FASTA format.
Complementing this orchestration are proposed optimizations at the GPUs module, implementing the
introduction of CUDA Streams to the transference code of data execution and invoking kernels to the
execution GPUs. These GPU worker optimizations are presented in Section 4.2.
Finally, in order to assure that the several system workers process the correct amount of data, is intro-
duced an extra module of Load Balancing, with the objective of adjusting the size of blocks processed
along the execution of the application, in such a way that the end time of the execution of the application
is the minimum possible, as represented in the example presented on Section 4.5.
49
Experimental Results
5
Contents
5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.1 Experimental Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Evaluating Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.1 Scenario A - Single CPU core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.2 Scenario B - Four CPU cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.3 Scenario C - Single GPU - GeForce GTX 780 Ti . . . . . . . . . . . . . . . . . . . 54
5.3.4 Scenario D - Single GPU - GeForce GTX 660 Ti . . . . . . . . . . . . . . . . . . . 55
5.3.5 Scenario E - Four CPU cores + Single GPU Execution . . . . . . . . . . . . . . . 55
5.3.6 Scenario F - Four CPU cores + Double GPUs Execution . . . . . . . . . . . . . . 57
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
50
5.1 Experimental Setup
The performance of our implementation - MultiSW - was evaluated by considering multiple execution
scenarios. The results are presented below in section 5.3.
5.1 Experimental Setup
To correctly benchmark the implemented solution, the considered experimental setup was a Linux
based workstation with the following characteristics:
Machine:
• Intel(R) Core(TM) i7 4770K @ 3.5GHz (CPU);
• Four Kingston HyperX DDR3 CL9 8GB @ 1.6GHz Memory RAM modules;
• ASUS Z87-Pro Motherboard;
• GPU A - MSI GeForce GTX 780 Ti Gaming 3GB DDR5;
• GPU B - GeForce GTX 660 Ti 2GB DDR5;
The code was compiled for a 64-bit Linux operating system using the Intel C compiler version 13.1.3
and the NVIDIA Compiler release 6.5.
When comparing the used GPUs its easy to identify which one will obtain the best results. GeForce
GTX 780 Ti has more cuda processing cores (2880) than GeForce GTX 660 Ti (1344), so can run the
kernels with more parallelization power. Another big difference is in the memory bandwith that in the
GTX 780 Ti is 336 GB/s, whereas in the GTX 660 Ti is only 144.2 GB/s, less than a half. The memory
interface is also different for both, in the first one, it is 384 bits and, in the second one it is 192 bits. So it
is expected that the first GPU runs the kernel functions faster, and transfers the data more quickly than
the second GPU.
5.1.1 Experimental Dataset
The query sequence that was used in the experimental scenarios was the IFNA6 interferon, alpha 6
[Homo sapiens (human)] [51] with 189 residues.
The database sequences that were considered was the release 2014_02 of UniProtKB/Swiss-Prot [52]
database sequences in the FASTA format repeated 5 times in the file. This database contains 542,503
sequences with several sizes, comprising 192,888,369 amino acids abstracted from 226,190 references.
The total processed number of sequences is 2,712,515.
51
5. Experimental Results
5.2 Evaluating Metrics
In order to compare the considered scenarios, the speedup metric will be used. This metric measures
how much one optimized implementation is faster than the base implementation. It is given by equation
5.1:
tsequential
speedup = (5.1)
tparallel
5.3 Results
This section presents multiple scenarios and their results when running the application with the
various execution parameters configurations. It starts with the simplest scenario, corresponding to a
single CPU core execution, and finishes with the most complex configuration with an orchestration of
the workers based on a multicore CPU and multiple GPUs, that processes all the available work. The
execution block sizes for each kind of worker were pre-adjusted in order to obtain the best solutions in
the overall execution times running with several block size configurations before getting the experimental
results.
Each execution scenario was executed ten times, and the presented results correspond to the av-
erage of the times of these executions. An iteration execution represents the time that the application
spends to execute the block size defined for the execution worker.
In each presented scenario, for the global orchestration, the CPU execution worker represents the
CPU Wrapper module, presented in Section 4.2.1.A, regardless the execution being done with one or
four CPU cores. The iteration time may vary, because the sequences processed has different sizes.
The block sizes for the experimental results considered in the CPU and the GPU were adjusted by
varying the block sizes and checking the best execution times using only one CPU core and a single
GPU. The CPU obtained block size was 30,000 sequences and the GPU default block size was 65,000
sequences.
52
5.3 Results
5.3.1 Scenario A - Single CPU core
Considering a single CPU core execution, the total execution time was about 31.52 seconds, as
shown in Figure 5.1.
31.52
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Time (s)
Figure 5.1: Processing times considering a single CPU core execution and a processing block with 30000 sequences.
The multiple grey-colored blocks represents the execution for each CPU wrapper iteration, consid-
ering its size of 30,000 sequences. These iteration execution times vary between 0.0688 and 2.391
seconds and all together represent the total execution time of about 31.52 seconds. Between each
iteration execution (during the overall execution), in the beginning the preparation time is about 0.0009
seconds and is not visible in the figure presented above. The difference in the iteration execution times
is explained by different sequences size in each iteration. For bigger sequences, the iteration execution
time will be bigger.
5.3.2 Scenario B - Four CPU cores
Considering a 4-core CPU execution, the total execution time was about 15.55 seconds, as shown
in Figure 5.2.
53
15.55
0 2 4 6 8 10 12 14 16
Time (s)
Figure 5.2: Processing times for 4 CPU cores, considering a block size of 30,000 sequences.
The distinct grey-based colored blocks represents the process time for a block of 30,000 sequences
(considering a CPU wrapper iteration) by four CPU cores. These iteration values vary between 0.046
and 0.957 seconds.
Total execution time was about 15.55 seconds. The reason why the solution with four CPU cores is not
four times faster than the single CPU core is because of the synchronization between multiple threads
and the data partition and organization times, this way the speedup obtained was not linear like it was
supposed to.
5.3.3 Scenario C - Single GPU - GeForce GTX 780 Ti
Considering a single GeForce GTX 780 Ti GPU execution, the total execution time was 6.35 seconds,
as show in Figure 5.3.
6.35
0 1 2 3 4 5 6 7
Time (s)
Figure 5.3: Processing Times for single GPU in Machine A, considering blocks size of 65,000 sequences. Total execution time
about 6.35 seconds.
In the figure it is presented several grey colored execution blocks that represents the time of pro-
cessing 65,000 database sequences against the query sequence. These iteration values vary between
54
5.3 Results
0.118 and 0.266 seconds.

Considering the several optimizations mentioned in Section 4.2.2, specially the use of CUDA Streams
provided by NVIDIA in its framework, it is possible to reduce significantly the preparation time between
iterations and getting the best overall execution times.
5.3.4 Scenario D - Single GPU - GeForce GTX 660 Ti
Considering a single GeForce GTX 660 Ti GPU execution, the total execution time was about 7.38
seconds, as show in Figure 5.4.
7.38
0 1 2 3 4 5 6 7 8
Time (s)
Figure 5.4: Processing Times for single GPU in Machine B, considering blocks size of 65,000 sequences. Total execution time
about 7.38 seconds.
In the figure it is presented several grey colored execution blocks. Each one of them represents the
time of processing 65,000 database sequences against the query sequence. The total execution time
was 7.38 seconds. These iteration values vary between 0.126 and 0.304 seconds.
5.3.5 Scenario E - Four CPU cores + Single GPU Execution
In this scenario, the considered workers for the execution are the four CPU cores and the GeForce
GTX780 Ti GPU. This execution time was about 6.112 seconds, as show in Figure 5.6:
55
6.112
GPU A
CPU Cores
0 1 2 3 4 5 6
Time (s)
Figure 5.5: Processing Times for 4 CPU cores and a GeForce GTX780 Ti GPU, considering CPU blocks of 30,000 sequences
and GPU blocks of 65,000 sequences. Total execution time was 6.112 seconds.
This time was better than the one obtained in Scenario F, because the considered GPU was the
GeForce GTX 780 Ti, that executes faster than GeForce GTX 660 Ti like presented in Scenarios C and
D.
In Figure 5.6 it is presented the number of sequences processed by each kind of worker. The
execution CPU worker processed 817,087 sequences, while the GPU worker processed 1,895,428 se-
quences.
2100000
1895428
1800000
1500000
1200000
900000 817087
600000
300000
0
CPU Cores GPU A
Figure 5.6: Number of Sequences Processed by CPU cores and GPU.
The orchestration represented in this scenario is better than the single GPU execution, but a linear
speedup was not achieved since the synchronization points increased with the increase of the workers
in the orchestration.
In Figure 5.5, its also presented the dynamic block size along the time. These values are presented
in the arrow next to the block execution, in the GPU worker and in the CPU worker. So the CPU worker
56
5.3 Results
starts with the 30,000 block size and finish with a size of 15,000. The GPU worker starts with a 65,000
sequences block size and finish with a size of 40,000 sequences. To both of the workers, the number of
next processing sequences is decreasing along the execution time, since it was the way load balacing
module works. The iteration execution times for the GPU worker varies between 0.082 and 0.316 sec-
onds. For the CPU worker this value goes between the 0.072 and the 0.258 seconds.
5.3.6 Scenario F - Four CPU cores + Double GPUs Execution
The last scenario considered is composed by 4-core CPU execution and both available GPUs: the
GeForce GTX 780 Ti (GPU A) and the GeForce GTX 660 Ti (GPU B).
As expected, this execution was the fastest one but not the most efficient and takes about 4.957
seconds, as show in Figure 5.7.
65000 44601 40000 4.957
GPU B
65000 58633 58415 56279 40000
GPU A
30000 33000 25387 17843 16372 15000 4.9
CPU Cores
0 1 2 3 4 5
Time (s)
Figure 5.7: Processing Times for 4 cores CPU, GPU A and GPU B, considering the initial block size of 30,000 sequences blocks
to the CPU solution and 65,000 to the GPU solution. Total execution time of 4.957 seconds. Near some of the iteration blocks it is
presented the new considered block size.
Figure 5.7 shows the execution blocks for the three workers. The CPU worker starts with a block size
of 30.000 and finishes with the block size of 15,000. The execution times for this worker takes from 0,06
to 0,379 seconds. The execution times for the GPU A worker goes between 0.067 and 0.394 seconds.
Finally, for the GPU B worker the execution times goes between 0.067 and 0.520 seconds.
The number of sequences computed by each worker is presented in Figure 5.8. The CPU worker
process 70,795 sequences, the GPU A worker computed 1,241,496 sequences and the GPU B worker
process 1,059,202 sequences. The low quantity processed by the CPU worker is because the block
size of this worker is smaller than the GPU worker block sizes improve performance.
57
1500000
1241496
1200000
1059202
900000
600000
411817
300000
0
CPU Cores GPU A GPU B
Figure 5.8: Number of Sequences Processed by CPU, GPU A, and GPU B workers.
5.4 Summary
As shown in Table 5.1, considering the multiple scenarios, Scenario F presented in Section 5.3.6 was
achieved a speedup of 6.4x when comparing the execution in a CPU single core execution presented in
Scenario A, Section 5.3.1.
Execution Time Speedup

Single core 31.52
Four cores 15.55 2.03
GeForce GTX 780 Ti 6.350 4.96
GeForce GTX 660 Ti 7.380 4.271
Four CPU cores + GeForce GTX 780 Ti 6.112 5.16
Four CPU cores + 2 GPU 4.96 6.36
Table 5.1: Execution Speedups.
The increase of workers in the execution of our work orchestration, increases also the synchroniza-
tion between the involved threads are needed. This causes the occurrence of execution delays and
makes the workers wait longer. This situation is minimized with the load balancing layer presented in
our solution, since the block size is being adapted to be similar. However, there are some limitations
in the loading balance module, since the number of total processing sequences are not known at the
beginning of the application.
Despite these limitations, as it can be verified in the Table 5.1, to the different execution scenarios, the
orchestration was getting relatively good speedups, with the inclusion of new worker in their execution.
58
Conclusions and Future Work
6
Contents
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
59
6. Conclusions and Future Work
6.1 Conclusions
Multiple solutions have been proposed in the last years to be possible to respond to the large amount
of biological information produced everyday. Exploiting parallel architectures based on CPU and GPU
architectures, enables the quickly processing of these data. So, in our work, it was proposed a solution
that mixes both to get better results. Under this context, this thesis proposed the integration of two pre-
viously presented parallel implementations: an adaptation of SWIPE implementation [16], for multi-core
CPUs that exploits SIMD vectorial instructions [2], and an implementation of the Smith-Waterman algo-
rithm for GPU platforms (CUDASW++ 2.0) [17]. Accordingly, the presented work offers a unified solution
that tries to take advantage of all computational resources that are made available in heterogeneous
platforms, composed by CPUs and GPUs, by integrating a convenient dynamic load balancing layer.
The obtained results presented in Chapter 5 show that the attained speeup can reach values as high as
6x, when executing in a quad-core CPU and two distinct GPUs.
6.2 Future Work
The presented solution already considers both intra-task and inter-task processing approaches.
However, in the CPU module, it would be good to explore some extra inter-task approach. Another
possible future work is to add an extra thread to the solution to prepare all the GPU work, like it occurs
in the CPU module.
The new Kepler NVIDIA GPUs presents a tecnology designated as Dynamic Parallelism, which al-
lows to create new chunks of work without need of new data transfer between the device and the host.
Thus, is possible to spend less time transferring data between device and host, optimizing the total exe-
cution time.
Finally, another possible optimization is taking advantage of the Load Balancing module, optimizing
the algorithm used in the module, in order to achieve more optimal values to the processing blocks size,
processed by the available workers in the solution.
60
Bibliography
[1] NVIDIA CUDA - NVIDIA CUDA C Programming Guide, February 2014. URL http://docs.nvidia.
com/cuda/pdf/CUDA_C_Programming_Guide.pdf.
[2] Pedro Matos Monteiro. Profiling biological applications for parallel implementation in multicore
computers. Master’s thesis, Av. Rovisco Pais, 1, november 2012.
[3] Yongchao Liu, Douglas L. Maskell, and Bertil Schmidt. CUDASW++: optimizing Smith-Waterman
sequence database searches for CUDA-enabled graphics processing units. BMC research notes,
2(1):73+, 2009. ISSN 1756-0500. doi: 10.1186/1756-0500-2-73. URL http://dx.doi.org/10.
1186/1756-0500-2-73.
[4] Yongchao Liu, Adrianto Wirawan, and Bertil Schmidt. Cudasw++ 3.0: accelerating smith-waterman
protein database search by coupling cpu and gpu simd instructions. BMC Bioinformatics, 14(1):
117, 2013. ISSN 1471-2105. doi: 10.1186/1471-2105-14-117. URL http://www.biomedcentral.
com/1471-2105/14/117.
[5] M. J. Flynn. Very high-speed computing systems. Proc. IEEE, 54(12):1901–1909, December 1966.
[6] EBERLY COLLEGE OF ARTS and SCIENCES. Eberly college of arts and sciences.
http://eberly.wvu.edu/, 2014. Accessed on October 9, 2014.
[7] Pat Hanrahan Daniel Reiter Horn, Mike Houston. ClawHMMer: A streaming HMMer-search imple-
mentation. In Supercomputing, 2005.
[8] Genbank DNA Database. Genbank dna database. http://www.ncbi.nlm.nih.gov/genbank/, 2014.

Accessed on October 9, 2014.
[9] National Center for Biotechnology Information (NCBI). National center for biotechnology information
(ncbi). http://www.ncbi.nlm.nih.gov/, 2014. Accessed on October 9, 2014.
[10] Universal Protein Resource (UniProt). Universal protein resource (uniprot). http://www.uniprot.org/,
2014. Accessed on October 9, 2014.
[11] Nucleotide sequence database (EMBL). Nucleotide sequence database (embl).

http://www.ebi.ac.uk/ena/, 2014. Accessed on October 9, 2014.
[12] Swiss-Prot. Swiss-prot. http://www.ebi.ac.uk/uniprot/, 2014. Accessed on October 9, 2014.
61
Bibliography
[13] TrEMBL. Trembl. http://www.ebi.ac.uk/uniprot/, 2014. Accessed on October 9, 2014.
[14] T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of

molecular biology, 147(1):195–197, March 1981. ISSN 0022-2836. URL http://view.ncbi.nlm.
nih.gov/pubmed/7265238.
[15] D.E. Culler, J.P. Singh, and A. Gupta. Parallel computer architecture: a hardware/software ap-
proach. The Morgan Kaufmann Series in Computer Architecture and Design. Morgan Kauf-
mann Publishers, 1999. ISBN 9781558603431. URL http://books.google.pt/books?id=
gftcVOn7iGsC.
[16] Torbjorn Rognes. Faster Smith-Waterman database searches with inter-sequence SIMD par-
allelisation. BMC Bioinformatics, 12(1):221+, June 2011. ISSN 1471-2105. doi: 10.1186/
1471-2105-12-221. URL http://dx.doi.org/10.1186/1471-2105-12-221.
[17] Yongchao Liu, Bertil Schmidt, and Douglas Maskell. CUDASW++2.0: enhanced Smith-Waterman
protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstrac-
tions. BMC Research Notes, 3(1):93+, 2010. ISSN 1756-0500. doi: 10.1186/1756-0500-3-93.
URL http://dx.doi.org/10.1186/1756-0500-3-93.
[18] G.S. Almasi and A. Gottlieb. Highly parallel computing. The Benjamin/Cummings series in com-
puter science and engineering. Benjamin/Cummings Pub. Co., 1994. ISBN 9780805304435. URL
http://books.google.pt/books?id=rohQAAAAMAAJ.
[19] J.L. Hennessy, D.A. Patterson, and A.C. Arpaci-Dusseau. Computer architecture: a quantitative
approach. Number vol. 1 in The Morgan Kaufmann Series in Computer Architecture and De-
sign. Morgan Kaufmann, 2007. ISBN 9780123704900. URL http://books.google.pt/books?id=
57UIPoLt3tkC.
[20] Alex Peleg and Uri Weiser. MMX Technology Extension to the Intel Architecture. IEEE Micro, 16
(4):42–50, 1996. ISSN 0272-1732. doi: 10.1109/40.526924. URL http://dx.doi.org/10.1109/
40.526924.
[21] W. Stallings. Computer Organization and Architecture: Designing for Performance. Prentice Hall,
2010. ISBN 9780136073734. URL http://books.google.es/books?id=-7nM1DkWb1YC.
[22] N. Wilt. The CUDA Handbook: A Comprehensive Guide to GPU Programming. Pearson Education,
2013. ISBN 9780133261509. URL http://books.google.pt/books?id=ynydqKP225EC.
[23] NVIDIA. Kepler GK110 whitepaper, 2012. URL http://www.nvidia.com/content/PDF/kepler/

NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.
[24] Jason Sanders and Edward Kandrot. CUDA by Example: An Introduction to General-Purpose
GPU Programming. Addison-Wesley Professional, 1st edition, 2010. ISBN 0131387685,
9780131387683.
62
Bibliography
[25] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron E. Lefohn, and
Timothy J. Purcell. A survey of general-purpose computation on graphics hardware. Computer
Graphics Forum, 26(1):80–113, 2007. ISSN 1467-8659. doi: 10.1111/j.1467-8659.2007.01012.x.
[26] Xiaoqing Tang. Introduction to general purpose GPU computing. University of Rochester - Class
Lecture, March, 16 2011. URL http://www.cs.rochester.edu/~kshen/csc258-spring2011/
lectures/student_Tang.pdf.
[27] John Nickolls and William J. Dally. The gpu computing era. IEEE Micro, 30(2):56–69, March 2010.
ISSN 0272-1732. doi: 10.1109/MM.2010.41. URL http://dx.doi.org/10.1109/MM.2010.41.
[28] AMD. AMD Fusion Website. http://www.amd.com/us/products/technologies/fusion/Pages/fusion.aspx.

accessed on 1/1/2012.
[29] Mayank Daga, Ashwin M. Aji, and Wu-chun Feng. On the efficacy of a fused cpu+gpu processor (or
apu) for parallel computing. In Proceedings of the 2011 Symposium on Application Accelerators in
High-Performance Computing, SAAHPC ’11, pages 141–149, Washington, DC, USA, 2011. IEEE
Computer Society. ISBN 978-0-7695-4448-9. doi: 10.1109/SAAHPC.2011.29. URL http://dx.
doi.org/10.1109/SAAHPC.2011.29.
[30] Math Smith. What is an APU? [technology explained]. http://www.makeuseof.com/tag/apu-

technology-explained/, February 18 2011. Accessed on January 1, 2012.
[31] Michael Wolfe. Understanding the CUDA Data Parallel Threading Model A Primer.
http://www.pgroup.com/lit/articles/insider/v2n1a5.htm, February 2010. accessed in 7-1-2012.
[32] Brent Oster and Greg Ruetsch. Getting started with CUDA. In NVISION 2008, The World of
Visual Computing. NVIDIA, 2008. URL http://www.nvidia.com/content/nvision2008/tech_
presentations/CUDA_Developer_Track/NVISION08-Getting_Started_with_CUDA.pdf.
[33] Biological Sequences. Biological sequences. http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/BIOSEQ.HTM

[34] Colin Dewey. Multiple sequence alignment. http://www.biostat.wisc.edu/bmi576/lectures/multiple-

alignment.pdf, Fall 2011. URL http://www.biostat.wisc.edu/bmi576/lectures/
multiple-alignment.pdf. Acessed December 21, 2011.
[35] Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. Biological Sequence Anal-
ysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, July 1998.
ISBN 0521629713.
[36] S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in
the amino acid sequence of two proteins. J. Mol. Biol., 48:443–453, 1970.
[37] Thomas H. Cormen, Clifford Stein, Ronald L. Rivest, and Charles E. Leiserson. Introduction to
Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001. ISBN 0070131511.
63
Bibliography
[38] D. J. Lipman and W. R. Pearson. Rapid and Sensitive protein Similarity Searches. Science, 227:
1435–1441, March 1985.
[39] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proceedings
of the National Academy of Sciences of the United States of America, 85(8):2444–2448, April 1988.
ISSN 0027-8424. doi: 10.1073/pnas.85.8.2444. URL http://dx.doi.org/10.1073/pnas.85.8.
2444.
[40] Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. Basic local
alignment search tool. Journal of Molecular Biology, 215(3):403–410, Oct 1990. URL citeseer.
nj.nec.com/akutsu99identification.html.
[41] A. Wozniak. Using video-oriented instructions to speed up sequence comparison. Computer Appli-
cations in the Biosciences, 13(2):145–150, 1997. URL http://dblp.uni-trier.de/db/journals/
bioinformatics/bioinformatics13.html#Wozniak97.
[42] Michael Farrar. Striped Smith–Waterman speeds database searches six times over other SIMD
implementations. Bioinformatics, 23:156–161, January 2007. ISSN 1367-4803. doi: http://dx.doi.
org/10.1093/bioinformatics/btl582. URL http://dx.doi.org/10.1093/bioinformatics/btl582.
[43] Svetlin A Manavski and Giorgio Valle. CUDA compatible GPU cards as efficient hardware
accelerators for Smith-Waterman sequence alignment. BMC Bioinformatics, 9(Suppl 2):S10,
2008. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2323659&tool=
pmcentrez&rendertype=abstract.
[44] T. Rognes and E. Seeberg. Six-fold speed-up of Smith-Waterman sequence database searches
using parallel processing on common microprocessors. Bioinformatics (Oxford, England), 16(8):
699–706, August 2000. ISSN 1367-4803. doi: 10.1093/bioinformatics/16.8.699. URL http://dx.
doi.org/10.1093/bioinformatics/16.8.699.
[45] Bio-sequence database scanning on a GPU, April 2006. doi: 10.1109/ipdps.2006.1639531. URL
http://dx.doi.org/10.1109/ipdps.2006.1639531.
[46] NCBI BLAST. Blast - ncbi. http://blast.ncbi.nlm.nih.gov/Blast.cgi, 2014. Accessed on October 9,

2014.
[47] EBI. Ssearch algorithm. http://www.ebi.ac.uk/Tools/sss/, 2014. Accessed on October 9, 2014.
[48] Blast DB Format. Blast format. http://selab.janelia.org/people/farrarm/blastdbfmtv4/blastdbfmt.html,

[49] FASTA DB Format. Fasta format. http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml, 2014.

[50] NVIDIA. Nvidia overlap data transfers and cuda executions.

http://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/, 2014. Accessed
on October 9, 2014.
64
Bibliography
[51] NCBI. Ifna6 interferon, alpha 6 [homo sapiens (human)] - ncbi.

http://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=ShowDetailView&TermToSearch=3443, 2014.
[52] Uniprot. Uniprotkb/swiss-prot release 2014 february - uniprot. http://www.uniprot.org/downloads,

[53] Kamran Karimi, Neil G Dickson, and Firas Hamze. A performance comparison of CUDA and
OpenCL. Read, cs.PF(1):12, 2010. URL http://arxiv.org/abs/1005.2581.
[54] E.A. Lee. The problem with threads. Computer, 39(5):33 – 42, may 2006. ISSN 0018-9162. doi:
10.1109/MC.2006.180.
[55] Microsoft. Microsoft - mmx, sse, and sse2 intrinsics. http://msdn.microsoft.com/en-

us/library/y0dh78ez(v=vs.90).aspx, May 2011. Acessed May 14, 2014.
[56] W. Stallings. Computer Organization and Architecture: Designing for Performance. Prentice Hall,
2010. ISBN 9780136073734. URL http://books.google.es/books?id=-7nM1DkWb1YC.
[57] J. D. Thompson, D. G. Higgins, and T. J. Gibson. CLUSTAL W: improving the sensitivity of pro-
gressive multiple sequence alignment through sequence weighting, position-specific gap penalties
and weight matrix choice. Nucleic Acids Research, 22(22):4673–4680, November 1994. ISSN
1362-4962. doi: 10.1093/nar/22.22.4673. URL http://dx.doi.org/10.1093/nar/22.22.4673.
[58] N. Whitehead and A. Fit-Florea. Precision & performance: Floating point and ieee 754 compliance
for nvidia gpus. nVidia technical white paper, 2011.
65
Bibliography
66

Disser Ta Cao

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Disser Ta Cao

Enviado por

Direitos autorais:

Formatos disponíveis

Efficient GPU implementation of bioinformatics

Nuno Miguel Trindade Marcos

Thesis to obtain the Master of Science Degree in

Information Systems and Computer Engineering

Bioinformatics Algorithms; Sequence Alignment; Smith-Waterman Algorithm; Heterogeneous Paral-

Algoritmos Bioinformáticos; Alinhamento de sequências; Algoritmo de Smith-Waterman; Arquitectu-

3 Sequence Alignment in Bioinformatics 19

4 Heterogeneous Parallel Alignment MultiSW 38

6 Conclusions and Future Work 59

2.1 NVIDIA GK110 Kepler Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1 Pairwise Alignment Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 Heterogeneous Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.1 Flynn’s Taxonomy [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5.1 Execution Speedups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

ALU Arithmetic Logic Unit

AMD Advanced Micro Devices - North American Technology Company

APU Accelerated Processing Unit

BLOSUM Blocks Substitution Matrix

CPU Central Processing Unit

CUDA Compute Unified Device Architecture

GPC Graphics Processing Clusters

GPGPU General-Purpose Computation on Graphics Hardware

GPU Graphics Processing Unit

NVIDIA North American Company that invented the GPUs in 1999

PAM Point Accepted Mutation

SMX Streaming Multiprocessor

• GenBank DNA database [8];

• National Center for Biotechnology Information (NCBI) [9];

• Universal Protein Resource (UniProt) [10];

• Nucleotide sequence database (EMBL) [11];

1.3 Document Outline

2.1 Flynn’s Taxonomy

Table 2.1: Flynn’s Taxonomy [5].

Single Instruction Multiple Instruction

2.2 CPU - Central Processing Unit

A CPU is usually composed by these components [21]:

2.3 GPU - Graphics Processing Unit

framework, allocating buffers (arrays) and writing shaders (kernel functions).

• Defense and Intelligence applications (such as geospatial visualization);

• Computational Finance (financial analysis, etc.);

• thousands of registers that can be partitioned among threads of execution;

• several kind of memory caches (explained below);

• Execution cores for integer and floating-point operations.

2 also known as Streaming Processors (SPs)

Figure 2.1: NVIDIA GK110 Kepler Architecture

Figure 2.2: Coalesced memory access

Figure 2.3: GPU Memory organization

2.4 CPU vs GPU

Figure 2.4: CPU and GPU architectures [1].

Figure 2.5: GPU vs CPU GFLOPS comparation [1]

• Faster and Cheaper;

• Fully programmable processing units that support vectorized floating-point operations[27];

2.5 Hybrid Solution: Accelerated Processing Unit

2.6 CUDA - Compute Unified Device Architecture

2.6.1 Definition and Architecture

Figure 2.6: Fermi Architecture

2.6.2 Programming Model

1. Allocate device memory

2. Copy memory from host to device

4. Copy memory from device to host

5. Free device memory

Figure 2.7: CUDA Kernel definition and invocation example [1]

The next section presents the execution model.