Escolar Documentos
Profissional Documentos
Cultura Documentos
Master of Technology
In
Computer Application
By
Somen Barma
(2005JCA2428)
Under the guidance of
Dr. Kolin Paul
(Department of Computer Science and Engineering)
This is to certify that the project entitled “Design, development and performance
evaluation of multiprocessor systems on FPGA” submitted by Somen Barma in partial
fulfillment of the requirement for the award of the degree of Master of Technology in
Computer Applications to the Indian Institute of Technology, Delhi, is a record of bona-
fide work carried by him under my supervision and guidance.
2
II
ACKNOWLEDGEMENT
I feel pleasure and privilege to express my deep sense of gratitude, indebt ness and
thankfulness towards my guide, Dr. Kolin Paul, for his guidance, constant supervision
and continuous inspiration and support throughout the course of work. His valuable
suggestion and critical evaluation have greatly helped me in successful completion of the
work.
I am also thankful to Prof. M. Balakrishnan for showing keen interest in solving critical
problems in this project.
I am also thankful to all those who helped me directly or indirectly in completion of this
work.
3
III
CONTENTS
CERTIFICATE ………………………………………………………….………… I
ACKNOWLEDGEMENT …………………………………………………………
II
CONTENTS …….………………………………………………………………
III
ABSTRACT ……………………………………………………………………
IV
4
CHAPTER 4 PARALLELIZATION OF SMITH WATERMAN ALGORITHM …. 35
4.1 Local Alignment: Smith –Waterman algorithm …………………………. 35
4.2 Parallelism in Smith Waterman and multiple processors System ………… 36
4.3 Speed up obtained with Smith-Waterman algorithm …………………….. 42
REFERENCES …………………………………………………………………………. 46
GLOSSARY …………………………………………………………………………….. 48
5
IV
ABSTRACT
To get more functionalities in the OS level we chose for uClinux as our OS. The OS was
build according to our system using the tool chain available. For the OS to reside we
added the DDRAM to the system. Presently the system also has Ethernet card and can
handle network packets.
6
CHAPTER 1 INTRODUCTION
During the past decade, there has been a dramatic increase in the number of applications
within the commercial, medical and military market requiring very high input/output
bandwidth and real-time processing power. Powerful embedded systems, also offering
flexible configurability and cost-effective features, are needed to support particular
requirements of such applications.
The system where all the microprocessors are same and treated similarly is called
symmetric multiprocessor system (SMP). Here an attempt has been made to develop the
SMP system on FPGA.
7
integrated circuits. The CLB is the basic logic unit in an FPGA. Exact numbers and
features vary from device to device, but every CLB consists of a configurable switch
matrix with 4 or 6 inputs, some selection circuitry (MUX, etc), and flip-flops.
While the CLB provides the logic capability, flexible interconnect routing routes the
signals between CLBs and to and from I/Os. Routing comes in several flavors, from that
designed to interconnect between CLBs to fast horizontal and vertical long lines spanning
the device to global low-skew routing for Clocking and other global signals. The design
software makes the interconnect routing task hidden to the user unless specified
otherwise, thus significantly reducing design complexity
We have used here the FPGA board from Xilinx. The specifications of the board is
XC2VP30 , Grade ff896 , Speed -7. Along with that they also ships the EDK software
which can be used for functional specification , synthesis, place and routing and finally
download and debug the system.
Today many embedded products place their solution on several chips making it bigger,
more expensive and more power requiring. The SoC solution has existed a long time on
Application Specific Integrated Circuit (ASIC) boards but is rather new on FPGA boards.
FPGA boards has become bigger, faster and cheaper and is now able to handle a SoC
solution. As the FPGA boards have become bigger and faster they are now able to handle
a soft processor which is an Intellectual Property (IP) core implemented using logical
primitives. A key benefit is configurability where it is possible to add only what is needed
in the design. A trade of is performance, a hard processor is faster but less configurable
and more expensive [1]. More and more companies are therefore looking into the
possibility of using SoC on an FPGA board with a soft processor, which makes it easier
to develop and evaluate the solution.
8
1.2 Microblaze:
The soft-core processor used for this project is Microblaze [2]. The MicroBlaze
embedded processor soft core is a reduced instruction set computer (RISC), 5 stage
pipeline, optimized for implementation in Xilinx field programmable gate arrays
(FPGAs). Figure 1.2.1 shows a functional block diagram of the MicroBlaze core.
Many aspects of the MicroBlaze can be configured at compile time owing to the
configurable nature of FPGAs. Cache structure, peripherals, and interfaces can be
customized to the application. In addition, hardware support for certain operations, such
as multiplication, division, and floating-point arithmetic, can be added or removed.
Microblaze does not have a memory management unit. It can run at the speed of 150
MHz. It has the following features.
The processor’s fixed feature set includes:
9
• Thirty-two 32-bit general purpose registers
• 32-bit instruction word with three operands and two addressing modes.
• 32-bit address bus.
• Single issue pipeline.
The list below consists of some additional features that can be added to the MicroBlaze
[12].
• Hardware barrel shifter - A digital circuit that can shift data any number of bits in
one operation. A vital component in floating point operations
• Hardware divider
• Instruction and data cache
• On-chip peripheral bus (OPB)
• Local memory bus (LMB)
• Fast Simplex Link (FSL)
• Xilinx CacheLink
1.2.1 Registers:
• Volatile registers (caller-save) are temporary registers and do not retain their
values across function calls. Volatile registers are registers R3-R12, R3 and R4
are used for returning values to the caller function. R5-R12 are used to pass
parameters.
• Non-volatile registers keep their values across function calls (callee-save).Non-
volatile register are registers R19-R31.
• Dedicated registers are the other registers. Registers R14-R17 are used to store
return addresses from interrupts, sub-routines, traps and exceptions. R0 is always
value 0 and R1 is used to store the stack pointer. These registers should not be
used for anything else.
10
1.2.2 Bus Interface:
MicroBlaze has several bus interfaces to be used in different areas. It follows the Harvard
architecture where separate paths are used for data and instruction accesses. An
advantage using Harvard architecture is that it makes it possible to read both instructions
and data from memory at the same time [2].
• On-chip Peripheral Bus
The OPB is a fully synchronous bus that provides access to both on-chip and offchip
peripherals. The bus is not intended to connect directly to the processor [3].
• Local Memory Bus
The LMB is a fast local bus used to connect MicroBlaze to high-speed peripherals,mainly
Block RAM (BRAM). LMB makes is possible to access BRAM in one clock cycle [12].
• Fast Simplex Link Bus
FSL is a one way point-to-point communication bus used between an output FIFO device
and an input FIFO device. It has support for up to eight master and slave interfaces and
data can be transfered in two clock cycles [12].
• Xilinx CacheLink
The Xilinx CacheLink (XCL) interface is a high-speed bus for external memory
communication and is only available when the caches are enabled. XCL can be combined
with an OPB where one cache uses XCL and the other one uses an OPB bus. Memory
located outside the cache-able area is accessed through OPB or LMB [12].
• Debug Interface
The debug interface is used with the Microprocessor Debug Module (MDM) and is
controlled through the JTAG port by the Xilinx Microprocessor Debugger (XMD) [12].
11
1.3 Power PC 405:
1. Central Processing unit: The PowerPC 405 central-processing unit (CPU) implements
a 5-stage instruction pipeline consisting of fetch, decode, execute, write-back, and load
write-back stages. The fetch and decode logic sends a steady flow of instructions to the
execute unit. All instructions are decoded before they are forwarded to the execute unit.
Instructions are queued in the fetch queue if execution stalls. Up to two branches are
processed simultaneously by the fetch and decode logic. If a branch cannot be resolved
prior to execution, the fetch and decode logic predicts how that branch is resolved.
12
64 KB, 256 KB, 1 MB, 4 MB and 16 MB. Multiple page sizes can improve memory
efficiency and minimize the number of TLB misses.
Initially the microblazes tested with the stand alone [4] BSP. And in later cases too other
than the Microblaze running on uClinux rest are running on stand alone BSP. It provides
certain standard APIs. Which some applications written on c may use. Standalone Board
Support Package. The Standalone BSP is designed for use when an application accesses
board or processor features directly (without an intervening OS layer).
13
1.5 uClinux:
On VM Linux, whenever an application tries to write off the top of the stack, an
exception is flagged and some more memory is mapped in at the top of the stack to allow
the stack to grow. Under uClinux, no such luxury is available as the stack must be
allocated at compile time. So sometimes due to the overflow of the stack may cause
crashes. Also uClinux Instead of dynamic heap uses global memory pool that basically is
the kernel's free memory pool.
ISR (interrupt service routine) and Kernel task share common counting semaphore. There
for if kernel task is holding the semaphore it may so happen that ISR has to wait for long
to get executed. Secondly kernel tasks are non preemptive. Therefore even high priority
user application may have to wait. This makes the response time of uClinux longer.
Therefore it is not a complete hard real time OS.
However a critical issue is that we cannot use a single uClinux for all the microblazes.
This is because the system does not have provision for a single interrupt controller for all
14
of them. Secondly there is always a cache coherency problem. We do not have a
hardware solution for that.
15
CHAPTER 2 METHODOLOGY AND WORK DONE
Initially a system was built with a single microprocessor (Microblaze). The Microblaze is
instantiated with its local BRAM support. As the Microblaze is of Harvard architecture
the BRAM is connected to Microblaze both with an instruction BRAM controller and a
data BRAM controller [5]. The Microblaze is sitting on the OPB (On Chip Peripheral
BUS).
Once we have got this system the next step was to add the other processor. The next
processor was added to the system using the wizard. However even by changing the
MHS file we could have got the same result. The new processor added has to be first
provided with its own BRAM. Where the data and instruction local to the processor will
reside. Eventually we would require two more controllers for that BRAM two. A block
diagram of the resulted system is shown in the figure 2.1.1.
But this ability causes another problem , resource conflict. The bus arbiter in every cycle
gives the bus to every Microblaze in a round robin fashion. So it may so happen when
one of the Microblaze is transferring data , say to the uart, the other one gets hold of the
bus and starts transmitting too. This can be solved by bus parking facility of the bus. So
that until one Microblaze finishes the other does not get the chance.
However with this scheme there is a flaw. Suppose there are multiple resources on the
OPB bus. Both the processors are using different resources. In that case if we block the
bus. Then either of the two will have to wait. This is removed with the introduction of the
custom IP.
The sync IP solves the problem. Whenever a process on a Microblaze wants to get the
bus it comes and registers its ID and the process ID in a software addressable register for
a particular resource i.e locks the resource. Later opn this microprocessor can release the
lock. However with this approach there may be chances of dead lock. For the above
system both the microblazes are having BSP support but no Operating System.
After the hardware platform design is complete, one can generate an FPGA configuration
bitstream. We use the XPS to build the bit stream and the net list. At this point, we have
only a "hardware bitstream," and this is not ready for to be applied it to an FPGA until
the software component is included for the embedded system.
After the embedded software development is complete, we can choose one of the
following ways to run it on the hardware:
17
it gets loaded into on-chip memory, ready to execute, every time the FPGA is
configured.
2. During prototyping, XPS can dynamically download the executable to the board
via the JTAG cable connected to the FPGA. In this case, we select a bootloop to
be merged into the bitstream to initialize on-chip memory so that the processor
remains in a static state until software downloading can be completed.
3. For production systems, one can store executables residing in off-chip memory
regions in a non-volatile memory device, such as flash Programmable ROM
(PROM) or along with the configuration bitstream in a System ACE™ device. In
this case, one would configure a bootloader executable to be merged into the
hardware bitstream to initialize on-chip memory. Then each time the FPGA is
configured or reset, the bootloader copies the application executable to a suitable
(volatile) memory device and starts it running.
We have used the JTAG for down loading the elf file along with the bit stream to the
board. The standard output for the system is the UART.
The system has been tested with matrix addition. The system works like this. The first
Microblaze has the data on which the matrix addition is to be performed. It writes the
data to the shared BRAM.
Till it completes the writing of the data the other Microblaze waits and keeps on checking
whether a particular bit is set or not flagging that data has been written completely.
Then the Microblaze 2 starts reading the data. Once completed it calculates the addition
and writes the result back to a different location in the BRAM. The first one then picks
the result up and displays the result on the UART, (the hyper terminal).
18
Figure 2.2.1 Schematic representation of Matrix addition
The above-described system has been changed by adding a DDRSDRAM. This Ram is
required to accommodate the OS( uClinux). The Ram is also attached on the OPB bus.
Now the caches for the Microblaze 1 has been instantiated.
2.3.1 Cache
The cache consists of both instruction and a data cache and are controlled using a bit each
in the MSR [12]. They are 1-way associative (Direct mapped), each block (a collection of
data containing the requested word) can only be placed at one place in the cache [6]. The
memory can be divided into cache-able and non-cache-able segments, making it possible
to tell exactly what to cache. The only address space that can not be cached is the LMB
address space. The data-cache uses write-through, where the cache is mirrored on main
memory by writing to memory on each cache writing [6]. The cache can be used with
19
either the OPB interface or the dedicated XCL interface or as combination. The
differences between those interfaces are [12]:
• CacheLink uses a 4-word cache block (critical word first). It takes the requested
word and the next three, which increases the hit rate; OPB uses a single word
cache block.
• CacheLink uses a dedicated bus interface for memory accesses. This reduces
traffic on the OPB bus.
• The CacheLink interface requires a specialized memory controller interface. The
OPB interface uses standard OPB memory controllers.
• CacheLink allows posted write accesses on write-misses. OPB caches require the
write access to be completed before execution is resumed (Only data cache).
To be able to debug the microblazes and the programs we have to connect the second
Microblaze also too the XMD debug module [8] this is done by connecting the following
ports of the XMD module to the micro blaze.
20
The following figure shows the functional diagram of XMD module.
Ethernet card has also been included in the new design. This is for supporting the
network applications on the system. It has the following features.
• CSMA/CD compliant operation at 10 Mbps and 100 Mbps in half duplex mode
21
• Supports unicast, multicast, and broadcast transmit and receive modes as well
as promiscuous and 64 entry Contents Addressable Memory (CAM) based
receive modes
The EMAC [10] Interface design is a soft intellectual property (IP) core designed for
implementation in several Xilinx FPGAs. It supports the IEEE Std. 802.3 Media
Independent Interface (MII) to industry standard Physical Layer (PHY) devices and
communicates to a processor via an IBM On-Chip Peripheral Bus (OPB) interface. The
design provides a 10 Megabits per second (Mbps) and 100 Mbps (also known as Fast
Ethernet) EMAC Interface. This design includes many of the functions and the flexibility
found in dedicated Ethernet controller devices currently on the market.
In this step, the uClinux auto-configure mechanism is used to map/export the EDK
hardware design to the uClinux kernel build mechanism. This is done by configuring and
generating the BSP. The library generator uses the configuration information and the
hardware design database to export an auto-config.in file, which contains all the
information about the hardware design. This auto-config.in file is in the uClinux
configuration file format. The uClinux sources, makefiles, and other scripts are built with
conditional code to generate the correct software based on the hardware described in the
auto-config.in file. This flow effectively means that retargeting uClinux for any hardware
setup can be done quickly. To retarget uClinux, regenerate the BSP for the new hardware
setup. By rebuilding uClinux with the newly generated auto-config.in file, an image
targeting the new hardware setup is compiled by uClinux tools. The tools and related
sources can be obtained from petalogix [15]. To generate the uClinux image we have to
use a linux system. Once the auto-config is placed in the proper place in linux
distribution, going to the tool chain we can use the following commands to generate the
image.
22
1. We have to Go to the uClinux distribution directory. Run: cd
home/devel/uclinux/src/uClinux-dist
make clean
make xconfig
2. have to set the vendor option to Xilinx.
3. choose the Os as uClinux auto
4. select the other libraries .
Once the kernel image ready we take that back to the host system. From the host
system we download the image to the RAM , using the hardware debug module.
We have used the following command.
cd binaries
dow -data tmicro.bin 0x30000000
mwr 0x100 0
rwr 5 0x100
rwr pc 0x30000000
con
that means we are loading the kernel image tmicro.bin to the memory location
0x30000000. and then setting th eprogram counter value to this address and asking
the processor to start.
The applications in Xilkernel can be run in two ways. One method is to build the
applications into the kernel, having only one executable file. The other method is to have
each application as a separate executable file, the same method that is used in a
23
conventional operating system where the kernel is a separate image file and the
applications are separate files.
Here we have first build the application by putting it into the source path and then
updating the make file. However later on we only build the specific application. To down
load the application we connect it to the microblaze system through the net. And from the
host machine run a ftp session to the system. Where we have down loaded the application
to the directory /var/tmp. Then changed the mode of the file. And executed. This
approach helped us to avoid the tedious job of building the OS image all the time
repeatedly.
The application accepts the packet over the network using TCP/IP protocol. The payload
is then copied to a particular location in DDR RAM. The other microblaze picks the data
up from the location and performs the required operation in parallel.
The following are possible ways of creating a shared memory multi-processor system
(taken from John Williams report). We evaluate the pros and cons and finally choose a
custom deign for our implementation.
24
Figure 2.6.1.1 Implicit Multiprocessing
Another way for achieving multi-processing (considering the limitation mentioned in last
section) is to run every MicroBlaze with its own copy of OS. In such a system, N CPU(s)
sit on shared bus with private address zones within shared physical memory and common
shared memory region with IPC protocols. Since in this system every MicroBlaze has a
separate copy of OS, it leads to lot of memory waste.
25
2.7 Modified two Microblaze system:
While working with the previous system we came across to a problem with the booting of
the OS. Since all other peripherals as well as the DDRRAM was sitting on the same OPB
there might be a problem of bus contention. The MB0 which is responsible for the
booting and running the OS and acting as the Master of the system, was not always able
to get the bus completely for the period required for booting of the system. Also there is a
bug reported related to this issue that it is difficult to have multiple Microblazes on the
same bus and one of them using the DDRAM for the OS. So we tried to overcome the
above issue by adding a PLB (processor local bus) to the system on which the DDRRAM
loaded with kernel image will reside. And the master Microblaze will sit on one OPB and
26
the rest of the Microblazes on another OPB. This system also has the scope of further
scaling up by attaching more buses and Microblazes on them.
In order to achieve parallelism with minimum memory overhead and bus scaling issues,
we have developed a system with a master-slave relationship. The master MicroBlaze in
the system is the only MicroBlaze with uClinux OS. The remaining MicroBlaze
processors in the system act as slaves and use standalone BSP. The MicroBlaze running
uClinux is responsible for providing all benefits of advance OS to the complete system
and for various synchronization activities. In the application explored in the project, the
master processor is used for collecting data over the network and makes it available for
other processors. The above figure shows the architecture of the system.
There are some bridges (OPB2PLB and PLB2OPB) which allows the microblazes to
access the DDRRAM and other peripherals on different buses which though not on their
own OPB but mapped to their address space.
The Xilinx 64-bit Processor Local Bus (PLB) consists of a bus control unit, a watchdog
timer, and separate address, write, and read data path units with a three-cycle only
arbitration feature.
Some works, which are specific to PLB arbitration logics, are:
1. Single Read Transfer Bus Time-Out,
2. Single Write Transfer Bus Time-Out,
3. Line Read Transfer Bus Time-Out,
4. Line Write Transfer Bus Time-Out,
5. Burst Read Transfer Bus Time-Out,
6. Burst Write Transfer Bus Time-Out,
7. Pipelined Read Transfer,
8. Pipelined Write Transfer
27
Arbitration Priority: The Xilinx PLB implements fixed priority when two or more
masters have the same priority inputs. Priority order in this case is Master 0, Master 1,
Master 2, and Master N.
Clock and Power Management: The IBM PLB Arbiter Core supports clock and power
management by gating clocks to all internal registers and providing asleep request signal
to a central clock and power management unit in the system. This sleep request signal is
asserted by the IBM PLB Arbiter to indicate when it is permissible to shut off clocks to
the arbiter.
The On-Chip Peripheral Bus (OPB) to Processor Local Bus (PLB) Bridge translates OPB
transactions into PLB transactions. It functions as a slave on the OPB side and a master
on the PLB side. Since the Microblazes are on the OPB bus to communicate with the
DDRRAM the shared memory we have put that specific OPB2PLB bus. This bridge
enables both the address translation as well as the bus protocol translation for the
Microblazes.
High Level Description: The following figure shows a schematic diagram of the bridge
28
The OPB interface is designed with a pipelined architecture to improve timing and to
support high clock frequencies. Input and Output signals to the OPB are designed to be
driven through flip-flops for better timing. Pipelining introduces some additional latency
in the design. Since some signals are delayed through registers. However, the use of
pipelining balances transaction latency with higher clock fequencies.
The bridge decoder takes care of that fact that same location is not addressed by any
master of the PLB as well as by itself. To avoid a situation like this certain parameters are
(user changeable) provided.
The bridge provides four channels o address four contiguous memory locations on the
PLB side by the master on the OPB side. Each channel has one C_RNG0_BASEADDR
and one C_RNG0_HIGHADDR, which corresponds to the address range of that channel
that will be accessed by the bridge on PLB. The PLB base addresses will be those, which
will be obtained after negating the channel address for OPB.
This Bridge acts as a master for the OPB bus and as a slave on the PLB bus. The user
changeable parameters are similar to that of OPB2PLB Bridge. This bridge translates the
transaction on the PLB bus to OPB bus.
29
2.7.3 System View:
30
2.8 Heterogeneous multiprocessor system with Power PC:
The previously described homogeneous multiprocessor system has further been modified
to a heterogeneous system by adding Power PC to it. The Power PC is now sitting on the
PLB bus. We have tested the system with some simple program and it reported that all
the hardware units are working.
31
CHAPTER 3 EXPERIMENTS ON THE DEVELOPED SYSTEM
Once we have done with the building up of the hardware system. Now it is time to
experiment with the system. We started with the goal of running the compiled kernel
image as discussed in section 2.4 on the system developed (modified two microblaze
system) discussed in 2.7. The microblaze running uClinux gives the flexibility of
operating system where as the stand alone microblaze in the same system provide
extreme computational power.
We have got the following out put in the hyper terminal. It is the booting sequence of
uClinux.
32
IP Protocols: ICMP, UDP, TCP
IP: routing cache hash table of 2048 buckets, 16Kbytes
TCP: Hash tables configured (established 16384 bind 32768)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
VFS: Mounted root (romfs filesystem) readonly.
Freeing init memory: 52K
flatfsd: Nonexistent or bad flatfs (-114), creating new one...
/bin/flatfsd: mtd.c: 156: flat_dev_close: Assertion `flatinfo.fd != -1' failed.
flatfsd: mtd.c: 156: flat_dev_close: Assertion `flatinfo.fd != -1' failed.
Abort
Setting hostname:
Setting up interface lo:
Starting DHCP client:
Starting thttpd:
eth0: Link carrier lost.
uclinux-auto login:
An application using DES cryptography algorithm was used in our setup to measure the
gains using the architecture as described above. The application accepting network
packets as described in section 2.5 was executed on the MicroBlaze with uClinux. The
standalone MicroBlaze ran DES algorithm for encrypting portions of data placed in RAM
by master MicroBlaze. Following were the run-time achieved using varying number of
MicroBlaze processors:
34
CHAPTER 4 PARALLELIZATION OF SMITH WATERMAN ALGORITHM
The most important fact of biological sequence analysis is that in bimolecular sequences
(DNA, RNA or amino acid sequences), high sequence similarity usually implies
significant functional or structural similarity.
In many applications two strings may not be highly similar in their entirety but may
contain regions that are highly similar. The task is to find and extract a pair of regions,
one from each of the two given string that exhibit high similarity. This is called the local
alignment or local similarity problem. Smith Waterman algorithm is an approach to that.
The Smith-Waterman algorithm is a database search algorithm developed by T.F. Smith
and M.S. Waterman, and is based on an earlier model known as Needleman and Winch,
who proposed the algorithm initially. The Smith-Waterman algorithm uses dynamic
programming to find the best local alignment between any two given sequences. Based
on certain criterion, usually a scoring matrix, scores and weights are assigned to each
character-to-character comparison. Positive for exact matches/substitutions, and usually
negative for insertion/deletions. The exact scores are based on a scoring matrix. The
scores are then added together and the highest scoring alignment is reported.
The Smith-Waterman algorithm(the generic one) [18] is given below:
1 Declare a nxn similarity matrix
2 Initialize the top row (i=0) and left column (j=0) with 0
3 for i = 1; i < length(sequence); i++ do
4 for j = 1; j < length(sequence); j++ do
5 F(i,j) = max (0, F(i-1,j-1)+ s(xi,yi), F(i-1,j) - d, F(i,j-1)-d)
6 Save index of term that contributed to the calculated value in F(i,j)
7 end for
8 end for
9 Find maximum value in nxn matrix
10 Using saved indices in (6), traceback to first 0 encountered
35
Where F (i, j) stands for the value of the optimal local suffix alignment for the given
index par i, j.
The time complexity of the algorithm is O(n2). Here we have tried to divide the
computation among different processors and to get a speed up for that multiprocessor
system. Now to do that we have to parallelize the computation. Here the trouble spot is
in calculation of the table. Which can be a target for parallelization.
Before approaching the problem of parallelization first we will try to look how the flow
of computation proceeds. The following table is an example of local alignment using
Smith- Waterman algorithm.
C A G C G T T G
0 0 0 0 0 0 0 0 0
A 0 0 2 0 0 0 0 0 0
G 0 0 0 4 2 2 0 0 2
G 0 0 0 2 3 4 2 0 2
T 0 0 0 0 1 2 6 4 2
A 0 0 2 0 0 0 4 5 3
C 0 2 0 1 2 0 2 3 4
36
Here the costs assumed are
Gap=-2;
Match=2;
Mismatch=-1
The flow of computation moves along with the diagonal. That is to say one by one
diagonal gets calculated. The following figure gives a pictorial representation of a 4X4
strings.
From the above diagrams it is obvious that if we can make the computations of cells on
each diagonal on different processors we may gain advantage over the computation type.
37
4.2.1 The parallelization, A Locally Sequential, Globally Parallel, ( LSGP )
approach:
To calculate new values on a fresh diagonal whatever data dependencies there are from
the previous diagonal. The following figure shows how two processors can be used for
parallel computation.
The dotted lines represent the cells, which will be computed in processor one and the
solid lines represent the cells, which will be computed by the other processor, in essence
LSGP approach. Now up to the first two diagonals both the processor will compute all
the three cells. After that the division of work will arise. The first processor keeps the
track of the number of cells to be computed by the other one and by him. Both of them
compute the corner points first and then go for the calculation of the other points. The
number 0ne processor (P1) calculates the right most cells and writes the result back to the
shared memory. Similarly second processor (P2) does the same for its left corner point of
the diagonal. However though all the time the data are not required to be transmitted but
doing that brings symmetry to the process. So with an extra overhead the programming
complexity can be reduced to a great deal. Along with the count of the cells P1 also lets
P2 know about the right most point index. So that later one can start working from that
end. The indexes are easy to calculate as because all the cells that need to be calculated
are on the same diagonal. This approach can be extended for further number of
38
processors. But with higher number of processors the pattern for incrementing or
decrementing the number of cells calculated by each processor, for each diagonal has to
follow a specific pattern. Here we have proposed a pattern, which we would like to
explain with a three-processor system example.
2 1 1 4 5 5 5 15
2 2 1 5 5 5 4 14
2 2 2 6 5 4 4 13
3 2 2 7 4 4 4 12
3 3 2 8 5 4 3 11
We assume the function Smith_waterman(i,j) can calculate the value of each cell when
provided with the index. And the function Trace_back() finally gives the required
alignment. The variable number_of_cells_for_P1 and number_of_cells_for_P2 keep track
of the cells each processor has to calculate for each diagonal. The two functions
write_to_SM() and Read_to_SM() allows the processors to write and read data to and
from the Shared Memory respectively. The processors have the data dependencies. Data
of which cell is required and has to be brought in from the shared memory is found out by
the processors after looking at the local table of each processor. If the required data
already available do nothing, or fetch the data and store it first in its local memory, then
go for the other computations.
Processors 1:
//the total table is of nXn size
Calculate the first two diagonals locally.
number_of_cells_for_P1=2;
number_of_cells_for_P2=1;
39
For( i = 2 to n-1)
If( i(mod)2 = 0)
Then
{
number_of_cells_for_P1= number_of_cells_for_P1+1;
}
if( I(mod)3=0)
Then
{
number_of_cells_for_P2= number_of_cells_for_P2+1;
write_to_SM(number_of_cells_for_P2);
}
row_index_for_P2= i- number_of_cells_for_P1-1;
column_index_for_P2= i+ number_of_cells_for_P1;
write_to_SM(row_index_for_P2);
write_to_SM(column_index_for_P2);
While( iteration NOT EQUAL TO number_of_cells_for_P1)
{
result=Smith_Waterman(row_index,column_index);
write_to_SM(result);
row_index= row_index+1;
column_index= column_index-1;
iteration = iteration+1;
}
// upto this part the P1 calculates cells from the upper triangle of the table.
For( k=0 to n-1)
{
//decrement of the number_of_cells_for_P2 and increment of
nummber_of_cells_for_P1
if( result required for calculation of the cell (n-1, (n-1)-number_of_cells_for_P1)
not available)
40
Read_from_SM(result);
While( iteration NOT EQUAL TO number_of_cells_for_P1)
{
result=Smith_Waterman(row_index,column_index);
write_to_SM(result);
row_index= row_index+1;
column_index= column_index-1;
iteration = iteration+1;
}
}// End of the pseudo code for Processor P1
Processor 2:
//the total table is of nXn size
Calculate the first two diagonals locally.
For( k=0 to n-1)
{
Read_from_SM(number_of_cells_for_P2);
Read_from_SM (row_index_for_P2);
Read_from_SM (column_index_for_P2);
While( iteration NOT EQUAL TO number_of_cells_for_P2)
{
if( result for all three corners not available)
Read_from_SM(result);
result=Smith_Waterman(row_index,column_index);
if(iteration=1)
{
write_to_SM(result);
}
row_index= row_index-1;
column_index= column_index+1;
iteration = iteration+1;
41
}
}// this part is for the upper half of the table.
For( k=0 to n-1)
{
Read_from_SM(number_of_cells_for_P2);
Read_from_SM (row_index_for_P2);
Read_from_SM (column_index_for_P2);
While( iteration NOT EQUAL TO number_of_cells_for_P1)
{
result=Smith_Waterman(row_index,column_index);
write_to_SM(result);
row_index= row_index+1;
column_index= column_index-1;
}
}
}// code completed for 2nd processor.
Speed Up Comparisions
10000
9000
8000
Number of cycles
7000
6000 P1
5000 P2
4000 P3
3000
2000
1000
0
0 2 4 6 8 10 12 14
String Length
42
The last graph has been obtained when we tried to implement the algorithm described in
section 4.2 for Smith-Waterman algorithm.
43
CHAPTER 5 DISCUSSION
The UClinux is a operating system for microcontrollers which does not have the
requirement for MMU( memory management unit) . That is the reason we cannot make a
single copy of the OS to run on and control all of the Microblazes. However it is
advantageous to have one of the processors running it. That processor works as a master
and can be made to handle many applications.
For the decryption algorithm we have the OS running on the master microblaze. Which is
responsible for putting the data in the RAM Here the other microblazes keep on
checking the availability of the data and when found go for the computation. But as the
size of the data gets increased the transaction on the PLB gets increased. Also the master
microblaze’s OS is in DDRAM. This too is on the same PLB. So as a consequence as the
data size increases the performance falls back.
44
CHAPTER 6 CONCLUSION AND FUTURE SCOPE
The results presented in last section indicate that adding multiple processors and therefore
adding more compute power can provide sufficient gains. The gain obtained varies from
application to application. To avoid bus contention delays and achieve better results,
there is also a need for scheduling transactions on the bus. The results also indicate that
one cannot expect linear speed up by simply adding more processors in shared memory
architecture. For the decryption application chosen, running the system with four
MicroBlaze processors gave 2.29X speedup. Addition of more MicroBlaze processors
results in little gain due to increasing number of bus contention delays. However with
FSL( Fast Simplex Link) links if the data transfers can be made parallel the model can be
further scaled up with more number of processors.
For the Smith-Waterman algorithm the speed up is around 1.7x for three-processor
system with a 12-string length. We can expect better performance with string with higher
length. Since the computation increases where the data transaction with respect to each
diagonal remains the same.
Here the string of very high length cannot be used, as the Local Memory size of the
processors is small. The use of the local memory and not the shared Block Ram for all the
memory purposes is the one of the strategies to increase up the speed. So if we have to
use the given architecture some pruning methods [19] should be used to reduce the string
lengths.
45
REFERENCES
[2] MicroBlaze Processor Reference Guide Embedded Development Kit EDK 8.2i
(UG081 v6.0) June 1, 2006
[4] Standalone Board Support Package EDK 8.2i, June 23, 2006.
[5] LMB BRAM Interface Controller (v1.00b) DS452 February 22, 2006
[6] on-Chip peripheral Bus V2.0 with OPB arbiter( v1.10c) DS401 December 2, 2005.
[8] Micropr essor Debug Module (MDM) (v2.00a) DS450 February 22, 2006oc.
[11] D. McCullough. uclinux for linux programmers. In Linux Journal Volume 2004 ,
Issue 123 (July 2004) Page: 7 Year of Publication: 2004 ISSN:1075-3583. ACM Press,
July 2004.
46
[13] Microblaze uClinux FAQ, John Williams, 2006-02-16 http://www.itee.uq.edu.au/˜-
˜jwilliams/mblazeuclinux/Documentation/FAQ.html.
[14] D. Stepner, N. Rajan, and D. Hui. Embedded application design using a real-time
os. In annual ACM IEEE Design Automation Conference Proceedings of the 36th
ACM/IEEE onference on Design automation, pages 151 – 156. ACM Press, August
1999.
[15] J. Williams. What does PetaLogix mean for the MicroBlaze uClinux community,
PetaLogix, 2006-02-16. http://www.petalogix.org/ news events/petalogix announce
.
[16]OS and Libraries Document Collection, Xilinx. http: //www.xilinx.com
/ise/embedded /edk71 docs/oslibs rm.pdf.
[17] PowerPC™ 405 Processor Block Reference Guide ,UG018 (v2.0) August 20, 2004
[19] Pruning algorithm to reduce the search space of the Smith-Waterman algorithm,
Farhan Ahmed Department of Electrical and Computer Engineering Lafayette College,
Easton, PA.
47
GLOSSARY
FPGA Field Programmable Gate Arrays.
SMP Symmetric Multiprocessor System
CLB Configurable Logic Cells
EDK Embedded Development Kit
IP Intellectual Property
SoC System on Chip
RISC Reduced Instruction Set Computer
OPB Onchip PeripherAL Bus
LMB Local Memory Bus
FSL Fast Simplex Link
BRAM Block RAM
XCL Xilinx Cache Link
MDM Microprocessor Debug Module
MMU Memory Management Unit
TLB Translation Look Aside Buffer
BSP Board Support Package
ELF Executable Linked Format
MHS Microprocessor Hardware Specification
MSS Microprocessor Software Specification
OPB2PLB OPB to PLB bridge
48