Você está na página 1de 48

DESIGN, DEVELOPMENT AND PERFORMANCE

EVALUATION OF MULTIPROCESSOR SYSTEMS ON


FPGA

A dissertation submitted in partial fulfillment of the requirements


for the degree of

Master of Technology
In
Computer Application

By
Somen Barma
(2005JCA2428)
Under the guidance of
Dr. Kolin Paul
(Department of Computer Science and Engineering)

Indian Institute of Technology Delhi


May 2007
I
CERTIFICATE

This is to certify that the project entitled “Design, development and performance
evaluation of multiprocessor systems on FPGA” submitted by Somen Barma in partial
fulfillment of the requirement for the award of the degree of Master of Technology in
Computer Applications to the Indian Institute of Technology, Delhi, is a record of bona-
fide work carried by him under my supervision and guidance.

Dr. Kolin Paul


Department of Computer Science and Engineering
Indian Institute of Technology, New Delhi

2
II
ACKNOWLEDGEMENT

I feel pleasure and privilege to express my deep sense of gratitude, indebt ness and
thankfulness towards my guide, Dr. Kolin Paul, for his guidance, constant supervision
and continuous inspiration and support throughout the course of work. His valuable
suggestion and critical evaluation have greatly helped me in successful completion of the
work.
I am also thankful to Prof. M. Balakrishnan for showing keen interest in solving critical
problems in this project.
I am also thankful to all those who helped me directly or indirectly in completion of this
work.

New Delhi Somen Barma


20th May, 2007 2005JCA2428
IIT, Delhi.

3
III
CONTENTS

CERTIFICATE ………………………………………………………….………… I

ACKNOWLEDGEMENT …………………………………………………………
II

CONTENTS …….………………………………………………………………
III

ABSTRACT ……………………………………………………………………
IV

CHAPTER 1 INTRODUCTION ………………………………………………….. 7

1.1 Back ground ………………………………………………….. 8


1.2 Microblaze ………………………………………………….. 9
1.3 Power PC ………………………………………………….. 12
1.4 Stand alone board support package ……………………………………. 13
1.5 UClinux ………………………………………………….. 14

CHAPTER 2 METHODOLOGY AND WORK DONE …………………………… 16


2.1 Designing two multiprocessor system …………………………………. 16
2.2 Matrix addition on the system ………………………………………… 18
2.3 Two Microblaze system with DDRRAM …………………………….. 19
2.4 Building of uClinux …………………………………………………. 22
2.5 Creating new application for Ethernet packet handling ………………. 23
2.6 Microblaze shared memory system ………………………………….... 24
2.7 Modified two Microblaze system …………………………………… 26
2.8 Heterogeneous multiprocessor system with Power PC ……………….. 31

CHAPTER 3 EXPERIMENTS ON THE DEVELOPED SYSTEM ……………… 32


3.1 Running of UClinux …………………………………………………… 32
3.2 Speed up obtained with Decryption algorithm ………………………… 33

4
CHAPTER 4 PARALLELIZATION OF SMITH WATERMAN ALGORITHM …. 35
4.1 Local Alignment: Smith –Waterman algorithm …………………………. 35
4.2 Parallelism in Smith Waterman and multiple processors System ………… 36
4.3 Speed up obtained with Smith-Waterman algorithm …………………….. 42

CHAPTER 5 DISCUSSION …………………………………………………………… 44

CHAPTER 6 CONLUSION AND FUTURESCOPE ……………………………… 45

REFERENCES …………………………………………………………………………. 46

GLOSSARY …………………………………………………………………………….. 48

5
IV
ABSTRACT

In embedded system multiple processors can be used for performance enhancement.


More over on FPGAs the cost and risk involved to develop such a system using soft-core
processors is also much less. The target of the project was to successfully build a
multiprocessor system so that the system can be used for many embedded system
applications with an enhanced performance out put.

We started with a single microprocessor system using Xilinx’s soft-core processor


Microblaze. Then developed a two processor system and sorted out certain resource
conflict issues. Till then the two processors were supported by independent stand alone
OS, with less capability and functionality. On this system we tried out the matrix addition
application. The data transferred to the microprocessors was through am shared BRAM
(Block RAM) sitting on the OPB.

To get more functionalities in the OS level we chose for uClinux as our OS. The OS was
build according to our system using the tool chain available. For the OS to reside we
added the DDRAM to the system. Presently the system also has Ethernet card and can
handle network packets.

Finally we get a homogeneous and heterogeneous system. To evaluate the performance


of the system we choose two applications Decryption and Smith-Waterman of local
alignment.

6
CHAPTER 1 INTRODUCTION

During the past decade, there has been a dramatic increase in the number of applications
within the commercial, medical and military market requiring very high input/output
bandwidth and real-time processing power. Powerful embedded systems, also offering
flexible configurability and cost-effective features, are needed to support particular
requirements of such applications.

The predominant method to provide a solution to this is multiprocessor system. This is


due to several reasons: the possibility of using the best processing element to make a
particular functionality; the possibility of using off-the-shelf components; the possibility
to allocate tasks with different timing characterization separately (periodic and sporadic
tasks, tasks with hard and soft real-time constraints) and so to use the most appropriate
local scheduling policy; the possibility of minimizing communication allocating
cooperating tasks in the same subsystem; and the possibility of distributing processing
elements closely to related sensors so that will be possible to manage, as possible,
distributed data in a distributed manner.

The system where all the microprocessors are same and treated similarly is called
symmetric multiprocessor system (SMP). Here an attempt has been made to develop the
SMP system on FPGA.

The FPGA (Field Programmable Gate Arrays) is programmable logic. Field


Programmable means that the FPGA's function is defined by a user's program rather than
by the manufacturer of the device. A typical integrated circuit performs a particular
function defined at the time of manufacture. In contrast, the FPGA's function is defined
by a program written by someone other than the device manufacturer. Depending on the
particular device, the program is either 'burned' in permanently or semi-permanently as
part of a board assembly process, or is loaded from an external memory each time the
device is powered up. This user programmability gives the user access to complex
integrated designs without the high engineering costs associated with application specific

7
integrated circuits. The CLB is the basic logic unit in an FPGA. Exact numbers and
features vary from device to device, but every CLB consists of a configurable switch
matrix with 4 or 6 inputs, some selection circuitry (MUX, etc), and flip-flops.

While the CLB provides the logic capability, flexible interconnect routing routes the
signals between CLBs and to and from I/Os. Routing comes in several flavors, from that
designed to interconnect between CLBs to fast horizontal and vertical long lines spanning
the device to global low-skew routing for Clocking and other global signals. The design
software makes the interconnect routing task hidden to the user unless specified
otherwise, thus significantly reducing design complexity

We have used here the FPGA board from Xilinx. The specifications of the board is
XC2VP30 , Grade ff896 , Speed -7. Along with that they also ships the EDK software
which can be used for functional specification , synthesis, place and routing and finally
download and debug the system.

1.1 Back ground:

Today many embedded products place their solution on several chips making it bigger,
more expensive and more power requiring. The SoC solution has existed a long time on
Application Specific Integrated Circuit (ASIC) boards but is rather new on FPGA boards.
FPGA boards has become bigger, faster and cheaper and is now able to handle a SoC
solution. As the FPGA boards have become bigger and faster they are now able to handle
a soft processor which is an Intellectual Property (IP) core implemented using logical
primitives. A key benefit is configurability where it is possible to add only what is needed
in the design. A trade of is performance, a hard processor is faster but less configurable
and more expensive [1]. More and more companies are therefore looking into the
possibility of using SoC on an FPGA board with a soft processor, which makes it easier
to develop and evaluate the solution.

8
1.2 Microblaze:

The soft-core processor used for this project is Microblaze [2]. The MicroBlaze
embedded processor soft core is a reduced instruction set computer (RISC), 5 stage
pipeline, optimized for implementation in Xilinx field programmable gate arrays
(FPGAs). Figure 1.2.1 shows a functional block diagram of the MicroBlaze core.

Many aspects of the MicroBlaze can be configured at compile time owing to the
configurable nature of FPGAs. Cache structure, peripherals, and interfaces can be
customized to the application. In addition, hardware support for certain operations, such
as multiplication, division, and floating-point arithmetic, can be added or removed.

Figure 1.2.1 microblaze core block diagram

Microblaze does not have a memory management unit. It can run at the speed of 150
MHz. It has the following features.
The processor’s fixed feature set includes:

9
• Thirty-two 32-bit general purpose registers
• 32-bit instruction word with three operands and two addressing modes.
• 32-bit address bus.
• Single issue pipeline.

The list below consists of some additional features that can be added to the MicroBlaze
[12].

• Hardware barrel shifter - A digital circuit that can shift data any number of bits in
one operation. A vital component in floating point operations
• Hardware divider
• Instruction and data cache
• On-chip peripheral bus (OPB)
• Local memory bus (LMB)
• Fast Simplex Link (FSL)
• Xilinx CacheLink

1.2.1 Registers:

MicroBlaze provides two kinds of registers, general-purpose registers and special


purpose registers [2].

• Volatile registers (caller-save) are temporary registers and do not retain their
values across function calls. Volatile registers are registers R3-R12, R3 and R4
are used for returning values to the caller function. R5-R12 are used to pass
parameters.
• Non-volatile registers keep their values across function calls (callee-save).Non-
volatile register are registers R19-R31.
• Dedicated registers are the other registers. Registers R14-R17 are used to store
return addresses from interrupts, sub-routines, traps and exceptions. R0 is always
value 0 and R1 is used to store the stack pointer. These registers should not be
used for anything else.

10
1.2.2 Bus Interface:

MicroBlaze has several bus interfaces to be used in different areas. It follows the Harvard
architecture where separate paths are used for data and instruction accesses. An
advantage using Harvard architecture is that it makes it possible to read both instructions
and data from memory at the same time [2].
• On-chip Peripheral Bus
The OPB is a fully synchronous bus that provides access to both on-chip and offchip
peripherals. The bus is not intended to connect directly to the processor [3].
• Local Memory Bus
The LMB is a fast local bus used to connect MicroBlaze to high-speed peripherals,mainly
Block RAM (BRAM). LMB makes is possible to access BRAM in one clock cycle [12].
• Fast Simplex Link Bus
FSL is a one way point-to-point communication bus used between an output FIFO device
and an input FIFO device. It has support for up to eight master and slave interfaces and
data can be transfered in two clock cycles [12].
• Xilinx CacheLink
The Xilinx CacheLink (XCL) interface is a high-speed bus for external memory
communication and is only available when the caches are enabled. XCL can be combined
with an OPB where one cache uses XCL and the other one uses an OPB bus. Memory
located outside the cache-able area is accessed through OPB or LMB [12].
• Debug Interface
The debug interface is used with the Microprocessor Debug Module (MDM) and is
controlled through the JTAG port by the Xilinx Microprocessor Debugger (XMD) [12].

An interesting comparison between the synthesizable processors MircoBlaze, LEON2


and OpenRISC 2000 is presented in a master thesis from Chalmers university [3]. It
compares the performance, configurability and usability. The MicroBlaze version used is
2.10.a and it performs well in the benchmarks but it is discovered that it does not follow
the floating point standard.

11
1.3 Power PC 405:

The PowerPC 405 [17] is a 32-bit implementation of the PowerPC embedded-


environment architecture. It has been derived from the PowerPC architecture. This one is
particularly tailored to meet the requirement of embedded system development. The
original one is a 64 bit processors with a 32-bit subset. But this one is a 32-bit processor.
Some other features of this processor are as follows:

1. Memory management optimized for embedded software environments.


2. Cache-management instructions for optimizing performance and memory con-
trol in complex applications, which are graphically and numerically intensive.
3. A device-control-register address space for managing on-chip peripherals such
as memory controllers.
4. A dual-level interrupt structure and interrupt-control instructions.
5. Multiple timer resources.

1.3.1 PowerPC 405 Hardware Organization:

1. Central Processing unit: The PowerPC 405 central-processing unit (CPU) implements
a 5-stage instruction pipeline consisting of fetch, decode, execute, write-back, and load
write-back stages. The fetch and decode logic sends a steady flow of instructions to the
execute unit. All instructions are decoded before they are forwarded to the execute unit.
Instructions are queued in the fetch queue if execution stalls. Up to two branches are
processed simultaneously by the fetch and decode logic. If a branch cannot be resolved
prior to execution, the fetch and decode logic predicts how that branch is resolved.

2. Memory Management Unit: The PowerPC 405 supports 4 GB of flat (non-segmented)


address space. The memory management unit (MMU) provides address translation,
protection functions, and storage attribute control for this address space. The MMU
supports demand-paged virtual memory using multiple page sizes of 1 KB, 4 KB, 16 KB,

12
64 KB, 256 KB, 1 MB, 4 MB and 16 MB. Multiple page sizes can improve memory
efficiency and minimize the number of TLB misses.

Figure 1.3.1.1 Block Diagram of Power PC [17]

1.4 Stand alone board support package:

Initially the microblazes tested with the stand alone [4] BSP. And in later cases too other
than the Microblaze running on uClinux rest are running on stand alone BSP. It provides
certain standard APIs. Which some applications written on c may use. Standalone Board
Support Package. The Standalone BSP is designed for use when an application accesses
board or processor features directly (without an intervening OS layer).

13
1.5 uClinux:

UClinux (pronounced ”you-see-linux”) is a port of regular Linux, intended for


microprocessors that do not have a Memory Management Unit (MMU). It is a soft real
time OS [10]. In short MMU translates logical addresses into physical addresses. All
requests for data is sent to the MMU, it then decides if the data is in RAM or needs to be
fetched from disk. It also decides if the process has the rights to access the memory it is
trying to reach [25]. Without MMU there is no memory protection or virtual memory
leaving a bigger responsibility to the programmer not to write over other processes
memories. The most noticeable effect for the programmer is that vfork() is used instead
of fork(). vfork() and fork() creates a child process that only differs from the parent
process by its PID and PID number [9]. uClinux has been ported to many processor
architectures, Motorolas Coldfire and Dragonball, Blackfin, ARM7TDMI and
MicroBlaze are the ones most used. It exists as a derivative from linux kernels 2.0, 2.4
and 2.6 but it is the 2.4derivative that has been ported to the most number of
microprocessors [13].

On VM Linux, whenever an application tries to write off the top of the stack, an
exception is flagged and some more memory is mapped in at the top of the stack to allow
the stack to grow. Under uClinux, no such luxury is available as the stack must be
allocated at compile time. So sometimes due to the overflow of the stack may cause
crashes. Also uClinux Instead of dynamic heap uses global memory pool that basically is
the kernel's free memory pool.

ISR (interrupt service routine) and Kernel task share common counting semaphore. There
for if kernel task is holding the semaphore it may so happen that ISR has to wait for long
to get executed. Secondly kernel tasks are non preemptive. Therefore even high priority
user application may have to wait. This makes the response time of uClinux longer.
Therefore it is not a complete hard real time OS.

However a critical issue is that we cannot use a single uClinux for all the microblazes.
This is because the system does not have provision for a single interrupt controller for all

14
of them. Secondly there is always a cache coherency problem. We do not have a
hardware solution for that.

15
CHAPTER 2 METHODOLOGY AND WORK DONE

2.1 Designing two multiprocessor system:

Initially a system was built with a single microprocessor (Microblaze). The Microblaze is
instantiated with its local BRAM support. As the Microblaze is of Harvard architecture
the BRAM is connected to Microblaze both with an instruction BRAM controller and a
data BRAM controller [5]. The Microblaze is sitting on the OPB (On Chip Peripheral
BUS).

Once we have got this system the next step was to add the other processor. The next
processor was added to the system using the wizard. However even by changing the
MHS file we could have got the same result. The new processor added has to be first
provided with its own BRAM. Where the data and instruction local to the processor will
reside. Eventually we would require two more controllers for that BRAM two. A block
diagram of the resulted system is shown in the figure 2.1.1.

Figure 2.1.1 Basic two Microblaze system


16
The BRAM sitting on the OPB acts as a shared memory for the two processors. Any
address on the OPB is visible to both the microblazes. Microblaze uses memory mapped
IO. So both of them can access any resource what so ever may be on the OPB bus.

But this ability causes another problem , resource conflict. The bus arbiter in every cycle
gives the bus to every Microblaze in a round robin fashion. So it may so happen when
one of the Microblaze is transferring data , say to the uart, the other one gets hold of the
bus and starts transmitting too. This can be solved by bus parking facility of the bus. So
that until one Microblaze finishes the other does not get the chance.

However with this scheme there is a flaw. Suppose there are multiple resources on the
OPB bus. Both the processors are using different resources. In that case if we block the
bus. Then either of the two will have to wait. This is removed with the introduction of the
custom IP.

The sync IP solves the problem. Whenever a process on a Microblaze wants to get the
bus it comes and registers its ID and the process ID in a software addressable register for
a particular resource i.e locks the resource. Later opn this microprocessor can release the
lock. However with this approach there may be chances of dead lock. For the above
system both the microblazes are having BSP support but no Operating System.

After the hardware platform design is complete, one can generate an FPGA configuration
bitstream. We use the XPS to build the bit stream and the net list. At this point, we have
only a "hardware bitstream," and this is not ready for to be applied it to an FPGA until
the software component is included for the embedded system.

After the embedded software development is complete, we can choose one of the
following ways to run it on the hardware:

1. If the application executable resides in on-chip memory regions, it is possible to


merg the Executable Linked Format (ELF) file into the hardware bitstream so that

17
it gets loaded into on-chip memory, ready to execute, every time the FPGA is
configured.
2. During prototyping, XPS can dynamically download the executable to the board
via the JTAG cable connected to the FPGA. In this case, we select a bootloop to
be merged into the bitstream to initialize on-chip memory so that the processor
remains in a static state until software downloading can be completed.
3. For production systems, one can store executables residing in off-chip memory
regions in a non-volatile memory device, such as flash Programmable ROM
(PROM) or along with the configuration bitstream in a System ACE™ device. In
this case, one would configure a bootloader executable to be merged into the
hardware bitstream to initialize on-chip memory. Then each time the FPGA is
configured or reset, the bootloader copies the application executable to a suitable
(volatile) memory device and starts it running.

We have used the JTAG for down loading the elf file along with the bit stream to the
board. The standard output for the system is the UART.

2.2 Matrix addition on the system:

The system has been tested with matrix addition. The system works like this. The first
Microblaze has the data on which the matrix addition is to be performed. It writes the
data to the shared BRAM.

Till it completes the writing of the data the other Microblaze waits and keeps on checking
whether a particular bit is set or not flagging that data has been written completely.

Then the Microblaze 2 starts reading the data. Once completed it calculates the addition
and writes the result back to a different location in the BRAM. The first one then picks
the result up and displays the result on the UART, (the hyper terminal).

18
Figure 2.2.1 Schematic representation of Matrix addition

2.3 Two Microblaze system with DDRRAM:

The above-described system has been changed by adding a DDRSDRAM. This Ram is
required to accommodate the OS( uClinux). The Ram is also attached on the OPB bus.
Now the caches for the Microblaze 1 has been instantiated.

2.3.1 Cache

The cache consists of both instruction and a data cache and are controlled using a bit each
in the MSR [12]. They are 1-way associative (Direct mapped), each block (a collection of
data containing the requested word) can only be placed at one place in the cache [6]. The
memory can be divided into cache-able and non-cache-able segments, making it possible
to tell exactly what to cache. The only address space that can not be cached is the LMB
address space. The data-cache uses write-through, where the cache is mirrored on main
memory by writing to memory on each cache writing [6]. The cache can be used with

19
either the OPB interface or the dedicated XCL interface or as combination. The
differences between those interfaces are [12]:
• CacheLink uses a 4-word cache block (critical word first). It takes the requested
word and the next three, which increases the hit rate; OPB uses a single word
cache block.
• CacheLink uses a dedicated bus interface for memory accesses. This reduces
traffic on the OPB bus.
• The CacheLink interface requires a specialized memory controller interface. The
OPB interface uses standard OPB memory controllers.
• CacheLink allows posted write accesses on write-misses. OPB caches require the
write access to be completed before execution is resumed (Only data cache).

2.3.2 XMD debug module:

To be able to debug the microblazes and the programs we have to connect the second
Microblaze also too the XMD debug module [8] this is done by connecting the following
ports of the XMD module to the micro blaze.

PORT DBG_CAPTURE_0 = DBG_CAPTURE_s


PORT DBG_CLK_0 = DBG_CLK_s
PORT DBG_REG_EN_0 = DBG_REG_EN_s
PORT DBG_TDI_0 = DBG_TDI_s
PORT DBG_TDO_0 = DBG_TDO_s
PORT DBG_UPDATE_0 = DBG_UPDATE_s

20
The following figure shows the functional diagram of XMD module.

Figure 2.3.2.1 Block diagram of XMD debugger

2.3.3 Ethernet Card:

Ethernet card has also been included in the new design. This is for supporting the
network applications on the system. It has the following features.

• CSMA/CD compliant operation at 10 Mbps and 100 Mbps in half duplex mode

• Programmable PHY reset signal

• Internal loop-back capability

• Optional support of jumbo frames up to 9K bytes in length

21
• Supports unicast, multicast, and broadcast transmit and receive modes as well
as promiscuous and 64 entry Contents Addressable Memory (CAM) based
receive modes

• Auto source address field insertion or overwrite or pass through for


transmission

The EMAC [10] Interface design is a soft intellectual property (IP) core designed for
implementation in several Xilinx FPGAs. It supports the IEEE Std. 802.3 Media
Independent Interface (MII) to industry standard Physical Layer (PHY) devices and
communicates to a processor via an IBM On-Chip Peripheral Bus (OPB) interface. The
design provides a 10 Megabits per second (Mbps) and 100 Mbps (also known as Fast
Ethernet) EMAC Interface. This design includes many of the functions and the flexibility
found in dedicated Ethernet controller devices currently on the market.

2.4 Building of uClinux:

In this step, the uClinux auto-configure mechanism is used to map/export the EDK
hardware design to the uClinux kernel build mechanism. This is done by configuring and
generating the BSP. The library generator uses the configuration information and the
hardware design database to export an auto-config.in file, which contains all the
information about the hardware design. This auto-config.in file is in the uClinux
configuration file format. The uClinux sources, makefiles, and other scripts are built with
conditional code to generate the correct software based on the hardware described in the
auto-config.in file. This flow effectively means that retargeting uClinux for any hardware
setup can be done quickly. To retarget uClinux, regenerate the BSP for the new hardware
setup. By rebuilding uClinux with the newly generated auto-config.in file, an image
targeting the new hardware setup is compiled by uClinux tools. The tools and related
sources can be obtained from petalogix [15]. To generate the uClinux image we have to
use a linux system. Once the auto-config is placed in the proper place in linux
distribution, going to the tool chain we can use the following commands to generate the
image.

22
1. We have to Go to the uClinux distribution directory. Run: cd
home/devel/uclinux/src/uClinux-dist
make clean
make xconfig
2. have to set the vendor option to Xilinx.
3. choose the Os as uClinux auto
4. select the other libraries .
Once the kernel image ready we take that back to the host system. From the host
system we download the image to the RAM , using the hardware debug module.
We have used the following command.
cd binaries
dow -data tmicro.bin 0x30000000
mwr 0x100 0
rwr 5 0x100
rwr pc 0x30000000
con
that means we are loading the kernel image tmicro.bin to the memory location
0x30000000. and then setting th eprogram counter value to this address and asking
the processor to start.

2.5 Creating new application for Ethernet packet handling:

Applications [14] can be developed separately or as a software application project in the


XPS, they can also be created in the Software Development Kit (SDK) by importing the
XPS design. All applications need to link to the Xilkernel library in order to get the
kernel functionality [16]. The kernel and its applications can then be downloaded to the
target within the EDK and SDK.

The applications in Xilkernel can be run in two ways. One method is to build the
applications into the kernel, having only one executable file. The other method is to have
each application as a separate executable file, the same method that is used in a

23
conventional operating system where the kernel is a separate image file and the
applications are separate files.

Here we have first build the application by putting it into the source path and then
updating the make file. However later on we only build the specific application. To down
load the application we connect it to the microblaze system through the net. And from the
host machine run a ftp session to the system. Where we have down loaded the application
to the directory /var/tmp. Then changed the mode of the file. And executed. This
approach helped us to avoid the tedious job of building the OS image all the time
repeatedly.

The application accepts the packet over the network using TCP/IP protocol. The payload
is then copied to a particular location in DDR RAM. The other microblaze picks the data
up from the location and performs the required operation in parallel.

2.6 MicroBlaze shared memory Multiprocessing:

The following are possible ways of creating a shared memory multi-processor system
(taken from John Williams report). We evaluate the pros and cons and finally choose a
custom deign for our implementation.

2.6.1 Implicit Multiprocessing:

Using this mechanism, parallelism is hidden by OS and hardware. A single copy of


operating system runs and controls all the processors in the system. The following figure
shows an implementation of N processor SMP (Symmetric Multi Processor).
The MicroBlaze soft-core processor does not have hardware support for cache coherency.
Software implementation of cache coherency has severe performance impact. MicroBlaze
core also lacks distributed interrupt management.

24
Figure 2.6.1.1 Implicit Multiprocessing

2.6.2 Explicit Multiprocessing:

Another way for achieving multi-processing (considering the limitation mentioned in last
section) is to run every MicroBlaze with its own copy of OS. In such a system, N CPU(s)
sit on shared bus with private address zones within shared physical memory and common
shared memory region with IPC protocols. Since in this system every MicroBlaze has a
separate copy of OS, it leads to lot of memory waste.

Figure 2.6.2.1 Explicit Multiprocessing

25
2.7 Modified two Microblaze system:

While working with the previous system we came across to a problem with the booting of
the OS. Since all other peripherals as well as the DDRRAM was sitting on the same OPB
there might be a problem of bus contention. The MB0 which is responsible for the
booting and running the OS and acting as the Master of the system, was not always able
to get the bus completely for the period required for booting of the system. Also there is a
bug reported related to this issue that it is difficult to have multiple Microblazes on the
same bus and one of them using the DDRAM for the OS. So we tried to overcome the
above issue by adding a PLB (processor local bus) to the system on which the DDRRAM
loaded with kernel image will reside. And the master Microblaze will sit on one OPB and

Figure 2.7.1 Multiprocessor system with DDRRAM on PLB

26
the rest of the Microblazes on another OPB. This system also has the scope of further
scaling up by attaching more buses and Microblazes on them.

In order to achieve parallelism with minimum memory overhead and bus scaling issues,
we have developed a system with a master-slave relationship. The master MicroBlaze in
the system is the only MicroBlaze with uClinux OS. The remaining MicroBlaze
processors in the system act as slaves and use standalone BSP. The MicroBlaze running
uClinux is responsible for providing all benefits of advance OS to the complete system
and for various synchronization activities. In the application explored in the project, the
master processor is used for collecting data over the network and makes it available for
other processors. The above figure shows the architecture of the system.

There are some bridges (OPB2PLB and PLB2OPB) which allows the microblazes to
access the DDRRAM and other peripherals on different buses which though not on their
own OPB but mapped to their address space.

2.7.1 Processor Local Bus (PLB):

The Xilinx 64-bit Processor Local Bus (PLB) consists of a bus control unit, a watchdog
timer, and separate address, write, and read data path units with a three-cycle only
arbitration feature.
Some works, which are specific to PLB arbitration logics, are:
1. Single Read Transfer Bus Time-Out,
2. Single Write Transfer Bus Time-Out,
3. Line Read Transfer Bus Time-Out,
4. Line Write Transfer Bus Time-Out,
5. Burst Read Transfer Bus Time-Out,
6. Burst Write Transfer Bus Time-Out,
7. Pipelined Read Transfer,
8. Pipelined Write Transfer

27
Arbitration Priority: The Xilinx PLB implements fixed priority when two or more
masters have the same priority inputs. Priority order in this case is Master 0, Master 1,
Master 2, and Master N.

Clock and Power Management: The IBM PLB Arbiter Core supports clock and power
management by gating clocks to all internal registers and providing asleep request signal
to a central clock and power management unit in the system. This sleep request signal is
asserted by the IBM PLB Arbiter to indicate when it is permissible to shut off clocks to
the arbiter.

2.7.2 OPB TO PLB (OPB2PLB) Bridge:

The On-Chip Peripheral Bus (OPB) to Processor Local Bus (PLB) Bridge translates OPB
transactions into PLB transactions. It functions as a slave on the OPB side and a master
on the PLB side. Since the Microblazes are on the OPB bus to communicate with the
DDRRAM the shared memory we have put that specific OPB2PLB bus. This bridge
enables both the address translation as well as the bus protocol translation for the
Microblazes.

High Level Description: The following figure shows a schematic diagram of the bridge

Figure 2.7.2.1 schematic diagram of the OPB2PLB Bridge

28
The OPB interface is designed with a pipelined architecture to improve timing and to
support high clock frequencies. Input and Output signals to the OPB are designed to be
driven through flip-flops for better timing. Pipelining introduces some additional latency
in the design. Since some signals are delayed through registers. However, the use of
pipelining balances transaction latency with higher clock fequencies.

Address resolution and deadlock prevention:

The bridge decoder takes care of that fact that same location is not addressed by any
master of the PLB as well as by itself. To avoid a situation like this certain parameters are
(user changeable) provided.

The bridge provides four channels o address four contiguous memory locations on the
PLB side by the master on the OPB side. Each channel has one C_RNG0_BASEADDR
and one C_RNG0_HIGHADDR, which corresponds to the address range of that channel
that will be accessed by the bridge on PLB. The PLB base addresses will be those, which
will be obtained after negating the channel address for OPB.

2.7.2 PLB TO OPB (PLB2OPB) Bridge:

This Bridge acts as a master for the OPB bus and as a slave on the PLB bus. The user
changeable parameters are similar to that of OPB2PLB Bridge. This bridge translates the
transaction on the PLB bus to OPB bus.

29
2.7.3 System View:

Figure 2.7.3.1 System View

30
2.8 Heterogeneous multiprocessor system with Power PC:

Figure 2.8.1 Heterogeneous multiprocessor system

The previously described homogeneous multiprocessor system has further been modified
to a heterogeneous system by adding Power PC to it. The Power PC is now sitting on the
PLB bus. We have tested the system with some simple program and it reported that all
the hardware units are working.

31
CHAPTER 3 EXPERIMENTS ON THE DEVELOPED SYSTEM

3.1 Running of UClinux:

Once we have done with the building up of the hardware system. Now it is time to
experiment with the system. We started with the goal of running the compiled kernel
image as discussed in section 2.4 on the system developed (modified two microblaze
system) discussed in 2.7. The microblaze running uClinux gives the flexibility of
operating system where as the stand alone microblaze in the same system provide
extreme computational power.
We have got the following out put in the hyper terminal. It is the booting sequence of
uClinux.

Linux version 2.4.32-uc0 (jca052428@vindhyachal) (gcc version 3.4.1 ( Xilinx EDK


8.1 Build EDK_I.17 090206 )) #1 Sun Nov 26 15:54:08 IST 2006
On node 0 totalpages: 65536
zone(0): 65536 pages.
zone(1): 0 pages.
zone(2): 0 pages.
CPU: MICROBLAZE
Kernel command line:
Console: xmbserial on UARTLite
Calibrating delay loop... 49.86 BogoMIPS
Memory: 256MB = 256MB total
POSIX conformance testing by UNIFIX
xgpio #1 at 0x40020000 mapped to 0x40020000
xgpio #2 at 0x40040000 mapped to 0x40040000
Xilinx GPIO registered
RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize
eth0: using sgDMA mode.
eth0: Xilinx EMAC #0 at 0x40C00000 mapped to 0x40C00000, irq=1
eth0: id 2.0l; block id 11, type 1
uclinux[mtd]: RAM probe address=0x3016209c size=0xd8000
uclinux[mtd]: root filesystem index=0
NET4: Linux TCP/IP 1.0 for NET4.0

32
IP Protocols: ICMP, UDP, TCP
IP: routing cache hash table of 2048 buckets, 16Kbytes
TCP: Hash tables configured (established 16384 bind 32768)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
VFS: Mounted root (romfs filesystem) readonly.
Freeing init memory: 52K
flatfsd: Nonexistent or bad flatfs (-114), creating new one...
/bin/flatfsd: mtd.c: 156: flat_dev_close: Assertion `flatinfo.fd != -1' failed.
flatfsd: mtd.c: 156: flat_dev_close: Assertion `flatinfo.fd != -1' failed.
Abort
Setting hostname:
Setting up interface lo:
Starting DHCP client:
Starting thttpd:
eth0: Link carrier lost.

uclinux-auto login:

3.2 Speed up obtained with Decryption algorithm:

An application using DES cryptography algorithm was used in our setup to measure the
gains using the architecture as described above. The application accepting network
packets as described in section 2.5 was executed on the MicroBlaze with uClinux. The
standalone MicroBlaze ran DES algorithm for encrypting portions of data placed in RAM
by master MicroBlaze. Following were the run-time achieved using varying number of
MicroBlaze processors:

Figure 3.2.1 Time taken by different number of processor


33
Figure 3.2.2 Speed up obtained with different number of processors

34
CHAPTER 4 PARALLELIZATION OF SMITH WATERMAN ALGORITHM

4.1 Local Alignment: Smith –Waterman algorithm:

The most important fact of biological sequence analysis is that in bimolecular sequences
(DNA, RNA or amino acid sequences), high sequence similarity usually implies
significant functional or structural similarity.

In many applications two strings may not be highly similar in their entirety but may
contain regions that are highly similar. The task is to find and extract a pair of regions,
one from each of the two given string that exhibit high similarity. This is called the local
alignment or local similarity problem. Smith Waterman algorithm is an approach to that.
The Smith-Waterman algorithm is a database search algorithm developed by T.F. Smith
and M.S. Waterman, and is based on an earlier model known as Needleman and Winch,
who proposed the algorithm initially. The Smith-Waterman algorithm uses dynamic
programming to find the best local alignment between any two given sequences. Based
on certain criterion, usually a scoring matrix, scores and weights are assigned to each
character-to-character comparison. Positive for exact matches/substitutions, and usually
negative for insertion/deletions. The exact scores are based on a scoring matrix. The
scores are then added together and the highest scoring alignment is reported.
The Smith-Waterman algorithm(the generic one) [18] is given below:
1 Declare a nxn similarity matrix
2 Initialize the top row (i=0) and left column (j=0) with 0
3 for i = 1; i < length(sequence); i++ do
4 for j = 1; j < length(sequence); j++ do
5 F(i,j) = max (0, F(i-1,j-1)+ s(xi,yi), F(i-1,j) - d, F(i,j-1)-d)
6 Save index of term that contributed to the calculated value in F(i,j)
7 end for
8 end for
9 Find maximum value in nxn matrix
10 Using saved indices in (6), traceback to first 0 encountered

35
Where F (i, j) stands for the value of the optimal local suffix alignment for the given
index par i, j.

The Smith-Waterman algorithm is similar to the Needleman-Wunsch algorithm and


differs primarily in the fact that the trace back occurs from the maximum value in the nxn
matrix to the first 0 encountered, rather than from the lower right corner to the upper left
corner. This difference in the trace back procedure results in the best alignment between
subsequences of the original sequences rather than an optimal alignment between the
entire sequences.

The time complexity of the algorithm is O(n2). Here we have tried to divide the
computation among different processors and to get a speed up for that multiprocessor
system. Now to do that we have to parallelize the computation. Here the trouble spot is
in calculation of the table. Which can be a target for parallelization.

4.2 Parallelism in Smith Waterman and multiple processors System:

Before approaching the problem of parallelization first we will try to look how the flow
of computation proceeds. The following table is an example of local alignment using
Smith- Waterman algorithm.

C A G C G T T G
0 0 0 0 0 0 0 0 0
A 0 0 2 0 0 0 0 0 0
G 0 0 0 4 2 2 0 0 2
G 0 0 0 2 3 4 2 0 2
T 0 0 0 0 1 2 6 4 2
A 0 0 2 0 0 0 4 5 3
C 0 2 0 1 2 0 2 3 4

Table 4.2.1 an example of local alignment

36
Here the costs assumed are
Gap=-2;
Match=2;
Mismatch=-1

The final optimal alignment is A G C G T


A G __ G T

The flow of computation moves along with the diagonal. That is to say one by one
diagonal gets calculated. The following figure gives a pictorial representation of a 4X4
strings.

Figure 4.2.1 the way in which the computation flows [19]

From the above diagrams it is obvious that if we can make the computations of cells on
each diagonal on different processors we may gain advantage over the computation type.

37
4.2.1 The parallelization, A Locally Sequential, Globally Parallel, ( LSGP )
approach:

To calculate new values on a fresh diagonal whatever data dependencies there are from
the previous diagonal. The following figure shows how two processors can be used for
parallel computation.

Figure 4.2.1.1 an illustration of LSGP approach

The dotted lines represent the cells, which will be computed in processor one and the
solid lines represent the cells, which will be computed by the other processor, in essence
LSGP approach. Now up to the first two diagonals both the processor will compute all
the three cells. After that the division of work will arise. The first processor keeps the
track of the number of cells to be computed by the other one and by him. Both of them
compute the corner points first and then go for the calculation of the other points. The
number 0ne processor (P1) calculates the right most cells and writes the result back to the
shared memory. Similarly second processor (P2) does the same for its left corner point of
the diagonal. However though all the time the data are not required to be transmitted but
doing that brings symmetry to the process. So with an extra overhead the programming
complexity can be reduced to a great deal. Along with the count of the cells P1 also lets
P2 know about the right most point index. So that later one can start working from that
end. The indexes are easy to calculate as because all the cells that need to be calculated
are on the same diagonal. This approach can be extended for further number of

38
processors. But with higher number of processors the pattern for incrementing or
decrementing the number of cells calculated by each processor, for each diagonal has to
follow a specific pattern. Here we have proposed a pattern, which we would like to
explain with a three-processor system example.

Incrementing pattern Number of cells Decrementing pattern Number of cells


P1 P2 P3 P1 P2 P3

2 1 1 4 5 5 5 15
2 2 1 5 5 5 4 14
2 2 2 6 5 4 4 13
3 2 2 7 4 4 4 12
3 3 2 8 5 4 3 11

4.2.2 Pseudo code for the processors

We assume the function Smith_waterman(i,j) can calculate the value of each cell when
provided with the index. And the function Trace_back() finally gives the required
alignment. The variable number_of_cells_for_P1 and number_of_cells_for_P2 keep track
of the cells each processor has to calculate for each diagonal. The two functions
write_to_SM() and Read_to_SM() allows the processors to write and read data to and
from the Shared Memory respectively. The processors have the data dependencies. Data
of which cell is required and has to be brought in from the shared memory is found out by
the processors after looking at the local table of each processor. If the required data
already available do nothing, or fetch the data and store it first in its local memory, then
go for the other computations.
Processors 1:
//the total table is of nXn size
Calculate the first two diagonals locally.
number_of_cells_for_P1=2;
number_of_cells_for_P2=1;

39
For( i = 2 to n-1)
If( i(mod)2 = 0)
Then
{
number_of_cells_for_P1= number_of_cells_for_P1+1;
}
if( I(mod)3=0)
Then
{
number_of_cells_for_P2= number_of_cells_for_P2+1;
write_to_SM(number_of_cells_for_P2);
}
row_index_for_P2= i- number_of_cells_for_P1-1;
column_index_for_P2= i+ number_of_cells_for_P1;
write_to_SM(row_index_for_P2);
write_to_SM(column_index_for_P2);
While( iteration NOT EQUAL TO number_of_cells_for_P1)
{
result=Smith_Waterman(row_index,column_index);
write_to_SM(result);
row_index= row_index+1;
column_index= column_index-1;
iteration = iteration+1;
}
// upto this part the P1 calculates cells from the upper triangle of the table.
For( k=0 to n-1)
{
//decrement of the number_of_cells_for_P2 and increment of
nummber_of_cells_for_P1
if( result required for calculation of the cell (n-1, (n-1)-number_of_cells_for_P1)
not available)

40
Read_from_SM(result);
While( iteration NOT EQUAL TO number_of_cells_for_P1)
{
result=Smith_Waterman(row_index,column_index);
write_to_SM(result);
row_index= row_index+1;
column_index= column_index-1;
iteration = iteration+1;
}
}// End of the pseudo code for Processor P1

Processor 2:
//the total table is of nXn size
Calculate the first two diagonals locally.
For( k=0 to n-1)
{
Read_from_SM(number_of_cells_for_P2);
Read_from_SM (row_index_for_P2);
Read_from_SM (column_index_for_P2);
While( iteration NOT EQUAL TO number_of_cells_for_P2)
{
if( result for all three corners not available)
Read_from_SM(result);
result=Smith_Waterman(row_index,column_index);
if(iteration=1)
{
write_to_SM(result);
}
row_index= row_index-1;
column_index= column_index+1;
iteration = iteration+1;

41
}
}// this part is for the upper half of the table.
For( k=0 to n-1)
{
Read_from_SM(number_of_cells_for_P2);
Read_from_SM (row_index_for_P2);
Read_from_SM (column_index_for_P2);
While( iteration NOT EQUAL TO number_of_cells_for_P1)
{
result=Smith_Waterman(row_index,column_index);
write_to_SM(result);
row_index= row_index+1;
column_index= column_index-1;
}
}
}// code completed for 2nd processor.

4.3 Speed up obtained with Smith-Waterman algorithm:

Speed Up Comparisions

10000
9000
8000
Number of cycles

7000
6000 P1
5000 P2
4000 P3
3000
2000
1000
0
0 2 4 6 8 10 12 14
String Length

Figure 4.3.1 Speed up in case of Smith-Waterman algorithm

42
The last graph has been obtained when we tried to implement the algorithm described in
section 4.2 for Smith-Waterman algorithm.

43
CHAPTER 5 DISCUSSION

The UClinux is a operating system for microcontrollers which does not have the
requirement for MMU( memory management unit) . That is the reason we cannot make a
single copy of the OS to run on and control all of the Microblazes. However it is
advantageous to have one of the processors running it. That processor works as a master
and can be made to handle many applications.

For the decryption algorithm we have the OS running on the master microblaze. Which is
responsible for putting the data in the RAM Here the other microblazes keep on
checking the availability of the data and when found go for the computation. But as the
size of the data gets increased the transaction on the PLB gets increased. Also the master
microblaze’s OS is in DDRAM. This too is on the same PLB. So as a consequence as the
data size increases the performance falls back.

Smith-waterman algorithm shows a speed up with respect to multiple processors. Here


we have not used theUClinux. However this speed up is visible after a certain string
length is crossed. For example the two-processor system shows a speed up only after
string length 7. The three-processor system shows a speed over two-processor system
after string length of 10. This is because of the overhead incurred due to data transaction
in contrast to the number of computations each processor doing. The API ( application
program interface) for reading or writing the data to or from shared memory takes
approximately 13 cycles . where as the computation time for one single cell is 65 cycle.

44
CHAPTER 6 CONCLUSION AND FUTURE SCOPE

The results presented in last section indicate that adding multiple processors and therefore
adding more compute power can provide sufficient gains. The gain obtained varies from
application to application. To avoid bus contention delays and achieve better results,
there is also a need for scheduling transactions on the bus. The results also indicate that
one cannot expect linear speed up by simply adding more processors in shared memory
architecture. For the decryption application chosen, running the system with four
MicroBlaze processors gave 2.29X speedup. Addition of more MicroBlaze processors
results in little gain due to increasing number of bus contention delays. However with
FSL( Fast Simplex Link) links if the data transfers can be made parallel the model can be
further scaled up with more number of processors.

For the Smith-Waterman algorithm the speed up is around 1.7x for three-processor
system with a 12-string length. We can expect better performance with string with higher
length. Since the computation increases where the data transaction with respect to each
diagonal remains the same.

Here the string of very high length cannot be used, as the Local Memory size of the
processors is small. The use of the local memory and not the shared Block Ram for all the
memory purposes is the one of the strategies to increase up the speed. So if we have to
use the given architecture some pruning methods [19] should be used to reduce the string
lengths.

45
REFERENCES

[1] What is a soft processor?, Xilinx, 2006-01-.http://www.xilinx.com/ipcenter/processor


central/microblaze/doc/mb faq.pdf.

[2] MicroBlaze Processor Reference Guide Embedded Development Kit EDK 8.2i
(UG081 v6.0) June 1, 2006

[3] D. Mattson and M. Christensson. Evaluation of synthesizable cpu cores. Master’s


thesis, 2004.

[4] Standalone Board Support Package EDK 8.2i, June 23, 2006.

[5] LMB BRAM Interface Controller (v1.00b) DS452 February 22, 2006

[6] on-Chip peripheral Bus V2.0 with OPB arbiter( v1.10c) DS401 December 2, 2005.

[7] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative


Approach, Third Edition. Morgan Kaufmann, May 2002

[8] Micropr essor Debug Module (MDM) (v2.00a) DS450 February 22, 2006oc.

[9] Linux Operating System and Linux Distributions. http://linux.about.com.

[10]Linux for Real Time Requirements, SoftTech Solutions P Ltd. www.isofttech.com/


downloads/Linux-RTOS.pdf.

[11] D. McCullough. uclinux for linux programmers. In Linux Journal Volume 2004 ,
Issue 123 (July 2004) Page: 7 Year of Publication: 2004 ISSN:1075-3583. ACM Press,
July 2004.

[12] MicroBlaze Processor Reference Guide, Xilinx. http://www.xilinx.com


/ise/embedded/edk7 1docs/mb ref guide.pdf.

46
[13] Microblaze uClinux FAQ, John Williams, 2006-02-16 http://www.itee.uq.edu.au/˜-
˜jwilliams/mblazeuclinux/Documentation/FAQ.html.

[14] D. Stepner, N. Rajan, and D. Hui. Embedded application design using a real-time
os. In annual ACM IEEE Design Automation Conference Proceedings of the 36th
ACM/IEEE onference on Design automation, pages 151 – 156. ACM Press, August
1999.

[15] J. Williams. What does PetaLogix mean for the MicroBlaze uClinux community,
PetaLogix, 2006-02-16. http://www.petalogix.org/ news events/petalogix announce
.
[16]OS and Libraries Document Collection, Xilinx. http: //www.xilinx.com
/ise/embedded /edk71 docs/oslibs rm.pdf.

[17] PowerPC™ 405 Processor Block Reference Guide ,UG018 (v2.0) August 20, 2004

[18] A Parallel Implementation of the Smith-Waterman Algorithm for Massive


Sequences Searching ,Hsien-Yu Liao, Meng-Lai Yin, Yi Cheng, Electrical and computer
Engineering Department California State Polytechnic University, Pomona
hsienyuliao@csupomona.edu, myin@csupomona.edu.

[19] Pruning algorithm to reduce the search space of the Smith-Waterman algorithm,
Farhan Ahmed Department of Electrical and Computer Engineering Lafayette College,
Easton, PA.

[20] A Parallel Implementation of Smith-WatermanSequence Comparison Algorithm,


Brian Hang Wai Yang, December 6, 2002

47
GLOSSARY
FPGA Field Programmable Gate Arrays.
SMP Symmetric Multiprocessor System
CLB Configurable Logic Cells
EDK Embedded Development Kit
IP Intellectual Property
SoC System on Chip
RISC Reduced Instruction Set Computer
OPB Onchip PeripherAL Bus
LMB Local Memory Bus
FSL Fast Simplex Link
BRAM Block RAM
XCL Xilinx Cache Link
MDM Microprocessor Debug Module
MMU Memory Management Unit
TLB Translation Look Aside Buffer
BSP Board Support Package
ELF Executable Linked Format
MHS Microprocessor Hardware Specification
MSS Microprocessor Software Specification
OPB2PLB OPB to PLB bridge

48

Você também pode gostar