Accelerating Matrix Product On Recon Gurable Hardware For Image Processing

Accelerating matrix product on recongurable hardware for image processing applications
F. Bensaali, A. Amira and A. Bouridane Abstract: Matrix multiplication is very important in many types of applications including image and signal processing. The suitability of recongurable hardware devices, in the form of eld programmable gate arrays (FPGAs), is investigated as a low-cost solution for implementing two matrix multipliers for 3-D afne transformations and colour space conversion. A rst solution based on processing large matrix multiplication, for large 3-D models, and for the evaluation of the Celoxica xed-point library and Xilinx CoreGen performance has been reported. A novel architecture for efcient implementation of a colour space converter (CSC) based on distributed arithmetic (DA) principles has been presented. The two multipliers have been developed and implemented on the RC1000-PP Celoxica board-based development platform. Results show that the FPGA-based rst parallel multiplier can achieve the performance of a graphics card when performing 3-D afne transformations, while the second multiplier, which is fully pipelined and platform-independent, has a low latency (8 cycles) and is capable of a sustained data rate of over 234 mega-conversions per second.
Introduction
Matrix algorithms are commonly used in the areas of graph theory, numerical algorithms, digital control and signal processing. A close examination of these algorithms reveals that many of the fundamental actions involve matrix operations, such as matrix multiplication, which requires enormous computing power. Moreover, medical imaging, 3-D image manipulation, edge detection for object recognition and other applications, involve large matrix multiplication [1]. Multiplication of this type of matrices requires a lot of computation time as its complexity is O(N3), where N is the dimension of a square matrix. Because most current applications require higher computational throughputs, many researchers have tried to improve the performance of matrix multiplication. Even with improvements such as Strassens algorithm-based partitioning for sequential matrix multiplication [2], performance is limited. For this reason, parallel approaches, which have complexity O(N3/p), when using p parallel processors, have been examined for decades [1]. As part of an ongoing research project to develop a hardware accelerator for image and signal processing algorithms based on matrix computations at Queens University of Belfast [35], this paper proposes: A parallel matrix multiplier for 3-D afne transformations with the evaluation of the performance of the Xilinx CoreGen [6] and Celoxica xed-point library [7] for the implementation. A novel architecture for colour space conversion based on matrixvector multiplication using DA principles, which
r IEE, 2005 IEE Proceedings online no. 20040838 doi:10.1049/ip-cds:20040838 Paper rst received 9th December 2003 and in revised form 7th June 2004. Originally published online: 3rd June 2005 The authors are with the School of Computer Science, Queens University of Belfast, Belfast BT7 1NN, UK
is a bit-level rearrangement of a multiply accumulate to hide the multiplications. DA distributes arithmetic operations rather than grouping them as multipliers do. Conventional DA, called ROM-based DA, decomposes the variable input of the inner product to bit-level in order to generate precomputed data. ROM-based DA uses a ROM table to store the precomputed data, which makes it regular and efcient in the use of the silicon area in a VLSI implementation. The advantage of a DA-based ROM approach is its efciency of implementation. The basic operations required are a sequence of ROMs, addition, subtraction and shift operations of the input data sequence [8]. Examples for the use of DA can be found in [810]. The target hardware for the implementation and verication of the proposed architectures is a Celoxica RC1000-PP PCI-based FPGA development board equipped with a Xilinx XCV2000E Virtex FPGA [7, 11]. The structure of this paper can be split into three parts; the rst part is concerned with the 3-D afne transformations, the second part is concerned with the colour space conversion, while a general conclusion is given in the third part. 2 3-D afne transformations
Computer graphics algorithms are generally computationally expensive. This fact is the reason why people struggle to accelerate such algorithms using any reasonable means. The traditional sources of speedup are faster processors, parallelism or dedicated hardware. Recent advances in digital circuit technology, especially rapid development of FPGAs, offer an alternative way to acceleration. Attempts to implement such algorithms on FPGA have been the subject of several researchers. In [12] various techniques for improving cost effectiveness of graphics applications have been described. Methods for exploiting custom data formats and datapath widths, and for optimising graphics operations, such as texture mapping and hidden-surface
IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005
236
removal, have been studied. Customised architectures have been implemented on the Xilinx 4000 and Virtex FPGAs in [12] using Handel-C, a C-like language supporting parallelism and exible data size. Singh and Bellec [13] have shown that FPGAs can implement simple and complex graphics algorithms with a performance level that places them comfortably between custom graphics chips and general processors with specialised graphics instructions sets. A new method of hardware texture mapping in which texture images are synthesised using FPGAs was presented in [14]. The conclusion from this work was that, using FPGAs, procedural textures can be synthesised at high speed with low hardware cost. It is the aim of this work to use FPGAs as a low-cost accelerator to develop and implement a matrix multiplier for 3-D afne transformations using Handel-C and to evaluate the performance of the Xilinx Coregen and Celoxica xed-point library for the implementation.
and vice versa. In a homogeneous system a vertex V(x, y, z) is presented as V (X, Y, Z, w) for any scale factor w 0. The three-dimensional Cartesian co-ordinate representation is then x X =w ; y Y =w ; z Z =w 2 In computer graphics w is always taken to be one and the matrix representation of a point is (xyz1)T. Translation can now be treated as a matrix multiplication operation, like the other two transformations, and becomes 0 1 0 1 0 1 0 1 1 0 0 Tx x x x B y C B 0 1 0 Ty C B y C By C B CB C B C B C @ z A @ 0 0 1 T z A @ z A T @ z A 3 1 0 0 0 1 1 1 Therefore in homogeneous co-ordinates it is possible to describe any transformation in a matrix notation: 0 1 0 1 0 1 A D G J x x B y C B B E H K C B y C B CB C B C 4 @ z A @C F I L A @ z A 1 0 0 0 1 1 This universal matrix for transformations can be divided into four function blocks: scaling and rotation translation part of the homogeneous representation 1 5 The matrix representations for the two other most commonly used transformations are as follows: Scaling: V SV Sx B0 SB @0 0 0 0 Sy 0 0 0 0 Sz 0 1 0 0C C 0A 1 6
2.1
Review
In computer graphics the most popular method for representing an object is the polygon mesh model. In the simplest case, a polygon mesh is a structure that consists of polygons represented by a list of (x, y, z) co-ordinates that are the polygon vertices. Thus the information we store to describe an object is nally a list of points or vertices [15] (Fig. 1). 3-D afne transformations are the transformations that involve rotation, scaling, shear and translation. A matrix can represent an afne transformation and a set of afne transformations can be combined into a single overall afne transformation. Technically, it can be said that an afne transformation is made up of any combination of linear transformations (rotation, scaling and shear) followed by translation (technically, translation is not a linear transformation) [15]. A set of vertices or three-dimensional points belonging to an object can be transformed into another set of points by a linear transformation. Matrix notation is used in computer graphics to describe such transformations. Using matrix notation, a vertex V is transformed to V * (* denotes the transformed vertex) under translation, scaling and rotation, which are the most commonly used transformations in computer graphics, as V DV; V SV; V RV 1 Where D is a translation vector, S and R are the scaling and rotation matrices [15, 16]. A uniform representation of all transformations in matrix notation is necessary for implementing these transformations in hardware. As it is not possible to describe the translation in matrix notation in Cartesian co-ordinates, homogeneous co-ordinates have to be used. But it is very easy to transform Cartesian into homogeneous co-ordinates
Here Sx, Sy, and Sz are the scaling factors. For a uniform scaling Sx Sy Sz. Rotation: To rotate an object in a three-dimensional space, an axis of rotation needs to be specied. This can have any spatial orientation in a three-dimensional space, but it is easier to consider rotations that are parallel to one of the co-ordinate axes. The transformation matrices for rotation about the X,
polygon cube 5 1 4 0 3 2 7 6
faces A B C D E F
vertex lists 0 3 2 1
vertices x0, y0, z0 x1, y1, z1 x2, y2, z2 x3, y3, z3 x4, y4, z4 x5, y5, z5 x6, y6, z6 x7, y7, z7
3 7 6 2
Fig. 1
Data structure for object representation

237
Y and Z-axes, respectively are 0 1 0 B 0 cosy B Rx B @ 0 siny 0 0 0 cosy 0 1 0
Figure 3 shows two 3-D objects, a foot skeleton (N 2154) and a face (N 5597).
siny 0 C C C cosy 0 A 0 1 1 siny 0 0 0C C C cosy 0 A
B 0 B Ry B @ siny 0
0 0 0 1 1 cosy siny 0 0 B siny cosy 0 0 C B C Rz B C @ 0 0 1 0A 0 0 0 1

Fig. 3 Examples of 3-D objects
a Foot skeleton contains 2154 vertices b Face contains 5597 vertices
It is worth noting that a sequence of transformations can be represented by one matrix T T1 T2 . . . TN Figure 2 shows different examples of a cube containing eight vertices when applying different transformations.
z z
system-level model
C code HW/SW partitioning (host processor) simulation y x orginal position x y-axis translation y C compiler (MS visual C++)
handel-C code (FPGA hardware) celocixa DK2 IDE EDIF xilinx layout tools FPGA bitstream (full configuration) xilinx JBits
external cores (schematic, VHDL, CoreGen...)
FPGA place and route
FPGA configuration
host processor program y x y real-time prototyping host processor platform
FPGA bitstream partial configuration
FPGA board
prototyping platform a x z-axis rotation x-axis scaling
Fig. 2
3-D transformation examples

bank 0 DMA
2.2 3-D afne transformations-based large matrix multiplication

Consider an object represented with N vertices. The new position (NP) of the object when applying a transformation can be calculated as follows: NP T OP 9 where T is the matrix transform, OP is a (4, N) matrix contains the old vertices position and NP is a (4, N) matrix contains the new vertices position. 0 1 . . . x x0 x 1 N 1 B y y . . . y C B 0 1 N 1 C B C @ z0 z A . . . z 1 N 1 1 0 1 A D E F 0 ... G H I 0 1 1 0 x0 J C B K C B y0 CB L A @ z0 1 1 x1 y1 z1 1 ... ... ... ... x N 1 1 10
bank 1 bank 2
xcv2000E
PCI
bank 3 control
8 bit
status b
Fig. 4
Hardwaresoftware tools used for the implementation
a Handel-C design ow b Schematic view of FPGA/banks part in the RC1000-PP board
2.3
FPGA implementation
BB B B @C 0
238
yN 1 C C C zN 1 A 1
2.3.1 Implementation approach: Handel-C is a

high-level language that is at the heart of a hardware compilation system known as the Celoxica Development Kit (DK) [7], which is designed to compile programs written
in a C-like high-level language into synchronous hardware. One of the advantages of using hardware is the ability to exploit parallelism directly. Because standard C is a sequential language, Handel-C has additional constructs to support the parallelisation of code, and to allow ne control over what hardware is generated. DK produces a Netlist le, which is used during the place and route stage to generate the image or bitstream le [7] (Fig. 4). The RC1000-PP co-processor board used is a standard PCI bus card equipped with a large FPGA chip. It has 8 Mbytes of SRAM directly connected to the FPGA in four 32-bit wide memory banks. All are accessible by the FPGA and any device on the PCI bus. Different methods of data transfer from the host PC or the environment to the FPGA are available as follows: Bulk transfer of data between the FPGA and the PCI bus is performed through the memory banks 03. Streams of bytes are most conveniently communicated through the unidirectional 8-bit control and status ports (Fig. 4). The RC1000-PP board is supported with a macro library that simplies the process of initialising and talking to the hardware. This library comprises a set of driver functions with the following functionality: initialisation and selection of a board handling of FPGA conguration les
data transfer between PC and the RC1000-PP board function to help with error checking and debugging.
These library functions can be included in a C or C++ program that runs on the host PC and performs data transfer via the PCI bus [7]. Figure 5 shows the proposed parallel matrix multiplier (PMM), which can be used to perform the matrix multiplication described in Section 3. The multiplier has been implemented using Handel-C and compiled using DK version 2 (DK2) on the RC1000-PP board. In order to perform this multiplication the four external memories have been exploited. Since the vertex co-ordinates are real numbers, oatingpoint or xed-point representations can be used. Celoxica provides two libraries (oating-point and xed-point), which allow different widths to be specied, thus allowing designers to use the minimum number of bits to represent data and consequently generate smaller hardware. It can be seen from [17], when using the two libraries, that oatingpoint representation has large resource requirements and consequently lower performance. If the range of real number values that must be represented is small, or can be scaled in order to make it smaller, xed-point arithmetic is one way of providing cheap fast non-integer support. Fixed-point arithmetic is appropriate for our application because the range of the values is small. The proposed architecture consists of p identical PEs, which should be a multiple of four. Each PE comprises a xed-point multiply
22 bits 22 bits 22 bits PEp-3 PE1 SP1 32 bits TBuf1
bank 0
B0 C0
matrix T row row row row
host
22 bits TBuf2 bank 1
matrix OP block 0 size (N/4)4 block 1 size (N/4)4 B0 bank 2 B1 block 2 size (N/4)4 block 3 size (N/4)4 B2 B3 block 2 C2 block 3 C3
B1 PEp-2 PE2 SP2 C1
TBuf3 block 0 C0 B2 PEp-1 PE3 SP3 C2
matrix OP block 1 C1
bank 3 TBuf4 B3 PEp PE4 SP4 C3
fixed-point multiply accumalator
tik Okj
Fig. 5
Proposed parallel xed-point matrix multiplier for 3-D afne transformations

239
SE
NPij
PE : processor element SP : storage processor SE : storage element
accumulator (MAC) and a register for nal result storage. The MAC has been implemented using two approaches: (i) MAC-based Celoxica xed-point library: This is a deviceindependent hardware library that allows the width of the fractional and integer part of the number to be dened and provides macros to execute arithmetic operations [7]. (ii) MAC-based Xilinx CoreGen: Xilinxs CoreGen utility contains many designs that can often save time for a programmer, and it is possible to integrate Xilinx CoreGen blocks with a program in Handel-C using the interface declaration [6]. Two components have been used: a parallel signed integer multiplier and a parallel signed integer adder, which are suitable for the Xilinx XCV2000E-6 Virtex-E FPGA (Fig. 6).
signed integer multiplier 22 bits 22 bits 44 bits logical shift right << 14 signed integer adder 30 bits 32 bits 32 bits
cycles needed in order to perform a transformation is controlled by the slowest processor element, adding columns of zeros will not affect it. Table 1 illustrates the performance obtained for the proposed architecture when using the two different approaches for the MAC implementation.
Table 1: Area/spead implementation report for the proposed parallel xed-point matrix multiplier for two different approaches
MAC used Number of PEs Area % Celoxica xed-point library 4 8 16 Xilinx CoreGen 4 8 16 17 35 75 10 20 42 Speed MHz 22 20 17 35 30 24
Fig. 6
Fixed-point MAC
In both cases, the vertex co-ordinates are represented with 22 bits (14 bits for the fractional part, seven bits for the integer part, one sign bit). The input transform matrix T is partitioned into four row-wise blocks, which gives one row per block. Each block is stored in one of the four available banks. The matrix OP is partitioned into four columnwise blocks, likewise each matrix T block is stored in one of the banks. Because of the problem of accessing different elements stored in the same SRAM simultaneously, four buffers (TBuf1, TBuf2, TBuf3, TBuf4) for storing the four rows of T have been used to avoid a memory conict. Data are transferred from the banks to the buffers. Columns of the matrix OP are transferred from SRAMs to the PEs in parallel. Each PE computes one element of the output matrix NP. The four storage processors (SPs), which have access to the PE registers, are used to transfer the nal results to the banks and operate as an interface between the p PEs and the four memory banks. Each group of four PEs is working as a matrixvector multiplier. Therefore, four PEs are used to compute the new position of a vertex. The entire computation of the matrix NP can be carried out in [2 (4 N)/p+BI+N/NB] clock cycles. 2 is the number of clock cycles needed by the multiply accumulator for one accumulation N is the number of object vertices p is the number of PEs used BI 4 is the number of clock cycles needed for buffers initialisation NB 4 is the number of memory banks available N/NB is the number of clock cycles needed for nal result storage. It is worth nothing that the number of vertices N is always rounded to a multiple of four. Therefore, the last partition of the matrix OP should be padded with, at most, three columns of zeros (e.g. for N 2154 two columns of zeros should be supplemented to the matrix OP; in this case each partition has a size of 539 4). Since the number of clock
240
The buffers in our multiplier have been implemented using look-up tables (LUTs). The multiplier can perform transformations to an object with the number of vertices N up to 218. The implementation using the MAC-based Xilinx CoreGen shows better performance when compared with the one based on the Celoxica library due to the suitability of the cores used for the FPGA chip available in our board. There exist expensive 3-D cards, which support the manipulation of co-ordinate transformations. Our PC (Pentium 4 CPU 2.00 GHz) is equipped by an ATI RADEON FSC 32 MB graphics card, which belong to this category of cards. This card delivers immerse, realistic colour and 3-D graphics at the fastest possible frame rate. The RADEON Charisma Engine, which takes care of the geometry processing, and the Pixel Tapestry Architecture, which is the rendering engine, support full transformation, clipping and lighting for improvement in 3-D details. With full support for DirectX and OpenGL, it accelerates all todays top 3-D games. Table 2 shows the performances obtained in terms of minimum period and computation time for the RADEON
Table 2: Computation time comparison of the proposed structure with the RADEON FSC 32 MB graphics card
MAC used Number Minimum Computation of PEs period time ms FPGA Celoxica xed-point 4 library 8 16 Xilinx CoreGen 4 8 16 RADEON FSC 32 MB graphics card 1/22 1/20 1/17 1/35 1/30 1/24 ms 220.47 134.825 31.905 138.585 89.883 22.600 14.806
open GL
-defining and rendering 3-D primitives -viewing 2-D projections of 3-D scenes -lighting -texture mapping
graphical user interface -transform matrix T -verices old position DMA matrix (matrix OP) vertices new position matrix (matrix NP)
bank 0 bank 1 bank 2 bank 3
XCV 2000E
3D objects
-object information: -vertices -faces -texture...
parallel floating-point matrix multiplier
Fig. 7
Proposed system for 3-D afne transformations on FPGA
graphics card and our FPGA implementation with different number of PEs when performing a transformation on the foot skeleton object, which contains 2154 vertices. It can be seen from Table 2 that an improvement in the MAC will give better result in terms of computation time. Based on the number of clock cycles obtained for our design, a 37 MHz frequency (minimum period 1/37 ms) with p 16, which gives a computation time less than that obtained when using the graphics card, will be enough to outperform it. The performance of the matrix multiplier, dedicated for 3-D afne transformations, demonstrates that the FPGA can be used as an effective low-cost solution. Although the RADEON FSC 32 MB card is approximately twice as fast, a lower-cost alternative would be preferable for an application which does not require additional performance.
is applied to the result in order to reconstruct the transformed 3-D model.
Colour space conversion
2.3.2 Proposed environment for 3-D afne transformations on FPGA: Figure 7 shows a
general view of the entire proposed system. The environment consists of a host application (GUI), a 3-D object database, the open graphics library (OpenGL) [18] and the single FPGA chip coprocessor based on the RC1000-PP development board. 3-D object database: contains the 3-D model les (.OBJ, .3DS, .ASE) OpenGL: is the specication of a powerful set of more than 350 graphics routines for 2-D and 3-D graphics processing. OpenGL includes facilities for: dening and rendering 3-D primitives such as points, lines, polygons, spheres and cones viewing 2-D projections of 3-D scenes manipulating co-ordinate transformations lighting; light sources, material properties texture mapping. Coprocessor: performs the 3-D afne transformations. Host application (GUI): implemented using Borland C++, gives the user the ability to select a 3-D model from the 3-D object database, and display it on the available 3-D viewer. The user can apply different algorithms on the object, such as texture, lighting and antialiasage, which involve calls to the OpenGL functions. Since C++ does not support xed-point formats, a oating-point to xedpoint converter has been implemented. The vertex coordinates are converted from oating-point to xed-point before performing the DMA transfer. The inverse operation
Colour is a visual sensation produced by the light in the visible region of the spectrum incident on the retina. Since the human visual system has three types of colour photoreceptor cone cells, three components are necessary and sufcient to describe a colour [19]. Colour spaces (also called colour models or colour systems) provide a standard method of dening and representing colours. There are many existing colour spaces and most of them represent each colour as a point in a 3-D co-ordinate system. Each colour space is optimised for a well dened application area [20]. The three most popular colour models are RGB (used in computer graphics); YIQ, YUV and YCrCb (used in video systems) and CMYK (used in colour printing). All of the colour spaces can be derived from the RGB information supplied by devices such as cameras and scanners. Processing an image in the RGB colour space, with a set of RGB values for each pixel is not the most efcient method. To speed up some processing steps, many broadcast, video and imaging standards use luminance and colour difference video signals, such as YCrCb, thus making a mechanism for converting between formats necessary [21]. Several cores for RGB to YCrCb conversion can be found on the market, which have been designed for FPGA implementation, such as the cores proposed by Amphion Ltd [21], CAST Inc [22] and ALMA Tech [23]. It is the aim of this work to propose a novel architecture for RGB to YCrCb colour space conversion based on matrixvector multiplication using DA ROM accumulator principles. In the rest of this Section, the gamma-corrected RGB values are denoted R0 G0 B0 . Decomposing an R0 G0 B0 colour image into one luminance image and two chrominance images is the method that has been used in most commercial applications, such as face detection [24, 25], as well as the JPEG and MPEG imaging standards [26, 27]. The calculation of Y0 CrCb colour components from 0 0 0 R G B components consumes up to 40% of the processing power in a highly optimised decoder [27]. Accelerating this operation would be useful for the acceleration of the whole process.
241
3.1
Converting From R0 G0 B0 to Y 0 CrCb
A colour in the R0 G0 B0 colour space is converted to the Y0 CrCb colour space using the following equation [23]:
0 Y0 1 0 B C B @ Cr A @ 0:439 0:368 0:148 0:291 Cb 0:257 0:504 R B G0 C C B C 0:071 128 A B 0 C @B A 0:439 128 1 0:098 16 1 0
0
Each vector
1 cij0 @ cij1 A cij2 product a01 a11 a21 a02 a12 a22 0 1 1 bij0 a03 B bij1 C C a13 A B @ bij2 A a23 1 output image colour space a01 a11 a21 a02 a12 a22 1 a03 a13 A a23
11 While the inverse conversion can be carried out using the following equation [23]:
0 1:164 1:596 R @ G0 A @ 1:164 0:813 1:164 0:0 B0
0
is the result of the 0 a00 @ a10 a20
1 Y0 0:0 222:912 B Cr C C 0:392 135:616 A B @ Cb A 2:017 276:8 1 1
12
where cijk represent the components and 0 a00 A @ a10 a20
3.1.1 Mathematical background: Consider an

L M image (L image height, M image width) (Fig. 8). Let us represent each image pixel by bijk, (0rirL1, 0rjrM1, 0rkr2) where
represents one of the constant matrices in (11) and (12). The cijk elements can be computed using the following equation: cijk 0 i
3 X m 0
8 0 < bij0 R ij the red component of the pixel in row i and column j 0 bij1 G ij the green component of the pixel in row i and column j : bij2 B0 ij the blue component of the pixel in row i and column j
akm bijm j M 1; 0 k 2
15
L 1; 0
13 The image can be converted using the following mathematical formula:

0 0 B B B B B B B B B B B c102 B B B . . B . B0 B cL100 1 B BB @ @ cL101 C A cL102 c000 B C @ c001 A c002 0 1 c100 B C @ c101 A 1 ... c0M 10 Bc C @ 0M 11 A C C C C c0M 12 0 1 C C c1M 10 C 0 a00 Bc C C @ 1M 11 A C C B C @ a10 c1M 12 C C a20 C . . C . C 0 1 cL1M 10 C C Bc CC @ L1M 11 A A cL1M 12 1 0 0 b000 B B b001 C C B B C B B B @ b002 A B B 1 B 1 B 0 B b100 B B B b101 C C B B C B B B @ b102 A B B 1 B B . B . B . B0 1 B bL100 B BBb C B B L101 C BB C @ @ bL102 A 0 1 1
where {akm} are l-bit constants and {bijm} are written in the unsigned binary representation as shown in (16) bijm 0 i
W 1 X l0
bijm;l 2l j M 1; 0 m 2
16
L 1; 0
th
...
a01 a11 a21
a02 a12 a22
a03 a23
C a13 A
... ...
where bijm,l is the l bit of bijm, which is zero or one, W is the word length used which represents the resolution for each colour component of a pixel. Substituting (16) in (15) ! 3 W 1 X X akm bijm;l 2l cijk
m0 l0 3 X m 0
0 ...
...
...
...
C C C C C C 1 C 1 C 0 C b1M 10 C C C Bb B 1M 11 C C C C B @ b1M 12 A C C C 1 C C . C . C . 1C 0 bL1M 10 C C CC Bb B L1M 11 C C CC B @ bL1M 12 A A 1
1 b0M 10 Bb C B 0M 11 C B C @ b0M 12 A
Dene
W 1 X l0
!
l
akm bijm;l 2
17
Zl
3 X m0
akm bijm;l
18
Therefore, cijk can be computed as cijk

W 1 X l 0
Zl 2l
19
14 where the operation # can be dened as follows:

M L R G B image G B M
The idea is that since the term Zl depends on the bijm,l values and has only 24 possible values, it is possible to precompute and store them in ROMs. An input set of four bits (bij0,l, bij1,l, bij2,l, bij3,l) is used as an address to retrieve the corresponding Zl value. The ROM content is different and depends on the constant matrix A coefcients.
k Cb M j Y Cr Cb image L Y Cr Cb image
R G B image
Y j conversion
Cr
Fig. 8
242
R G B to Y CrCb conversion
These intermediate results are accumulated in W clock cycles to produce cijk coefcients. Since all the input components are in the range 0255, eight bits (W 8) are required to represent them. Equation (19) becomes cijk
m0 ak0 bij0;0 ak0 bij0;1 ak0 bij0;2 . . . ak0 bij0;7 m1 ak1 bij1;0 ak1 bij1;1 ak1 bij1;2 . . . ak1 bij1;7 7 X l0 m2 ak2 bij2;0 ak2 bij2;1 ak2 bij2;2 . . . ak2 bij2;7 m3 ak3 bij3;0 ak3 bij3;1 ak3 bij3;2 . . . ak3 bij3;7
where Zl
m0 cijk ak0 bij0;0 ak0 bij0;1 ak0 bij0;2 . . . ak0 bij0;7 ak 3 PPijk0 PPijk4 ak 3
2 X m0
akm bijm;l
m2 ak2 bij2;0 ak2 bij2;1 ak2 bij2;2 . . . ak2 bij2;7 PPijk2 PPijk6
24
Zl 2
20
cijk
2 0 2 1 2 2 . . . 2 7
m1 ak1 bij1;0 ak1 bij1;1 ak1 bij1;2 . . . ak1 bij1;7 PPijk 1 PPijk 5
20 21 22 . . . 27 PPijk3 PPijk7
l0 l1 l2 . . . l7
l0 l1 l2 . . . l7
21 Since the element bij3 is always equal to one & 1 for l 0 bij3;l 0 for l 6 0 Equation (20) can be rewritten as cijk
7 X l0
25 It is worth mentioning that the size of the ROMs has been reduced to 23. Table 3 gives the content of each ROM.
22
3.2
23
Proposed architecture
Zl 2l ak3
Equation (23) can be mapped into the proposed architecture as shown in Fig. 9. The architecture consists of eight identical PEns (0rnr7). Each PE comprises three parallel signed integer adders, three n right shifters and one ROMs block. Each ROMs block consists of three ROMs with the size 23 each (Fig. 10). The ROM content is different and depends on the matrix A coefcients, which depend on the conversion type.
Table 3: Content of the ROM i

bi j 0,l 0 0 0 0 1 1 1 1 bi j1,l 0 0 1 1 0 0 1 1 bij2,l 0 1 0 1 0 1 0 1 Content of the ROM i 0 ai2
ROM3 ROM2 ROM1
ai1 ai1+ai2 ai0 ai0+ai2 ai0+ai1 ai0ai1ai2
Fig. 10
ROMs block structure
bij 0,7 bij 1,7 bij 2,7 bij 0,5 bij 1,5 bij 2,5 bij 0,6 bij 1,6 bij 2,6
bij 0,1 bij 1,1 bij 2,1 bij 0,0 bij 1,0 bij 2,0
bij 0,2 bij 1,2 bij 2,2
bij 0,3 bij 1,3 bij 2,3
bij 0,4 bij 1,4 bij 2,4
3 ROMs block
3 ROMs block <<1 <<1 <<1 <<1 <<1
3 ROMs block
3 ROMs block
3 ROMs block
3 ROMs block
3 ROMs block
3 ROMs block cij 0
a + 0.5 03 a + 0.5 13 a + 0.5 23
+ + +
+ + +
<<2 <<2 <<2
+ + +
<<3 <<3 <<3
+ + +
<<4 <<4 <<4
+ + +
<<5 <<5 <<5
+ + +
<<6 <<6 <<6
+ + +
<<7 <<7 <<7
+ + +
cij 1 cij 2
PE delay PE: processor element
Fig. 9
Proposed architecture based on DA principles

243
It is worth noting that the architecture has a latency of W and a throughput rate equal to one. The entire image conversion can be carried out in (Latency+ (L M)Throughput) 8+(L M) clock cycles, while using the standard algorithm (Fig. 11), the conversion can be carried out in (3 4 L M) clock cycles, where (3 4) is the constant matrix A size. Figure 12 shows the functional analysis diagram of the proposed architecture. Table 4 gives the content of the ROMs used for R0 G0 B0 to Y0 CrCb conversions. The proposed architecture can be used for the inverse conversion (Y0 CrCb to R0 G 0 B 0 ) and for other conversions based on matrixvector multiplication. The content of the ROMs for the case of Y0 CrCb to R0 G0 B0 conversion is shown in Table 5. The precomputed partial products are stored in the ROMs using 13-bit xed-point representation (seven bits for the integer part, one sign bit and ve bits for the fractional part). 13-bit arithmetic is used inside the architecture. The architectures inputs and outputs are
presented using eight bits and the outputs are rounded. Rounding usually looks at the decimal value and if it is greater than or equal to 0.5, then the result is increased by one. This implies a condition of verifying followed by another arithmetic operation. A more efcient way to round a number is to add 0.5 to the result and truncate the decimal value. This technique has been applied in our proposed architecture. The initial value for each rst PEs adder is set in advance to (ai3+0.5), where (0rir2).
3.3
FPGA implementation
Like the previous implementation, the proposed architecture based on DA technique has been implemented and veried using the Celoxica RC1000-PP FPGA development board. The architecture consumes 193 slices and can run with a maximum clock frequency of 188 MHz. The parallelsigned adders have been implemented using Xilinxs CoreGen utility. The shifters and ROMs initialisation have been implemented using VHDL. In order to make a fair and consistent comparison with the existing FPGA-based colour space converters, the XCV50E-8 FPGA device has
for i for j
1 to L do 1 to M do 1 to 3 do for k for k 1 to 3 do cijk + = akm x bijm
// scanning image rows // scanning image columns // scanning the three RGB values of a pixel // scanning columns of the constant conversion matrix
end for end for end for end for
Fig. 11
Pseudocode for the standard algorithm
PE1 PE2 PE3 PE4 PE5 PE6 PE7 PE8
1 CC PP00k0 delay delay delay delay delay delay delay
st
nd
CC PP01k0 PP 00k1
rd
CC
.. ... ... ... ... ... ... ... ...
PP 02k0 PP 01k1 PP 00k2

delay delay delay delay delay
7 CC PP 06k0 PP 05k1 PP 04k2 PP 03k3 PP 02k4 PP 01k5 PP00k6 delay
th
8 CC PP07k0 PP06k1 PP05k2 PP04k3 PP03k4 PP02k5 PP01k6 PP00k7 C 00
th
9 CC PP 08k0 PP 07k1 PP 06k2 PP 05k3 PP 04k4 PP 03k5 PP 02k6 PP 01k7 C01 . .
th
delay delay delay delay delay delay
Fig. 12
Functional analysis diagram
Table 4: Content of the ROMs (R0 G0 B0 to Y0 CrCb)

0 Rij 0;l 0 Gij 1;l 0 Bij 2;l
Table 5: Content of the ROMs (R0 G0 B0 to Y0 CrCb)

ROM3 0 0.439 0.291 0.148 0.148 0.291 0.439 0
0 Yij 0;l
ROM1 0 0.098 0.504 0.602 0.257 0.355 0.761 0.859
ROM2 0 0.071 0.368 0.439 0.439 0.368 0.071 0
Crij1,l 0 0 1 1 0 0 1 1
Cbij2,l 0 1 0 1 0 1 0 1
ROM1 0 0 1.596 1.596 1.164 1.164 2.76 2.76
ROM2 0 0.392 0.813 1.025 1.164 0.772 0.351 0.041
ROM3 0 0 1.596 1.596 1.164 1.164 2.76 2.76
0 0 0 0 1 1 1 1 244
0 0 1 1 0 0 1 1
0 1 0 1 0 1 0 1
0 0 0 0 1 1 1 1
Table 6: Performance comparison with existing CSC cores

Design parameters Slices Throughput (mega-conversion/s) 234 112 105 90
Proposed architecture CAST, Inc [21] ALMA, Tech [22] Amphion Ltd [20]
193 222 222 204
been targeted. Table 6 illustrates the performances obtained for the proposed architecture in terms of area consumed and speed which can be achieved. The proposed architecture shows signicant improvements in comparison with the existing FPGA implementations in terms of the area consumed and the throughput achieved. In addition, Fig. 13 illustrates the R0 G0 B0 Baboon image (512 512) converted to the Y0 CrCb format using: FPGA implementation: the entire conversion image can be carried out in approximately 1.2 ms. Software-based implementation: using a 2.0 GHz Pentium 4 processor with 1 Gbyte of SDR RAM and C++ Builder V5.0, the entire image conversion can be carried out in approximately 126 ms. It can be seen that the same converted image (like in software conversion) can be obtained faster when using the FPGA implementation. In the second part of this paper, a novel independent platform and fully pipelined architecture based on the DA approach for R0 G0 B0 to Y0 CrCb conversion has been reported. The proposed architecture has a low latency and a high throughput rate. This novel architecture can be used for other conversions based on matrixvector multiplication by setting up the ROM content in advance. In addition the architecture has been implemented and veried using the Celoxica RC1000-PP FPGA development board. 4 Conclusions
and the use of matrix multiplication in many image and signal processing applications, two matrix multipliers have been proposed for FPGA implementation in this paper. The rst proposed multiplier is dedicated to 3-D afne transformations, while the second one is for colour space conversion. The two multipliers have been implemented and veried using the Celoxica RC1000-PP FPGA development board. Results obtained for the rst multiplier have shown that a low-cost FPGA implementation can achieve the performance of a graphics card when performing 3-D afne transformations. The performance, in terms of the area used and the maximum clock frequency, has been assessed for the second multiplier and has shown that it can be run with a higher frequency and consumes less area when compared with existing systems. 5 References
FPGAs have grown in capacity, improved in performance and decreased in cost. They have become a viable solution for performing computationally intensive tasks, with the ability to tackle applications for custom chips and programmable DSP devices. Owing to the importance
1 Bensaali, F., Amira, A., and Bouridane, A.: An FPGA based coprocessor for large matrix product implementation. Proc. IEEE Int. Conf. on Field-Programmable Technology (FPT03), Tokyo, Japan, December 2003, pp. 292295 2 Huss-Lederman, S., Jacobson, E. M., Tsao, A., Turnbull, T., and Johnson, J.R.: Implementation of Strassens algorithm for matrix multiplication. Presented at ACM/IEEE Conf. on Supercomputing, PA, USA, November 1996 3 Amira, A., Bouridane, A., Milligan, P., and Belatreche, A.: Design of efcient architectures for discrete orthogonal transforms using bit level systolic structures, IEE Proc., Comput. Digit. Tech., 2002, 149, (1), pp. 1724 4 Bensaali, F., Amira, A., Uzun, I.S., and Ahmedsaid, A.: An FPGA Implementation of 3-D afne transformations. Proc. 10th IEEE Int. Conf. on Electronics, Circuits and Systems (ICECS03), Sharjah, UAE, December 2003, Vol. 2, pp. 715718 5 Bensaali, F., Amira, A., Uzun, I.S., and Ahmedsaid, A.: Efcient implementation of large parallel matrix product for DOTs. Presented at Int. Conf. on Computer, Communication and Control Technologies (CCCT03), FL, USA, July 2003 6 Xilinx CoreGen and Handel-C, Application note AN 58 v1.0, 2001 7 URL: www.celoxica.com 8 Amira, A., Bouridane, A., Milligan, P., and Roula, M.: Novel FPGA implementations of Walsh Hadamard transforms for signal processing, IEE Proc., Vis. Image Signal Process., 2001, 148, (6), pp. 377383 9 Ohlsson, H., and Wanhammer, L.: Maximally fast numerically equivalent state-space recursive digital lters using distributed arithmetic. Proc. IEEE Nordic Signal Processing Symp. (NORSIG2000), Kolmarden, Sweden, June 2000, pp. 295298 10 Gustafsson, O., and Wanhammar, L.: Implementation of a digital beamformer in an FPGA using distributed arithmetic. Proc. IEEE Nordic Signal Processing Symp. (NORSIG2000), Kolmarden, Sweden, June 2000, pp. 295298 11 URL: www.xilinx.com 12 Styles, H., and Luk, W.: Customising graphics applications:techniques and programming interface. Proc. IEEE Symp. on FieldProgrammable Custom Computing Machines (FCCM), Napa, CA, USA, April 2000, pp. 7787 13 Singh, S., and Bellec, P.: Virtual hardware for graphics applications using FPGAs. Proc. IEEE Workshop on FPGAs for Custom Computing Machines, Los Alamitos, CA, USA, April 1994, pp. 4958
Fig. 13
Baboon (512 512) test image
a Original R0 G0 B0 image b Converted image in Y0 CrCb format using the proposed system c Software-based implementation (C++)
245
14 Ye, A.G., and Lewis, D.M.: Procedural texture mapping on FPGAs. Proc. ACM/SIGDA Int. Symp. on Field Programmable Gate Arrays, Monterey, CA, USA, February 1999, pp. 112120 15 Watt, A.: 3-D computer graphics (AddisonWesley, 2000) 16 Ferguson, R.S.: Practical algorithms for 3-D computer graphics (A K Peters, 2001) 17 Eadie, D., Shevlin, F., and Nisbet, A.: Correction of geometric image distortion using FPGAs, Proc. SPIE - Int. Soc. Opt. Eng., 2003, 4877, pp. 2837 18 URL: www.opengl.org 19 B. Payette, Color space converter: R0 G0 B0 to Y0 CrCb. Xilinx Application Note, XAPP637, V1.0, September 2002 20 Gonzalez, R.C., and Woods, R.E.: Digital image processing (Prentice Hall Inc, 2002, 2nd edn.) 21 Color space converters, Datasheet, Amphion Semiconductor Ltd, (www.amphion.com) DS6400 V1.1, April 2002
22 CSC color space converter Application note, CAST Inc, (www.castinc.com) April 2002 23 High performance color space converter. Datasheet, ALMA Technologies, (www.alma-tech.com) May 2002 24 Albiol, A., Torres, L., and Delp, E.J.: An unsupervised color image segmentation algorithm for face detection applications. Proc. Int. Conf. on Image Processing, October 2001, Vol. 2, pp. 681684 25 Kuchi, P., Gabbur, P., Bhat, P.S., and David, S.: Human face detection and tracking using skin color modelling and connected component operators, IETE J. Res., 2002, 48, pp. 289293 26 Mitchell, J.L., and Pennebaker, W.B.: MPEG video compression standard (Chapman & Hall, 1996) 27 Bartkowiak, M.: Optimisations of color transformation for real time video decoding. Presented at Digital Signal Processing for Multimedia Communications and Services, EURASIP ECMCS 2001, Budapest, Hungary, September 2001
246

Accelerating Matrix Product On Recon Gurable Hardware For Image Processing

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Accelerating Matrix Product On Recon Gurable Hardware For Image Processing

Enviado por

Direitos autorais:

Formatos disponíveis

Accelerating matrix product on recongurable hardware for image processing applications

Data structure for object representation

IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005

Y and Z-axes, respectively are 0 1 0 B 0 cosy B Rx B @ 0 siny 0 0 0 cosy 0 1 0

siny 0 C C C cosy 0 A 0 1 1 siny 0 0 0C C C cosy 0 A

0 0 0 1 1 cosy siny 0 0 B siny cosy 0 0 C B C Rz B C @ 0 0 1 0A 0 0 0 1

external cores (schematic, VHDL, CoreGen...)

FPGA place and route

host processor program y x y real-time prototyping host processor platform

FPGA bitstream partial configuration

prototyping platform a x z-axis rotation x-axis scaling

3-D transformation examples

2.2 3-D afne transformations-based large matrix multiplication

Hardwaresoftware tools used for the implementation

a Handel-C design ow b Schematic view of FPGA/banks part in the RC1000-PP board

2.3.1 Implementation approach: Handel-C is a

22 bits 22 bits 22 bits PEp-3 PE1 SP1 32 bits TBuf1

matrix T row row row row

22 bits TBuf2 bank 1

B1 PEp-2 PE2 SP2 C1

TBuf3 block 0 C0 B2 PEp-1 PE3 SP3 C2

bank 3 TBuf4 B3 PEp PE4 SP4 C3

fixed-point multiply accumalator

Proposed parallel xed-point matrix multiplier for 3-D afne transformations

IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005

PE : processor element SP : storage processor SE : storage element

IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005

bank 0 bank 1 bank 2 bank 3

-object information: -vertices -faces -texture...

parallel floating-point matrix multiplier

Proposed system for 3-D afne transformations on FPGA

is applied to the result in order to reconstruct the transformed 3-D model.

Colour space conversion

Converting From R0 G0 B0 to Y 0 CrCb

is the result of the 0 a00 @ a10 a20

1 Y0 0:0 222:912 B Cr C C 0:392 135:616 A B @ Cb A 2:017 276:8 1 1

where cijk represent the components and 0 a00 A @ a10 a20

3.1.1 Mathematical background: Consider an

13 The image can be converted using the following mathematical formula:

a01 a11 a21

a02 a12 a22

C C C C C C 1 C 1 C 0 C b1M 10 C C C Bb B 1M 11 C C C C B @ b1M 12 A C C C 1 C C . C . C . 1C 0 bL1M 10 C C CC Bb B L1M 11 C C CC B @ bL1M 12 A A 1

Therefore, cijk can be computed as cijk

14 where the operation # can be dened as follows:

Table 3: Content of the ROM i

ai1 ai1+ai2 ai0 ai0+ai2 ai0+ai1 ai0ai1ai2

ROMs block structure

bij 0,2 bij 1,2 bij 2,2

bij 0,3 bij 1,3 bij 2,3

bij 0,4 bij 1,4 bij 2,4

3 ROMs block <<1 <<1 <<1 <<1 <<1

3 ROMs block cij 0

a + 0.5 03 a + 0.5 13 a + 0.5 23

<<2 <<2 <<2

<<3 <<3 <<3

<<4 <<4 <<4

<<5 <<5 <<5

<<6 <<6 <<6

<<7 <<7 <<7

PE delay PE: processor element

Proposed architecture based on DA principles

IEE Proc.-Circuits Devices Syst., Vol. 152, No. 3, June 2005

1 to L do 1 to M do 1 to 3 do for k for k 1 to 3 do cijk + = akm x bijm