Fulltext 3

J Real-Time Image Proc (2008) 3:320 DOI 10.
1007/s11554-007-0070-9
SPECIAL ISSUE
A real-time motion estimation FPGA architecture

Konstantinos Babionitakis Gregory A. Doumenis George Georgakarakos George Lentaris Kostantinos Nakos Dionysios Reisis Ioannis Sifnaios Nikolaos Vlassopoulos
Received: 10 May 2007 / Accepted: 12 December 2007 / Published online: 8 January 2008 Springer-Verlag 2007
Abstract A motion estimation architecture allowing the execution of a variety of block-matching search techniques is presented in this paper. The ability to choose the most efcient search technique with respect to speeding up the process and locating the best matching target block leads to the improvement of the quality of service and the performance of the video encoding. The proposed architecture is pipelined to efciently support a large set of currently used block-matching algorithms including Diamond Search, 3-step, MVFAST and PMVFAST. The proposed design executes the algorithms by providing a set of instructions common for all the block-matching algorithms and a few instructions accommodating the specic actions of each technique. Moreover, the architecture supports the use of different search techniques at the block level. The results and performance measurements of the architecture have been validated on FPGA supporting maximum throughput of 30 frames/s with frame size 1,024 9 768. Keywords FPGA Video encoding Motion estimation Real time Block-matching
1 Introduction Towards achieving real-time performance in multimedia applications, researchers and engineers have presented solutions for optimizing the modules of video encoders. Among the most demanding tasks in video encoding is the Motion Estimation (ME) process, because it requires more memory space and it is computationally more intensive when compared to the other processes performed in the encoder. Targeting the improvement of the Motion Estimation performance a number of algorithms have been presented in the literature [15]. The effort in producing new algorithms has targeted the improvement of the computational complexity while keeping the resulting QoS as close as possible to that of the Full Search Algorithm [6]. Furthermore, a large set of architectures is presented for letting the new ME algorithms operate efciently and especially in real time. The effort in architectures has focused on the real-time performance and the optimization of the hardware resources and the power consumption [715]. Reaching the optimal QoS with the use of Full Search algorithm and minimal hardware resources has been the target of several research groups [10, 12, 14]. On the other hand, the research effort focused on designing and implementing techniques for improving power consumption and throughput capacity of Motion Estimation architectures [16]. These techniques rely on fast blockmatching ME algorithms [14], which make trade-offs between the computational complexity and the efciency of the searching strategy. These algorithms though, can present strong performance discrepancies depending on the type (size, motion activity etc.) of the input video. To avoid binding the performance of the Motion Estimation engine with the performance of individual algorithms, research groups present reprogrammable devices that support
K. Babionitakis G. Lentaris K. Nakos D. Reisis (&) N. Vlassopoulos Electronics Laboratory, Department of Physics, National and Kapodistrian University of Athens, Athens, Greece e-mail: dreisis@phys.uoa.gr G. A. Doumenis G. Georgakarakos I. Sifnaios Global Digital Technologies S.A., Athens, Greece e-mail: gregory.doumenis@gdt.gr
123
J Real-Time Image Proc (2008) 3:320
multiple search strategies to accommodate the needs of many different applications [79, 17, 18]. Targeting the enhancement of the performance of the Motion Estimation process, this paper presents an architecture letting different algorithms operate at run time. The architecture provides an infrastructure allowing the encoder to choose the best performing algorithm with respect to any criterion selected by the user, for the specics of each video sequence. The Motion Estimation organization constitutes a reprogrammable processing engine with the ability to perform a variety of the most commonly used block-matching ME algorithms, including the MVFAST and PMVFAST [19]. It can be characterized as an Application Specic Processor (ASP) with its own customized Instruction Set. The key idea in designing the proposed Motion Estimation engine is the development of an Instruction Set Architecture (ISA) core, which uses a set of instructions common to the execution of all the Block-Matching algorithms. In order to accommodate the individual requirements of each algorithm, the ISA includes algorithm-specic instructions. In addition, parallelizing the operation of the processing elements to work on 16 pairs of pixels concurrently and pipelining the load/store operations results in real-time performance. The advantages of the proposed Motion Estimation engine when compared to other reprogrammable ME architectures in the literature [79, 17, 18] are rst, the ISA design allowing a large number of different ME algorithms to be executed with minimized number of instructions and number of cycles. Second, the efcient design of the memory storing the search area and of the pipelined computations leads to the enhancement of the performance and the minimization of the required hardware resources. Finally, the third is the speculative execution technique predicting the outcome of the Sum of Absolute Differences calculations, which enables the prefetching of pixels, maximizes the utilization of the resources and further enhances the performance. The architecture can be realized as a stand-alone IP core applicable to any video compression standard. To validate the design, the proposed Motion Estimation engine has replaced a conventional Motion Estimation architecture integrated within an H.264 realtime encoder [20] operating on a Xilinx Virtex 2 Pro FPGA at 50 MHz. We implement three representative blockmatching algorithms, namely the PMVFAST, the MVFAST and the Diamond Search and we compare their performance using the PSNR as quality metric. Further, we illustrate the efciency of the algorithms real-time execution in the motion estimation engine. The metric of the efciency is the output bit-rate that each algorithm achieves with respect to the number of clock cycles that each algorithm requires.
The paper is organized with the following section giving the overview of the ME architecture. Sections 3, 4 and 5 describe the modules of the architecture: the memory organization, the processing module and the control, respectively. Section 6 presents the results and performance measurements and nally, Sect. 7 concludes the paper.
2 Overview of the ME architecture The proposed Motion Estimation architecture consists of three modules: the Data Cache module, the Control module and the module processing the Sum of Absolute Differences (SAD module) as depicted in Fig. 1. The Data Cache module buffers pixels communicated by the Motion Estimation and the other parts of the encoder. The buffering is accomplished by a Random Access Memory (RAM) storing the pixels of the search area and the current macro block (MB). The storing of these reusable pixels reduces signicantly the bandwidth required for the ME communication. Additionally, the Data Cache module interpolates the pixel values on-the-y to 1/2 and 1/4 resolutions which are used by the ME when performing sub-pixel search steps. The Control module is responsible for three tasks with each task performed on each of three distinct entities. The Top Level Controller synchronizes the ME engine with the host encoder. The Search Algorithm Controller executes the ME algorithm and handles the SAD module and the INFO Controller serves as an information collector for the ME engine. The communication among the Controls entities involves handshaking signals designed to synchronize the modules and to preserve the consistency of the overall ME process. The SAD module is organized as a Single-Instruction Multiple-Data (SIMD) machine to speed up the computations of the Sum of Absolute Differences metric for two distinct MBs. The mapping of the functionality of the ME architecture onto the above mentioned modules has been derived by studying the processing requirements of all the supported
Fig. 1 Motion Estimation engine architecture
123
fast ME algorithms and by categorizing these requirements into common to all algorithms and specic to each algorithm. The analysis has shown that all the algorithms can utilize a common nite state machine, which handles their computation ow. Further, with respect to their processing requirements, all the algorithms utilize a processing module, which calculates the error metrics (the sums of absolute differences) resulting by the matching of the current MB to a variety of positions within the search area. To accommodate these requirements this work has focused on the design of an efcient pipeline comparing MBs, which is formed by the Data Cache and the SAD modules. The operations within the pipeline are dictated by the Control module, which realizes a exible application specic processor executing a given algorithm. The proposed architecture constitutes a generic solution for implementing a large set of ME algorithms in real time. It can be used either as a reprogrammable architecture, which can switch to any operating algorithm in real-time or as an architecture, which can be congured at compile time to accommodate a specic algorithm or a subset of these. The latter case achieves a relative optimization of the architecture with respect to the gate count, memory size and power dissipation. To achieve this goal we have followed the approach of developing an ISA, which can support the ME engine in either of the above mentioned two roles. A set with limited number of instructions constitutes the building block of ISA. This basic block consists of the instructions that are common to all the ME real-time implementations. Moreover, the basic block of instructions is sufcient to support a number of low complexity algorithms, such as the Diamond, the 3-step search, the 4-step search, etc. The ISA is completed with more complex instructions enabling the ME engine to execute more involved and recently developed ME algorithms. The full set of instructions can be used to support the proposed ME engine in reaching two targets. The rst is to let the engine serve as a reprogrammable architecture at real-time. The second is to let a reduced part of the ME architecture with a subset of the instruction set to execute a specic ME algorithm. The purpose of these algorithm specic architecture instantiations is to save on the hardware resources and power dissipation. In this case the architecture is congured at compile time making the entire design suited for FPGA implementation. A representative set of algorithms supported by the ME engine includes the Three-Step Search (TSS), New Three-Step Search (NTSS), Four-Step Search (FSS), Diamond Search (DS), Two-Dimensional Logarithmic Search (TDL), Binary Search (BS), Orthogonal Search Algorithm (OSA), One-Step Search (OSS), Cross-Search Algorithm (CSA), Spiral Search (SS), Hexagonal Search Algorithm (HSA), Size-Based Predictors Selection Hexagon Search (SBPSHS), Cross-Diamond-
Hexagonal Search Algorithms (CDHS), Motion Vector Field Adaptive Search Technique (MVFAST) and Predictive Motion Vector Field Adaptive Search Technique (PMVFAST). The specications of the ME engine include the storage capability of the search area having size of 48 9 48. The system has been designed to use one reference frame. The sub-pixel processing supporting the proposed architecture is independent of the integer-pixel searching technique and it is available for activation after the integer search termination. It generates half and quarter pixels only for the candidate blocks that are horizontally and vertically (not diagonally) offset to the integer search result. These subpixel processing features are included in the execution of all the ME algorithms. Also the sub-pixel processing architecture is included in the reprogrammable architecture and it can be included in all the ME engines instantiations congured at compile time. The proposed ME engine also features a technique to realize speculative execution. The idea is that while the SAD module processes an arbitrary block, the Control Unit will fetch and decode instructions based on a speculation of the SAD modules result. The speculative execution increases the speed of the ME engine because it improves two operations. The rst is the block matching process calculating a SAD result, which will point out the next block to be matched. If the speculation is not included then until the SAD calculation is completed, the control will not initiate any instructions regarding the following matching process because the next block to be matched is not yet known and it cannot be loaded to the SAD module. The speculation computations result in an indication regarding the SAD of the current block. Therefore, the result indicates whether the current block is a candidate for best matching and it also indicates which is going to be the next block to be matched. This allows the control to issue instructions continuously so that, no idle periods appear in the instruction initiation. The second operation is the pipeline processing along the main architecture pipeline (Memoryto-SAD), which by using the speculation remains fully utilized throughout the entire search period. This is because the Data Cache is informed about the next chosen candidate block (to be matched) a number of cycles before the SAD completes its computations on the last data of the current block. Hence, the Data Cache can load the communication pipeline with the next blocks data at the cycle, which follows the cycle used to load the current blocks last data.
3 Data Cache The major computational effort in the ME process is dedicated to the search for a MB position within the search
123
area, which minimizes the error metric. The Data Cache module provides a fast local buffer for the search area data and it improves the speeding up of the MB-searching calculations of the ME architecture. The 48 9 48 pixel area, which is to be examined with respect to the current MB, is prefetched to the Data Cache from the external frame buffer and it is cached locally in zero-wait-state memory. The proposed architecture introduces a cache memory organization, which allows the module to maintain the throughput of 16 pixels per cycle. The 16 pixels are communicated in parallel at each clock cycle to the SAD calculation module and hence, the scanning of one MB is completed in 16 clock cycles. Furthermore, the module organization lets efcient accesses to the frame buffer (external memory). The Data Cache accesses data in the frame buffer by performing one data burst operation during prefetching at the beginning of each MB processing step. The result of this operation is that the frame buffer bus remains free of the ME engines requests for the remaining period. This access pattern allows the RAM memory device, which stores the frame buffer to be used as a shared memory with other system peripherals, and thus it leads to the optimization of the design cost and to a more straightforward SoC integration. The Data Cache module uses a RAM to store the pixel luminance values of the reference frame (previous frame). For each MB processed by the ME engine the Data Cache module has to fetch the related frame region from the previous frame buffer and to forward pixels of this region to the SAD module. The module efciently realizes these tasks by utilizing a multiple bank, interleaved memory organization and by pipelining the operations. The memory has been designed to optimize the MB movement from the previous frame buffer (external memory) to the Data Cache and minimize the corresponding required bandwidth. This is accomplished by designing the memory structure to efciently support the order of the MB movement, which follows the raster scan order of the MB processing during the encoding of each frame. The details of the operations performed by the components of the Data Cache and the organization of the interleaved memory are given in the following subsections.
Previous Frame Buffer
Search Area RAM
MUX Logic
SAD
Interpolator
Address Generator
Previous Frame Buffer
Current MB
SAD
Control Unit
Control
SAD
Fig. 2 Organization of the Data Cache module
of the latter pixel-window and addresses the pixel-block by referencing the plane coordinates of its origin (top-left pixel coordinates). The Address Generator translates the requests for 4 9 4 pixel-blocks into addresses for pixels in the RAM. The Mux Logic selects and rearranges the output of the Search Area RAM to form the requested 4 9 4 pixelblock in a 16 pixel raster scan order word and forward this to the Interpolator. The Interpolator component is the last stage of the Data Cache pipeline. It provides to SAD either the 1/2 and 1/4 interpolated sub-pixel values of each input pixel-block, or the pixel-block itself. If the ME engine is operating at sub-pixel MB search points then the interpolated output will be selected. Else, in the case of integer search points the pixel-block will be forwarded unaltered to SAD. The Data Cache module additionally includes the Current MB component storing the pixel values of the current MB. The Current MB component forwards any 4 9 4 pixel-block of the current MB to the SAD module and synchronizes this forward operation with the output of Mux Logic. The Control component provides local scheduling for the modules operations, it interfaces with the SAD and the Control modules and it generates the addresses to the previous frame buffer.
3.2 Search Area RAM organization The Search Area RAM component has been designed to accommodate two distinct operations: The store operations copying MB data from the previous frame buffer (external memory) to the Search Area RAM, and the fetch operations forwarding pixel data from the Search Area RAM to the SAD module. The store operations occur at the beginning of each new MB processing step of the ME engine. During the store operation, the component has to fetch the 48 9 48 pixel area from the search area RAM. This area consists of 9 MBs, as shown in Fig. 3 left (within the heavy square). The Search Area Ram component fetches these 9 MBs of the search area from the frame buffer on a MB basis. The order of the MBs is specied by the FORDER instruction,
3.1 Data Cache functions Figure 2 shows the overall organization of the module, with the data path depicted with heavy arrows and the control path with light arrows. The Search Area RAM component stores the 48 9 48 pixel-window centered around the collocated MB of the previous frame (following the H.264 standard recommendations [21]). We note here that, the SAD can request as an input any 4 9 4 pixel-block
123
J Real-Time Image Proc (2008) 3:320 Fig. 3 Data Cache refresh policy. Left Initial Conguration. Right Conguration at the next MB step
which will be explained in the Sect. 5.2. The order of the fetching of the MBs can be specied in the program so that the ME architecture can begin the searching operations while the Data Cache module is still loading the Search Area RAM component with data coming the external frame buffer. This feature is exploited by the fast ME algorithms, which have predened MB search patterns, such as the Three-Step Search, Diamond Search and other algorithms. A straightforward approach to design the store operations is to fetch the 9 MBs using the designated fetch order at the beginning of each new MB processing step. The proposed architecture improves this approach by applying a strategy based on the order of processing MBs within a video frame. The frame is scanned in row-major order and with direction from the left to the right, also called the raster scan order. At the beginning of an MB line, the Data Cache uses the straightforward approach to fetch the search area. Given the raster scan order, for subsequent MBs in the same MB line note that contiguous MBs have overlapping search windows as shown in Fig. 3 (the gure shows the search windows of two horizontally contiguous MBs). The left part of the gure shows the initial conguration with the search window centered about MB F. On the right, during the next MB step, the search window has moved one MB to the right and is centered about G. To minimize the volume of the communicated data between the Data Cache and the frame buffer memory, the Data Cache module fetches only the three non-overlapped MBs D, H, and L replacing the A, E and I of the previous window. Further, using this replacing scheme the indices of the MB columns have to be rotated at each MB step. The column index rotation is demonstrated at the top of Fig. 3. Consider for example the case of the center of the search window. During the initial conguration (left) the center of the search window resides within MB column 2 (F). At the next MB step (right) this center resides within MB column 3 (G). Therefore, by using this scheme the coordinates supplied by the SAD module to the Data Cache have to be translated. It is straightforward to show that the coordinate translation involves only the x (horizontal) coordinate of the (x,y) pair. Moreover, the translation performed is cyclic with a period of three MB steps: at step 1 x is left
untranslated, at step 2 x becomes (x + 16) mod 48 and at step 3 x becomes (x - 16) mod 48. To support the fetch operations with a throughput of up to 16 pixels per clock, the Search Area RAM component is organized as an 8 bank interleaved memory (Fig. 4). Each bank is a dual port RAM core with one read port and one write port. The word length is 32 bits and hence, each word can store four 8-bit luma pixels. The Data Cache stores the 48 9 48 pixel search area of the previous frame buffer as shown at the top left of the gure. The pixels are interleaved horizontally using an 8 bank scheme, banks A through H as shown at the bottom right of the gure. Each word in each bank stores 4 vertically contiguous pixels. To allow the fetching of one complete 4 9 4 block at each clock cycle with arbitrary coordinates, the mapping of the pixels onto the banks is altered every 4 vertical pixels as shown in the Fig. 4. This organization is able to fetch and/ or store concurrently 16 pixels in the form of the 4 9 4 pixel-block. The pixels of the 48 9 48 search area are stored in the Search Area RAM as follows: divide the search area into 12 row-strips (with id 0 B id B 11) of size 4 9 48 pixels each. A column (4 9 1 pixels) of a row-strip constitutes a word in the RAM. The words of the even numbered row-strips are stored in the RAM with word wk, 0 B k B 47, located at the bank w, where w = k mod 8. Conversely, for oddly numbered row-strips, each word wk is mapped to bank w = (k + 4) mod 8. The key idea of this organization is the efcient fetch operation of the 4 9 4 pixel-blocks to the SAD module. The Address Generator translates a (x,y) block origin request to the address l by=4c 6 bx=8c 26 of the w by=4cmod2 4 xmod8 RAM bank (considering that (0,0) corresponds to the origin of the collocated MB in the center of the search area). To illustrate that the requested pixels are stored in contiguous banks, four pixels per bank, we consider the request of an arbitrary 4 9 4 pixel-block with origin (x,y). Note that (x,y) belongs to a row-strip U with rows /; by=4c 4 / by=4c 4 3; and is located in the bank w. The Mux Logic outputs the pixels (x + i,y), 0 B i B 3, located in the (w + i) mod 8 RAM bank respectively. Note that these four pixels constitute the upper row of the requested 4 9 4 pixel-block. In
123
8 Fig. 4 Mapping of the Search Area Pixels onto the banks of the Data Cache memory
MB0 MB1 MB2
MB3
MB4
MB5
MB6
MB7
MB8
0 0 1 A 2 3 4 5 E 6 7
y A
1
B
2
C
3
D
4
E
5
F
6
G
7
H A B
addition the Mux Logic outputs all the remaining three rows of the 4 9 4 pixel-block. These are the pixels (x + i,y + j), 0 B i B 3 and 1 B j B 3, located in the bank (w + i) mod 8 if the pixel (x + i,y + j) belongs to the same row-strip with (x,y) or in the bank (w + i + 4) mod 8 with address l + 6, if the pixel belongs to the row-strip below U (i.e. rows j; j [ by=4c 4 3: The architecture of the Search Area RAM with the Address Generator and the Mux Logic is shown in the Fig. 5. The (x,y) coordinates are the input to the request translator, which computes the above mentioned transformation resulting the address l and the bank of the rst pixel w. The translator generates 8 pixel addresses for the eight banks. w species the rotation offset (number of shifts) that the address barrel shifter has to rotate the addresses. Each address points to a 4-pixel word within a bank. Therefore, the organization provides at the output of the memory banks 32 pixels, which contain the requested 16 pixels. Note that each 4-pixel word output can contain from none to four pixels located within the requested pixel block. The Mux Logic consists of the data barrel shifter and the output multiplexer, which are designed to extract the 16 requested pixels out of the 32 given by the memory banks. The data barrel shifter and the output multiplexer are controlled by the request translator, which computes the control words for these components.
4 9 4 pixel-block at each cycle from the output Mux Logic. The output depends on the mode the Control has selected. In the integer Motion Estimation mode, the input pixel-block is forwarded unaltered to the SAD. In the halfpixel and quarter-pixel modes the interpolator component computes and outputs 4 pixels at the half-pixel and quarterpixel resolutions, respectively. The half-pixel interpolations according to the H.264 standard recommendations [21] are derived by applying a 6-tap FIR lter with coefcients [1, -5, 20, 20, -5, 1] and the quarter-pixel interpolations consist of the averaging of nearby pixels. The proposed Interpolator implements a simplied 4-tap lter with coefcients [-4, 20, 20, -4], obtained by merging the rst with the second taps and the fth with the sixth taps (refer to Fig. 6). The gure shows interpolated sub-pixel generation for each pixel block. Integer pixel positions for a 4 9 4 pixel-block are shown lightly shaded (A through L). Generated pixels at 1/2 resolution are labeled as primed letters (A through J) and at 1/4 resolution are labeled with lower case, normal and
3.3 Sub-pixel interpolation The Interpolator component computes the sub-pixel interpolations of each pixel-block. It receives as input one
Fig. 5 Data Cache Memory Architecture
123
J Real-Time Image Proc (2008) 3:320 Fig. 6 Interpolated pixel generation positions for half pixels (left) and quarter pixels (right)
A a
B b
c C g
D d
D a b F f f g G d H
primed letters (a through j and a through j). The FIR lter organization is depicted in the Fig. 7. The modication of the FIR leads to several advantages in the FPGA implementation. The simplied FIR implementation requires less hardware resources due to the FIR shorter length but also due to the simplication of the two coefcients to powerof-two values (-4). These multiplications are reduced to arithmetic binary shift operations and sign inversions. The remaining two coefcients (20) can be realized by using a shift-add operation (20x = 16x + 4x). Also, the divisions by 2 and 32 are realized by shift operations. Furthermore, in the simplied version the input pixels to the FIR all lie within a 4 9 4 pixel block. For this reason all the sub-pixel calculations can be accomplished by using only the data included in this pixel block. Although this simplication leads to a signicantly improved hardware resources utilization, it results in a marginal degradation of the SNR performance.
Fig. 7 Interpolation lters for 1/2 and 1/4 sub-pixel calculations
The interpolator component is a pipeline organization and consists of the Transpose and 4 parallel FIR units. The input to the Transpose unit a 4 9 4 pixel-block. This unit conditionally transposes the pixel block if vertical interpolations are required. Each FIR unit will conduct the simplied 4-tap FIR lter calculations if the 1/2 or 1/4 search steps are requested. The last stage of the FIR computations includes an averaging unit which computes the average of the integer and FIR interpolated pixels to produce the 1/4 pixel resolution. Figure 7 depicts the architecture and the dataow of one FIR unit. Each 4 pixels are input to one FIR unit which applies the simplied FIR calculations and forwards the result through the half-pel output. The four half-pixel results of the four parallel FIR units, in the case of the Transpose unit congured for a notranspose operation, are depicted in Fig. 6on the left sideas K, C, F and J. In the transpose operation the respective results are shown in the Fig. 6on the left sideas A, B, G and D. The FIR unit also calculates the quarter resolution pixels and forwards these to the SAD module through the quarter-pel output (refer to Fig. 7). The quarter-pixels are computed via a straightforward averaging operation, which involves the generated half-pixels and their adjacent integer pixels. The multiplexer supplies the averaging component with either the left-adjacent (resp. top-adjacent) or the right-adjacent (resp. bottom-adjacent) integer pixel. Hence, the FIR unit can compute either the 1/4 displacement or the 3/4 displacement of each pixelblock. This situation is depicted in Fig. 6 right: for a notranspose operation the 1/4 displacement is i, c, f, j while the 3/4 displacement are their primed counterparts. In the same gure, in the transpose operation the 1/4 displacement is a, b, g, d and the respective primed letters denote the 3/4 pixel displacement. To minimize the latency of the pipeline, the component design also includes a bypass path from the input directly to the output, which is used when the ME engine is performing integer pixel MB search steps.
123
10
Note that in sub-pixel resolutions, the input throughput to the Interpolator is fully utilized by consuming all the 16 pixels of the 4 9 4 pixel-block. Due to the FIR calculations though, the Memory-to-SAD pipeline has a throughput of 4 pixels per clock, which is 1/4 of the throughput achieved when the ME engine operates at integer search mode.
4 SAD calculator The SAD module compares a 16 9 16 region of the search area (Cij) with the current MB (Rij). The metric used to illustrate the resemblance between the two pixel blocks is the Sum of Absolute Differences, given by Eq. 1. SAD
15 15 XX i0 j0
jCij Rij j
1
Fig. 8 SAD module architecture
The SAD module realizes the search-point inspection process of the ME search algorithm. To process the inspection, the Control module provides the SAD module with a (x,y) vector specifying the position of the targetblock (candidate) on the search area. Based on this vector, the SAD module initiates the requests to the Data Cache module to fetch the pixel-blocks required for calculating the SAD. The resulting SAD value is sent back to the Control Module for further processing. The SAD module operates on 16 pairs of pixels in parallel. To complete the 16 9 16 region inspection the SAD module generates 16 consecutive fetch-requests. Each request to the Data Cache addresses two distinct 4 9 4 pixel blocks: the rst is a sub-block of the Current MB and the second is a sub-block of the candidate block. The inspection process utilizes efciently the fully pipelined data-path, which connects the Data Cache with the SAD module and has a latency of 4 cycles. The pipeline (Memory-to-SAD) can achieve 100% utilization because the SAD is able to initiate a new MB comparison at the clock cycle following the addressing of the 16 th (last) pixel-block of the previous MB comparison. The 16 pairs of pixelsgiven as input to the SAD moduleare directly forwarded to 16 subtractors. The absolute values of these differences are summed within a depth-4 adder tree. For area optimization reasons we use the following technique regarding the calculations of the absolute values: only 8 of the 16 differences are transformed with the use of absolute converters (utilizing a 9-bit inverter, a 9-bit adder and a 9-bit 2-to-1 multiplexer). The remaining 8 differences will be added if they are positive or subtracted if they are negative from the transformed values by checking their sign bit. For this reason, the rst stage of the depth-4 adder tree uses add/subtract elements instead of ordinary adders (Fig. 8). The proposed design of the SAD
modules data path benets the architecture not only in terms of hardware resources, but also in terms of performance because it is optimized to include the dedicated high-speed components in the FPGA environment. The adder tree of the SAD module is the nal stage of the Memory-to-SAD pipeline. The root of this tree is a 16-bit accumulator terminating the ow of the pipeline. The accumulation process lasts for 16 consecutive clock cycles, unless an abort signal occurs. The abort signal will occur if the partially accumulated value turns larger than the best_sad value provided by the Control module (the SAD metric corresponding to the best candidate block discovered by that time). To accommodate the proposed speculative execution technique, the SAD module noties the Control module regarding the applicability of the candidate block before the termination of the SAD metric calculation. This notication is, more precisely, an estimation of whether the outcome of the SAD computation is going to be smaller than the best_sad. The estimation is computed 15 times during the 16-cycle accumulation process and it involves a comparison of the dynamically growing SAD metric with a fraction of the best_sad. The fraction of the best_sad increases proportionally to the number of the SAD results 1 2 3 accumulated at each cycle 16; 16; 16; etc.) and this value is sent to the speculation calculator comparator (shown in Fig. 8). The idea of using a fraction of the best_sad proportional to the processed part of the candidate block is to achieve SAD comparisons normalized with respect to the fraction of MB scanned so far. These fractions are calculated with simple shift-add operations in a depth-2 adder tree (Fig. 8 includes the corresponding circuitry: speculations calculator) using maximal bit precision (e.g.
123

9 16 1 1 1 16 ; 11 1 1 16 ; etc.). In the case of a false 2 16 2 8 estimation, which is the case of a notication differing the preceding notications, an exception signal is generated and forwarded to the Control module (Sect. 5.3 describes the exception handling). The exception signal does not affect the accumulation process, and thus it causes no delay to the SAD module. The internal SAD controller is realized as two distinct state machines to efciently implement the two aforementioned control functions of the SAD module: the SAD estimation and the pixel requests. Note here that the pixelblocks are fetched in a sparse manner (chessboard) rather than a strict raster scan order. This sampling technique results in a faster SAD value convergence and thus, by getting better SAD estimations it provides to the ME algorithm the information to compute the region similarities faster than the strict raster scan. Finally, when the ME engine performs in the sub-pixel mode the entire SAD organization operates as in the integer mode with a few differences: The formation of a candidate block in the sub-pixel case involves 64 requests issued by the SAD to the Data Cache (instead of the 16 of the integer mode) because 4 integer pixels produce a single sub-pixel value. The 12 out of the 16 adder tree inputs are given 0s as input and the speculative execution is deactivated. The architecture involves a multiplexer circuit, which selects a 4-pixel row or a 4-pixel column of the 16 pixels of the pixel-block belonging to the current MB.
11
encoder. The ME engine uses the INFO component to fetch information data. Such data are the motion vectors of the neighboring MBs (left, top-left, top, top-right) and the motion vector of the previous frame collocated MB. Other information retrieved by the INFO component is the calculated SAD values of these ve MBs and the data regarding the frame dimensions and the position of the current MB on it. The above information is stored in registers to ensure the parallel access of up to ve information data in one clock cycle. The Search Algorithm component is the core of the Control module. It handles the SAD module, decides the sequence for loading blocks and implements the branching steps of the search algorithm. The following subsections describe in detail the architecture of the Search Algorithm component, the supported ISA and how the organization handles the feature of speculative execution and the subpixel processing.
5.1 The architecture of the Search Algorithm component The architecture of the Search Algorithm component is depicted in Fig. 9. It consists of an instruction memory, a sequencer, a decoder and the necessary arithmetic/logic circuitry to execute the instructions. The output of this component is the (x,y) coordinates of the target block. This output is forwarded either to the SAD module during the search process or to the host encoder when the target block becomes the nal MB predictor. The Instruction Memory stores the program performing the search algorithm. The memory is 16 bits wide, while the depth depends on the number and the complexity of the stored programs. The Search Algorithm component includes an Instruction Register, a Program Counter and an Instruction Decoder. Multiple ME algorithms are stored in the ROM occupying different address spaces. The user can select the ME algorithm at the start of each MB process. The control signal choosing the ME search algorithm is synchronized with the start of each MB process. The Execute subcomponent shown in Fig. 9 utilizes a set of 8 16-bit registers: two 16-bit registers (current_position and best_move) tracking the path of the target block on the search area, one 16-bit register (check) used for comparisons and 5 general purpose registers. Both current_position and best_move registers store (x,y) pointers on the search area corresponding to good candidate blocks; current_position points to the best candidate block discovered till the current algorithm iteration, while best_move points to the best candidate block discovered during the current iteration (using coordinates relative to the coordinates stored on current_position). The data of
5 Control module The Control module of the proposed ME engine consists of three distinct components (Fig. 1): the Top Level component, the INFO component and the Search Algorithm component. The Top Level component triggers the other two components of the Control Module and the Data Cache storing process. The rst component to be triggered is the INFO component. The algorithm execution and the data caching follow the termination of the INFO process. This initiation sequence has been specied to accommodate the encoder organization having a single bus to connect the ME engine with the host encoder. The design of the ME engine includes interfaces of the Data Cache and the INFO to the encoders bus and therefore, these components cannot use the bus at the same time. Also, this initiation sequence is designated because the ME functions depend on the outcome of the INFO process. The INFO component is a stand-alone state machine serving as an information collector. The design of this component depends on the specics of the interface between the ME engine and the other modules of the entire
123
12 Fig. 9 Search Algorithm component architecture
these two registers are added at the end of each iteration to form the origin of the latest MB predictor. The check register stores intermediate algorithm results for comparison with specic instruction operands accommodating the branching steps of the algorithm. For example, among other data, this register stores the relative position of the of the latest MB predictor, which is used in deciding the ow of the program by branching. Finally, the 5 general purpose registers support the initialization steps of advanced algorithms (PMVFAST, MVFAST), the exception handling (discussed in Sect. 5.3) and the SAD value storing. The Search Algorithm component can be congured to include an auxiliary tracking memory to store the already searched points. The memory must be of size 32 9 32 bits, organized as a bit map indicating the previously visited search points of the entire 48 9 48 pixel search area. Further, this bit map can be implemented with 32 registers, resulting in signicant time savings because the bit map can be initialized in a single clock cycle and it can be used to consult without memory delays. Note that a straightforward ME algorithm implementation without the auxiliary memory can lead to redundant search operations in the search area. The approach followed in the proposed design is to avoid the utilization of a bit map implemented either as a register le or as an auxiliary tracking memory. To achieve an efcient implementation without keeping track of the visited points, the proposed design has realized the ME applications by focusing on the efciency of the ISA code implementation. The ISA allows such efcient coding through the extensive use of branching steps. An example illustrating such an efcient use of the proposed ISA will be given in the following subsection. The design of the Search Algorithm component, described above, is based on the analysis of a large number of fast
block-matching ME algorithms. The register organization reects the recursion that almost every ME algorithm uses: nd a good search point and improve the result, if possible, with respect to that particular point. The arithmetic/logic elements organization reects the needs originating from the renement strategy common to all algorithms. Moreover, the arithmetic/logic organization has been designed to utilize distinct structures accommodating the different calculations of the advanced algorithms. These distinct processing blocks (shown in Fig. 9 as application dedicated circuits) can be either included in the reprogrammable ME architecture or omitted at compile time in the case of a specic application instantiation.
5.2 Instruction Set Architecture The goal in designing the ISA of the Search Algorithm component is to provide the programmer with a simple tool to program the ME engine. The outcome is a very small set of application specic instructions making feasible the implementation of almost any known fast motion estimation algorithm (or even a novel one). The instructions are divided into two sets: the generic containing the instructions common to all the search techniques and the algorithm specic used in a limited number of algorithms. Below we give the specications rst of the generic instruction set and then the specications of the algorithm specic instruction set. The instructions are 2 bytes in length. We use the 4 MSBs for the instructions opcode and the remaining 12 for the operands, as shown in Table 1. In detail: INIT determines a reference point on the search area (typically 0,0). The Sx and Sy operands are stored to the
123
13
current_position register, which stores the motion vector acquired from all the past algorithmic iterations. SAD triggers the SAD module. The Rx and Ry operands are relative to the current_position coordinates and dene a search point. During the execution of this instruction, Rx and Ry are added to the current_position register and then are passed to the SAD module. Upon completion of the SAD calculations, if the SAD module computes a better metric than that kept in the best_sad register, the Rx ; Ry operands are stored to the best_move register and the best_sad register is updated with the new value. Additionally, the SAD instruction is designed to support predened operands referring explicitly to each of the six following motion vectors: each of the four neighboring blocks (left, top-left, top, top-right), or their median value or the motion vector of the collocated block from the previous frame. UPDATE is used at the end of each algorithmic iteration to conclude the motion vector renement. UPDATE accumulates the best_move data to the current_position register, loads the best_move data to the check register and resets the best_move register. FORDER determines the order according to which the Data Cache fetches MBs from the previous frame buffer. The 48 9 48 pixel region contains 9 MBs (gure 4). The collocated MB is always fetched rst (MB4). The remaining 8 MBs can be indexed using 3 bits. Hence, the order of up to 4 MBs can be given in the instruction, given that the 16-bit instruction word contains a maximum of 4 MB indices MB1 to MB4 : Programs execute the FORDER instruction twice to set the complete fetching order of the 8 MBs, as follows: the FORDER operands are stored in a length-8 array consisting of 3-bit registers able to forward data to each other (parallel-toserial transformation). The rst four entries of the array hold the operands of the rst decoded FORDER while the second four entries hold the operands of the second FORDER. Each time an MB fetch is completed, the operands in the array are shifted so that a new MB index is forwarded to the Data Cache. CMP compares the contents of the check register with the operands Cx ; Cy On equality, the CMP ag is raised. The JUMPC instruction consults this ag to determine if the branch will be taken. CMP usually follows the UPDATE instruction. JUMPC is used for branching. The offset operand is added to the Program_Counter only when the proper ag is raised. The proper ag is determined by the extop operand: if extop is 0 the jump is unconditional. If extop is 1 the CMP ag is checked. If extop is either 2 or 3 the respective CMPSP ags are checked. TSTOP conditionally terminates the MB search process if the value of the best_sad register is smaller than the
threshold operand. Operand f indicates that the threshold value calculated by the FIXTHR instruction will be used for comparison. END terminates the integer search process. The current_position register holds the motion vector of the current MB, while the best_sad register holds the SAD metric that corresponds to the current MBs predictor.
In addition to these generic instructions that can be used to program a variety of search algorithms, we integrated the following, algorithm specic, instructions in order to accommodate the MVFAST and PMVFAST techniques: CMPSP performs certain comparisons (PMVFAST techniques). Operand extop indicates the exact operation to be performed. CMPSP raises specic ags for the next JUMPC to take the branch. FIXTHR is the rst part of the PMVFAST initialization step. It computes the two metric thresholds depending on the neighboring MB information. The nal values are stored to general purpose registers of the Control module. PRDMV is the second part of the PMVFAST initialization step. The median of the neighboring motion vectors is stored to a general purpose register of the Control Module. MBACT computes the motion activity of the current MB (MVFAST technique). To do so, the maximum cityblock length of the neighboring motion vectors is compared to the operands L1 ; L2 : MBACT stores the result to the check register.
Figure 10 depicts a segment of the One Step Search (OSS) program implementing the vertical search steps of OSS algorithm. Loop A moves the target-block downwards in the search area until a local minimum is found, while Loop B moves the target-block upwards. Note that this technique bypasses the already searched points, because the search direction is based on the relative position of the target-block at each step. The same technique is used to program algorithms with more complicated search patterns, avoiding redundant SAD initiations without the use of the tracking memory which keeps track of the visited positions. Even with the extensive use of branching steps, a single program size does not exceed the 150 instructions (for instance DS requires 84 instructions and PMVFAST requires 146 instructions, while their straightforward implementations using the tracking memory require 20 and 50 instructions, respectively).
5.3 Speculative execution and sub-pixel processing All the ISA instructions are decoded and executed in one clock cycle, except the SAD which invokes the SAD modules process. Due to the bandwidth and the depth of
123
14
J Real-Time Image Proc (2008) 3:320 Table 1 Instruction Set Architecture of the programmable ME
15 14 13 12 11 10 9 Sx f MB1 Cx extop extop L1 Rx L2 Ry offset MB2 threshold MB3 Cy MB4 8 7 6 5 4 3 Sy 2 1 0
INIT END UPDATE TSTOP FORDER CMP JUMPC CMPSP PREDMV FIXTHR MBACT SAD
Fig. 10 Segment of an example program based on the proposed ISA
the Memory-to-SAD pipe, the SAD instruction can consume up to 20 clock cycles in the worst case. To avoid stalling the Control module for so long, we use a speculative execution technique. This technique is based on the specics of the execution of the SAD instruction during the ME algorithms search process. An arbitrary SAD(xi,yi) instruction returns a SAD value corresponding to the (xi,yi) candidate block. The returned SAD value is compared to the best_sad value corresponding to the best candidate block discovered by that time to determine whether (xi,yi) points to a better MB predictor. We note that, the program ow does not depend on the returned SAD value itself, but rather on the comparison of this value with the best_sad value. Further, the forthcoming SAD comparison can be estimated based on the intermediate results of the pending SAD instruction. This estimation is calculated at the speculation calculator and forwarded to the Control module as if it was the nal comparison outcome. Hence, the Control module continues with the program execution working in parallel with the SAD module. The speculative execution is blocked when an instruction depends on the nal SAD value of a candidate block (TSTOP, CMPSP) or when the following SAD(xi+1,yi+1) instruction begins. In the latter case, the SAD(xi+1,yi+1) is decoded and the (xi+1,yi+1) pointer is sent to the Data Cache module. This operation allows the Data Cache module to forward the pixels of the candidate block (xi+1,yi+1) to the SAD module immediately after the last pixel of the block (xi,yi) has been forwarded. Hence, the speculative execution eliminates the empty slots in the Memory-to-SAD pipe and achieves its 100% utilization. It also improves the utilization of the instruction pipe because the instruction fetching/decoding is not stalled. The speculative execution technique involves an exception handling technique. The SAD module sends an exception signal to the Control module if a false estimation occurs during the candidate block processing. The exception handling cancels the speculatively executed instructions, restoring the state of the algorithm at the time
of the specic SAD initiation. This rollback is achieved as follows: each time a new SAD process is initiated, the three registers describing the algorithms state (current_position, best_move, Program_Counter) and the Rx ; Ry operands are stored temporarily to general purpose registers. These values are restored when an exception occurs. The best_sad register is not included in the exception handling as this register is updated only when the SAD process has nished (not based on estimations). Note here that, a false estimation does not result in a SAD re-initiation because the SAD process is not affected by these SAD estimations; the only time loss is in terms of speculative operations. After the termination of the program execution, the ME engine enters the sub-pixel mode of operation. The search algorithm component controls the sub-pixel process in the same way as in the integer-pixel process, by using separate control ags. The sub-pixel algorithm is not reprogrammable and it involves two iterations. During the rst iteration, the search algorithm component issues four consecutive requests to the SAD module regarding the four half-pixel candidates adjusted to the integer predictor (up, down, left, right). The integer motion vector will be rened and points to the best half-pixel candidate of the rst iteration. During the second iteration, the search algorithm component issues two consecutive requests regarding the quarter-pixel candidates adjusted to the best half-pixel predictor and aligned to the same direction with it (left/ right or up/down). The half-pixel motion vector is then rened and points to the best quarter-pixel candidate of the second iteration. The sub-pixel mode can be deactivated during run-time for energy conservation or can be omitted at compile-time for area optimization. Finally note that, when the sub-pixel process exceeds its allocated time budget, the control of the ME engine forwards to the host encoder the best predictor discovered so far (the same holds for the integer-pixel process).
123
15
6 FPGA implementation and performance evaluation The proposed Motion Estimation engine has been implemented on an FPGA and several ME block-matching algorithms have performed to validate the architecture and measure the performance. The ME engine has been integrated with an H.264 real-time encoder to test the functionality and its performance with various video streams.
6.1 FPGA implementation The FPGA used to realize the ME engine is the Xilinx Virtex 2 Pro 40. The ME engine has been realized at a cost of 2,127 slices (equivalent of 46,125 gates) and 11 block RAMs operating at 50 MHz. Table 2 summarizes the distribution of the FPGA resources among the three ME modules. Note here that the total RAM and ROM memory size (does not include the registers) varies with respect to the search area size and the selected ME algorithm (the implementation presented in Table 2 makes use of a 48 9 48 search area and a ROM containing the PMVFAST algorithm). The proposed architecture features some signicant improvements compared to other results presented in the literature. The authors in [17] describe a reprogrammable ME engine with a more exible ISA. In terms of FPGA slices, [17] uses similar resources with the architecture presented in this paper but it does not include functions such as the sub-pixel estimation, the parallel-pixel SAD calculation, the speculative execution and the complicated data caching (note also that it utilizes a smaller search area of 32 9 32 pixels). As a result, the throughput that [17] can sustain is much smaller than the desirable 1,024 9 768 at 30 fps. The authors in [18] present a VLIW processor for Block Matching Motion Estimation, using approximately 9,000 slices, which is greater than four times the amount of resources used by the proposed architecture in this paper. The efciencydened as the fraction of throughput over the area (hardware resources)is reported as similar for both architecturesin [18] and in the present paperbut the techniques in [18] use signicantly smaller number of candidate blocks during the search process. Further, the
Table 2 FPGA place and route results ME modules Data Cache SAD calculator Control Total Eq.gates 27,727 5,699 12,699 46,125 Slices 1,123 312 692 2,127 Memory (bits) 20,480 2,336 22,816
memory organization in [18] features less bank utilization per pixel-block request than the proposed Data Cache and makes use of additional pixel buffers (hardware resources), as it does not support parallel reads of vertically contiguous pixels. The authors in [7] propose a programmable ME engine utilizing a separate RISC processor. The advantage of our approach lies in the fact that the ISA is optimized with respect to the ME functions, resulting in less instruction fetching operations per macro-block process and more efcient resource utilization. The area comparison between [7] and the proposed architecture clearly favors the proposed ME engine (note that we compare separately the control units and the units realizing the SAD calculations). Besides these reprogrammable approaches, the proposed architecture features less area even when compared to architectures implementing individual fast ME algorithms (e.g. the FOSS architecture in [16]). Compared to architectures implementing the Full Search (FS) algorithm, the proposed architecture follows the rule of the tradeoff between execution time and hardware resources; the FS implementations mentioned in [12] and [15] require signicant larger amount of hardware resources in order to complete their calculations on time. A software implementation running on a soft-IP processor integrated in a SoC in [22] results in the VLSI area of 8 mm2 using a 0.09 lm process. An optimized version of the 3-D Recursive Search algorithm requires a 50 MHz system clock for encoding 30 frames/s at the SD resolution. An example implementation of the proposed ME engine utilizes approximately 1 mm2 using a 0.18 lm process and requires comparable clock frequency (around 40 MHz). Finally, in [23] the SAD is based on carry save adders (CSAs) and the authors have designed an application specic circuit for the input of the SAD to improve the performance of the architecture in bottlenecks. The proposed architecture is an efcient pipeline design without bottlenecks. Therefore, it is improved with respect to the resources (area optimized) compared to [23] because (Sect. 4) the adder tree is implemented with dedicated add/subtract components instead of customized carry-save adders and the technique used at the input stage of the tree saves half of the absolute value converters instead of reducing the ripple-carry chains.
6.2 Congurations at compile time The proposed architecture can be utilized as a ME engine programmable at real-time to support the previously mentioned algorithms or can be congured at compile time to support specic algorithmic functionality. These architectures instantiations target an efcient ME implementation for the early developed and simpler ME algorithms or the
123
16

Coastguard sequence
MVFAST implementation or the PMVFAST or a conguration, which does not include the sub-pixel processing. Table 3 shows the hardware resources used for each case. The rst line shows the resources of the complete reprogrammable ME engine and the second the slices in the case that the sub-pixel processing is excluded. The reduction in the case of the sub-pixel exclusion is relatively small because the sub-pixel processing organization has been area (resource) optimized. The third line presents the ME engine realizing the simpler algorithms including the subpixel processing capabilities. The fourth line has the resources required for the MVFAST execution with subpixel processing and the last line presents the resources for the PMVFAST realization with sub-pixel.
42 40 38
PSNR (dB)
36 34 32 30 28 26 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

Diamond MVFAST PMVFAST
4.0
4.5
5.0
5.5
Bitrate (Mbps)
Fig. 11 PSNR versus bit-rate for 140 frames of coastguard sequence encoded with three ME algorithms
6.3 Performance results To verify and test the proposed design, the ME engine has been integrated in a H.264 encoder and has been deployed on a prototyping platform [20]. The host encoder utilizes a CAVLC entropy encoder and uses 16 9 16 Intra-frame predictions. The board features two FPGAs, one Xilinx Virtex II Pro (XCV2P40) and one Spartan 2 (XC2S150). It also provides 16 MB of SDRAM for frame buffer, 512 kB of SRAM and 8 MB of ash RAM for storage of test input video sequences, an add-on sensor for real-time image acquisition and interfaces for USB, serial and host processor connections. The design achieves a maximum of 50MHz clock frequency on the FPGA board and can sustain the throughput of 30 frames/s with each frame having resolution of 1024 9 768. We have tested the ME architectures performance with the PMVFAST, MVFAST and Diamond Search algorithms. We have used ve well known CIF (352 9 288) sequences at 25 frames/s: foreman, coastguard, highway, mobile and news. Figures 11, 12, 13, 14 depict results regarding the prediction quality and speed of the three ME algorithms. First, to measure the quality of the block matchings, we plot the PSNR vs bitrate curves (Figs. 11 and 12) of the reconstructed images. The only parameter varying during these encodings is the quantization step of the host encoder. Second, to measure the speed of these algorithms, we examine the size of the encoded bit-streams while narrowing the time window of the search process (Figs. 13 and 14). More specically, we stop the search process at a predened number of clock cycles forcing the ME engine to report the best macro-block predictor found so far. The resulting bit rate implies the effectiveness of the prediction (small bit rates, useful predictions). Note that the credibility of these measurements depends on the stability of the image quality. We have approximated this stability by using a constant quantization step. The above PSNR tests show that the algorithms in discussion have performed as expected: PMVFAST performs better than MVFAST, which in turn performs better than DS. Figures 11 and 12 depict the performance of the algorithms and their differences in a limited search time window (400 cycles). The differences between any two of the above-tested algorithms decrease when the cycle budget increases; in some cases (e.g. mobile CIF sequence and
42 40 38
Mobile sequence
PSNR (dB)
36 34 32 30 28 26 24 1.0 2.0 3.0 4.0 5.0 6.0

Table 3 FPGA slice-count for various congurations Conguration type ME engine for all algorithms Omitting sub-pixel operations Supporting only simple algorithms Supporting simple and MVFAST alg. Supporting simple and PMVFAST alg. Slices 2,127 1,837 1,724 1,896 2,057 Overall reduction (%) 14 19 11 4
7.0
8.0
9.0
Bitrate (Mbps)
Fig. 12 PSNR versus bit-rate for 180 frames of mobile sequence encoded with three ME algorithms
123

Foreman sequence
17
800 750 700
650 600 550 500 450 200
300
400
500
600
700
800
Clock Cycles per MB
Fig. 13 Output bit-rate versus ME search time for 70 frames of foreman sequence (Q = 34)
for the algorithms performing for more than 700 cycles) the performance is practically the same for all three algorithms. More interesting are the results of comparing the algorithms performance with respect to the search time window (Figs. 13 and 14). Each algorithm will improve the bit rate signicantly if it is allowed to perform up to a number of clock cycles. This improvement reaches an upper bound (lower bit rate) at a number of cycles. The upper bound as well as the number of cycles that this limit appears are distinct for each algorithm and for each sequence. Beyond this knee-point each algorithm shows negligible performance improvement. The fact that all the knee-points appear before the completion of 700 cycles, allows the use of the time surplus for extended sub-pixel processing and/or the ME process termination for improving the power dissipation (at least for low resolution images). Regarding the algorithms performance evaluation, the cycle budget analysis is in accordance with the
420 410 400
Highway sequence
390 380 370 360 350 340 200 300 400 500 600 700 800
Clock Cycles per MB
Fig. 14 Output bit-rate versus ME search time for 90 frames of highway sequence (Q = 32)
PSNR analysis. PMVFAST exhibits the lowest bit rates during the entire search process. Note here that the results of the rst 100200 cycles (either way of minor importance) show peculiar performance due to the overhead of the initialization steps: advanced algorithms like MVFAST and especially PMVFAST show relatively lesser performance (than they are expected), because they execute a number of pre-calculations delaying the actual block comparisons useful for frame predictions. The above tests show a performance comparison regarding only integer-search ME algorithms. The subpixel prediction process has been deactivated during these tests and its effect on the performance has been tested separately. Figure 15 depicts the proposed ME engines performance improvement when using sub-pixel predictions. We construct three PSNR plots representing three different cases and using as example algorithm the MVFAST. The rst shows the performance of the proposed ME engine using no sub-pixel predictions. The second curve shows the performance of the proposed ME engine, which has been realized to involve only horizontal/vertical sub-pixel predictions. For sake of comparison we have included the third case showing the performance of the sub-pixel process checking every possible sub-pixel displacement. The third case is realized with a software bit accurate model developed for comparing the FPGA ME implementation performance with the bit accurate expected results. In the second and the third cases, the sub-pixel mode of operation is activated after the integer-search process termination. The results show a quality loss of the proposed sub-pixel approach with respect to the ordinary approach, because the proposed ME engine uses less subpixel candidates and 4-tap instead of 6-tap interpolation lters. This performance degradation is expected as the proposed ME engine is designed to accommodate the computationally intensive sub-pixel process with a processing organization optimized with respect to the hardware resources (note that the sub-pixel processing organization occupies only 14% of the overall architecture). Finally, we evaluate the proposed ISA with respect to the number of the instructions used to implement and execute fast ME algorithms. Table 4 presents the results of the execution of three algorithms (4SS, DS, MVFAST) on the proposed ISA, on a general purpose processor (GPP) ISA and on the ISA described in [17], using various CIF and QCIF test sequences. The results measuring the efciency of the ISA include the total number of instructions per program (static code segment size -ROM depth-) and the average number of cycles required to process a macro block. Note here that, by GPP ISA we refer to x86 architectures and that the cycles counted in this case correspond to the number of the executed instructions (assembly
Bitrate (Kbps)
Bitrate (Kbps)
123
18
News sequence, 300 frames
44 42 40
PSNR (dB)
38 36 34 32 30 28 0.0 0.1 0.2 0.3 0.4

All Sub-Pixels ON Diagonal Sub-Pixels OFF All Sub-Pixels OFF
0.5
0.6
0.7
0.8
Bitrate (Mbps)
Fig. 15 ME engine performance for various sub-pixel congurations
instructions, enumerated with proling tools). Further, the programs executed by the GPP are the bit accurate models of the programs implemented with the proposed ISA. As shown in Table 4, the customized architectures lead to shorter programs than the GPP architectures, and thus to programs with signicantly fewer instruction fetches per macro block process. Reducing the instruction fetches results in signicant clock cycle savings. This becomes very important in real-time applications: for instance, the proposed ME engine can use 16 of these cycles to examine a new search point. As also shown in Table 4, the proposed ISA is more efcient than the ISA described in [17]. Since [17] involves RISC oriented instructions, it offers slightly better exibility when programming a novel ME algorithm, but leads to much bigger programs with more instruction fetches than the ISA proposed in this paper. Overall, Table 4 veries that: due to its efcient ISA, the parallel pixel process and the speculative execution, the proposed ME engine requires substantially less cycles than other reprogrammable ME architectures.
7 Conclusion This work has presented the design and the implementation of a Motion Estimation processing engine, which is able to execute a variety of the most commonly used blockmatching algorithms. The ME architecture is suitable for
Table 4 Static code size and runtime cycles per Macro-Block Instruction Architecture GPP Dias et al. [17] Proposed Set Four step Size Diamond MVFAST Cycles
Cycles Size
Cycles Size
real-time and low area applications. The proposed design introduced a data cache based on an interleaved memory structure, which allows the parallel access of 16 pixels formed as a 4 9 4 pixel-block. The data cache organization and the parallel pixel access lead to optimized memory bank utilization as well as optimized data movement from this cache memory to the Sum of Absolute Differences processing module. The design of the SAD processing module introduced the use of the speculation calculator, which at each clock cycle estimates the expected SAD metric. The technique called the speculative execution exploits the expected SAD to allow the control to continue its operations instead of halting during the SAD operations, and furthermore, it leads to the continuous fetching of pixels from the data cache to the SAD module. These techniques and the SIMD parallel organization of the SAD processing module result in the elimination of bottlenecks and the almost 100% use of the pipeline. They lead therefore to achieving high throughput and efcient utilization of the pipeline and the other hardware resources. As a result the execution of the PMVFAST and the MVFAST requires less cycles than those available to achieve the requested quality of service. The design of the instruction set has focused on developing commands, which allow a large number of currently used ME algorithms to perform in real-time. The instruction set provides four additional application specic instructions to accommodate the special needs of the advanced ME algorithms such as the MVFAST and the PMVFAST in reaching their real-time performance. Moreover, the design of the instructions allows the code of the ME process to avoid redundant searches and computations while the algorithm examines various positions in the search area. Hence, the instruction set leads to the efcient ME algorithm execution and saves on the cost of an auxiliary memory. Finally, the experimental data prove that compared to the hitherto published results the proposed ME engine is very efcient with respect to the resources and the features that it includes. Our future work will focus on how to exploit the advantage of the design, which is the ability to choose the search algorithm at run time according to specied criteria. Based on the fact that the architecture allows the use of different search techniques at the level of processing a macro block, research will be conducted on the efciency of the search techniques with respect to speeding up the process and locating the best matching target block. Expected result is the further improvement of the quality of service and of the performance of the video encoding.
Acknowledgments This work has been supported in part by the PAVET 05-181 project of the Greek General Secretariat for Information Systems.
1,793 87,219 1,670 73,332 2,493 45,870 365 88 5,248 722 460 84 4,352 660 744 135 2,944 511
123
19 picture-rate up-converter. In: Proceedings of the 17th International Conference on VLSI Design, vol. 17, pp. 10831088, January (2004) Tourapis, A.M., Au, O.C., Liou, M.L.: Fast block-matching motion estimation using predictive motion vector eld adaptive search technique (PMVFAST). ISO/IEC JTC1/SC29/WG11w MPEG2000/M5866, March (2000) Babionitakis, K., Lentaris, G., Nakos, K., Vlassopoulos, N., Reisis, D., Doumenis, G., Georgakarakos, G., Sifnaios, J.: An efcient H.264 VLSI advanced video encoder. In: 13th IEEE International Conference on Electronics, Circuits and Systems, pp. 545548, December (2006) H.264 Advanced Video Coding for Generic Audiovisual Services. ITU-T, May (2003) Waerdt, J., Slavenburg, G., Itegem, J., Vassiliadis, S.: Motion Estimation performance of the TM3270 processor. In: Proceedings of the 2005 ACM Symposium on Applied Computing, pp. 850856 (2005) Wong, S., Stougie, B., Cotofana, S.: Alternatives in FPGA-based SAD implementations. In: IEEE International Conference on Field-Programmable Technology, pp. 449452, December (2002)
References
1. Tourapis, A.M., Au, O.C., Liou, M.L.: Highly efcient predictive zonal algorithms for fast block-matching motion estimation. In: IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, no.10, October (2002) 2. Tourapis, A.M., Au, O.C., Liou, M.L.: Predictive motion vector eld adaptive search technique (PMVFAST): enhancing block based motion estimation. In: Proceedings of Visual Communications and Image Processing (VCIP01) (2001) 3. Liu, Y., Oraintara, S.: Complexity comparison of fast blockmatching motion estimation algorithms. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, May (2004) 4. Cheung, C., Po, L.: Novel cross-diamond-hexagonal search algorithms for fast block motion estimation. In: IEEE Transactions on Multimedia, vol. 7, no.1, February (2005) 5. Ruiz, V., Fotopoulos, V., Skodras, A.N., Constantinides, A.G.: An 8 9 8-block based motion estimation using Kalman lter. In: IEEE International Conference on Image Processing, October (1997) 6. Richardson, I.: H.264 and MPEG-4 Video Compression, Video Coding for Next Generation Multimedia. Willey, London (2003) 7. Li, T., Li, S., Shen, C.: A novel congurable motion estimation architecture for high-efciency MPEG-4/H.264 encoding. In: IEEE Proceedings of the ASP-DAC, January (2005) 8. Alfonso, D., Rovati, F., Pau, D., Celetto, L.: An innovative, programmable architecture for ultra-low power motion estimation in reduced memory MPEG-4 encoder. IEEE Trans. Consum. Electron. 48(3), 702708 (2002) 9. Dutta, S., Wolf, W.: A exible parallel architecture adapted to block-matching motion-estimation algorithms. In: IEEE Transactions on Circuits and Systems for Video Technology, February (1996) 10. Rahman, C.A., Badawy, W.: A quarter pel full search block motion estimation architecture for H.264/AVC. In: IEEE International Conference on Multimedia and Expo, July 2005 11. Choi, J., Togawa, N., Yanagisawa, M., Ohtsuki, T.: VLSI architecture for a exible motion estimation with parameters. In: IEEE Proceedings of the 15th International Conference on VLSI Design (2002) 12. Kim, M., Hwang, I., Chae, S.: A fast VLSI architecture for fullsearch variable block size motion estimation in MPEG-4 AVC/ H.264. In: IEEE Proceedings of the ASP-DAC, January (2005) 13. Liu, Z., Huang, Y., Song, Y., Goto, S., Ikenaga, T.: Hardwareefcient propagate partial SAD architecture for variable block size motion estimation in H.264/AVC. In: ACM Proceedings of the 17th Great Lakes Symposium on VLSI, March (2007) 14. He, W., Zhao, M., Tsui, C., Mao, Z.: A scalable frame-level pipelined architecture for FSBM motion estimation. In: IEEE 20th International Conference on VLSI Design, January (2007) 15. Huang, Y.W., Chien, S.Y., Yu Hsieh, B., Liang-Gee Chen, B.: Global elimination algorithm and architecture design for fast block matching motion estimation. In: IEEE Transactions on Circuits and Systems for Video Technology, vol. 14, no. 6, June (2004) 16. Ramachandran, S., Srinivasan, S.: FPGA implementation of a novel, fast motion estimation algorithm for real-time video compression. In: ACM Proceedings of the 2001 9th International Symposium on FPGAs, February (2001) 17. Dias, T., Momcilovic, S., Roma, N., Sousa, L.: Adaptive motion estimation processor for autonomous video devices. EURASIP J. Embed. Syst. 2007(1) (2007) 18. Beric, A., Sethuraman, R., Peters, H., van Meerbergen, J., de Haan, G., pinto, C.A.: A 27 mW 1.1 mm2 motion estimator for 19.
20.
21. 22.
23.
Author Biographies
Konstantinos Babionitakis has received his MSc in Electronics and Automation and BSc in Physics studying at the Department of Physics of the National and Kapodistrian University of Athens. His research interests include parallel algorithms and efcient VLSI implementations. Gregory A. Doumenis received his Diploma in Electronic Engineering (1990) and his PhD in Broadband communication systems (1994) from the Electrical Engineering Department of the National Technical University of Athens (NTUA), Greece. Since 1998, he is Senior Research Associate of the Telecommunications networks laboratory of the NTUA. His research interests include: Multimedia systems and applications, performance analysis of embedded systems, broadband networks, digital systems architectures, video coding and transmission. He co-founded GDT SA in 1999 and he is President of the Board and Chief Technology Ofcer since 2001. George Georgakarakos was born in Athens, Greece in 1976. He received his Degree in Electrical Engineering and Computer Technology from University of Patras in 2002. His diploma thesis was focused on an advanced VLSI implementation of Internet Protocol. He is experienced in FPGA-based system design. George Lentaris received his degree in Physics and MSc in Electrical Automation from the National Kapodistrian University of Athens in 2004 and 2006 respectively. He is currently a PhD student in Computer Science, Physics Departement, NKUA. His research interests include algorithms and parallel architectures for digital signal processing. Kostantinos Nakos is a PhD candidate at the Department of Physics of the University of Athens with research subject: Parallel and RealTime Architectures for Signal, Image and Video Processing. In 2003 he was awarded his MSc in Electronics and Automation and in 2001 his degree in Physics, both from the Department of Physics in Athens. His research interests lie in parallel algorithms, recongurable and parallel architectures, computational complexity and theory of algorithms and computational geometry. Dionysios Reisis has received his Ptychion in Electrical Engineering from the University of Patras, Greece, in 1983, and his MSc and PhD degrees in Computer Engineering from the Department of Electrical and Computer Engineering of the University of Southern California,
123
20 USA, in 1989 with PhD Thesis: Parallel Algorithms on Mesh Connected Computer with Static and Recongurable Buses.This research involved parallel mesh computers and algorithms for signal, image and graph processing, vision and VLSI architectures. In 1990 he started cooperation with the Telecommunications lab of the Division of Computer Science of NTUA as a research associate. In 1991 he has been elected Lecturer and in 1996 Assistant Professor of the Electronics Laboratory of the Department of Physics of the University of Athens (NKUA). His interests include parallel architectures and algorithms for image and graph signal processing with applications in VLSI environment, as well as real time hardware design and efcient algorithms design for telecommunication system support. Apart from his participation in projects (ONR, DARPA, NSF, RACE, ACTS, IST) he has publications in scientic magazines (IEEE transactions on Computers, on PAMI) and conferences (ICPP, IPPS, MIT on VLSI, CVPR, UUCP, ISS etc.) and has co-written
J Real-Time Image Proc (2008) 3:320 books on parallel processing. He is a reviewer in magazines (IEEE transactions, JPDC, SIAM on Computing etc.) and conferences (ICPP, IPPS, CVPR etc.). Ioannis Sifnaios was born in Athens (Greece) in 1974. He received his diploma in EE 1997 from the Electrical Engineering Department of NTUA. His research interests include: digital systems architectures, hardware implementation of video codec, telecommunication algorithms and protocols and formal design techniques. Nikolaos Vlassopoulos graduated from the Department of Physics at the National and Kapodistrian University of Athens in 2000. He continued his studies attaining an MSc in Electronics Automation in 2002. He is currently studying for his PhD with subject Parallel and Recongurable Architectures. His research interests lie in computational complexity and theory of algorithms, parallel algorithms, computational models and cryptography.
123

Fulltext 3

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Fulltext 3

Enviado por

Direitos autorais:

Formatos disponíveis

J Real-Time Image Proc (2008) 3:320 DOI 10.

A real-time motion estimation FPGA architecture

J Real-Time Image Proc (2008) 3:320

Fig. 1 Motion Estimation engine architecture

J Real-Time Image Proc (2008) 3:320

J Real-Time Image Proc (2008) 3:320

Previous Frame Buffer

Search Area RAM

Fig. 2 Organization of the Data Cache module

J Real-Time Image Proc (2008) 3:320

Fig. 5 Data Cache Memory Architecture

Fig. 7 Interpolation lters for 1/2 and 1/4 sub-pixel calculations

J Real-Time Image Proc (2008) 3:320

J Real-Time Image Proc (2008) 3:320

12 Fig. 9 Search Algorithm component architecture

J Real-Time Image Proc (2008) 3:320

J Real-Time Image Proc (2008) 3:320

Fig. 10 Segment of an example program based on the proposed ISA

J Real-Time Image Proc (2008) 3:320

J Real-Time Image Proc (2008) 3:320

36 34 32 30 28 26 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

36 34 32 30 28 26 24 1.0 2.0 3.0 4.0 5.0 6.0

J Real-Time Image Proc (2008) 3:320

800 750 700

650 600 550 500 450 200

Clock Cycles per MB

Clock Cycles per MB

J Real-Time Image Proc (2008) 3:320

38 36 34 32 30 28 0.0 0.1 0.2 0.3 0.4

Fig. 15 ME engine performance for various sub-pixel congurations

J Real-Time Image Proc (2008) 3:320

Você também pode gostar