Você está na página 1de 4

A Novel 32-bit Scalable Multiplier Architecture

Yeshwant Kolla
SUN Microsystems, Inc Burlington, MA, 01803, USA

Yong-Bin Kim
Dept. opf ECE, Northeastern University Boston, MA, 02115, USA

John Carter
School of Computing,University of Utah Salt Lake City, UT,USA

Yeshwant.Kolla@sun.com

ybk@ece.neu.edu

retrac@cs.utah.edu

ABSTRACT
In this paper, we present a novel hybrid multiplier architecture that has the regularity of linear array multipliers and the performance of tree multipliers and is highly scalable to higher-order multiplication. This multiplier topology is highly conducive for an electronic design automation (EDA) tool based implementation. A 32-bit version of this multiplier has been implemented using a standard ASIC design methodology and one variation of the standard design methodology in a 0.25m technology. This 32-bit multiplier has a latency of 3.56ns.

Categories and Subject Descriptors


B.7.1 [Microprocessors and microcomputers]:

the speed-up. However, conventional tree structures are less regular than linear array topologies and their interconnect wires tend to be long and complex. In addition, the varying lengths of the interconnects causes delay balancing problems. This leads to signicant glitching at the outputs of the logic gates, resulting in higher power dissipation. The longer interconnects that result from complex interconnections, require larger transistors to drive the additional load. This increase in transistor sizes has a domino eect, rippling all the way to the source driver, resulting in an overall area increase. We propose a multiplier architecture with the performance advantage of trees and the regularity of linear array multipliers.

2. BINARY MULTIPLIER ARCHITECTURE 2.1 Generation of PPs


An m-bit Booth recoding reduces the number of partial products, and therefore the depth of addition, by almost a factor of m. But for 3-bit and higher recoding, the latencies involved in the generation of hard multiples like 3MD, 5MD etc. is usually very high. This is due to the fact that the 3MD and 5MD multiples are usually generated from basic multiples like MD and 2MD by a carry-propagate addition, which is a high latency operation. In addition, the multiplicand multiple generation and multiplier scanning can be done in one step for the 2-bit recoding, which highly simplies the recoding mechanism. So the proposed multiplier architecture uses the 2-bit modied-Booth recoding for generation of PPs. A combined implementation of generating the multiples and scanning the multiplier group can be realized with the logic shown in Figure 1.

General Terms
Multiplier,Design, Performance

Keywords
Multiplier, Architecture,CMOS VLSI

1.

INTRODUCTION

Hardware multipliers are some of the most widely used datapath elements. This basic arithmetic unit is one of the most critical elements in high-speed microprocessors, digital signal processors, graphics processors, and many other application-specic processors. In addition, multiplication is a high latency operation requiring between 2 and 8 cycles to complete [1]. Consequently, the performance of a multiplier has a direct eect on the throughput and cycle time of a system. Conventional linear array multipliers have very good regularity, but the latency through these multipliers tends to be high. Usually the number of levels of addition are equal to the number of bits of a multiplier. To improve on the latency of linear array multipliers, tree topologies [4, 2, 3] are used. Tree topologies use parallelism in performing the dierent steps of multiplication in order to obtain

2.2 Summation of PPs


The novelty of the multiplier architecture being proposed lies in the summation of PPs. The proposed technique called Hybrid tree addition using 8-to-2 compressors, implements a hybrid structure using a combination of linear sub-array addition and tree addition. One of the biggest drawback of a conventional tree structure is its scalability to higher order multiplication. This tree topology is well suited for 16-bit or lower order multipliers and the routing complexity increases drastically beyond 32-bit multiplication. To overcome the scalability issues, we propose to divide the summation array into 8-bit sub-arrays and to add these sub-arrays using a tree structure. Dividing the array into 8-bit sub-arrays increases the scalability of the tree topology to a great extent. The 8-bit sub-arrays are added up

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. GLSVLSI03, April 2829, 2003, Washington, DC, USA. Copyright 2003 ACM 1-58113-677-3/03/0004 ...$5.00.

241

Multiplicand bits
Msb Lsb

M 2M

Lb Multiplier recode bits Hb

Msb

Lsb

Partial Product
Figure 1: 2-bit Modied Booth Partial Product Generator. simultaneously using a structure called the 8-to-2 compressor. The following sub-section discuses the design of the 8-to-2 compressor.

2.2.1 8-to-2 Compressor


PIn (n)

Table 1: 8-to-2 Compressor Truth Table. Note: In the table, Nin (n) represents the total number of inputs with weight n equal to 1 and Nout (n + 1) represents the total number of outputs with weight n+1 equal to 1. Nin (n) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Nout (n + 1) 0 0 1 1 2 2 3 3 4 4 5 5 6 6 Nout (n) 0 1 0 1 0 1 0 1 0 1 0 1 0 1

SPOut (n+1)

8to2 compressor
(n)

SPIn (n)

POut (n+1)

POut (n)

Figure 2: A weight n 8-to-2 Compressor Block Diagram. A block diagram of a 8-to-2 compressor with a weight of n is shown in Figure 2. This structure takes eight primary inputs (PIn) and produces two primary outputs (POut). The eight primary inputs and one of the primary outputs have a weight of n, while the other primary output has weight of n+1. This compressor also has ve stage-propagate inputs (SPIn) of weight n and ve stage-propagate outputs (SPOut) of weight n+1. The truth table for this structure is shown in Table 1. The 8-to-2 compressor can be implemented in a number of dierent ways. One obvious approach is a direct logic implementation of the truth table, shown in Table 1 and using logical optimizations to reduce the number of the levels of logic along the critical path. Though this approach

may provide the smallest number of gate delays, interconnect delays tend to be excessive because of lack of emphasis on regularity of the design. A more regular and structured approach is to use carry-save adders as the basic building block and composing the structure either in a tree topology or a linear topology. A tree-like implementation of the 8-to-2 compressor is shown in Figure 3. This structure is more regular and more structures in comparison to direct logic implementation but still has some amount of interconnect and layout complex-

242

Inputs

3to2 adder

3to2 adder

3to2 adder to next stage

3to2 adder from previous stage 3to2 adder

3to2 adder

Outputs

Figure 3: Tree 8-to-2 Compressor Block.


Inputs

3to2 adder

3to2 adder

tion requires a stage of 9-to-2 compressors in addition to a stage of 8-to-2 compressors. The 9-to-2 compressor stage is needed to account for the additional partial product arising from performing modied-booth recoding Creating a 9-to-2 compressor from a linear 8-to-2 compressor is fairly simple. Though this multiplication topology using linear 8-to-2 compressors has an additional delay of 2 CSAs as compared to an ideal tree topology, the performance impact should be overcome by the longer wire delays in tree topologies due to complex routing and lower regularity. In addition, though the total number of CSAs required by the 8-to-2 tree is almost on par to that required by conventional tree topologies, the complexity of routing in the latter should result in higher routing track requirement, requiring larger area. This tree topology ensures that the depth of the tree structure at the top-level is not more than two stages for a 32-bit multiplier, beyond three stages the routing becomes signicantly complicated. Depending on the performance requirements of the multiplier each of the sub-arrays can be implemented in either a linear or a tree approach. Also depending on the bit-width of the multiplier, the summation tree is divided into multiple levels of sub-arrays. This reduces the length and complexity of global routes while absorbing some of the complexity locally within the sub-arrays. Finally, the proposed tree structure scales very well to higher order multipliers while the routing complexity of conventional tree multipliers increase exponentially with the order of the multiplier. Ease of scaling and lower routing complexity become a very important factor when using design automation tools.

3to2 adder to next stage from previous stage

4. RESULTS
The performance of the multiplier implemented using a standard ASIC design methodology and a variation namely, base-cell methodology are compared for the typical case library and operating conditions. In the base-cell ow, carrysave adders are used as the basic building blocks for the multiplier. The results presented are obtained using a 0.25m technology. The performance analysis is presented using the two common design metrics - Area and timing. The results obtained after just the synthesis stage and after the complete physical design stage are presented. The synthesis and place & route results are presented for the standard methodology and the base-cell variation in a bottom-up fashion i.e. starting at the base-cell level. The Compress82 block generated using the standard methodology creates a very complex interconnect-based design as compared to a very regular structure created by the base-cell design methodology. Due to the way timing critical designs are handled by automated tools, the standard ow generates a logically-optimal design by minimizing the number of levels of logic along the critical path. This makes the designs slightly faster but in order to reduce logic along the critical path, the design usually ends up having complicated wiring. As a result of increased wire loads due to the complicated wiring, the drivers along these paths need to be upsized to make them meet path timing requirements. This upsizing of the drivers compounded by rippling eect of upsizing downstream drivers along these paths results in a huge area penalty. This eect can be seen in Tables 2, 3. In the case of compress82 in Table 2, a speed-up of 16% is obtained at the expense of 290% area penalty and in the case of compress92 in Table 3, a speed-up of 30% is obtained

3to2 adder

3to2 adder

3to2 adder

Outputs

Figure 4: Linear 8-to-2 Compressor Block. ity. Figure 4 shows a linear implementation of the 8-to-2 compressor. This is a highly regular structure and the routing is fairly simple. Though the linear 8-to-2 compressor has six carry-save adders along its critical path in comparison to four in the case of the tree implementation, the linear structure is easily implementable and is highly scalable. The linear 8-to-2 compressor is also very conducive to a bit-slice layout, helping decrease layout time.

3.

8-TO-2 TREE

The top-level tree structure that adds the results from the 8-bit sub-arrays to produce the nal carry-save product is shown in Figure 5. As the gure depicts, the implementa-

243

Input bus 0

9-to-2 compressor

9-to-2 compressor

9-to-2 compressor

Input bus 1

8-to-2 compressor

8-to-2 compressor

8-to-2 compressor

4-to-2 compressor Output Bus

4-to-2 compressor

Figure 5: A Section of a 32-bit 8-to-2 Tree Structure. at the expense of 254% area penalty. These tables include both the After Synthesis and After Place & Route numbers to show the advantages of the timing driven Place & Route, which provides better timing than synthesis but can take a bigger area penalty due to upsizing some drivers based on wire-lengths. Table 2: Performance comparison of the Compress82 base cell for the standard methodology and the variation - after synthesis and after place & route Design Methodology After Synth After P&R Standard Variation Standard Variation Area (in 2 ) 3119.82 820.20 3678.42 943.71 Timing (in ns) 3.84 4.29 2.51 2.92 the multiplier and the physical design tool requires a certain amount of keep-out area around sub-blocks. This area around each base cell results in a huge penalty for both timing and area. One way to improve timing would be to add these base cells as part of the library and be used by the physical tool as a leaf cell not requiring the keep-out area. From Table 4, the multiplier design after synthesis generated using the standard ow has a speed-up of 14% with an area penalty of 66% over the base-cell variation whereas the design after place & route using standard ow has a speed-up of 33% with an area penalty of just 46%. Table 4: Performance comparison of the 32-bit Multiplier for the standard methodology and the variation - after place & route Design Methodology After Synth After P&R Standard Variation Standard Variation Area (in 2 ) 234986.95 141663.92 463080.25 317419.56 Timing (in ns) 4.93 5.60 3.56 4.72

Table 3: Performance comparison of the Compress92 base cell for the standard methodology and the variation - after synthesis and after place & route Design Methodology After Synth After P&R Standard Variation Standard Variation Area (in 2 ) 3254.02 953.91 4602.94 1298.88 Timing (in ns) 4.00 5.00 2.62 3.39

6. REFERENCES
[1] H. Al-Twaijry and M. Flynn. Multipliers and datapaths. Tech. Report, Stanford University, December 1994. [2] L. Dadda. Some schemes for parallel multipliers. Alta Frequenza, 34(5):349356, March 1965. [3] M. Santoro. Design and clocking of vlsi multipliers. Ph.D. thesis, Stanford University, October 1989. [4] C. Wallace. A suggestion for fast multipliers. IEEE Transactions on electronic Computers, EC(13):1417, February 1964.

5.

CONCLUSION AND DISCUSSION

In the case of the multiplier, the synthesis results provide a more accurate picture of the improvements of the base-cell variation over the standard ow because of some tool shortcomings. The base cells are treated as sub-blocks within

244

Você também pode gostar