10 1 1 1 4951

16-bit Booth Multiplier with 32-bit Accumulate
Marc Mosko CMPE223 Independent Study December 1,2000
CMPE223 Booth Multiplier Table of Contents
Marc Mosko
Introduction......................................................................................................................................3 Basic Design ....................................................................................................................................4 Performance Estimates ................................................................................................................5 Booth Multiplier ..........................................................................................................................6 VHDL Source Code.......................................................................................................................10 Code Overview ..........................................................................................................................10 I/O Register Design ...................................................................................................................13 Example Register Access ..........................................................................................................13 Source Code ...............................................................................................................................17 Source Code Hierarchy..............................................................................................................18 VHDL Code Versions ................................................................................................................20 Overflow Logic..............................................................................................................................22 Magic Layout .................................................................................................................................23 Design Hierarchy .......................................................................................................................24 RSIM Calibration.......................................................................................................................28 Optimization ..............................................................................................................................29 References......................................................................................................................................32 VHDL Source Code.......................................................................................................................33 Addcell.vhd ................................................................................................................................33 Adder.vhd ..................................................................................................................................34 Booth.vhd ...................................................................................................................................36 claN.vhd .....................................................................................................................................38 driverN.vhd ................................................................................................................................41 latch.vhd .....................................................................................................................................42 mult.vhd .....................................................................................................................................47 mult_cla.vhd ..............................................................................................................................53 mult_pipe.vhd ............................................................................................................................54 multreg.vhd ................................................................................................................................75 pp.vhd ........................................................................................................................................78 Layout and Schematics ..................................................................................................................81 Overall Floorplan.......................................................................................................................81 Addcell.......................................................................................................................................82 Booth Encoder ...........................................................................................................................83 Csa_cond ....................................................................................................................................86 csa_first ......................................................................................................................................88 Csa_mid .....................................................................................................................................90 Csa_last ......................................................................................................................................92 Csa_2 .........................................................................................................................................94 Csa_4 .........................................................................................................................................95 Csa_4b .......................................................................................................................................96 Csa_8 .........................................................................................................................................96 Csa_8 .........................................................................................................................................97 Fa_tg ..........................................................................................................................................98 Fa_tg2 ......................................................................................................................................100 December 1, 2000 Page i
CMPE223 Booth Multiplier
Marc Mosko
Invchain ...................................................................................................................................101 Mcell ........................................................................................................................................103 Mcell ........................................................................................................................................104 Ppmux ......................................................................................................................................105 Ppmuxfa ...................................................................................................................................107 Rwire........................................................................................................................................108 Wiring cells (passive) ..............................................................................................................109
December 1, 2000
Page ii
Marc Mosko
Introduction
This report presents three main topics we investigated as part of a project to build a Booth encoded multiply/accumulate VLSI chip. The original scope of work included synthesizing Exemplar was the VHDL compiler. Leonardo
VHDL code using the Mentor Graphics tools.
Spectrum was the synthesizer. Since my team, which included Kevin Delaney, did not meet a Mosis deadline our chip funding was lost. S ince we did not actually fabricate a chip, we cannot discuss the success of our results. Likewise, VHDL synthesis using the Exemplar tools was not very successful, so we do not discuss synthesis results except in passing. The main points we cover are the basic architecture, our VHDL code, and a Magic layout in place of logic synthesis. The work presented here, except as cited, is almost entirely my own. Teamwork with Kevin
Delaney had some influence on the VHDL code, since he was primarily working on the synthesis portion of the project.
Due to length considerations, we have not included all VHDL code or any test suites. We have attached VHDL code for our main modules. We have not included any of the test scripts or stimulus files. They are available on-line or via the CSE file system. They are very similar to other work we have submitted. We have also excluded material previously reported in our report of Spring 2000.
This report is available on-line at http://www.cse.ucsc.edu/~mmosko/cmpe223/report2. There is also a mirror under /projects/kestrel/users/mult/marc/report2. The web URL and file mirror
contain all source code, test suites, and other material not reprinted here.
December 1, 2000
Page 3
Marc Mosko
Basic Design
The goal of the multiplier is to compute X[15:0] * Y[15:0] + W[31:0] = Z[31:0] and OVRFLW. OVRFLW is the multiply-accumulate overflow. We discuss OVRFLW in more detail below. It is not simply the carry-out of the final addition.
Our multiplier is based on a booth encoded array multiplier design in [3,4]. The 32-bit adder we use for the final addition is from [1,2,4]. We used a Carry-Select Adder (CSA) since it has fairly regular layout and good performance.
The VHDL design is a 3-stage pipeline with I/O registers and common 16-bit I/O bus.
complete transaction takes 7 complete cycles: load X, load Y, load W_H, load W_L, Multiply, read Z_H, read Z_L. Our design can pipeline the multiply with loading a value, such as the next operations X, so in a stream we are down to 6 cycles. The 6 or 7 cycle length is a limitation of the 16-bit I/O bus based on the pin count of the die.
The Magic layout is a non-pipelined design in AMI 0.5 5 volt. It does not have I/O registers or a pad frame. This design is double-rail throughout (except input and final output) and uses passlogic as much as possible. The original design in [3] uses CPL logic chains sometimes 3 deep. We had many errors in RSIM with this design. We initially thought that RSIM cannot handle this style, so we had to limit the CPL logic to 1 level b efore restoring with inverters. We found rather late in the design that RSIM does handle the pass-logic style. Our errors were from
December 1, 2000
Page 4
Marc Mosko
improperly sizes transistors that did not pass 1 or sometimes 0 with enough force to drive the whole CPL NMOS chain. [3] also uses cross-coupled minimum sized PMOS latches to restore the swing to output inverters. RSIM did not correctly simulate the swing restore, so we had to remove the cross-coupled latches.
We have verified correct operation of both the VHDL and Magic circuits with several boundary cases and 10,000 random multiply/accumulates. The VHDL test cases ran through the I/O The Magic layout had many As of the
registers while the Magic cases were raw arithmetic computations.
problems, particularly with the carry-select adder design, which uses pass-logic.
writing of this report, we verified 10,000 random cases on the Magic layout with one error. We have fixed that error, but do not have time to rerun the whole batch. It takes about 8 hours to run all cases (we used four machines at 2 hours each).
Performance Estimates
Based on timing estimates from Leonardo Spectrum, we believe the VHDL system will run at about 48 MHz on a 2-phase clock (about 7ns per phase, 3 phases). We do not believe these estimates are very good, since Spectrum does not easily incorporate a 2-phase clock. It also does not work with transparent latches, so we had to replace all latches with non-transparent DFF. A fair amount of guesswork went in to creating the stimulus files.
The Magic layout runs at 181 MHz (5.5 ns) for the whole 16x16+32 operation, over 3 times faster than the VHDL synthesized design. This number is based on the slowest of 10,000 Since OVRFLW
random multiply-accumulate operations and the path output for OVRFLW. depends on Z[31], it is the slowest signal in the system.
In general, we have each transistion
December 1, 2000
Page 5
Marc Mosko
down to 0.1ns or less. There are a handful of 0.2ns transisitons in the critical path. The critical path has 61 transitions.
Because of uncertainty in both the Leonardo timing and the RSIM calibration, it is possible that both results are substantially off. The section RSIM Calibration describes our approach to
calibrating RSIM for the AMI C5N 0.5 process.
The original fast adder in the Magic layout was a 2-2-4-4-4-8-8 CSA adder. This design, based on [2], assumes that all inputs arrive at about the same time. That is not the case here.
Generally, the last bit to the adder is around Z[16], so one might wish to experiment with a 2-24-4-2-2-4-4-4 or other variant. Please see our comments on the CSA adder in the Optimization section below.
We had time to try a 2 -2-4-4-4-4-4-8 adder, and our maximum time dropped from 7ns to 6.1ns for 4EF9 x E1DC + 287CF2D0 (we have since reduced the time even further). Our intuition above was correct. We have not had time to experiment with other adder configurations.
Booth Multiplier
The multipliers presented in [3,4] are essentially the same. [3] refines [4] with comparisons of transistor logic families (e.g. CMOS vs. CPL) and presents optimized transistor sizes for a 6b x 6b multiplier. [3] used a 0.8 CMOS in BiCMOS process at 3.3V and 20 MHz. In general, we reused the sizes in [3], but as measures since we have a smaller technology. While this is not optimal, it gave us a quick basis, where transistors are approximately scaled to one another. We December 1, 2000 Page 6
Marc Mosko
present schematics of each component in a later section with the Magic cell layouts. For now, we wish to present only a high-level floorplan.
Figure 2. shows the floorplan of a 6 -bit Booth multiplier with 12-bit accumulate. It is essentially the same as the 16-bit multiplier. We will use the 6 -bit version in our present discussion, since the floorplan fits on a single page. The vertical dashed lines are continuations of the X inputs. We used dashed lines to make it easier to see the regular wiring pattern.
The 12-bit accumulate requires a set of full adders as shown.
The first five bits of the
accumulate, W[4:0], use the first Booth row for addition. W5 cannot be added with X5. X5 is a sign bit but W5 is not. Therefore, we must add W5 with an adder on the outside of the array. A standard array multiplier has a fast adder outside the array. Along the bottom of the array, a sum bit is Zi = Cj + S j +1 . S and C are the sum and carry outputs of the bottom Booth row ppfa
components. In our case of a 6-bit multiplier, j=i-5. To add in a third bit, Wi, we use full adder to compute the sum Si and carry Ci from C j + S j+1 + Wi . We may then use a fast adder to compute Zi = Ci1 + Si .
We had a choice of using 4:2 compressors or the style previously described. We chose the later style since it
bth_in pp_out Carry
Xi W i
Xi-1
allowed us to use a fast adder (CSA in this case) for the long addition and single full adders in parallel. Since the
ppfa
Sum
Figure 1. PPFA Floorplan December 1, 2000 Page 7
CMPE223 Booth Multiplier single full adders have no ripple carry, this style seems to work well.
Marc Mosko
The design in [3,4] uses a sign extension mechanism between booth-encoded rows. There is no constant offset, as in the current Kestrel multiplier. The sign extension uses the partial product output (pp_out) from the left-most column of the previous row. The pp_out output is the output of the partial product mux before the adder. Using this and the ff output of the previous rows sgn component, the sign extender computes new outputs to carry the sign to the next row. This technique uses one additional column in the array.
December 1, 2000
Page 8
gnd gnd
x2
gnd x5 x4 x3 x1 x0 gnd
w4
w3
w2
w1
w0
5 bth encode ppfa

pp
Figure 2. General layout of Booth multiplier with accumulate

ppfa ppfa ppfa ppfa ppfa ppfa add cell
5 bth encode ppfa

pp
y3 y4 y5 ppfa ppfa ppfa
ppfa
ppfa
ppfa
add cell
sgn w5
HA
w9 w8 w7 w6
w11 n/c
w10
Marc Mosko
FA
FA
FA
FA
FA
FA
n/c
12-bit CSA Adder (1/2)
12-bit CSA Adder (1/2)
December 1, 2000
sgn 5 bth encode ppfa
pp
gnd y0 y1
y1 y2 y3 ppfa ppfa ppfa ppfa ppfa
ppfa
add cell
Page 9
sgn
Marc Mosko
VHDL Source Code

This section presents the most recent VHDL source code for a pipelined Booth encoded multiplier. The code presented below may be found in the directory
http://www.cse.ucsc.edu/~mmosko/cmpe223/report2/vhdl. under /projects/kestrel/users/mult/marc/vhdl/booth-1.
There are several other versions
The code presented here is mostly based
on the leo directory (short for Leonardo, the synthesizer). The last part of this section makes brief comments on the other versions.
We used C++ to model the Booth multiply/accumulate before writing the VHDL code. The link is http://www.cse.ucsc.edu/~mmosko/cmpe223/report2/cpp. There are three versions. The first
is an 8-bit adder using 4:2 compressors. The second is a 16-bit also with 4:2 compressors. The third is a 16-bit with a fast adder. We shall not discuss this code any further in the interests of space.
Code Overview
The multiplier implemented by the VHDL code is shown in Fig. 4. The figure does not show all the details, such as clocking or tri-state register control. All I/O signals are phi_2, since we
assume there will be setup time in phi_1 and that the external system is single phased.
December 1, 2000
Page 10
Marc Mosko
Signal
busio_s2h ovrflw_s2h bussel_s2h bs_s2h rw_s2h me_s2h rst_s2h clk
Direction Purpose
In/Out Out In In In In In In 16-bit input/output from multilier. OVRFLW output 3-bit mutiplexed register selection. Input from bus (H) or multiplier (L). Applies to Z registers. Read from bus (H) or write to bus (L). Multiplier Enable (perform the calculation) Reset all registers to 0. phi_1H, phi_1L, phi_2H, phi_2L
The VHDL code mirrors the design in Fig. 4 as closely as possible. We made some abstractions. An abstract data type in VHDL replaces the one-hot booth encoding. This allows the synthesizer to use whatever technique it chooses. The adders are abstract + signs, not actual fast adder implementations. The synthesizer may then use whatever style is appropriate.
Referring to Fig. 4, there are two main sections to the multiplier. The top section is the chip I/O consisting of six 16-bit registers, an overflow register, a 16-bit common I/O bus, and control signals. The bottom section is the pipelined Booth multiplier. The multiplier begins
computations on phi_2 and the results are ready on the following cycles phi_2. We discuss the I/O registers in detail in the next section. Here, we will focus on the multiplier.
The first pipeline stage consists of four booth encoders and an array multiplier. The outputs of the array multiplier are latched at the end of phi_2. The first pipeline stage accumulates on the lower 15 bits of W. The 16th bit cannot be accumulated here, since it would interfere with the multipliers sign bit. W[15] is not a sign bit, but X[15] is a sign bit.
December 1, 2000
Page 11
Marc Mosko
The second pipeline stage consists of four more Booth encoders reading from latched Y values. The computation begins on phi_1 and the results are latched at the end of the phase. There is another array multiplier, which continues the multiplication process. There is also an unsigned
8-bit adder to sum the results from the first pipeline section. Leonardo synthesized an Inverted Nibble adder. Since this addition is independent of the results in the second pipeline stage, we can perform this addition with little overhead.
According to Leonardo timing estimates, the 8-bit adder is not necessarily for free. The 8-bit adder takes a comparable amount of time to the 4 -row Booth multiplier. In fact, in some timing runs the 8-bit adder took longer than the multiply, indicating that it might not be a good idea to try the addition in this pipeline stage. Because of problems we had with the Leonardo timing We would surmise that since all
estimates, we did not finish an analysis of this question.
pipeline stages have the same period, the second pipeline stage with 8 -bit accumulate would still take less time than the 24-bit accumulate in the third stage.
The third pipeline stage contains an array of 16 full adders that compute, in parallel, the accumulate of W[31:15]. The results of this accumulate are then summed in a 24-bit fast adder. Leonardo synthesized a Carry-Look-Ahead (CLA) adder. The third pipeline stage also generates the signal medly_s2h (not shown), which clocks in the Z value to registers 4 and 5.
December 1, 2000
Page 12
Marc Mosko
I/O Register Design

The I/O registers follow the schematic in Fig. 3. The signal medly_s2h is
multin_v2h busin_s2h 0 1 w2_s2h D DFF CLK RST Q store_s2h sel_s2h rw_s2h rden_s2h
used to clock in the value from multin_v2h. registers. delayed It only applies to the Z
The multiplier generates the multiplier enable signal, The
bs_s2h csel_q2h medly_s2h rden_s2h
0 1
csel_s2h
medly, as part of the pipeline.
phi_2h
output signal store_s2h feeds both multout_s2h and busout_s2h. The trirst_s2h
rst_q2h
state drivers for the I/O bus are located
Figure 3. Multiplier I/O Register
in different VHDL code because of problems we had with the VHDL compiler. drives the bus when sel_s2h and not(rw_s2h) is true, otherwise it is tri-state.
The register
Line-----A-Bus output-------Bus input--------BBB-C-D-E-F 00000122---1011111001000011-1011111001000011-000-1-1-1-0 00000123---1010111101100101-1010111101100101-001-0-1-0-0 00000124-0-1011111111111111-ZZZZZZZZZZZZZZZZ-100-1-0-0-0 00000125-0-0000000000000010-ZZZZZZZZZZZZZZZZ-101-1-0-0-0 00000126---0010110100000100-0010110100000100-010-1-1-0-0 00000127---0100000101111111-0100000101111111-011-1-1-0-0 00000128---0000110110010110-0000110110010110-000-1-1-1-0 00000129---1001001000110001-1001001000110001-001-0-1-0-0 0000012a-0-0100000110110111-ZZZZZZZZZZZZZZZZ-100-1-0-0-0 0000012b-0-0001111011101110-ZZZZZZZZZZZZZZZZ-101-1-0-0-0
Example Register Access

The sample stimulus file below shows how the register controls signals work. The first row is a header row for our present explanation. Line is the hex line number, from file 30 (see below), mult.txt. The Line number is used for debugging the VHDL and serves no other purpose. Page 13
December 1, 2000
Marc Mosko
Column A is the expected carry out, which is set when reading from the multiplier. Bus Output is the expected bus output. Bus input is the external driver to the bus. Column A is ovrflw_s2h. Column B is the bussel_s2h signal. Column C is the bs_s2h. Column D is the rw_s2h signal. Note that rw_s2h only affects the Z registers. Column E is the me_s2h signal. Column F is rst_s2h.
Prior to line 122, values were loaded in to the X, Y, and W registers. On line 122, we enable me_s2h, which latches the X, Y, and W register values in to the multiplier. Because of our I/O register design, we may simultaneously load a new value in to a register while reading the register. Line 122 loads a new value 1011111001000011 (BE43) into the X register by
selecting register 0 via bussel_s2h and asserting rw_s2h.
Line 123 loads a new value in to the Y register and simultaneously stores the multiplier output in to the Z registers. The new value 1010111101100101 (AFC5) is stored in the Y register by selecting register 1 via bussel_s2h and asserting rw_s2h. The multiplier result is stored in both Z registers by lowering bs_s2h such that registers 4 and 5 read from their multiplier input rather than the bus. This also clocks in the ovrflw value to its D-flipflop. When bs_s2h is 0, one does not need to select the registers with bussel_s2h. If one wanted to store a value in a Z register from the bus, one would need to use bussel_s2h and rw_s2h signals.
Lines 124 and 125 read the Z values from registers 4 and 5. This is done by using bussel_s2h and lowering rw_s2h to write to the tri-state bus. We actually drive ovrflw_s2h at all times
December 1, 2000
Page 14
Marc Mosko
from a D-flipflop. In our test stimulus file, lines other than 124, 125, 12a, and 12b are - dont care.
Lines 126 and 127 load the W values (2D04417F). Lines 128 and 129 are similar to line 122 and 123. They load the next X and Y value and compute BE43 * AFC5 + 2D04417F = 41B71EEE. We read the Z values in lines 12a and 12b.
December 1, 2000
Page 15

ovrflw_s2h
out
Marc Mosko
rw_s2h
in
bs_s2h
in
rst_s2h
in
bussel_s2h[2:0]
in
3:8 demux
BUSIO_S2H[15:0]
io
clk = phi1_h/l, phi2_h/l

clk[3:0]
in
reg 0 x[15:0]
reg 1 y[15:0]
reg 2 w[31:16]
reg 3 w[15:0]
reg 4 Z[31:16]
reg 5 Z[15:0]
dff Ovrflw
me_s2h
in
ovrflw logic
pipeline registers
8b unsinged add pipeline registers
Booth (20b out)
z[7:0]
y[8:0]
16 x 4 Booth Multiplier
y [15:8]
pipeline registers Booth (20b out) y latch (9 bits)
24b CLA (1/3)

FA
16 x 4 Booth Multiplier
z[15:8]
pipeline registers
FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA
24b CLA (2/3)

z[31:16]
Figure 4. General VHDL Multiplier Layout
December 1, 2000
Page 16
Marc Mosko
Source Code
The table below lists the 59 files that are part of the VHDL code. Generally, there are three or four files associated with each major component. For the component foo, there would be
foo.vhd, which is the instantiation of the entity and architecture. foo_test.vhd is a test script that uses foo.vhd as a component (UUT). The test script reads stimulus from foo.txt. Sometimes there will be a foogen.{pl|cc} to generate the stimulus. Fileno File Name
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 addcell.txt addcell.vhd addcell_test.vhd adder.txt adder.vhd adder_test.vhd adder15.txt adder15gen.pl adderN_test.vhd adk.vhd booth.txt booth.vhd booth_test.vhd claN.vhd claN_test.vhd converts.vhd dffrN.txt dffrNgen.pl dffrN_test.vhd dft.map driverN.txt driverN.vhd driverN_gen.pl driverN_test.vhd latch.vhd marc.pl marc2.pl marc3.pl
Description Test file stimulus Implements +1 when booth sign negative Test script for addcell Test file stimulus Single bit and N bit full adder Test file stimulus Generates stimulus for exhaustive 15-bit adder, incorrect carry out A 15-bit adder test using "adder.vhd" Cell library for Leonardo Test file stimulus Booth type (abstracts one-hot), booth encoder, sign propagation An n-bit carry-lookahead adder (abstract "plusN" and "uplusN") Text conversions for test cases Test file stimulus Generates test cases Test script for "latchrN" (transparent latch w/ reset) -- poorly named file Ini file for Exemplar DFT program Test file stimulus N-bit driver Generates test cases Test script for "driverN" dff rise, dff fall, latch, gated dff, gated latch, and N-bit versions Generates Leonardo timing analysis script for "mult_frame" Generates Leonardo timing analysis script (better version) Generates Leonardo timing analysis script (different clocking) Page 17
December 1, 2000

29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 modelsim.ini mult.txt
Marc Mosko
Ini file for Exemplar VHDL Test file stimulus mult.vhd The whole multiplier with I/O registers mult_test.vhd Test for "mult.vhd" mult_test_1.vhd Test for synthesized "mult.vhd" (uses std_ulogic) mult_cla.vhd The CLA adder used by the multiplier mult_framegen Generates test cases (boundary and random) mult_framegen.cc C++ source code for mult_framegen mult_frame.tcl A timing analysis file example mult_frame.txt Test stimulus mult_frame.vhd The booth array multiplier and CLA adder mult_frame_test.vhd Test for "mult_frame" mult_frame_test_u.vhd Test for "mult_frame" (synthesized) mult_pipe.txt Test stimulus mult_pipe.vhd The booth array multiplier mult_pipe_test.vhd Old test script -- out of date multgen.cc Generates test cases for "mult.vhd" multreg.vhd N-bit multiplier register using "dffr_fall" and an input buffer multregN.txt Test stimulus for 4-bit register, non-exhaustive, out of date (for latch, not dff) multreg_test.vhd Test script for 4-bit register Mymake CSH script to create everything plusN_test.vhd Tests abstract N-bit adder pp.vhd Partial-product cells ppmux, ppfa, ppfapp) ppfa.txt Test cases for "ppfa" ppfagen.pl Generates test cases for "ppfa" ppmux.txt Test cases for "ppmux" ppfa_test.vhd Test script ppfapp_test.vhd Test script ppmux_test.vhd Test script sgn.txt Test stimulus for "sgn" sgn_test.vhd Test script for "sgn"
Source Code Hierarchy

The top most entity is mult from mult.vhd (file 31). It uses the components multregn (file 46), mult_cla (file 34), and mult_pipe (file 43). Mult.vhd also decodes bussel_s2h in to an 8-bit one-hot bus bus_sel_h[7:0] that feeds the six instances of multregn. The component mult also
December 1, 2000
Page 18
Marc Mosko
computes bus_wr_h[7:0] as a one-hot control signal to a set of 16-bit tri-state buffers for each I/O registers busout_s2h signal. Finally, the code computes the ovrflw signal.
The component multregn instantiates an N-bit I/O register, as described above.
It uses the
components dffrN_fall (file 25) and buf (file 22). DffrN_fall is a N-bit D-flipflop with reset clocked on the falling edge. Buf is a 1-bit buffer. We had to play some tricks with signal
buffering to ensure proper fan-out. Leonardo had trouble with our source code and generating proper fan-out. We believe the problem was that we did not follow a strict hierarchy structure of combinatorial logic followed by registers.
The component mult_cla instantiates a 24-bit fast adder. It uses the component plusN (file 14). PlusN is an abstracted + operation in VHDL with some added logic to compute the carry. We used to have mult_cla and mult_pipe in the same source file as part of the same component. We separated them at some point because of timing simulation problems with Leonardo.
The component mult_pipe is the most complex component.
It instantiates the 16-bit Booth
multiplier with W accumulate less the final fast adder. This component uses 11 sub-components. The component addcell (file 2) implements the addcell in Fig. 2. The component adder (file 5) is a 1-bit full adder used for the W accumulate. The component uplusN (file 14) computes an unsigned N-bit addition. This is for the 8-bit adder in Fig. 4. The component booth_encode (file 12) implements the Booth encoder and defines an abstract data type for the 5-bit Booth control signal. The components ppfa and ppfapp (both file 51) implement the ppfa The difference is that ppfapp has the extra pp_out signal. The
component of Fig. 2.
December 1, 2000
Page 19
CMPE223 Booth Multiplier component sgn (file 51) implements the sign extender of Fig. 2.
Marc Mosko The components dffr_fall,
dffrN_fall, gdffr_fall, and gdffrN_fall (all file 25) implement single bit and N-bit D-flipflops with reset. The g versions are gated and have a tri-state Enable input (no longer used).
Inside mult_pipe , we used to drive each pipeline stage from a gated transparent latch. By using a gated latch, we could conserve power by eliminating spurious transitions while computing the previous pipeline stage. At the end of the first pipeline stage, for instance, we would latch the data at the end on phi_2 and enable the tri-state output at the beginning o f phi_1. We used the components glatchrN , etc. When we switched to the DFF, there was no reason to continue
using a gated version, since the flipflop is not transparent. Thus, the gdff and dff components are identical except for an extra Enable signal t hat does nothing. We preserved the Enable input such that there were not changes to our code semantics.
VHDL Code Versions

There are six versions of the VHDL code. The code that best synthesizes is in a directory called leo under /projects/kestrel/users/mult/marc/vhdl/booth-1. The leo code was the basis for the code described here. We have cleaned up the leo code and added back some files we had dropped because of synthesis.
The original code is under double.
This version is a rather literal translation of [3,4] in to We had extensive test scripts in-line with
VHDL and carries over the double-rail structure.
components. We also tested for meta-values, such as X or Z. Synthesis does not support meta-values, so we had to remove those tests. Synthesis does not like text I/O functions, so we also separated our test routines to external source files. The version single removes the
December 1, 2000
Page 20
CMPE223 Booth Multiplier double-rail nature and had better synthesis results.
Marc Mosko Kevin Delaney found a cell library for
Leonardo, the synthesis tool. The cell library is called ADK. We began using the ADK cell library in the source tree adk. adk is a non-pipelined multiplier. adk-pipe is a pipelined multiplier. adk-pipe-cla is a pipelined multiplier with carry-look-ahead adder. We hard-coded the CLA structure with a behavioral description. being so specific and just use a + sign. In our final version, we steered away from
We learned several things from these many versions and our efforts at synthesis. In our opinion, one should try to be as abstract as possible and let the synthesizer figure out the specifics. One must be aware of automatic register generation and what sort of statements will not synthesize. Apart from those concerns, we would recommend staying away from gate-level specifics. When one tries to enforce a specific structure, there is usually competition with the synthesizer and no one wins. There are directives to give the synthesizer guidelines for specific modules, but we did not have much success with them.
We also found that Leonardo had several limitations.
First, it does not support transparent
latches in timing estimates, so one should try to use D-flipflops. A single phase clock works much better in the timing analysis too. While Leonardo will support arbitrary register placement, the manual states that it works much better if modules are constructed as combinatorial logic followed by registers in a consistent manner. Do not place registers haphazardly in the middle of a VHDL module. One may use input or output registers, but do not mix them.
December 1, 2000
Page 21
Marc Mosko
Overflow Logic
A multiply-accumulate where all words are n-bit does not have overflow. architecture, potential however, for does have since Our the the
00 01 WZ 11 10 XY 00 0 1 0 0 01 0 0 0 1 11 0 1 0 0 10 0 0 0 1
overflow
Figure 5. Karnaugh Map
accumulate is twice the word size of the multiplier/multiplicand. We compute a signed overflow from the following two assertions for Z[m:0]=X[n:0] * Y[n:0] + W[m:0], where in our case n=15 and m=31. There is overflow if (1) x*y > 0, w > 0 and z<0, or (2) x*y <0, w < 0, and z >0. Symbolically, this may be computed as
t X n Yn
ovrflw (Wm t ) Z m (Wm t ) Z m
The value t is true if X and Y are of opposite signs, and thus their product is negative. In logic, this expression takes one XOR, 2 NAND, 3 NOR, and 2 INV.
Using the Karnaugh map in Fig. 5, we may construct a custom logic element.
We find
OVRFLW is zero when W ( Z + XY + XY ) +W ( Z + XY + XY ) is true and is one when

WZ ( XY + XY ) + WZ ( XY + XY ) is true.
We have dropped the subscripts, which should be
understood. We may implement this in CMOS with 26 gates. It should be rather quick, because all signals except Z are available very early in the calculation. We can position the Z dependent gates on the output line. Refer to the Schematics section for the layout and gate schematic.
December 1, 2000
Page 22
Marc Mosko
Magic Layout
The table below lists the 53 files that make up the Magic layout. In general, there are three types of files, similar to the VHDL directory structure. For the component foo, the file foo.mag is the Magic cell. foo.cmd is the RSIM command file that runs a test suite. Some components will have a foo.{pl|c|cc} program to generate the test cases. Sometimes, there is a foo_head.cmd file with the header portion of the CMD file independent of the test cases. There is also a csa subdirectory with a VHDL model of the CSA adder. To view these files with the recompiled Magic, set the environment variable CAD_HOME=/projects/kestrel/users/mult/tools and
execute Magic as magic -TSCN3ME_SUBM.30 from $CAD_HOME/bin.

Fileno 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 File Name Addcell.cmd Addcell.mag broute.mag bth.cmd bth.mag bthbuf.mag bthroute.mag bwire.mag csa/converts.vhd csa/csa.pl csa/csa.txt csa/csa.vhd csa/csa_4bit.vhd csa/csa_gen.vhd csa/csa_test.vhd csa/csamid.txt csa/modelsim.ini csa_2.mag csa_32.c csa_32.cmd csa_32.mag csa_32_head.cmd csa_4.cmd csa_4.mag csa_4.pl Description RSIM command file w/ exhastive stimulus Generates the +1 for negative Booth encoding A wiring channel RSIM command file w/ exhastive stimulus Booth encoding and sign propagation Inverter chain for booth lines Wire routing for "bth" cell Wiring channel VHDL text conversions for debugging Perl script to generate test stimulus Test stimulus for 4-bit CSA VHDL model of CSA components VHDL model of 4-bit CSA VHDL to generate internal test signals for RSIM testing VHDL test script for 4-bit CSA adder Test signals from CSA_MID for RSIM use Exemplar VHDL INI file CSA 2-bit chain Generates test vectors for a 4-bit CSA RSIM command file w/ stimulus for random tests of 32-bit adder 4-bit test adder (now broken) "Header" section of RSIM stimulus file RSIM command file w/ exhastive stimulus CSA 4-bit chain Perl script to generate test stimulus
December 1, 2000
Page 23

Fileno 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 File Name csa_8.mag csa_cond.cmd csa_cond.mag csa_first.mag csa_last.mag csa_mid.cmd csa_mid.mag csa_wire.mag fa.cmd fa.mag fa_cmos.mag fa_tg.cmd fa_tg.mag fa_tg2.mag invchain.mag invtop.mag mcell.cmd mcell.mag mult_head.cmd mult_add.cmd mult_add.mag mult_add_head.cmd multgen.cc ppmux.cmd ppmux.mag ppmuxfa.mag rwire.mag wroute.mag wtop.mag Description
Marc Mosko
CSA 8-bit chain RSIM command file w/ exhastive stimulus CSA conditional input section CSA first cell in multi-bit chain CSA last cell in multi-bit chain RSIM command file w/ exhastive stimulus CSA middle cell in multi-bit chain Used in CSA_32 RSIM command file w/ exhastive stimulus Full adder CPL style Full adder CMOS style RSIM command file w/ exhastive stimulus Full adder w/ 1 level deep TG style for ppfa cell Full adder w/ 1 level deep TG style for W sum Single-rail to double-rail inverter chain Top row inverter chains for X and W RSIM command file w/ exhastive stimulus Multiplier cell (ppmuxfa and wiring) Header file for RSIM (no test cases) RSIM file with random tests 16x16 Booth multiplier with 32-bit accumulate RSIM file header C++ program to generate "mult_add" test cases RSIM command file w/ exhastive stimulus TG style partial product mux ppmux with full adder (fa_tg) Wiring channel and inverters to drive CSA Wiring channel Wiring channel
Design Hierarchy
We will begin our description of the design hierarchy with the inputs and work towards the outputs. Our description captures a high-level view of the design and does not penetrate to deep detail except where the detail is very important. opposed to design borrowed from the references. We also call attention to our innovations, as
December 1, 2000
Page 24
Marc Mosko
The top-level cell is mult_add.mag. This cell has some glue wiring and all the raw input/output. The X input is via the cell invtop[15:0]/X_H. The W input connects directly to the wires The Y input connects
Wn_H, where n ranges from 15 to 31 and to the cells invtop[14:0]/X_H.
directly to the wires Yn_H, where n ranges from 0 to 15. The output Z c onnects to the Sn_H outputs of various CSA cells. The OVRFLW output connects to ovrflw_0/ovrflw_h.
The X and W[14:0] inputs pass through the cell array invtop. These are inverter chains along the top of the multiplier to generate the proper drive for the long X wires. The W inverts are small, since those signals only drive the adder in the top row of the multiplier. The X signals must drive about 0.450 pF. The cell invtop connects directly to the multiplier array cells, mcell.
The Y input connects to the cell bth along the left side of the multiplier. The bth cell produces the 5-bit one-hot Booth encoding of the Y word [3]. propagation [3]. The bth cell also computes the sign
There are three Y inputs per bth cell, with one input common between two
cells. Each bth cell generates a double-rail Y signal with a small inverter chain. The output of the bth cell is the 5-bit word bth_p1, bth_p2, bth_z0, bth_m1, bth_m2 , indicating, respectively, x1, x2, x0, x -1, x-2 for a particular row. There are also ff_out and pp_out outputs for the sign propagation. Each bth cell abuts to a bthbuf cell in column 2 of the multiplier.
The cell bthbuff has five inverter chains to generate the proper drive for the 5-bit Booth encoding. Each signal must drive about 0.600 pF. We use a 4-stage inverter chain with =2.85 per stage. Bthbuf connect to the multiplier body of mcells.
December 1, 2000
Page 25
Marc Mosko
The main array cell is mcell. It contains three components: ppmux, fa_tg, and wroute. Ppmux is a pass-logic multiplexer to select the proper X input based on the Booth encoding for the row [3]. The cell fa_tg is a double-rail transmission-gate based full adder [3]. It calculates the sum and carry in parallel. There are four output inverters for the sum and carry-out. We added four input inverters for the B_H/B_L inputs, one pair of inverts for each of the carry and sum logic. We found there was too much back-pressure from the transmission gates and it caused uncertainty in RSIM about who was driving whom. Wroute is a wire channel routing cell to
pass horizontal and vertical signals. The sum out connects two columns to the right while the carry-out connects one column to the right. The X signals pass directly down.
Along the right side of the mcell array is a column of addcell. Addcell checks the rows Booth encoding and generates a double-rail 0 or 1 output [3]. If the Booth encoding is negative, it generates the 1 output. The cell also passes the sum and carry outputs from mcell through to the next column. Addcell connects to a column of rwire , which is a vertical wiring channel to
connect Addcell to the fast adder in the right hand column. Rwire has a pair of inverters to drive the CSA cells, since we had problems just making the output drivers in fa_tg larger.
The fast adder in the right hand column is a 16-bit CSA, designed as 2-2-4-4-4 [1,2,4]. We found the references [1,2,4] slightly confusing and contradictory. The signal names were not Our original
consistent, despite having the same authors and essentially the same designs.
attempt to layout the design from [1] failed. We wrote a VHDL description of the CSA in [4] (files 69-77). The VHDL description tightly modeled the gate-level design. Once we prototyped the adder and tested operation, we designed a modular lego-block cell library.
December 1, 2000
Page 26
Marc Mosko
The basic CSA blocks are csa_cond, csa_first, csa_mid, and csa_last [4]. Csa_cond is a subcomponent of the other three. It is a double-rail pass-transistor mux to compute the conditional sum and carry bits. One must always use csa_first and csa_last. For a three or more bit adder, one inserts the necessary number of csa_mid cells. We created three adder sizes, csa_2, csa_4 (and csa_4b), and csa_8. Each of these cells has a 2-inverter driver chain for the double-rail
carry-in input. This is necessary, since load varies widely between the three cells. The RSIM estimates are 0.081pF, 0.133pF, and 0.243pF for the 2, 4, and 8-bit cells (see the Optimization section below). The cells csa_2 and csa_4 are designed for use along the right side of the
multiplier. The cells csa_4b and csa_8 are designed for the bottom of the multiplier.
We had to make many substantial changes to the CSA designs in [1,2,4]. The original designes used extensive pass-logic. RSIM showed many unknown errors in our original layouts. We
corrected some by inserting intermediate inverters.
Other errors, which we originally thought
were problems with RSIM and pass logic, ended up being insufficient 1 drive from fa_tg for the number of threshold drops. By widening the pass-logic NMOS gates and increasing the
output inverter drive of fa_tg, we were able to overcome many of these problems. To fully solve the problem, we also had to change the last CPL pass-transistor in the sum chain to a full transmission gate and insert some extra inverters before the CSA cell. The original designs also had minimum-size PMOS latches for swing-restore, but these never simulated correctly in RSIM. We entirely removed them from the design.
December 1, 2000
Page 27
Marc Mosko
The bottom row of mcell connects downward to a row of bwire , a wire routing channel. Below the channel is a row of fa_tg2. These full adders sum the carry-out, sum-out, and W values for each output bit. The output of the full adders then passes through the wiring chennel broute and drives the bottom 16-bits of CSA adder. The last 16-bits of CSA adder are made up of a 4-4-8 design using csa_4b and csa_8.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.
capm2a capm2p capma capmp cappa cappp capda capdp cappda cappdp capga lambda
.00003 .00020 .00006 .00020 .00005 .00020 .00030 .00040 .00050 .00040 .00215 0.3 0.4 0.6 0 0 0 0
; ; ; ; ; ; ; ; ; ;
2nd metal cap -- area, pf/sq-micron 2nd metal cap -- perimeter, pf/micron 1st metal cap -- area, pf/sq-micron 1st metal cap -- perimeter, pf/micron poly cap -- area, pf/sq-micron poly cap -- perimeter, pf/micron n-diffusion cap -- area, pf/sq-micron n-diffusion cap -- perimeter, pf/micron p-diffusion cap -- area, pf/sq-micron p-diffusion cap -- perimeter, pf/micron -- area, pf/sq-micron
; gate cap ; microns/lambda
lowthresh highthresh cntpullup diffperim subparea diffext
; logic low threshold as a normalized voltage ; logic high threshold as a normalized voltage
; n-channel resistance resistance n-channel dynamic-high resistance n-channel dynamic-low resistance n-channel static ; p-channel resistance resistance p-channel dynamic-high resistance p-channel static resistance p-channel dynamic-low 5 5 5 1 1 1 2302.30 1251.10 1206.74
13.5 13.5 13.5
1 1 1
943.97 970.84 1820.00
RSIM Calibration
We calibrated RSIM with Hspice parameters from the AMI C5N lot T06F NMOS transistor. Parameters 1 6 in the listing above came directly from the Mosis run. We calculated the
diffusion parameters for a 1x1 square using the appropriate SPICE parameters and the equation December 1, 2000 Page 28
Marc Mosko
shown below. We used also used SPICE parameters to calculate the gate capacitance. Items 13 18 above were left as-is from the original PRM file. Items 19 26 came from MOSIS.
We calculated the gate capacitance and drain capacitance following the SPICE calculations presented in [5, pp. 188ff]. which are summed Gate capacitance has two components, the intrinsic and extrinsic, for the total.
Cgin = W L Cox
and
C gex = W Cgso + W Cgdo + 2L Cgbo .
The parameters for the gate-source, gate-drain, and They are, respectively, 1.93 x10-10
gate-body capacitances came from the Hspice parameters.
F/m, 1.93x10-10 F/m, and 1.00 x10-9 F/m. The gate oxide thickness is 1.38x10-6 m. Since RSIM uses a unit measurement per area, we set W and L to 1 . The drain capacitance is given by the following, where CJ, VJ, PB, MJ, CJSW, and MJSW are SPICE parameters. Their values are 4.22E-4, 2.5, 0.984, 3.49E-10, 1.20E-1. We used an area of 1 and a perimeter of 4.
MJ VJ Cj = Area CJ 1 + PB
+ Perim CJSW
VJ 1+ PB
MJSW
Optimization
Based on RSIM capacitance estimates, we optimized several key inverter chains for maximum performance. We used the techniques from [4, pp. 221ff], which are similar to those presented in CMPE222. It uses the recursion Ci = i1 Ci 1 . The size multiplier is ln( ) = ln(CL / C in ) n . One may estimate the optimal number of stages from nopt = ln( CL / Cin ) .
For the input capacitance, we used estimates of 9 f F and 14 fF, respectively, a 3/5 and 4/8 first stage. These numbers came from RSIM estimates. The output drive capacitance also came from
December 1, 2000
Page 29
Marc Mosko
RSIM. Long wires, such as the booth-encoded selectors, could range between 0.5 pF and 0.6 pF. We generally fixed n based on layout considerations.
When generating double-rail signals from single-rail inputs, we usually use 2-inverter/3-inverter trees or 3-inverter/4-inverter trees. Sometimes this was sub-optimal, since we used fewer but
larger inverters based on layout restrictions. The layout restrictions came from the standard cell size we selected early in the design process.
The CSA adder is designed as a 2 -2-4-4-4-8-8 chain, based on [2]. Using Magic estimates of input capacitance for the carry-in, we designed an input driver for each of csa_2, csa_4, and csa_8 to optimize the performance of each element. The component csa_last generates the
car_h, car_l carry outputs with a 6/6 inverter that then drives a 3/3 transmission gate for the carry select. Thus, csa_last has low drive ability.
From Magic, the input capacitances of csa_2, csa_4, csa_8 are, respectively, 0.081pf, 0.133pf, 0.243pf. From path output, it would take 0.3ns, 0.9ns, and 1.4ns, respectively, to charge each component. Because of space restrictions, we could only have a two-stage driver for each of the high and low inputs. The optimal for csa_4 and csa_8 would have been a three stage driver. The input capacitances of each component reduced to 0.012pf, 0.013pf, and 0.013pf (0.034pf for csa_8_0). The transition times dropped to 0.1ns each, except for csa_8_0 which has a 0.2ns transition time. The component csa_8_0 is the first adder along the bottom, so there is a long M2 wire from csa_4_2 to csa_8_0 and the 6/6 output inverter in csa_4_2 cannot drive the line. The schematics for each component show the details of the input inverters.
December 1, 2000
Page 30
Marc Mosko
The 2-2-4-4-4-8-8 design assumes that all inputs arrive at the same time. In our multiplier case, that is not true. The input to the first 8 bit adder actually arrives last. One might experiment with different designs, such as 2-2-4-8-2-2-4-8.
We found that the ff output of the cell bth drove about 0.139 pF but only had a 12/16 output inverter. We redesigned it as a 2 -inverter chain of 4/6 and 12/18. Using a 4/6 rather than a 3/5 reduced the size of the second inverter by 2 . Going from a 28 of input capacitance down to 10 also helped. This one change improved performance by approximately 15% overall.
December 1, 2000
Page 31
Marc Mosko
References
1. Abu-Khater, I.S.; Bellaouar, A.; Elmasry, M.I.; Yan, R.H., Circuit/architecture for low-power high-performance 32-bit adder, Fifth Great Lakes Symposium on VLSI, Buffalo, NY, USA, 16-18, March 1995 pp.74-7. 2. Abu-Khater, I.S.; Yan, R.H.; Bellaouar, A.; Elmasry, M.I., A 1-V low-power highperformance 32-bit conditional sum adder, Symposium on Low Power Electronics. Digest of Technical Papers, San Diego, CA, USA, 10-12 Oct. 1994, pp.66-7. 3. Abu-Khater, I.S.; Yan, R.H.; Bellaouar, A.; Elmasry, M.I., Circuit Techniques for CMOS Low-Power High-Performance Multipliers, IEEE Journal of Solid-State Circuits, v. 31, no. 10, Oct 1996, pp. 1535 1546. 4. Bellaouar, A. and M.I. Elmasry, Low-Power Digital VLSI Design. Circuits and Systems, Kluwer Academic Publishers, Boston: 1995. 5. Weste, N.H.E. and K. Eshraghian, Principles of CMOS VLSI Design. A systems Perspective 2nd Ed., Addison-Wesley Publishing, Reading, MA: 1992.
December 1, 2000
Page 32
Marc Mosko
VHDL Source Code

Addcell.vhd
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. ------------------------------------------------------------------------- Add Cell from "Low-power Digital VLSI Design" by -- Bellaouar and Elmasry. -- Returns 1 if Booth encoding is negative else 0 -----------------------------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; use work.bth_types.all; entity addcell is port (bth : in std_ulogic_vector(4 downto 0); sum : out std_ulogic); end addcell;
-- description of adder using concurrent signal assignments architecture rtl of addcell is begin sum <= bth(bth_m1) or bth(bth_m2); end rtl;
December 1, 2000
Page 33
Marc Mosko
Adder.vhd
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. ------------------------------------------------------------------------- Single-bit adder -----------------------------------------------------------------------library IEEE, adk; use IEEE.std_logic_1164.all; entity adder is port ( a_h b_h c_h sum_h car_h end adder;
: in std_ulogic; : in std_ulogic; : in std_ulogic; : out std_ulogic; : out std_ulogic);
architecture rtl of adder is
component fadd1 is port ( A : in STD_LOGIC; B : in STD_LOGIC; CI : in STD_LOGIC; S : out STD_LOGIC; CO : out STD_LOGIC ); end component; signal signal signal signal signal begin a <= a_h; b <= b_h; c <= c_h; fa: fadd1 port map (a => a,b => b,ci => c,s => s,co => t); sum_h <= s; car_h <= t; end rtl; ------------------------------------------------------------------------- N-bit adder -- The width of the adder is determined by generic N -- From ModelSim examples -- Uses simple ripple -----------------------------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; use work.adder; entity adderN is generic(N : integer := 15); port (a_h : in std_ulogic_vector(N downto 1); b_h : in std_ulogic_vector(N downto 1); c_h : in std_ulogic; a b c s t : : : : : std_logic; std_logic; std_logic; std_logic; std_logic;
December 1, 2000
Page 34

61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. sum_h car_h end adderN; : out std_ulogic_vector(N downto 1); : out std_ulogic);
Marc Mosko
-- structural implementation of the N-bit adder architecture ripple of adderN is component adder port (a_h : in std_ulogic; b_h : in std_ulogic; c_h : in std_ulogic; sum_h : out std_ulogic; car_h : out std_ulogic); end component; signal carry : std_ulogic_vector(0 to N); begin carry(0) <= c_h; car_h <= carry(N); -- instantiate a single-bit adder N times gen: for I in 1 to N generate add: adder port map( a_h => a_h(I), b_h => b_h(I), c_h => carry(I - 1), sum_h => sum_h(I), car_h => carry(I)); end generate; end ripple;
December 1, 2000
Page 35
Marc Mosko
Booth.vhd
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. ------------------------------------------------------------------------- Constants used by Booth functions -----------------------------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; package bth_types is constant bth_m1 constant bth_m2 constant bth_p2 constant bth_p1 constant bth_z0 end bth_types;
: : : : :
integer integer integer integer integer
:= := := := :=
4; 3; 2; 1; 0;
------------------------------------------Booth encoder for row j -----------------------------------------library IEEE; use IEEE.std_logic_1164.all; use work.bth_types.all; entity booth_encode is port( in_h : in std_ulogic_vector (2 downto 0); bth_h : out std_ulogic_vector (4 downto 0)); end booth_encode; architecture rtl of booth_encode is begin -- input "in_h" is Y(2i+1) Y(2i) Y(2i-1) MSB order -- See bth.vhd for booth types bth_h <= "10000" when (in_h="101" or in_h="110") else "01000" when (in_h="100") else "00100" when (in_h="011") else "00010" when (in_h="010" or in_h="001") else "00001"; end rtl; ------------------------------------------------------------------------- Sign propagation -- taken from "Circuit Techniques for CMOS Low-Power High-Performance Multipliers" -- by Abu-Khater, Bellaouar, and Elmasry in IEEE J. Solid-State Circuits v.31 (10) -- Oct 1996 pp. 1535ff --- double rail -----------------------------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; entity sgn is port (
pp_h : ff_h : pp_out_h: ff_out_h:
in std_ulogic; in std_ulogic; out std_ulogic; out std_ulogic);
end sgn; architecture rtl of sgn is begin pp_out_h <= pp_h xor ff_h; ff_out_h <= pp_h or ff_h;
December 1, 2000
Page 36

61. 62. end rtl;
Marc Mosko
December 1, 2000
Page 37
Marc Mosko
claN.vhd
63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. ------------------------------------------------------------------------- N-bit Carry-Lookahead adder -- The width of the adder is determined by generic N -- From Altera examples -----------------------------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; use work.adder; entity claN is generic(N : positive); port (a_h : in std_ulogic_vector(N-1 downto 0); b_h : in std_ulogic_vector(N-1 downto 0); c_h : in std_ulogic; sum_h : out std_ulogic_vector(N-1 downto 0); car_h : out std_ulogic); end claN; architecture behavioral of claN is signal h_sum : std_ulogic_vector(N-1 downto 0); signal car_gen : std_ulogic_vector(N-1 downto 0); signal car_prop : std_ulogic_vector(N-1 downto 0); signal car_intern : std_ulogic_vector(N-1 downto 1); begin h_sum car_gen car_prop
<= a_h XOR b_h; <= a_h AND b_h; <= a_h OR b_h;
PROCESS( car_gen, car_prop, car_intern, c_h ) BEGIN car_intern(1) <= car_gen(0) OR (car_prop(0) AND c_h); inst: FOR i IN 1 to N-2 LOOP car_intern(i+1) <= car_gen(i) OR (car_prop(i) AND car_intern(i)); END LOOP; car_h <= car_gen(N-1) OR (car_prop(N-1) AND car_intern(N-1)); END PROCESS; sum_h(0) <= h_sum(0) XOR c_h; sum_h(N-1 downto 1) <= h_sum(N-1 downto 1) XOR car_intern(N-1 downto 1); end behavioral; ------------------------------------------------------------------------- N-bit adder -- The width of the adder is determined by generic N -----------------------------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; -- usign = unsigned addtion entity plusN is generic( N : positive); port (a_h : in std_ulogic_vector(N-1 downto 0); b_h : in std_ulogic_vector(N-1 downto 0); c_h : in std_ulogic; sum_h : out std_ulogic_vector(N-1 downto 0); car_h : out std_ulogic); end plusN;
December 1, 2000
Page 38

123. 124. 125. 126. 127. 128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141. 142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. 156. 157. 158. 159. 160. 161. 162. 163. 164. 165. 166. 167. 168. 169. 170. 171. 172. 173. 174. 175. 176. 177. 178. 179. 180. 181. 182. 183. 184. 185. architecture behavioral of plusN is signal x : std_logic_vector(N-1 downto 0); signal y : std_logic_vector(N-1 downto 0); signal signal signal signal signal signal w z a b c s : : : : : : std_logic_vector(N-1 downto 0); std_logic_vector(N-1 downto 0); signed (N-1 downto 0); signed (N-1 downto 0); signed (N-1 downto 0); signed (N-1 downto 0);
Marc Mosko
signal t4_h : std_ulogic; signal t5_h : std_ulogic; begin x <= To_StdLogicVector(a_h); y <= To_StdLogicVector(b_h); -- need to extend the carry to the same size as X and Y. w(N-1 downto 1) <= (others=> '0'); w(0) <= c_h; a <= SIGNED(x); b <= SIGNED(y); c <= SIGNED(w); s <= a + b + c; z <= STD_LOGIC_VECTOR(s); sum_h <= To_StdULogicVector(z); ----------------------------------------------------------------- 8) The overflow from signed addition -An overflow is defined when -1) x > 0 and y > 0 and z < 0 or -2) x < 0 and y < 0 and z > 0 ---------------------------------------------------------------t4_h <= not (x(N-1)) and not(y(N-1)) and z(N-1); t5_h <= (x(N-1) and y(N-1)) and not z(N-1); car_h <= (t4_h or t5_h); end behavioral; ------------------------------------------------------------------------- unsigned N-bit adder -- The width of the adder is determined by generic N -----------------------------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; entity uplusN is generic( N : positive); port (a_h : in std_ulogic_vector(N-1 downto 0); b_h : in std_ulogic_vector(N-1 downto 0); c_h : in std_ulogic; sum_h : out std_ulogic_vector(N-1 downto 0); car_h : out std_ulogic); end uplusN; architecture behavioral of uplusN is signal x : std_logic_vector(N downto 0); signal y : std_logic_vector(N downto 0);
December 1, 2000
Page 39

186. 187. 188. 189. 190. 191. 192. 193. 194. 195. 196. 197. 198. 199. 200. 201. 202. 203. 204. 205. 206. 207. 208. 209. 210. 211. 212. 213. signal signal signal signal signal signal w z a b c s : : : : : : std_logic_vector(N std_logic_vector(N unsigned (N downto unsigned (N downto unsigned (N downto unsigned (N downto downto 0); downto 0); 0); 0); 0); 0);
Marc Mosko
begin x(N-1 downto 0) <= To_StdLogicVector(a_h); y(N-1 downto 0) <= To_StdLogicVector(b_h); x(N) <= '0'; y(N) <= '0'; -- need to extend the carry to the same size as X and Y. w(N downto 1) <= (others=> '0'); w(0) <= c_h; a <= UNSIGNED(x); b <= UNSIGNED(y); c <= UNSIGNED(w); s <= a + b + c; z <= STD_LOGIC_VECTOR(s); sum_h <= To_StdULogicVector(z(N-1 downto 0)); car_h <= z(N); end behavioral;
December 1, 2000
Page 40
Marc Mosko
driverN.vhd
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. ------------------------------------------------------------------------- N-bit driver -----------------------------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; entity buf is port ( signal Q signal D end buf;
: out std_ulogic; : in std_ulogic);
architecture behavior of buf is begin Q <= '0' when (D = '0' or D = 'L') else '1' when (D = '1' or D = 'H') else 'U' when D = 'U' else 'X'; end behavior; library IEEE; use IEEE.std_logic_1164.all; entity driverN is generic(N : positive ); port (signal Q : out std_ulogic_vector(N-1 downto 0); signal D : in std_ulogic_vector(N-1 downto 0)); end driverN; -- structural implementation of the N-bit adder architecture structural of driverN is component buf is port ( signal Q : out std_ulogic; signal D : in std_ulogic); end component; begin gen0:
for i in 0 to N-1 generate b: buf port map ( Q => Q(i), D => D(i)); end generate; end structural;
December 1, 2000
Page 41
Marc Mosko
latch.vhd
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. ------------------------------------------------------------------------- N-bit LATCH with reset -- The width of the latch is determined by generic N -----------------------------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; entity dffr_fall is port ( Rst Clk signal D signal Q end dffr_fall;
: : : :
in std_ulogic; in std_ulogic; in std_ulogic; out std_ulogic);
architecture behavior of dffr_fall is begin process(Rst, Clk, D) begin if Rst = '1' then Q <= '0'; elsif clk'event and Clk = '0' then Q <= D; end if; end process; end behavior; -----------------------------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; entity gdffr_fall is port ( Rst Clk Enable signal D signal Q end gdffr_fall;
: : : : :
in std_ulogic; in std_ulogic; in std_ulogic; in std_ulogic; out std_ulogic);
architecture behavior of gdffr_fall is begin process(Rst, Clk, D) begin if Rst = '1' then Q <= '0'; elsif clk'event and Clk = '0' then Q <= D; end if; end process; end behavior; -----------------------------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; entity dffr_rise is port ( Rst Clk
: in std_ulogic; : in std_ulogic;
December 1, 2000
Page 42

61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. signal D signal Q end dffr_rise; : in std_ulogic; : out std_ulogic);
Marc Mosko
architecture behavior of dffr_rise is begin process(Rst, Clk, D) begin if Rst = '1' then Q <= '0'; elsif clk'event and Clk = '1' then Q <= D; end if; end process; end behavior; -----------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; entity dffrN_fall is generic(N : positive); port ( Rst : in std_ulogic; Clk : in std_ulogic; signal D : in std_ulogic_vector(N-1 downto 0); signal Q : out std_ulogic_vector(N-1 downto 0)); end dffrN_fall; architecture behavior of dffrN_fall is component dffr_fall is port ( Rst : in std_ulogic; Clk : in std_ulogic; signal D : in std_ulogic; signal Q : out std_ulogic); end component; begin gen: for j in 0 to N-1 generate dffgen: dffr_fall port map (Rst=> Rst, Clk=> Clk, D=> D(j), Q=> Q(j)); end generate; end behavior; -----------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; entity gdffrN_fall is generic(N : positive); port ( Rst : in std_ulogic; Clk : in std_ulogic; Enable : in std_ulogic; signal D : in std_ulogic_vector(N-1 downto 0); signal Q : out std_ulogic_vector(N-1 downto 0)); end gdffrN_fall; architecture behavior of gdffrN_fall is component dffr_fall is port ( Rst : in std_ulogic; Clk : in std_ulogic; signal D : in std_ulogic; signal Q : out std_ulogic);
December 1, 2000
Page 43

124. 125. 126. 127. 128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141. 142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. 156. 157. 158. 159. 160. 161. 162. 163. 164. 165. 166. 167. 168. 169. 170. 171. 172. 173. 174. 175. 176. 177. 178. 179. 180. 181. 182. 183. 184. 185. 186. end component;
Marc Mosko
begin gen: for j in 0 to N-1 generate dffgen: dffr_fall port map (Rst=> Rst, Clk=> Clk, D=> D(j), Q=> Q(j)); end generate; end behavior; -----------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; entity dffrN_rise is generic(N : positive); port ( Rst : in std_ulogic; Clk : in std_ulogic; signal D : in std_ulogic_vector(N-1 downto 0); signal Q : out std_ulogic_vector(N-1 downto 0)); end dffrN_rise; architecture behavior of dffrN_rise is component dffr_rise is port ( Rst : in std_ulogic; Clk : in std_ulogic; signal D : in std_ulogic; signal Q : out std_ulogic); end component; begin gen: for j in 0 to N-1 generate dffgen: dffr_rise port map (Rst=> Rst, Clk=> Clk, D=> D(j), Q=> Q(j)); end generate; end behavior; library IEEE; use IEEE.std_logic_1164.all; entity latchr is port ( Rst Clk signal D signal Q end latchr;
: : : :
in std_ulogic; in std_ulogic; in std_ulogic; out std_ulogic);
architecture behavior of latchr is begin process(Rst, Clk, D) begin if Rst = '1' then Q <= '0'; elsif Clk = '1' then Q <= D; end if; end process; end behavior; library IEEE; use IEEE.std_logic_1164.all; entity latchrN is generic(N : positive ); port ( Rst
: in std_ulogic;
December 1, 2000
Page 44

187. 188. 189. 190. 191. 192. 193. 194. 195. 196. 197. 198. 199. 200. 201. 202. 203. 204. 205. 206. 207. 208. 209. 210. 211. 212. 213. 214. 215. 216. 217. 218. 219. 220. 221. 222. 223. 224. 225. 226. 227. 228. 229. 230. 231. 232. 233. 234. 235. 236. 237. 238. 239. 240. 241. 242. 243. 244. 245. 246. 247. 248. Clk signal D signal Q end latchrN; : in std_ulogic; : in std_ulogic_vector(N-1 downto 0); : out std_ulogic_vector(N-1 downto 0));
Marc Mosko
architecture behavior of latchrN is component latchr is port ( Rst : in std_ulogic; Clk : in std_ulogic; signal D : in std_ulogic; signal Q : out std_ulogic); end component; signal my_clk signal my_rst : std_logic_vector(N/8 downto 0); : std_logic_vector(N/8 downto 0);
begin process (Clk) begin clk_buf: for i in 0 to N/8 LOOP my_clk(i) <= clk; end LOOP; end process; process (Rst) begin rst_buf: for i in 0 to N/8 LOOP my_rst(i) <= rst; end LOOP; end process; gen: for j in 0 to N-1 generate latchgen: latchr port map (Rst=> my_rst(j/8), Clk=> my_clk(j/8), D=> D(j), Q=> Q(j)); end generate; end behavior; ------------------------------------------------------------------------- N-bit dff with reset : NON-TRANSPARENT ON GATED BUFFER -- The width of the dff is determined by generic N -- modified from "VHDL Made Easy!" by Pellerin and Taylor (p. 224) -----------------------------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; entity glatchr is port ( Rst Clk Enable signal D signal Q end glatchr;
: in std_ulogic; : in std_ulogic; : in std_ulogic; : in std_ulogic; : out std_ulogic);
architecture rtl of glatchr is component latchr is port ( Rst : in std_ulogic; Clk : in std_ulogic; signal D : in std_ulogic; signal Q : out std_ulogic); end component; signal w : std_ulogic;
December 1, 2000
Page 45

249. 250. 251. 252. 253. 254. 255. 256. 257. 258. 259. 260. 261. 262. 263. 264. 265. 266. 267. 268. 269. 270. 271. 272. 273. 274. 275. 276. 277. 278. 279. 280. 281. 282. 283. 284. 285. 286. 287. 288. 289. 290. 291. 292. 293. 294. 295. 296. 297. 298. 299. 300. 301. 302. 303. 304. 305.
Marc Mosko
begin dff: latchr port map ( Rst=> Rst, Clk=> Clk, D=> D, Q=> w ); Q <= w when Enable='1' else 'Z'; --Q <= w; end rtl; library IEEE; use IEEE.std_logic_1164.all; entity glatchrN is generic(N : positive ); port ( Rst : in std_ulogic; Clk : in std_ulogic; Enable : in std_ulogic; signal D : in std_ulogic_vector(N-1 downto 0); signal Q : out std_ulogic_vector(N-1 downto 0)); end glatchrN; architecture rtl of glatchrN is component glatchr is port ( Rst : in std_ulogic; Clk : in std_ulogic; Enable : in std_ulogic; signal D : in std_ulogic; signal Q : out std_ulogic); end component; signal my_clk signal my_rst signal my_en : std_logic_vector(N/8 downto 0); : std_logic_vector(N/8 downto 0); : std_logic_vector(N/8 downto 0);
begin process (Clk) begin clk_buf: for i in 0 to N/8 LOOP my_clk(i) <= clk; end LOOP; end process; process (Rst) begin rst_buf: for i in 0 to N/8 LOOP my_rst(i) <= rst; end LOOP; end process; process (Enable) begin rst_buf: for i in 0 to N/8 LOOP my_en(i) <= Enable; end LOOP; end process; gen: for j in 0 to N-1 generate gdffrgen: glatchr port map (Rst=> my_rst(j/8), Clk=> my_clk(j/8), Enable=> my_en(j/8), D=> D(j), Q=> Q(j)); end generate; end rtl;
December 1, 2000
Page 46
Marc Mosko
mult.vhd
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. ------------------------------------------------------------------------- N-bit multiplier Multiplier -- This is a phi-2 device. --BusIO_S2H is the pad i/o bus -Ovrflw_s2h is the overflow output. Should be made an InOut for carryin -BusSEL_S2H is a chip select, encoded active high -BS_S2H is the input select (bus high, mult low) -RW_S2H is the Read/Write select (read high, write low) -ME_S2H is the Multiplier Enable -Rst_S2H is a reset signal. It is clocked with PHI_2 to ensure -that it does not muck with stuff when it is not supposed to -Reset is immediate. There is no 1 cycle delay, like -with regular signals. -----------------------------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; --use work.converts.all; entity mult is port ( BusIO_S2H Ovrflw_S2H BusSEL_S2H BS_S2H RW_S2H ME_S2H Rst_S2H PHI_1H PHI_2H end mult;
: inout std_logic_vector(15 downto 0); : out std_ulogic; : in std_ulogic_vector(2 downto 0); : in std_ulogic; : in std_ulogic; : in std_ulogic; : in std_ulogic; : in std_ulogic; : in std_ulogic);
architecture structural of mult is -- A multiplier register of width N component multregn is generic(N : positive ); port ( BusOUT_S2H : out std_ulogic_vector(N-1 downto 0); MultOUT_S2H : out std_ulogic_vector(N-1 downto 0); BusIN_S2H : in std_ulogic_vector(N-1 downto 0); MultIN_V2H : in std_ulogic_vector(N-1 downto 0); SEL_S2H : in std_ulogic; BS_S2H : in std_ulogic; RW_S2H : in std_ulogic; MEDLY_S2H : in std_ulogic; Rst_S2H : in std_ulogic; PHI_1H : in std_ulogic; PHI_2H : in std_ulogic); end component; component mult_cla is generic ( N: positive ); port( z_v2h : out std_ulogic_vector(N-1 downto 0); car_v2h : out std_ulogic; a_v2h : in std_ulogic_vector(N-1 downto 0); b_v2h : in std_ulogic_vector(N-1 downto 0); c_v2h : in std_ulogic ); end component; -- A 16 x 16 + 32 = 32 + ovrflw -- ovrflw is w[31] x[31] y[31] to compute with z[31]
December 1, 2000
Page 47

61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. component mult_pipe is port( z_v2h a_v2h b_v2h c_v2h ovrflw_v2h medly_s2h x_s2h y_s2h w_s2h me_s2h PHI_1H PHI_2H Rst_s2h ); end component;
Marc Mosko
: out std_ulogic_vector(7 downto 0); : out std_ulogic_vector(23 downto 0); : out std_ulogic_vector(23 downto 0); : out std_ulogic; : out std_ulogic_vector(2 downto 0); : out std_ulogic; : in std_ulogic_vector(15 downto 0); : in std_ulogic_vector(15 downto 0); : in std_ulogic_vector(31 downto 0); : in std_ulogic; : in std_ulogic; : in std_ulogic; : in std_ulogic
-- single bit D flip flop component dffr_fall is port ( Rst : in std_ulogic; Clk : in std_ulogic; signal D : in std_ulogic; signal Q : out std_ulogic); end component; component buf is port ( Q : out std_ulogic; D : in std_ulogic); end component; -- Buses to/from the multiplier from the registers signal bus_x : std_ulogic_vector(15 downto 0); signal bus_y : std_ulogic_vector(15 downto 0); signal bus_w : std_ulogic_vector(31 downto 0); signal bus_z : std_ulogic_vector(31 downto 0); -- wiring from multiplier to CLA unit signal bus_a : std_ulogic_vector(23 downto 0); signal bus_b : std_ulogic_vector(23 downto 0); signal bus_c : std_ulogic; signal car_out : std_ulogic; -- Decoded select bus and write-enable signals signal signal bus_sel_h bus_wr_h : std_ulogic_vector(7 downto 0); : std_ulogic_vector(7 downto 0);
constant Gnd_16 signal signal
: std_ulogic_vector(15 downto 0) := "0000000000000000";
GND : std_ulogic := '0'; Vdd : std_ulogic := '1';
-- enable verbose debug output signal DBG_EN : std_ulogic := '0'; signal DBG2_EN : std_ulogic := '0'; signal dbg2 : std_ulogic := '0'; signal signal signal signal signal signal medly_s2h medly_q2h ovrflw_v2h ovrout_v2h ovr_s2h rst_q2h
-- general (includes OVRFLW) -- specific to OVRFLW signal -- computed from DBG_EN or DBG2_EN
: std_ulogic; -- output from mult_pipe : std_ulogic; -- Q2 version to clock in ovrflw_v2h : std_ulogic; -- computed ovrflw : std_ulogic_vector(2 downto 0); -- output from mult : std_ulogic; -- from dff to output pin, for monitoring : std_ulogic; -- Q2 version to reset Ovrflw dff
December 1, 2000
Page 48

124. 125. 126. 127. 128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141. 142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. 156. 157. 158. 159. 160. 161. 162. 163. 164. 165. 166. 167. 168. 169. 170. 171. 172. 173. 174. 175. 176. 177. 178. 179. 180.
Marc Mosko
-- temporary signals used to compute overflow signal t1_h : std_ulogic; signal t2_h : std_ulogic; signal t3_h : std_ulogic; signal t4_h : std_ulogic; signal t5_h : std_ulogic; -- outputs from regsiters signal feed_r0 : std_ulogic_vector(15 signal feed_r1 : std_ulogic_vector(15 signal feed_r2 : std_ulogic_vector(15 signal feed_r3 : std_ulogic_vector(15 signal feed_r4 : std_ulogic_vector(15 signal feed_r5 : std_ulogic_vector(15 -- buffered clocks signal phi_a_1h signal phi_a_2h
downto downto downto downto downto downto
0); 0); 0); 0); 0); 0);
: std_ulogic_vector(6 downto 0); : std_ulogic_vector(6 downto 0);
begin ---------------------------------------------------------------- Decode the input register select bus_sel_h <= "00000001" when (bussel_s2h = "000") else "00000010" when (bussel_s2h = "001") else "00000100" when (bussel_s2h = "010") else "00001000" when (bussel_s2h = "011") else "00010000" when (bussel_s2h = "100") else "00100000" when (bussel_s2h = "101") else "01000000" when (bussel_s2h = "110") else "10000000" when (bussel_s2h = "111") else "XXXXXXXX"; bus_wr_h <= "00000001" when "00000010" when "00000100" when "00001000" when "00010000" when "00100000" when "01000000" when "10000000" when "00000000"; (bus_sel_h(0) (bus_sel_h(1) (bus_sel_h(2) (bus_sel_h(3) (bus_sel_h(4) (bus_sel_h(5) (bus_sel_h(6) (bus_sel_h(7) = = = = = = = = '1' '1' '1' '1' '1' '1' '1' '1' and and and and and and and and RW_S2H RW_S2H RW_S2H RW_S2H RW_S2H RW_S2H RW_S2H RW_S2H = = = = = = = = '0') '0') '0') '0') '0') '0') '0') '0') else else else else else else else else
--------------------------------------------------------------busio_s2h <= To_StdLogicVector(feed_r0) when bus_wr_h(0) = '1' else "ZZZZZZZZZZZZZZZZ"; busio_s2h <= To_StdLogicVector(feed_r1) when bus_wr_h(1) = '1' else "ZZZZZZZZZZZZZZZZ"; busio_s2h <= To_StdLogicVector(feed_r2) when bus_wr_h(2) = '1' else "ZZZZZZZZZZZZZZZZ"; busio_s2h <= To_StdLogicVector(feed_r3) when bus_wr_h(3) = '1' else "ZZZZZZZZZZZZZZZZ"; busio_s2h <= To_StdLogicVector(feed_r4) when bus_wr_h(4) = '1' else "ZZZZZZZZZZZZZZZZ"; busio_s2h <= To_StdLogicVector(feed_r5) when bus_wr_h(5) = '1' else "ZZZZZZZZZZZZZZZZ"; ------------------------------------------------------------------------------------------------------------------------------ R0 is the X register -- R0 never reads from the multiplier (BS = Vdd, MultIn = GND) reg_0: multregN generic map (16)
December 1, 2000
Page 49

181. 182. 183. 184. 185. 186. 187. 188. 189. 190. 191. 192. 193. 194. 195. 196. 197. 198. 199. 200. 201. 202. 203. 204. 205. 206. 207. 208. 209. 210. 211. 212. 213. 214. 215. 216. 217. 218. 219. 220. 221. 222. 223. 224. 225. 226. 227. 228. 229. 230. 231. 232. 233. 234. 235. 236. 237. 238. 239. 240. 241. 242. 243. port map ( BusOUT_S2H BusIN_S2H MultOut_S2H MultIn_V2H Sel_s2h BS_S2H RW_S2H MEDLY_S2H RST_S2H PHI_1H PHI_2H => => => => => => => => => => =>
Marc Mosko
feed_r0, To_StdULogicVector(busio_s2h), bus_x, Gnd_16, bus_sel_h(0), Vdd, RW_S2H, MEDLY_Q2H, RST_S2H, PHI_1H, PHI_2H);
-- R1 is the Y register -- R1 never reads from the multiplier (BS = Vdd, MultIn = GND) reg_1: multregN generic map (16) port map ( BusOUT_S2H => feed_r1, BusIN_S2H => To_StdULogicVector(busio_s2h), MultOut_S2H => bus_y, MultIn_V2H => Gnd_16, Sel_s2h => bus_sel_h(1), BS_S2H => Vdd, RW_S2H => RW_S2H, MEDLY_S2H => MEDLY_Q2H, RST_S2H => RST_S2H, PHI_1H => PHI_1H, PHI_2H => PHI_2H); -- R2 is the W(31:16) register -- R2 never reads from the multiplier (BS = Vdd, MultIn = GND) reg_2: multregN generic map (16) port map ( BusOUT_S2H => feed_r2, BusIN_S2H => To_StdULogicVector(busio_s2h), MultOut_S2H => bus_w(31 downto 16), MultIn_V2H => Gnd_16, Sel_s2h => bus_sel_h(2), BS_S2H => Vdd, RW_S2H => RW_S2H, MEDLY_S2H => MEDLY_Q2H, RST_S2H => RST_S2H, PHI_1H => PHI_1H, PHI_2H => PHI_2H); -- R3 is the W(15:0) register -- R3 never reads from the multiplier (BS = Vdd, MultIn = GND) reg_3: multregN generic map (16) port map ( BusOUT_S2H => feed_r3, BusIN_S2H => To_StdULogicVector(busio_s2h), MultOut_S2H => bus_w(15 downto 0), MultIn_V2H => Gnd_16, Sel_s2h => bus_sel_h(3), BS_S2H => Vdd, RW_S2H => RW_S2H, MEDLY_S2H => MEDLY_Q2H, RST_S2H => RST_S2H, PHI_1H => PHI_1H, PHI_2H => PHI_2H); -- R4 is the Z(31:16) register -- R4 & R5 have no MultOut connections reg_4: multregN
December 1, 2000
Page 50

244. 245. 246. 247. 248. 249. 250. 251. 252. 253. 254. 255. 256. 257. 258. 259. 260. 261. 262. 263. 264. 265. 266. 267. 268. 269. 270. 271. 272. 273. 274. 275. 276. 277. 278. 279. 280. 281. 282. 283. 284. 285. 286. 287. 288. 289. 290. 291. 292. 293. 294. 295. 296. 297. 298. 299. 300. 301. 302. 303. 304. 305. 306. generic map (16) port map ( BusOUT_S2H BusIN_S2H MultIn_V2H Sel_s2h BS_S2H RW_S2H MEDLY_S2H RST_S2H PHI_1H PHI_2H
Marc Mosko
=> => => => => => => => => =>
feed_r4, To_StdULogicVector(busio_s2h), bus_z(31 downto 16), bus_sel_h(4), BS_S2H, RW_S2H, MEDLY_Q2H, RST_S2H, PHI_1H, PHI_2H);
-- R5 is the Z(15:0) register -- R4 & R5 have no MultOut connections reg_5: multregN generic map (16) port map ( BusOUT_S2H => BusIN_S2H => MultIn_V2H => Sel_s2h => BS_S2H => RW_S2H => MEDLY_S2H => RST_S2H => PHI_1H => PHI_2H =>
feed_r5, To_StdULogicVector(busio_s2h), bus_z(15 downto 0), bus_sel_h(5), BS_S2H, RW_S2H, MEDLY_Q2H, RST_S2H, PHI_1H, PHI_2H);
---------------------------------------------------------------- Storage for the Overflow output --------------------------------------------------------------Rst_q2h <= Rst_s2h and PHI_2H; Medly_q2h <= MEDLY_S2H and PHI_2H; dff_ovrflw: dffr_fall port map (Q=> ovr_s2h, D=> ovrflw_v2h, Clk=> MEDLY_Q2H, Rst=> Rst_q2h); -- allows us to monitor ovrflw_s2h without using a buffered I/O pin ovrflw_s2h <= ovr_s2h; ---------------------------------------------------------------- Connect to the multiplier --------------------------------------------------------------mult_0: mult_pipe port map ( z_v2h a_v2h b_v2h c_v2h ovrflw_v2h medly_s2h x_s2h y_s2h w_s2h me_s2h PHI_1H PHI_2H Rst_s2h ); ---------------------------------------------------------------- Sum the results. Done as separate entity for synthesis --------------------------------------------------------------=> bus_z(7 downto 0), => bus_a, => bus_b, => bus_c, => ovrout_v2h, => medly_s2h, => bus_x, => bus_y, => bus_w, => ME_S2H, => phi_1h, => phi_2h, => Rst_s2h
December 1, 2000
Page 51

307. 308. 309. 310. 311. 312. 313. 314. 315. 316. 317. 318. 319. 320. 321. 322. 323. 324. 325. 326. 327. 328. 329. 330. 331. 332. 333. 334. 335. 336. 337. 338. 339. 340. 341. 342. 343. 344.
Marc Mosko
cla_0 : mult_cla generic map (24) port map ( z_v2h car_v2h a_v2h b_v2h c_v2h );
=> => => => =>
bus_z(31 downto 8), car_out, bus_a, bus_b, bus_c
---------------------------------------------------------------- Compute the overflow -An overflow is defined when -1) x*y > 0 and w > 0 and z < 0 or -2) x*y < 0 and w < 0 and z > 0 ----------------------------------------------------------------t1_h t2_h t3_h t4_h t5_h <= <= <= <= <= ovrout_v2h(1) xor ovrout_v2h(0); ovrout_v2h(2) nor t1_h; ovrout_v2h(2) nand t1_h; t2_h and bus_z(31); t3_h nor bus_z(31);
ovrflw_v2h <= t4_h or t5_h; ---------------------------------------------------------------- Debugs to watch all the variables ----------------------------------------------------------------monitor(bus_a, "mult: bus_a", DBG_EN); --monitor(bus_b, "mult: bus_b", DBG_EN); --monitor(bus_c, "mult: bus_c", DBG_EN); --monitor(bus_z, "mult: bus_z", DBG_EN); --monitor(feed_r4, "feed_r4", '1'); --monitor(feed_r5, "feed_r5", '1'); end structural;
December 1, 2000
Page 52
Marc Mosko
mult_cla.vhd
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. ------------------------------------------------------------------------- 24-bit CLA as separate entity for synthesis ------------------------------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; entity mult_cla is generic (N : positive ); port( z_v2h car_v2h a_v2h b_v2h c_v2h : : : : : out std_ulogic_vector(N-1 downto 0); out std_ulogic; in std_ulogic_vector(N-1 downto 0); in std_ulogic_vector(N-1 downto 0); in std_ulogic
); end mult_cla; architecture rtl of mult_cla is component plusN is generic( N : positive); port ( a_h : in std_ulogic_vector(N-1 downto 0); b_h : in std_ulogic_vector(N-1 downto 0); c_h : in std_ulogic; sum_h : out std_ulogic_vector(N-1 downto 0); car_h : out std_ulogic); end component; component claN is generic( N port ( a_h b_h c_h sum_h car_h end component;
: : : : : :
positive); in std_ulogic_vector(N-1 downto 0); in std_ulogic_vector(N-1 downto 0); in std_ulogic; out std_ulogic_vector(N-1 downto 0); out std_ulogic);
begin --add : claN generic map (24) add : plusN generic map (24) port map( a_h => a_v2h, b_h => b_v2h, c_h => c_v2h, sum_h => z_v2h, car_h => car_v2h); end rtl;
December 1, 2000
Page 53
Marc Mosko
mult_pipe.vhd
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. ------------------------------------------------------------------------- Booth encoded carry-save-adder array --- From "Low-power Digital VLSI Design" by Bellaouar and Elmasry. -- and -- "Circuit Techniques for CMOS Low-Power High-Performance Multipliers" -- by Abu-Khater, Bellaouar, Elmasry in IEEE J. Solid-State Circuits v.31 (10) -- Oct 1996 pp. 1535ff --- z_v2h Multiply accumulate output (x * y + w) (only low-order 8 bits) -- a_v2h goes to fast adder for high-order 24-bits -- b_v2h -- c_v2h -- ovrflow_v2h 3-bits to compute overflow (w[31] x[31] y[31]) -- medly_s2h Output good at end of phase (see me_s2h, this is delayed) -- x_s2h multiplicand -- y_s2h multiplier (gets booth encoded) -- w_s2h accumulate -- me_s2h multiplier enable -- PHI_1H clock -- PHI_2H clock -- Rst_s2h Reset internal registers to 0 --- The Y inputs are booth encoded then gated until ME_S2H & PHI_2H. -- The Y inputs should be applied first to give the booth encoders time -- to settle. The Y inputs must remain valid until MEDLY_S2H (actually -- until a 1/2 cycle before...) ------------------------------------------------------------------------------------------------------------------------------------------------ Variables are generally named as follows: -name_PtCl --- P = pipe line stage (1, 2, or 3) -- t = type (s,q,v) -- C = clock phase (1 or 2) -- l = logic (L or H) --- examples: -- sum_0_1v2h = row 0 sum 1st pipe stage, V timing, Phi-2, active high --- Rules: -- Variables can only be assigned if P and C the same: -x_1v2h <= y_1v2h xor a_1s2h;
December 1, 2000
Page 54

44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89.
Marc Mosko
--- To go between phases/stages you need to use a storage device: --gdffr_fall(Q=> x_2v1h, D=> x_1v2h, Clk=> mdly_q2h, Enable=> mdly_q1h) -This clocks in x_1v2h on mdly_q2h and -enables the output to x_2v1h on mdly_q1h ------------------------------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; --use work.converts.all; -- We use a fixed width / height for simplicity. -- Overflow = x*y + w out of range entity mult_pipe is port( z_v2h a_v2h b_v2h c_v2h ovrflw_v2h medly_s2h x_s2h y_s2h w_s2h me_s2h PHI_1H PHI_2H Rst_s2h ); end mult_pipe;
: : : : : : : : : : : : :
out std_ulogic_vector(7 downto 0); out std_ulogic_vector(23 downto 0); out std_ulogic_vector(23 downto 0); out std_ulogic; out std_ulogic_vector(2 downto 0); out std_ulogic; in std_ulogic_vector(15 downto 0); in std_ulogic_vector(15 downto 0); in std_ulogic_vector(31 downto 0); in std_ulogic; in std_ulogic; in std_ulogic; in std_ulogic
architecture rtl of mult_pipe is constant COL : integer := 16; constant ROW : integer := 8; -- AddCell will add a 0/1 to each row depending on the sign -- of the booth encoding. component addcell is port ( bth : in std_ulogic_vector(4 downto 0); sum : out std_ulogic); end component; -- A standard full adder component adder is port ( a_h : in std_ulogic;
December 1, 2000
Page 55

90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. 126. 127. 128. 129. 130. 131. 132. 133. 134. 135. b_h c_h sum_h car_h end component; : in std_ulogic; : in std_ulogic; : out std_ulogic; : out std_ulogic);
Marc Mosko
-- unsigned addition component uplusN is generic( N : positive); port ( a_h : in std_ulogic_vector(N-1 downto 0); b_h : in std_ulogic_vector(N-1 downto 0); c_h : in std_ulogic; sum_h : out std_ulogic_vector(N-1 downto 0); car_h : out std_ulogic); end component; -- A standard full adder 15 bits wide component adderN is generic( N : positive); port ( a_h : in std_ulogic_vector(N-1 downto 0); b_h : in std_ulogic_vector(N-1 downto 0); c_h : in std_ulogic; sum_h : out std_ulogic_vector(N-1 downto 0); car_h : out std_ulogic); end component; -- Generate a 5-line demultiplexed booth encoding of 3 input bits component booth_encode is port( in_h : in std_ulogic_vector (2 downto 0); bth_h : out std_ulogic_vector (4 downto 0)); end component; -- Partial product generator with full adder -- Has only SUM (and carry) out component ppfa is port ( bth : in std_ulogic_vector(4 downto 0); x1_h : in std_ulogic; x2_h : in std_ulogic; s0_h : in std_ulogic; c0_h : in std_ulogic; sum_h : out std_ulogic; ca1_h : out std_ulogic); end component; -- Partial product generator with full adder -- Has both PP out and SUM (and carry) out
December 1, 2000
Page 56

136. 137. 138. 139. 140. 141. 142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. 156. 157. 158. 159. 160. 161. 162. 163. 164. 165. 166. 167. 168. 169. 170. 171. 172. 173. 174. 175. 176. 177. 178. 179. 180. 181. component ppfapp is port ( bth x1_h x2_h s0_h c0_h pp_h sum_h ca1_h end component;
Marc Mosko
: in std_ulogic_vector(4 downto 0); : in std_ulogic; : in std_ulogic; : in std_ulogic; : in std_ulogic; : out std_ulogic; : out std_ulogic; : out std_ulogic);
-- Sign extender. Computes sign bits to pass to next row. -- Adds 2 bits per row. "ff" is the "flag" bit. component sgn is port ( pp_h : in std_ulogic; ff_h : in std_ulogic; pp_out_h: out std_ulogic; ff_out_h: out std_ulogic); end component; -- D flip flop with reset component dffr_fall is port ( Rst : in std_ulogic; Clk : in std_ulogic; signal D : in std_ulogic; signal Q : out std_ulogic); end component; component dffrN_fall is generic(N : positive ); port ( Rst : in std_ulogic; Clk : in std_ulogic; signal D : in std_ulogic_vector(N-1 downto 0); signal Q : out std_ulogic_vector(N-1 downto 0)); end component; -- a gated flipflop component gdffr_fall is port ( Rst : in std_ulogic; Clk : in std_ulogic; Enable : in std_ulogic; signal D : in std_ulogic; signal Q : out std_ulogic); end component; -- an N-bit gated flipflop
December 1, 2000
Page 57

182. 183. 184. 185. 186. 187. 188. 189. 190. 191. 192. 193. 194. 195. 196. 197. 198. 199. 200. 201. 202. 203. 204. 205. 206. 207. 208. 209. 210. 211. 212. 213. 214. 215. 216. 217. 218. 219. 220. 221. 222. 223. 224. 225. 226. 227.
Marc Mosko
component gdffrN_fall is generic(N : positive ); port ( Rst : in std_ulogic; Clk : in std_ulogic; Enable : in std_ulogic; signal D : in std_ulogic_vector(N-1 downto 0); signal Q : out std_ulogic_vector(N-1 downto 0)); end component; ----These are the outputs from the sign extenders one for each row pp15 is the pp output of the 15th column of each row we need 9 sets of wires since we have inputs to row 0 and outputs from row 7
-- v2 signals in 1st pipe stage, v1 signals in 2nd -- (pp1 = 1st stage, pp2 = 2nd, pp3 = 3rd) -- There is some overlap here, since in PHI2 we generate pp1_v2h(4) which -- is then latech to PHI1 signal pp_1v2h : std_ulogic_vector(4 downto 0); signal ff_1v2h : std_ulogic_vector(4 downto 0); signal pp15_1v2h: std_ulogic_vector(4 downto 0); signal signal signal signal pp_2v1h : std_ulogic_vector(8 downto 4); ff_2v1h : std_ulogic_vector(8 downto 4); pp15_2v1h: std_ulogic_vector(7 downto 4); pp_3v2h : std_ulogic_vector(8 downto 8);
-- each row has an output from the addcell signal add_1v2h : std_ulogic_vector(3 downto 0); signal add_2v1h : std_ulogic_vector(7 downto 0); -- these are a cycle later signal add_3v2h : std_ulogic_vector(7 downto 4); -----each row gets own array. Don't try 2-dimension array. sum_x_h is the sum output of each column in row X. ca1_x_h is the carry output of each column in row X. pre_A_h is the booth encoding for row A before the gate bth_A_h is the booth encoding for row A after the gate
-- The V2H signals are outputs from the multiplier body -- the S1H signals are outputs from the 1st pipeline registers -- the V1H signals are outputs from the 1st pipeline gates
December 1, 2000
Page 58

228. 229. 230. 231. 232. 233. 234. 235. 236. 237. 238. 239. 240. 241. 242. 243. 244. 245. 246. 247. 248. 249. 250. 251. 252. 253. 254. 255. 256. 257. 258. 259. 260. 261. 262. 263. 264. 265. 266. 267. 268. 269. 270. 271. 272. 273. signal signal signal signal signal signal signal signal signal signal signal signal signal signal signal signal signal signal signal signal signal signal signal signal sum_0_1v2h car_0_1v2h sum_0_2v1h car_0_2v1h bth_pre_0_h bth_0_1v2h sum_1_1v2h car_1_1v2h sum_1_2v1h car_1_2v1h bth_pre_1_h bth_1_1v2h sum_2_1v2h car_2_1v2h sum_2_2v1h car_2_2v1h bth_pre_2_h bth_2_1v2h sum_3_1v2h car_3_1v2h sum_3_2v1h car_3_2v1h bth_pre_3_h bth_3_1v2h : std_ulogic_vector(COL downto 0); : std_ulogic_vector(COL downto 0); : std_ulogic_vector(1 downto 0); : std_ulogic; : std_ulogic_vector(4 downto 0); : std_ulogic_vector(4 downto 0); : std_ulogic_vector(COL downto 0); : std_ulogic_vector(COL downto 0); : std_ulogic_vector(1 downto 0); : std_ulogic; : std_ulogic_vector(4 downto 0); : std_ulogic_vector(4 downto 0); : std_ulogic_vector(COL downto 0); : std_ulogic_vector(COL downto 0); : std_ulogic_vector(1 downto 0); : std_ulogic; : std_ulogic_vector(4 downto 0); : std_ulogic_vector(4 downto 0); : std_ulogic_vector(COL downto 0); : std_ulogic_vector(COL downto 0); : std_ulogic_vector(COL downto 0); : std_ulogic_vector(COL downto 0); : std_ulogic_vector(4 downto 0); : std_ulogic_vector(4 downto 0);
Marc Mosko
-- The V1H signals are outputs from the multiplier body -- the S2H signals are outputs from the 2st pipeline registers -- the V2H signals are outputs from the 2st pipeline gates signal signal signal signal signal signal signal signal signal signal signal signal sum_4_2v1h car_4_2v1h sum_4_3v2h car_4_3v2h bth_pre_4_h bth_4_2v1h sum_5_2v1h car_5_2v1h sum_5_3v2h car_5_3v2h bth_pre_5_h bth_5_2v1h : std_ulogic_vector(COL downto 0); : std_ulogic_vector(COL downto 0); : std_ulogic_vector(1 downto 0); : std_ulogic; : std_ulogic_vector(4 downto 0); : std_ulogic_vector(4 downto 0); : std_ulogic_vector(COL downto 0); : std_ulogic_vector(COL downto 0); : std_ulogic_vector(1 downto 0); : std_ulogic; : std_ulogic_vector(4 downto 0); : std_ulogic_vector(4 downto 0);
December 1, 2000
Page 59

274. 275. 276. 277. 278. 279. 280. 281. 282. 283. 284. 285. 286. 287. 288. 289. 290. 291. 292. 293. 294. 295. 296. 297. 298. 299. 300. 301. 302. 303. 304. 305. 306. 307. 308. 309. 310. 311. 312. 313. 314. 315. 316. 317. 318. 319. signal signal signal signal signal signal signal signal signal signal signal signal sum_6_2v1h car_6_2v1h sum_6_3v2h car_6_3v2h bth_pre_6_h bth_6_2v1h sum_7_2v1h car_7_2v1h sum_7_3v2h car_7_3v2h bth_pre_7_h bth_7_2v1h : std_ulogic_vector(COL downto 0); : std_ulogic_vector(COL downto 0); : std_ulogic_vector(1 downto 0); : std_ulogic; : std_ulogic_vector(4 downto 0); : std_ulogic_vector(4 downto 0); : std_ulogic_vector(COL downto 0); : std_ulogic_vector(COL downto 0); : std_ulogic_vector(COL downto 0); : std_ulogic_vector(COL downto 0); : std_ulogic_vector(4 downto 0); : std_ulogic_vector(4 downto 0);
Marc Mosko
-- The first 15 bits go into a full adder array. -- The last 17 bits go into a 42 compressor array with W() --- These are the a_h() and b_h() inputs and the carry output signal fa_a_2v1h : std_ulogic_vector(7 downto 0); signal fa_b_2v1h : std_ulogic_vector(7 downto 0); signal fa_car_2v1h : std_ulogic; -- these feed the 24-bit -- fa_a_3 is (32 - 8) to signal fa_a_3v2h : signal fa_b_3v2h : signal fa1_car_3v2h CLA accomodate an extra carry bit that we do not use std_ulogic_vector(32 downto 8); std_ulogic_vector(31 downto 8); : std_ulogic;
-- The carry outputs of bit 16's compressor (no longer use 4:2 compressors, but - the name is the same...) --signal comp_ca1_3v2h: std_ulogic; --signal comp_ca2_3v2h: std_ulogic; -- b input and Carry outputs of the 42 compressor array -- cout_out_h is the output of the 42 compressors (since z_v2h -- is not inout or buffered) no longer use 42 compressors, but name is the same. signal comp_b_3v2h : std_ulogic_vector(15 downto 0); signal comp_out_3v2h: std_ulogic_vector(31 downto 0); -- some miscellaneous signals used to compute the overflow constant constant GND : std_ulogic := '0'; VDD : std_ulogic := '1';
-- a modified version of x_s2h to align with the times 2 needed for booth -- Use a tempx as the bit-sliced version then assign whole to myx_v2h
December 1, 2000
Page 60

320. 321. 322. 323. 324. 325. 326. 327. 328. 329. 330. 331. 332. 333. 334. 335. 336. 337. 338. 339. 340. 341. 342. 343. 344. 345. 346. 347. 348. 349. 350. 351. 352. 353. 354. 355. 356. 357. 358. 359. 360. 361. 362. 363. 364. 365.
Marc Mosko
-- A ModelSim technote said this was the way to do it.... -- We need to pad with a "0" on the right and duplicate x_s2h(15) on left -- myx_v2h is also gated on ME_Q2H -- myx_s1h/v1h is latched/gated on MDLY_Q1H in the 2nd pipeline stage signal myx_1v2h : std_ulogic_vector(COL+1 downto 0); signal myx_2v1h : std_ulogic_vector(COL+1 downto 0); signal signal signal signal myx_3v2h: std_ulogic; -- needed in 3rd pipeline stage
myy_2s1h: std_ulogic_vector(COL-1 downto 7); myy_3v2h: std_ulogic; -- needed in 3rd pipeline stage tempx : std_ulogic_vector(COL+1 downto 0);
-- The W signal is gated in three places. -W[14:0] is gated on ME_Q2H -W[15] is gated on MDLY_Q1H -W[31:16] is gated on MDLY_Q2H -- the array indicies are to keep them the same as w_s2h signal w_1v2h signal w_2v1h signal w_3v2h : std_ulogic_vector(31 downto 0); : std_ulogic_vector(31 downto 15); : std_ulogic_vector(31 downto 15);
-- a temp signal array for the Y input to row 0 booth encoder. signal y0_in : std_ulogic_vector(2 downto 0); -- timing signal signal signal signal signal signals for pipeline registers and gates me_1q2h : std_ulogic; me_2s1h : std_ulogic; me_2q1h : std_ulogic; me_3s2h : std_ulogic; me_3q2h : std_ulogic;
-- Internally guarded RESET on PHI_2 signal rst_q2h : std_ulogic; -- The 1st 8 bits of z are generated in the 2nd pipeline stage signal z_2v1h : std_ulogic_vector(7 downto 0); signal begin -- Generate the internal reset signal DBG_EN : std_ulogic := '0';
December 1, 2000
Page 61

366. 367. 368. 369. 370. 371. 372. 373. 374. 375. 376. 377. 378. 379. 380. 381. 382. 383. 384. 385. 386. 387. 388. 389. 390. 391. 392. 393. 394. 395. 396. 397. 398. 399. 400. 401. 402. 403. 404. 405. 406. 407. 408. 409. 410. rst_q2h <= rst_s2h and phi_2h; - Generate internal clocking me_1q2h <= me_s2h and phi_2h; me_2q1h <= me_2s1h and phi_1h; me_3q2h <= (me_3s2h or rst_s2h) and phi_2h; dff_clk0: dff_clk1:
Marc Mosko
dffr_fall port map( D => me_s2h, Q=> me_2s1h, CLK=> phi_2h, Rst=> rst_q2h); dffr_fall port map( D => me_2s1h, Q=> me_3s2h, CLK=> phi_1h, Rst=> rst_q2h);
medly_s2h <= me_3s2h; -- Create the gated W signals --w_1v2h <= w_s2h when me_1q2h = '1' else (others => 'Z'); w_1v2h <= w_s2h; -- take input from original w_s2h not the tristate version wlatch_1: dffrN_fall generic map(17) port map (Q=> w_2v1h(31 downto 15), D=> w_s2h(31 downto 15), Clk=> me_1q2h, Rst=> Rst_q2h); wlatch_2: dffrN_fall generic map(17) port map (Q=> w_3v2h,
D=> w_2v1h(31 downto 15), Clk=> me_2q1h, Rst=> Rst_q2h);
-- ff_h(0) is always 0 ff_1v2h(0) <= GND; -- construct myx_v2h with 0 on right and x(15) dup'd on left tempx(0) <= GND; tempx(16 downto 1) <= x_s2h(15 downto 0); tempx(17) <= x_s2h(15); --myx_1v2h <= tempx when me_1q2h = '1' else (others => 'Z'); myx_1v2h <= tempx; -- The C++ code used a 2d array for cells. I'm just going to use 1d -- arrays and make life a little more readable. ----------------------------------------------------------------- Generation overview -- 1) Generate the sign extender cells, one cell per row -- 2) The booth encoders, one cell per row (includes gates on me_q2h and mdly_q1h) -- 3) The add cells, one per row -- 4) The Multiplier body, 16 columns by 8 rows -- 5) An 8-bit FA in pipe stage 2 for the sums from the 1st pipe stage -- 6) a 24-bit CLA adder for the remaining bits in the 3rd pipe stage. -Since we must accumulate W, there are also non-ripple FAs feeding the CLA
December 1, 2000
Page 62

411. 412. 413. 414. 415. 416. 417. 418. 419. 420. 421. 422. 423. 424. 425. 426. 427. 428. 429. 430. 431. 432. 433. 434. 435. 436. 437. 438. 439. 440. 441. 442. 443. 444. 445. 446. 447. 448. 449. 450. 451. 452. 453. 454. 455. 456.
Marc Mosko
-------------------------------------------------------------------------------------------------------------------------------- 1) Generate the sign extender cells, one cell per row ----------------------------------------------------------------- There is one sign cell per row COLGEN1: for i in 0 to 3 generate sgncell : sgn port map( pp_h => pp15_1v2h(i), ff_h => ff_1v2h(i), pp_out_h => pp_1v2h(i+1), ff_out_h => ff_1v2h(i+1) ); end generate; pipe_pp2: gdffr_fall port map ( Q=> pp_2v1h(4), D=> pp_1v2h(4), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h); pipe_ff2: gdffr_fall port map ( Q=> ff_2v1h(4), D=> ff_1v2h(4), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h); COLGEN2: for i in 4 to 7 generate sgncell : sgn port map( pp_h => pp15_2v1h(i), pp_out_h => pp_2v1h(i+1), end generate;
ff_h => ff_2v1h(i), ff_out_h => ff_2v1h(i+1) );
pipe_pp3: gdffr_fall port map ( Q=> pp_3v2h(8), D=> pp_2v1h(8), Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h); --pipe_ff3: gdffr_fall port map ( Q=> ff_3v2h(8), D=> ff_2v1h(8), -Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h);
----------------------------------------------------------------- 2) The booth encoders, one cell per row ----------------------------------------------------------------- Generate each Booth encoders, one per row. Note that row 0 is special -- and pads a "0" as LSB. y0_in(2 downto 1) <= y_s2h(1 downto 0); y0_in(0) <= GND; bth_0 bth_1 bth_2 bth_3 : : : : booth_encode booth_encode booth_encode booth_encode port port port port map map map map ( ( ( ( in_h in_h in_h in_h => => => => y0_in, bth_h y_s2h(3 downto 1), bth_h y_s2h(5 downto 3), bth_h y_s2h(7 downto 5), bth_h => => => => bth_pre_0_h); bth_pre_1_h); bth_pre_2_h); bth_pre_3_h);
-- Delay y_s2h(15 downto 7) until stage 2 bth_4 bth_5 bth_6 bth_7 : : : : booth_encode booth_encode booth_encode booth_encode port port port port map map map map ( ( ( ( in_h in_h in_h in_h => => => => myy_2s1h(9 downto 7), myy_2s1h(11 downto 9), myy_2s1h(13 downto 11), myy_2s1h(15 downto 13), bth_h bth_h bth_h bth_h => => => => bth_pre_4_h); bth_pre_5_h); bth_pre_6_h); bth_pre_7_h);
December 1, 2000
Page 63

457. 458. 459. 460. 461. 462. 463. 464. 465. 466. 467. 468. 469. 470. 471. 472. 473. 474. 475. 476. 477. 478. 479. 480. 481. 482. 483. 484. 485. 486. 487. 488. 489. 490. 491. 492. 493. 494. 495. 496. 497. 498. 499. 500. 501. 502.
Marc Mosko
-- Pass the booth encoding through the gated drivers --bth_0_1v2h <= bth_pre_0_h when me_1q2h = '1' else (others=>'0'); --bth_1_1v2h <= bth_pre_1_h when me_1q2h = '1' else (others=>'0'); --bth_2_1v2h <= bth_pre_2_h when me_1q2h = '1' else (others=>'0'); --bth_3_1v2h <= bth_pre_3_h when me_1q2h = '1' else (others=>'0'); --bth_4_2v1h <= bth_pre_4_h when me_2q1h = '1' else (others=>'0'); --bth_5_2v1h <= bth_pre_5_h when me_2q1h = '1' else (others=>'0'); --bth_6_2v1h <= bth_pre_6_h when me_2q1h = '1' else (others=>'0'); --bth_7_2v1h <= bth_pre_7_h when me_2q1h = '1' else (others=>'0'); bth_0_1v2h <= bth_pre_0_h; bth_1_1v2h <= bth_pre_1_h; bth_2_1v2h <= bth_pre_2_h; bth_3_1v2h <= bth_pre_3_h; bth_4_2v1h <= bth_pre_4_h; bth_5_2v1h <= bth_pre_5_h; bth_6_2v1h <= bth_pre_6_h; bth_7_2v1h <= bth_pre_7_h; -- store bit y(15) until 3rd pipeline stage for overflow glatch_y2 : dffrN_fall generic map (9) port map ( Q=> myy_2s1h, D=> y_s2h(15 downto 7), Clk=> me_1q2h, Rst=> Rst_q2h); glatch_y3 : dffr_fall port map ( Q=> myy_3v2h, D=> myy_2s1h(15),
Clk=> me_2q1h, Rst=> Rst_q2h);
----------------------------------------------------------------- 3) The add cells, one per row ----------------------------------------------------------------- The Add Cells get mixedup on the indicies, since booth encoding is -- not a row array. Easiest to just declare each out outside a generate loop addcell_0 : addcell port map ( bth => bth_0_1v2h, sum => add_1v2h(0) ); addcell_1 : addcell port map ( bth => bth_1_1v2h, sum => add_1v2h(1) ); addcell_2 : addcell port map ( bth => bth_2_1v2h, sum => add_1v2h(2) ); addcell_3 : addcell port map ( bth => bth_3_1v2h, sum => add_1v2h(3) ); addcell_4 : addcell port map ( bth => bth_4_2v1h, sum => add_2v1h(4) ); addcell_5 : addcell port map ( bth => bth_5_2v1h, sum => add_2v1h(5) ); addcell_6 : addcell port map ( bth => bth_6_2v1h, sum => add_2v1h(6) ); addcell_7 : addcell port map ( bth => bth_7_2v1h, sum => add_2v1h(7) ); -- Delay the first 4 to 2nd stage gadd1: gdffrN_fall generic map(4) port map( Q=> add_2v1h(3 downto 0), D=>add_1v2h, Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h); -- Delay the last 4 to 3nd stage
December 1, 2000
Page 64

503. 504. 505. 506. 507. 508. 509. 510. 511. 512. 513. 514. 515. 516. 517. 518. 519. 520. 521. 522. 523. 524. 525. 526. 527. 528. 529. 530. 531. 532. 533. 534. 535. 536. 537. 538. 539. 540. 541. 542. 543. 544. 545. 546. 547. 548.
Marc Mosko
gadd2: gdffrN_fall generic map(4) port map( Q=> add_3v2h(7 downto 4), D=>add_2v1h(7 downto 4), Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h); ----------------------------------------------------------------- 4) The Multiplier body, 16 columns by 8 rows ----------------------------------------------------------------- i is the column ROWGEN: for i in 0 to COL generate -- for the PPFA cells, columns 0 to 14 use regular PPFA cells -- column 15 uses the PPFAPP cell which has a tap on the PP output of -- the mux. This is needed to do the sign extension. --- So, ppfa_0(5), for example, would be column 5 of row 0 -- The first 15 columns get sum/carry inputs from previous row -- Columns 15 and 16 get special wiring from the sign extenders -- Column 16 also uses the PPFAPP cells G0: if( i < COL-1 ) generate -- Row 0 is special and gets W() inputs ppfa_0: ppfa port map( bth => bth_0_1v2h, x1_h => myx_1v2h(i+1), x2_h => myx_1v2h(i), s0_h => w_1v2h(i), c0_h => GND, sum_h => sum_0_1v2h(i), ca1_h => car_0_1v2h(i)); -- All other rows get s0_h from 2 columns left and -- c0_h from 1 column left from the previous row. ppfa_1: ppfa port map( bth x1_h x2_h s0_h c0_h sum_h ca1_h bth x1_h x2_h s0_h c0_h sum_h => bth_1_1v2h, => myx_1v2h(i+1), => myx_1v2h(i), => sum_0_1v2h(i+2), => car_0_1v2h(i+1), => sum_1_1v2h(i), => car_1_1v2h(i)); => bth_2_1v2h, => myx_1v2h(i+1), => myx_1v2h(i), => sum_1_1v2h(i+2), => car_1_1v2h(i+1), => sum_2_1v2h(i),
ppfa_2: ppfa port map(
December 1, 2000
Page 65

549. 550. 551. 552. 553. 554. 555. 556. 557. 558. 559. 560. 561. 562. 563. 564. 565. 566. 567. 568. 569. 570. 571. 572. 573. 574. 575. 576. 577. 578. 579. 580. 581. 582. 583. 584. 585. 586. 587. 588. 589. 590. 591. 592. 593. 594. ca1_h ppfa_3: ppfa port map( bth x1_h x2_h s0_h c0_h sum_h ca1_h => car_2_1v2h(i)); => bth_3_1v2h, => myx_1v2h(i+1), => myx_1v2h(i), => sum_2_1v2h(i+2), => car_2_1v2h(i+1), => sum_3_1v2h(i), => car_3_1v2h(i));
Marc Mosko
p00_sum1 : gdffr_fall port map ( Q=> sum_3_2v1h(i), D=> sum_3_1v2h(i), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h); p00_car1 : gdffr_fall port map ( Q=> car_3_2v1h(i), D=> car_3_1v2h(i), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h); -- use the value before the tri-state p00_x1 : gdffr_fall port map ( Q=> myx_2v1h(i), D=> tempx(i), ppfa_4: ppfa port map( bth x1_h x2_h s0_h c0_h sum_h ca1_h bth x1_h x2_h s0_h c0_h sum_h ca1_h bth x1_h x2_h s0_h c0_h sum_h ca1_h bth x1_h
Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h);
=> bth_4_2v1h, => myx_2v1h(i+1), => myx_2v1h(i), => sum_3_2v1h(i+2), => car_3_2v1h(i+1), => sum_4_2v1h(i), => car_4_2v1h(i)); => bth_5_2v1h, => myx_2v1h(i+1), => myx_2v1h(i), => sum_4_2v1h(i+2), => car_4_2v1h(i+1), => sum_5_2v1h(i), => car_5_2v1h(i)); => bth_6_2v1h, => myx_2v1h(i+1), => myx_2v1h(i), => sum_5_2v1h(i+2), => car_5_2v1h(i+1), => sum_6_2v1h(i), => car_6_2v1h(i)); => bth_7_2v1h, => myx_2v1h(i+1),
December 1, 2000
Page 66

595. 596. 597. 598. 599. 600. 601. 602. 603. 604. 605. 606. 607. 608. 609. 610. 611. 612. 613. 614. 615. 616. 617. 618. 619. 620. 621. 622. 623. 624. 625. 626. 627. 628. 629. 630. 631. 632. 633. 634. 635. 636. 637. 638. 639. 640. x2_h s0_h c0_h sum_h ca1_h => myx_2v1h(i), => sum_6_2v1h(i+2), => car_6_2v1h(i+1), => sum_7_2v1h(i), => car_7_2v1h(i));
Marc Mosko
p00_sum2 : gdffr_fall port map ( Q=> sum_7_3v2h(i), D=> sum_7_2v1h(i), Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h); p00_car2 : gdffr_fall port map ( Q=> car_7_3v2h(i), D=> car_7_2v1h(i), Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h); end generate G0; -- In column 15, the s0_h input is the "pp" output of the sign extender -- pp_h() is indexed by row number. G15: if( i = COL-1 ) generate ppfa15_0: ppfa port map(
bth x1_h x2_h s0_h c0_h sum_h ca1_h bth x1_h x2_h s0_h c0_h sum_h ca1_h bth x1_h x2_h s0_h c0_h sum_h ca1_h bth x1_h x2_h s0_h
=> bth_0_1v2h, => myx_1v2h(i+1), => myx_1v2h(i), => GND, => GND, => sum_0_1v2h(i), => car_0_1v2h(i)); => bth_1_1v2h, => myx_1v2h(i+1), => myx_1v2h(i), => pp_1v2h(1), => car_0_1v2h(i+1), => sum_1_1v2h(i), => car_1_1v2h(i)); => bth_2_1v2h, => myx_1v2h(i+1), => myx_1v2h(i), => pp_1v2h(2), => car_1_1v2h(i+1), => sum_2_1v2h(i), => car_2_1v2h(i)); => bth_3_1v2h, => myx_1v2h(i+1), => myx_1v2h(i), => pp_1v2h(3),
ppfa15_1: ppfa port map(
December 1, 2000
Page 67

641. 642. 643. 644. 645. 646. 647. 648. 649. 650. 651. 652. 653. 654. 655. 656. 657. 658. 659. 660. 661. 662. 663. 664. 665. 666. 667. 668. 669. 670. 671. 672. 673. 674. 675. 676. 677. 678. 679. 680. 681. 682. 683. 684. 685. 686. c0_h sum_h ca1_h
Marc Mosko
=> car_2_1v2h(i+1), => sum_3_1v2h(i), => car_3_1v2h(i));
p15_sum1 : gdffr_fall port map ( Q=> sum_3_2v1h(i), D=> sum_3_1v2h(i), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h); p15_car1 : gdffr_fall port map ( Q=> car_3_2v1h(i), D=> car_3_1v2h(i), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h); -- use value before the tri-state (don't use myx_1v2h) p15_x1 : gdffr_fall port map ( Q=> myx_2v1h(i), D=> tempx(i), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h); ppfa15_4: ppfa port map( bth x1_h x2_h s0_h c0_h sum_h ca1_h bth x1_h x2_h s0_h c0_h sum_h ca1_h bth x1_h x2_h s0_h c0_h sum_h ca1_h => bth_4_2v1h, => myx_2v1h(i+1), => myx_2v1h(i), => pp_2v1h(4), => car_3_2v1h(i+1), => sum_4_2v1h(i), => car_4_2v1h(i)); => bth_5_2v1h, => myx_2v1h(i+1), => myx_2v1h(i), => pp_2v1h(5), => car_4_2v1h(i+1), => sum_5_2v1h(i), => car_5_2v1h(i)); => bth_6_2v1h, => myx_2v1h(i+1), => myx_2v1h(i), => pp_2v1h(6), => car_5_2v1h(i+1), => sum_6_2v1h(i), => car_6_2v1h(i)); => bth_7_2v1h, => myx_2v1h(i+1), => myx_2v1h(i), => pp_2v1h(7), => car_6_2v1h(i+1), => sum_7_2v1h(i), => car_7_2v1h(i));
bth x1_h x2_h s0_h c0_h sum_h ca1_h p15_sum2 : gdffr_fall port map
December 1, 2000
Page 68

687. 688. 689. 690. 691. 692. 693. 694. 695. 696. 697. 698. 699. 700. 701. 702. 703. 704. 705. 706. 707. 708. 709. 710. 711. 712. 713. 714. 715. 716. 717. 718. 719. 720. 721. 722. 723. 724. 725. 726. 727. 728. 729. 730. 731. 732.
Marc Mosko
( Q=> sum_7_3v2h(i), D=> sum_7_2v1h(i), Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h);
p15_car2 : gdffr_fall port map ( Q=> car_7_3v2h(i), D=> car_7_2v1h(i), Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h); end generate G15; -- In column 16, the s0_h input is the "ff" output of the sign extender -- The c0_h input is 0. G16: if( i = COL ) generate ppfapp_0: ppfapp port map(
bth x1_h x2_h s0_h c0_h pp_h sum_h ca1_h bth x1_h x2_h s0_h c0_h pp_h sum_h ca1_h bth x1_h x2_h s0_h c0_h pp_h sum_h ca1_h bth x1_h x2_h s0_h c0_h pp_h sum_h ca1_h
=> bth_0_1v2h, => myx_1v2h(i+1), => myx_1v2h(i), => GND, => GND, => pp15_1v2h(0), => sum_0_1v2h(i), => car_0_1v2h(i)); => bth_1_1v2h, => myx_1v2h(i+1), => myx_1v2h(i), => ff_1v2h(1), => GND, => pp15_1v2h(1), => sum_1_1v2h(i), => car_1_1v2h(i)); => bth_2_1v2h, => myx_1v2h(i+1), => myx_1v2h(i), => ff_1v2h(2), => GND, => pp15_1v2h(2), => sum_2_1v2h(i), => car_2_1v2h(i)); => bth_3_1v2h, => myx_1v2h(i+1), => myx_1v2h(i), => ff_1v2h(3), => GND, => pp15_1v2h(3), => sum_3_1v2h(i), => car_3_1v2h(i));
ppfapp_1: ppfapp port map(
December 1, 2000
Page 69

733. 734. 735. 736. 737. 738. 739. 740. 741. 742. 743. 744. 745. 746. 747. 748. 749. 750. 751. 752. 753. 754. 755. 756. 757. 758. 759. 760. 761. 762. 763. 764. 765. 766. 767. 768. 769. 770. 771. 772. 773. 774. 775. 776. 777. 778.
Marc Mosko
p16_sum1 : gdffr_fall port map ( Q=> sum_3_2v1h(i), D=> sum_3_1v2h(i), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h); p16_car1 : gdffr_fall port map ( Q=> car_3_2v1h(i), D=> car_3_1v2h(i), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h); -- don't use myx_1v2h, use tempx from before the tristate p16_x1 : gdffr_fall port map ( Q=> myx_2v1h(i), D=> tempx(i), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h); ppfapp_4: ppfapp port map( bth x1_h x2_h s0_h c0_h pp_h sum_h ca1_h bth x1_h x2_h s0_h c0_h pp_h sum_h ca1_h bth x1_h x2_h s0_h c0_h pp_h sum_h ca1_h bth x1_h x2_h s0_h c0_h pp_h sum_h ca1_h => bth_4_2v1h, => myx_2v1h(i+1), => myx_2v1h(i), => ff_2v1h(4), => GND, => pp15_2v1h(4), => sum_4_2v1h(i), => car_4_2v1h(i)); => bth_5_2v1h, => myx_2v1h(i+1), => myx_2v1h(i), => ff_2v1h(5), => GND, => pp15_2v1h(5), => sum_5_2v1h(i), => car_5_2v1h(i)); => bth_6_2v1h, => myx_2v1h(i+1), => myx_2v1h(i), => ff_2v1h(6), => GND, => pp15_2v1h(6), => sum_6_2v1h(i), => car_6_2v1h(i)); => bth_7_2v1h, => myx_2v1h(i+1), => myx_2v1h(i), => ff_2v1h(7), => GND, => pp15_2v1h(7), => sum_7_2v1h(i), => car_7_2v1h(i));
December 1, 2000
Page 70

779. 780. 781. 782. 783. 784. 785. 786. 787. 788. 789. 790. 791. 792. 793. 794. 795. 796. 797. 798. 799. 800. 801. 802. 803. 804. 805. 806. 807. 808. 809. 810. 811. 812. 813. 814. 815. 816. 817. 818. 819. 820. 821. 822. 823. 824.
Marc Mosko
p16_sum2 : gdffr_fall port map ( Q=> sum_7_3v2h(i), D=> sum_7_2v1h(i), Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h); p16_car2 : gdffr_fall port map ( Q=> car_7_3v2h(i), D=> car_7_2v1h(i), Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h); end generate G16; end generate; -- need to latch bit 17 of "myx", since that is not in the generates above glatch_x17 : gdffr_fall port map ( Q=> myx_2v1h(17), D=> myx_1v2h(17), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h); -- need bit 16 (=x_in(15)) in 3rd pipeline stage for overflow glatch_x15 : gdffr_fall port map ( Q=> myx_3v2h, D=> myx_2v1h(16), Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h); -- These are the tri-state latched outputs going to the adder gsum_0_2: gdffrN_fall generic map (2) port map ( Q=> sum_0_2v1h, D=> sum_0_1v2h(1 Clk=> me_1q2h, Enable=> me_2q1h, gsum_1_2: gdffrN_fall generic map (2) port map ( Q=> sum_1_2v1h, D=> sum_1_1v2h(1 Clk=> me_1q2h, Enable=> me_2q1h, gsum_2_2: gdffrN_fall generic map (2) port map ( Q=> sum_2_2v1h, D=> sum_2_1v2h(1 Clk=> me_1q2h, Enable=> me_2q1h, gca1_0_2: gdffr_fall port map ( gca1_1_2: gdffr_fall port map ( gca1_2_2: gdffr_fall port map ( Q=> car_0_2v1h, D=> car_0_1v2h(0), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h); Q=> car_1_2v1h, D=> car_1_1v2h(0), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h); Q=> car_2_2v1h, D=> car_2_1v2h(0), Clk=> me_1q2h, Enable=> me_2q1h, Rst=> Rst_q2h); Q=> sum_4_3v2h, D=> sum_4_2v1h(1 Clk=> me_2q1h, Enable=> me_3q2h, Q=> sum_5_3v2h, D=> sum_5_2v1h(1 Clk=> me_2q1h, Enable=> me_3q2h, Q=> sum_6_3v2h, D=> sum_6_2v1h(1 Clk=> me_2q1h, Enable=> me_3q2h, downto 0), Rst=> Rst_q2h); downto 0), Rst=> Rst_q2h); downto 0), Rst=> Rst_q2h);
downto 0), Rst=> Rst_q2h); downto 0), Rst=> Rst_q2h); downto 0), Rst=> Rst_q2h);
gsum_4_3: gdffrN_fall generic map (2) port map ( gsum_5_3: gdffrN_fall generic map (2) port map ( gsum_6_3: gdffrN_fall generic map (2) port map (
gca1_4_3: gdffr_fall port map ( gca1_5_3: gdffr_fall port map ( gca1_6_3: gdffr_fall port map (
Q=> car_4_3v2h, D=> car_4_2v1h(0), Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h); Q=> car_5_3v2h, D=> car_5_2v1h(0), Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h); Q=> car_6_3v2h, D=> car_6_2v1h(0), Clk=> me_2q1h, Enable=> me_3q2h, Rst=> Rst_q2h);
December 1, 2000
Page 71

825. 826. 827. 828. 829. 830. 831. 832. 833. 834. 835. 836. 837. 838. 839. 840. 841. 842. 843. 844. 845. 846. 847. 848. 849. 850. 851. 852. 853. 854. 855. 856. 857. 858. 859. 860. 861. 862. 863. 864. 865. 866. 867. 868. 869. 870.
Marc Mosko
----------------------------------------------------------------- 5) The full adder array (down right side) -This is an 8-bit adder in phi-1 ----------------------------------------------------------------- Construct fa_a_h and fa_b_h -- fa_a_h and fa_b_h repeat in groups of 2 from (1) to (14) fa_a_2v1h(0) <= add_2v1h(0); fa_b_2v1h(0) <= sum_0_2v1h(0); fa_a_2v1h(1) fa_a_2v1h(2) fa_a_2v1h(3) fa_a_2v1h(4) fa_a_2v1h(5) fa_a_2v1h(6) fa_a_2v1h(7) fa_b_2v1h(1) fa_b_2v1h(2) fa_b_2v1h(3) fa_b_2v1h(4) fa_b_2v1h(5) fa_b_2v1h(6) fa_b_2v1h(7) <= <= <= <= <= <= <= <= <= <= <= <= <= <= car_0_2v1h; add_2v1h(1); car_1_2v1h; add_2v1h(2); car_2_2v1h; add_2v1h(3); car_3_2v1h(0); sum_0_2v1h(1); sum_1_2v1h(0); sum_1_2v1h(1); sum_2_2v1h(0); sum_2_2v1h(1); sum_3_2v1h(0); sum_3_2v1h(1);
-- unsigned addition fa_2 : uplusN generic map(8) port map( a_h => fa_a_2v1h, b_h => fa_b_2v1h, c_h => GND, sum_h => z_2v1h(7 downto 0), car_h => fa_car_2v1h); ------------------------------ Go from Stage 2 to Stage 3 ----------------------------gfa_3 : dffrN_fall generic map(8) port map ( Q=> comp_out_3v2h(7 downto 0), D=> z_2v1h(7 downto 0), Clk=> me_2q1h, Rst=> Rst_q2h); gca_3 : dffr_fall port map (
Q=> fa1_car_3v2h, D=> fa_car_2v1h, Clk=> me_2q1h, Rst=> Rst_q2h);
fa_a_3v2h(8) <= add_3v2h(4);
December 1, 2000
Page 72

871. 872. 873. 874. 875. 876. 877. 878. 879. 880. 881. 882. 883. 884. 885. 886. 887. 888. 889. 890. 891. 892. 893. 894. 895. 896. 897. 898. 899. 900. 901. 902. 903. 904. 905. 906. 907. 908. 909. 910. 911. 912. 913. 914. 915. 916. fa_a_3v2h(9) <= fa_a_3v2h(10)<= fa_a_3v2h(11)<= fa_a_3v2h(12)<= fa_a_3v2h(13)<= fa_a_3v2h(14)<= fa_a_3v2h(15)<= fa_b_3v2h(8) <= fa_b_3v2h(9) <= fa_b_3v2h(10)<= fa_b_3v2h(11)<= fa_b_3v2h(12)<= fa_b_3v2h(13)<= fa_b_3v2h(14)<= car_4_3v2h; add_3v2h(5); car_5_3v2h; add_3v2h(6); car_6_3v2h; add_3v2h(7); GND; sum_4_3v2h(0); sum_4_3v2h(1); sum_5_3v2h(0); sum_5_3v2h(1); sum_6_3v2h(0); sum_6_3v2h(1); sum_7_3v2h(0);
Marc Mosko
-- The first of 17 FAs to accumulate W. The outputs of the -- FAs feed fa_a and fa_b which then go to the CLA fa_15 : adder port map ( a_h => car_7_3v2h(0), b_h => sum_7_3v2h(1), c_h => w_3v2h(15), sum_h => fa_b_3v2h(15), car_h => fa_a_3v2h(16)); comp_b_3v2h(15) <= pp_3v2h(8); comp_b_3v2h(14 downto 0) <= sum_7_3v2h(16 downto 2); fagen: for i in 0 to 15 generate fa_x : adder port map( a_h => car_7_3v2h(i+1), b_h => comp_b_3v2h(i), c_h => w_3v2h(i+16), sum_h => fa_b_3v2h(i+16), car_h => fa_a_3v2h(i+17) ); end generate; a_v2h <= fa_a_3v2h(31 downto 8); b_v2h <= fa_b_3v2h; c_v2h <= fa1_car_3v2h; z_v2h <= comp_out_3v2h(7 downto 0); ----------------------------------------------------------------- 8) The overflow from signed addition
December 1, 2000
Page 73

917. 918. 919. 920. 921. 922. 923. 924. 925. 926. 927. 928. 929. 930. 931. 932. 933. 934.
Marc Mosko
-An overflow is defined when -1) x*y > 0 and w > 0 and z < 0 or -2) x*y < 0 and w < 0 and z > 0 --- No longer computed here, since we do not have z[31] ---------------------------------------------------------------ovrflw_v2h(2) <= w_3v2h(31); ovrflw_v2h(1) <= myx_3v2h; ovrflw_v2h(0) <= myy_3v2h; -----------------------------------------------------------------monitor ( w_3v2h(31), "w_3v2h(31) ", '1'); --monitor ( myx_3v2h, "myx_3v2h ", '1'); --monitor ( myy_3v2h, "myy_3v2h ", '1'); --monitor ( myy_2s1h, "myy_2s1h ", '1'); --monitor ( y_s2h, "y_s2h ", '1'); end rtl;
December 1, 2000
Page 74
Marc Mosko
multreg.vhd
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. ------------------------------------------------------------------------- N-bit multiplier register -- This is a phi-2 device. -- It will add delay to synchornize on Phi 2 -- It is NON-TRANSPARENT -- Control lines must be S2 -- Input lines are V2 -- Outputs are S2 --BusOUT_S2H outputs to the higher level which does the tristate -- BusIN_S2H is the input -MultOUT_S2H is an output from the register to the multiplier. -it is the same as the output to BusIO, but not gated. -MultIN_V2H is the input from the multiplier -SEL_S2H is a chip select, active HIGH -BS_S2H is the input select (bus high, mult low) -RW_S2H is the Read/Write select (read high, write low) -READ = Read from Bus or Multiplier (based on BS_S2H) -Read from BUS requires (BS, RW) = high, SEL = low -Read from MULT requires BS=low, RW=high, MEDLY=high -WRITE = Write to Bus (MultOut is always on) -if RW_S2H high, then BusIO_V1S2H driven Z -MEDLY_S2H is the Multiplier Enable Delay line to clock in the -result. MEDLY is non-chip register specific. -Rst_S2H is a reset signal. It is clocked with PHI_2 to ensure -that it does not muck with stuff when it is not supposed to -Reset DOES NOT require SEL_S2H active -Reset is immediate. There is no 1 cycle delay -----------------------------------------------------------------------library IEEE,adk; use IEEE.std_logic_1164.all; entity multregN is generic(N : positive ); port ( BusOUT_S2H : out std_ulogic_vector(N-1 downto 0); MultOUT_S2H : out std_ulogic_vector(N-1 downto 0); BusIN_S2H : in std_ulogic_vector(N-1 downto 0); MultIN_V2H : in std_ulogic_vector(N-1 downto 0); SEL_S2H : in std_ulogic; BS_S2H : in std_ulogic; RW_S2H : in std_ulogic; MEDLY_S2H : in std_ulogic; Rst_S2H : in std_ulogic; PHI_1H : in std_ulogic; PHI_2H : in std_ulogic); end multregN; architecture structural of multregN is --component inv is -port ( Q : out std_ulogic; -D : in std_ulogic); --end component; component dffrN_fall is generic(N : positive port ( Rst : Clk : signal D : in
); in std_ulogic; in std_ulogic; std_ulogic_vector(N-1 downto 0);
December 1, 2000
Page 75

61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. signal Q end component;
Marc Mosko
: out std_ulogic_vector(N-1 downto 0));
component driverN is generic(N : positive ); port ( Q : out std_ulogic_vector(N-1 downto 0); D : in std_ulogic_vector(N-1 downto 0)); end component; component buf is port ( signal Q : out std_ulogic; signal D: in std_ulogic); end component; -- Input data selected from MUX between BUSIO and MULTIN signal W2_S2H : std_ulogic_vector(N-1 downto 0); -- Output of dff_0 signal STORE_S1H
: std_ulogic_vector(N-1 downto 0);
-- The phi-2 output before any output drivers (output of dff_1) signal STORE_S2H : std_ulogic_vector(N-1 downto 0); -- Read enable or write enable signals signal RdEn_S2H : std_ulogic; signal WrEn_S2H : std_ulogic; -- Internal "chip select" signal derived from MUX between -- MEDLY and RW/SEL. Since it clocks a register, it must be Q. -- The S2 version is an intermediate value. signal CSEL_S2H : std_ulogic; signal CSEL_Q2H : std_ulogic; -- We require double rail for RW_S2 signal RW_S2L : std_ulogic; -- A reset line. signal RST_A_S2H signal RST_B_S2H signal RST_A_Q2H signal RST_B_Q2H We clock it with PHI2. : std_ulogic; : std_ulogic; : std_ulogic; : std_ulogic;
-- Enable verbose debugging (1 = debug) signal DBG_EN : std_ulogic := '0'; -- tap reads the bus and feed feeds the bus signal tap : std_ulogic_vector(N-1 downto 0); signal feed : std_ulogic_vector(N-1 downto 0); component buf16 is port ( A : in STD_LOGIC; Y : out STD_LOGIC ); end component; signal w6_h : std_ulogic; signal signal begin -- These do not synthesize well my_phi_2h my_rst_s2h : std_ulogic; : std_ulogic;
December 1, 2000
Page 76

124. 125. 126. 127. 128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141. 142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. 156. 157. 158. 159. 160. 161. 162. --inv_0: inv port map (Q=> RW_S2L, RW_S2L <= not RW_S2H; D=> RW_S2H);
Marc Mosko
-- Determine the reset (requires select and PHI2) -- No longer require SEL for the RST signal my_phi_2h <= phi_2h; buf1: buf port map (D=> rst_s2h, Q=> RST_A_S2H); buf2: buf port map (D=> rst_s2h, Q=> RST_B_S2H); RST_A_Q2H<= my_Phi_2h and RST_A_S2H; RST_B_Q2H<= my_Phi_2h and RST_B_S2H; RdEn_S2H <= SEL_S2H and RW_S2H; csel_s2h <= RdEn_S2H when (BS_S2H = '1') else MEDLY_S2H;
CSEL_Q2H <= PHI_2H and CSEL_S2H; -- Select the register input w2_s2h <= BusIN_S2H when (BS_S2H = '1') else MultIn_V2H; -- The storage cells. There are two registers since the input -- and outputs are expected to be PHI2 -- dff_0 is the actual storage cell dff_0 : dffrN_fall generic map(N) port map ( Rst Clk D Q MultOut_S2H <= STORE_S2H; BusOut_S2H <= STORE_S2H; end structural;
=> RST_A_Q2H, => CSEL_Q2H, => W2_S2H, => STORE_S2H);
December 1, 2000
Page 77
Marc Mosko
pp.vhd
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. ------------------------------------------------------------------------- PPMUX cell from "Low-power Digital VLSI Design" by -- Bellaouar and Elmasry. -----------------------------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; use work.bth_types.all; entity ppmux is -- bth is booth encoding (5 line) -- x1 is the current x1 column -- x2 is the previous x2 column -- pp is the partial product output port ( bth : in std_ulogic_vector(4 downto 0); x1_h : in std_ulogic; x2_h : in std_ulogic; pp_h : out std_ulogic); end ppmux;
architecture rtl of ppmux is begin pp_h <= x1_h when (bth(bth_p1) = '1') else not x1_h when (bth(bth_m1) = '1') else x2_h when (bth(bth_p2) = '1') else not x2_h when (bth(bth_m2) = '1') else '0' when (bth(bth_z0) = '1') else 'X'; end rtl; ------------------------------------------------------------------------- PPFA cell from "Low-power Digital VLSI Design" by -- Bellaouar and Elmasry. -----------------------------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; entity --------ppfa is bth is booth encoding (5 line) x1 is the current x1 column x2 is the previous x2 column s0 is the sum in c0 is the carry in pp is the partial-product output from the multiplexor sum_h is the x1 output (sum out) ca1_h is the x2 output (carry out) bth x1_h x2_h s0_h c0_h sum_h ca1_h : in std_ulogic_vector(4 downto 0); : in std_ulogic; : in std_ulogic; : in std_ulogic; : in std_ulogic; : out std_ulogic; : out std_ulogic);
port (
end ppfa; -- description of adder using concurrent signal assignments architecture rtl of ppfa is component ppmux
December 1, 2000
Page 78

60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. bth x1_h x2_h pp_h end component; component adder port (a_h b_h c_h sum_h car_h end component; signal begin mux port ( : : : : in std_ulogic_vector(4 downto 0); in std_ulogic; in std_ulogic; out std_ulogic);
Marc Mosko
pp_temp : std_ulogic; : ppmux port map( bth => bth, x1_h => x1_h, x2_h => x2_h, pp_h => pp_temp); a_h => b_h => c_h => sum_h=> car_h=> pp_temp, s0_h, c0_h, sum_h, ca1_h);
fa
: adder port map(
end rtl; ------------------------------------------------------------------------- PPFA cell from "Low-power Digital VLSI Design" by -- Bellaouar and Elmasry. -- Single rail design -----------------------------------------------------------------------library IEEE; use IEEE.std_logic_1164.all; entity --------ppfapp is bth is booth encoding (5 line) x1 is the current x1 column x2 is the previous x2 column s0 is the sum in c0 is the carry in pp is the partial-product output from the multiplexor sum is the x1 output (sum out) ca1 is the x2 output (carry out) bth x1_h x2_h s0_h c0_h pp_h sum_h ca1_h : in std_ulogic_vector(4 downto 0); : in std_ulogic; : in std_ulogic; : in std_ulogic; : in std_ulogic; : out std_ulogic; : out std_ulogic; : out std_ulogic);
port (
end ppfapp; architecture rtl of ppfapp is component ppmux port ( bth : in std_ulogic_vector(4 downto 0); x1_h : in std_ulogic; x2_h : in std_ulogic; pp_h : out std_ulogic); end component;
December 1, 2000
Page 79

123. 124. 125. 126. 127. 128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141. 142. 143. 144. 145. component adder port (a_h b_h c_h sum_h car_h end component; signal begin mux
Marc Mosko
pp_temp : std_ulogic; : ppmux port map( bth => bth, x1_h => x1_h, x2_h => x2_h, pp_h => pp_temp);
pp_h <= pp_temp; fa : adder port map( a_h => b_h => c_h => sum_h=> car_h=> pp_temp, s0_h, c0_h, sum_h, ca1_h);
end rtl;
December 1, 2000
Page 80
Marc Mosko
Layout and Schematics

Overall Floorplan
The top row is invtop. The columns from left to right are bth, bthbuf, mcell array, addcell, rwire , and csa_N. W[31:15] Along the bottom, the blue is the The blue overlaps the
wires.
component bwire . The next row is fa_tg2 array for the W accumulate. The bottom row is two
instances of csa_8. Between fa_tg2 and csa_8 is a row of broute, which is a wire routing cell. The left M3 is GND and the right M3 is Vdd. These have inter-locking teeth over the multiplier with M3 Figure 6. Multiplier Floorplan contact staples to vertical power
distribution. The cell size is 3620 x 4106 (including the large M3 wires).
December 1, 2000
Page 81
Marc Mosko
Addcell
The addcell is made up of two rows of NMOS transistors followed by two inverters. The NMOS rows make up a pass-transistor mux to compute add_h and add_l in parallel. The pass transistors are 4 and the inverters are 8/8 . restore. Below the inverters are two 3 PMOS gates for swing-
Since RSIM could not simulate the swing restore, the transistors are disconnected. .
Along the bottom of the cell are some M1 connections from the output of the mcell array to the right side adders. One can also see the M3contact on the GND and Vdd vertical M2 lines. The cell size is 346 x 80 .
bth_p1
bth_p2
bth_z0
bth_m1
bth_m2
8/8
8/8
add_h
add_l
December 1, 2000
Page 82
Marc Mosko
Booth Encoder
The Booth Encoder serves two functions. First, it generates Zn_H and Zn_L from Yn_H for all three Y inputs. The subcomponent invchain generates the outputs with four inverters (see invchain later). The signals bth_z0, bth_p1, and bth_m1 come
from a 6-input OR/AND circuit, shown below. The signals bth_p2 and bth_m2 come from a 3-input NOR, also shown below. The table below summarizes the inputs.
A Z2_H Z2_H Z2_H Z2_L Z2_L B Z1_H Z1_L Z1_H Z1_H Z1_H C Z0_L Z0_L Z0_H Z0_L Z0_H D Z2_H Z2_L Z2_L E Z1_L Z1_L Z1_L F Z0_H Z0_L Z0_H
bth_p1 bth_p2 bth_z0 bth_m1 bth_m2
The second function of the cell bth is to compute the sign extension (called sgn in the VHDL code). The pp output comes from an AND and the ff output from an XOR, both done in pass-logic with restoring inverters. This logic is in the upper right corner of the cell. We found that the output inverters for the ff output were too small, so we
increased their size. They drive 0.139 pF. We made them a 2-stage 4/6-12/18 chain. This improved performance by 15% overall. The cell size is 346 x 190 . December 1, 2000 Page 83
Marc Mosko
ppx_h
ff_in_l ff_in_h ppx_l 4
18
18
4 ff_in_l
18
18
18
18
ff_in_h
A 6
8/8
8/8
12/20 ff_out_l
12/20 ff_out_h
ppx_h
ppx_h ppx_l ppx_l 4
pp_in_h 6/9 pp_in_l
ppx_l
4 ff_in_l 4 ff_in_h 4
ppx_h
8/8 pp_out_h
8/8 pp_out_l
December 1, 2000
Page 84
Marc Mosko
Booth Buffer
This cell is a set of five inverter chains in-line with the one-hot booth control signals. The fourth inverter should have been 59/98, not 42/98. We only just noticed the size error when writing this document. The 0.6 pF load is from drain capacitance from the 17 ppfa cells and 1 addcell. The M1 wiring along the bottom carries the pp_out and ff_out signals from the Booth encoder sign extension to the mcell array. The two M1 wires between the n i verters carry the pp_in signal from the ppmux to the input of the sign extender. The cell size is 346 x 151 .
IN_H 3/5 8/12 20/36 42/98
OUT_H .600 pF
December 1, 2000
Page 85
Marc Mosko
Csa_cond
This cell produces the conditional sum and carry outputs for all CSA cells. The inputs are A_H, A_L, B_H, and B_L. The outputs are S0_L, S1_L, C0_H, C0_L, C1_H, C1_L. The number, 0 or 1, means that the signal applies if the carry-in i s, respectively, 0 or 1. The cell uses pass-logic. The original circuit in [1] did not have the inverters. RSIM would not correctly simulate the circuit,
since it saw a feedback loop from the drain input to the gate. The vertical M2 is GND. Vdd for the inverters comes in the left side from the wiring channel.
The NMOS pass transistors are 12 . The large size helps pass 1 and 0 with minimum resistive drop.
The cell size is 155 x 50 .
December 1, 2000
Page 86

A_H A_L B_H AX_H
Marc Mosko
C0_H AX_H B_H
C1_H B_L AX_L
A_H
AX_L
C0_L AX_L B_L
4/6
A_L AX_H
C1_L B_L B_H
S1_L B_H B_L
S0_L
Figure 7. CSA_COND
December 1, 2000
Page 87
Marc Mosko
csa_first
csa_first is the first cell in a CSA chain. It takes in the carry-in ci_h and ci_l and generates the c0_h, c0_l, c1_h, c1_l signals plus the first sum_h bit. The two large
vertical M2 lines are Vdd and GND (left to right). The M2 lines on the right are the
ci_h and ci_l signals. These do not extend to the top of the cell, since an inverter chain is laid over this cell when used in the csa_n cells (csa_2, csa_4, csa_8). Note the
M3contacts over csa_cond and GND for the M3 GND contacts. The cell height is determined by the width of mcell since this cell is positioned sideways along the bottom of the adder when used in csa_8. The cell width is arbitrary. The M1 wires that seem to go nowhere used to connect to a swing-restoring PMOS latches. They were removed because of RSIM problems. The cell size is 155 x 244 .
December 1, 2000
Page 88
Marc Mosko
CI_H C0_L 6/6 C1_L 6/6 C0_H C0_H A_L B_H B_L csa_cond A_H C1_H C0_L C1_L 6/6 S1_L S0_L C1_H 6/6
CI_L
8 so_l 8/8 8 CAR_H CAR_L sum_h
Figure 8. CSA_FIRST
December 1, 2000
Page 89
Marc Mosko
Csa_mid
Csa_mid is used between csa_first and csa_last in a multi-bit chain. The cell takes in ci0_h, ci0_l, ci1_h, ci1_l, ci_h, and ci_l from the top of the cell from csa_first or a previous csa_mid. Along the bottom, the cell outputs
c0_h, c0_l, c1_h, c1_l and passes through the carry signals. The cell also outputs sum_h on the right side. Csa_mid has no M3 power
connections, and relies on the vertical contacts from csa_first for GND and csa_last for Vdd. The M1 wires that seem to go nowhere used to connect to a swing-restoring PMOS latches. They were removed because of RSIM problems. Because of problems passing 1 in the sum (bottom row), we made the last NMOS pass transistor a full transmission gate by adding a 5 PMOS gate. The cell size is 155 x 244 .
December 1, 2000
Page 90

CI1_L CI1_H CI0_L CI0_H C_TOP_H C0_L 8/8 8 8 C_BOT_H C1_L 8 C0_H csa_cond A_H A_L B_H B_L C1_H C0_L C1_L S1_L S0_L 8 8 S_TOP_L 6/6 so_l 8 8 C_BOT_L C1_H 8 C_TOP_L C0_H 6/6 8/8
Marc Mosko
CI_H
CI_L
11
11 8/5 sum_h
11
11 S_BOT_L
8/8 CAR_H CAR_L
Figure 9. CSA_MID
December 1, 2000
Page 91
Marc Mosko
Csa_last
Csa_last terminates a CSA chain. It may connect to either csa_first (a 2-bit adder) or csa_mid (a 3 or more bit adder) along the top. Csa_last generates the final carry-out car_h, car_l at the bottom of the cell and a sum bit along the right side. There are
M3contacts on the vertical Vdd line in this cell.
Similar to csa_mid, we made the last passtransistor transmission gate to help pass 1 values. The cell size is 155 x 244 . in the sum chain a full
December 1, 2000
Page 92
Marc Mosko
CI1_L
CI1_H
CI0_L CI0_H C_TOP_H 6/6
CI_H C0_L C1_L 6/6
CI_L
8 C_BOT_H 3/3
8 C0_H csa_cond A_H A_L B_H B_L C1_H C0_L C1_L S1_L S0_L 8 8
8 C_TOP_L
C0_H 6/6 3/3 C1_H 6/6 CAR_H CAR_L
8 C_BOT_L 8 S_TOP_L
11
11 8/5 sum_h
11
11 S_BOT_L 8/8
Figure 10. CSA_LAST
December 1, 2000
Page 93
Marc Mosko
Csa_2
Csa_2 is a 2-bit CSA adder. The input inverter drives the carry-in load. The cell size is 346 x 247 .
CI_H 3/5 9/15
CX_H .081 pF
CI_L 3/5 9/15
CX_L .081 pF
December 1, 2000
Page 94
Marc Mosko
Csa_4
This is a 4-bit adder for the right side of the multiplier. The input inverter chain drives the carry-in load. We used 0.133 pF to calculate the inverter sizes, but measurements for different inputs will
result in different capacitive loads. We have measured up to 0.144 pF. The cell size is 692 x 247 .
CI_H 4/6 11/19
CX_H .133 pF
CI_L 4/6 11/19
CX_L .133 pF
December 1, 2000
Page 95
Marc Mosko
Csa_4b
This is a 4 -bit adder designed for the bottom of the multiplier array. The spacing is narrower than the csa_4 component. The capacitive load is slightly different than csa_4. The cell size is 620 x 247 .
CI_H 4/6 11/19
CX_H .133 pF
CI_L 4/6 11/19
CX_L .133 pF
December 1, 2000
Page 96
Marc Mosko
Csa_8
This is the 8-bit adder designed for the bottom of the multiplier array. As with the other CSA components, the input inverter chain drives the carry-in capacitive load. The cell size is 1240 x 248 .
CI_H 4/6 16/25 CX_H .243 pF
CI_L 4/6 16/25
CX_L .243 pF
December 1, 2000
Page 97
Marc Mosko
Fa_tg
This full adder, based on [3], is in the body of the mcell. It is a CPL and TG pass-logic hybrid that computes the sum and carry in parallel using double-rail signaling. There are output
inverters for each of so_h, so_l, co_h, co_l, the sum and carry signals. The inverter sizes grew and grew as we debugged the design because they must drive the pass-logic of the CSA adder. We eventually put an additional inverter driver in the cell rwire to fix the drive problem. We did n ot have time to go back and resize these transistors to an optimal point. We believe they are about 25% - 50% too big.
Departing from [3], we added input inverters to the b_h and b_l signals for each of the sum and carry. In the pass logic, the B signals connect to both drains and gates. this. The cell size is 200 x 155 . RSIM did not like
December 1, 2000
Page 98
Marc Mosko
SI_H B_H 10/13 B_L 10/13 B_H 9/11 B_L 9/11 BY_H BX_H 7/10 SI_H BY_L Q_L SI_L BX_H BX_L SI_H BX_L BX_H Q_H SI_L
Computes SI + B = SO + CO.
Q_H CI_L J1 Q_L SO_H 10/15 Q_L CI_L Q_H BY_H J6 Q_L CI_L Q_H 10/15
Figure 11. FA_TG
Q_H BY_L J3 CO_H 10/15
CI_H
Q_H
SO_L 10/15
Q_L CI_H Q_H
J4
CO_L 10/15
5/8
December 1, 2000
Page 99
Marc Mosko
Fa_tg2
This cell is essentially the same as fa_tg. The main difference is that the left-side top inputs are no longer carried through to the bottom of the cell. The old X inputs are now a W input. Since W s i single-ended, we needed to add a driver chain of 4 inverters to create the differential signal.
We found a problem for 2BA1 x 4FE3 + 40D9E5F6 that the signal SO_L in csa_4b_1/csa_mid_1 was unknown. This signal passes through 3
NMOS transistors. The source was the SO_L from fa_tg2_0[22] for a 0 value. We widended the output inverters NMOS from 7 to 9 . This fixed the problem. Since this is essentially the same circuit as in fa_tg, we also widended all those NMOS output inverters. This had the added side benefit of improving performance too!
December 1, 2000
Page 100
Marc Mosko
Invchain
A single-rail to double-rail inverter chain. Unfortunately, not much thought went
in to the inverter sizes. They should be recomputed for equal-delay timing. The cell size is 48 x 45 .
IN_H 4/8 8/17
OUT_H
OUT_L 4/8 6/13
December 1, 2000
Page 101
Marc Mosko
Invtop
This is the single-rail to double-rail input drivers for X[15:0] and W[14:0]. The X
inputs must drive a very large load of 0.400 pF to 0.450 pF.
December 1, 2000
Page 102
Marc Mosko
WIN_H 4/8 8/17
WO_H
WO_L 4/8 8/13
XIN_H 3/5 8/12 21/32 56/80 XO_L 10/15 32/48
XO_H 0.450 pF
0.400 pF
Figure 12. INVTOP
December 1, 2000
Page 103
Marc Mosko
Mcell
Mcell is the array multiplier cell. The three sub-components, ppmux, fa_tg, and wroute are described elsewhere. The cell size is 346 x 155 .
December 1, 2000
Page 104
Marc Mosko
Ppmux
This cell selects the proper X input based on a rows Booth encoding [3]. The x1 and x2 X value comes from the vertical X_H, X_L signals along the left. The x-1, x -2 signals
come from a right hand side input.
All NMOS gates are 8 . The cell size is 105 x 155 .
December 1, 2000
Page 105
Marc Mosko
Xi_H
Xi_L
Xi-1_H
Xi-1_L
bth_p1
bth_p2
bth_z0
bth_m1
bth_m2
pp_out_h
pp_out_l
OUT_H
Figure 13. PPMUX
OUT_L
December 1, 2000
Page 106
Marc Mosko
Ppmuxfa
This is a sub-component of mcell.
December 1, 2000
Page 107
Marc Mosko
Rwire
This used to be a passive wiring channel. We had persistent problems with intermittent U values out of
several CSA cells on the right hand side of the multiplier. We tried increasing the output inverters of fa_tg, but the problems persisted. adder. We finally put some large inverter drivers in rwire , which abuts to the CSA
Having the inverter close to the CSA cell solved the intermittent U problem. Since each row
produces two bits, there are four inverters (double rail). The T_H, T_L inputs enter from the top and the B_H, B_L inputs enter from the left. The inverter size is arbitrary and large. We ran out of time, so we just made them the smallest square inverter. The cell size is 346 x 71 .
T_H 24/24 T_L 24/24 B_H 24/24 B_L 24/24 BX_L BX_H B2_H B2_L
December 1, 2000
Page 108
Marc Mosko
Wiring cells (passive)
December 1, 2000
Page 109
Marc Mosko
December 1, 2000
Page 110

10 1 1 1 4951

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

10 1 1 1 4951

Enviado por

Direitos autorais:

Formatos disponíveis

16-bit Booth Multiplier with 32-bit Accumulate

Marc Mosko CMPE223 Independent Study December 1,2000

CMPE223 Booth Multiplier Table of Contents

CMPE223 Booth Multiplier

CMPE223 Booth Multiplier

VHDL code using the Mentor Graphics tools.

CMPE223 Booth Multiplier

CMPE223 Booth Multiplier

registers while the Magic cases were raw arithmetic computations.

In general, we have each transistion

CMPE223 Booth Multiplier

calibrating RSIM for the AMI C5N 0.5 process.

CMPE223 Booth Multiplier

The 12-bit accumulate requires a set of full adders as shown.

The first five bits of the

Figure 1. PPFA Floorplan December 1, 2000 Page 7

5 bth encode ppfa

Figure 2. General layout of Booth multiplier with accumulate

5 bth encode ppfa

y3 y4 y5 ppfa ppfa ppfa

12-bit CSA Adder (1/2)

12-bit CSA Adder (1/2)

CMPE223 Booth Multiplier

y1 y2 y3 ppfa ppfa ppfa ppfa ppfa

CMPE223 Booth Multiplier

VHDL Source Code

http://www.cse.ucsc.edu/~mmosko/cmpe223/report2/vhdl. under /projects/kestrel/users/mult/marc/vhdl/booth-1.

There are several other versions

The code presented here is mostly based

CMPE223 Booth Multiplier

CMPE223 Booth Multiplier

estimates, we did not finish an analysis of this question.

CMPE223 Booth Multiplier

I/O Register Design

The multiplier generates the multiplier enable signal, The

bs_s2h csel_q2h medly_s2h rden_s2h

medly, as part of the pipeline.

state drivers for the I/O bus are located

Figure 3. Multiplier I/O Register

Example Register Access

CMPE223 Booth Multiplier

selecting register 0 via bussel_s2h and asserting rw_s2h.

CMPE223 Booth Multiplier

CMPE223 Booth Multiplier

clk = phi1_h/l, phi2_h/l

8b unsinged add pipeline registers

Booth (20b out)

pipeline registers Booth (20b out) y latch (9 bits)

24b CLA (1/3)

24b CLA (2/3)

Figure 4. General VHDL Multiplier Layout

CMPE223 Booth Multiplier

CMPE223 Booth Multiplier

Source Code Hierarchy

CMPE223 Booth Multiplier

The component multregn instantiates an N-bit I/O register, as described above.

The component mult_pipe is the most complex component.

It instantiates the 16-bit Booth

Marc Mosko The components dffr_fall,

VHDL Code Versions

The original code is under double.

VHDL and carries over the double-rail structure.

Marc Mosko Kevin Delaney found a cell library for