ECEN 4233 – Goldschmidt Algorithm

1
ECEN 4233 Implentation of Goldschmidts algorithm for 16 bit division and square root
Cale Spratt and Jeremy Storm, Oklahoma State University
Abstract Division and square root are common functions used in digital systems such as microprocessors. Microprocessors that have hardware to perform division and square root are able to do those functions with fewer clock cycles, and more power effictiently. This implementation utilizes a 36 bit multiplication unit to implement the Goldschmidt algorithm with enough precision to give a large amount of accuracy, and it also utilizes a two stage pipeline in order to increase the efficiency of the implantation. Index TermsGoldschmidts Algorithms, Carry Propagate Adder, Full Adder, Half Adder, Carry Save Array Multiplier
1 INTRODUCTION In this implemtation, the designed hardware can perform these operations:
a) Division b) Square Root c) Multiplication Division and square root are common mathematical functions. Due to the importance of the two functions, it is beneficial to create an implementation that runs efficiently and that produces accurate, reliable results after each use. To decrease the amount of clock cycles that are required to perform division or square root, the design has been pipelined, allowing for two different divisions or or square roots to be performed at the same time. Implementing Goldschmidts algorithms requires the use of a multiplier, and because of this, the designed hardware can also implement multiplication, with the addition of a few control lines, allowing for this module to be even more functional. Most applications that utilize any division method are dependent upon the precision expectations that the hardware outputs. For our implementation we assumed lower precision with a sign bit, integer bit, and 14-bits of precision for values falling between [1,2) and (-2,-1]. If our design had requested a larger division subset then we would have expanded the precision and implemented a modified approximation protocol to account for more effective number of interations. If a division is to occur then the approxima-
tion muxes with select the corresponding sign value of 0.75 or -0.75, otherwise a square root is assumed and the value will be 0.833. It is very important to select the correct approximation to create the proper division. The value must always fall to the left of the quotient when on an exponentially decreasing graph. Our module uses the control muxes, A and B, to select the registers or input values we are going to propagate through the multiplier. Specific output values have been mapped to certain registers to create a specific pipelining process.
2 DIVISION
2.1 Goldschmidt Method
The Goldschmidt method is a method for approximating the zeroes of a real function. This method calculates Q = N/D. It is an iterative method that successively gives a better approximation after each iteration. The amount of bits of error grows quadratically after each successive iteration. The equation for determining the quotient is
K1 is obtained from utilizing the intial approximation that is provided from the mux unit dependent upon mathematical operation.
The number of iterations needed is equal to (n/log_2(radix)), where n is the number of bits of precision.
HIGH SPEED PROJECT SPRAT, STORM
Mux Select (3 bits)
Output (16 bits) --Regc ---D Regd IA --N
3 Hardware Modules
000 001 010 011 100 101 110 111
A. Initial Approximation
The initial approximation (IA) module consists of 3 16-bit mux units that cascade a combination of signed approximations based on the mathematical function requested. The inputs values are between [1,2) or (-2,1]. There are only three constant approximations of 0.75, -0.75, and 0.833 where the first two correspond to specified signed divisions. The value 0.833 is the approximation used for a square root operation. A 2-bit selection input is provided to the cascaded muxes to select the operational approximation for calculation. Area is 96 gates. Delay is 6.
B. Multiplexers
Multiplexor B is implemented to select a 16-bit output value from inputs of initial approximation, register A and register B. This mux has the intented operation of provide the second multiplication value for our CSAM multiplier. The logic table for multiplexor B is listed below. Area is 96 gates. Delay is 6. Mux Select (2 bits) 00 01 10 11 Output (16 bits) Rega Regb Rega IA
We have a total of three multiplexor sets that assist in implementing the expected mathematical operation for each interation of the pipeline stages. Multiplexor A is used to delinieate between the 16-bit values from the intitial approximation unit, inputs N and D, and the two register units C and D. A 3-bit mux selection input determines the 16-bit mux output value that is propagated to our multiplier. All values passed to the multiplier through the muxes will be spliced with two additional bits for higher precision during multiplication. The table below provides all selection bit operations for multiplexor A. Area is 288 gates. Delay is 9.
Multiplexor Twos was devoted to propagating either a 16-bit two or three to the 2s complement module in order to perform either a division or square root. The table for the logic is provided below. Area is 96 gates. Delay is 6. Mux Select (1 bit) 0 1 Output (16 bits) 16h4000 16h6000
SPRAT, STORM HIGH SPEED PAPER
The multiplexor for the 2s complement logic unit has an area of 48 gates and 3 based on the layout of the unit. C. Signed Carry Save Array Multiplier (CSAM)
The Carry Save Array Multiplier (CSAM) is a pipelined module to give the external register units the ability to store the appropriate multiplication value. Our multiplier was instantiated to be 36-bits with two 18-bit inputs from pre-module muxes. This multiplier gives one integer and sign bits, with 16-bits of precision. Three 18-bit register modules that incorporate latch control were placed between the multiplication array and the Carry Propagate Adders (CPA). This in required to allow the expected pipelining capability. These registers store the carries and sum values for propagation into the CPAs. The implementation was developed to utilize 16-bit external registers so all the inputs and outputs have to be extended or spliced to meet the processing requirements. The most-significant two bits are always removed following the CPA sum being calculated. In other words we have performed the same function as a rounding unit would. Area is 2433 gates. Delay is 37 .
bit value of 2 that implements as an output from the 2s selection mux. A 16-bit value of 3 will be implemented for a square root operation. Our 2s complement module has 16-bit inputs and outputs but they have to be modified to include one additional integer bit for the multiplier value being utilized. We concatenate one bit after the left-most bit for the multiplier and one bit to the right-most bit of the subtractor value. After the subtraction is complete then the additional integer bit is removed before propagating the output. Area is 144 gates for 16-bits. Delay is 55
E.
Total Area and Delay The total area for the entire block is the sume of the initial approximation unit, two input muxes, squareroot/division mux, multiplier and the 2s complement logic unit. This area comes out to be 3105 gates. The delay also follows the same process and comes out to be roughly 122.
D. 2s Complement Module
The complement subtractor has to be integrated to perform both division and square root functions. An input mux determined the complement value designated for the intended mathematical operation. For the division operation, the 2s complement logic uses a 16-
3. State Table 3.1 State Table for Division
CLK mux_selecta mux_selectb mux_twos_select rega_out regb_out regc_out regd_out rega_load regb_load regc_load regd_load 6 null null 1 7 100 00 1
0 111 11 1 0 0 0 N*IA 0 0 0 0 0 8
1 111 11 1 0 2-K*IA 0 N*IA 0 D*IA 0 0 1 0
2 110 null 11 null 1 2-K*IA 0 N*IA D*IA 1 0 0 1 9 null null null null
4 100 00 1 2-D*IA
5 011 00 1
2-D*K0*K1 0 0 0 N*K0*K1 N*K0*K1 D*IA D*K0*K1 0 0 1 0 10 100 00 1 11 011 00 1 1 0 0 1
011 null 00 null 1
22-D*K0*K1 2-D*K0*K1 D*K0*K1*K2 0 0 0 N*K0*K1 N*K0*K1*K2 N*K0*K1*K2 D*K0*K1 D*K0*K1 D*K0*K1*K2 null null null null 12 null null 0 0 1 0 13 100 00 1 0 0 1
22D*K0*K1*K2 2-D*K0*K1*K2 D*K0*K1*K2*K3 0 0 0 N*K0*K1*K2 N*K0*K1*K2*K3 N*K0*K1*K2*K3 D*K0*K1*K2 D*K0*K1*K2 D*K0*K1*K2*K3 null null null null 14 11 00 0 0 1 0 1 0 0 1
AUTHOR: CALE SPRATT, JEREMY STORM
22D*K0*K1*K2*K3 2-D*K0*K1*K2*K3 D*K0*K1*K2*K3*K4 0 0 0 N*K0*K1*K2*K3 N*K0*K1*K2*K3*K4 N*K0*K1*K2*K3*K4 D*K0*K1*K2*K3 D*K0*K1*K2*K3 D*K0*K1*K2*K3*K4 null null null null
3.2 State Table for Square Root
0 0 1 0
1 0 0 1
After every three iterations, the 16-bit output from the 2s complement module will be shifted by one bit before propagating to the mux to create the division necessary for a square root.
CLK mux_selecta mux_selectb mux_twos_select rega_out regb_out regc_out regd_out rega_load regb_load regc_load regd_load
0 111 11 0 0 0 0 N*IA 0 0 0 0 0
1 111 11 0 0 3-K*IA 0 N*IA 0 D*IA 0 0 1 0
2 110 null 11 null 0 3-K*IA 0 N*IA D*IA 1 0 0 1 null null null null
4 100 00 0 3-D*IA
5 011 00 0
3-D*K0*K1 0 0 0 N*K0*K1 N*K0*K1 D*IA D*K0*K1 0 0 1 0 1 0 0 1
6 null null 0 3-D*K0*K1
7 100 00 0 3-D*K0*K1
8 011 null 00 null 0 3-D*K0*K1*K2
10 100 00 0 3-D*K0*K1*K2
11 011 00 0 3-D*K0*K1*K2*K3
3-D*K0*K1*K2
HIGH SPEED PAPER SPRAT, STORM
0 N*K0*K1 D*K0*K1 null null null null
0 0 0 0 0 N*K0*K1*K2 N*K0*K1*K2 N*K0*K1*K2 N*K0*K1*K2*K3 N*K0*K1*K2*K3 D*K0*K1 D*K0*K1*K2 D*K0*K1*K2 D*K0*K1*K2 D*K0*K1*K2*K3 0 0 1 0 12 13 100 00 0 1 0 0 1 null null null null 14 11 00 0 0 0 1 0 1 0 0 1
null null 0
3-D*K0*K1*K2*K3 3-D*K0*K1*K2*K3 3-D*K0*K1*K2*K3*K4 0 0 0 N*K0*K1*K2*K3 N*K0*K1*K2*K3*K4 N*K0*K1*K2*K3*K4 D*K0*K1*K2*K3 D*K0*K1*K2*K3 D*K0*K1*K2*K3*K4 null null null null
3 Error Analysis
0 0 1 0
1 0 0 1
1 ERROR ANALYSIS
1.1 Error Analysis for Division The analysis for division and square root was performed using the excel spreadsheets provided in the course.
1.923 N 1.523 D 1.262639527 N/D q*K 1.44225 1.237089938 1.26212253 1.262639316 1.262639527 1.262639527 r*K 1.14225 0.979764938 0.999590542 0.999999832 1 1
IA
0.75
2-D*Xi 0.85775 1.020235063 1.000409458 1.000000168 1 1
TRUE Error #bits 1.262639527 0.179610473 -2.477056622 1.262639527 0.02554959 -5.290556064 1.262639527 0.000516998 -10.91755495 1.262639527 2.11689E-07 -22.17155272 1.262639527 3.59712E-14 -44.66015 1.262639527 6.66134E-16 -50.4150375
2. 1.923 N 1.013 D 1.898321816 N/D q*K 1.44225 1.788750563 1.891997357 1.898300746 1.898321816 1.898321816 r*K 0.75975 0.942279938 0.996668394 0.9999889 1 1 2-D*Xi 1.24025 1.057720063 1.003331606 1.0000111 1 1 TRUE Error #bits 1.898321816 0.456071816 -1.132667075 1.898321816 0.109571254 -3.190058739 1.898321816 0.00632446 -7.304842067 1.898321816 2.10706E-05 -15.53440872 1.898321816 2.33875E-10 -31.99354105 1.898321816 2.22045E-16 -52 IA 0.75
3.
-1.012 N 1.9123 D -0.529205669 N/D q*K -0.759 -0.429423225 -0.510391554 -0.528536796 -0.529204823 -0.529205669 r*K 1.434225 0.811448649 0.964448388 0.998736083 0.999998403 1
IA
0.75
2-D*Xi 0.565775 1.188551351 1.035551612 1.001263917 1.000001597 1
TRUE Error #bits -0.529205669 0.229794331 -2.121584885 -0.529205669 0.099782444 -3.32507019 -0.529205669 0.018814115 -5.7320408 -0.529205669 0.000668872 -10.54598202 -0.529205669 8.45399E-07 -20.17386446 -0.529205669 1.35048E-12 -39.4296699
4. 1.012 N 1.9123 D 0.529205669 N/D q*K 0.759 0.429423225 0.510391554 0.528536796 0.529204823 0.529205669 r*K 1.434225 0.811448649 0.964448388 0.998736083 0.999998403 1 2-D*Xi 0.565775 1.188551351 1.035551612 1.001263917 1.000001597 1 TRUE Error #bits 0.529205669 0.229794331 -2.121584885 0.529205669 0.099782444 -3.32507019 0.529205669 0.018814115 -5.7320408 0.529205669 0.000668872 -10.54598202 0.529205669 8.45399E-07 -20.17386446 0.529205669 1.35048E-12 -39.4296699 IA 0.75
In all the above cases, the amount of error was greater than 16 bits after five iterations, showing that the the design should be accurate to the amount of bits being used on the design.
1.2 Error Analysis for Square Root
1.
1.231 N 1.231 D 1 N/D
IA
0.853553391
q*K r*K (3-D*Xi)/2 1.050724224 0.896849224 1.051575388 1.104915733 0.991745555 1.004127223 1.109475967 0.999948757 1.000025621 1.109504393 0.999999998 1.000000001 1.109504394 1 1 1.109504394 1 1
TRUE Error #bits 1.109504394 0.05878017 -4.088526657 1.109504394 0.00458866 -7.767711238 1.109504394 2.84273E-05 -15.10236562 1.109504394 1.09252E-09 -29.76969629 1.109504394 2.22045E-16 -52 1.109504394 2.22045E-16 -52
2.
1.99 N 1.99 D 1 N/D
IA
0.853553391
q*K r*K (3-D*Xi)/2 1.698571247 1.449821247 0.775089376 1.316544529 0.870999747 1.064500127 1.401461818 0.986982526 1.006508737 1.410583564 0.999872358 1.000063821 1.410673589 0.999999988 1.000000006 1.410673598 1 1
TRUE Error #bits 1.410673598 0.287897649 -1.796372086 1.410673598 0.094129069 -3.409215861 1.410673598 0.00921178 -6.762304257 1.410673598 9.00338E-05 -13.43917395 1.410673598 8.61919E-09 -26.78980037 1.410673598 2.22045E-16 -52
3.
1.01 N 1.01 D 1 N/D
IA
0.853553391
q*K r*K (3-D*Xi)/2 0.862088924 0.735838924 1.132080538 0.975954093 0.943055834 1.028472083 1.003741539 0.997521859 1.00123907 1.004985246 0.99999539 1.000002305 1.004987562 1 1 1.004987562 1 1
4.
1.7532 N 1.7532 D 1 N/D
IA
0.853553391
q*K r*K (3-D*Xi)/2 1.496449804 1.277299804 0.861350098 1.288967185 0.947659369 1.026170316 1.322699864 0.997909496 1.001045252 1.324082418 0.99999672 1.00000164 1.324084589 1 1 1.324084589 1 1
After five iterations the number of bits of error exceeds 16 bits in all of the simulations, showing that five iterations is sufficient to produce enough accuracy.
10
4 APPENDICES 4.1 Specialized Full Adder
11
4.2 Carry Save Array Multiplier
4.3 Subtractor
12
4.4 Overall Project Design
4.5 MUXa
13
4. CONCLUSION The designed hardware can be implemented into a digital system that requires the use of division and square root functions, and necessitates accuracy within 16 bits. This module can also be implemented with a low amount of area, especially when there is already a multiplication module, as the division module can be built upon a multiplication module. The module also makes a drastic speed up in implementing division and square root functions over having the functions being programmed manually in a system. To additionally speed up the processing of functions, the design has been pipelined, increasing the possible throughput of the design. The error analysis implies that we should retain 39 bits for a division operation and 52 fractional bits for square rooting. The state tables provide the process flow with the corresponding registers, the mux select values, and the mathematical operations occurring after each iteration. The multiplier registers are internal to the module to help provide the pipeline capability for the unit. We basically stall the module after each interation to insure that multiplications are not overflow each other.
REFERENCES
[1] [2] High Speed Computer Arithmetic Class, Dr. Stine, Spring 2013 M. D. Ercegovac, J Muller Design of a Complex Divider aComputer Science Department, University of California, Los Angeles, California, U.S.A Aswin Ramachandran, ECEN 5060- Final Project Implementation of Goldschmidt Algorithm for Division, Square root and Inverse Square root Graduate Student, Oklahoma State University, 2006 Javier Hormigo, Julio Villalba and Emilio L. Zapata, Cordic Algorithm with digits skipping Dept. Computer Architecture. University of Malaga (SPAIN)
[3]
[4]

ECEN 4233 – Goldschmidt Algorithm

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

ECEN 4233 – Goldschmidt Algorithm

Enviado por

Direitos autorais:

Formatos disponíveis

1

HIGH SPEED PROJECT SPRAT, STORM

Mux Select (3 bits)

Output (16 bits) --Regc ---D Regd IA --N

000 001 010 011 100 101 110 111

SPRAT, STORM HIGH SPEED PAPER

3. State Table 3.1 State Table for Division

1 111 11 1 0 2-K*IA 0 N*IA 0 D*IA 0 0 1 0

2-D*K0*K1 0 0 0 N*K0*K1 N*K0*K1 D*IA D*K0*K1 0 0 1 0 10 100 00 1 11 011 00 1 1 0 0 1

011 null 00 null 1

AUTHOR: CALE SPRATT, JEREMY STORM

1 111 11 0 0 3-K*IA 0 N*IA 0 D*IA 0 0 1 0

3-D*K0*K1 0 0 0 N*K0*K1 N*K0*K1 D*IA D*K0*K1 0 0 1 0 1 0 0 1

6 null null 0 3-D*K0*K1

8 011 null 00 null 0 3-D*K0*K1*K2

HIGH SPEED PAPER SPRAT, STORM

0 N*K0*K1 D*K0*K1 null null null null

2-D*Xi 0.85775 1.020235063 1.000409458 1.000000168 1 1

AUTHOR: CALE SPRATT, JEREMY STORM

2-D*Xi 0.565775 1.188551351 1.035551612 1.001263917 1.000001597 1

HIGH SPEED PAPER SPRAT, STORM

1.2 Error Analysis for Square Root

1.231 N 1.231 D 1 N/D

1.99 N 1.99 D 1 N/D

AUTHOR: CALE SPRATT, JEREMY STORM

1.01 N 1.01 D 1 N/D

1.7532 N 1.7532 D 1 N/D

HIGH SPEED PAPER SPRAT, STORM

4 APPENDICES 4.1 Specialized Full Adder

AUTHOR: CALE SPRATT, JEREMY STORM

4.2 Carry Save Array Multiplier

HIGH SPEED PAPER SPRAT, STORM

4.4 Overall Project Design

AUTHOR: CALE SPRATT, JEREMY STORM

Você também pode gostar

1 111 11 1 0 2-KIA 0 NIA 0 D*IA 0 0 1 0

2-DK0K1 0 0 0 NK0K1 NK0K1 DIA DK0*K1 0 0 1 0 10 100 00 1 11 011 00 1 1 0 0 1

1 111 11 0 0 3-KIA 0 NIA 0 D*IA 0 0 1 0

3-DK0K1 0 0 0 NK0K1 NK0K1 DIA DK0*K1 0 0 1 0 1 0 0 1

6 null null 0 3-DK0K1

8 011 null 00 null 0 3-DK0K1*K2

0 NK0K1 DK0K1 null null null null