Paper-6 - Modified Montgomery Modular Multiplication For RSA Cryptosystem

International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No.
Modified Montgomery Modular Multiplication for RSA Cryptosystem

Rupali Verma 1, Maitreyee Dutta 2, Renu Vig 3 PEC University of Technology, Chandigarh, India National Institute of Technical Teachers Training, Chandigarh, India 3 University Institute of Engineering & Technology, Panjab University, Chandigarh,India
2 1
rupali@pec.ac.in , d_maitreyee@yahoo.co.in , renuvig@hotmail.com
Abstract
Montgomery modular multiplication is used in public key cryptosystems like RSA and ECC. The efficiency of these public key cryptosystems depends on efficiency of Montgomery modular multiplication design. This paper proposes a Modified Montgomery Modular Multiplication algorithm. The dependency of quotient on partial product is relaxed by making the LSB of multiplicand zero. The quotient for next iteration is computed as soon as partial sum for an iteration is available. This makes computation of quotient simple with 1 XOR delay. In our design, the critical path has two levels of carry save addition and simplified quotient determination. These two factors improve the frequency and throughput of the proposed design.
Keywords- public key cryptosystems, modular multiplication, radix 2 design, carry save adders.
1. Introduction
The use of Internet for ecommerce has made security an important issue in e-transactions. Four basic services like confidentiality, authentication, data integrity and non repudiation have become essential. Public key cryptosystems like RSA are used for encryption and digital signatures. Encryption and decryption in RSA [1] is based on modular exponentiation. Modular exponentiation is achieved by repeated modular multiplications and modular squarings. Hence the efficiency of RSA cryptosystem depends on the efficiency of modular multiplication. The efficiency of RSA cryptosystem becomes more critical with large size of operands as for security reasons the operand sizes in RSA are 1024 bits or more. Hence efficient designs of RSA in terms of area, frequency and throughput are hard to achieve. As such the modular multiplication design becomes very crucial in RSA implementation. Montgomery modular multiplication [2] is an efficient method for modular multiplication and is suitable for hardware implementation since it replaces trial division by modulus with additions and shift operations. But the price paid is conversion of operands into and out of Montgomerys domain. This cost is negligible for RSA cryptosystem where several modular multiplications are done modulo one n ( ie the same modulus). The critical path in Montgomerys design is addition of long operands. Many designs have been proposed in literature to reduce the carry propagation during addition of long operands. They fall under two categories: systolic arrays and carry save adders. This paper focuses on carry save adder (CSA) architectures of Montgomery design. McIvor [7] et al have proposed two variants of Montgomery design using carry save adders: five-to-two CSA that has three levels of carry save logic and four-to-two CSA with precomputed values and two levels of carry save logic. Kooroush Manochehri [8] et al have proposed a new Montgomery design where the multiplicand is increased by one bit and critical path improved by simplifying quotient determination. The five to two CSA proposed by McIvor is implemented in [8] using the proposed modification in [8]. Yuan Yang [10] et al have improved the critical path by using CSA architecture which performs result format conversion ie from redundant form to binary form using carry save logic. It makes the architecture of Montgomery multiplier very simple and reduces the critical path to two levels of CSA. Ming Der Shieh [11] et al have improved the Montgomery design by reducing the number of input operands for the CSA tree and achieving four-to-two CSA. In this paper we propose a Modified Montgomery Modular Multiplication design which is a hybrid of the two designs: four to two CSA proposed by McIvor [7] and New Montgomery design proposed by Kooroush Manochehri
39
International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9
[8]. The multiplier and multiplicand are increased by one bit. Multiplicand is left shifted by one bit and its LSB bit is set to zero. This makes the LSB of partial product zero in all iterations and hence the quotient independent of partial product. Quotient for next iteration is computed as soon as partial sum for an iteration is available. The bit k of the multiplier ie the MSB bit is set to zero. The number of iterations in modular multiplication becomes k+1. The multiplier bit for next iteration is computed in parallel with carry save addition of long operands. Precomputation of three operands: multiplicand in redundant form and modulus is done so that there are four operands for addition. Hence the proposed design has two levels of CSA with simplified quotient determination at expense of an extra iteration. Our design has high frequency and improved throughput. Section 2 discuss the Montgomery modular multiplication designs. Section 3 gives the proposed design, its key features, architecture and critical path delay. Section 4 discuss RSA exponentiation. Section 5 gives implementation results. Conclusions are in section 6.
2. Montgomery Modular Multiplication

Montgomery modular multiplication is an efficient method since it replaces trial division by modulus with additions and shift operations. These operations are easily performed on hardware. But the price paid in using Montgomery algorithm is conversion of operands in and out of Montgomerys domain. Let a and b be the multiplier and multiplicand, n be the modulus. To compute Montgomery modular multiplication (MMM) of a and b modulus n the operands are first converted to Montgomery domain ie., n residue of integer with respect to r, where r=2k and k is bit length of operands. A is n residue of a with respect to r A=a*r (mod n) Similarly B = b*r (mod n) The N residue of integer with respect to r is obtained by Montgomery modular multiplication of integer and r2 modulus n A= MMM (a, r2 mod n ,n) =( a* r2 mod n *r-1 ) mod n = a*r (mod n) Montgomery product of A and B is defined as =MMM (A, B, N) = A*B* r -1 mod n = ( a*r (mod n)) * ( b*r (mod n)) * r -1 mod n Product = ab r mod n Product is converted back to integer domain from Montgomery domain. = MMM(Product,1,n) = a b r * 1 * r-1 mod n = ab mod n. Montgomery modular multiplication is suitable where several modular multiplications are done modulo one n(modulus). In RSA cryptosystem modular exponentiation is achieved by repeated modular multiplications with the same modulus. Hence the cost of conversion of input operands a and b from integer domain to Montgomery domain and then the result back to integer domain is very negligible in cryptosystems like RSA. Algorithm1 Montgomery Modular Multiplication (A, B, n) S[0]=0; for i in 0 to k-1 loop (1) qi = ( S[i]0 + Ai * B0 ) mod 2; (2) S[i+1]=(S[i]+Ai*B+qi*n)div2; end loop; return S[k]; Algorithm1 is a radix 2 version of Montgomerys multiplication algorithm [5]. The critical path (2) in Montgomerys design is addition of long operands which results in large carry propagation delay. To avoid the carry propagation
40
many designs are proposed in literature based on systolic arrays and carry save adder architectures. Algorithm 2 is based on carry save adders (CSA). Algorithm 2 Four-to-two CSA Montgomery Multiplication (A1, A2, B1, B2, n) D1,D2 = CSR(B1+B2 +n+0) S1[0]=0; S2[0]=0; for i in 0 to k-1 loop qi =(S1[i]o +S2[i]o) + (Ai*(B1o +B2o)) mod 2; if Ai=0 and qi =0 then S1[i+1], S2[i+1] = CSR (S1[i] + S2[i] + 0 + 0) div 2; elsif Ai=1 and qi =0 then S1[i+1], S2[i+1] = CSR (S1[i] + S2[i] + B1 + B2) div 2; elsif Ai=0 and qi =1 then S1[i+1], S2[i+1] = CSR (S1[i] + S2[i] + n + 0) div 2; else S1[i+1], S2[i+1] = CSR (S1[i] + S2[i] + D1+ D2) div 2; end if; end loop; return S1[k],S2[k]; The authors in [7] have proposed four-to-two CSA. The critical path in this design consists of computation of quotient and 2 levels of CSA. Two levels are achieved by precomputation of operands. The quotient determination involves delay of 2 XOR and 1 AND. Therefore the critical path delay is 2 full adders + 4:1 multiplexer + 2 XORs + 1 AND. This design has a large delay in quotient computation which is solved in our proposed design. Below we discuss Simple Radix 2 Montgomery Multiplier design proposed in [8]. This design improves the critical path by simplifying quotient determination. The multiplier and multiplicand are increased by one bit. Multiplicand B is made 2B to make first bit of partial product, Ai * B0 as zero. Hence the quotient becomes independent partial product in all iterations and depends only on partial sum. This is at the expense of just one extra clock cycle. Algorithm 3. New Montgomery Multiplication (A, B, n) S[0]=0; B_new=2B; for i in 0 to k loop S[i+1]=(S[i] + Ai *B_new + S[i]o *n) div2; end loop; return S[k+1]; The authors in [8] have implemented this design in five to-two CSA proposed by McIvor and have renamed it as the Five-to-two CSA NEW Montgomery Multiplication. In this architecture when it is calculating S1+S2+ Ai*(B1+B2) the result of n*(S1o + S2o) is prepared (result is also called operand for modular reduction) and there is no delay to wait. Hence this design improves over the Five to two CSA that has more delay. The design ie algorithm 3 was not implemented with Four to two CSA. The reason is that Four to two CSA needs two bits simultaneously: the multiplier bit and quotient bit for computing two levels of carry save addition. Hence the computation of quotient and operand for modular reduction in parallel with carry save addition cannot be applied as such in Four to two CSA. But we have tried to implement the design in [8] ie algorithm 3 with Four to two by computing quotient for i+1 iteration in i iteration as soon as partial sum for i iteration is available. The proposed design is discussed in detail in section 3.
41
3. Proposed Algorithm
The proposed design is a hybrid of the two designs ie algorithm 2 and algorithm 3. The quotient determination in algorithm 2 involves a delay of 2 XORs + 1 AND. The proposed design tries to reduce the delay in quotient computation. It simplifies quotient determination by making quotient independent of partial product thus reducing its computational delay to one XOR. Algorithm 4 Modified Montgomery Modular Multiplication (A1, A2, B1, B2, n) 1. 1a. (S1[0], S2[0]) = (0, 0) 1b. B1_n= 2B1, B2_n=2B2; 1c. D1, D2 = CSR ( B1_n + B2_n + n) 1d. q0 = 0; 1e P1=0, P2=0 parallel {1e A1_n=0 & A1; A2_n = 0 & A2; 1f Computation of A_n0 for first iteration } 2. For i in 0 to k loop 2a. if A_ni=0 and qi =0 then P1=0 ; P2=0; elsif A_ni=1 and qi =0 then P1= B1_n ; P2=B2_n; elsif A_ni=0 and qi =1 then P1= 0 ; P2=n; else P1= D1; P2=D2; end if; 2b S1[i+1], S2[i+1] = CSR (S1[i] + S2[i] + P1 + P2) div 2; 2c. q i+1 = ( S1[i+1]0 xor S2[i+1]0) * computation of quotient ( steps 2a to 2c parallel to step 2d) 2d. { computation of A_n i +1 } end loop; 3. Return (S1[k+1],S2[k+1] ); In the proposed algorithm, the multiplicand (step 1b) and multiplier (step 1e) are increased by one bit. The LSB of multiplicand is made 0 by left shifting the multiplicand by one bit. Since the multiplier bit is also increased by one bit, and its MSB is made zero (step 1e), therefore the number of iterations in modular multiplication becomes k+1 where k is bit length of input operands. Our design uses twice the multiplicand and one extra division by 2, hence gives the required result for modular multiplication. B1_n=2B1, B2_n=2B2 (A* (B1_n +B2_n))* 2-(k+1) mod n =( A* 2(B1+B2)/2)* 2-k mod n = A * (B1+B2) * 2-k mod n 3.1 The key features of design are: 3.1.1 Dependency Relaxation and Simplified Quotient Computation In the proposed design the dependency of quotient on the partial product (as in algorithm 2) in an iteration is removed. This is achieved by increasing the multiplicand by one bit and making its LSB zero. This makes the LSB of partial product zero for all iterations. Hence the quotient depends on only the LSB of partial sum. Therefore in the proposed design it is computed as soon as partial sum in the iteration is available. The quotient is computed for i+1
42
iteration in i iteration (as in step 2c). Also the computation of quotient is simplified with a delay of 1 XOR as compared to the delay of 2 XOR and 1 AND in algorithm 2.
S1[i]o S2[i]o B1o B2o
XOR
XOR
Ai
AND
XOR qi
Fig 1. Quotient computation in Processing Element in algorithm 2

S1[i+1]0 S2[i+1]0
XOR
q i+1 Fig 2. Quotient computation in Processing Element in algorithm 4 3.1.2 Two Levels of Carry Save logic The critical path in Montgomery design is addition of long operands. The proposed design achieves 2 levels of carry save logic. At start of each iteration of for loop the multiplier bit and quotient bit are available. The precomputation of operands is done to have 4 operands for addition under all cases and hence 2 levels of CSA (step 2b). 3.1.3 Parallel Computation of Multiplier bit The multiplier bit Ai+1 for i+1 iteration is computed in iteration i, which is done in parallel with CSA (step 2a-2c parallel with 2d). Hence no extra clock cycle is required for its computation. 3.2. Architecture of Processing Element The processing element in the proposed Montgomery design processes all the bits of operands. The values 0,n,B1_n,B2_n,A1_n,A2_n, D1,D2,P1,P2 and results of carry save adders are maintained in registers implemented using flip flops on FPGA device. The PE has 2(k+2) , 4:1 mux, where k is bit length of operands. Carry save adders are used for addition. Multiplier bit A for next iteration is computed in parallel with carry save addition. 3.3. Critical Path Delay Analysis This section analyzes the critical path delay of our design with related work in [7]. Our design has improved the critical path in Montgomery modular multiplication by having 2 levels of CSA and simplified computation of quotient. Our proposed algorithm has critical path delay less by 1 XOR and 1 AND as compared to algorithm 2 in [7]. This is due to simplified quotient computation at expense of extra iteration. Therefore our design takes k+3 clock cycles as compared to k+2 clock cycles by algorithm 2 in [7].
43
Table 1. Critical Path Delay of Montgomery designs

Algorithm Loop Delay 2 CSAs + 4:1 mux + 2 XORs + 1 AND 2 CSAs + 4:1 mux + 1 XOR Clock Cycles K+2 Critical path delay 2 full adders + 4:1 mux + 2 XORs + 1 AND 2 f ull adders+ 4:1 mux + 1 XOR
4:2 [7]
Proposed algorithm
K+3
4. RSA Modular Exponentiation

Modular Exponentiation is the main operation in RSA cryptosystem. Both encryption and decryption in RSA are functions of modular exponentiation. Let M be plaintext message block to be encrypted, n be the k bit modulus, e and d be the public and private exponents respectively. Let C be the ciphertext block computed by encrypting M. Therefore ciphertext C=Me mod n. Modular expoenentiation is achieved by repeated modular multiplications. Many methods for computation of modular exponentiation exist [3]. This paper focuses on the implementation of RSA using binary method which is based on square and multiply. The left to right square and multiply method (MSB firstMost Significant Bit) and right to left square and multiply method (LSB first-Least Significant Bit) for modular exponentiation are implemented using the proposed Montgomery modular multiplication algorithm 4.
44
Registers
0 n B1_n B2_n D1 D2
A q
0 1
2 3
0 1 2 3
4:1 MUX
4:1 MUX
P1 S1 S2
P2
A_n i +1 CSA1 acarry S1 S2 A1_n CSA2 A2_n
>>2
>>2 S20 XOR Q(i+1)
S10
Fig 3. Architecture of Processing Element
5. Implementation Results
The proposed design is coded in VHDL and synthesized in Xilinx ISE 8.1i. The results for the proposed Montgomery design were obtained for 512, 1024 and 2048 bits. The results in table 2 are synthesis results. Area is measured in terms of number of slices. Frequency is in MHz generated in synthesis report. Throughput is defined as bit length multiplied by the frequency and divided by number of clock cycles. Table 2 shows the performance of Montgomery Modular Multiplication design in [7] and proposed design. Table 3 gives area results for our proposed Montgomery design. In terms of slices, slice flip flops and LUTs (Look up tables).
45
Table2: Performance of Montgomery Modular Multiplication designs, Thr=Throughput

Bit Len Four -totwo CSA [7] Algo 4 512 1024 2048 512 1024 2048 Device Area (no of slices) 5782 11520 23108 3480 6953 14015 Freq MHz 122.03 111.32 90.73 156.824 136.458 135.587 Thr Mb/s 121.55 111.10 90.64 155.9 136.05 135.38
XC2V1500 XC2V3000 XC2V6000 XC2V1500 XC2V3000 XC2V4000
Table 3. Area results for proposed Montgomery design
Bit length 512 1024 2048
Device
No of slices 3480 6953 14015
XC2V1500 XC2V3000 XC2V4000
No of Slice Flip Flops 4259 8519 17263
4 Input LUTs 5917 11766 23359
RSA is implemented using modified Montgomery design. Both LSB and MSB binary method of RSA are coded in VHDL. Synthesis result for RSA 512 and 1024 bits are given in table 4. Here the public exponent of ek=216 +1 is taken for coding. Therefore, the number of clock cycles for RSA Encryption using LSB method are [19(k+3) + (k+1)] and for MSB are [37(K+3) + (K+1)]. Here k+1 cycles are needed to convert redundant to binary form. The implementation takes 17 bits for the exponent. Table 4: RSA Exponentiation using proposed Montgomery design (RSA Encryption)
Design MSB/ LSB LSB MSB LSB MSB Bits Device Area (slices) 8962 6073 17818 12040 Freq (MHz) 117.564 117.247 109.914 109.636 Thr
512 512 1024 1024
XC2V3000 XC2V3000 XC2V4000 XC2V3000
5.84 Mb/s 3.06 Mb/s 5.48 Mb/s 2.87 Mb/s
6. Conclusions
The proposed Montgomery design improves critical path by simplifying quotient computation. It saves delay of 1 XOR and 1 AND in each loop at the expense of extra iteration as compared to algorithm 2[7]. This is because the computation of quotient is simplified by making it independent of partial product Therefore our design has a higher frequency and throughput as compared to [7]. Also decrease in area as compared to area in [7]. The design architecture shows that two registers P1 and P2 are taken. The value in the registers depends on the value of quotient and multiplier bit in an iteration. In algorithm 2 [7], different carry save addition paths exist . One of them is chosen depending on the quotient and multiplier bits. Our design assigns one of the four values to each register P1 and P2 depending on quotient bit and multiplier bit. Only one carry save path exists in our design. Also slight hardware reduction with simplified quotient computation. Experimental results show that our design has reduced area, improved frequency and throughput. Hence the design is useful for applications like RSA encryption and decryption.
46
References
[1] R.L. Rivest, A. Shamir, L. Adleman, (1978), A Method for Obtaining Digital Signatures and Public-Key Cryptosystems, Communications of the ACM, vol. 21 Issue 2, pp. 120-126. [2] Peter L. Montgomery,(1985) Modular Multiplication without trial division, Mathematics of Computation, Vol. 44, No. 170, pp 519-21. [3] Cetin Kaya Koc,(1994), High-Speed RSA Implementation, RSA Laboratories, Tech. Rep. [4] Holger Orup,(1995) Simplifying Quotient Determination in High-Radix Modular Multiplication, in Proc. 12th IEEE Symp. Computer Arithmetic, pp 193-199. [5] C.D. Walter,(1999), Montgomery Exponentiation Needs No Final Subtraction, Electronics. Letters, vol. 35, no 21, pp 1831-1832 [6]Thomas Blum and Christof Paar,(2001),High Radix Montgomery Modular Exponentiation on Reconfigurable Hardware, IEEE Transactions On Computers, vol 50, No 7, pp 759-764 [7]C. McIvor, M. McLoone, and J.V. McCanny, (2004),Modified Montgomery modular multiplication and RSA exponentiation techniques, IEE Proceedings-Computers and Digital Techniques, vol 151, no. 6, pp 402-408. [8]K. Manochehri and S. Pourmozafari, (2005), Modified Radix-2 Montgomery Modular Multiplication to Make It Faster and Simpler, Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC05). [9] RV Kamala and MB Srinivas,(2006),High Throughput Montgomery Modular Multiplication, IFIP International Conference on Very Large Scale Integration, France. [10] Y.Y. Zhang, Z. Li, L. Yang, and S.-W. Zhang,(2007), An efficient CSA architecture for Montgomery modular multiplication, Elsevier, Microprocessors and Microsystems vol 31, no 7 , pp 456-459 [11] M.-D. Shieh, J.-H. Chen, H.-H. Wu, and W.-C. Lin, (2008),A New Modular Exponentiation Architecture for Efficient Design of RSA Cryptosystem, IEEE Transactions On Very Large Scale Integration (VLSI) Systems, vol. 16, no. 9, pp 1151-1161.
47

Paper-6 - Modified Montgomery Modular Multiplication For RSA Cryptosystem

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Paper-6 - Modified Montgomery Modular Multiplication For RSA Cryptosystem

Enviado por

Direitos autorais:

Formatos disponíveis

International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No.

Modified Montgomery Modular Multiplication for RSA Cryptosystem

rupali@pec.ac.in , d_maitreyee@yahoo.co.in , renuvig@hotmail.com

2. Montgomery Modular Multiplication

Fig 1. Quotient computation in Processing Element in algorithm 2

Table 1. Critical Path Delay of Montgomery designs

4. RSA Modular Exponentiation

A_n i +1 CSA1 acarry S1 S2 A1_n CSA2 A2_n

>>2 S20 XOR Q(i+1)

Fig 3. Architecture of Processing Element

Table2: Performance of Montgomery Modular Multiplication designs, Thr=Throughput

XC2V1500 XC2V3000 XC2V6000 XC2V1500 XC2V3000 XC2V4000

Table 3. Area results for proposed Montgomery design

Bit length 512 1024 2048

No of slices 3480 6953 14015

XC2V1500 XC2V3000 XC2V4000

No of Slice Flip Flops 4259 8519 17263

4 Input LUTs 5917 11766 23359

512 512 1024 1024

XC2V3000 XC2V3000 XC2V4000 XC2V3000

5.84 Mb/s 3.06 Mb/s 5.48 Mb/s 2.87 Mb/s

Você também pode gostar