Fast Implementations of RSA Cryptography
Fast Implementations of RSA Cryptography
Fast Implementations of RSA Cryptography
M. Shand J. Vuillemin Digital Equipment Corp., Paris Research Laboratory (PRL), 85 Av. Victor Hugo. 92500 Rueil-Malmaison, France.
Abstract
We detail and analyse the critical techniques which may be combined in the design of fast hardware for RSA cryptography: chinese remainders, star chains, Hensel's odd division (a.k.a. Montgomery modular reduction), carry-save representation, quotient pipelining and asynchronous carry completion adders. A PAM1 implementation of RSA which combines all of the techniques presented here is fully operational at PRL: it delivers an RSA secret decryption rate over 600Kb/s for 512b keys, and 165Kb/s for 1Kb keys. This is an order of magnitude faster than any previously reported running implementation. While our implementation makes full use of the PAM's recongurability, we can nevertheless derive from our (multiple PAM designs) implementation a (single) gate-array specication whose size is estimated under 100K gates, and speed over 1Mb/s for RSA 512b keys. Each speed-up in the hardware performance of RSA involves a matching gain in software performance which we also analyse. In addition to the techniques enumerated above, our best software implementation of RSA involves Karatsuba multiplication and specic squaring. Measured on 512b keys, the method leads to software implementations at 2Kb/s on a VAX 8700, 16Kb/s on a 40MHz DECstation 5000/240, and 56Kb/s on a 150MHz Alpha, which is faster than any RSA hardware commercially available in 1992.
1 Introduction
In 1977 Rivest, Shamir and Adleman [RSA 78] introduced an important public key crypto-system based
for Programmable Active Memory: based on Programmable Gate Array (PGA) technology, a PAM is a universal congurable hardware co-processor closely coupled to a standard host computer. The PAM can speed-up many critical software applications running on the host, by executing part of the computations through a specic hardware PAM conguration.
1
on computing modular exponentials. The security of RSA cryptography ultimately lies on our inability to eectively factor large integers. As a point in case, [LLMP 92] used hundreds of computers world-wide for a number months in order to explicitly factor the 9-th Fermat number F9 = 229 + 1, a 513 bit integer. Admittedly this factorization uses special properties of F9 and fully general techniques for factoring numbers over 512b are even slower. Nevertheless, RSA implementations with key lengths of 512b must be prepared to renew their keys regularly and cannot be used for reliably transmitting any data which must remain secret for more than a few weeks. Longer keys (768b or 1Kb) appear safe within the current state-of-the-art on integer factoring. The complexity of RSA encoding a km bit message with a k bit key is m times that of3computing a2k bit modular exponential, which is sk in software and hk2 in hardware3 . This paper deals with techniques to lower the values of the time constants s and h for both software and hardware implementations of RSA. During the last ve years, we have used RSA cryptography as a benchmark for evaluating the computing power of the PAMs built at PRL (see [BRV 89] and [BRV 92]). This recongurable hardware has allowed us to implement measure and compare over ten successive versions of RSA (see [SBV 91] for details). Our fastest current hardware implementation of RSA relies on recongurability in many ways: we use a dierent PAM design for RSA encryption and decryption; we generate a dierent hardware modular multiplier for each (dierent prime) modulus P (the k coecients in the binary representation of P are hardwired into the logic equations). So, deriving an ASIC from our PAM design is not an automatic task, and we account for these facts4 in
2 although algorithms with better asympotic bounds exist, at the bit lengths of interest in RSA, modular multiplication has complexity k2 . 3 assuming that the area of the hardware is proportional to k. 4 as well as earlier experimental gures obtained by H. Touati and R. Rudell for general PAM to ASIC compiling
3 Chinese Remainders
In order to speed-up the RSA secret decryption AD (mod M ) we take advantage of the secret knowledge of the prime decomposition M = P Q [QC 82], and use chinese remainders (mod P ) and (mod Q): 6 Algorithm 1 (RSA decrypt) In order to compute S (A) = AE (mod M = P Q): Ap = A (mod P ); 1. Compute Aq = A (mod Q):
2 RSA Cryptography
Let us recall the ingredients for RSA cryptography:
order to speed-up public encryption, it is chosen to be small: either E = F0 = 3 as in [K 81], or E = F4 = 216 +1 as recommended by the CCITT standard. tion:
P (A) = AE (mod M ):
2. The secret decryption S (A) is:
S (A) = AD (mod M ):
3. Encrypt and decrypt are respective inverses:
S (P (A)) = P (S (A)) = A;
modulo M and for all 0 A < M . Since the public exponent (say E = 216 +1) is small, the complexity of the public encryption procedure is only 17sk2 in software and 17hk in hardware. On current fast micro-processors5 , this provides an eective RSA public encoding bandwidth of 150Kb/s for k = 512 and 90Kb/s for k = 1024. Since the secret exponent D has typically k = l2 (M ) bits, RSA secret decryption is slower than public encryption: 32 times slower for k = 512b and E = F4 , and 512 times slower for k = 1Kb and E = 3.
5
Bp = ADp (mod P ); p 2. Compute Bq = ADq (mod Q); q with (precomputed) exponents: Dp = D (mod P 1); Dq = D (mod Q 1): Sp = Bp Cp (mod M ); 3. Compute Sq = Bq Cq (mod M ); with (precomputed chinese) coecients: Cp = QP 1 (mod M ); Cq = P Q 1 (mod M ): 4. Compute S = Sp + Sq ; S (A) = if S M then S M else S : With chinese remainders, the software complexity of RSA decryption goes from sk3 down to s k3 + 4sk2; 44 a speed-up factor of 4 (k), with (k) = k < 1=100 for k > 400. The factor (k) accounts for the initial modulo reductions and nal chinese recombination. With a single k bit multiplier, the hardware speedup from chinese remainders is only 2 0(k), with 0(k) = 4=k accounting for the initial and nal overhead. A better way to use the same silicon area is to operate two k bit multipliers in parallel, one for each 2 factor of M as in g. 1. The exponentiation time becomes h k3 for a speed-up near 4, provided that we can 4 perform the nal chinese recombination at the same rate. Taking full advantage of the recongurability of the PAM, we have implemented three solutions to this problem: 1. Compute the chinese recombination in software: a 40 MIPS host performs this operation at a rate of more than 600Kb/s on a pair of 512b primes, which easily accommodates our hardware exponentiation rate for 1Kb RSA keys. 2. Assist a slower host computer with a fast enough hardware multiplier, running in parallel with the two exponentiators [SBV 91]. 6 The knowledge of P and Q is equivalent to that of D, as we can eciently factor M = P Q once we know M , E and D.
This justies our use of chinese remainders.
3. Design the two k bit modulo P and Q multipliers 2 so that they can be recongurated quickly enough into one single k bit multiplier modulo M , which performs the nal chinese combination.
and P as in g. 28 . The number of cycles required for computing a k bit modular exponential in this way is sk2, where s is the cycles required per bit of modular product. In [OK 91] these two multipliers are time-multiplexed on the same physical multiplier, even though the multiplication of B and P happens on average in only half of the cycles allocated to it. This multiplexed implementation is chosen because data dependencies in the inner loop of the modular multiplication algorithm make it impossible to commence the next step on the next cycle.
k k x M k x M k
Figure 2. When the underlying modular multiplications can be implemented free from such pipeline bubbles, it is natural to implement Algorithm H with only one k bit modular multiplier as in g. 3, which is used for both squaring B and for multiplying B by P . The number of cycles for computing a k bit modular exponential in this way is sk(k + 2(k)), where s is the cycle time per bit of modular product. On the average, this is only 1:5 times slower than the two multipliers design based on algorithm L; with only half the hardware. This is actually 1:33 times faster on the average than an implementation of algorithm L with a single time-multiplexed hardware multiplier as in [OK 91], because every cycle is productive.
k x M
A e
Both algorithms involve k 1 = l2 (E ) 1 squares and P 2(E ) modular products, where 2(E ) = i<k ei is the number of one bits in the binary representation of the exponent E ; clearly, 0 < 2(E ) k and the average value of 2 (E ) is k : 2 In software, the choice of either L or H makes no dierence with regard to the computation time whose average is 32s k3; here s is the time required per bit of modular product. In hardware, algorithm L requires two storage registers (for B and P) while algorithm H gets away with only one storage register (for both B and P). It is natural (see [OK 91]) to implement algorithm L with two logically distinct k bit modular multipliers; one for squaring B , the other for multiplying B
7 the index of all our stepped by one.
Figure 3.
for
g. 3.
hardware structures because it allows the hardwiring of one of the inputs to the multiplier. Despite this restriction, star-chains are known (see [K 81] again) to be almost as ecient as general addition chains. Asymptotically, an optimal star chain requires only k multiplies, against k + 2(k) for the binary methods. The sequence of multiplies only depends upon the exponent E and can thus be computed o-line. However, there is no ecient algorithm known to compute the optimal star chain sequence. The modular multiplier in g. 3 can be modied so as to exponentiate along any star chain, provided that input A is able to take one of its operands from a memory in which intermediate results have been saved. For k = 512b the required memory is less than 64kB . An alternative to star chains which requires less storage and is easier to compute is to use the ary method of exponentiation. We pre-compute a table of the powers Ak (mod M ) for all small k < ; ( = 2p ). The exponential AE is then obtained by a repeated sequence of p squarings followed by a multiplication by the appropriate power of A. If we allow the multiply to occur early in the squaring sequence we only need to store the odd powers of A. This is a simple generalization of algorithm H to radix , with E = [e k 1 e0 ] and 0 ei < for i < k . The p p storage required is kp=2 bits; as the squaring sequence need not start until the second most signicant digit, the expected number of modular products is (assuming that k is a multiple of p): + k p + k 1: 2 p In software, for k = 256 (corresponding to a public modulus of 512b) the optimal choice of p is 5 ( = 32) and the average number of multiplies 1:24k; for k = 512 the optimal p is 6 and the average number of multiplies is 1:22k. With a fast host, the computation of small powers can be done in software while the hardware completes the previous exponentiation thus eliminating the =2 term corresponding to the number of products required for building the table.
for i<=p do { S[i] = 2 R[i] + n{p-i}; q{i} = if S[i]<M then 0 else 1; R[i+1] = S[i] - q{i}M }; R = R[p+1].
N = MQ + R; R < M;
with quotient Q = [q0 qp 1]2. The corresponding hardware scheme is:
N 0
k k+1 +
S M
M x
q
x
Figure 4. 2. Hensel introduced the odd division around 1900, for computing the inverses of odd 2adic numbers. This implies that the modulus M = 1+2M 0 must be odd.
N = MQ + 2p R; R < 2M;
with quotient Q = [qp 1 q0]2. The corresponding hardware scheme is:
N 0
k k +
M 0
2 /2
k+1 + k q k
Algorithm 4 (Euclid's division) Starting from R[0] = [nk+p 1 np+1 ]2, compute:
Figure 5.
Hensel's division computes N 2 p (mod M ) rather than Euclid's result N (mod M ) making it not appropriate for all applications; however, in hardware terms, Hensel's division has a decisive advantage over Euclid's: it does not require to implement a quotient unit. The quotient qi in Hensel's division is simply the current low bit of the R register. Euclid's division requires a full k bits comparison between M and the current value of S [i] in order to compute the corresponding quotient bit qi. Most previously reported hardware implementations of RSA deal with modular reduction in Euclid's way: they avoid carrying out the full-compare quotient by using an approximate quotient computation which only involves a small number of high bits in S . The resulting redundant quotient has at least one more bit than Euclid strictly demands. Examples of such methods are found in [PV 90] which uses radix 4, [OK 91]9 which uses radix 32, and [IWSD 92]. An alternative is given by [T 91] which uses base 4 quotient digits in a redundant 3b per digit quotient system. As a consequence, the modular multiplier in [T 91] has an area which is (at least) 1.5 larger than the Hensel based implementation which follows. One last advantage of Hensel's division is to allow for quotient pipelining, as shown in section 8.2. We are not aware of any similar pipelining technique for Euclid's division.
P = A n B (mod M ); (1) by letting P [0] = 0 and evaluating, for t = 0; ; n 1: 1 P [t + 1] = (A bt + P [t] + qt M ): (2) At each step (2) the quotient digit qt 2 [0 1] is chosen so that P [t + 1] is an integer: A bt + P [t] + qt M = 0 (mod ):
This is achieved by letting
that
6 Modular Product
There are many ways to compute modular products
P = A B (mod M ): As shown in the previous section, one has to rst choose the order in which to process the multiplier. Hensel's scheme (from low bits to high bits) has been used in software implementations of RSA, and [DK 91] present some of the benets. Even [E90] presents a hardware design for modular exponentiation based on Hensel's scheme. Ours appears to be the rst reported working hardware implementation of RSA to operate in this manner. The following is a generalization of an original algorithm in [Mo 85]: Algorithm 6 (Modular Product) Let A; B; M 2 N be three integers, each presented by n radix = 2p A = [an 1 a0 ] ; B = [bn 1 b0] ; M = [mn 1 m0 ] ; with the modulus M relatively prime to , that is gcd(m0 ; ) = 1: We compute a product P = P [n] such
9 [OK 91] present their quotient unit as a parallel exhaustive search.
digits
qt = (a0 bt + p0(t)) (mod ); (3) where number = M 1 (mod ) is pre-computed so that m0 + 1 = 0 (mod ), and p0 (t) = P [t] (mod ) denotes the least signicant digit of P [t]. It is easily checked, by induction on t, that t P [t] = A [bt 1 b0] + M [qt 1 q0] hence (1) follows for t = n. As a numerical example, let us set = 2; n = 5; A = 26 = [11010]2; B = 11 = [01011]2; M = 19 = [10011]2 and use Algorithm 1 to compute t 0 1 2 3 4 P [t + 1] 13 29 24 25 22 ; qt 0 1 1 0 1 thus establishing the diophantine equation: 25 22 = 26 11 + 22 19: Observe that the values of P [t] remain bounded: indeed, a simple induction on t establishes that 0 P [t] < A + M (4) is an invariant of Algorithm 1. It follows that n digits plus one bit are sucient to represent P [t] < 2 n , for all t 0. Equation (4) also shows that, assuming 0 A < M , we can compute P = AnB (mod M ) with 0 P < M with just one conditional subtraction following Algorithm 1: P = if P [n] M then P [n] M else P [n]: For modular exponential, we avoid performing this reduction after each modular product by letting all intermediate results have two extra bits of precision;
all operands can nevertheless be represented with n digits since: 1 ( if M < 4 n and A; B 2M then P [n] < 2M: 5) Indeed, since P [n 1] < A + M by (4) and (5) implies bn 1 1, we have: 2 1 P [n] = (P [n 1] + bn 1A + qn 1M ) 1 < (A + M + ( 1)A + ( 1)M ) 2 < A + M 2M: 2 Algorithm 6 requires to choose the radix = 2p in which to decompose the n = k digits multiplicand p B = [bn 1 b0] : Let us analyse the impact of the choice of radix on a k bits keys RSA implementation. = 2k In [SBV 91], we compute modular products through a long integer hardware multiplier, which is forcing the largest base = 2k upon us. The modular product is realized by a sequence of three full-length k k 7! 2k integer products: 1. Compute the 2k bit product C = A B ; let C0 = C (mod 2k ) and C1 = C 2k , so that C = C0 + 2k C1. 2. Compute the 2k bit product Q = C0 M , where M is the precomputed inverse of M (mod 2k ). Keep the k low order bits as hensel quotient q = Q (mod 2k ). 3. Compute the 2k bit product D = q M , and add: P = C + D; number P is such that P = A2kB (mod M ) with P 2M . The resulting complexity is high: 6hk2, and we see that using large radixes is not ecient. = 232 Let M(p) represent the cost of a p p ! 2p bit multiply. In step 2 of algorithm 6, the products A bt and qt M each cost k M(p), and p the computation of qt costs M(p). The multiply cost of Algorithm 6 is thus: ( (6) 2k2 M(p) + k Mp p) p2 In software at the moderate bit lengths of RSA, the cost of multiplication grows quadratically with the length of the operands. Thus the rst term of (6) is constant while the second term grows with p. So, software should choose as small a radix as possible: typically one machine word.
= 22 In a dedicated hardware, equation (6) shows that p should also be small. The choice = 22 allows for a trivial calculation of qt, and permits to use Booth recoded multiplication: this doubles the multiplier's performance compared to = 2, at a modest increase (about 1.5) in hardware area. Higher radixes which oer better multiply performance had to be dismissed, since they involve too much hardware and the computation of the quotient digits is no longer trivial.
With the use of star chains, or the precomputation of small powers in over 75% of the modular product operations A = B . For k = 512 and p = 64 (using Karatsuba's algorithm), squaring may be implemented 1:77 times more eciently than general multiplication. This yields an overall speed-up of 1:29. In hardware this would require extra storage for the length 2k intermediate result, but in software memory is cheap.
In order to speed-up the clock cycle through pipelining, let us dene rt and Rt by: rt = P [t] + A bt (mod ) Rt = P [t] + A bt rt Equation (7) becomes: t P [t + 1] = Rt + (rt + q M ) : We introduce d levels of pipeline by choosing a quo0 tient digit qt such that:
rt + qt0 M = 0 (mod d+1 ); 0 with 0 qt < d+1 : We obtain the modied recurrence: 0 (r + qt M ) P 0[t + 1] = Rt + t d d+1d This value is related to the old recurrence by: P [t + 1] P 0[t + 1] + rt + + rt d+1 (mod n ) d Now, P [n] = A n B (mod n ):
Letting the datapath run for d more iterations gives:
The choice of d is technology dependent. Making it unnecessarily large consumes cycles and area12 , but it should be suciently large that each step in the distribution of qt's through the datapath is no greater than other critical paths of the datapath. In our implementation on the PAM the pipelined data-path can be clocked at 25ns. In a non-pipelined version the combinatorial distribution of qt takes over 100ns. Thus in this particular technology quotient pipelining gives a speed-up of 4.
9 Summary of Speedups
In the following table we recall the various techniques applied to implementing RSA and quantify the speedup achieved for 1Kb keys. TECHNIQUE Software Hardware Chinese remainders 4 4 Precompute small powers 1.2 1:25 Hensel's odd division 1.05 1:5 Karatsuba multiplication 1:22 | Squaring optimization 1:29 | Carry completion adder | 2 l2 (k)=k Quotient pipelining | 4
10 Bibliography
plementations of RSA, in Gilles Brassard, editor, Advances in Cryptology { Crypto '89 , pp 368-370, Springer-Verlag, 1990. [BRV 89] P. Bertin, D. Roncin , J. Vuillemin Introduction to Programmable Active Memories, in Systolic Array Processors edited by J. McCanny, J. McWhirter and E. Swartzlander , Prentice Hall, pp 301-309, 1989. Also available as PRL report 3 , Digital Equipment Corp., Paris Research Laboratory, 85, Av. Victor Hugo. 92563 RueilMalmaison Cedex, France. [BRV 92] P. Bertin, D. Roncin, J. Vuillemin: Programmable Active Memories: a Performance Assessment, report in preparation, Digital Equipment Corp., Paris Research Laboratory, 85, Av. Victor Hugo. 92563 Rueil-Malmaison Cedex, France, 1992. [BW 89] D. A. Buell, R. L. Ward, A Multiprecise Integer Arithmetic Package, The journal of Supercomputing 3, pp 89-107; Kluwer Academic Publishers, Boston 1989. [DK 91] S. R. Duss, B. S. Kaliski Jr.: A Crype tographic Library for the Motorola DSP 56000, Proceedings of EUROCRYPT '90, Springer LNCS 473, 1991.
12
Systolic Modular Multiplication, Proceedings of Crypto '90 pp 619-624, SpringerVerlag, 1990. [IWSD 92] P. A. Ivey, S. N. Walker, J. M. Stern and S. Davidson: An ultra-high speed public key encryption processor, Proceedings of the IEEE 1992 custom integrated circuits conference, Boston, Massachusetts, paper 19.6, 1992. [K 81] D. E. Knuth, The Art of Computer Programming, vol. 2, Seminumerical Algorithms, Addison Wesley, 1981. [LLMP 92] A. K. Lenstra, H. W. Lenstra, M. S. Manasse, J. Pollard: The factorization of the ninth Fermat number, Mathematics of Computation to appear, 1992. [Mo 85] P. L. Montgomery Modular multiplication without trial division, Mathematics of Computation 44(170):519-521, 1985. [PV 90] J. Vuillemin, F.P. Preparata Practical Cellular Dividers, IEEE Trans. on Computers, 39(5):605-614, 1990. [OK 91] H. Orup, P. Kornerup: A High-Radix Hardware Algorithm for Calculating the Exponential M E Modulo N , 10-th IEEE symposium on COMPUTER ARITHMETIC, pp 51-57, 1991. [QC 82] J-J. Quisquater, C. Couvreur Fast Decipherment Algorithm for RSA Public-key Cryptosystem, Electronics Letters, 18(21):905-907, 1982. [RSA 78] R. L. Rivest, A. Shamir, L. Adleman Public key cryptography, CACM 21, 120-126, 1978. [SBV 91] M. Shand, P. Bertin and J. Vuillemin: Hardware Speedups in Long Integer Multiplication, Computer Architecture News, 19(1):106114, 1991. [T 91] N. Takagi A Radix-4 Modular Multiplication Hardware Algorithm Ecient for Iterative Modular Multiplications, 10-th IEEE symposium on COMPUTER ARITHMETIC, pp 35-42, 1991. [X] Xilinx The Programmable Gate Array Data Book, Product Briefs, Xilinx, Inc., 1987-1992. [Y 91] Y. Yacobi Exponentiating Faster with Addition Chains, Proceedings of EUROCRYPT '90, Springer LNCS 473, 1991.
[E90] S. Even:
11 Acknowledgments
P.Bertin, F.Morain, R. Razdan and the anonymous referee.