Implementing RLWE-based Schemes Using An RSA Co-Processor
Implementing RLWE-based Schemes Using An RSA Co-Processor
Implementing RLWE-based Schemes Using An RSA Co-Processor
RSA Co-Processor
Martin R. Albrecht1∗ , Christian Hanser2† , Andrea Hoeller2† ,
Thomas Pöppelmann3† , Fernando Virdia1‡ , Andreas Wallner2†
1
Information Security Group, Royal Holloway, University of London, UK
martin.albrecht@royalholloway.ac.uk,Fernando.Virdia.2016@rhul.ac.uk,
2
Infineon Technologies Austria AG
firstname.lastname@infineon.com,
3
Infineon Technologies AG, Germany
Thomas.Poeppelmann@infineon.com,
1 Introduction
The development of an efficient quantum order-finding algorithm by Shor [Sho97] invali-
dated the quantum hardness of factoring and discrete logarithms in Abelian groups. Since
then, there has been a growing effort to develop new public-key encryption and signature
algorithms that can resist cryptanalysis using large-scale general quantum computers. The
resulting constructions are referred to as “quantum safe” or “post-quantum”. Popular
families are code-based, multivariate, isogeny-based and lattice-based cryptography.
In 2016 the US National Institute of Standards and Technology (NIST) started a
several year long process to standardise post-quantum cryptographic schemes [Nat16].
Furthermore, the European Telecommunications Standards Institute (ETSI) created a
quantum-safe cryptography working group [CCD+ 15] and in 2016 Google conducted its
first post-quantum cryptography at-scale test [Lan16]. Whatever we may think of the
∗ The research of Albrecht was supported by EPSRC grant “Bit Security of Learning with Errors for
Post-Quantum Cryptography and Fully Homomorphic Encryption” (EP/P009417/1) and by the European
Union PROMETHEUS project (Horizon 2020 Research and Innovation Program, grant 780701).
† The research of Hanser, Hoeller, Pöppelmann and Wallner was supported by European Union’s Horizon
2020 research and innovation programme under grant agreement No. 779391 (FutureTPM).
‡ The research of Virdia was supported by the EPSRC and the UK government as part of the Centre
for Doctoral Training in Cyber Security at Royal Holloway, University of London (EP/P009301/1).
Approach & outline. The key computational task in {Ring, Module}-LWE encryption/de-
cryption is to evaluate
MulAdd a(x), b(x), c(x), f (x), q := a(x) · b(x) + c(x) mod (f (x), q)
for polynomials a(x), b(x), c(x) ∈ Zq [x]/(f (x)). In this work, we realise the MulAdd
gadget using a combination of a variant of Kronecker substitution [VZGG13, p. 245] and
low-degree polynomial arithmetic in the spirit of Schönhage’s trick [Sch77]. Kronecker
substitution is a well-known and well-utilised technique in computer algebra to reduce
polynomial multiplication to integer multiplication. Briefly, we start from standard
Kronecker substitution by considering a(2` ) · b(2` ) + c(2` ) mod f (2` ) where e.g. a(2` )
represents the integer obtained by evaluating a(x) at 2` for some sufficiently big integer
`. However, for typical parameter choices, e.g. those of Kyber or NewHope [ADPS16],
this strategy produces integers too large for our hardware multiplier to handle. Thus, in
Section 3 we apply a variant of Harvey [Har09] to our use-case. Harvey proposed Kronecker
variants which permit to half the required bitsize of the integers being multiplied at the
cost of doubling the number of multiplications. This provides a worthwhile trade-off for
medium-sized integers where quasi-linear integer multiplication algorithms are not yet
competitive. However, in our context Harvey’s technique on its own still does not suffice to
reduce the integer operands to match our hardware multiplier. Thus, we utilise (low-degree)
polynomial arithmetic on top. Overall, we obtain an implementation which computes the
1 We stress that our variant of Kyber is not interoperable with Kyber as specified in [SAB+ 17]. The
main differences are choices for symmetric functions and that Kyber explicitly requires the usage of the
Number Theoretic Transform (NTT), which we cannot realise efficiently with our approach.
M. R. Albrecht, C. Hanser, A. Hoeller, T. Pöppelmann, F. Virdia, A. Wallner 171
where ≈s means ≈ in each “slot” defined by multiples of q. This observation then gives rise
to the I-RLWE problem, which also permits packing several plaintext bits into one large
integer. In [Chu17], a reduction from Ring-LWE to I-RLWE is given, but this reduction
does not consider the noise distribution, only its size.3 In [Ham17], a variant of I-RLWE
over a pseudo-Mersenne field is given to instantiate an MLWE KEM. Similarly, [AJPS17]
can be considered as an integer variant of NTRU.
LWE, the noise distribution does not play a significant role if it provides enough entropy.
4 See http://www.commoncriteriaportal.org/products/#IC.
172 Implementing RLWE-based Schemes Using an RSA Co-Processor
2 Preliminaries
For x ∈ R, we write dxc to mean the closest integer to x (where dy + 21 c := y + 1 for y ∈ Z).
For a, b ∈ Z, we write a mod(+) b for the unique integer â ≡ a mod b such that 0 ≤ â < b.
We write a mod(−) b for the unique integer â ≡ a mod b such that −b/2 ≤ â < b/2. We
extend this definition to tuples, vectors, matrices and polynomials a over Z component-wise.
We also write [a]b for a mod(+) b. We often write {a, . . . , b} to mean the set [a, b] ∩ Z.
We write (
1 if condition is true,
JconditionK := .
0 if condition is false.
2.2 Kyber
A recent construction relying on the MLWE problem is the Kyber Key Encapsulation Mech-
anism. Kyber has been submitted to the NIST PQC standardisation process [SAB+ 17] and
a variant is also published as an academic paper [BDK+ 17]. It is defined by an intermediate
IND-CPA secure Public-Key Encryption (PKE) scheme which is then transformed to
an IND-CCA secure KEM using a generic transform [HHK17].5 We note that Kyber
5 We note that [SAB+ 17] does not include the Targhi-Unruh tag.
174 Implementing RLWE-based Schemes Using an RSA Co-Processor
unambiguously refers to the IND-CCA secure KEM, i.e. [SAB+ 17] does not formally
propose a public-key encryption scheme nor a KEM which only claims IND-CPA security.
Definition 4 (Simplified Kyber.CPA following [BDK+ 17]; cf. [SAB+ 17]). For n = 256
let k, n, q, η, dt , du , dv be positive integers. Let M = {0, 1}n be the plaintext space, where
each message m ∈ M can be seen as a polynomial in R with coefficients in {0, 1}. Define
the functions
let χ be a centered binomial distribution with support {−η, . . . , η}, and let χn be the
distribution of polynomials of degree n with entries independently sampled from χ. Define
the public-key encryption scheme Kyber.CPA = (Kyber.CPA.Gen, Kyber.CPA.Enc,
Kyber.CPA.Dec) as in Algorithms 1, 2 and 3.
$ 256 256
1 (ρ, σ) ← − {0, 1} × {0, 1} ;
ρ
2 ~
A← − Rqk×k
;
σ
3 − χkn × χkn ;
(~s, ~e) ←
4 ~t ← Compressq (A~ ~ s + ~e, dt ) ;
5 return pkCP A := (~t, ρ), skCP A := ~s ;
Algorithm 1: Kyber.CPA.Gen.
Input: skCP A = ~s
Input: c = (~u, v)
1 ~u ← Decompressq (~u, du ) ;
2 v ← Decompressq (v, dv ) ;
3 return Compressq (v − h~s, ~ui , 1) ;
Algorithm 3: Kyber.CPA.Dec.
In Kyber, the parameters that define the base ring Rq are fixed at n = 256 and
q = 7681. The parameters that define key and ciphertext compression are also fixed and
set to du = 11, dv = 3 and dt = 11. The three different security levels are obtained by
different choices of k and η. All relevant Kyber parameters are summarised in Table 1.
The performance of an implementation of Kyber depends highly on the speed of the
polynomial multiplication algorithm and the performance of the PRNG instantiations
ρ
as a large number of pseudo random data is required when generating A ~ ←− Rqk×k or
M. R. Albrecht, C. Hanser, A. Hoeller, T. Pöppelmann, F. Virdia, A. Wallner 175
Input: pk = (~t, ρ)
$
1 m← − {0, 1}256 ;
2 m ← H(m) ;
3 (K̂, r) ← G(m, H(pk)) ;
4 (~u, v) ← Kyber.CPA.Enc(pk, m; r) ;
5 c ← (~u, v) ;
6 K ← H(K̂, H(c)) ;
7 return (c, K) ;
Algorithm 5: Kyber.Encaps.
176 Implementing RLWE-based Schemes Using an RSA Co-Processor
3 Kronecker
Kronecker substitution is a classical technique in computer algebra for reducing polynomial
arithmetic to large integer arithmetic, cf. [VZGG13, p. 245] and [Har09]. The fundamental
idea behind this technique is that univariate polynomial and integer arithmetic are identical
except for carry propagation in the latter. Thus, coefficients are simply packed into an
integer in such a way as to terminate any possible carry chain. For example, say, we want
to multiply two polynomials f (x) := x + 2 with g(x) := 3x + 4 in Z[x]. We may write
6 We refer to https://www.infineon.com/cms/de/product/security-smart-card-solutions/
security-controllers/sle-78/ for more information on the SLE 78 family.
7 See http://www-304.ibm.com/jct01003c/common/ssi/ShowDoc.wss?docURL=/common/ssi/rep_sm/1/
649/ENUS4767-_h01/index.html.
M. R. Albrecht, C. Hanser, A. Hoeller, T. Pöppelmann, F. Virdia, A. Wallner 177
f (100) = 100 + 2 = 102 and g(100) = 300 + 4 = 304. Multiplying gives 102 · 304 = 31008
or 3x2 + 10x + 8. In implementations, we use powers of two as evaluation points since this
permits efficient “packing” (polynomial to integer) and “unpacking” (integer to polynomial)
using only cheap bit shifts.
In this work, we employ Kronecker substitution for computing
MulAdd a(x), b(x), c(x), f (x) := a(x) · b(x) + c(x) mod f (x)
Input: g ∈ Z[x]
Input: f ∈ Z[x]
Input: bitlength `
1 return g(2` ) mod(+) f (2` ) ;
Algorithm 7: Snort(g, f, `).
A := Snort(a, f, `),
B := Snort(b, f, `),
C := Snort(c, f, `),
and
D := A · B + C mod(+) f (2` ),
n−1
then Sneeze (D, f, `) returns {r(i) }i=0 where r(i) = di for i ∈ {0, . . . , n − 1}.
Corollary 1 (Power of two cyclotomic). Let α, β, γ be as above, let n be a power of 2,
and let f (x) = xn + 1. Let δ := nαβ + γ. Then Lemma 1 applies.
Proof. See Appendix A.
Corollary 2P(Prime cyclotomic). Let α, β, γ be as above, let n = p − 1 where p is prime,
n
and let f = i=0 xi . Let δ := (2n − 1)αβ + γ. Then Lemma 1 applies.
Proof. See Appendix A.
Proof (Proof of Lemma 1). We need to uniquely encode any possible d as an integer
modulo f (2` ). Since the coefficients di are ` bits long, and we need to store n of them,
this means that we require f (2` ) > 2n` − 1.
When Sneeze is called, we set
where the last equality is over the integers, for some b ∈ Z. Given that
n−1
X
`i 2n` − 1 2n` − 1
di 2 ≤ δ ` ≤ (2`−1 − 1) ` < 2n`−1 ,
2 −1 2 −1
i=0
the assumption that f (2` ) > 2n` − 1 > 2n`−1 implies that b ∈ {0, 1}.
The main computation in Sneeze is done between lines 3 and 11, hence we define
some conditions on the output of that loop and prove they hold by induction.
Claim. After step i ∈ {0, . . . , n − 1}, we have
r(i) = di + b fi (2)
and
n−1
X n
X
G[i] = dj 2`(j−i−1) + b fj 2(j−i−1)` (3)
j=i+1 j=i+1
M. R. Albrecht, C. Hanser, A. Hoeller, T. Pöppelmann, F. Virdia, A. Wallner 179
for some ti ∈ Z such that e(i) ∈ {0, . . . , 2` − 1}. Similar to before, by definition of ` and
the fact that b ∈ {0, 1}, we have
|di + b fi | ≤ δ + ϕ < 2`−1 for all i ∈ {0, . . . , n − 1} (4)
Hence ti ∈ {0, 1} for i < n. We then set
G[i−1] − e(i)
G[i] =
2`
n−1
X n
X
= dr 2`(r−i−1) + b fj 2(j−i−1)` − ti
r=i+1 j=i+1
and balance e(i) (mod 2` ). By the size consideration made in Inequality 4, this amounts
to subtracting ti 2` from e(i) . We keep account of this subtraction by adding back ti to
x(i) . Finally, we assign r(i) ← e(i) . Hence Conditions 2, 3 hold for step i ≥ 1. Similarly,
we can see that Conditions 2, 3 also hold for step i = 0, proving the claim.
By Condition 3, after step i = n − 1 we have G[n−1] = b < 2` , which would become
the coefficient of an nth power of x in d. Line 12 takes care of reducing this modulo f ,
which results in assigning
r(i) ← r(i) − fi G[n−1] = di + b fi − fi b = di for all i < n,
completing the proof.
Since operating on G[i] involves integer arithmetic on n` bit integers, we may modify
Algorithm 8 to correct carries on e(i) in order to avoid executing line 8 of Algorithm 8.
This variant of the algorithm is given as Algorithm 9. Note that with this change the only
large integer operations are division with remainder modulo 2` and thus cheap, while the
final output of the algorithm is the same.
The proof of Lemma 1 can be directly adapted to the MLWE setting where we let
n n−1
X oκ n n−1
X oκ n−1
X
ai = ai,j xj , bi = bi,j xj , c= cj xj
i=1 i=1
j=0 j=0 j=0
Pκ ai,j ∈ {−α, . . . , α}, bi,j ∈ {−β, . . . , β}, and cj ∈ {−γ, . . . , γ} and want to compute
with
i=1 ai · bi + c (mod f ), by letting
performing two multiplications. Note that integer arithmetic is super-linear (e.g. Karatsuba
multiplication is used for medium-sized inputs and has a cost of 3log2 L for integers of size
L, see below) and thus this trade-off produces a noticeable speed-up. The two techniques
are orthogonal and can be combined, which reduces bit sizes by a factor of four at the cost
of increasing the number of multiplications to four. The combined algorithm is referred to
as KS4.
The KS2 algorithm proceeds as follows. Assume a(x), b(x) are such that their product
c(x) := a(x) · b(x) has positive coefficients bounded by 22` . Let
X X
c(+) := c(2` ) = a(2` ) · b(2` ) = ci 2i` + ci 2i`
[i]2 =0 [i]2 =1
X X
i`
c (−) ` `
:= c(−2 ) = a(−2 ) · b(−2 ) `
= ci 2 − ci 2i`
[i]2 =0 [i]2 =1
since the sum and the difference cancel out either the even or the odd powers. The
coefficients can be either read directly with care to their offset, or dividing the above
quantities by the appropriate power of 2 over the integers.
The KS2 algorithm is compatible with arithmetic modulo f = xn + 1, when n is even.
When doing this over Zf (2` ) some care must be taken since reducing c(·) modulo f (2` )
may change its parity. In such case the coefficients can be recovered by either multiplying
c(+) + c(−) by 2−1 mod(+) f (2` ) and c(+) − c(−) by 2−`−1 mod(+) f (2` ), or multiplying
both quantities by a desired power of 2 modulo f (2` ) and reading the coefficients with the
appropriate offset. Packing and unpacking are identical to standard Kronecker substitution,
i.e. the proof of Lemma 1 applies directly when working with such an f .
On the other hand, adapting packing and unpacking to combine the KS3 algorithm
with modular reduction is somewhat more involved, requiring a fair amount of careful bit
shifting. Implementing this strategy would roughly half the number of multiplications
required at the cost of a more involved packing/unpacking algorithm. However, since our
packing and unpacking routines already take time comparable to the actual multiplications
they facilitate and since our target platform does not have efficient bit-shift operations, we
did not attempt an implementation of KS3 or KS4.
(a3 b0 + a2 b1 + a1 b2 + a0 b3 ) x3 + (a2 b0 + a1 b1 + a0 b2 − a3 b3 ) x2
+(a1 b0 + a0 b1 − a3 b2 − a2 b3 ) x + a0 b0 − a3 b1 − a2 b2 − a1 b3
but we have a multiplier that would only let us work modulo x2 + 1 given the ` required
by Lemma 1. Letting y = x2 , we can write a(x, y) = a(0) (y) + a(1) (y) x where
and similarly for b = b(x, y). Then, computing a(x, y) · b(x, y) (mod y 2 + 1) can be
182 Implementing RLWE-based Schemes Using an RSA Co-Processor
if we were to unpack the coefficients of Ĉ(x). Note that the coefficients on the second
line match our target, but the coefficients on the first line do not (they are not grouped
correctly and the signs do not necessarily match). This can be corrected by using the
identity y = x2 and thus rewriting x2 → y and reducing again modulo y 2 + 1. From
our intermediate representation Ĉ(x) = Ĉ0 + Ĉ1 x + Ĉ2 x2 , this can be done by defining
C(x) = C0 + C1 x with
C0 := Ĉ0 + (2` · Ĉ2 mod(+) 22` + 1) mod(+) 22` + 1 and C1 = Ĉ1 ,
and then unpacking C(x) to obtain a · b (mod x4 + 1).
More generally,P this can be formally described as follows. Let n = m · ω. Given a
n−1
polynomial p(x) = i=0 pi xi of degree < n, we can set y = xm , and then rewrite p as
p(x, y) = p0 + p0+m y + · · · + p0+(ω−1)m y ω−1 x0
+ ...
+ pm−1 + pm−1+m y + · · · + pm−1+(ω−1)m y ω−1 xm−1
← Ff(p, m, i), cf. Algorithm 10). The idea is to pack each p(i) , p ∈ {a, b, c}, into buffers
P (i) := p(i) (2` ) mod(+) (2ω` + 1) of length ω` + 1, and then evaluate
a(x, 2` ) · b(x, 2` ) + c(x, 2` ) mod(+) (2ω` + 1),
Pm
where p(x, 2` ) ≡ i=0 P (i) xi . By Lemma 1, the integer modulo operation will act on the
coefficients as reduction modulo y ω + 1 ≡ xn + 1 (mod y − xm ) would.
Working with polynomials a(x, y), b(x, y), the resulting polynomial a(x, y) · b(x, y) will
be a linear combination of monomials of the form y i xj . If we were to substitute xm = y
back now, we would obtain monomials of degree ≥ n every time that im + j ≥ n, which
we do not want. Furthermore, depending on how we index the y i xj in our code, we
may be in need of combining (“grouping”) constant coefficients from different monomials
y i xj 6= y r xs mapping to the same power of x.
To better see what adjustments need to be done to the resulting polynomial in x, we
look at a(x, y) · b(x, y) (mod y ω + 1) in detail.
m−1
X
a(x, y) · b(x, y) = a(i) (y) b(r) (y) xi+r
i,r=0
m−1
X ω−1
X
= ai+jm br+sm y j+s xi+r
i,r=0 j,s=0
m−1
X ω−1
X
≡ (−1)Jj+s≥ωK ai+jm br+sm y [j+s]ω xi+r (mod y ω + 1)
i,r=0 j,s=0
M. R. Albrecht, C. Hanser, A. Hoeller, T. Pöppelmann, F. Virdia, A. Wallner 183
Given that y [j+s]ω xi+r ≡ xm·[j+s]ω + i+r (mod y − xm ), we can see that after reducing
modulo y ω + 1 it will be necessary to further reduce modulo y − xm whenever m · [j + s]ω +
i + r ≥ n, which can happen only if i + r ≥ m. We do this by sending any monomial y j xi
where i ≥ m to y j+1 xi−m (mod y ω + 1), or equivalently by mapping monomials xi with
i ≥ m to 2` xi−m , as done in Line 11 of Algorithm 11. This also takes care of groupings.
Then, we can simply Sneeze every coefficient to obtain the final result. The full procedure
results in Algorithms 10 and 11.
A possible optimisation could be that of choosing ` more aggressively. Indeed, we only
ever need to pack polynomials of degree ω, and hence we could use this value in place of
n. This would save ≈ log m bits per packed coefficient while still being able to perform
the reduction modulo y ω + 1 ≡ 2ω` + 1, overall resulting in a saving of size ≈ ω log m
per packed polynomial p(i) (y). In this case one would need to unpack the P (i) before the
second reduction and final grouping, and handle these afterwards on the CPU.
Input: polynomial g ∈ R
Input: step size m, dividing n
Input: offset o
1 ω ← n/m ;
Pω j
2 return j=0 gm·j+o x ;
Algorithm 10: Ff(g, m, o). Return a new polynomial containing every mth coefficient
of g, starting at offset o.
5 Implementation
Using the strategies outlined in Sections 3 and 4, we are now ready to fix an implementation
of Kyber and the KyberMulAdd gadget (see Corollary 3) using a big integer multiplier.
We focus on the Kyber768 parameter set (or more precisely a variant) and implement it on
the Infineon SLE 78 (SLE78CLUFX5000) equipped with an RSA, an AES and a SHA-256
co-processor and 16 Kbyte RAM. All our software is native code and written in C and
assembly language.
$ 256
1 ρ←
− TRNG() ; // ρ ∈ {0, 1} sampled from internal TRNG
$ 256
2 σ←− TRNG() ; // σ ∈ {0, 1} sampled from internal TRNG
3 N ←0;
// Sample ~s and transform to S ~
4 for i = k − 1, k − 2, . . . , 0 do
5 stmp ← CBD(PRF(σ, N )) ;
6 N ←N +1 ;
7 Si ← Snort(stmp ) ;
8 end
// Compute A~ ~ s + ~e
9 for i = 0, 1, . . . , k − 1 do
10 e ← CBD(PRF(σ, N )) ;
11 N ←N +1 ;
12 T̂ ← Snort(e) ;
13 for j = 0, 1, . . . , k − 1 do
14 atmp ← Parse(XOF(ρ||i||j)) ;
15 Atmp ← Snort(atmp ) ;
16 T̂ ← MulAddSingle(Atmp , Sj , T̂ ) ;
17 end
18 T ← FinalEll(T̂ ) ;
19 ti ← Sneeze(T ) ;
20 end
21 pk ← Encodedt (Compressq (~t, dt )||ρ) ;
22 sk ← Encode13 (~s mod(+) q) ;
23 return pkCP A := pk, skCP A := sk ;
Algorithm 12: Kyber.CPA.Imp.Gen
Input: m ∈ M
Input: pkCP A
1 ~t, ρ ← Decodedt (pkCP A ) ;
2 ~t ← Decompressq (~t) ;
3 N ←0;
// Sample MLWE secret ~r and transform to R ~
4 for i = k − 1, k − 2, . . . , 0 do
5 rtmp ← CBD(PRF(σ, N )) ;
6 N ←N +1 ;
7 Ri ← Snort(rtmp ) ;
8 end
// Compute A ~ T ~r + e~1
9 for i = 0, 1, . . . , k − 1 do
10 e ← CBD(PRF(σ, N )) ;
11 Ûtmp ← Snort(e) ;
12 N ←N +1 ;
13 for j = 0, 1, . . . , k − 1 do
14 atmp ← Parse(XOF(ρ||i||j)) ;
15 Atmp ← Snort(atmp ) ;
16 Ûtmp ← MulAddSingle(Atmp , Rj , Ûtmp ) ;
17 end
18 Utmp ← FinalEll(Ûtmp ) ;
19 ui ← Sneeze(Utmp ) ;
20 end
// Compute ~t, ~r + e2
21 m̄ ← EncodeMsg(m) ;
22 e ← CBD(PRF(σ, N )) ;
23 e ← e + m̄ ;
24 V̂ ← Snort(e) ;
25 for i = 0, 1, . . . , k − 1 do
26 Ttpm ← Snort(ti ) ;
27 V̂ ← MulAddSingle(Ri , Ttpm , V̂ ) ;
28 end
29 V ← FinalEll(V̂ ) ;
30 v ← Sneeze(V ) ;
// Encode ciphertext
31 c1 ← Encodedu (Compressq (~u, du )) ;
32 c2 ← Encodedv (Compressq (v, dv )) ;
33 return c := (c1 ||c2 ) ;
Algorithm 13: Kyber.CPA.Imp.Enc
multiplication (see Section 5.3 and Section 5.4). All our implementations are not fully
compatible with the specification as Kyber is explicitly defined with a specific NTT and
assumes that the pseudorandom polynomials of A ~ are already output by the sampler in
the NTT domain.
To expand randomness into a longer bitstream, Kyber originally specifies the use of
various instances from the SHA3 family as PRNG (originally, XOF is SHAKE-128 and
PRF is SHAKE-256). We implemented one version of the samplers that is compatible with
the specification where SHAKE-128 and SHAKE-256 are realised in software. Hardware
acceleration is not possible as our target device does not have a SHA3 hardware accelerator.
M. R. Albrecht, C. Hanser, A. Hoeller, T. Pöppelmann, F. Virdia, A. Wallner 187
The SHA3 implementation written in C has been optimised to some extent with assembly to
remove obvious performance bottlenecks introduced by the compiler. Additionally, we have
implemented a (non-compatible) Kyber variant that is using AES-256 in counter mode to
implement XOF and PRF. A similar approach has been used by Google in their NewHope
experiment where the constant polynomial a was also sampled using AES [Lan16]. Even
though there are some theoretical concerns [ADPS16], this approach appears to be secure
in practice. When AES-256 is chosen as PRNG we can rely on the AES co-processor of
the SLE78CLUFX5000 and do not need to implement AES in software.
A difference that is not noticeable by a user is that we, as previously mentioned, do
not hash the randomness provided to key generation due to the availability of a TRNG.
The hashing of the input randomness in the Kyber specification is intended as a protection
against leakage of the internal state of a random number generator. However, on our
target device we have access to a certified RNG with appropriate post-processing and thus
expensive computation of SHA3-512 is unnecessary.
The implementations of CBD, Parse, Encode, Decode and Decompressq follow
the C reference implementation and are not particularly optimised using assembly. Our
implementation of CCA-secure Kyber using the FO transformation is denoted as Ky-
ber.CCA.Imp.Gen for key generation, Kyber.CCA.Imp.Enc for encapsulation and
Kyber.CCA.Imp.Dec for decapsulation and we straightforwardly follow Algorithm 4 to 6.
The main additional operations demanded by the CCA conversion are the computation
of hash functions to implement random oracles. In one version of our implementation
we follow the specification where H is using SHA3-256 and G is using SHA3-512 and
where SHA3 is implemented in software. Additionally, we implemented a variant where H
is realised by the MAC-based scheme HKDF [Kra10] using a SHA-256 co-processor and
where H is realised by a call to SHA-256. The usage of HKDF is necessary as the output
of G has to be longer than a single SHA-256 hash.
bits in total. However, to simplify the packing algorithm we have chosen 32 bits per
coefficient (thus ` = 32) which leads to integers of 64 · 32 = 2048 bits. This way no shifts
by arbitrary integers are required as everything is immediately word aligned in Snort.
This provides a performance advantage as the SLE 78 needs one cycle for each shift to the
right or left. Moreover, the big integer multiplier is relatively fast and thus the tradeoff
between simpler packing/unpacking and slightly larger integer coefficients turned out to
be favorable. However, on different platforms this may not be the case. An issue that
costs some performance is the correct handling of carry bits caused by negative coefficients
in Snort.
For a single big integer multiplication in MulAddSingle we use the RSA co-processor
on the SLE78CLUFX5000 which has five registers of length slightly larger than 2048-bit.
In a simplified model it is able to compute additions of two registers in 8 cycles while
a multiplication with modular reduction takes roughly 9,300 cycles. However, not all
registers are general purpose. One register is a working register that contains the result of
a computation and is not directly accessible from the CPU. Another register is needed
to store the modulus when performing operations modulo p. Thus three registers are
available for temporary results or operands. Naturally, for an integer multiplication modulo
log2 p = 2048, two registers are already occupied with operands.
For KS1 with parameters (ω, m) = (64, 4) and ` = 32 one option to realise the
polynomial multiplication Ĉ(x) ← A(x) · B(x) mod(+) F for A, B, C ∈ Zp with p = F =
2ω` + 1 = 22048 + 1 described in line 8 of Algorithm 11 would be schoolbook multiplication.
As we have to do polynomial arithmetic modulo x4 + 1 this would lead to 42 = 16
multiplications in Zp due to the quadratic complexity of schoolbook multiplication. To
reduce the number of multiplications we have chosen Karatsuba multiplication for our
KS1 implementation of the MulAddSingle function, which leads to 9 multiplications,
17 additions and 16 subtractions in Zp . These numbers include additions or subtractions
required for the modulo x4 + 1 operation. In general, Karatsuba multiplication leads
to a large number of additions as a trade-off for fewer multiplications. An approach
where the additions are executed on the RSA co-processor would be possible but requires
a lot of transfers. We thus decided to exploit the ability to run the co-processor and
the CPU in parallel. While the RSA co-processor executes a modular multiplication
we compute long integer additions in parallel on the CPU. This can easily be achieved
by the appropriate rearrangement of multiplication and addition/subtraction operations
in the Karatsuba formula. For simplicity, we give a sort example for a(x) = a0 + a1 x
and b(x) = b0 + b1 x. A polynomial multiplication can be computed with Karatsuba as
a(x)b(x) = a0 b0 + ((a1 + a0 )(b1 + b0 ) − a1 b1 − a0 b0 ) x + a1 b1 x2 . Here some additions can
be performed in parallel to multiplications where T1 = a1 · b1 and T2 = b1 + b0 is computed
in parallel, then T3 = a0 · b0 and T4 = a1 + a0 , then T5 = T2 · T4 and T6 = T1 + T3 . Final
additions and computations are T7 = x2 · T1 , T8 = T5 − T6 , T9 = x · T8 , T10 = T7 + T9 ,
and T11 = T3 + T10 where a(x)b(x) = T11 . Note in our specific case also some additions or
subtractions caused by the modulo x4 + 1 operation are also hidden behind multiplications.
For the remaining additions and subtractions we make use of the co-processor. To save
cycles for transfers we store the result of several additions/subtractions in one register of
the co-processor so that we only have to transfer values into the co-processor and then
read out the final result. The FinalEll function (see line 10 of Algorithm 11) requires 3
multiplications by 2` . They are implemented on the co-processor using a special command
that allows fast shifting by 32 bits and are thus relatively cheap.
13 · 128 = 1664 bits would be required in total. However, similarly to KS1 we use 16
bits for easier packing/unpacking and end up with integers of size of 16 · 128 = 2048 bits
(` = 16). Computations are then performed on two polynomials modulo x2 + 1. This leads
to 2 · 22 = 8 multiplications in Zp for p = F = 2ω` + 1 when using schoolbook multiplication.
With Karatsuba a reduction to 2 · 3 = 6 multiplication would be possible. As the difference
between Karatsuba and schoolbook is small we use schoolbook multiplication to implement
KS2. This allows us to store partial products during schoolbook multiplication in the
free register of the RSA co-processor. This way we can perform additions with the RSA
co-processor and save time as we do not have to retrieve every result from the co-processor
into the memory.
countermeasures against physical attacks. Such attacks are not the focus of our work but
a secured PRNG would be easier to realise with the AES co-processor than by using a
shared SW implementation of SHA3 (see [OSPG18] where this necessity is discussed and
performance of a shared SHA3 is given). With roughly ≈ 376, 000 cycles used for sampling
in Kyber.CPA.Imp.Gen (≈ 9 × Parse + 6 × CBD) and roughly ≈ 407, 000 cycles used
in Kyber.CPA.Imp.Enc (≈ 9 × Parse + 7 × CBD) the sampling requires only about
10 percent of the overall runtime. Additionally, in Table 3 we have computed the sum
of cycles based on the calls to measured subfunctions for KS1. This gives an overview
what amount of cycles can be associated to each operation. In all three functions the most
cycles are contributed by MulAddSingle and Sneeze. They would be a natural target
for further optimization.
Compared to a Kyber768 implementation that is using the NTT as specified in [SAB+ 17]
on the SLE 78 in software, our approach of using the co-processor to compute the
KyberMulAdd gadget provides an advantage. On the SLE 78 a single n = 256 NTT
costs 997,691 cycles. The computation of KyberCPA.Enc for k = 3 requires 10 calls to
the NTT9 which alone would account for roughly 10 · 997,691 ≈ 10.0 million cycles plus
additional overhead from pointwise multiplication and addition.
In case one would want to make our implementation compatible with Kyber as specified
in [SAB+ 17] in terms of NTT usage and still use the KyberMulAdd gadget we would
have to perform k 2 inverse NTTs and then use our multiplication algorithm. This would
add roughly 32 · 997,691 ≈ 9.0 million cycles to Gen and Enc when executed on the CPU.
It would basically nullify all gains from a different and faster algorithm for polynomial
multiplication.
All in all, when our Kyber variant that is using the AES co-processor (i.e. AES-HW)
is run on our target device with an average clock frequency of 50 MHz we can execute Ky-
ber.CPA.Imp.Gen in 72.5 ms, Kyber.CPA.Imp.Enc in 94.9 and Kyber.CPA.Imp.Dec
in 28.4 ms.
For the CCA variant the decryption becomes slower due to the re-encryption but the
additional overhead of the hash functions H and G is rather low when the SHA-256 co-
processor is used (HW-SHA-256) to compute SHA-256 and HKDF with HMAC-SHA-256.
When H and G are instantiated with SHA3 implemented in software (SW-SHA3) a signifi-
cant portion of the computation is now attributed to SHA3. In comparison we can execute
Kyber.CCA.Imp.Gen in 79.6 ms (2,903 ms with SW-SHA3), Kyber.CCA.Imp.Enc
in 102.4 ms (571.2 ms with SW-SHA3) and Kyber.CCA.Imp.Dec in 132.7 ms (394.0
ms with SW-SHA3). An implementation of Kyber that is fully compatible with the
specification [SAB+ 17] would not achieve practical performance mainly due to the slow
SHA3 PRNG performance and to a lesser extent due to the slower NTT in software. Of
course, further low-level optimization of SHA3 and the NTT could change this picture to
some extent.
Table 2: Performance of our work on the SLE 78 target device in clock cycles.
Operation Cycles
Snort (KS1) 31,017
Sneeze (KS1) 295,730
MulAddSingle (KS1) 201,767
FinalEll (KS1) 28,381
Snort (KS2) 70,015
Sneeze (KS2) 295,331
MulAddSingle (KS2) 186,652
FinalEll (KS2) 90,728
NTT (n = 256, in SW) 997,691
Pointwise-Multiplication (n = 256, in SW) 356,549
CBD(PRF(σ, N )) (Software-SHA3) 9,341,406
CBD(PRF(σ, N )) (Hardware-AES) 31,068
Parse(XOF(ρ||i||j)) (Software-SHA3) 19,934,170
Parse(XOF(ρ||i||j)) (Hardware-AES) 21,081
Kyber.CPA.Imp.Gen (HW-AES: PRF/XOF; KS1) 3,953,224
Kyber.CPA.Imp.Enc (HW-AES: PRF/XOF; KS1) 5,385,598
Kyber.CPA.Imp.Dec (KS1) 1,382,963
Kyber.CPA.Imp.Gen (HW-AES: PRF/XOF; KS2) 3,625,718
Kyber.CPA.Imp.Enc (HW-AES: PRF/XOF; KS2) 4,747,291
Kyber.CPA.Imp.Dec (KS2) 1,420,367
Kyber.CCA.Imp.Gen (HW-AES: PRF/XOF; HW-SHA-256: H; KS2) 3,980,517
Kyber.CCA.Imp.Enc (HW-AES: PRF/XOF; HW-SHA-256: G, H; KS2) 5,117,996
Kyber.CCA.Imp.Dec (HW-AES: PRF/XOF; HW-SHA-256: G, H; KS2) 6,632,704
Kyber.CCA.Imp.Gen (HW-AES: PRF/XOF; SW-SHA3: H; KS2) 14,512,691
Kyber.CCA.Imp.Enc (HW-AES: PRF/XOF; SW-SHA3: G, H; KS2) 18,051,747
Kyber.CCA.Imp.Dec (HW-AES: PRF/XOF; SW-SHA3: G, H; KS2) 19,702,139
computations are done using co-processors, are expected to lead to different CPU designs
or low-level implementations than that for a high performance embedded microcontroller.
As we use an RSA co-processor for lattice-based cryptography, a natural target for
a comparison is RSA. The cycle counts given in Table 4 for co-processor supported
RSA on our SLE 78 target device are based on the data sheet. With an average clock
frequency of 50 MHz on the SLE 78, RSA encryption can be executed in 6 ms while RSA
decryption with CRT needs 120 ms. In comparison with our work this shows that our
Kyber implementation is two orders of magnitude slower for encryption but performs
decryption with similar speed. In case RSA is not used with CRT our Kyber decryption
even outperforms RSA. However, it should be noted that the RSA cycle counts do not
account for padding like Optimal Asymmetric Encryption Padding (OAEP) which is often
used to achieve CCA2 security for RSA. However, they include countermeasures against
physical attacks (e.g. exponent blinding or message blinding, see [FWA+ 13]) while our
implementation does not.
Publicly available information on the performance of RSA and ECC on various smart
cards running the JavaCard platform can be found in works like [DRHM17, SNS+ 16], the
Bachelor’s thesis of Kvašňovský [Kva16] as well as in the JCAlgTest project10 . Across
the selected cards, the runtime for an RSA2048 encryption function call is in the range
from 8 to 74 ms while RSA2048 decryption takes between 426 to 2,927 ms and 140 to
1,569 ms when using the Chinese Remainder Theorem (CRT). On-card key generation
10 See https://www.fi.muni.cz/~xsvenda/jcalgtest/comparative-table.html.
192 Implementing RLWE-based Schemes Using an RSA Co-Processor
Table 3: Called functions, number of calls, clock cycles, and final sum of clock cycles.
Kyber.CPA.Imp.Gen (KS1)
Function Calls Cycles per function Product
CBD(PRF(σ, N )) (HW-AES) 6 31,068 186,408
Parse(XOF(ρ||i||j)) (HW-AES) 9 21,081 189,729
Snort 15 31,017 465,255
MulAddSingle 9 201,767 1,815,903
Sneeze 3 295,730 887,190
FinalEll 3 28,381 85,143
Encode/Decode - - 400,226
= 4,029,854
Kyber.CPA.Imp.Enc (KS1)
Function Calls Cycles per function Product
CBD(PRF(σ, N )) (HW-AES) 7 31,068 217,476
Parse(XOF(ρ||i||j)) (HW-AES) 9 21,081 189,729
Snort 19 31,017 589,515
MulAddSingle 12 201,767 2,421,204
Sneeze 4 295,730 1,182,920
FinalEll 4 28,381 113,524
Encode/Decode - - 676,453
= 5,390,629
Kyber.CPA.Imp.Dec (KS1)
Function Calls Cycles per function Product
CBD(PRF(σ, N )) (HW-AES) 0 31,068 0
Parse(XOF(ρ||i||j)) (HW-AES) 0 21,081 0
Snort 4 31,017 217,119
MulAddSingle 3 201,767 605,301
Sneeze 1 295,730 295,730
FinalEll 1 28,381 28,381
Encode/Decode - 365,175
= 1,511,706
Kyber.CPA.Imp.Gen (KS1)
Kyber.CPA.Imp.Enc (KS1)
Kyber.CPA.Imp.Dec (KS1)
0 2 4
Cycles ·106
CBD(PRF(σ, N )) (HW-AES) Parse(XOF(ρ||i||j)) (HW-AES) Snort
MulAddSingle Sneeze FinalEll
Encode/Decode
for RSA2048 is a complex process with a variable runtime due to the required primality
testing and takes between 6,789 and 44,143 ms. There is also a certain overhead by the
JavaCard platform compared to a pure native implementation as well as overhead from
various countermeasures against physical attacks.
For comparison with other post-quantum schemes we have ported the reference im-
plementation of ephemeral/CPA-secure NewHope with n = 1024 claiming 255-bits of
post-quantum security onto our target device. To obtain a fair comparison we also changed
the internal PRNG to use the co-processor-based AES in counter-mode and we removed
costly randomness hashing in the key generation. With these modifications the main bot-
tleneck in NewHope is the computation of NTTs. When comparing CPA-secure NewHope
implementation (claimed 255-bit security level) with our CPA-secure Kyber (claimed
161-bit security level) in an ephemeral key setting11 , we achieve a factor of 6 better
performance for Alice (Gen+Dec) and a factor of 7 better performance for Bob (Enc).
Note that the implementation of our variant of Kyber that is not using the NTT would
most likely lead a loss of performance on other platforms. However, the implementation of
Saber on ARM given in [KMRV18] shows that high performance is also possible without
using the NTT when parameters are chosen accordingly.
Most modern general purpose ARM-based microcontroller platforms (e.g. Cortex-M)
have the advantage of a 32-bit architecture and are equipped with a single-cycle or few-cycle
multiplier (optional in Cortex-M0). Thus good performance can be expected for most
arithmetic operations, e.g. the inner loop of the NTT. Open-source implementations of
Kyber768 and NewHope1024 targeting general purpose ARM controllers are available
through the mupq project [va18]. It can be seen that in comparison with such a different
class of devices our CCA-secure Kyber768 implementation of Gen and Enc is slower than
CCA-secure Kyber768 on ARM using the NTT.
list.nist.gov/forum/#!topic/pqc-forum/r9R7OJT6x_c.
194 Implementing RLWE-based Schemes Using an RSA Co-Processor
Table 4: Comparison of our work with other PKE or KEM schemes on various microcon-
troller platforms in clock cycles.
of the AES co-processor to implement PRF/XOF and a software implementation of the NTT with
997,691 cycles for an NTT on SLE 78 @ 50 MHz.
f Reference implementation of constant time ephemeral NewHope key exchange (n = 1024) [ADPS16]
MHz. For simplification we report the cost of one point multiplication (PM) in Gen, two PMs in Enc
and one PM in Dec.
n Elliptic curve Diffie-Hellman using Curve25519 [Ber06] from [DHH+ 15] on ARM Cortex @ 48 MHz.
Reporting as in l .
M. R. Albrecht, C. Hanser, A. Hoeller, T. Pöppelmann, F. Virdia, A. Wallner 195
cycle multiplier. Here it is also worth to consider that on an ARM processor Snort,
Sneeze, and software-based big integer addition are also expected to be significantly faster
due to the more efficient instruction set and larger word size, while the CPU and the
co-processor could still execute in parallel.
From the algorithmic side, in the case of the KS1, ω = 64 implementation of Kyber we
currently require ` ≥ 25 bits of precision, and hence opted for using 32 bits. By using the
considerations made in Section 4 about swapping ω for n in the formula for computing `,
we could get down to ` ≥ 23, making it possible to save some memory at the cost of a
more complex unpacking.
In a more general direction it appears interesting to investigate whether a performance
advantage can be obtained with schemes specifically designed with the constraints of the
big integer multiplier in mind such as ThreeBears [Ham17] or Mersenne-75683917 [AJPS17].
However, we note that these schemes use integer sizes too large for direct handling with
our co-processor. In contrast, MLWE-based schemes immediately allow for a piece-wise
approach. Thus, another interesting target for implementation could be an MLWE-based
scheme that is parameterised with a power-of-two modulus q, e.g. SABER [DKRV17] which
permits to efficiently implement the strategy from Equation (1). For example, a viable
choice could be a prime-cyclotomic ring for n = 167 with 213 such that each ring element
fits directly into a co-processor register. Another approach would be a Kyber instantiation
with a smaller prime modulus q, as we do not have to choose q in a way that a fast NTT
exists. Moreover, our results naturally transfer over to the Dilithium signature scheme
and an implementation on the SLE 78 is a natural next step. However, parameters have
to be adapted for Dilithium, as it uses a larger modulus q = 8380417. Another interesting
question is whether it is possible to efficiently use RSA/ECC co-processors to implement
the NTT by treating the big integer multiplier as a vector processor using smart packing
of coefficients or a variant of Kronecker substitution.
References
[AAB+ 17] Erdem Alkim, Roberto Avanzi, Joppe Bos, Léo Ducas, Antonio de la Piedra,
Peter Schwabe Thomas Pöppelmann, and Douglas Stebila. Newhope. Tech-
nical report, National Institute of Standards and Technology, 2017. avail-
able at https://csrc.nist.gov/projects/post-quantum-cryptography/
round-1-submissions.
[AD17] Martin R. Albrecht and Amit Deo. Large modulus ring-LWE ≥ module-LWE.
In Tsuyoshi Takagi and Thomas Peyrin, editors, ASIACRYPT 2017, Part I,
volume 10624 of LNCS, pages 267–296. Springer, Heidelberg, December 2017.
[ADPS16] Erdem Alkim, Léo Ducas, Thomas Pöppelmann, and Peter Schwabe. Post-
quantum key exchange - A new hope. In Thorsten Holz and Stefan Savage,
editors, 25th USENIX Security Symposium, USENIX Security 16, pages
327–343. USENIX Association, 2016.
[AJPS17] Divesh Aggarwal, Antoine Joux, Anupam Prakash, and Mikos Santha.
Mersenne-756839. Technical report, National Institute of Standards and
Technology, 2017. available at https://csrc.nist.gov/projects/post-
quantum-cryptography/round-1-submissions.
[AJS16] Erdem Alkim, Philipp Jakubeit, and Peter Schwabe. Newhope on ARM
cortex-m. In Security, Privacy, and Applied Cryptography Engineering - 6th
International Conference, SPACE 2016, pages 332–349, 2016.
196 Implementing RLWE-based Schemes Using an RSA Co-Processor
[BBE+ 18] Gilles Barthe, Sonia Belaïd, Thomas Espitau, Pierre-Alain Fouque, Benjamin
Grégoire, Mélissa Rossi, and Mehdi Tibouchi. Masking the GLP lattice-based
signature scheme at any order. In Jesper Buus Nielsen and Vincent Rijmen,
editors, EUROCRYPT 2018, Part II, volume 10821 of LNCS, pages 354–384.
Springer, Heidelberg, April / May 2018.
[BDEZ12] Razvan Barbulescu, Jérémie Detrey, Nicolas Estibals, and Paul Zimmermann.
Finding optimal formulae for bilinear maps. In Ferruh Özbudak and Francisco
Rodríguez-Henríquez, editors, Arithmetic of Finite Fields, volume 7369 of
Lecture Notes in Computer Science, pages 168–186. Springer Berlin Heidelberg,
2012.
[BDK+ 17] Joppe W. Bos, Léo Ducas, Eike Kiltz, Tancrède Lepoint, Vadim Lyubashevsky,
John M. Schanck, Peter Schwabe, and Damien Stehlé. CRYSTALS - kyber:
a cca-secure module-lattice-based KEM. IACR Cryptology ePrint Archive,
2017:634, 2017. to appear in IEEE European Symposium on Security and
Privacy 2018, EuroS&P 2018.
[Ber06] Daniel J. Bernstein. Curve25519: New Diffie-Hellman speed records. In Moti
Yung, Yevgeniy Dodis, Aggelos Kiayias, and Tal Malkin, editors, PKC 2006,
volume 3958 of LNCS, pages 207–228. Springer, Heidelberg, April 2006.
[BLP+ 13] Zvika Brakerski, Adeline Langlois, Chris Peikert, Oded Regev, and Damien
Stehlé. Classical hardness of learning with errors. In Dan Boneh, Tim
Roughgarden, and Joan Feigenbaum, editors, 45th ACM STOC, pages 575–
584. ACM Press, June 2013.
[BSJ15] Ahmad Boorghany, Siavash Bayat Sarmadi, and Rasool Jalili. On constrained
implementation of lattice-based cryptographic primitives and schemes on
smart cards. ACM Trans. Embed. Comput. Syst., 14(3):42:1–42:25, April
2015.
[CCD+ 15] Matthew Campagna, Lidong Chen, Dr Özgür Dagdelen, Jintai Ding, Jen-
nifer K. Fernick, Nicolas Gisin, Donald Hayford, Thomas Jennewein, Norbert
Lütkenhaus, Michele Mosca, Brian Neill, Mark Pecen, Ray Perlner, Gré-
goire Ribordy, John M. Schanck, Dr Douglas Stebila, Nino Walenta, William
Whyte, and Dr Zhenfei Zhang. ETSI whitepaper: Quantum safe cryptogra-
phy and security. http://www.etsi.org/images/files/ETSIWhitePapers/
QuantumSafeWhitepaper.pdf, June 2015.
[CDW17] Ronald Cramer, Léo Ducas, and Benjamin Wesolowski. Short stickelberger
class relations and application to ideal-SVP. In Jean-Sébastien Coron and
Jesper Buus Nielsen, editors, EUROCRYPT 2017, Part I, volume 10210 of
LNCS, pages 324–348. Springer, Heidelberg, April / May 2017.
[CHT12] Peter Czypek, Stefan Heyse, and Enrico Thomae. Efficient implementations
of MQPKS on constrained devices. In Emmanuel Prouff and Patrick Schau-
mont, editors, CHES 2012, volume 7428 of LNCS, pages 374–389. Springer,
Heidelberg, September 2012.
[Chu17] Gu Chunsheng. Integer version of ring-LWE and its applications. Cryptology
ePrint Archive, Report 2017/641, 2017. http://eprint.iacr.org/2017/641.
[CLT13] Jean-Sébastien Coron, Tancrède Lepoint, and Mehdi Tibouchi. Practical
multilinear maps over the integers. In Ran Canetti and Juan A. Garay,
editors, CRYPTO 2013, Part I, volume 8042 of LNCS, pages 476–493. Springer,
Heidelberg, August 2013.
M. R. Albrecht, C. Hanser, A. Hoeller, T. Pöppelmann, F. Virdia, A. Wallner 197
[CÖ10] Murat Cenk and Ferruh Özbudak. On multiplication in finite fields. Journal
of Complexity, 26(2):172–186, 2010.
[CS15] Jung Hee Cheon and Damien Stehlé. Fully homomophic encryption over
the integers revisited. In Elisabeth Oswald and Marc Fischlin, editors, EU-
ROCRYPT 2015, Part I, volume 9056 of LNCS, pages 513–536. Springer,
Heidelberg, April 2015.
[dCRVV15] Ruan de Clercq, Sujoy Sinha Roy, Frederik Vercauteren, and Ingrid Ver-
bauwhede. Efficient software implementation of ring-LWE encryption. In
Proceedings of the 2015 Design, Automation & Test in Europe Conference &
Exhibition, DATE 2015, pages 339–344, 2015.
[DHH+ 15] Michael Düll, Björn Haase, Gesine Hinterwälder, Michael Hutter, Christof
Paar, Ana Helena Sánchez, and Peter Schwabe. High-speed curve25519 on
8-bit, 16-bit, and 32-bit microcontrollers. Des. Codes Cryptography, 77(2-
3):493–514, 2015.
[DKRV17] Jan-Pieter D’Anvers, Angshuman Karmakar, Sujoy Sinha Roy, and Frederik
Vercauteren. Saber. Technical report, National Institute of Standards and
Technology, 2017. available at https://csrc.nist.gov/projects/post-
quantum-cryptography/round-1-submissions.
[DRHM17] Petr Dzurenda, Sara Ricci, Jan Hajny, and Lukas Malina. Performance analysis
and comparison of different elliptic curves on smart cards. In International
Conference on Privacy, Security and Trust (PST), 2017. to appear, see
https://www.ucalgary.ca/pst2017/files/pst2017/paper-39.pdf.
[FH07] Haining Fan and M. Anwar Hasan. Comments on “five, six, and seven-term
karatsuba-like formulae”. IEEE Trans. Computers, 56(5):716–717, 2007.
[FWA+ 13] Dirk Feldhusen, Guntram Wicke, Arnold Abromeit, Lex Schoonen, and
Zertifizierungsstelle BSI˙ Minimum requirements for evaluating side-channel
attack resistance of rsa, dsa and diffie-hellman key exchange implementations.
Technical report, German Federal Office for Information Security - BSI,
1 2013. See https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/
Zertifizierung/Interpretationen/AIS_46_BSI_guidelines_SCA_RSA_
V1_0_e_pdf.pdf?__blob=publicationFile&v=1.
[Ham17] Mike Hamburg. Three bears. Technical report, National Institute of Standards
and Technology, 2017. available at https://csrc.nist.gov/projects/post-
quantum-cryptography/round-1-submissions.
[HBB13] Andreas Hülsing, Christoph Busold, and Johannes Buchmann. Forward secure
signatures on smart cards. In Lars R. Knudsen and Huapeng Wu, editors,
SAC 2012, volume 7707 of LNCS, pages 66–80. Springer, Heidelberg, August
2013.
[HHK17] Dennis Hofheinz, Kathrin Hövelmanns, and Eike Kiltz. A modular analysis
of the Fujisaki-Okamoto transformation. In Yael Kalai and Leonid Reyzin,
editors, TCC 2017, Part I, volume 10677 of LNCS, pages 341–371. Springer,
Heidelberg, November 2017.
[HRS16] Andreas Hülsing, Joost Rijneveld, and Peter Schwabe. ARMed SPHINCS
- computing a 41 KB signature in 16 KB of RAM. In Chen-Mou Cheng,
Kai-Min Chung, Giuseppe Persiano, and Bo-Yin Yang, editors, PKC 2016,
Part I, volume 9614 of LNCS, pages 446–470. Springer, Heidelberg, March
2016.
[KMRV18] Angshuman Karmakar, Jose Maria Bermudo Mera, Sujoy Sinha Roy, and
Ingrid Verbauwhede. Saber on ARM CCA-secure module lattice-based key
encapsulation on ARM. Cryptology ePrint Archive, Report 2018/682, 2018.
https://eprint.iacr.org/2018/682.
[Kra10] Hugo Krawczyk. Cryptographic extraction and key derivation: The HKDF
scheme. In Tal Rabin, editor, CRYPTO 2010, volume 6223 of LNCS, pages
631–648. Springer, Heidelberg, August 2010.
[LP11] Richard Lindner and Chris Peikert. Better key sizes (and attacks) for LWE-
based encryption. In Aggelos Kiayias, editor, CT-RSA 2011, volume 6558 of
LNCS, pages 319–339. Springer, Heidelberg, February 2011.
[LPO+ 17a] Zhe Liu, Thomas Pöppelmann, Tobias Oder, Hwajeong Seo, Sujoy Sinha Roy,
Tim Güneysu, Johann Großschädl, Howon Kim, and Ingrid Verbauwhede.
High-performance ideal lattice-based cryptography on 8-bit AVR microcon-
trollers. ACM Trans. Embedded Comput. Syst., 16(4):117:1–117:24, 2017.
[LPO+ 17b] Zhe Liu, Thomas Pöppelmann, Tobias Oder, Hwajeong Seo, Sujoy Sinha Roy,
Tim Güneysu, Johann Großschädl, Howon Kim, and Ingrid Verbauwhede.
High-performance ideal lattice-based cryptography on 8-bit AVR microcon-
trollers. ACM Trans. Embedded Comput. Syst., 16(4):117:1–117:24, 2017.
[LPR10] Vadim Lyubashevsky, Chris Peikert, and Oded Regev. On ideal lattices and
learning with errors over rings. In Henri Gilbert, editor, EUROCRYPT 2010,
volume 6110 of LNCS, pages 1–23. Springer, Heidelberg, May / June 2010.
[OPG14] Tobias Oder, Thomas Pöppelmann, and Tim Güneysu. Beyond ECDSA and
RSA: lattice-based digital signatures on constrained devices. In The 51st
Annual Design Automation Conference 2014, DAC ’14, pages 110:1–110:6,
2014.
[OSPG18] Tobias Oder, Tobias Schneider, Thomas Pöppelmann, and Tim Güneysu.
Practical CCA2-secure masked Ring-LWE implementations. IACR
TCHES, 2018(1):142–174, 2018. https://tches.iacr.org/index.php/
TCHES/article/view/836.
[RdCR+ 16] Oscar Reparaz, Ruan de Clercq, Sujoy Sinha Roy, Frederik Vercauteren, and
Ingrid Verbauwhede. Additively homomorphic ring-lwe masking. In Post-
Quantum Cryptography - 7th International Workshop, PQCrypto 2016, pages
233–244, 2016.
[Reg09] Oded Regev. On lattices, learning with errors, random linear codes, and
cryptography. Journal of the ACM, 56(6):1–40, Sep 2009.
[S+ 17] William Stein et al. Sage Mathematics Software Version 8.0. The Sage Devel-
opment Team, 2017. http://www.sagemath.org.
[SAB+ 17] Peter Schwabe, Roberto Avanzi, Joppe Bos, Leo Ducas, Eike Kiltz, Tancrede
Lepoint, Vadim Lyubashevsky, John M. Schanck, Gregor Seiler, and Damien
Stehle. Crystals-kyber. Technical report, National Institute of Standards and
Technology, 2017. available at https://csrc.nist.gov/projects/post-
quantum-cryptography/round-1-submissions.
[SBPV07] Kazuo Sakiyama, Lejla Batina, Bart Preneel, and Ingrid Verbauwhede.
HW/SW co-design for public-key cryptosystems on the 8051 micro-controller.
Computers & Electrical Engineering, 33(5-6):324–332, 2007.
[Sch77] Arnold Schönhage. Schnelle multiplikation von polynomen über körpern der
charakteristik 2. Acta Informatica, 7(4):395–398, Dec 1977.
[SNS+ 16] Petr Svenda, Matús Nemec, Peter Sekan, Rudolf Kvasnovský, David Formánek,
David Komárek, and Vashek Matyás. The million-key question - investigating
the origins of RSA public keys. In 25th USENIX Security Symposium, USENIX
Security 16, Austin, TX, USA, August 10-12, 2016., pages 893–910, 2016.
200 Implementing RLWE-based Schemes Using an RSA Co-Processor
[SSTX09] Damien Stehlé, Ron Steinfeld, Keisuke Tanaka, and Keita Xagawa. Efficient
public key encryption based on ideal lattices. In Mitsuru Matsui, editor,
ASIACRYPT 2009, volume 5912 of LNCS, pages 617–635. Springer, Heidelberg,
December 2009.
[va18] various authors. Post-quantum crypto library for the ARM Cortex-M4. Web-
site, 2018. accessed April 2018, see https://github.com/mupq/pqm4.
[vMOG15] Ingo von Maurich, Tobias Oder, and Tim Güneysu. Implementing QC-MDPC
McEliece encryption. ACM Trans. Embedded Comput. Syst., 14(3):44:1–44:27,
2015.
[VZGG13] Joachim Von Zur Gathen and Jürgen Gerhard. Modern computer algebra.
Cambridge university press, 2013.
[Wen13] Erich Wenger. A lightweight atmega-based application-specific instruction-set
processor for elliptic curve cryptography. In Lightweight Cryptography for
Security and Privacy - Second International Workshop, LightSec 2013, pages
1–15, 2013.
M. R. Albrecht, C. Hanser, A. Hoeller, T. Pöppelmann, F. Virdia, A. Wallner 201
A Cyclotomic gadgets
In this Appendix we prove Corollaries 1 and 2.
Proof (of Corollary 1). We need to verify that
f (2` ) > 2n` − 1 (5)
and that
di ∈ {−δ, . . . , δ} (6)
` n`
Condition 5 holds since f (2 ) = 2 + 1. Condition 6 follows by explicitly evaluating
n−1
X
d(x) = di xi := a(x) · b(x) + c(x) (mod xn + 1)
i=0
and hence
max |di | ≤ nαβ + γ =: δ.
{aj }j ,{bk }k ,
{cm }m
Pn−1 Pn−1 Pn
Lemma 2. Let a = i=0 ai xi , b = i=0 bi xi with ai , bi ∈ Z, and let f = i=0 xi . Let
P2n−2 n−1
ci := j+k=i aj bk such that c := i=0 ci xi = a · b and let d := i=0 di xi ≡ c (mod f ).
P P
Then
n−3
X
d= (ci − cn + ci+n+1 ) xi + (cn−2 − cn ) xn−2 + (cn−1 − cn ) xn−1
i=0
and each di is a sum of at most 2n − 1 terms of the form aj bk .
Pm
Proof. Let f (m) := i=0 xi (it follows that f ≡ f (n) ). Since a and b have degree < n, we
know that we need to reduce modulo f only the powers xi+n for i = 0, . . . , n − 2 of c. For
i ≥ 1 we have
xi+n ≡ xi (xn − f (n) (x)) (mod f )
i (n−1)
= −x (f )
i−1 (n−1)
= −x (xf )
i−1 (n)
= −x (f − 1)
≡ xi−1 (mod f ),
while for i = 0, xn ≡ −f (n−1) (mod f ). Hence, we can write
2n−2
X
c= ci xi
i=0
n−1
X n−3
X
= ci xi + cn xn + cn+i+1 xn+i+1
i=0 i=0
n−1
X n−1
X n−3
X
≡ ci xi − cn xi + cn+i+1 xi (mod f )
i=0 i=0 i=0
n−3
X
≡ (ci − cn + cn+i+1 ) xi + (cn−2 − cn ) xn−2 + (cn−1 − cn )xn−1 (mod f )
i=0
202 Implementing RLWE-based Schemes Using an RSA Co-Processor
where by explicit computation dn−1 is a sum of 2n−1 terms aj bk , dn−2 is a sum of 2n−2 such
terms and, for i ≤ n − 3, di has 3n − |i − n + 1| − |n − n + 1| − |n + i + 1 − n + 1| = 2n − 2
such terms.
Proof (of Corollary 2). We need to verify that
and that
di ∈ {−δ, . . . , δ} (8)
` n` (n−1)`
Condition 7 holds since f (2 ) = 2 + 2 + · · · + 1. Condition 8 follows by explicitly
evaluating
n−1
X
d= di xi ≡ a · b + c (mod f )
i=0
B Proof of Concept
Our high-level proof-of-concept implementation is written in SageMath [S+ 17].
# -* - coding : utf -8 -* -
"""
Kyber using big integer arithmetic - proof - of - concept
"""
from sage . all import parent , ZZ , vector , PolynomialRing , GF
from sage . all import log , ceil , randint , set_random_seed , random_vector , matrix , floor
def B i n o m i a l D i s t r i b u t i o n ( eta ):
r = 0
for i in range ( eta ):
r += randint (0 , 1) - randint (0 , 1)
return r
# Kyber ( sort of )
class Kyber :
n = 256
q = 7681
eta = 4
k = 3
D = staticmethod ( BinomialDistribution )
f = [1]+[0]*( n -1)+[1]
ce = n
@classmethod
def key_gen ( cls , seed = None ):
""" Generate a new public / secret key pair
: param cls : Kyber class , inherit and change constants to change defaults
: param seed : seed used for random sampling if provided
"""
n , q , eta , k , D = cls .n , cls .q , cls . eta , cls .k , cls . D
return (A , t ) , s
@classmethod
def enc ( cls , pk , m = None , seed = None ):
""" IND - CPA encryption sans compression
: param cls : Kyber class , inherit and change constants to change defaults
: param pk : public key
: param m : optional message , otherwise all zero string is encrypted
: param seed : seed used for random sampling if provided
"""
n , q , eta , k , D = cls .n , cls .q , cls . eta , cls .k , cls . D
A , t = pk
204 Implementing RLWE-based Schemes Using an RSA Co-Processor
if m is None :
m = (0 ,)
@classmethod
def dec ( cls , sk , c , decode = True ):
""" IND - CPA decryption
: param cls : Kyber class , inherit and change constants to change defaults
: param sk : secret key
: param c : ciphertext
: param decode : perform final decoding
"""
n , q = cls .n , cls . q
s = sk
u, v = c
m = (v - s*u) % f
m = list ( m )
while len ( m ) < n :
m . append (0)
m = balance ( vector ( m ) , q )
if decode :
return cls . decode (m , q , n )
else :
return m
@staticmethod
def decode (m , q , n ):
""" Decode vector ‘m ‘ to ‘\{0 ,1\}^ n ‘ depending on distance to ‘q /2 ‘
"""
return vector ( GF (2) , n , [ abs ( e ) > q / ZZ (4) for e in m ] + [0 for _ in range (n - len ( m ))])
@classmethod
def encap ( cls , pk , seed = None ):
""" IND - CCA encapsulation sans compression or extra hash
: param cls : Kyber class , inherit and change constants to change defaults
: param pk : public key
: param seed : seed used for random sampling if provided
"""
n = cls . n
m = random_vector ( GF (2) , n )
m . set_immutable ()
s et _r an d om _s ee d ( hash ( m )) # NOTE : this is obviously not faithful
K_ = random_vector ( GF (2) , n )
K_ . set_immutable ()
r = ZZ . rand om_eleme nt (0 , 2** n -1)
c = cls . enc ( pk , m , r )
@classmethod
def decap ( cls , sk , pk , c ):
""" IND - CCA decapsulation
: param cls : Kyber class , inherit and change constants to change defaults
: param sk : secret key
: param pk : public key
: param c : ciphertext
"""
n = cls . n
m = cls . dec ( sk , c )
m . set_immutable ()
s et _r an d om _s ee d ( hash ( m )) # NOTE : this is obviously not faithful
M. R. Albrecht, C. Hanser, A. Hoeller, T. Pöppelmann, F. Virdia, A. Wallner 205
K_ = random_vector ( GF (2) , n )
K_ . set_immutable ()
r = ZZ . rand om_eleme nt (0 , 2** n -1)
c_ = cls . enc ( pk , m , r )
if c == c_ :
return hash (( K_ , c )) # NOTE : this obviously isn ’t a cryptographic hash
else :
return hash ( c ) # NOTE ignoring z
class Nose :
"""
Snorting ( packing ) and sneezing ( unpacking ).
"""
@staticmethod
def snort (g , f , p ):
"""
Convert vector ‘g ‘ in ‘\ ZZ ^n ‘ with coefficients bounded by ‘p /2 ‘ in absolute value to
integer ‘\ bmodp f ( p ) ‘.
@staticmethod
def sneeze (G , f , p ):
""" Convert integer ‘G \ bmodl f ( p ) ‘ to vector of integers
"""
assert ( G >= 0 and G < f ( p ))
n = f . degree ()
c = 0
r = []
for i in range ( n ):
e = G % p
G -= e
e += c
G = G // p
c = int ( e > p //2)
e -= c * p
r . append ( e )
for i in range ( n ):
r [ i ] -= f [ i ]*( G + c )
return r [: n ]
@staticmethod
def proof_sneeze (G , f , p ):
""" Convert integer ‘G \ bmod f ( p ) ‘ to vector of integers
"""
assert ( G >= 0 and G < f ( p ))
n = f . degree ()
r = []
for i in range ( n ):
e = G % p
G -= e
G = G // p
if e > p //2:
e -= p
G += 1
r . append ( e )
for i in range ( n ):
r [ i ] -= f [ i ]* G
return r [: n ]
@classmethod
def prec ( cls , scheme ):
"""
Return ‘\ log_2 ( k ce eta (q -1)/2 + (q -1)/2 + 1) + 1 ‘
"""
eta , q , k , f , ce = scheme . eta , scheme .q , scheme .k , scheme .f , scheme . ce
l = log ( k * ce * floor ( q / ZZ (2))* eta + eta + max ([ abs ( fi ) for fi in f ]) + 1 , 2) + 1
return l
@classmethod
def muladd ( cls , scheme , a , b , c , l = None ):
"""
Compute ‘a \ cdot b + c mod f ‘ using big - integer arithmetic
"""
R , x = Pol ynomialR ing ( ZZ , " x " ). objgen ()
k , f = scheme .k , R ( scheme . f )
if l is None :
l = ceil ( cls . prec ( scheme ))
# Skipper
IND - CPA Decryption in 30 multi plicatio n of (64 \ cdot 25 =) 1600 - bit integers .
"""
@staticmethod
def ff (v , offset , start =0):
""" Fast - forward through vector ‘v ‘ in ‘‘ offset ‘ ‘ sized steps starting at ‘‘ start ‘ ‘
: param v : vector
: param offset : increment in each step
: param start : start offset
"""
p = parent ( v )
return p ( list ( v )[ start :: offset ])
"""
n , eta , q , k , f = kyber .n , kyber . eta , kyber .q , kyber .k , kyber . f
l = log ( k * n * floor ( q / ZZ (2))* eta + eta + max ([ abs ( fi ) for fi in f ]) + 1 , 2) + 1
return l
@classmethod
def muladd ( cls , kyber , a , b , c , l = None ):
"""
Compute ‘a \ cdot b + c ‘ using big - integer arithmetic
"""
m , k = 4 , kyber . k
w = kyber . n // m
R , x = Pol ynomialR ing ( ZZ , " x " ). objgen ()
f = R ([1]+[0]*( w -1)+[1])
if l is None :
# Could try passing degree w , but would require more careful
# sneezing
l = ceil ( cls . prec ( kyber ))
M. R. Albrecht, C. Hanser, A. Hoeller, T. Pöppelmann, F. Virdia, A. Wallner 207
F = f (2** l )
# MUL : 3
# specific trick for how we multiply degree n = 256 polys
# the coefficients from above need readjustment
# here doing 2** l * is basically doing y * !!! and if this wraps around
# it takes care of the - in front
W = sum (( W [0+ i ] + (2** l * W [ m + i ] % F ))* x ** i for i in range (m -1)) + W [m -1]* x **( m -1)
d = []
for j in range ( w ):
for i in range ( m ):
d . append ( D [ i ][ j ])
return R ( d )
@classmethod
def enc ( cls , kyber , pk , m = None , seed = None , l = None ):
""" IND - CPA encryption sans compression
: param kyber : Kyber class , inherit and change constants to change defaults
: param pk : public key
: param m : optional message , otherwise all zero string is encrypted
: param seed : seed used for random sampling if provided
"""
n , q , eta , k , D = kyber .n , kyber .q , kyber . eta , kyber .k , kyber . D
A , t = pk
if m is None :
m = (0 ,)
@classmethod
def dec ( cls , kyber , sk , c , l = None , decode = True ):
""" Decryption .
: param kyber : Kyber class , inherit and change constants to change defaults
: param sk : secret key
: param c : ciphertext
: param l : bits of precision
: param decode : perform final decoding
"""
n , q = kyber .n , kyber . q
u, v = c
s = sk
class S ki pp er 2 Ne ga te d ( Skipper4 ):
"""
Kyber using big integer arithmetic
@classmethod
208 Implementing RLWE-based Schemes Using an RSA Co-Processor
: param kyber : Kyber class , inherit and change constants to change defaults
"""
return Skipper4 . prec ( kyber )/ ZZ (2)
@classmethod
def muladd ( cls , kyber , a , b , c , l = None ):
"""
Compute ‘a \ cdot b + c ‘ using big - integer arithmetic
"""
m , k = 2 , kyber . k
w = kyber . n // m
R , x = Pol ynomialR ing ( ZZ , " x " ). objgen ()
f = R ([1]+[0]*( w -1)+[1])
g = R ([1]+[0]*( w //2 -1)+[1])
if l is None :
l = ceil ( cls . prec ( kyber ))
F = 2**( w * l ) + 1
# MUL : 2 * k * 3
Wp = ( Ap * Bp + Cp ) % F
Wn = ( An * Bn + Cn ) % F
We = ( Wp + Wn ) % F
Wo = ( Wp - Wn ) % F
Wo , We = ( sum (( Wo [0+ i ] + (2** l * We [ m + i ] % F ))* x ** i for i in range (m -1)) + Wo [m -1]* x **( m -1)) % F , \
( sum (( We [0+ i ] + (2** l * Wo [ m + i ] % F ))* x ** i for i in range (m -1)) + We [m -1]* x **( m -1)) % F
_ i n v e r s e _ o f _ 2 _ m o d _ F = F - 2**( w *l -1)
_ i n v e r s e _ o f _ 2 _ t o _ t h e _ l _ p l u s _ 1 _ m o d _ F = F - 2**( w *l -1 - l )
We = ( We * _ i n v e r s e _ o f _ 2 _ m o d _ F ) % F
Wo = ( Wo * _ i n v e r s e _ o f _ 2 _ t o _ t h e _ l _ p l u s _ 1 _ m o d _ F ) % F
d = []
return R ( d )