Friday, October 31, 2014

52 Things: Number 4: The Complexity Class P


This is the fourth blog post talking about '52 Things Every PhD Student Should Know' to do Cryptography, and the first on the topic of Theoretical Computer Science. In this post, I've been asked to define the complexity class P. Having recently joined the Cryptography group at Bristol as a Mathematician, I knew very little theoretical computer science when I first started my PhD and I'm sure there will be plenty of others in my situation, so this blog will start right from the beginning and you can skip over parts you know already. First, we'll give an overview of what complexity means and why it matters, then we'll define Turing machines, and finally arrive at the complexity class P, concluding with an example.

Most of the content of this post is a reworking of parts of Introduction to the Theory of Computation by Michael Sipser [1], which I have found hugely helpful.

Section 1: Complexity and Big O Notation

We want to know how difficult a given task is for a computer to do in order to design efficient programs. The trouble is that the processing power of a computer varies massively depending on the hardware (e.g. see last week's '52 Things' blog). So we want a measure of the difficulty of a task that doesn't depend on the specific details of the machine performing the task. One way to do this is to bound the number of operations that a certain model of a computer would take to do it. This is called (time) complexity theory.

Typically, though, the number of operations required will depend on the input to the task and may vary even with inputs of the same length. As a pertinent example, say we design a computer program which tells you whether or not an integer you input is prime. If we give as input the number 256, the program will probably output 'not prime' sooner than if we had given it the number 323 (even though they both have length 9 when written as binary integers, for example), since the first integer has a very small prime factor (2) and the second has larger factors (17 and 19). Therefore we usually opt for a worst-case analysis where we record the longest running time of all inputs of a particular length. So we obtain an algebraic expression $t(n)$ that reflects the longest running time of all inputs of length $n$.

Furthermore, when the input length $n$ becomes very large, we can neglect all but the most dominant term in the expression and also ignore any constant factors. This is called asymptotic analysis; we assume $n$ is enormous and ask roughly how many steps the model of computation will take to 'finish' when given the worst possible input of length $n$, writing our answer in the form $\mathcal{O}\left(t\left(n\right)\right)$. For example, if we find that our process takes $6n^3 – n^2 + 1$ steps, we write that it is $\mathcal{O}\left(n^{3}\right)$, since all other terms can be ignored for very large $n$.

Section 2: Turing Machines

Now we give the model that is most often used in the kind of calculations performed in Section 1. First, recall that an alphabet is a non-empty finite set and a string is a finite sequence of elements (symbols) from an alphabet. A language is simply a set of strings.

A Turing machine models what real computers can do. Its 'memory' is an infinitely long tape. At any time, each square of the tape is either blank or contains a symbol from some specified alphabet. The machine has a tape head that can move left or right along the tape, one square at a time, and read from and write to that square. At first, the tape is all blank except for the leftmost $n$ squares which constitute the input (none of which can be blank so that it is clear where the input ends). The tape head starts at the leftmost square, reads the first input symbol and then decides what to do next according to a transition function. The transition function depends on what it reads at the square it is currently on and the state that the machine is currently in (like a record of what it has done so far) and returns
  1. a new state
  2. another symbol to write to the square it is on (though this symbol might be the same as what was already written there)
  3. a direction to move in: left or right. 
The machine will continue to move one square at a time, read a symbol, evaluate the transition function, write a symbol and move again, until its state becomes some specified accept state or reject state.

If the machine ends up in the accept state, we say it accepts its input. Similarly it may reject its input. In either case we say the machine halts on its input. But note that it may enter a loop without accepting or rejecting i.e. it may never halt. If a Turing machine accepts every string in some language and rejects all other strings, then we say the machine decides that language. We can think of this as the machine testing whether or not the input string is a member of the language. Given a language, if there is a Turing machine that decides it, we say the language is decidable.

The power of this model comes from the fact that a Turing machine can do everything that a real computer can do (this is called the Church-Turing thesis [2]). We define the time complexity class $\mathrm{TIME}\left(t\left(n\right)\right)$ to be the collection of all languages that are decidable by an $\mathcal{O}\left(t\left(n\right)\right)$ time Turing machine, then we turn computational problems into questions about language membership (is an input string a member of a certain language? e.g. does this string representing an integer belong to the language of strings representing prime integers?) and can partition computational problems into time complexity classes.

Section 3: The Complexity Class P

Finally, we arrive at the purpose of this blog! If $t(n) = n^k$ for some $k > 0$ then $\mathcal{O}\left(t\left(n\right)\right)$ is called polynomial time. The complexity class P is the class of all languages that are decidable in polynomial time by a Turing machine. Since $k$ could be very large, such Turing machines are not necessarily all practical, (let alone 'fast'!), but this class is a rough model for what can be realistically achieved by a computer. Note that the class P is fundamentally different to those languages where $t(n)$ has $n$ in an exponent, such as $2^n$, which grow much, much faster as $n$ increases – so fast that even if you have a decider for some language, you may find that the universe ends before it halts on your input!

We conclude with an example of a polynomial time problem. Suppose you have a directed graph (a set of nodes and edges where there is at most one edge between any pair of nodes and each edge has an arrow indicating a direction). Then if we encode the graph and the two nodes as a single string, we can form a language consisting of those strings representing a graph and two nodes such that it is possible to follow the edges from the first node and eventually arrive at the second. So a decider for this language will effectively answer the question of whether there is a path from the first node A to the second B, called the path problem, by accepting or rejecting the graph and nodes you input. We give a decider for this language and show that it decides in polynomial time. 
  1. First put a mark on A. 
  2. Scan all the edges of the graph. If you find an edge from a marked node to an unmarked node, mark the unmarked node.
  3. Repeat the above until you mark no new nodes.
  4. If B is marked, accept. Otherwise, reject. 
This process successively marks the nodes that are reachable from A by a path of length 1, then a path of length 2, and so on. So it is clear that a Turing machine implementing the above is a decider for our language. Now we consider the time complexity of this algorithm. If we couldn't do steps 1 and 4 in polynomial time, our machine would be terrible! So we focus on steps 2 and 3. Step 2 involves searching the input and placing a mark on one square, which is clearly polynomial time in the size of the input. Step 3 repeats step 2 no more times than the number of nodes, which is necessarily less than the size of the input (since the input must encode all the nodes of the graph) and is hence polynomial (in particular, linear) in the size of the input. Therefore the whole algorithm is polynomial time and so we say the path problem is in P.

Thursday, October 30, 2014

SPA and the AES key schedule

Today's study group was given by Valentina on the 2014 CHES paper titled "Simple Power Analysis on the AES Key Expansion Revisited" by Christophe Clavier, Damien Marion and Antoine Wurcker from the Universite de Limoges in France.

To briefly recap, "simple power analysis" (SPA) is the rather misleading moniker for the class of methods that analyse side-channel information contained within a small amount of side-channel traces captured during the encryption of a single pair of plaintext and key material. The misleading nature of the title is that these methods are anything but simple to perform---the amount of exploitable information leakage contained within a single trace corresponding to the encryption and decryption of a fixed input pair is far, far smaller than that which can be achieved by capturing a large amount of traces for a variety of input pairs to be exploited using differential power analysis (DPA).

In the side-channel community there's a growing shift in perception towards viewing side-channel analysis as an auxiliary phase in an enhanced global key search, rather than a stand-alone ‘win-or-lose’ attack. DPA attacks use the additional information available in the larger set of traces and aim to reduce the final enumeration effort to as small an amount as possible. SPA attacks instead face the challenge of having to exploit all the available information in the trace and perform a potentially demanding enumeration phase.

The work of Clavier et. al. explores attacks on the AES key scheduling algorithm. It is sufficient for the purposes of this blog to know that the AES key schedule takes (assuming AES-128) one 16-byte secret key and expands it to 11 16-byte round keys, the first of which is the same as the input secret key. The authors explore two masked implementations of the key schedule algorithm---masking aims to combine random values (masks) with sensitive intermediate values, to break any relationships between the intermediate values and observed side-channel information. The first is termed "11 byte entropy boolean masking" and generates 11 individual random bytes, each of which masks all of the bytes with each round key (the 16 bytes of a single round key are masked using the same random mask). The second is termed "16-byte entropy boolean masking", and essentially is orthogonal to the 11-byte scheme: each byte within a round key is masked with a different random byte, but each round key is masked by the same 16 random bytes. In an ideal case, all of the 11 x 16 = 176 sensitive bytes would be masked by a new random byte each time---the authors claim that their 11- and 16-byte schemes are relevant in scenarios where a device does not have enough memory or available entropy to store or generate this many random values.

SPA attacks on the key-schedule attempt to reduce the size of the set of possible secret keys as much as possible. In this work, the authors assume that they know the Hamming-weight (a common form of side-channel leakage is that the Hamming weight of an intermediate value is proportional to side-channel information) of the key byte XORed with the mask. This is a strongly simplifying assumption---in practice, an attacker would have to profile the key schedule on a second device to begin to work towards an approximation for the side-channel leakage.

The primary contribution of the paper is an adapted "guess and backtrack"-style algorithm for reducing the set of possible keys by exploiting relationships between key bytes within the key schedule, with full knowledge of the leakage corresponding to the targeted intermediate variable of the key byte XORed with its mask. The additional enumeration effort imposed by the presence of the masks is shown to be manageble, finding that with up to 30 trace measurements their attack succeeds in drastically reducing the remaining search space in a relatively short amount of time. The attacks are further analysed in the presence of a shuffling countermeasure (the order of execution of the combination of the key bytes with the masks is randomised within each 4-byte column), and the authors discover that they can adapt the algorithm to explore all permutations of the ordering, taking on average several hours to succeed. With an assumption of being able to introduce faults in specific parts of the algorithm, this time can be reduced to the order of minutes.

The techniques presented for exploiting the information leakage are intricate and clever. The work motivates the following question for this area of research: to discover methods for relaxing the assumption on the attacker needing full knowledge of the information leakage occurring.

Friday, October 24, 2014

Study group: Lattice-based Digital Signatures

This week's study group was given by Emmanuela on the subject of digital signature schemes based on lattices. Because not everyone likes lattices as much as I do, Emmanuela decided to first convey some of the history of lattice signatures to the audience. The seminal work that everyone refers to when they talk about any sort of lattice-based cryptography is of course Ajtai's paper from '96 on the connection between the average and worst cases of certain lattice problems. Around the same time, the NTRU and GGH cryptosystems were proposed which provided both encryption and digital signature schemes. However, both schemes had their initial issues, and it turned out that especially the security of the digital signature schemes proved to be problematic. In hindsight it is pretty clear that the problem lies with the 'noise' distribution and that signatures leak information on the secret key.

The next steps towards the security of lattice-based signature schemes were taken in '08, when two independent works described schemes that are provably secure based on the hardness of standard lattice problems. The first is by Gentry, Peikert and Vaikuntanathan, which follows the hash-and-sign paradigm and includes a useful method to sample from the discrete Gaussian distribution, which is used in most modern lattice-based crypto schemes. As the word 'hash' implies, the security proof for this scheme is in the random oracle model. The second scheme is by Lyubashevsky and Micciancio in the standard model, but it only provides one-time secure signatures. These can be converted into fully secure signatures using a tree construction, which requires a logarithmic number (in the security parameter) of applications of the one-time scheme.

These two works inspired two different lines of research. One line focuses on getting the best efficiency in the random oracle model, whereas the other focuses on getting security in the standard model while maintaining a good asymptotic efficiency as well. The focus paper of the study group was in this second line of research: Improved Short Lattice Signatures in the Standard Model by Ducas and Micciancio from this year's Crypto. It combines the 'vanishing trapdoor' technique due to Boyen and the 'confined guessing' method due to Böhl et al. For their lattice trapdoors, they use the work of Micciancio and Peikert from Eurocrypt '12. This non-trivial combination leads to a scheme where the signatures are short (consisting of only one vector) at the cost of having keys consisting of a logarithmic number of vectors. They also propose a stateful scheme which reduces the key sizes to a log log number of vectors and also tightens the reduction, removing a factor introduced by the confined guessing stage as well as tightening the approximation factor of the underlying lattice problem. Interestingly, the schemes by Ducas and Micciancio require the additional algebraic structure of ideal lattices, whereas previous works only use this structure for the efficiency improvement.

In conclusion, the result is a new scheme that compares favourably to previous schemes by either allowing for smaller signatures or smaller keys. But things move fast in the lattice world, as there is already a new paper on the ePrint archive that reduces the keys to a constant number of vectors, at the cost of a bigger approximation factor in the underlying lattice problem. It is still possible to choose parameters such that this approximation is polynomial, but it is also possible to pick them less conservatively, resulting in a subexponential approximation factor. It will be interesting to see whether such choices survive future improvements to cryptanalysis.

Wednesday, October 22, 2014

52 Things: Number 3: Computational and storage power of different form factors

This is the third in a series of blog posts to address the list of '52 Things Every PhD Student Should Know' to do Cryptography. The set of questions has been compiled to give PhD candidates a sense of what they should know by the end of their first year. We will be presenting answers to each of the questions over the next year, one per week, and I am the student assigned to the third question:
Q3: Estimate the relative computational and storage capabilities of
  • a smart-card
  • a micro-controller (i.e. a sensor node)
  • an embedded or mobile computer (e.g., a mobile phone or PDA)
  • a laptop- or desktop-class computer.
To measure the computational capability of a device we could assess the clock speed of its processors. This may be misleading if the processor enables some form of parallelism---two cores running at 2 GHz obviously possess more computational power than a single core running at 2 GHz, and so finding a direct quantitative measure is not a realistic expectation. For specific devices like general purpose graphics cards, often the total FLOPS (floating point operations per second) the device is capable of sustaining is reported (for either single or double precision arithmetic) but even this measure is not a particularly reliable choice when applied to any given problem---indeed, some services facilitate a comparison by benchmarking the performance of different devices on a variety of problem instances---see, for example, CompuBench. Fortunately the range of capabilities of the devices included in the question makes a sufficient answer less dependent on quantitative metrics.
A measure for the storage capabilities of each device is much simpler to find: we can simply compare the approximate number of bytes of information the device is capable of holding on permanent storage.
A smart-card is the least computationally powerful device: obviously clock speeds vary for different implementations, but one might expect to see around a 20 MHz core speed. In terms of storage, a typical smart-card might have around 2 kilobytes (KiB) available.
A microcontroller is "a small computer on a single integrated circuit containing a processor core, memory, and programmable input/output peripherals" [1]. The range of storage and compute capability available will vary significantly according to the exact definition of microcontroller, but taking the suggested sensor node as an example, a typical microcontroller is likely to have similar computational capabilities as a smart-card and slightly more storage available, perhaps in the order of a few KiB to a few megabytes (MiB).
A mobile computer such as a mobile phone has significantly more storage and computing power, and the amount of power available is rapidly increasing over time. Taking the 2008 iPhone and the 2013 Nexus 5 phone as an example, the iPhone used a 412 MHz 32-bit RISC ARM core, and the Nexus 5 CPU used is a 2.3 GHz quad-core processor. In terms of storage, if we ignore the ability of some phones to use removable storage, then a high-end phone in 2013 might expect to provide in the order of 16 to 32 gigabytes (GiB) of storage
Finally, most laptop or desktop class computers are likely to have more processing power than a mobile phone: the high-end Intel "Haswell" i7 4960K processor contains 4 cores each clocked at 4 GHz, and the AMD "Piledriver" FX-9590 CPU contains 8 cores at 4.7 GHz---note that a direct comparison between these two processors requires more than just assessing core counts and clock speeds! There are other factors that can affect the computing capabilities of a desktop or laptop computer---in particular, the addition of a graphics processing unit can, for certain problems, provide a large increase in performance. The storage capacity of a laptop or desktop can vary tremendously, but a typical amount of storage in a consumer machine might be between hundreds of gigabytes and several terabytes (TiB)---the largest single hard drive capacities are now around 8 TiB.
  • [1] https://en.wikipedia.org/wiki/Microcontroller

Friday, October 17, 2014

Study group: witness-indistinguishable proofs (2014-10-16)

Zero-knowledge (ZK) proofs are a fairly common object in cryptography. What's less common knowledge: zero-knowledge does not compose well. For example, for every interactive ZK prover/verifier (P,V), you can build another pair $(\overline P, \overline V)$ that is still a ZK proof of the same language but running two prover/verifier instances in parallel leaks a witness.

Back in the early days of ZK, Feige and Shamir came up with an alternative notion called witness indistinguishability (WI). This says that (for some NP language) if $v, w$ are two witnesses to a statement $x$ then a proof using $v$ is indistinguishable from one using $w$. For some languages like "discrete logarithms" this property holds trivially but once there are several witnesses it becomes interesting. For example, a WI proof of a witness to a Pedersen commitment $G^x H^r$ is guaranteed not to reveal $x$, just like the original commitment itself. And WI is closed under composition. In fact, you can do a strong (and composable) form of WI that is information-theoretic and holds even if the verifier knows the witnesses in question, i.e. the proof is statistically independent of the witness.

The second topic we looked at are ZAPs. No-one seems to know what ZAP stands for but it's a two-round WI proof in the plain model (plain-ZK would require at least 3 rounds). The idea: start with a "common random string"-model non-interactive ZK scheme. In round 1, the verifier picks a lot of random strings $r_1, \ldots, r_k$. In round 2, the prover picks one random $r$ and sets $c_i = r_i \oplus r$ to get $k$ different CRS values. The prover then sends a CRS-NIZK proof for each of these values; the verifier accepts if all of them verify. An argument on the probability of a proof for a false statement going through on a random CRS then says that the soundness error of this construction is negligible in $k$.

At CRYPTO '03, Barak et al. further showed how to derandomise ZAPs.

Our final application is a 3-round OT scheme. To transfer a bit, verifier picks an RSA modulus $N = pq$. The prover sends a random string $r$  and the verifier replies with two random elements $y_0, y_1$ and a ZAP  w.r.t. $r$ that at least one is a quadratic residue $mod N$. The prover then picks two random $x_0, x_1$ and sends $y_0^{b_0} \cdot x_0^2$ and $y_1^{b_1} \cdot x_1^2$. The verifier can recover one bit by checking which of the two values is not a quadratic residue. To OT a whole bitstring, this protocol can be done for all bits in parallel. This is where it is important that the ZAP (which is WI) still works under concurrent composition.


Thursday, October 16, 2014

52 Things: Number 2: What is the difference between a multi-core processor and a vector processor?


On the face of it, you may be confused as to what the difference is between these two processors. After all, you may be familiar with words like parallel computing and come across these two different types of processor. So what are the differences between them? This is the question of this week’s 52Things Every Cryptography PhD Student should know. But before we get into the nitty gritty of it, why don't we first have a look at the concept these two different processors are part of, namely parallel computing.

What is parallel computing?


Before answering this question we first need to consider the conventional “serial” model of processing. Let's do so by imagining some problem we need to solve. The way serial computing solves this problem is by viewing it as a number of steps (instructions) which the processor deals with in sequential order. The processor deals with each of the instructions and then at the end, the answer comes out and the problem is solved. Whilst being a wonderful way of solving the problem it does however imply a bottleneck in the speed of solving it. Namely, the speed of the processor at executing the individual instructions. This is fine if the problem isn’t too large, but what happens when we have to deal with larger problems or want to compute things faster? Is there a way of increasing the speed of computation without the bottleneck of the speed of the processor?

The answer as you might have guessed is yes and it comes in the form of something called parallel computing. What parallel computing does to the problem we are trying to solve is to break it down into smaller problems, each of which can be computed separately at the same time. In this way, the problem is distributed over different processing elements which perform each of these different sub problems simultaneously, providing a potentially significant increase in speed of computation – the amount of speed up depends on the algorithm and can be determined by Amdahl's law [1]. So how does this all work? How can you process things in such a way as this? Well two solutions to the problem are multi-core and vector processors.

What is a multi-core processor?


A multi-core processor is a single computing component that carries out parallel computing by using multiple serial processors to do different things at the same time. The sub problems of the bigger problem discussed earlier are each solved by a separate processor allowing programs to be computed in parallel. It's like having multiple people working on a project where each person is given a different task to do, but all are contributing to the same project. This might take some extra organising to do, but the overall speed of getting the project completed is going to be faster.

What is a vector processor?


A vector processor is a processor that computes single instructions (as in a serial processor) but carries them out on multiple data sets that are arranged in one dimensional arrays (unlike a standard serial processor which operates on single data sets). The idea here is that if you are doing the same thing many times to different data sets in a program, rather than executing a single instruction for each piece of data why not do the instruction to all the sets of data once? The acronym SIMD (Single Instruction Multiple Data) is often used to denote instructions that work in this way.

What is the difference?


So that's the general idea, let's sum up with an example. Let's say we want roll 4 big stones across a road and it takes one minute to do each roll. The serial processor rolls them one by one and so takes four minutes. The multi core processor with two cores has two people to roll stones so each one rolls two stones, it takes two minutes. The vector processor gets a long plank of wood, puts it behind all four stones and pushes them all in one, taking one minute. The multi core processor has multiple workers, the vector processor has a way of doing the same thing to multiple things at the same time.

[1] http://en.wikipedia.org/wiki/Amdahl%27s_law

Friday, October 10, 2014

Compiler-based side-channel application and masking

This weeks study group was led by David McCann and focused on the “Compiler-based Side Channel Vulnerability Analysis and Optimized Countermeasures Application” [1] paper presented at the Design and Automation Conference (DAC) 2013. At a high-level, the authors consider the output for each instruction executed and determine, for each individual bit, the minimum number of key bits mixed into the result. Using this information, the compiler makes a decision (based on a threshold) on whether or not to mask the intermediate data.

Consider a toy example of the AddRoundKey operation from AES where $s = k \oplus m$ where $m$ is the input matrix, $k$ is the key matrix and $s$ is the resulting state matrix. Each bit of $s$ contains only a single bit of keyed information. The authors define security as a threshold for the minimum number of key bits affecting each intermediate state bit. The exception being an intermediate state bit that has no relation to the key bits (in this case, the security is considered infinity).

The authors incorporate the additional stages, referred to as the “security-oriented data flow analysis” (SDFA), into the LLVM compiler framework. This has some immediate advantages over applying masks directly in the source code, specifically, if the source code applies a masking scheme, and the compiler is clever enough to notice, the masking process may be identified as unnecessary computation and be optimised out. In addition to this, only the vulnerable intermediate bits are identified for masking rather than the application as a whole.

The SDFA converts the compiler intermediate-representation (IR) code to a Control-Flow Graph (CFG) where each node represents a statement in the program. The authors go on to define a set of rules for the propagation of keyed information at the input and output of the and, or, cmp, mul, div, mod, store, load and shift operations. I could define all the rules here, however this would be almost equivalent to re-writing the paper so if you are interested, please read the full text [1].

In the final section, the authors present experimental results with an implementation of AES-128. The results are compared to implementations presented in [2] and [3]. They discuss different levels of masking and their respective code size and speed performance. The authors highlight the speed-up in performance over existing masked implementations (roughly x2.5).

It is a nice and neat presentation on how to exploit the compilers framework to identify and mask vulnerable intermediate data. I am on the fence about the speed-up results seeing as the reference implementations are optimised for 8-bit processors (whereas the target hardware was a 32-bit processor). They also state the code size increase in their work is 'acceptable' given the speed up. However there is a three fold increase in size for a first order tabulated S-Box (accredited to loop-unrolling) for a two fold speed-up. Nevertheless, a nice paper with good potential for automation of side-channel countermeasures.




Plaintext Awareness and Signed ElGamal

For the first study group of the academic year, David Bernhard spoke on recent developments in the area of CCA-secure public key encryption. After introducing the security goals of the area and some key previous results, he presented a paper by Yannick Seurin and Joana Treger, from CT-RSA 2013 [0], which aims to provide such a scheme, which is a variant of Schnorr-signed ElGamal encryption...

ElGamal Encryption

We begin with a recap of the ElGamal encryption scheme. ElGamal is a public-key cryptosystem whose security is derived from the security of the discrete logarithm problem (DLP). Very briefly, it uses a publicly known cyclic group $G$ (in which the DLP must be hard!) of prime order $p$, with a chosen generator $g$. The user setting up the scheme picks a secret variable $x$, and sets their public key to be $h:=g^x$. To encrypt a message $m$ under this public key, one sends sends the pair $(g^y,mh^y)$, which the initial user can decrypt since they know $x$ and $p$.

ElGamal encryption is IND-CPA secure if (and only if) the Discrete Diffie-Hellman (DDH) assumption holds[1]. However, its ciphertexts are malleable and thus the scheme cannot be IND-CCA2. Indeed, trying to extend it to such a scheme turned out not to be as simple as one might think. Various papers made progress in this direction, taking two rather different directions. One branch aimed to create a hybrid scheme (e.g. DHIES [2]) and reduces security to a stronger assumption on the group and that of the symmetric primitive used. The method we will look at is the alternative...

 

Plaintext Awareness

A scheme is said to be plaintext aware in the Random Oracle Model (ROM-PA) if an adversary cannot generate a valid ciphertext without already knowing the plaintext of it. That is, it encapsulates the behaviour of a good scheme that prevents the adversary from generating valid ciphertexts other than by encrypting plaintexts herself (or doing something she knows to be equivalent to this). The technicalities of this definition are rather complex, but roughly mean that that given a ciphertext and the list of random oracle queries made to generate it, one can deduce the underlying plaintext.

Now, the key result for us is that if a scheme is both IND-CPA and ROM-PA secure, then it is IND-CCA2 secure [3].

 

Making ElGamal ROM-PA

Various earlier results had considered designing a plaintext aware scheme by combining ElGamal with a Schnorr signature [1,4], forming a scheme this paper refers to as SS-EG. Whilst SS-EG inherits IND-CPA security from ElGamal, unfortunately it does not add plaintext awareness. The key contribution of this paper[0] was to observe that actually SS-EG is in some sense "very close", and define a very similar scheme using Chaum-Pedersen commitments rather than the Schnorr signatures. Denoted CPS-EG, the only real difference between the schemes is a slight modification to the variables in the ciphertext and the addition of two more variables to the Random Oracle query when generating the commitment. It is the second of these differences that is criticial, because the extra information supplied to the random oracle (information that an extractor is given direct access to) is sufficient to make the scheme ROM-PA.

Making ElGamal IND-CCA2


At this point, one may hope that the scheme is indeed IND-CCA2, but there remains one final twist. There are various different ways one may transmit the required data for a signature, of which one traditional method is the Fiat-Shamir scheme, consisting of a challenge commitment and appropriate response (i.e. a non-interactive proof of knowledge). However, there are also alternative representations that are more efficiently packaged, leading to more concise ciphertexts. The particular representation chosen in the original scheme in fact failed to inherit IND-CPA security from the underlying encryption scheme [5].

Returning to the (less concise) Fiat-Shamir scheme ensures the final version of CPS-EG is indeed an IND-CCA2 secure public key encryption scheme.


References

[0] A Robust and Plaintext-Aware Variant of Signed ElGamal Encryption
Yannick Seurin and Joana Treger
CT-RSA 2013

[1] On the Security of ElGamal Based Encryption.
Yiannis Tsiounis and Moti Yung
PKC '98

[2] The Oracle Diffie-Hellman Assumptions and an Analysis of DHIES
M. Abdalla, M. Bellare and P. Rogaway
CT-RSA '01

[3] Relations among notions of security for public-key encryption schemes
M. Bellare, A. Desai, D. Pointcheval and P. Rogaway
Crypto '98

[4] A practical mix
Markus Jakobsson
Eurocrypt '98

[5] Cited by authors as 'personal communication'
Bertram Poettering
Jan 2013

Thursday, October 9, 2014

52 Things: Number 1 : Different Types of Processors

This is the first in a series of blog posts to address the list of '52 Things Every PhD Student Should Know' to do Cryptography. The set of questions has been compiled to give PhD candidates a sense of what they should know by the end of their first year - and as an early warning system to advise PhD students they should get out while they still can! In any case, we will be presenting answers to each of the questions over the next year and I have been volunteered for the (dubious) honour of blogging the very first 'thing'. The first topic is on computer architecture and is presented as the following question:

What is the difference between the following?
- A general-purpose processor.
- A general-purpose processor with instruction-set extensions.
- A special-purpose processor (or co-processor).
- An FPGA.

There is no strict definition of a general-purpose processor, however, it is widely accepted that a processor is general if it is Turing-complete. This captures any processor that is able to compute anything actually computable (i.e. can solve any problem a Turing machine can). I will not delve into the definition of a Turing machine but if I've already lost you then I would recommend brushing up on your Theory of Computation [1]. Note though that this definition has no concept of performance or instruction-set capabilities and, in fact, some researchers have gone through the trouble of proving that you may only need a single instruction to be Turing-complete [2]. In the context of modern processors, most programmable CPUs are considered general purpose.

The cost of being general-purpose usually comes at a penalty in performance. Specifically, a general-purpose processor may be able to compute anything computable but, it will never excel at complex repeated tasks. Given a task that is repeated regularly on a general-purpose processor in a wide variety of applications, a processor designer may incorporate instruction-set extensions to the base micro-architecture to accommodate the task. Functionally, there may be no difference in the micro-architecture capabilities but practically there may be huge performance gains for the end-user.

As we're all cryptographers here, I will stick to a crypto example for instruction-set extensions. Consider a desktop machine with an AES encrypted disk. Any reads from secondary storage require a CPU interrupt to decrypt the data blocks before being cached. Given disk access from a cache miss is already considered terrible, add the decryption routine over the top and you have a bottleneck worth making you re-consider your disk encryption. It should be clear here that AES is our complex repeated task and given a general-purpose CPU with a simple instruction-set, we have no choice but to implement the decryption as a linear stream operations. Intel and AMD both recognised the demand for disk encryption and the penalty AES adds to secondary storage access and have (since circa 2010) produced the AES-NI x86 instruction-set extension to accelerate disk encryption on the their line of desktop CPUs.

If you want to fully accelerate any computation, the most optimal result will always be a special-purpose processor or an Application-specific integrated circuit (ASIC). Here we lose a significant portion of the flexibility granted by a general-purpose processor in exchange for performance gains. These types of processors are often tightly-coupled to a general-purpose processor, hence the term co-processor. Note, a co-processor may indeed be in the same package as a general-purpose processor but not necessarily integrated into the general-purpose architecture. Once again, if we turn to modern processor architectures, Intel and AMD have both integrated sound cards, graphics processors and DSP engines into their CPUs for some time now. The additional functionality is exposed via special-purpose registers and the co-processor treated as a separate component which the general-purpose processor must manage.

Finally we turn to Field-Programmable Gate Arrays (FPGAs). The middle-ground between ASIC and general-purpose processors. If an application demands high-performance throughput but also requires (infrequent) modification then an FPGA is probably the answer. To understand how an FPGA works, consider a (very) large electronics breadboard with thousands of logic-gates and lookup tables (multiplexers attached to memory) placed all around the breadboard. If you describe an application as a set of gates and timing constraints then you can wire it together on the breadboard and produce a circuit that will evaluate your application. An FPGA affords the flexibility of being re-programmable whilst producing the dedicated logic to evaluate a target application. The key difference to a general-purpose program is how you design and build your application. In order to get the most out of the hardware you must describe the application as a set of hardware components and events using a hardware description language (Verilog or VHDL). This process is frequently used to prototype general-purpose and special-purpose processors on FPGAs before production. However, it is not without its drawbacks. Designing a program with low-level building blocks becomes very cumbersome for large applications. In addition, the energy consumption and hardware costs are generally higher in comparison to a general-purpose embedded IC. Recently, FPGA manufacturer Xilinx have begun shipping FPGAs with ARM general-purpose cores integrated within a single package [3]. This now makes FPGAs available to the ARM core as a flexible co-processor. As a result, you can build dedicated logic to evaluate your crypto primitives and hence accelerate cryptographic applications.

In summary, general-purpose processors are capable of computing anything computable. Similarly for a general-purpose processor with instruction-set extensions and it may perform better in particular applications. A special-purpose (or co-processor) is very fast at a particular task but is unable to compute anything outside of that. An FPGA can be used to build all of the above hardware but sacrifices speed for flexibility over an ASIC solution.


Monday, October 6, 2014

EM is beginning to lose the non-invasive touch

Anyone familiar with hardware side-channel attacks knows that electromagnetic (EM)-field attacks are considered the most practical and forgiving attack vector. This is owing to the non-invasive and localisation properties of EM-fields. We note this in contrast to early differential power analysis (DPA) [1] attacks which require a direct power tap (and often some modification to the target circuitry).

A recent publication at CHES2014 [2] seeks to challenge the state of EM-field attacks in an attempt to detect and subsequently prevent EM side-channel attacks. Prior attempts have been made to design an EM 'concious' devices (either by EM noise generation or active EM shielding [3]) but these have come at a heavy cost to area and power consumption, both of which are often of higher priority than security in an integrated circuit (IC) development cycle. This recent publication addresses these constraints and presents a simple and elegant solution to foil EM-field attacks.

First, let's recall the stages of an EM-field side-channel attack. An adversary must first buy/capture/'borrow' a target device. Having successfully gained physical access to the device, they must now establish a method to capture the EM-field side-channel leakage. A standard setup will include an near-field probe, an amplifier and a digitizer/oscilloscope. The probe is placed over an (experimentally determined) area of interest and the  adversary will record the EM-field radiation during the execution of some cryptographic operations. Additional information may be requires for specific attacks (ciphertexts or plaintests) but we'll stick to a generic scenario here.

In [2], the authors present a design methodology which allows the IC to detect probing attempts and prevent an attack early on in the process. The authors exploit the physical laws of EM-fields and mutual inductance to detect frequency shifts in the side-channel leakage owing to a near-field probe being placed in close proximity to the IC. If we consider the EM-frequency on the surface of the target IC as a result of some inductance ($L$) and capacitance $(C)$ we can calculate the frequency as follows:

$f_{LC} \approx \frac{1}{2\pi\sqrt{LC}}$

With the introduction of an EM-field probe, we expect the probe coil to produce its own field and hence a some form of mutual inductance ($M$) on the IC surface. The (shifted) EM-frequency at the IC surface will then present as:

$\bar{f}_{LC} \approx \frac{1}{2\pi\sqrt{(L-M)C}}$

We expect the mutual inductance to be inversely proportional to the distance between IC surface and the probe. Hence, as the probe approaches the surface, the frequency shift increases. At a high-level, the countermeasure detects the frequency shift and alerts the cryptographic core to the attack. However, any IC designer will point out, analogue circuity requires a careful design process and is often restrictive and costly. In addition, capturing a reference frequency may not always be possibly if the adversary has unfettered access to the device. The authors realise this and present a clever dual-coil control and detection logic implemented with a standard cell library and a MOS capacitor bank. This allows the entire design workflow to be (semi-)automated and hence greatly reducing the development time and resource constraints. We'll not go into the details of the design here but you can pick up all the information from their paper on eprint [2].

As a proof-of-concept design, the authors produced an $0.18\mu m$ process ASIC with an AES core and their EM detection logic. They proceeded to test the ASIC under a few different attack scenarios ranging from a vanilla EM attack to an adversary who is completely aware of the countermeasure and attempts to circumvent it. In all scenarios, the detection logic was able to pick-up the EM-probe and assert the control logic to halt the cryptographic operations. Arguably a solid result for the design team. The paper presents the system in a very nice and neat package for IC designers to pick up and implement. With relatively low overhead costs ($2\%$ area, $9\%$ power and $0.2\%$ performance) it is hard to argue against it. However, it is not without a few caveats.

The detection system will not be able to detect all EM attacks and the authors do acknowledge this in their conclusion. However they do not discuss this in any great detail. Having no access to their device I can guess at a few scenarios in which their system is too limited to detect an attack. Primarily (from my understanding) the authors always depackage the device (normally unnecessary when dealing with EM-field side-channel attacks and defeating the purpose of its non-invasive nature) and measure the probe distance relative to the die surface rather than the IC surface. There seems to be little mention on the effectiveness when with the device package still intact. Furthermore their current approach is limited to detecting probes from a maximum of $0.1mm$ to the die surface whereas EM leakage can picked up at far greater distances [4]. There is also the prospect that an adversary will not position the probe above the IC itself but over the support circuity around the IC (i.e. decoupling capacitors and power regulators). In this scenario, the countermeasures will be unable to detect any shift. Finally, there is little discussion on false-positives. All electrical devices will produce some form of mutual inductance and capacitive coupling so if we consider a device deployed in the field with these countermeasures. Will placing it near my smartphone (which contains several antennas and ICs) stop the device from performing any cryptographic operations? For its practical shortcomings though, this paper is a solid move in the direction to preventing EM side-channel attacks. Their design methodology and workflow make it appealing for practitioners and the simplicity behind their approach minimises the cost for IC manufacturers, overall a good contribution to the literature.


[1] http://www.cryptography.com/public/pdf/DPA.pdf
[2] https://eprint.iacr.org/2014/541.pdf
[3] http://www.google.com/patents/US6962294
[4] https://www.youtube.com/watch?v=4L8rnYhnLt8

Friday, October 3, 2014

Everybody loves quantum

Today is the last day of the PQC conference in Waterloo, Canada. In combination with the ETSI workshop on standardising post-quantum cryptography next week, the conference and summer school have attracted a varied crowd of cryptography researchers, physicists, government employees and people who care about standards. There were four invited talks this week and in this blog post I will try to summarise two of them. The first one is about quantum computing and the second one is about quantum cryptography. I chose these two because you do not hear too much about them in regular crypto conferences.

The first invited talk was on building a quantum computer by Matteo Mariantoni. He started off by listing the applications for quantum computers, which include quantum simulation of physical and chemical processes, applications to security such as cryptanalysis and quantum crypto, but also a more recent concept aimed at machine learning and artificial intelligence called quantum data processing. His opinion was that we should compare building a quantum computer to things like the Manhattan project, the space race and the current Mars exploration projects. In this comparison, he claims a quantum computer would be easier, cheaper and more useful and he cited how the requirements on technology used in space generated improvements that aid technology for personal use as well.

There are several important steps on the way to an actual quantum computer. Currently, researchers can construct physical qubits, but these are too noisy to carry out any meaningful quantum computation. The solution is to embed the physical qubits into a two-dimensional grid such that they can interact with their nearest neighbours, and to apply error-correcting techniques which eliminate the noise. This forms a single logical qubit. Now, there are some requirements on the physical qubits and gates like fidelity and readout times that need to be met before the logical qubit will actually work. At Waterloo they spent two years optimising their set-up, which is based on superconducting qubits, in order to reach the threshold that allows to create a logical qubit. As an example of why this sort of thing takes two years, he mentioned the task of individually ensuring that none of the screws in the chips with the superconductors are magnetic, which took two months.

Future steps in building a quantum computer consist of performing operations on the single logical qubit, performing operations on multiple logical qubits and eventually combining everything into a quantum computer. His final slide consisted of an estimate of what would be needed to factor a 2000-bit number in 24 hours. It would require 500 million physical qubits, a dedicated nuclear power plant, at least five years of research time and ten years of development time and about 1 billion dollars. The 24 hours is not really a variable in this case, as this comes from the physical implementation of the gates.


The second invited talk was about quantum cryptography, by Nicolas Gisin. He talked about two applications of quantum mechanics in constructive cryptography, Quantum Random Number Generators and Quantum Key Distribution. He briefly described a QRNG based on beam splitters, which is conceptually simple but fundamentally random (if quantum mechanics work). Current implementations provide about 4 Megabits per second of balanced random bits and solutions have been deployed in practice. An interesting observation he made is that there is also research into different approaches for QRNG's, one of which is based on the use of photosensors that are also inside most modern smart phones.

However, the biggest part of the talk was about QKD. QKD is based on the fact that quantum mechanics provides non-local distributed randomness and that it is impossible to clone quantum states, which leads to information-theoretic security. I will not describe the protocol here, but the idea is that Alice and Bob exchange information through quantum states (e.g. encoded in photons), and any information learned by Eve causes a perturbation on these states which can be detected by Alice and Bob. There are even methods that when given a certain amount of tampering by Eve allow Alice and Bob to compress their randomness such that Eve has no information on the result. One issue is that Alice and Bob require to do some communication on a classical channel, which requires authentication to prevent MitM attacks. This leaves two options: use an information-theoretic solution, which requires a pre-shared key or use a computationally secure solution. The first results in a "key-growing" scheme, where a pre-shared key turns into a longer key, which can then be used for future interactions, whereas the second results in a scheme that is secure as long as the authentication is secure at the time of execution. The reason is that the attack has to be active, which means it cannot be launched at a later point in time. Essentially, QKD is immune to any future progress in cryptanalysis of the authentication scheme. Of course, in reality the implementation matters and there have been attacks on existing implementations, similar to side-channel attacks on classical cryptography.

In the remainder of the talk, Nicolas described the state of the art in current implementations as well as the open challenges. A solution using fiber optic cables is limited in distance due to noise caused by spurious refractions inside the cable, whereas solutions through the air are hampered by that pesky atmosphere that allows us to breathe. The maximum range of these solutions appears to vary between about 100-400km. The obvious question then becomes how to extend this. One possibility would be to use a satellite, because the atmosphere does not stretch as far upwards. Alice sends her quantum states to the satellite, the satellite moves through orbit and sends the quantum states to Bob upon reaching him. This solution is also being explored at the University of Waterloo. Another option is to use trusted intermediary nodes where the information is converted to classical and back to quantum before it is sent on. Several countries such as the US and China are already planning such networks. The final option Nicolas mentioned is by using intermediary nodes (no longer trusted) to entangle states over longer distances. However, this solution requires some extra research into storing these entangled states until their counterparts arrive, which is currently not yet possible.