Friday, December 19, 2014

Study Group: Efficient Smart Phone Forensics Based on Relevance Feedback

This week's study group was given by Panos on the subject of Efficient Smart Phone Forensics Based on Relevance Feedback[1].  

 The Digital Forensics Procedure has 4 major steps: Device Secure, Data Extraction, Data Analysis and Data Representation. To improve the efficiency of data analysis, a sub procedure known as forensic triage is involved to mark out the most relative information among in the phone. In this paper, a ranking system called LIFTR was presented to prioritize the information recovered from Android phones where the amounts of data are huge comparing to feature phones.

LIFTR is designed under following assumptions:
  1. Physical image of phone is acquired.
  2. The data is not encrypted. They are filtered binary files (which are the output of previous procedure).
  3. The format of data are list of strings, e.g. the result of DEC0DE or strings command.
LIFTR then takes the information parsed by recovery engine as the input alongside with optional suspect information, and then outputs a ranked list of the most important NAND pages.

There are two major steps in the design of LIFTR:
  1. Initial Ranking. The input of LIFTR are unranked pages. The initial ranking will mark the pages based on a relevance metric. After this procedure, the pages will be ranked for the next step.
  2. Relevance Feedback. After the initial ranking, LIFTR will present some possibly relevant fields of pages to the investigator and ask for the labelling, which is usually manually, of true and false positive. LIFTR then resorts the ranking by the labelling information assigning relevant pages with higher ranks and irrelevant pages lower. This process can be repeated multiple time to reach a better performance.

The experimental result shows that LIFTR can efficiently mark out the relevant pages. However, it does not guarantee the target information will be included in its result; therefore all data should still be analyzed during an investigation. However, LIFTR is still an effective tool to improve the efficiency of the investigation.

[1]Saksham Varma, Robert J. Walls, Brian Lynn, and Brian Neil Levine. 2014. Efficient Smart Phone Forensics Based on Relevance Feedback. InProceedings of the 4th ACM Workshop on Security and Privacy in Smartphones & Mobile Devices (SPSM '14). ACM, New York, NY, USA, 81-91. DOI=10.1145/2666620.2666628

52 Things: Number 11: What are the DLP, CDH and DDH problems?

This is the latest in a series of blog posts to address the list of '52 Things Every PhD Student Should Know To Do Cryptography': a set of questions compiled to give PhD candidates a sense of what they should know by the end of their first year. This blog is the second in the section on Mathematical Background and concerns how group operations can be used to design cryptographic primitives.

As you probably know by now, cryptography often relies on 'hard problems'. That is, we design cryptographic protocols whose security can be proven if we assume that the adversary is unable to solve a certain (mathematical) problem in a reasonable amount of time. This blog post introduces three such problems which are widely used in security proofs. Luckily for me, a) this is really just Group Theory, not Computer Science and b) just two days before writing this I attended a very decent guest lecture by fellow Bristol Crypto researcher Susan Thomson on this exact topic. (That is to say, any inaccuracies in the following can and should be blamed on her!)

The Discrete Logarithm Problem (DLP)

Okay, let $G$ be an abelian group. First we write the operation in $G$ multiplicatively. For any $g \in G$ and any integer $a>1$ let $g^a$ denote $g * g * ... * g$ with $a$ occurrences of $g$. The discrete logarithm problem (DLP) is:

Given $G$, $g$ and $h=g^a$, find $a$. 

Here $a$ is called the discrete logarithm of $h$ with base $g$, hence the name of the problem.

Is the discrete logarithm problem hard? Sometimes, maybe. As a counter-example, let $G$ be the integers under addition. So now it makes sense to write the group operation additively, not multiplicatively. So the same process of repeating the group operation with the same element $g$ is now written $g + g + ... + g$ with, say, $a$ summands and, since we're working in the integers, this sum, call it $h$, is just equal to the integer $ag$. Therefore $a$, the discrete logarithm of $h$ to the base $g$, can be found by just dividing $h$ by $g$. For example, if I say to you, "find the discrete logarithm to the base 3 of 18 in the integers under addition", you'd just write $3 + 3 + ... + 3 = 18$ with $a$ summands on the left, simplify this to $3a = 18$ and find that $a = 6$. I could change the group to integers modulo $N$ for some integer $N>1$ (still under addition) but the problem wouldn't be much harder: one has to solve an equation like $ag \equiv h \: (\mathrm{mod} \: N)$ which is solved by performing the Extended Euclidean Algorithm to find $g^{-1}\: (\mathrm{mod} \: N)$ (if it exists), multiplying this by $h$ and reducing modulo $N$ to obtain $a$. All of this is can be done in polynomial time - no good for building secure cryptographic primitives.

On the other hand, the DLP in a finite field of prime order viewed as a group under multiplication (after throwing out the zero element) or in elliptic curve groups (more on those next week) is believed to be hard. That is, we do not yet know of any polynomial time algorithms for finding discrete logarithms in these groups. As a concrete example, suppose I ask you, "find the discrete logarithm to the base 3 of 5 in the multiplicative group of the integers modulo 7". This means find an integer $a$ such that $3^a \equiv 5\: (\mathrm{mod} \: 7)$. Now that we are in the multiplicative group, not the additive one and so we really do have to 'take logarithms' somehow, not just multiply by the inverse of 3. In this case, since 7 is fairly small, we can find the answer by just trying all the possibilities one at a time until we find a solution:
  1. $3^1 = 3 \not\equiv 5\: (\mathrm{mod} \: 7)$
  2. $3^2 = 9 \equiv 2 \not\equiv 5 \: (\mathrm{mod} \: 7)$
  3. $3^3 = (3^2)\times3 \equiv 2\times3 = 6 \not\equiv 5\: (\mathrm{mod} \: 7)$
  4. $3^4 = (3^3)\times3 \equiv 6\times3 = 18 \equiv 4 \not\equiv 5\: (\mathrm{mod} \: 7)$
  5. $3^5 = (3^4)\times3 \equiv 4\times 3 = 12 \equiv 5\: (\mathrm{mod} \: 7)$
so $a = 5$. The way we 'bounce around' the integers modulo 7 by repeatedly multiplying by the base 3 (getting 3, 2, 6, 4 and then 5) should give you some intuition as to why DLP seems hard in this setting. If our prime was much larger than 7, say thousands of bits long in binary, even a computer would take a very long time to solve DLP this way (though there are better, sub-exponential time algorithms and it is not proven that no polynomial time algorithm exists for solving DLP in this kind of group).

The Computational Diffie-Hellman Problem (CDH)
A problem related to DLP is named after Whit Diffie and Martin Hellman who devised a way of two parties agreeing on a secret key over a public channel without revealing it:
  • Alice and Bob publicly agree on a cyclic group $G$ and generator $g$. 
  • Alice chooses a random secret integer $a$ and Bob chooses a random secret integer $b$. 
  • Alice computes $g^a$ and publicly sends this to Bob. Bob computes $g^b$ and publicly sends this to Alice. 
  • Alice and Bob both compute $g^{ab}=(g^a)^b=(g^b)^a$ by raising what they received from the other party to power of their own secret integer. 
Now $g^{ab}$ is a secret key that can be used for symmetric encryption and decryption by Alice and Bob. But someone listening in to the exchange has in their possession $G$, $g$, $g^a$ and $g^b$. So secrecy of the key $g^{ab}$ depends on this problem, called the Computational Diffie-Hellman Problem (CDH):

Given $G$, $g$, $g^a$ and $g^b$, find $g^{ab}$.

CDH is clearly related to DLP, but which is harder? Well, if I can solve DLP then I can efficiently compute the secret integer $a$ from $g^a$ and then find $g^{ab}$ by raising $g^{b}$ to the power $a$ in the same way Alice does, therefore solving CDH. So anyone who can solve DLP can also solve CDH, meaning DLP is at least as hard as CDH.

The Decisional Diffie-Hellman Problem (DDH)
This is another 'discrete logarithm' style problem used to prove indistinguishability properties. Say Alice and Bob perform the Diffie-Hellman key agreement protocol as above so that $G$, $g$, $g^a$ and $g^b$ are all public and $g^{ab}$ is the shared secret key. Intuitively, the Decisional Diffie-Hellman Problem (DDH) asks whether an adversary can distinguish Alice and Bob's secret key $g^{ab}$ from a random group element of $G$. Formally:

Given  $G$, $g$, $g^a$, $g^b$ and $T_x$ such that $T_0$ is a random element of $G$, $T_1 = g^{ab}$ and $x$ is chosen uniformly at random from $ \lbrace 0,1 \rbrace $, find $x$.

If an adversary can solve DDH (i.e. output the correct value of $x$ with probability greater than $\frac{1}{2}$), then $G$, $g$, $g^a$ and $g^b$ must leak some information about the secret key $g^{ab}$ that distinguishes it from a random group element, even if it can't be computed directly. What should be clear is that if the adversary can solve the computational Diffie-Hellman problem, then they can actually compute $g^{ab}$ and hence trivially distinguish this element from a random group element, thereby solving the decisional Diffie-Hellman problem. So anyone who can solve CDH can also solve DDH, meaning CDH is at least as hard as DDH.

These are the three problems we wanted to talk about and we've given a sketch proof of their ordering in terms of hardness: DLP is the most hard, then CDH and then DDH. As we've seen, DLP is sometimes easy, making CDH and DDH easy. So the choice of group $G$ and generator $g$ is very important when doing cryptography!

Wednesday, December 17, 2014

Study Group: Attack on XLS

This week's study group, delivered by Guy, featured a paper by Mridul Nandi attacking the XLS domain completion method. This attack was initially given at DIAC in August, and was formally presented earlier this month at Asiacrypt 2014 in Taiwan. The idea of a domain completion method is to preserve length in an encryption scheme, so that the ciphertext is of the same length as the message. More formally, this means turning an encryption scheme that works on blocks of size n to a cipher that works on inputs of size mn. This contrasts to the strategy of padding (for example to the block length of a block cipher) that infers ciphertext expansion, which is used in many applications and standards. A number of approaches have appeared in the literature including ciphertext stealing, hash-then-counter (XCB, HCTR, HCH) and applying a block cipher twice to create a one-time pad for the final block (EME, TET, HEH) - but these methods are not applicable to generic blockciphers and lack formal security analysis.

At FSE 2007, Ristenpart and Rogaway presented XLS (eXtended Latin Squares), a domain completion method that inherits the assumed strong pseudorandom permutation property of the underlying blockcipher. The method is conceptually straightforward, utilising a (linear) mixing permutation and bit-flipping and the security proof is, as you would expect from the authors, relatively easy to follow despite its numerous intricacies. No fewer than six teams have incorporated XLS into their CAESAR submissions.

Despite this acquired confidence, Nandi presents an attack to demonstrate that XLS is in fact not a strong pseudorandom permutation (SPRP), giving an attack that makes just three encryption/decryption oracle queries and succeeds with probability 1/2. In addition, the author shows that replacing the mixing function defined in the original paper with some other stronger linear permutation(s) does not help, and gives a distinguishing attack (with success probability 1/4) against this generalised version of XLS.

Nandi does not precisely pinpoint any incorrect assertions in the proof nor provide any indication to give hope that XLS is fixable, and it will be interesting to see how the authors of the original paper (and the teams relying on XLS in their CAESAR submissions) respond in the coming months.

Friday, December 12, 2014

52 Things: Number 10: What is the difference between the RSA and strong-RSA problem?

This is the latest in a series of blog posts to address the list of '52 Things Every PhD Student Should Know To Do Cryptography': a set of questions compiled to give PhD candidates a sense of what they should know by the end of their first year. This blog post introduces the RSA and Strong-RSA problems and highlights the differences between the two.

Cryptography relies heavily on the assumption that certain mathematical problems are hard to solve in a realistic amount of time. When looking at Public-Key (Asymmetric) Cryptography, which is what we'll be focusing on in this blog post we use the assumed existence of One-Way functions, i.e. functions that are easy to compute one way but are difficult to invert. We use problems from algorithmic number theory to produce these functions.


The first difficult problem from number theory to talk about is factoring. Given a composite integer $N$ the factoring problem is to find positive integers $p,q$ such that $N = pq$. Although on the face of it this seems like a very simple problem, this is in fact a very tough, well studied problem. This can be solved in exponential time by checking all the numbers $p = 2, \ldots, \sqrt{N}$. However, solving a problem in exponential time is not fast enough. No polynomial time algorithm has been developed to solve the factoring problem, despite many years of research. Clearly there are examples of $N$ for which this is very easy to solve, for example whenever $N$ is even. Therefore, when starting to think about using this in a Cryptographic construction we consider $N$ as very large and being constructed by 2 large primes $p,q$. 

The RSA Problem

In RSA public-key encryption [1] Alice encrypts a plaintext $M$ using Bob's public key $(n,e)$ to ciphertext $C$ by $C = M^e (\textrm{mod } n)$ where $n$ is the product of two large primes and $e \geq 3$ is an odd integer that is coprime to the order of $\mathbb{Z}_n^{*}$, the group of invertible elements of $\mathbb{Z}_n$. Bob knows the private key $(n,d)$ where $de = 1 (\textrm{ mod } n)$ meaning he can compute $M = C^d (\textrm{mod } n)$. An adversary can eavesdrop $C$ and can know the public key $(n, e)$ however to calculate $M$ the adversary must find the factors of $n$. Therefore, this means the RSA problem is no harder than integer factorisation but is still a very hard problem to solve provided a suitable $n$ is chosen. 

The Strong RSA Assumption

The strong RSA assumption differs from the RSA assumption in that the adversary can choose the (odd) public exponent $e \geq 3$. The adversary's task is to compute the plaintext $M$ from the ciphertext given that $C = M^e (\textrm{mod } n)$. This is at least as easy as the RSA problem meaning that the strong RSA assumption is, unsurprisingly, a stronger assumption. The RSA problem is now over a quarter of a century old. Public key encryption schemes have been developed that derive their strength fully from the RSA problem.

[1] -

Thursday, December 11, 2014

Password-Protected Secret Sharing

Hugo Krawczyk opened the final day presenting a paper on password-protected secret sharing. He began with a motivating example of protecting a high value secret such as a Bitcoin wallet. Typical storage methods using password authentication at a server are vulnerable to a single point of failure - an attacker just needs to break into one server, and can then mount an offline dictionary attack to guess the user's password and recover the key. One natural cryptographic solution to this problem is to secret share the wallet key across multiple servers (this approach is taken in the commercial solution offered by Dyadic Security), ensuring that an attacker must break into several servers to recover the key. The problem now is that in order to use the Bitcoin wallet, you must authenticate yourself to each of your servers - if this is done with a single password then there are now many points of failure for the system! An offline dictionary attack can now be performed against any one of the servers. Of course, one could choose different, strong passwords for each server, but this is unlikely to be practical for most users.

The notion of password-protected secret sharing (PPSS) gets around this. A PPSS scheme allows a user to secret share a secret among n servers (with threshold t) and store just a single password for authentication. Roughly speaking, the security guarantee is that any attacker breaking into up to t servers and obtaining all of their secret information (shares, long-term keys, password files etc) cannot learn anything about the secret or the password. This implies that the adversary's only hope is to try and guess the password in an online attack: offline attacks with fewer than t+1 servers are useless.

The main contribution of the paper is a new PPSS scheme, which only requires authenticated channels between the servers (except for a one-time setup phase) and is very efficient: authentication is done in one round of communication (2 messages), requiring 2t + 3 exponentiations for the client and 2 for each server. Previous schemes were much less efficient and often required additional assumptions such as a PKI.

The main building block of their construction is an Oblivious PRF (OPRF). Here a server holds a PRF key $k$, the user holds the input $x$ and obtains the output $f_k(x)$, whilst the server learns nothing about the input. This can be done very efficiently with a 'hashed Diffie-Hellman' technique, where the output is defined by $f_k (x) = H(H(x)^k)$ and can be computed with a standard DH exchange requiring just 2 exponentiations per party.

For the PPSS scheme, each server $S_i$ has an OPRF key $k_i$. The user chooses a password $p$, secret shares their secret $s$ into shares $s_i$ and runs the OPRF with each server to obtain $r_i = f_{k_i}(p)$. They then encrypt $s_i$ as $c_i = s_i \oplus r_i$, store $c_i$ at server $S_i$, erase all intermediate information and store the password $p$. To reconstruct, the user retrieves $c_i$ from $S_i$ and runs the OPRF to recover the shares $s_i$. The protocol is nice and simple to describe, but there are many subtle issues that arise with defining and proving security for the OPRF used in the construction: see the paper for details.

Another application of PPSS is for threshold PAKE (password authenticated key exchange), which allows a user to securely exchange keys with any subset of n servers using a single password, provided no more than t servers are corrupted.  They prove a generic composition theorem showing that PPSS can be combined with key exchange to give threshold PAKE in just one round, which was never achieved by any prior work.

Wednesday, December 10, 2014

The Legal Infrastructure Around Information Security in Asia

The second invited talk at Asiacrypt strikingly stood out. It was given by Helaine Leggat, an Australian lawyer and information security professional, and it touched upon cryptography only by several mentions of hash functions. The speaker started by elaborating on her motivation to get involved with legal aspects of information security, those being the advent of international cyber warfare and mass surveillance as well as the following shifts in power: from the West to the East, from nation states to the society, and from real assets to information assets. All those are facilitated by the easiness of communication that the internet provides.

For the rest of her talk, Helaine highlighted various elements of the legal framework concerning information security, with a focus on Asia and Oceania. On the international level, there are model laws by United Nations Commission on International Trade Law (UNCITRAL) such as the UNCITRAL Model Law on Electronic Commerce from 1996 and the UNCITRAL Model Law on Electronic Signatures from 2001. Both serve as templates for national law, that is, they are not applicable in their own right. Furthermore, there are conventions such as the United Nations Convention on the Use of Electronic Communications in International Contracts from 2005 and the Convention on Cybercrime by the Council of Europe, which has been adopted by various countries in the Asia-Pacific.

For the national level, the speaker mostly elaborated on laws governing the relationship between governments and citizens with respect to access to information. There are two aspects of this relationship: citizens' access to government information, and the governments' access to information of their citizens. The former is regulated under freedom of information acts in many countries. Those law usually contain exceptions for state secrets.

Even more controversial is the legislation on privacy. It can be seen as being at the core of the trade-off between freedom and protection. Even though there is a lot of commercial interest in personal data, it seems that nation states lead a war on privacy by developing the most sophisticated malware. Moreover, Helaine mentioned historic privacy legislation that makes a difference between telecommunication and broadcasting and that distincts between telephones and computers. In the age of convergence in the form of smart devices, this makes little sense.

Finally, the speaker turned her attention to a more informal kind of legislation such as standards and best practice codes. It it note-worthy that, despite their informal nature, standards are sometimes referred to by laws, for example India's Information Technology act or New Zealand's Privacy Act of 1993, which refers to an ISO standard on privacy.

In the Q&A, Adi Shamir brought up the recent case where the US government asked Microsoft to disclose personal data that is stored in Ireland. Microsoft argues that this data is protected under EU privacy law. In her answer, Helaine pointed to the contradiction between the PATRIOT act and EU law, and to the fact that different parts of the world have different attitudes towards privacy. This makes corporations operating in Europe care more about their customers' privacy there, at least superficially.

Tuesday, December 9, 2014

Large-scale Computation and Exploitation of RC4 Biases

The excellent first invited talk at Asiacrypt 2014 in Kaohsiung, Taiwan was given by Kenny Paterson (@kennyog) from Royal Holloway, entitled "Big Bias Hunting in Amazonia: Large-scale Computation and Exploitation of RC4 Biases". Kenny outlined the progress of himself and his co-authors in analysing the single and double byte biases in the RC4 keystream in detail, and using the resulting information to construct attacks on cryptographic protocols such as TLS and WPA/TKIP.

Keystream bias and plaintext recovery

The fundamental flaw with RC4 is that (at least) the first 256 bytes of the keystream are not uniformly distributed: the probability of each byte taking each of the 256 possible values 0x00,..,0xFF being 1/256, the actual probabilities exhibit small differences (biases) away from this number. I find that the visual representation of these biases gives the best indication of the nature of the problem---check out the graphs here.

These biases are problematic in situations where a fixed plaintext (think cookie, or password) is repeatedly encrypted under new RC4 keys. In the ideal world, the distribution of the resulting ciphertexts for these plaintext should be uniform, but because the keystream is biased, it isn't. This allows an adversary who sees enough of these ciphertexts to use a fairly simple Bayesian analysis to learn about portions of the plaintext. Protocols sitting in this repeated plaintext context include TLS and WPA/TKIP.

The talk went into detail on how these biases can be exploited to attack these two protocols, but for the purposes of this blog I'm going to stick to discussing how these biases were estimated. For more on the actual attacks, have a look at the RHUL homepage, or a couple of blog posts from Matt Green here and here, or our own blog post here.

Generating the biases

The final section (and motivation for the title) of Kenny's talk outlined how the keystream distribution information was estimated. Estimating the distribution is functionally simple: simply generate lots and lots of random 128-bit RC4 keys, record the first 256 bytes of the keystream generated using each key, and sum up the resulting values of each byte (or pair of bytes in the case of double-byte biases) to estimate the probability of each byte value occurring.

The key here is the "lots and lots" part of the above paragraph. The biases are small---the uniform probability would be 0.00390625, and from a rough glance at the estimated distributions a typical deviation might be as small as 0.00001. As a consequence to eliminate the variance in the estimation process and to estimate the distributions with sufficient precision to allow a plaintext recovery attack, a large number of random keys are needed.

How many? In the case of estimating single-byte biases under the WPA/TKIP protocol, the number of keystreams required was 2^48, and for the double-byte biases 2^46 keystreams. Running this kind of computation using standard desktop-level resources is infeasible (unless you have a lot of time on your hands), and so the authors turned to the computational resources available through cloud computing services.

The authors used Amazon's EC2 compute cluster to deploy 128 nodes, each containing Intel Xeon E5-2680 v2 processors clocked at 2.8 GHz, giving a grand total of 4096 Ivy Bridge cores, totalling 8192 virtual cores in the presence of hyperthreading. Kenny gave some interesting anecdotes about issues with managing this amount of hardware, including having to ask Amazon to manually disable controls stopping users from spinning up more than 10 nodes at a time, and maxing out the capacity of the US West data centre.

As well as the computation required to estimate the distribution, storing the distribution is also a challenge---as always, data I/O is lurking in the background ready to pounce just as you think the difficult part is over. The single-byte distribution requires 32 GiB of storage (which I'd assume to be double-precision floating point values), and the double-byte requires a massive 8 TiB. If you're using a remote compute service, then transferring that kind of data onto local storage for further analysis requires significant effort.

Finally, the cost of the computational effort is interesting---clearly, renting compute cycles is cheaper than purchasing the necessary hardware yourself. The total cost was 43K USD; given the total compute time of 63 virtual core years, we can arrive at a cost of around 7 cents per virtual core hour.

The take-home message

For me, a few points stood out as particularly noteworthy. The maxim "attacks always get better" is definitely applicable here: from the initial observation that plaintext recovery might be possible, we've arrived at the point at which widely-used secure protocols can be considered to be broken when used in tandem with RC4. The results from the team at RHUL and elsewhere also demonstrate how you should keep pushing further with investigations into vulnerabilities, and that is certainly worth investing the time and resources necessary to fully experiment with the nature of the vulnerability.

Secondly, we again return to the problems that arise at the point of intersection between applied cryptography and real-world deployments. Kenny talked about the "inertia" of implementations of protocols: from the high of 50%+ usage of RC4 in TLS, we're still seeing a usage of 33%. Having the agility to rapidly patch/update all the necessary hardware, firmware and software to fully eliminate the usage of a broken ciphersuite is clearly very difficult. Moti Yung made a good point about the relative seriousness of these cryptographic flaws---whilst a plaintext recovery attack requiring 2^20-30 TLS sessions is clearly bad and necessitates a fix, the kind of software flaw vulnerabilities we've recently observed in almost all of the most popular TLS/networking stacks are far more impactful.