Today Carolyn and Jake gave the first talk in our new series of study groups, entitled "Distinguishers in DPA attacks", focusing on the evolution of research into the optimal choice of distinguisher and discussing current issues and challenges in this area.
In differential power analysis (DPA) attacks, an adversary records power traces corresponding to a known set of plaintexts from a device with a secret key. The power consumed by the device at a particular point in time is partially dependent on the data being processed at a particular point in time, so with careful selection of a point in time the power consumption can be found to be be partially dependent on the result of some intermediary function used in the cryptographic algorithm on the device, for example, a combination of a byte of the secret key and the output of an 8-bit AES S-Box. The adversary can then enumerate the result of this intermediary function for all possible key values. Assuming the adversary knows a function that can suitably approximate how the device consumes power for a particular piece of data, he or she can then model the power consumed by the device for each of the intermediate values and for each possible key. Hence, the 'hypothetical' consumed power for the intermediate values created using a correct key guess will be most similar to the real recorded power consumption.
The comparison tool used to compare the hypothetical and real trace values is called a distinguisher. As an aside - in practice a direct comparison isn't always optimal or practicable; distinguishers can essentially be classified into two types: 'comparison' distinguishers that do directly compare real and hypothetical power values, and 'partitioning' distinguishers that use the hypothetical power values to partition the real ones; the idea being that a meaningful partition will be made when the key is guessed correctly and then the distinguisher tries to quantify this (see here for a nice explanation). In essence, using either approach, a distinguisher is a statistical test and should be analysed as such - the distinguisher statistic can be computed exactly when we know all of the variables in the system, but as we don't usually have this information in practice, the statistic has to be estimated using the available sample data - the power traces. The result of having to do this is that a distinguisher may have theoretic features or advantages that will not necessarily translate into practical results because of the quality of the estimator for the distinguisher statistic. The estimator ideally needs to be both consistent, and efficient - inefficient estimation procedures can add a potentially unforeseen extra layer of data complexity to an attack.
A recurring theme in DPA research has been the aim of constructing distinguishers that are more efficient (require fewer traces to recover the key) and more generic (require fewer assumptions about the target device; in particular how it leaks power information). The requirement for efficiency is obvious; fewer traces translates to a device being weaker in practice (although this is obviously not a complete metric - what about the cost of profiling in a template attack? etc). Genericity is a slightly more subtle requirement; specialist distinguishers might be more powerful but may also require more assumptions to be made about the device. An interesting recent result in this area (paper here) has found that when using nanoscale technologies, inter-device variability becomes so high that a leakage for a particular device model will vary from one instantiation of the device and another. In this situation, even full Bayesian template attacks (theoretically the most powerful type of DPA adversary) will fail. This drives the requirement for some generic distinguisher that can perform equally well irrespective of the leakage function of the device and the assumptions an adversary can make about it. Perhaps the ideal distinguisher therefore is one that is both generic and efficient.
Carolyn and Jake described how research has found that the dual goal of efficiency and genericity is currently infeasible; instead what we see is a trade-off between the two. The more generic distinguishers have a higher data complexity and cost in efficiency over the less generic distinguishers. The parametric or moment-based statistics typically require stronger assumptions about the leakage model but require fewer traces, but the jump to including all of the distribution through use of nonparametric/generic statistics seems to add an unavoidable cost in data complexity. Four particular distinguishers were discussed: correlation, distance-of-means, mutual information analysis and linear regression.
The correlation distinguisher utilises Pearson's correlation-coefficient to quantify the linear dependencies between the real and hypothesised consumption values. When the two distributions are jointly Normally distributed, the correlation coefficient completely characterises the distribution, and although this performance degrades for a different non-Normal type of distribution, the correlation can remain as a useful measure of linear dependence. This is useful as long as the hypothesised consumption under the correct key is linearly related to the real consumption (and vice versa under an incorrect key) - i.e. when the power model used by the adversary is accurate up to proportionality. The estimator for Pearson's correlation coefficient is the sample correlation coefficient. This is a consistent estimator as long as the moments themselves are consistent, which is usually the case and under the law of large numbers, once a large enough sample is obtained the sample average will give the theoretical expectation. These factors combined equate to correlation being an efficient distinguisher (partly because of it's simplicity) but also being reliant on having an accurate power model and hence losing genericity. It is worth noting that up to a point, correlation can still be a successful distinguisher (with a cost in efficiency) in 'generic' scenarios if there is at least some linear dependencies between the real and hypothesised values that can be exploited.
The original DPA distinguisher, as introduced by Kocher, Jaffe and Jun in their 1998 paper, is the "distance of means" test. This is a 'partitioning'-type distinguisher, unlike correlation; the hypothesised values are used to partition the real traces into sets. In the distance of means test, the first bit of the leakage values is analysed. If it equals 0, then the corresponding real traces for the plaintexts that produce hypothetical leakage values of 0 are put into a set, and similarly for 1. The intuition is that when the key is hypothesised correctly, the partitioning of the real consumption values will be done correctly, and the two distributions of the partitioned values will be different (have different means), and vice versa for an incorrect hypothesis because each partition will contain a random sampling of the global distribution. The downside of the distance-of-means distinguisher is that the signal leaked by one bit of power consumption information is small compared to the amount of noise created by the overlapping distributions contributed by the remaining, ignored bits, and so the efficiency here is low, but again at with a benefit in genericity; there are few assumptions required.
A theoretically more powerful type of generic partition distinguisher is mutual information analysis (MIA), proposed in 2008 at CHES. MIA partitions the real traces using all of the discrete values in the hypothesised leakage space, so again a meaningful partition is created only when the key is guessed correctly. Rather than computing means, MIA computes the reduction in the trace entropy after knowing the hypothesised values under a particular key. Another way of describing MIA is that it computes the total amount of shared information between two random variables. In a partitioning attack the converse applies: for a correct key hypothesis, the conditional (or partitioned) distribution will have the smallest entropy, leading to the highest mutual information. The major benefit of MIA is that it can be made to be entirely generic; the intermediate function targeted acts as an inherent power model and so the adversary does not need to model the device leakage function at all. Another way of saying this is that MIA only requires a meaningful grouping of the leakages because mutual information itself is maximised by the raw intermediate values, and this is provided inherently by non-injective S-Boxes. The non-injectivity is a fundamental requirement - as mutual information is invariant to permutation, injective intermediate functions such as the AES S-Box lead to constant mutual information values over all key hypotheses. Hence the full genericity of MIA is restricted to non-injective targets such as the DES S-Box. Again we see the trade-off between efficiency and genericity with MIA - the data overhead required to estimate entropies is significant. To perform an MIA attack, first the underlying leakage distribution needs to be estimated, and then this needs to be substituted into estimators for marginal/conditional entropies. Additionally, all known estimators are biased; no ideal estimator exists, so in practice there are multiple problematic factors in play when performing MIA: a data overhead, sensitivity to estimator choice/configuration and unpredictability of performance based on the underlying data. So whilst MIA is generic up to a point, this genericity comes at a considerable cost in efficiency, so much so that often even with a poor power model correlation may be the more efficient distinguisher.
There has been research into distinguishers that are conceptually similar to MIA, such as the Kolmogorov-Smirnov test and the Cramer-von-Mises test, but they suffer from similar problems. One current, promising avenue of research is into distinguishers based on linear regression techniques. This concept was originally described as a template-style attack but recent work has adapted the approach to essentially allow for 'on-the-fly' profiling. The concept is to fit a linear regression model of the leakage values against a configuration of selected dependent variables, and then compute a goodness-of-fit measure to distinguish. This approach is not truly generic and comes with a computational cost, but is very flexible.
Perhaps the most important message to take from the talk and discussion is that currently there is no such thing as a general optimum distinguisher; each choice comes with some trade-off between genericity and efficiency. This means that the best distinguisher choice will only become obvious once all the other variables in the attack scenario are known: the device leakage, the attacker's model, the targeted function, and the noise. Whether it is possible to construct a distinguisher able to meet both these requirements is an ongoing challenge. Additionally, the question of whether it is possible to construct a generic distinguisher suitable for deployment against injective target functions is an interesting open problem.