## Thursday, October 15, 2015

### Inference Attacks on Property Preserving Encrypted Databases

This post will focus on Inference Attacks on Property-Preserving Encrypted Databases by Naveed, Kamara and Wright presented this week at CCS in Denver. As the name suggests, the idea of property-preserving encryption (PPE) is to encrypt data in such a way that some property, such as ordering (OPE), is preserved---i.e. if $a>b$ then $\mathsf{Enc}(a)>\mathsf{Enc}(b)$. Another example is deterministic encryption (DTE) which preserves equality, meaning that if $a=b$ then $\mathsf{Enc}(a)=\mathsf{Enc}(b)$. PPE is of course inherently less secure than semantically-secure encryption, however it opens the door to a number of applications, in particular the context of healthcare. As a straightforward example, if a large database holds patient records and each column is encrypted using some method of PPE, e.g. gender and disease using DTE, age and admission date using OPE (fields such as patient identifiers may not require any functionality). If a medical professional wishes to analyse data of all patients between the ages of 18 and 25 who have suffered from altitude sickness, instead of downloading the entire database (which she would have to do if the database was encrypted using standard, probabilistic encryption), she can calculate $\mathsf{Enc.OPE.age}(18)=c_1$, $\mathsf{Enc.OPE.age}(25)=c_2$ and $\mathsf{Enc.DTE.disease}(\textrm{altitude sickness})=c_3$ and only download the database entries that have an age ciphertext $c$ in range $c_1 \leq c \leq c_2$ and disease ciphertext $c_3$.

This saves bandwidth, and consequently time and money, but we really don't want the database (or queries to it) to leak anything more than the properties claimed by the encryption mechanisms, namely equality for DTE, and order+equality for OPE. Unfortunately, human characteristics and medical diagnoses have low entropy: if an adversary knows that a medical database holds records for patients from one geographic area, then the distributions of age and race are going to be very similar to the general population for that area, and even worse most of the data fields will be highly correlated to data from another nearby hospital. This opens the door to two attack vectors: inference attacks that use publicly-available data to gain information about queries (the focus of the paper), and de-anonymisation attacks (that can be built from inference attacks) that aim to identify entries of the EDB. Of course we can use semantically-secure encryption for sensitive attributes (requiring download of the whole column before analysis can occur) to try to thwart these attacks, but which attributes are sensitive? The answer to that question is beyond the scope of this post but the answer is, in general, as many as possible. Note that attributes may be sensitive not only for patients but also to hospitals, which may want to keep secret how many patients die of certain diseases.

The authors show that it is possible to mount inference attacks on encrypted databases (that are built on tools such as CryptDB and Cipherbase) storing real patient data from 200 US hospitals, using only the ciphertexts and public auxiliary information. The public data is obtained from the Healthcare Cost and Utilization Project (HCUP) National Inpatient Sample (NIS) database. The talk provided a jarring 'context is everything' moment by including a slide that displayed only the words "We attack each hospital individually" in large font. This means that the authors look at each hospital database on an individual basis rather than grouping them, and use the auxiliary info to mount their attacks. A number of tools are used and can be found in the paper. The easiest to explain is frequency analysis on a column encrypted using DTE (authors focus on mortality risk): sort the histogram, compare to the auxiliary histogram and map same-ranked elements to infer which entries correspond to which attribute. The authors' novel $l_p$ optimisation recovered mortality risk and patient death data for at least 99% of the 200 largest hospitals, and recovered disease severity for 100% of patients for at least 51% of those hospitals. Sorting attacks on data encrypted using OPE take columns that are 'dense' (meaning that all values have entries), and allow the attacker to decrypt all values with no knowledge of the key: the authors recovered the admission month and mortality risk of 100% of patients for at least 90% of the 200 largest hospitals. Another novel attack called the cumulative attack recovered disease severity, mortality risk, age, length of stay, admission month, and admission type of at least 80% of patients for at least 95% of the largest 200 hospitals. These results only exploit the databases themselves and not leakage from queries ('ciphertext only'), suggesting that the results are a lower bound on the information that could be gleaned from these encrypted databases.

Interestingly, the following talk in the session was given by Florian Kerschbaum defined a new security model for order-preserving encryption, potentially giving a countermeasure against some of the attacks on OPE given above.