This post will discuss parts of the talk by Angelo De Caro (IBM Zurich) on 'data storage-oriented cryptographic protocols,' given at the workshop on secure and trustworthy computing in Bucharest.
The cloud provides an attractive opportunity for users and enterprises (collectively 'clients') to outsource their storage. Many providers offer low-cost storage, with straightforward management and ubiquitous access (from multiple devices). However users inherently lose direct control, meaning they have no guarantees about privacy or the future availability of their files.
Deduplication is the process by which a provider saves itself storage space (and consequently money) by only storing one copy of each file, and using some method of tracking which users own that file. This concept is so desirable because there is a large amount of redundancy in many contexts, such as media (movies, music etc.), system files/software and email attachments. When server providers want to deduplicate they have a number of choices:
- file-level or block-level (latter allows better dedup in systems where updates to large files are common)
- single-user/cross-user (latter is more desirable for providers)
- client-side/server-side
The final point introduces some interesting concerns. Server-side dedup means that the client needs to send the whole file to the server, meaning a high bandwidth cost. For client-side deuduplication Alice takes a hash of her file $F$ and sends this value $H(F)$ to the server: if the server hasn't seen this before then instructs Alice to upload the file, if not then deduplication occurs. This means that having $H(F)$ is enough to download the file (DropBox previously used a system where this was the case). This means Alice can send all her friends $H(F)$ where $F$ is a movie, meaning that the provider acts as a content distribution network (this is not a privacy problem: server doesn't want to become such a provider and will have to pay for bandwidth etc). This method also creates a covert channel that reveals which files are stored on the server: in the so-called 'salary attack' discussed by Pinkas et al. an adversary Eve has knowledge of what a company's payslips look like, so can learn an employee's salary by creating a large number of possible payslips and beginning this upload process for each one until the server informs Eve that it already has the file.
Proofs of Ownership
These issues mean we'd rather use proofs of ownership (PoWs)--a way for Alice to convince the server that she owns the file--and this means we need to avoid short file identifiers. In 2011 Halevi et al. suggested the following framework for such a paradigm: in the preprocessing phase the server stores some short info per file (file itself is located in some secondary storage), then in the proof phase (done only during file upload) there is some challenge/response mechanism. This procedure needs to be bandwidth-efficient, and computation (particularly for the client) should be efficient. The authors suggested a method using Merkle trees with special encodings, where the prover is 'challenged' on a certain block in the hash tree. In the preprocessing phase a server is sent the file and computes a Merkle tree, then stores the root. To prove, the server asks the client to present sibling paths to $t$ random leaves, the client computes them, server authenticates. This solution is bandwidth efficient, and space efficient at server side, but the client has to do quite a lot of computation. A client that knows a large proportion of the file will likely be able to 'cheat,' so we need a way to 'spread' the entropy across the file to stop the scenario like the salary attack. Using an erasure code is a good way of doing this (cheating probability $2^{-t}$) but constructions are fairly computationally expensive. An alternative approach was taken by Di Pietro and Sorniotti (AsiaCCS 2012), which is considerably more efficient at on the client side but worse on the server side, and the challenge values have to be recomputed when they are exhausted.
Proofs of Retrievability
Alice outsources a file and wants to know that it's retrievable, meaning not only that the server still holds the file but also that the server hasn't modified the file. There is a trivial solution: just download the file on a regular basis. A better approach is to use a keyed hash function and store $H(k,F)$; if Alice wants to verify she can just send $k$ to the server S and ask S to compute $H(k,F)$ and compare. This is storage efficient for Alice, but S needs to read the entire file and can only verify once. Even better is to use 'sentinels' (short, random strings): Alice embeds sentinels in random positions in $F$, encrypts block-wise, and sends this file $F'$ to server and keeps the sentinels. To verify Alice just asks for sentinel $s_i$ and checks if it is correct. Protocol means Alice doesn't need to store all of F and can detect large erasure (which would remove more than one sentinel), but Alice has to store the sentinels and cannot detect small erasure. One can improve this by computing $MAC(k,s_i)$ and appends this value to the file (server doesn't know to which sentinels these MACs correspond), only needing to store $k$. Alice doesn't need to store any of F and can detect large erasure but still can't detect small erasure. We can solve the 'small erasure' problem by using error-correcting code, however this makes it more expensive.
Confidentiality
Both PoWs and PoRs are tangential to the goal of confidentiality of files from an untrusted server. If two clients upload the same file, encrypted under their own keys, then we'd expect these two ciphertexts to be distinct and (assuming a strong method of encryption) the server shouldn't be able to learn that the two ciphertexts correspond to the same file. Douceur et al. gave an initial solution to this problem: hash the file and use this value H(F) as the encryption key, and this was generalised by Bellare et al. (Eurocrypt 2013). Since encryption is deterministic we can only expect some sort of security if files have high entropy, and indeed this approach allows offline brute-force attacks (Eve is sent a challenge ciphertext $C^*$, and if the message space is small Eve just computes hash of each file and creates ciphertexts until she finds a collision with $C^*$). The same authors of the EC13 paper gave a solution to this problem: their system called DupLESS uses an independent key server (KS), and a user engages in an oblivious PRF with KS to get an encryption key (this means that this key server needs to enforce a per-client rate-limiting strategy to stop brute force attacks). At CCS next month Liu, Asokan and Pinkas will present a paper that removes the need for the key server, by distributing its role among the clients using PAKE (Asokan gave a talk on this paper earlier in the workshop).
This presentation complemented Asokan's talk, Florian Kerschbaum's talk about computing on encrypted data (slides here) and the talk given by Marc Lacoste from Orange Labs who discussed the goals of the Super Cloud project and the security challenges involved in the development of 5G communication standards and IoT.
The cloud provides an attractive opportunity for users and enterprises (collectively 'clients') to outsource their storage. Many providers offer low-cost storage, with straightforward management and ubiquitous access (from multiple devices). However users inherently lose direct control, meaning they have no guarantees about privacy or the future availability of their files.
Deduplication is the process by which a provider saves itself storage space (and consequently money) by only storing one copy of each file, and using some method of tracking which users own that file. This concept is so desirable because there is a large amount of redundancy in many contexts, such as media (movies, music etc.), system files/software and email attachments. When server providers want to deduplicate they have a number of choices:
- file-level or block-level (latter allows better dedup in systems where updates to large files are common)
- single-user/cross-user (latter is more desirable for providers)
- client-side/server-side
The final point introduces some interesting concerns. Server-side dedup means that the client needs to send the whole file to the server, meaning a high bandwidth cost. For client-side deuduplication Alice takes a hash of her file $F$ and sends this value $H(F)$ to the server: if the server hasn't seen this before then instructs Alice to upload the file, if not then deduplication occurs. This means that having $H(F)$ is enough to download the file (DropBox previously used a system where this was the case). This means Alice can send all her friends $H(F)$ where $F$ is a movie, meaning that the provider acts as a content distribution network (this is not a privacy problem: server doesn't want to become such a provider and will have to pay for bandwidth etc). This method also creates a covert channel that reveals which files are stored on the server: in the so-called 'salary attack' discussed by Pinkas et al. an adversary Eve has knowledge of what a company's payslips look like, so can learn an employee's salary by creating a large number of possible payslips and beginning this upload process for each one until the server informs Eve that it already has the file.
Proofs of Ownership
These issues mean we'd rather use proofs of ownership (PoWs)--a way for Alice to convince the server that she owns the file--and this means we need to avoid short file identifiers. In 2011 Halevi et al. suggested the following framework for such a paradigm: in the preprocessing phase the server stores some short info per file (file itself is located in some secondary storage), then in the proof phase (done only during file upload) there is some challenge/response mechanism. This procedure needs to be bandwidth-efficient, and computation (particularly for the client) should be efficient. The authors suggested a method using Merkle trees with special encodings, where the prover is 'challenged' on a certain block in the hash tree. In the preprocessing phase a server is sent the file and computes a Merkle tree, then stores the root. To prove, the server asks the client to present sibling paths to $t$ random leaves, the client computes them, server authenticates. This solution is bandwidth efficient, and space efficient at server side, but the client has to do quite a lot of computation. A client that knows a large proportion of the file will likely be able to 'cheat,' so we need a way to 'spread' the entropy across the file to stop the scenario like the salary attack. Using an erasure code is a good way of doing this (cheating probability $2^{-t}$) but constructions are fairly computationally expensive. An alternative approach was taken by Di Pietro and Sorniotti (AsiaCCS 2012), which is considerably more efficient at on the client side but worse on the server side, and the challenge values have to be recomputed when they are exhausted.
Proofs of Retrievability
Alice outsources a file and wants to know that it's retrievable, meaning not only that the server still holds the file but also that the server hasn't modified the file. There is a trivial solution: just download the file on a regular basis. A better approach is to use a keyed hash function and store $H(k,F)$; if Alice wants to verify she can just send $k$ to the server S and ask S to compute $H(k,F)$ and compare. This is storage efficient for Alice, but S needs to read the entire file and can only verify once. Even better is to use 'sentinels' (short, random strings): Alice embeds sentinels in random positions in $F$, encrypts block-wise, and sends this file $F'$ to server and keeps the sentinels. To verify Alice just asks for sentinel $s_i$ and checks if it is correct. Protocol means Alice doesn't need to store all of F and can detect large erasure (which would remove more than one sentinel), but Alice has to store the sentinels and cannot detect small erasure. One can improve this by computing $MAC(k,s_i)$ and appends this value to the file (server doesn't know to which sentinels these MACs correspond), only needing to store $k$. Alice doesn't need to store any of F and can detect large erasure but still can't detect small erasure. We can solve the 'small erasure' problem by using error-correcting code, however this makes it more expensive.
Confidentiality
Both PoWs and PoRs are tangential to the goal of confidentiality of files from an untrusted server. If two clients upload the same file, encrypted under their own keys, then we'd expect these two ciphertexts to be distinct and (assuming a strong method of encryption) the server shouldn't be able to learn that the two ciphertexts correspond to the same file. Douceur et al. gave an initial solution to this problem: hash the file and use this value H(F) as the encryption key, and this was generalised by Bellare et al. (Eurocrypt 2013). Since encryption is deterministic we can only expect some sort of security if files have high entropy, and indeed this approach allows offline brute-force attacks (Eve is sent a challenge ciphertext $C^*$, and if the message space is small Eve just computes hash of each file and creates ciphertexts until she finds a collision with $C^*$). The same authors of the EC13 paper gave a solution to this problem: their system called DupLESS uses an independent key server (KS), and a user engages in an oblivious PRF with KS to get an encryption key (this means that this key server needs to enforce a per-client rate-limiting strategy to stop brute force attacks). At CCS next month Liu, Asokan and Pinkas will present a paper that removes the need for the key server, by distributing its role among the clients using PAKE (Asokan gave a talk on this paper earlier in the workshop).
This presentation complemented Asokan's talk, Florian Kerschbaum's talk about computing on encrypted data (slides here) and the talk given by Marc Lacoste from Orange Labs who discussed the goals of the Super Cloud project and the security challenges involved in the development of 5G communication standards and IoT.