Papers tagged ‘Differential privacy’

k-Indistinguishable Traffic Padding in Web Applications

A Web application is a piece of software whose execution is split between the client and the server, and it sends data between the two sides every time you do something significant—even, perhaps, on every single keystroke. Inevitably, the volume of data has some correlation with the action performed. Since most Web applications are offered to the general public at low or no cost, an adversary can map out much of its internal state machine and develop a model that lets them deduce what someone’s doing from an observed sequence of packets. [1] [2] This has been successfully applied (in the lab) to Web applications for highly sensitive tasks, such as tax preparation and medical advice.

The only known solution to this problem is to pad the amount of data passed over the network in each direction so that the adversary cannot learn anything about the state of the web application from an observed sequence of packets. Unfortunately, the obvious way to do this involves unacceptably high levels of overhead—this paper quotes [1] as observing 21074% overhead for a well-known online tax system. Today’s paper proposes that techniques from the field of privacy-preserving data publishing (PPDP, [3]) be brought to bear on the problem.

The paper is mathematically heavy going and I’m not going to go into details of their exact proposal, because it’s really just a starting point. Instead, I’m going to pull out some interesting observations:

  • More padding is not always better. They show an example case where padding messages to a multiple of 128 bytes is exactly as good as padding them to a multiple of 512 bytes, and padding them to a multiple of 520 bytes is worse.

  • Computing the optimal amount of padding for a web application (as they define it) is NP-hard even with complete, ideal knowledge of its behavior. However, there are reasonably efficient and speedy approximations—if I understood them correctly, it might even be feasible to apply one of their algorithms on the fly with each client.

  • Because optimality is defined over an entire sequence of user interactions with an application, knowing how much padding to send in response to any given request may require the server to predict the future. Again, however, there are approximations that avoid this problem.

  • PPDP offers a whole bunch of different ways to model the degree to which one has mitigated the information leak: k-anonymity is the model they spend the most time on, but there’s also l-diversity, (αk)-anonymity, t-closeness, etc. etc. I don’t follow that literature but I have the impression people are still actively making up new models and haven’t spent a whole lot of time on figuring out which is best in which situations.

The great failing of this and all the other papers I’ve ever read in this area is operationalization. This paper’s algorithms need a complete, or nearly so, map of all possible sequences of states that a Web application can enter, and the authors shrug off the question of how you create this map in the first place. (I very strongly suspect that task can’t be mechanized in the general case—it will turn out to be the Halting Problem in another of its many disguises.) They also don’t seem to realize that things like Google search suggestions (the running example in the paper) are constantly changing. (Of course, that makes life harder for the adversary too.) This doesn’t mean there is no hope; it only means you need more approximations. I think it ought to be possible to build good-enough automatic padding into CMSes, and maybe even into web application frameworks. Trying that should be a priority for future research.

RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response

Lots of software developers would love to know in detail how people use their work, and to that end, there’s an increasing number of programs that phone home with crash reports, heat maps of which UI elements get used, and even detailed logs of every interaction. The problem, of course, is how to collect as much useful information as possible without infringing on user privacy. Usually, what the developers want are aggregate statistics—how often does this widget get used by everyone, that sort of thing—so the logical method to apply is differential privacy. However, stock differential privacy algorithms assume a central, trusted database, and in this case, the only people who could run the database are the developers—the very same people against whom individual responses should be protected. Also, differential privacy’s mathematical guarantees are eroded if you repeatedly ask the same survey questions of the same population—which is exactly what you want to do to understand how user behavior changes over time.

This paper solves both problems with a combination of randomized response and Bloom filters. Randomized response is an old, ingenious idea for asking sensitive yes-no survey questions such that any one person has plausible deniability, but the true aggregate number of yes responses is still known: each survey participant secretly flips a coin before answering, and if it comes up heads they answer yes whether or not that’s true. Otherwise they answer honestly. After the fact, everyone can claim to have answered yes because of the coin flip, but to recover the true population statistic one simply doubles the number of no answers. Bloom filters (which I’m not going to try to explain) are used to extend this from yes-no to reporting sets of strings. Finally, there are two stages of randomization. Given a set of survey questions, the system computes a Permanent randomized response which is, as the name implies, reused until either the answers or the questions change. This prevents privacy erosion, because each user is always either answering honestly or falsely to any given question; a nosy server cannot average out the coin tosses. Additional instantaneous randomness is added each time a report is submitted, to prevent the report being a fingerprint for that user. The system is said to be in use for telemetry from the Chrome browser, and they give several examples of data collected in practice.

The Permanent randomized responses illustrate a basic tension in security design: you can often get better security outcomes versus a remote attacker if you keep local state, but that local state is probably high-value information for a local attacker. For example, any system involving trust on first use will store a list of frequently contacted remote peers, with credentials, on the local machine; tampering with that can destroy the system’s security guarantees, and simply reading it tells you, at a minimum, which remote peers you regularly connect to. The TAILS live-CD is intended to leave no trace on the system it’s run on, which means it changes its entry guards on every reboot, increasing the chance of picking a malicious guard. In this case, a local adversary who can read a Chrome browser profile has access to much higher-value data than the Permanent responses, but a cautious user who erases their browser profile on every shutdown, to protect against that local adversary, is harming their own security against the Chrome developers. (This might be the right tradeoff, of course.)

I also wonder what a network eavesdropper learns from the telemetry reports. Presumably they are conveyed over a secure channel, but at a minimum, that still reveals that the client software is installed, and the size of the upload might reveal things. A crash report, for instance, is probably much bulkier than a statistics ping, and is likely to be submitted immediately rather than on a schedule.

Privacy Games: Optimal User-Centric Data Obfuscation

I attended this year’s PETS conference a couple weeks ago, and for the next several weeks I’m going to be focusing on highlights from there. For variety’s sake, there will be one unrelated thing each week.

I’m starting with a paper which is heavier mathematical going than I usually pick. The author claims to have developed an optimal (meta-)strategy for obfuscating a person’s location before revealing it to some sort of service. The person wants to reveal enough information to get whatever utility the service provides, while not telling it exactly where they are, because they don’t entirely trust the service. The service, meanwhile, wants to learn as much as possible about the person’s location—it doesn’t matter why.

There are two standard mathematical techniques for obfuscating a location: differential privacy and distortion privacy. They both work by adding noise to a location value, but they use different metrics for how much the adversary can learn given the noisy value. The paper asserts: differential privacy is sensitive to the likelihood of observation given data, distortion privacy is sensitive to the joint probability of observation given data. Thus, by guaranteeing both, we encompass all the defense that is theoretically possible. (This is a bold claim and I don’t know enough about this subfield to verify it.) The paper proceeds to demonstrate, first, that you can construct a linear-programming problem whose solution gives an appropriate amount of noise to add to a location value, and second, that this solution is game-theoretically optimal, even against an adversary who knows your strategy and can adapt to it. Then they analyze the degree to which the data has to be obfuscated, comparing this with more limited algorithms from previous literature.

I unfortunately couldn’t make any sense of the comparisons, because the figures are poorly labeled and explained. And the bulk of the paper is just mathematical derivations, to which I have nothing to add. But there is an important point I want to highlight. The game-theoretic analysis only works if everything the adversary already knows (before they learn the obfuscated location) is captured by a distance metric which defines how far the obfuscated location has been displaced. Mathematically, this is fine; any given fact that the adversary might know about the person revealing their location can be modeled by adjusting the metric. (Person is in a city, so their path is constrained to the street grid. Person just left a large park, there is only one coffee shop within radius X of that park. Person never goes to that restaurant.) However, the system doing the obfuscation might not be able to predict all of those factors, and if it distorts the metric too much it might contradict itself (person cannot possibly have walked from A to B in the time available). In the future I’d like to see some analysis of this. What is it plausible for the adversary to know? At what point does the obfuscation break down because it can’t cover all the known facts without contradicting itself?