Papers tagged ‘Adversarial optimization’

Improving Cloaking Detection Using Search Query Popularity and Monetizability

Here’s another paper about detecting search engine spam, contemporary with Detecting Semantic Cloaking on the Web and, in fact, presented as a refinement to that paper. They don’t make any changes to the is this page spamming the search engine algorithm itself; rather, they optimize the scan for pages that are spamming the search engine, by looking at the spammers’ economic incentives.

It obviously does no good to make your site a prominent search hit for keywords that nobody ever searches for. Less obviously, since search engine spammers are trying to make money by sucking people into linkfarms full of ads, they should be focusing on keywords that lots of advertisers are interested in (monetizability in the paper’s jargon). Therefore, the search engine operator should focus its spam-detection efforts on the most-searched and most-advertised keywords. Once you identify a linkfarm, you can weed it out of all your keyword indexes, so this improves search results in the long tail as well as the popular case.

They performed a quick verification of this hypothesis with the help of logs of the 5000 most popular and 5000 most advertised search keywords, from the MSN search engine (now Bing). Indeed, the spam was strongly skewed toward both high ends—somewhat more toward monetizability than popularity.

That’s really all there is to say about this paper. They had a good hypothesis, they tested it, and the hypothesis was confirmed. I’ll add that this is an additional reason (besides just making money) for search engines to run their own ad brokerages, and a great example of the value of applying economic reasoning to security research.

k-Indistinguishable Traffic Padding in Web Applications

A Web application is a piece of software whose execution is split between the client and the server, and it sends data between the two sides every time you do something significant—even, perhaps, on every single keystroke. Inevitably, the volume of data has some correlation with the action performed. Since most Web applications are offered to the general public at low or no cost, an adversary can map out much of its internal state machine and develop a model that lets them deduce what someone’s doing from an observed sequence of packets. [1] [2] This has been successfully applied (in the lab) to Web applications for highly sensitive tasks, such as tax preparation and medical advice.

The only known solution to this problem is to pad the amount of data passed over the network in each direction so that the adversary cannot learn anything about the state of the web application from an observed sequence of packets. Unfortunately, the obvious way to do this involves unacceptably high levels of overhead—this paper quotes [1] as observing 21074% overhead for a well-known online tax system. Today’s paper proposes that techniques from the field of privacy-preserving data publishing (PPDP, [3]) be brought to bear on the problem.

The paper is mathematically heavy going and I’m not going to go into details of their exact proposal, because it’s really just a starting point. Instead, I’m going to pull out some interesting observations:

  • More padding is not always better. They show an example case where padding messages to a multiple of 128 bytes is exactly as good as padding them to a multiple of 512 bytes, and padding them to a multiple of 520 bytes is worse.

  • Computing the optimal amount of padding for a web application (as they define it) is NP-hard even with complete, ideal knowledge of its behavior. However, there are reasonably efficient and speedy approximations—if I understood them correctly, it might even be feasible to apply one of their algorithms on the fly with each client.

  • Because optimality is defined over an entire sequence of user interactions with an application, knowing how much padding to send in response to any given request may require the server to predict the future. Again, however, there are approximations that avoid this problem.

  • PPDP offers a whole bunch of different ways to model the degree to which one has mitigated the information leak: k-anonymity is the model they spend the most time on, but there’s also l-diversity, (αk)-anonymity, t-closeness, etc. etc. I don’t follow that literature but I have the impression people are still actively making up new models and haven’t spent a whole lot of time on figuring out which is best in which situations.

The great failing of this and all the other papers I’ve ever read in this area is operationalization. This paper’s algorithms need a complete, or nearly so, map of all possible sequences of states that a Web application can enter, and the authors shrug off the question of how you create this map in the first place. (I very strongly suspect that task can’t be mechanized in the general case—it will turn out to be the Halting Problem in another of its many disguises.) They also don’t seem to realize that things like Google search suggestions (the running example in the paper) are constantly changing. (Of course, that makes life harder for the adversary too.) This doesn’t mean there is no hope; it only means you need more approximations. I think it ought to be possible to build good-enough automatic padding into CMSes, and maybe even into web application frameworks. Trying that should be a priority for future research.

Robust and Efficient Elimination of Cache and Timing Side Channels

We’re closing out Unpublished Results Week here with a paper that was submitted to CCS 2015 last May, but doesn’t appear to have made it.

I’ve talked about side channels a little before, but let me repeat why time and cache channels keep coming up and why they’re hard to get rid of: they’re side effects of hardware-design choices that, most of the time, are exactly what you want, if you want your computer to go fast. Think about long versus short division, for instance. It takes less mental effort, and less scratch paper, to manually divide 123456789 by 3 than by 29; the same is true (sort of) for your computer. So if it’s your job to design the CPU, why shouldn’t the integer divide instruction finish quicker in the former case? Especially if you have profiling information that says that most of the time integer divide gets used with small divisors. Memory caches are even more vital—the other day I saw a graph suggesting that it takes order of 17,000 clock cycles for the CPU to come all the way back up to full number crunching speed after a context switch; most of that is probably due to stuff falling out of the cache and having to be pulled back in.

So: side channels are hard to get rid of in hardware because that would involve making design tradeoffs that would hurt performance for the supermajority of code that isn’t crunching sensitive data, and they’re hard to get rid of in software because the hardware, and the compiler, are actively fighting you. (The paper says several times that side-channel-free cryptographic primitives pretty much have to be done in hand-written assembly language, and that getting them right is extremely difficult even by the standards of people who write assembly language by hand.) Wouldn’t it be nice to find a big hammer that would make the problem just … go away? That’s this paper.

A principal design goal for the big hammer in this paper is ease of use. The only thing the application programmer has to do is label protected functions that need to have their side channels eliminated. And it’s not a compiler paper; they haven’t come up with a way to make a compiler do the extremely difficult assembly tuning for you. All the magic happens at runtime. Given all that, it works pretty much the only way it could work: it measures how much the timing varies and then it pads out the fast cases to the same length as the slow cases. It also flushes state from the protected function out of as many CPU caches as possible (memory, TLB, branch target buffer, …) after the function is done, and takes steps (page coloring, CPU and memory pinning, realtime scheduling priority) to ensure that the protected function is not perturbed by what malicious code has done before it starts executing. The authors demonstrate reliability by testing several different crypto primitives that definitely do have side-channel information leaks, and showing that the leak is rendered unavailable in all cases. They also claim the overhead is acceptable (8.5%).

I really like the concept and hope to see more in this vein in the future. I can’t speculate on why this particular paper didn’t get accepted at CCS, but I do feel that the overhead measurements were inadequate: they don’t consider indirect costs of flushing caches and pinning stuff to CPUs all the time. I’d like to see whole-system benchmarking of, say, a shared-hosting HTTPS server. And at least one round of adversarial testing would be nice. It is one thing to demonstrate that your measurements show that the side channels you already knew about have been closed; it is another thing if someone who spends all their time thinking of side-channel attacks cannot find any in your system.

Privacy Games: Optimal User-Centric Data Obfuscation

I attended this year’s PETS conference a couple weeks ago, and for the next several weeks I’m going to be focusing on highlights from there. For variety’s sake, there will be one unrelated thing each week.

I’m starting with a paper which is heavier mathematical going than I usually pick. The author claims to have developed an optimal (meta-)strategy for obfuscating a person’s location before revealing it to some sort of service. The person wants to reveal enough information to get whatever utility the service provides, while not telling it exactly where they are, because they don’t entirely trust the service. The service, meanwhile, wants to learn as much as possible about the person’s location—it doesn’t matter why.

There are two standard mathematical techniques for obfuscating a location: differential privacy and distortion privacy. They both work by adding noise to a location value, but they use different metrics for how much the adversary can learn given the noisy value. The paper asserts: differential privacy is sensitive to the likelihood of observation given data, distortion privacy is sensitive to the joint probability of observation given data. Thus, by guaranteeing both, we encompass all the defense that is theoretically possible. (This is a bold claim and I don’t know enough about this subfield to verify it.) The paper proceeds to demonstrate, first, that you can construct a linear-programming problem whose solution gives an appropriate amount of noise to add to a location value, and second, that this solution is game-theoretically optimal, even against an adversary who knows your strategy and can adapt to it. Then they analyze the degree to which the data has to be obfuscated, comparing this with more limited algorithms from previous literature.

I unfortunately couldn’t make any sense of the comparisons, because the figures are poorly labeled and explained. And the bulk of the paper is just mathematical derivations, to which I have nothing to add. But there is an important point I want to highlight. The game-theoretic analysis only works if everything the adversary already knows (before they learn the obfuscated location) is captured by a distance metric which defines how far the obfuscated location has been displaced. Mathematically, this is fine; any given fact that the adversary might know about the person revealing their location can be modeled by adjusting the metric. (Person is in a city, so their path is constrained to the street grid. Person just left a large park, there is only one coffee shop within radius X of that park. Person never goes to that restaurant.) However, the system doing the obfuscation might not be able to predict all of those factors, and if it distorts the metric too much it might contradict itself (person cannot possibly have walked from A to B in the time available). In the future I’d like to see some analysis of this. What is it plausible for the adversary to know? At what point does the obfuscation break down because it can’t cover all the known facts without contradicting itself?