Papers tagged ‘Privacy’

I Know What You’re Buying: Privacy Breaches on eBay

eBay intends not to let anyone else figure out what you’re in a habit of buying on the site. Because of that, lots of people consider eBay the obvious place to buy things you’d rather your neighbors not know you bought (there is a survey in this very paper confirming this fact). However, this paper demonstrates that a determined adversary can figure out what you bought.

(Caveat: This paper is from 2014. I do not know whether eBay has made any changes since it was published.)

eBay encourages both buyers and sellers to leave feedback on each other, the idea being to encourage fair dealing by attaching a persistent reputation to everyone. Feedback is associated with specific transactions, and anyone (whether logged into eBay or not) can see each user’s complete feedback history. Items sold are visible, items bought are not, and buyers’ identities are obscured. The catch is, you can match up buyer feedback with seller feedback by the timestamps, using obscured buyer identities as a disambiguator, and thus learn what was bought. It involves crawling a lot of user pages, but it’s possible to do this in a couple days without coming to eBay’s notice.

They demonstrate that this is a serious problem by identifying users who purchased gun holsters (eBay does not permit the sale of actual firearms), pregnancy tests, and HIV tests. As an additional fillip they show that people commonly use the same handle on eBay as Facebook and therefore purchase histories can be correlated with all the personal information one can typically extract from a Facebook profile.

Solutions are pretty straightforward and obvious—obscured buyer identities shouldn’t be correlated with their real handles; feedback timestamps should be quantized to weeks or even months; feedback on buyers might not be necessary anymore; eBay shouldn’t discourage use of various enhanced-privacy modes, or should maybe even promote them to the default. (Again, I don’t know whether any of these solutions has been implemented.) The value of the paper is perhaps more in reminding website developers in general that cross-user correlations are always a privacy risk.

RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response

Lots of software developers would love to know in detail how people use their work, and to that end, there’s an increasing number of programs that phone home with crash reports, heat maps of which UI elements get used, and even detailed logs of every interaction. The problem, of course, is how to collect as much useful information as possible without infringing on user privacy. Usually, what the developers want are aggregate statistics—how often does this widget get used by everyone, that sort of thing—so the logical method to apply is differential privacy. However, stock differential privacy algorithms assume a central, trusted database, and in this case, the only people who could run the database are the developers—the very same people against whom individual responses should be protected. Also, differential privacy’s mathematical guarantees are eroded if you repeatedly ask the same survey questions of the same population—which is exactly what you want to do to understand how user behavior changes over time.

This paper solves both problems with a combination of randomized response and Bloom filters. Randomized response is an old, ingenious idea for asking sensitive yes-no survey questions such that any one person has plausible deniability, but the true aggregate number of yes responses is still known: each survey participant secretly flips a coin before answering, and if it comes up heads they answer yes whether or not that’s true. Otherwise they answer honestly. After the fact, everyone can claim to have answered yes because of the coin flip, but to recover the true population statistic one simply doubles the number of no answers. Bloom filters (which I’m not going to try to explain) are used to extend this from yes-no to reporting sets of strings. Finally, there are two stages of randomization. Given a set of survey questions, the system computes a Permanent randomized response which is, as the name implies, reused until either the answers or the questions change. This prevents privacy erosion, because each user is always either answering honestly or falsely to any given question; a nosy server cannot average out the coin tosses. Additional instantaneous randomness is added each time a report is submitted, to prevent the report being a fingerprint for that user. The system is said to be in use for telemetry from the Chrome browser, and they give several examples of data collected in practice.

The Permanent randomized responses illustrate a basic tension in security design: you can often get better security outcomes versus a remote attacker if you keep local state, but that local state is probably high-value information for a local attacker. For example, any system involving trust on first use will store a list of frequently contacted remote peers, with credentials, on the local machine; tampering with that can destroy the system’s security guarantees, and simply reading it tells you, at a minimum, which remote peers you regularly connect to. The TAILS live-CD is intended to leave no trace on the system it’s run on, which means it changes its entry guards on every reboot, increasing the chance of picking a malicious guard. In this case, a local adversary who can read a Chrome browser profile has access to much higher-value data than the Permanent responses, but a cautious user who erases their browser profile on every shutdown, to protect against that local adversary, is harming their own security against the Chrome developers. (This might be the right tradeoff, of course.)

I also wonder what a network eavesdropper learns from the telemetry reports. Presumably they are conveyed over a secure channel, but at a minimum, that still reveals that the client software is installed, and the size of the upload might reveal things. A crash report, for instance, is probably much bulkier than a statistics ping, and is likely to be submitted immediately rather than on a schedule.

Performance and Security Improvements for Tor: A Survey

This week’s non-PETS paper is a broad survey of research into improving either the security, or the performance, or both, of low-latency anonymity networks such as Tor. Nearly all of the research used Tor itself as a testbed, and the presentation here assumes Tor, but most of the work could be generalized to other designs.

There’s been a lot of work on this sort of thing in the eleven years since Tor was first introduced, and this paper does a generally good job of categorizing it, laying out lines of research, indicating which proposals have been integrated into Tor and which haven’t, etc. (I particularly liked the mindmap diagram near the beginning, and the discussion near the end of which problems still need to get solved.) One notable exception is the section on improved cryptography, where you need to have a solid cryptography background to get any idea of what the proposals are, let alone whether they worked. There are also a couple of places where connections to the larger literature of network protocol engineering would have been helpful: for instance, there’s not a single mention of bufferbloat, even though that is clearly an aspect of the congestion problems that one line of research aims to solve. And because it’s not mentioned, it’s not clear whether the researchers doing that work knew about it.

Tor is a difficult case in protocol design because its security goals are—as acknowledged in the original paper describing its design [1]—directly in conflict with its performance goals. Improvements in end-to-end latency, for instance, may make a traffic correlation attack easier. Improvements in queueing fairness or traffic prioritization may introduce inter-circuit crosstalk enabling an attacker to learn something about the traffic passing through a relay. Preferring to use high-bandwidth relays improves efficiency but reduces the number of possible paths that traffic can take. And so on. It is striking, reading through this survey, to see how often an apparently good idea for performance was discovered to have unacceptable consequences for anonymity.

Automated Experiments on Ad Privacy Settings

You’ve probably noticed the creepy effect where you consider buying something online, or maybe just look at a page for something that happens to be for sale, and for weeks afterward you get ads on totally unrelated websites for that thing or similar things. The more reputable online ad brokerages offer a degree of control over this effect (e.g.: Google, Microsoft). This study investigates exactly what effect those settings have on the ads observed by automated browsing agents. The basic idea is to set some of the knobs, visit websites that will tell the ad provider something about the simulated customer’s preferences, possibly adjust the knobs again, and finally record what is being advertised on a general-interest website.

A great deal of the paper is devoted to statistical methodology. Because the ad provider is a stateful black box, and one whose behavior may depend on uncontrollable external factors (e.g. that advertiser has exhausted their budget for the month), it’s vital to avoid as many statistical assumptions as possible. They use permutation tests and supervised classification (logistic regression), both of which make minimal assumptions. They’re also very careful about drawing conclusions from their results. I’m not much of a statistician, but it all sounds carefully thought out and plausible, with one exception: heavy reliance on significance testing, which has come in for serious criticism [1] to the point where some journals no longer accept its use at all [2]. This is exactly the sort of research where p-values can mislead; if I were reviewing this prior to publication I would have encouraged the authors to think about how they could present the same conclusions without using significance testing.

Now, the actual conclusions. Only Google’s ads were tested. (Expanding the tests to additional ad brokers is listed as future work.) They confirm that turning a particular topic (dating) off in the preferences does cause those ads to go away. They observe that two highly sensitive topics (substance abuse, disability) that do trigger targeted ads are not controllable via the preferences; in fact, they are completely invisible on that screen. And the most interesting case is when they set the ad preferences to explicitly reveal a gender (man or woman) then browsed a bunch of sites related to job searching. Simulated customers who claimed to be men got ads for a career coaching service which promised better odds of being hired into $200K+ executive positions; those who claimed to be women did not see these ads.

This last example clearly reflects the well-known glass ceiling which affects business in general, but (as the authors point out) it’s impossible to tell, from outside the black box, why it shows up in this case. The career-coaching service could have chosen to advertise only to men. Google’s ad-targeting algorithm could have been coded with (likely unconscious) bias in its category structure, or—this is the most probable explanation—its feedback mechanism could have detected that men are more likely to click on ads with those keywords, so it makes that even more likely by showing them preferentially to men. There’s a telling comment at the very end of the paper:

… We consider it more likely that Google has lost control over its massive, automated advertising system. Even without advertisers placing inappropriate bids, large-scale machine learning can behave in unexpected ways.

There’s a lesson here for all the big data companies: the best an unbiased machine learning system can hope to do is produce an accurate reflection of the training set—including whatever biases are in there. If you want to avoid reduplicating all the systemic biases of the offline world, you will have to write code to that effect.

An Automated Approach for Complementing Ad Blockers’ Blacklists

Last week’s long PETS paper was very abstract; this week’s paper is very concrete. The authors are concerned that manually-curated blacklists, as currently used by most ad-blocking software, cannot hope to keep up with the churn in the online ad industry. (I saw a very similar talk at WPES back in 2012 [1] which quoted the statistic that the default AdBlock Plus filter list contains 18,000 unique URLS, with new ones added at a rate of five to 15 every week.) They propose to train a machine classifier on network-level characteristics that differ between ad services and normal web sites, to automate detection of new ad providers and/or third-party privacy-invasive analytics services. (This is the key difference from the paper at WPES 2012: that project used static analysis of JavaScript delivered by third-party services to extract features for their classifier.)

They find that a set of five features provides a reasonably effective classification: proportion of requests that are third-party (for transclusion into another website), number of unique referrers, ratio of received to sent bytes, and proportion of requests including cookies. For the training set they used, they achieve 83% precision and 85% recall, which is reasonable for a system that will be used to identify candidates for manual inspection and addition to blacklists.

There are several methodological bits in here which I liked. They use entropy-based discretization and information gain to identify valuable features and discard unhelpful ones. They compare a classifier trained on manually-labeled data (from a large HTTP traffic trace) with a classifier trained on the default AdBlock Plus filter list; both find similar features, but the ABP filter list has better coverage of infrequently used ads or analytics services, whereas the manually labeled training set catches a bunch of common ads and analytics services that ABP missed.

One fairly significant gap is that the training set is limited to cleartext HTTP. There’s a strong trend nowadays toward HTTPS for everything, including ads, but established ad providers are finding it difficult to cut all their services over efficiently, which might provide an opportunity for new providers—and thus cause a shift toward providers that have been missed by the blacklists.

There’s almost no discussion of false positives. Toward the end there is a brief mention of third-party services like Gravatar and Flattr, that share a usage pattern with ads/analytics and showed up as false positives. But it’s possible to enumerate common types of third-party services (other than ads and analytics) a priori: outsourced commenting (Disqus, hypothes.is), social media share buttons (Facebook, Twitter), shared hosting of resources (jQuery, Google Fonts), static-content CDNs, etc. Probably, most of these are weeded out by the ratio of received to sent bytes check, but I would still have liked to see an explicit check of at least a few of these.

And finally, nobody seems to have bothered talking to the people who actually maintain the ABP filter lists to find out how they do it. (I suspect it relies strongly on manual, informal reporting to a forum or something.) If this is to turn into anything more than an experiment, people need to be thinking about integration and operationalization.

Cache Timing Attacks Revisited: Efficient and Repeatable Browser History, OS, and Network Sniffing

Cache timing attacks use the time some operation takes to learn whether or not a datum is in a cache. They’re inherently hard to fix, because the entire point of a cache is to speed up access to data that was used recently. They’ve been studied as far back as Multics. [1] In the context of the Web, the cache in question (usually) is the browser’s cache of resources retrieved from the network, and the attacker (usually) is a malicious website that wants to discover something about the victim’s activity on other websites. [2] [3] This paper gives examples of using cache timing to learn information needed for phishing, discover the victim’s location, and monitor the victim’s search queries. It’s also known to be possible to use the victim’s browsing habits as an identifying fingerprint [4].

The possibility of this attack is well-known, but to date it has been dismissed as an actual risk for two reasons: it was slow, and probing the cache was a destructive operation, i.e. the attacker could only probe any given resource once, after which it would be cached whether or not the victim had ever loaded it. This paper overcomes both of those limitations. It uses Web Workers to parallelize cache probing, and it sets a very short timeout on all its background network requests—so short that the request can only succeed if it’s cached. Otherwise, it will be cancelled and the cache will not be perturbed. (Network requests will always take at least a few tens of milliseconds, whereas the cache can respond in less than a millisecond.) In combination, these achieve two orders of magnitude speedup over the best previous approach, and, more importantly, they render the attack repeatable.

What do we do about it? Individual sites can mark highly sensitive data not to be cached at all, but this slows the web down, and you’ll never get everyone to do it for everything that could be relevant. Browsers aren’t about to disable caching; it’s too valuable. A possibility (not in this paper) is that we could notice the first time a resource was loaded from a new domain, and deliberately delay satisfying it from the cache by approximately the amount of time it took to load off the network originally. I’d want to implement and test that to make sure it didn’t leave a hole, but it seems like it would allow us to have the cake and eat it.

In closing, I want to point out that this is a beautiful example of the maxim that attacks always get better. Cryptographers have been banging that drum for decades, trying to get people to move away from cipher primitives that are only a little weak before it’s too late, without much impact. The same, it seems, goes for information leaks.

Privacy Games: Optimal User-Centric Data Obfuscation

I attended this year’s PETS conference a couple weeks ago, and for the next several weeks I’m going to be focusing on highlights from there. For variety’s sake, there will be one unrelated thing each week.

I’m starting with a paper which is heavier mathematical going than I usually pick. The author claims to have developed an optimal (meta-)strategy for obfuscating a person’s location before revealing it to some sort of service. The person wants to reveal enough information to get whatever utility the service provides, while not telling it exactly where they are, because they don’t entirely trust the service. The service, meanwhile, wants to learn as much as possible about the person’s location—it doesn’t matter why.

There are two standard mathematical techniques for obfuscating a location: differential privacy and distortion privacy. They both work by adding noise to a location value, but they use different metrics for how much the adversary can learn given the noisy value. The paper asserts: differential privacy is sensitive to the likelihood of observation given data, distortion privacy is sensitive to the joint probability of observation given data. Thus, by guaranteeing both, we encompass all the defense that is theoretically possible. (This is a bold claim and I don’t know enough about this subfield to verify it.) The paper proceeds to demonstrate, first, that you can construct a linear-programming problem whose solution gives an appropriate amount of noise to add to a location value, and second, that this solution is game-theoretically optimal, even against an adversary who knows your strategy and can adapt to it. Then they analyze the degree to which the data has to be obfuscated, comparing this with more limited algorithms from previous literature.

I unfortunately couldn’t make any sense of the comparisons, because the figures are poorly labeled and explained. And the bulk of the paper is just mathematical derivations, to which I have nothing to add. But there is an important point I want to highlight. The game-theoretic analysis only works if everything the adversary already knows (before they learn the obfuscated location) is captured by a distance metric which defines how far the obfuscated location has been displaced. Mathematically, this is fine; any given fact that the adversary might know about the person revealing their location can be modeled by adjusting the metric. (Person is in a city, so their path is constrained to the street grid. Person just left a large park, there is only one coffee shop within radius X of that park. Person never goes to that restaurant.) However, the system doing the obfuscation might not be able to predict all of those factors, and if it distorts the metric too much it might contradict itself (person cannot possibly have walked from A to B in the time available). In the future I’d like to see some analysis of this. What is it plausible for the adversary to know? At what point does the obfuscation break down because it can’t cover all the known facts without contradicting itself?

Security Analysis of Consumer-Grade Anti-Theft Solutions Provided by Android Mobile Anti-Virus Apps

Today we have another analysis of Android device security against scenarios where the owner loses control of the phone, by the same researchers who wrote Security Analysis of Android Factory Resets. Here, instead of looking at what happens when the owner deliberately erases a phone in order to sell it, they study what happens when the phone is stolen and the owner tries to wipe or disable it remotely. A remote disable mechanism is commonly included in anti-virus programs for Android, and this study looks at ten such mechanisms.

As with factory reset, the core finding boils down to none of them work. The root causes are slightly different, though. The core of Android, naturally, is at pains to prevent normal applications from doing something as destructive as initiating a memory wipe, rendering the phone unresponsive to input, or changing passwords. To be properly effective the anti-theft program needs administrative privileges, which can be revoked at any time, so there’s an inherent race between the owner activating the anti-theft program and the thief disabling it. To make matters worse, UX bugs make it easy for these programs to appear to be installed correctly but not have the privileges they need; implementation bugs (possibly caused by unclear Android documentation) may leave loopholes even when the program was installed correctly; and several device vendors added backdoor capabilities (probably for in-store troubleshooting) that allow a thief to bypass the entire thing—for those familiar with Android development, we’re talking if the phone is turned off and then plugged into a computer it boots into recovery mode and activates ADB.

There is a curious omission in this paper: since 2013, Google itself has provided a remote lock/wipe feature as part of the Google Apps bundle that’s installed on most Android devices [1]. Since the Apps bundle is, nowadays, Google’s way of getting around vendors who can’t be bothered to patch the base operating system in a timely fashion, it has all the privileges it needs to do this job correctly, and the feature should be available regardless of Android base version. The UX is reasonable, and one would presume that the developers are at least somewhat less likely to make mistakes in the implementation. This paper, however, doesn’t mention that at all, despite (correctly) pointing out that this is a difficult thing for a third-party application to get right and the device vendors should step up.

Practical advice for phone-owners continues to be: encrypt the phone on first boot, before giving it any private information, and use a nontrivial unlock code. Practical advice for antivirus vendors, I think, is you’re not testing your software adversarially enough. The implementation bugs, in particular, suggest that the vendors’ test strategy confirmed that remote lock/wipe works, when set up correctly, but did not put enough effort into thinking of ways to defeat the lock.

Why Doesn’t Jane Protect Her Privacy?

Today’s paper is very similar to What Deters Jane from Preventing Identification and Tracking on the Web? and shares an author. The main difference is that it’s about email rather than the Web. The research question is the same: why don’t people use existing tools for enhancing the security and privacy of their online communications? (In this case, specifically tools for end-to-end encryption of email.) The answers are also closely related. As before, many people think no one would bother snooping on them because they aren’t important people. They may understand that their webmail provider reads their email in order to select ads to display next to it, but find this acceptable, and believe that the provider can be trusted not to do anything else with its knowledge. They may believe that the only people in a position to read their email are nation-state espionage agencies, and that trying to stop them is futile. All of these are broadly consistent with the results for the Web.

A key difference, though, is that users’ reported understanding of email-related security risks is often about a different category of threat that end-to-end encryption doesn’t help with: spam, viruses, and phishing. In fact, it may hurt: one of Gmail’s (former) engineers went on record with a detailed argument for why their ability to read all their users’ mail was essential to their ability to filter spam. [1] I’m not sure that isn’t just a case of not being able to see out of their local optimum, but it certainly does make the job simpler. Regardless, it seems to me that spam, viruses, and phishing are a much more visible and direct threat to the average email user’s personal security than any sort of surveillance. Choosing to use a service that’s very good at filtering, even at some cost in privacy, therefore strikes me as a savvy choice rather than an ignorant one. Put another way, I think a provider of end-to-end encrypted email needs to demonstrate that it can filter junk just as effectively if it wants to attract users away from existing services.

(In the current world, encryption is a signal of not being spam, but in a world where most messages were encrypted, spammers would start using encryption, and so would your PHB who keeps sending you virus-infected spreadsheets that you have to look at for your job.)

Another key difference is, you can unilaterally start using Tor, anti-tracking browser extensions, and so on, but you can’t unilaterally start encrypting your email. You can only send encrypted email to people who can receive encrypted email. Right now, that means there is a strong network effect against the use of encrypted email. There’s not a single word about this in the paper, and I find that a serious omission. It does specifically say that they avoided asking people about their experiences (if any) with PGP and similar software because they didn’t want to steer their thinking that way, but I think that was a mistake. It means they can’t distinguish what people think about email privacy in general, from what they think about end-to-end encryption tools that they may have tried, or at least heard of. There may be a substantial population of people who only looked into PGP just enough to discover that it’s only useful if the recipient also uses it, and don’t think of it anymore unless specifically prompted about it.

Security Analysis of Android Factory Resets

Unlike the last two papers, today’s isn’t theoretical at all. It’s a straightforward experimental study of whether or not any data can be recovered from an Android phone that’s undergone a factory reset. Factory reset is supposed to render it safe to sell a phone on the secondhand market, which means any personal data belonging to the previous owner should be erased at least thoroughly enough that the new owner cannot get at it without disassembling the phone. (This is NIST logical media sanitization, for those who are familiar with those rules—digital sanitization, where the new owner cannot extract any data short of taking an electron microscope to the memory chip, would of course be better, but in the situations where the difference matters, merely having a cell phone may mean you have already failed.)

The paper’s core finding is also straightforward: due to a combination of UI design errors, major oversights in the implementation of the factory-reset feature, driver bugs, and the market’s insistence that flash memory chips have to pretend to be traditional spinning rust hard disks even though they don’t work anything like that, none of the phones in the study erased everything that needed to get erased. The details are confusingly presented, but they’re not that important anyway. It ends with a concrete set of recommendations for future improvements, which is nice to see.

There are a few caveats. The study only covers Android 2.3 through 4.3; it is unclear to me whether the situation is improved in newer versions of Android. They don’t discuss device encryption at all—this is at least available in all versions studied, and should completely mitigate the problem provided that a factory reset successfully wipes out the old encryption keys, and that there’s no way for unencrypted data to survive the encryption process. (Unfortunately, the normal setup flow for a new phone, at least in the versions I’ve used, asks for a bunch of sensitive info before offering to encrypt the phone. You can bypass this but it’s not obvious how.) It would be great if someone tested whether encryption is effective in practice.

Beyond the obvious, this is also an example of Android’s notorious update problem. Many of the cell carriers cannot be bothered to push security updates—or any updates—to phones once they have been sold. [1] They are also notorious for making clumsy modifications to the software that harm security, usability, or both. [2] In this case, this means driver bugs that prevent the core OS from erasing everything it meant to erase, and failure to deploy patches that made the secure erase mechanism work better. Nobody really has a good solution to that. I personally am inclined to say that neither the telcos nor Google should be in the business of selling phones, and conversely, the consumer electronics companies who are in that business should not have the opportunity to make any modifications to the operating system. Whatever the hardware is, it’s got to work with a stock set of drivers, and the updates have to come direct from Google. That would at least simplify the problem.

(Arguably Google shouldn’t be in the OS business either, but that’s a more general antitrust headache. One thing at a time.)