Papers tagged ‘Anonymity’

I Know What You’re Buying: Privacy Breaches on eBay

eBay intends not to let anyone else figure out what you’re in a habit of buying on the site. Because of that, lots of people consider eBay the obvious place to buy things you’d rather your neighbors not know you bought (there is a survey in this very paper confirming this fact). However, this paper demonstrates that a determined adversary can figure out what you bought.

(Caveat: This paper is from 2014. I do not know whether eBay has made any changes since it was published.)

eBay encourages both buyers and sellers to leave feedback on each other, the idea being to encourage fair dealing by attaching a persistent reputation to everyone. Feedback is associated with specific transactions, and anyone (whether logged into eBay or not) can see each user’s complete feedback history. Items sold are visible, items bought are not, and buyers’ identities are obscured. The catch is, you can match up buyer feedback with seller feedback by the timestamps, using obscured buyer identities as a disambiguator, and thus learn what was bought. It involves crawling a lot of user pages, but it’s possible to do this in a couple days without coming to eBay’s notice.

They demonstrate that this is a serious problem by identifying users who purchased gun holsters (eBay does not permit the sale of actual firearms), pregnancy tests, and HIV tests. As an additional fillip they show that people commonly use the same handle on eBay as Facebook and therefore purchase histories can be correlated with all the personal information one can typically extract from a Facebook profile.

Solutions are pretty straightforward and obvious—obscured buyer identities shouldn’t be correlated with their real handles; feedback timestamps should be quantized to weeks or even months; feedback on buyers might not be necessary anymore; eBay shouldn’t discourage use of various enhanced-privacy modes, or should maybe even promote them to the default. (Again, I don’t know whether any of these solutions has been implemented.) The value of the paper is perhaps more in reminding website developers in general that cross-user correlations are always a privacy risk.

20,000 In League Under The Sea: Anonymous Communication, Trust, MLATs, and Undersea Cables

Today’s paper takes another shot at modeling how the physical topology of the Internet affects the security of Tor against passive adversaries with the ability to snoop on a lot of traffic. It’s by some of the same people who wrote Defending Tor from Network Adversaries and is quite closely related.

Most of the work of this paper goes into building a flexible, formal threat model, which Tor client software could (in principle) use to inform its routing decisions. Acknowledging that there’s always going to be a good deal of uncertainty about what adversaries are out there and what they are capable of, they make two key design decisions. The model is probabilistic (based on a Bayesian belief network), and it takes user input. For instance, if you have reason to think the government of Transbelvia has it in for you, you can instruct Tor to avoid paths that Transbelvia might be able to snoop on, and the model will expand that out to all the ways they might do that. Conversely, if you trust a particular organization you might like to preferentially use its guards or exit nodes, and it can do that too.

The model is very thorough about different ways a government might be able to snoop on network traffic—not just relays physically hosted in the country, but ASes and IXPs (Transbelvia hosts a major IXP for Eastern Europe), submarine cable landing sites (not relevant for a landlocked country), mutual legal assistance treaties (MLATs) which might be used to have another country do some snooping on Transbelvia’s behalf, and even hacking into and subverting routers at interesting points in the connectivity graph. (The pun in the title refers to their analysis of how MLATs could allow several of the usual suspects to snoop on 90+% of all submarine cable traffic, even though they host hardly any cable landings themselves.) Equally important, it can be expanded at need when new techniques for spying are revealed.

I think something like this is going to be an essential building block if we want to add any spy-aware routing algorithm to Tor, but I have two serious reservations. First, simplest, but less important, right now all Tor clients make routing decisions more-or-less the same way (there have been small changes to the algorithm over time, but everyone is strongly encouraged to stay close to the latest client release anyway, just because of bugs). If clients don’t all make routing decisions the same way, then that by itself might be usable to fingerprint them, and thus cut down the number of people who might’ve taken some action, from all Tor users to all Tor users who make routing decisions like THIS. If highly personalized threat models are allowed, the latter group might be just one person.

Second, and rather more serious, the user-input aspect of this system is going to require major user experience research and design to have any hope of not being worse than the problem it’s trying to solve. It’s not just a matter of putting a friendly face on the belief language (although that does need to happen)—the system will need to educate its users in the meaning of what it is telling them, and it will need to walk them through the consequences of their choices. And it might need to provide nudges if there’s a good reason to think the user’s assessment of their threat model is flat-out wrong (even just making that judgement automatically is fraught with peril—but so is not making that judgement).

Performance and Security Improvements for Tor: A Survey

This week’s non-PETS paper is a broad survey of research into improving either the security, or the performance, or both, of low-latency anonymity networks such as Tor. Nearly all of the research used Tor itself as a testbed, and the presentation here assumes Tor, but most of the work could be generalized to other designs.

There’s been a lot of work on this sort of thing in the eleven years since Tor was first introduced, and this paper does a generally good job of categorizing it, laying out lines of research, indicating which proposals have been integrated into Tor and which haven’t, etc. (I particularly liked the mindmap diagram near the beginning, and the discussion near the end of which problems still need to get solved.) One notable exception is the section on improved cryptography, where you need to have a solid cryptography background to get any idea of what the proposals are, let alone whether they worked. There are also a couple of places where connections to the larger literature of network protocol engineering would have been helpful: for instance, there’s not a single mention of bufferbloat, even though that is clearly an aspect of the congestion problems that one line of research aims to solve. And because it’s not mentioned, it’s not clear whether the researchers doing that work knew about it.

Tor is a difficult case in protocol design because its security goals are—as acknowledged in the original paper describing its design [1]—directly in conflict with its performance goals. Improvements in end-to-end latency, for instance, may make a traffic correlation attack easier. Improvements in queueing fairness or traffic prioritization may introduce inter-circuit crosstalk enabling an attacker to learn something about the traffic passing through a relay. Preferring to use high-bandwidth relays improves efficiency but reduces the number of possible paths that traffic can take. And so on. It is striking, reading through this survey, to see how often an apparently good idea for performance was discovered to have unacceptable consequences for anonymity.

What Deters Jane from Preventing Identification and Tracking on the Web?

If you do a survey, large majorities of average people will say they don’t like the idea of other people snooping on what they do online. [1] [2] Yet, the existing bolt-on software that can prevent such snooping (at least somewhat) doesn’t get used by nearly as many people. The default explanation for this is that it’s because the software is hard to install and use correctly. [3] [4]

This paper presents a complementary answer: maybe people don’t realize just how ubiquitous or invasive online snooping is, so the benefit seems not worth the hassle. The authors interviewed a small group about their beliefs concerning identification and tracking. (They admit that the study group skews young and technical, and plan to broaden the study in the future.) Highlights include: People are primarily concerned about data they explicitly provide to some service—social network posts, bank account data, buying habits—and may not even be aware that ad networks and the like can build up comprehensive profiles of online activity even if all they do is browse. They often have heard a bunch of fragmentary information about cookies and supercookies and IP addresses and so on, and don’t know how this all fits together or which bits of it to worry about. Some people thought that tracking was only possible for services with which they have an account, while they are logged in (so they log out as soon as they’re done with the service). There is also general confusion about which security threats qualify as identification and tracking—to be fair, just about all of them can include some identification or tracking component. The consequences of being tracked online are unclear, leading people to underestimate the potential harm. And finally, many of the respondents assume they are not important people and therefore no one would bother tracking them. All of these observations are consistent with earlier studies in the same vein, e.g. Rick Wash’s Folk Models of Home Computer Security.

The authors argue that this means maybe the usability problems of the bolt-on privacy software are overstated, and user education about online security threats (and the mechanism of the Internet in general) should have higher priority. I think this goes too far. It seems more likely to me that because people underestimate the risk and don’t particularly understand how the privacy software would help, they are not motivated to overcome the usability problems. I am also skeptical of the effectiveness of user education. The mythical average users may well feel, and understandably so, that they should not need to know exactly what a cookie is, or exactly what data gets sent back and forth between their computers and the cloud, or the internal structure of that cloud. Why is it that the device that they own is not acting in their best interest in the first place?

Identifying Webbrowsers in Encrypted Communications

If you are a website, it is fairly easy to identify the web browser in use by each of your visitors, even if they take steps to suppress the blatant things like the User-Agent header. [1] [2] It is so easy, in fact, that researchers typically try to make it harder for themselves, trying instead to identify individual users even as they move around, change IP addresses, flush their cookies, etc. [3] [4]

If you are a passive eavesdropper in between the browser and the website, and the network traffic is encrypted, and particularly if you are isolated from the client’s IP address by anonymizing relays (e.g. Tor), the task should logically be much harder. Or is it? The authors of this short paper did the most obvious thing: capture packet traces and throw them at an off-the-shelf machine classifier. The feature vectors seen by the machine classifier are not described as clearly as I’d like, but I think they divided the trace into equal-length intervals and aggregated packet sizes in each direction in each interval; this is also one of the most basic and obvious things to do (the future work bit talks a little about better feature engineering). Despite the lack of tuning, they report 70–90% classification accuracy on a four-way choice among browsers (Chrome, Firefox, IE, Tor Browser) and 30–80% accuracy for a 13-way choice among browser and plugin combinations (by which they seem to mean whether or not JavaScript and Flash were enabled) (note that for a 13-way choice, chance performance would be 7.7% accuracy).

This is a short workshop paper, so it is to be expected that the results are a little crude and have missing pieces. The authors already know they need to tune their classifier. I hope someone has told them about ROC curves; raw accuracies make me grind my teeth. Besides that, the clear next step is to figure out what patterns the classifiers are picking up on, and then how to efface those patterns. I think it’s quite likely that the signal they see depends on gross characteristics of the different HTTP implementations used by each browser; for instance, at time of publication, Chrome implemented SPDY and QUIC, and the others didn’t.

The paper contains some handwaving in the direction of being able to fingerprint individual users with this information, but I’d want to see detectable variation among installations of the same browser before I’d be convinced that’s possible.

Defending Tor from Network Adversaries: A Case Study of Network Path Prediction

In a similar vein as Tuesday’s paper, this is an investigation of how practical it might be to avoid exposing Tor circuits to traffic analysis by an adversary who controls an Autonomous System. Unlike Tuesday’s paper, they assume that the adversary does not manipulate BGP to observe traffic that they shouldn’t have seen, so the concern is simply to ensure that the two most sensitive links in the circuit—from client to entry, and from exit to destination—do not pass through the same AS. Previous papers have suggested that the Tor client should predict the AS-level paths involved in these links, and select entries and exits accordingly [1] [2]. This paper observes that AS path prediction is itself a difficult problem, and that different techniques can give substantially different results. Therefore, they collected traceroute data from 28 Tor relays and compared AS paths inferred from these traces with those predicted from BGP monitoring (using the algorithm of On AS-Level Path Inference [3]).

The core finding is that traceroute-based AS path inference does indeed give substantially different results from BGP-based path prediction. The authors assume that traceroute is more accurate; the discrepancy is consistently described as an error in the BGP-based prediction, and (since BGP-based prediction tends to indicate exposure to more different ASes) as overstating the risk exposure of any given Tor link. This seems unjustified to me. The standard traceroute algorithm is known to become confused in the presence of load-balancing routers, which are extremely common in the backbone [4]; refinements have been proposed (and implemented in the scamper tool used in this paper) but have problems themselves [5] [6]. More elementally, traceroute produces a snapshot: these UDP packets did take this route just now. Tor links are relatively long-lived TCP connections (tens of minutes) which could easily be rerouted among several different paths over their lifetime. I think it would be better to say that BGP path prediction produces a more conservative estimate of the ASes to which a Tor link could be exposed, and highlight figuring out which one is more accurate as future work.

A secondary finding is that AS-aware path selection by the Tor client interacts poorly with the guard policy, in which each Tor client selects a small number of entry nodes to use for an extended period. These nodes must be reliable and high-bandwidth; the economics of running a reliable, high-bandwidth Internet server mean that they are concentrated in a small number of ASes. Similar economics apply to the operation of exit nodes, plus additional legal headaches; as a result, it may not be possible to find any end-to-end path that obeys both the guard policy and the AS-selection policy. This situation is, of course, worsened if you take the more conservative, BGP-based estimation of AS exposure.

I’ve been concerned for some time that guards might actually be worse for anonymity than the problem they are trying to solve. The original problem statement [7] is that if you select an entry node at random for each circuit, and some fraction of entry nodes are malicious, with high probability you will eventually run at least one circuit through a malicious entry. With guards, either all your circuits pass through a malicious entry for an extended period of time, or none do. My fundamental concern with this is, first, having all your traffic exposed to a malicious entry for an extended period is probably much worse for your anonymity than having one circuit exposed every now and then; second, the hypothetical Tor adversary has deep pockets and can easily operate reliable high-bandwidth nodes, which are disproportionately likely to get picked as guards. Concentration of guards in a small number of ASes only makes this easier for the adversary; concentration of guards together with exits in a small number of ASes makes it even easier. It’s tempting to suggest a complete about-face, preferentially choosing entry nodes from the low-bandwidth, short-lived population and using them only for a short time; this would also mean that entry nodes could be taken from a much broader pool of ASes, and it would be easier to avoid overlap with the AS-path from exit to destination.

Anonymity on QuickSand: Using BGP to Compromise Tor

One of the oldest research threads regarding Tor is trying to figure out how close you could get in real life to the global passive adversary that’s known to be able to deanonymize all communications. This is a new entry in that line of research, from HotNets 2014.

At the largest scale, the global Internet is administratively divided into autonomous systems (ASes) that exchange traffic, using BGP for configuration. Any given AS can only communicate with a small number of direct peers, so a stream of packets will normally pass through many different ASes on the way to its destination. It’s well-known that AS-operated backbone routers are in an excellent position to mount traffic-correlation attacks on Tor, particularly if they collude [1] [2]. The key observation in this paper is that, by manipulating BGP, a malicious AS can observe traffic that wouldn’t naturally flow through it.

BGP is an old protocol, originally specified in 1989; like most of our older protocols, it assumes that all participants are cooperative and honest. Any backbone router can announce that it is now capable of forwarding packets to a prefix (set of IP addresses) and the rest of the network will believe it. Incidents where traffic is temporarily redirected to an AS that either can’t get it to the destination at all, or can only do so suboptimally, are commonplace, and people argue about how malicious these are. [3] [4] [5] Suppose an adversary can observe one end of a Tor circuit—perhaps they control the ISP for a Tor client. They also have some reason to suspect a particular destination for the traffic. They use BGP to hijack traffic to the suspected destination, passing it on so that the client doesn’t notice anything. They can now observe both ends of the circuit and confirm their suspicions. They might not get to see traffic in both directions, but the authors also demonstrate that a traffic-correlation attack works in principle even if you can only see the packet flow in one direction, thanks to TCP acknowledgments.

Making this worse, high-bandwidth, long-lived Tor relays (which are disproportionately likely to be used for either end of a circuit) are clustered in a small number of ASes worldwide. This means an adversary can do dragnet surveillance by hijacking all traffic to some of those ASes; depending on its own position in the network, this might not even appear abnormal. The adversary might even be one of those ASes, or an agency in a position to lean on its operators.

The countermeasures proposed in this paper are pretty weak; they would only operate on a timescale of hours to days, whereas a BGP hijack can happen, and stop happening, in a matter of minutes. I don’t see a good fix happening anywhere but in the routing protocol itself. Unfortunately, routing protocols that do not assume honest participants are still a topic of basic research. (I may get to some of those papers eventually.) There are proposals for adding a notion of this AS is authorized to announce this prefix to BGP [6] but those have all the usual problems with substituting I trust this organization for I have verified that this data is accurate.