Archives, page 3 | Readings in Information Security

All Your Biases Belong To Us: Breaking RC4 in WPA-TKIP and TLS

Reviewed 10 August 2015

Cryptanalysis, Random number generation, RC4, TKIP, TLS

USENIX Security Symposium 2015

RC4 is a very widely used stream cipher, developed in the late 1980s. As wikipedia puts it, RC4 is remarkable for its speed and simplicity in software, but it has weaknesses that argue against its use in new systems. Today’s paper demonstrates that it is even weaker than previously suspected.

To understand the paper you need to know how stream ciphers work. The core of a stream cipher is a random number generator. The encryption key is the starting seed for the random number generator, which produces a keystream—a sequence of uniformly random bytes. You then combine the keystream with your message using the mathematical XOR operation, and that’s your ciphertext. On the other end, the recipient knows the same key, so they can generate the same keystream, and they XOR the ciphertext with the keystream and get the message back.

If you XOR a ciphertext with the corresponding plaintext, you learn the keystream. This wouldn’t be a big deal normally, but, the basic problem with RC4 is that its keystream isn’t uniformly random. Some bytes of the keystream are more likely to take specific values than they ought to be. Some bytes are more likely to not take specific values. And some bytes are more likely to take the same value as another byte in the keystream. (The ABSAB bias, which is mentioned often in this paper, is an example of that last case: you have slightly elevated odds of encountering value A, value B, a middle sequence S, and then A and B again.) All of these are referred to as biases in the RC4 keystream.

To use RC4’s biases to decrypt a message, you need to get your attack target to send someone (not you) the same message many times, encrypted with many different keys. You then guess a keystream which exhibits as many of the biases as possible, and you XOR this keystream with all of the messages. Your guess won’t always be right, but it will be right slightly more often than chance, so the correct decryption will also appear slightly more often than chance. It helps that it doesn’t have to be exactly the same message. If there is a chunk that is always the same, and it always appears in the same position, that’s enough. It also helps if you already know some of the message; that lets you weed out bad guesses faster, and exploit more of the biases.

Asking the attack target to send the same message many times might seem ridiculous. But remember that many Internet protocols involve headers that are fixed or nearly so. The plaintext of a TKIP-encrypted message, for instance, will almost always be the WiFi encapsulation of an IPv4 packet. If you know the IP addresses of your target and of the remote host it’s communicating with, that means you know eight bytes of the message already, and they’ll always be the same and in the same position. The paper goes into some detail about how to get a Web browser to make lots and lots of HTTPS requests with a session cookie (always the same, but unknown—the secret you want to steal) in a predictable position, with known plaintext on either side of it.

All this was well-known already. What’s new in this paper is: first, some newly discovered biases in the RC4 keystream, and a statistical technique for finding even more; second, improved attacks on TKIP and TLS. Improved means that they’re easier to execute without anyone noticing, and that they take less time. Concretely, the best known cookie-stealing attack before this paper needed to see $13 \cdot 2^{30}$ HTTPS messages (that’s about 14 billion) and this paper cuts it to $9 \cdot 2^{27}$ messages (1.2 billion), which takes only 75 hours to run. That’s entering the realm of practical in real life, if only for a client computer that is left on all the time, with the web browser running.

It’s a truism in computer security (actually, security in general) that attacks only ever get better. What’s important, when looking at papers like this, is not so much the feasibility of any particular paper’s attack, but the trend. The first-known biases in RC4 keystream were discovered only a year after the algorithm was published, and since then there’s been a steady drumbeat of researchers finding more biases and more effective ways to exploit them. That means RC4 is no good, and everyone needs to stop using it before someone finds an attack that only takes a few minutes. Contrast the situation with AES, where there are no known biases, and fifteen years of people looking for some kind of attack has produced only a tiny improvement over brute force.

Advice to the general public: Does this affect you? Yes. What can you do about it? As always, make sure your Web browsers are fully patched up—current versions of Firefox, Chrome, and IE avoid using RC4, and upcoming versions will probably take it out altogether. Beyond that, the single most important thing for you to do is make sure everything you’ve got that communicates over WiFi—router, computers, tablets, smartphones, etc.—are all set to use WPA2 and CCMP only. (The configuration interface may refer to CCMP as AES-CCMP or just AES; in this context, those are three names for the same thing.) The alternatives WEP, WPA, and TKIP all unavoidably involve RC4 and are known to be broken to some extent. Any WiFi-capable widget manufactured after 2006 has no excuse for not supporting WPA2 with CCMP. It should definitely be possible to set your home router up this way; unfortunately, many client devices don’t expose the necessary knobs.

Detecting Semantic Cloaking on the Web

Reviewed 7 August 2015

Machine learning, Spam, Web

Baoning Wu, Brian D. Davison

International World Wide Web Conference 2006

Now for something a little different: today’s paper is about detecting search engine spam. Specifically, it’s about detecting when a Web site presents different content to a search engine’s crawler than it does to human visitors. As the article points out, this can happen for benign or even virtuous reasons: a college’s front page might rotate through a selection of faculty profiles, or a site might strip out advertising and other material that is only relevant to human visitors when it knows it’s talking to a crawler. However, it also happens when someone wants to fool the search engine into ranking their site highly for searches where they don’t actually have relevant material.

To detect such misbehavior, obviously one should record each webpage as presented to the crawler, and then again as presented to a human visitor, and compare them. The paper is about two of the technical challenges which arise when you try to execute this plan. (They do not claim to have solved all of the technical challenges in this area.) The first of these is, of course, how do you program a computer to tell when a detected difference is spam, versus when it is benign? and here they have done something straightforward: supervised machine classification. You could read this paper just as a case study in semi-automated feature selection for a machine classifier, and you would learn something. (Feature selection is somewhat of a black art—features that appear to be highly predictive may be accidents of your sample, and features that logically should be predictive might not work out at all. In this case, the positive features they list seem plausibly motivated, but several of the negative features (features which are anticorrelated with spamming) seem likely to be accidental. I would have liked to see more analysis of why each feature is predictive.)

The second technical challenge is less obvious: sites are updated frequently. You don’t want to mistake an update for any kind of variation between the crawl result and the human-visitor result. But it’s not practical to visit a site simultaneously as the crawler and as the human, just because of how many sites a crawl has to touch (and if you did, the spammers might be able to figure out that your human visit was an audit). Instead, you could visit the site repeatedly as each and see if the changes match, but this is expensive. The paper proposes to weed out sites that don’t change at all between the crawler visit and the human visit, and do the more expensive check only to the sites that do. A refinement is to use a heuristic to pick out changes that are more likely to be spam: presence of additional keywords or links in the crawler version, relative to the human version. In their tests, this cuts the number of sites that have to be investigated in detail by a factor of 10 (and could do better by refining the heuristic further). These kinds of manual filter heuristics are immensely valuable in classification problems when one of the categories (no cloaking) is much larger than the other(s), both because it reduces the cost of running the classifier (and, in this case, the cost of data collection), and because machine-learning classifiers often do better when the categories all have roughly the same number of examples.

This paper shouldn’t be taken as the last word in this area: it’s ten years old, its data set is respectable for an experiment but tiny compared to the global ’net, and false positive and negative rates of 7% and 15% (respectively) are much too large for production use. The false positive paradox is your nemesis when you are trying to weed spammers out of an index of 10⁹ websites. We know from what little they’ve said about it in public (e.g. [1] [2]) that Google does something much more sophisticated. But it is still valuable as a starting point if you want to learn how to do this kind of research yourself.

Automated Detection and Fingerprinting of Censorship Block Pages

Reviewed 5 August 2015

Censorship, Machine learning, Web

Ben Jones, Tzu-Wen Lee, Nick Feamster, Phillipa Gill

Internet Measurement Conference 2014

This short paper, from IMC last year, presents a re-analysis of data collected by the OpenNet Initiative on overt censorship of the Web by a wide variety of countries. Overt means that when a webpage is censored, the user sees an error message which unambiguously informs them that it’s censored. (A censor can also act deniably, giving the user no proof that censorship is going on—the webpage just appears to be broken.) The goal of this reanalysis is to identify block pages (the error messages) automatically, distinguish them from normal pages, and distinguish them from each other—a new, unfamiliar format of block page may indicate a new piece of software is in use to do the censoring.

The chief finding is that block pages can be reliably distinguished from normal pages just by looking at their length: block pages are typically much shorter than normal. This is to be expected, seeing that they are just an error message. What’s interesting, though, is that this technique works better than techniques that look in more detail at the contents of the page. I’d have liked to see some discussion of what kinds of misidentification appear for each technique, but there probably wasn’t room for that. Length is not an effective tactic for distinguishing block pages from each other, but term frequency is (they don’t go into much detail about that).

One thing that’s really not clear is how they distinguish block pages from ordinary HTTP error pages. They mention that ordinary errors introduce significant noise in term-frequency clustering, but they don’t explain how they weeded them out. It might have been done manually; if so, that’s a major hole in the overall automated-ness of this process.

20,000 In League Under The Sea: Anonymous Communication, Trust, MLATs, and Undersea Cables

Reviewed 3 August 2015

Anonymity, Autonomous systems, Belief networks, BGP, Routing, Theory, Threat modeling, Tor

Aaron D. Jaggard, Aaron Johnson, Sarah Cortes, Paul Syverson, Joan Feigenbaum

Privacy Enhancing Technologies 2015

Today’s paper takes another shot at modeling how the physical topology of the Internet affects the security of Tor against passive adversaries with the ability to snoop on a lot of traffic. It’s by some of the same people who wrote Defending Tor from Network Adversaries and is quite closely related.

Most of the work of this paper goes into building a flexible, formal threat model, which Tor client software could (in principle) use to inform its routing decisions. Acknowledging that there’s always going to be a good deal of uncertainty about what adversaries are out there and what they are capable of, they make two key design decisions. The model is probabilistic (based on a Bayesian belief network), and it takes user input. For instance, if you have reason to think the government of Transbelvia has it in for you, you can instruct Tor to avoid paths that Transbelvia might be able to snoop on, and the model will expand that out to all the ways they might do that. Conversely, if you trust a particular organization you might like to preferentially use its guards or exit nodes, and it can do that too.

The model is very thorough about different ways a government might be able to snoop on network traffic—not just relays physically hosted in the country, but ASes and IXPs (Transbelvia hosts a major IXP for Eastern Europe), submarine cable landing sites (not relevant for a landlocked country), mutual legal assistance treaties (MLATs) which might be used to have another country do some snooping on Transbelvia’s behalf, and even hacking into and subverting routers at interesting points in the connectivity graph. (The pun in the title refers to their analysis of how MLATs could allow several of the usual suspects to snoop on 90+% of all submarine cable traffic, even though they host hardly any cable landings themselves.) Equally important, it can be expanded at need when new techniques for spying are revealed.

I think something like this is going to be an essential building block if we want to add any spy-aware routing algorithm to Tor, but I have two serious reservations. First, simplest, but less important, right now all Tor clients make routing decisions more-or-less the same way (there have been small changes to the algorithm over time, but everyone is strongly encouraged to stay close to the latest client release anyway, just because of bugs). If clients don’t all make routing decisions the same way, then that by itself might be usable to fingerprint them, and thus cut down the number of people who might’ve taken some action, from all Tor users to all Tor users who make routing decisions like THIS. If highly personalized threat models are allowed, the latter group might be just one person.

Second, and rather more serious, the user-input aspect of this system is going to require major user experience research and design to have any hope of not being worse than the problem it’s trying to solve. It’s not just a matter of putting a friendly face on the belief language (although that does need to happen)—the system will need to educate its users in the meaning of what it is telling them, and it will need to walk them through the consequences of their choices. And it might need to provide nudges if there’s a good reason to think the user’s assessment of their threat model is flat-out wrong (even just making that judgement automatically is fraught with peril—but so is not making that judgement).

Scandinista! Analyzing TLS Handshake Scans and HTTPS Browsing by Website Category

Reviewed 31 July 2015

HTTPS, Large sample

Andrew Hills

Workshop on Surveillance & Technology 2015

Today’s paper is a pilot study, looking into differences in adoption rate of HTTPS between various broad categories of websites (as defined by Alexa). They looked at raw availabilty of TLS service on port 443, and they also checked whether an HTTP request for the front page of each Alexa domain would redirect to HTTPS or vice versa. This test was conducted only once, and supplemented with historical data from the University of Michigan’s HTTPS Ecosystem project.

As you would expect, there is a significant difference in the current level of HTTPS availability from one category to another. They only show this information for a few categories, but the pattern is not terribly surprising: shopping 82%, business 70%, advertisers 45%, adult 36%, news 30%, arts 26%. (They say The relatively low score for Adult sites is surprising given that the industry has a large amount of paid content, but I suspect this is explained by that industry’s habit of outsourcing payment processing, plus the ubiquitous (not just in the adult category) misapprehension that only payment processing is worth bothering to encrypt.) It is also not surprising to find that more popular sites are more likely to support HTTPS. And the enormous number of sites that redirect their front page away from HTTPS is depressing, but again, not surprising.

What’s more interesting to me is the trendlines, which show a steady, slow, category-independent, linear uptake rate. There’s a little bit of a bump in adult and news around 2013 but I suspect it’s just noise. (The response growth over time figure (number 2), which appears to show a category dependence, is improperly normalized and therefore misleading. You want to look only at figure 1.) The paper looks for a correlation with the Snowden revelations; I’d personally expect that the dominant causal factor here is the difficulty of setting up TLS, and I’d like to see them look for correlations with major changes in that: for instance, Cloudflare’s offering no-extra-cost HTTPS support [1], Mozilla publishing a server configuration guide [2], or the upcoming Let’s Encrypt no-hassle CA [3]. It might also be interesting to look at uptake rate as a function of ranking, rather than category; it seems like the big names are flocking to HTTPS lately, it would be nice to know for sure.

The study has a number of methodological problems, which is OK for a pilot, but they need to get fixed before drawing serious conclusions. I already mentioned the normalization problem in figure 2: I think they took percentages of percentages, which doesn’t make sense. The right thing would’ve been to just subtract the initial level seen in figure 1 from each line, which (eyeballing figure 1) would have demonstrated an increase of about 5% in each category over the time period shown, with no evidence for a difference between categories. But before we even get that far, there’s a question of the difference between an IP address (the unit of the UMich scans), a website (the unit of TLS certificates), and a domain (the unit of Alexa ranking). To take some obvious examples: There are hundreds, if not thousands, of IP addresses that will all answer to the name of www.google.com. Conversely, Server Name Indication permits one IP address to answer for dozens or even hundreds of encrypted websites, and that practice is even more common over unencrypted HTTP. And hovering around #150 in the Alexa rankings is amazonaws.com, which is the backing store for at least tens of thousands of different websites, each of which has its own subdomain and may or may not have configured TLS. The correct primary data sources for this experiment are not Alexa and IPv4 scanning, but DNS scanning and certificate transparency logs. (A major search engine’s crawl logs would also be useful, if you could get your hands on them.) Finally, they should pick one set of 10-20 mutually exclusive, exhaustive categories (one of which would have to be Other) and consistently use them throughout the paper.

Performance and Security Improvements for Tor: A Survey

Reviewed 29 July 2015

Anonymity, Privacy, Protocol design, Survey paper, Tor, Traffic analysis

Mashael AlSabah, Ian Goldberg

2015

This week’s non-PETS paper is a broad survey of research into improving either the security, or the performance, or both, of low-latency anonymity networks such as Tor. Nearly all of the research used Tor itself as a testbed, and the presentation here assumes Tor, but most of the work could be generalized to other designs.

There’s been a lot of work on this sort of thing in the eleven years since Tor was first introduced, and this paper does a generally good job of categorizing it, laying out lines of research, indicating which proposals have been integrated into Tor and which haven’t, etc. (I particularly liked the mindmap diagram near the beginning, and the discussion near the end of which problems still need to get solved.) One notable exception is the section on improved cryptography, where you need to have a solid cryptography background to get any idea of what the proposals are, let alone whether they worked. There are also a couple of places where connections to the larger literature of network protocol engineering would have been helpful: for instance, there’s not a single mention of bufferbloat, even though that is clearly an aspect of the congestion problems that one line of research aims to solve. And because it’s not mentioned, it’s not clear whether the researchers doing that work knew about it.

Tor is a difficult case in protocol design because its security goals are—as acknowledged in the original paper describing its design [1]—directly in conflict with its performance goals. Improvements in end-to-end latency, for instance, may make a traffic correlation attack easier. Improvements in queueing fairness or traffic prioritization may introduce inter-circuit crosstalk enabling an attacker to learn something about the traffic passing through a relay. Preferring to use high-bandwidth relays improves efficiency but reduces the number of possible paths that traffic can take. And so on. It is striking, reading through this survey, to see how often an apparently good idea for performance was discovered to have unacceptable consequences for anonymity.

Automated Experiments on Ad Privacy Settings

Reviewed 27 July 2015

Advertising, Machine learning, Privacy

Amit Datta, Michael Carl Tschantz, Anupam Datta

Privacy Enhancing Technologies 2015

You’ve probably noticed the creepy effect where you consider buying something online, or maybe just look at a page for something that happens to be for sale, and for weeks afterward you get ads on totally unrelated websites for that thing or similar things. The more reputable online ad brokerages offer a degree of control over this effect (e.g.: Google, Microsoft). This study investigates exactly what effect those settings have on the ads observed by automated browsing agents. The basic idea is to set some of the knobs, visit websites that will tell the ad provider something about the simulated customer’s preferences, possibly adjust the knobs again, and finally record what is being advertised on a general-interest website.

A great deal of the paper is devoted to statistical methodology. Because the ad provider is a stateful black box, and one whose behavior may depend on uncontrollable external factors (e.g. that advertiser has exhausted their budget for the month), it’s vital to avoid as many statistical assumptions as possible. They use permutation tests and supervised classification (logistic regression), both of which make minimal assumptions. They’re also very careful about drawing conclusions from their results. I’m not much of a statistician, but it all sounds carefully thought out and plausible, with one exception: heavy reliance on significance testing, which has come in for serious criticism [1] to the point where some journals no longer accept its use at all [2]. This is exactly the sort of research where p-values can mislead; if I were reviewing this prior to publication I would have encouraged the authors to think about how they could present the same conclusions without using significance testing.

Now, the actual conclusions. Only Google’s ads were tested. (Expanding the tests to additional ad brokers is listed as future work.) They confirm that turning a particular topic (dating) off in the preferences does cause those ads to go away. They observe that two highly sensitive topics (substance abuse, disability) that do trigger targeted ads are not controllable via the preferences; in fact, they are completely invisible on that screen. And the most interesting case is when they set the ad preferences to explicitly reveal a gender (man or woman) then browsed a bunch of sites related to job searching. Simulated customers who claimed to be men got ads for a career coaching service which promised better odds of being hired into $200K+ executive positions; those who claimed to be women did not see these ads.

This last example clearly reflects the well-known glass ceiling which affects business in general, but (as the authors point out) it’s impossible to tell, from outside the black box, why it shows up in this case. The career-coaching service could have chosen to advertise only to men. Google’s ad-targeting algorithm could have been coded with (likely unconscious) bias in its category structure, or—this is the most probable explanation—its feedback mechanism could have detected that men are more likely to click on ads with those keywords, so it makes that even more likely by showing them preferentially to men. There’s a telling comment at the very end of the paper:

… We consider it more likely that Google has lost control over its massive, automated advertising system. Even without advertisers placing inappropriate bids, large-scale machine learning can behave in unexpected ways.

There’s a lesson here for all the big data companies: the best an unbiased machine learning system can hope to do is produce an accurate reflection of the training set—including whatever biases are in there. If you want to avoid reduplicating all the systemic biases of the offline world, you will have to write code to that effect.

Tor’s Usability for Censorship Circumvention

Reviewed 24 July 2015

Circumvention, Tor, Usable security

David Fifield, Linda N. Lee, Serge Egelman, David Wagner

Hot Topics in Privacy Enhancing Technologies 2015

This is a report on a pilot usability study. The authors ran five journalists (there aren’t any more details than that) through the process of installing, activating, and using the Tor Browser for a small number of canned tasks, identifying a number of problems:

… people did have difficulty with installing Tor Browser (principally because of the Gatekeeper code-signing feature on OS X), did not understand what many of the many options meant, and were confused about why certain things were happening.

They are going to do a much larger study, and were soliciting feedback on experimental design. I have only two things to say. First, the proposal is to do a large test of 200 users and then, presumably, start making changes to the software to improve usability. The problem with this is, it is very likely that subtle (yet serious) UX issues are being masked out by the more blatant ones: no matter how many people you experiment on, you won’t detect the subtle problems until the blatant ones are fixed. Therefore, it would be far more valuable to do a series of smaller user studies, improving the software based on the results of each study before doing the next one. This strategy also ensures that the research results do get incorporated into the product, rather than being lost in the shuffle once the paper is published.

The other point is more of a hypothesis about what would be good to aim for. To use Tor in a way that genuinely improves your security outcomes, you need to understand what it is doing and why, and to do that you have to wrap your head around some concepts that may be unfamiliar—especially if you haven’t previously needed to understand the Internet itself in any kind of detail. (For instance, the fact that every IP packet is labeled with its source and destination is obvious once you think about it, but I never thought about it to a lot of people.) There probably needs to be a training manual, and this manual needs to take the attitude that yeah, this is a little tricky, and you have to think about it some, but don’t panic, you can understand it. Shoot for the we understand tone said to characterize Rust compiler errors (warning: Reddit). The place I’ve seen this done best, personally, was the tutorial and concepts guide for GnuCash, which took just this tone with regard to double-entry bookkeeping—also somewhat notorious for its inscrutability. (Note: I read this a long time ago, and I don’t know whether its current edition is still like that.)

The Harmful Consequences of Postel’s Maxim

Reviewed 22 July 2015

Langsec, Philosophy of security, Protocol design

Martin Thomson

2015

Postel’s Maxim of protocol design (also known as the Robustness Principle or the Internet Engineering Principle) is Be liberal in what you accept, conservative in what you send. It was first stated as such (by Jon Postel) in the 1979 and 1980 specifications (e.g. RFC 760) of the protocol that we now call IPv4. [1] It’s been tremendously influential, for instance quoted as an axiom in Tim Berners-Lee’s design principles for the Web [2] but has also come in for a fair bit of criticism [3] [4]. (An expanded version of the principle, in RFC 1122, anticipates many of these criticisms and is well worth reading if you haven’t.) Now we have an Internet-Draft arguing that it is fatally flawed:

… there are negative long-term consequences to interoperability if an implementation applies Postel’s advice. Correcting the problems caused by divergent behavior in implementations can be difficult or impossible.

and arguing that instead

Protocol designs and implementations should be maximally strict.

Generating fatal errors for what would otherwise be a minor or recoverable error is preferred, especially if there is any risk that the error represents an implementation flaw. A fatal error provides excellent motivation for addressing problems.

The primary function of a specification is to proscribe behavior in the interest of interoperability.

This is the first iteration of an Internet-Draft, so it’s not intended to be done, so rather than express an opinion as such, I want to put forward some examples of real-world situations from the last couple decades of Internet protocol design that the author may or may not have considered, and ask how he feels they should be / have been handled. I also invite readers to suggest further examples where strictness, security, upward compatibility, incremental deployment, ergonomics, and so on may be in tension.

The original IP and TCP (v4) specifications left a number of bits reserved in their respective packet headers. In 2001 the ECN specification gave meaning to some of those bits. It was almost immediately discovered that many intermediate routers would silently discard packets with the ECN bits set; in consequence, fourteen years later ECN is still quite rarely used, even though there are far fewer such routers than there were in 2001. [5] [6]
Despite the inclusion of a protocol version number in SSL/TLS, and a clear specification of how servers were supposed to react to clients offering a newer protocol than they supported, servers that drop connections from too-new clients are historically quite common, so until quite recently Web browsers would retry such connections with an older protocol version. This enables a man-in-the-middle to force negotiation of an old, possibly insecure version, even if both sides support something better. [7] [8] [9] Similar to the ECN situation, this problem was originally noticed in 2001 and continues to be an issue in 2015.
Cryptographic protocols (such as TLS) can be subverted—and I mean complete breach of confidentiality subverted—if they reveal why a message failed to decrypt, or how long it took to decrypt / fail to decrypt a message, to an attacker that can forge messages. [10] [11] To close these holes it may be necessary to run every message through the complete decryption process even if you already know it’s going to fail.
In the interest of permitting future extensions, HTML5 [12] and CSS [13] take pains to specify exact error recovery behavior; the idea is that older software will predictably ignore stuff it doesn’t understand, so that authors can be sure of how their websites will look in browsers that both do and don’t implement each shiny new feature. However, this means you can predict how the CSS parser will parse HTML (and vice versa). And in conjunction with the general unreliability of MIME types it means you used to be able to exploit that to extract information from a document you shouldn’t be able to read. [14] (Browsers fixed this by becoming pickier about MIME types.)

An Automated Approach for Complementing Ad Blockers’ Blacklists

Reviewed 20 July 2015

Machine learning, Privacy, Web

David Gugelmann, Markus Happe, Bernhard Ager, Vincent Lenders

Privacy Enhancing Technologies 2015

Last week’s long PETS paper was very abstract; this week’s paper is very concrete. The authors are concerned that manually-curated blacklists, as currently used by most ad-blocking software, cannot hope to keep up with the churn in the online ad industry. (I saw a very similar talk at WPES back in 2012 [1] which quoted the statistic that the default AdBlock Plus filter list contains 18,000 unique URLS, with new ones added at a rate of five to 15 every week.) They propose to train a machine classifier on network-level characteristics that differ between ad services and normal web sites, to automate detection of new ad providers and/or third-party privacy-invasive analytics services. (This is the key difference from the paper at WPES 2012: that project used static analysis of JavaScript delivered by third-party services to extract features for their classifier.)

They find that a set of five features provides a reasonably effective classification: proportion of requests that are third-party (for transclusion into another website), number of unique referrers, ratio of received to sent bytes, and proportion of requests including cookies. For the training set they used, they achieve 83% precision and 85% recall, which is reasonable for a system that will be used to identify candidates for manual inspection and addition to blacklists.

There are several methodological bits in here which I liked. They use entropy-based discretization and information gain to identify valuable features and discard unhelpful ones. They compare a classifier trained on manually-labeled data (from a large HTTP traffic trace) with a classifier trained on the default AdBlock Plus filter list; both find similar features, but the ABP filter list has better coverage of infrequently used ads or analytics services, whereas the manually labeled training set catches a bunch of common ads and analytics services that ABP missed.

One fairly significant gap is that the training set is limited to cleartext HTTP. There’s a strong trend nowadays toward HTTPS for everything, including ads, but established ad providers are finding it difficult to cut all their services over efficiently, which might provide an opportunity for new providers—and thus cause a shift toward providers that have been missed by the blacklists.

There’s almost no discussion of false positives. Toward the end there is a brief mention of third-party services like Gravatar and Flattr, that share a usage pattern with ads/analytics and showed up as false positives. But it’s possible to enumerate common types of third-party services (other than ads and analytics) a priori: outsourced commenting (Disqus, hypothes.is), social media share buttons (Facebook, Twitter), shared hosting of resources (jQuery, Google Fonts), static-content CDNs, etc. Probably, most of these are weeded out by the ratio of received to sent bytes check, but I would still have liked to see an explicit check of at least a few of these.

And finally, nobody seems to have bothered talking to the people who actually maintain the ABP filter lists to find out how they do it. (I suspect it relies strongly on manual, informal reporting to a forum or something.) If this is to turn into anything more than an experiment, people need to be thinking about integration and operationalization.