Papers tagged ‘Web’

Improving Cloaking Detection Using Search Query Popularity and Monetizability

Reviewed 9 October 2015

Adversarial optimization, Economics, Machine learning, Spam, Web

Kumar Chellapilla, David Maxwell Chickering

Adversarial Information Retrieval on the Web 2006

Here’s another paper about detecting search engine spam, contemporary with Detecting Semantic Cloaking on the Web and, in fact, presented as a refinement to that paper. They don’t make any changes to the is this page spamming the search engine algorithm itself; rather, they optimize the scan for pages that are spamming the search engine, by looking at the spammers’ economic incentives.

It obviously does no good to make your site a prominent search hit for keywords that nobody ever searches for. Less obviously, since search engine spammers are trying to make money by sucking people into linkfarms full of ads, they should be focusing on keywords that lots of advertisers are interested in (monetizability in the paper’s jargon). Therefore, the search engine operator should focus its spam-detection efforts on the most-searched and most-advertised keywords. Once you identify a linkfarm, you can weed it out of all your keyword indexes, so this improves search results in the long tail as well as the popular case.

They performed a quick verification of this hypothesis with the help of logs of the 5000 most popular and 5000 most advertised search keywords, from the MSN search engine (now Bing). Indeed, the spam was strongly skewed toward both high ends—somewhat more toward monetizability than popularity.

That’s really all there is to say about this paper. They had a good hypothesis, they tested it, and the hypothesis was confirmed. I’ll add that this is an additional reason (besides just making money) for search engines to run their own ad brokerages, and a great example of the value of applying economic reasoning to security research.

k-Indistinguishable Traffic Padding in Web Applications

Reviewed 7 October 2015

Adversarial optimization, Differential privacy, Information leakage, Side channels, Web

Wen Ming Liu, Lingyu Wang, Kui Ren, Pengsu Cheng, Mourad Debbabi

Privacy Enhancing Technologies 2012

A Web application is a piece of software whose execution is split between the client and the server, and it sends data between the two sides every time you do something significant—even, perhaps, on every single keystroke. Inevitably, the volume of data has some correlation with the action performed. Since most Web applications are offered to the general public at low or no cost, an adversary can map out much of its internal state machine and develop a model that lets them deduce what someone’s doing from an observed sequence of packets. [1] [2] This has been successfully applied (in the lab) to Web applications for highly sensitive tasks, such as tax preparation and medical advice.

The only known solution to this problem is to pad the amount of data passed over the network in each direction so that the adversary cannot learn anything about the state of the web application from an observed sequence of packets. Unfortunately, the obvious way to do this involves unacceptably high levels of overhead—this paper quotes [1] as observing 21074% overhead for a well-known online tax system. Today’s paper proposes that techniques from the field of privacy-preserving data publishing (PPDP, [3]) be brought to bear on the problem.

The paper is mathematically heavy going and I’m not going to go into details of their exact proposal, because it’s really just a starting point. Instead, I’m going to pull out some interesting observations:

More padding is not always better. They show an example case where padding messages to a multiple of 128 bytes is exactly as good as padding them to a multiple of 512 bytes, and padding them to a multiple of 520 bytes is worse.
Computing the optimal amount of padding for a web application (as they define it) is NP-hard even with complete, ideal knowledge of its behavior. However, there are reasonably efficient and speedy approximations—if I understood them correctly, it might even be feasible to apply one of their algorithms on the fly with each client.
Because optimality is defined over an entire sequence of user interactions with an application, knowing how much padding to send in response to any given request may require the server to predict the future. Again, however, there are approximations that avoid this problem.
PPDP offers a whole bunch of different ways to model the degree to which one has mitigated the information leak: k-anonymity is the model they spend the most time on, but there’s also l-diversity, (α, k)-anonymity, t-closeness, etc. etc. I don’t follow that literature but I have the impression people are still actively making up new models and haven’t spent a whole lot of time on figuring out which is best in which situations.

The great failing of this and all the other papers I’ve ever read in this area is operationalization. This paper’s algorithms need a complete, or nearly so, map of all possible sequences of states that a Web application can enter, and the authors shrug off the question of how you create this map in the first place. (I very strongly suspect that task can’t be mechanized in the general case—it will turn out to be the Halting Problem in another of its many disguises.) They also don’t seem to realize that things like Google search suggestions (the running example in the paper) are constantly changing. (Of course, that makes life harder for the adversary too.) This doesn’t mean there is no hope; it only means you need more approximations. I think it ought to be possible to build good-enough automatic padding into CMSes, and maybe even into web application frameworks. Trying that should be a priority for future research.

Inferring Mechanics of Web Censorship Around the World

Reviewed 5 October 2015

Censorship, Methodology, Web

John-Paul Verkamp, Minaxi Gupta

Free and Open Communications on the Internet 2012

I’ve talked a bunch about papers that investigate what is being censored online in various countries, but you might also want to know how it’s done. There are only a few ways it could be done, and this paper does a good job of laying them out:

By DNS name: intercept DNS queries either at the router or the local DNS relay, return either no such host or a server that will hand out errors for everything.
By IP address: in an intermediate router, discard packets intended for particular servers, and/or respond with TCP RST packets (which make the client disconnect) or forged responses. (In principle, an intermediate router could pretend to be the remote host for an entire TCP session, but it doesn’t seem that anyone does.)
By data transferred in cleartext: again in an intermediate router, allow the initial connection to go through, but if blacklisted keywords are detected then forge a TCP RST.

There are a few elaborations and variations, but those are the basic options if you are implementing censorship in the backbone of the network. The paper demonstrates that all are used. It could also, of course, be done at either endpoint, but that is much less common (though not unheard of) and the authors of this paper ruled it out of scope. It’s important to understand that the usual modes of encryption used on the ’net today (e.g. HTTPS) do not conceal either the DNS name or the IP address of the remote host, but do conceal the remainder of an HTTP request. Pages of an HTTPS-only website cannot be censored individually, but the entire site can be censored by its DNS name or server IP address. This is why Github was being DDoSed a few months ago to try to get them to delete repositories being used to host circumvention tools [1]: Chinese censors cannot afford to block the entire site, as it is too valuable to their software industry, but they have no way to block access to the specific repositories they don’t like.

Now, if you want to find out which of these scenarios is being carried out by any given censorious country, you need to do detailed network traffic logging, because at the application level, several of them are indistinguishable from the site being down or the network unreliable. This also means that the censor could choose to be stealthy: if Internet users in a particular country expect to see an explicit message when they try to load a blocked page, they might assume that a page that always times out is just broken. [2] The research contribution of this paper is in demonstrating how you do that, through a combination of packet logging and carefully tailored probes from hosts in-country. They could have explained themselves a bit better: I’m not sure they bothered to try to discriminate packets are being dropped at the border router from packets are being dropped by a misconfigured firewall on the site itself, for instance. Also, I’m not sure whether it’s worth going to the trouble of packet logging, frankly. You should be able to get the same high-level information by comparing the results you get from country A with those you get from country B.

Another common headache in this context is knowing whether the results you got from your measurement host truly reflect what a normal Internet user in the country would see. After all, you are probably using a commercial data center or academic network that may be under fewer restrictions. This problem is one of the major rationales for Encore, which I discussed a couple weeks ago [3]. This paper nods at that problem but doesn’t really dig into it. To be fair, they did use personal contacts to make some of their measurements, so those may have involved residential ISPs, but they are (understandably) vague about the details.

I Know What You’re Buying: Privacy Breaches on eBay

Reviewed 28 September 2015

Anonymity, Information leakage, Privacy, User tracking, Web

Tehila Minkus, Keith W. Ross

Privacy Enhancing Technologies 2014

eBay intends not to let anyone else figure out what you’re in a habit of buying on the site. Because of that, lots of people consider eBay the obvious place to buy things you’d rather your neighbors not know you bought (there is a survey in this very paper confirming this fact). However, this paper demonstrates that a determined adversary can figure out what you bought.

(Caveat: This paper is from 2014. I do not know whether eBay has made any changes since it was published.)

eBay encourages both buyers and sellers to leave feedback on each other, the idea being to encourage fair dealing by attaching a persistent reputation to everyone. Feedback is associated with specific transactions, and anyone (whether logged into eBay or not) can see each user’s complete feedback history. Items sold are visible, items bought are not, and buyers’ identities are obscured. The catch is, you can match up buyer feedback with seller feedback by the timestamps, using obscured buyer identities as a disambiguator, and thus learn what was bought. It involves crawling a lot of user pages, but it’s possible to do this in a couple days without coming to eBay’s notice.

They demonstrate that this is a serious problem by identifying users who purchased gun holsters (eBay does not permit the sale of actual firearms), pregnancy tests, and HIV tests. As an additional fillip they show that people commonly use the same handle on eBay as Facebook and therefore purchase histories can be correlated with all the personal information one can typically extract from a Facebook profile.

Solutions are pretty straightforward and obvious—obscured buyer identities shouldn’t be correlated with their real handles; feedback timestamps should be quantized to weeks or even months; feedback on buyers might not be necessary anymore; eBay shouldn’t discourage use of various enhanced-privacy modes, or should maybe even promote them to the default. (Again, I don’t know whether any of these solutions has been implemented.) The value of the paper is perhaps more in reminding website developers in general that cross-user correlations are always a privacy risk.

Encore: Lightweight Measurement of Web Censorship with Cross-Origin Requests

Reviewed 16 September 2015

Censorship, Ethics, Methodology, Surveillance, Web

Sam Burnett, Nick Feamster

SIGCOMM 2015

As I’ve mentioned a few times here before, one of the biggest problems in measurement studies of Web censorship is taking the measurement from the right place. The easiest thing (and this may still be difficult) is to get access to a commercial VPN exit or university server inside each country of interest. But commercial data centers and universities have ISPs that are often somewhat less aggressive about censorship than residential and mobile ISPs in the same country—we think. [1] And, if the country is big enough, it probably has more than one residential ISP, and there’s no reason to think they behave exactly the same. [2] [3] What we’d really like is to enlist spare CPU cycles on a horde of residential computers across all of the countries we’re interested in.

This paper proposes a way to do just that. The authors propose to add a script to globally popular websites which, when the browser is idle, runs tests of censorship. Thus, anyone who visits the website will be enlisted. The first half of the paper is a technical demonstration that this is possible, and that you get enough information out of it to be useful. Browsers put a bunch of restrictions on what network requests a script can make—you can load an arbitrary webpage in an invisible <iframe>, but you don’t get notified of errors and the script can’t see the content of the page; conversely, <img> can only load images, but a script can ask to be notified of errors. Everything else is somewhere in between. Nonetheless, the authors make a compelling case for being able to detect censorship of entire websites with high accuracy and minimal overhead, and a somewhat less convincing case for being able to detect censorship of individual pages (with lower accuracy and higher overhead). You only get a yes-or-no answer for each thing probed, but that is enough for many research questions that we can’t answer right now. Deployment is made very easy, a simple matter of adding an additional third-party script to websites that want to participate.

The second half of the paper is devoted to ethical and practical considerations. Doing this at all is controversial—in a box on the first page, above the title of the paper, there’s a statement from the SIGCOMM 2015 program committee, saying the paper almost got rejected because some reviewers felt it was unethical to do anything of the kind without informed consent by the people whose computers are enlisted to make measurements. SIGCOMM also published a page-length review by John Byers, saying much the same thing. Against this, the authors argue that informed consent in this case is of dubious benefit, since it does not reduce the risk to the enlistees, and may actually be harmful by removing any traces of plausible deniability. They also point out that many people would need a preliminary course in how Internet censorship works and how Encore measures it before they could make an informed choice about whether to participate in this research. Limiting the pool of enlistees to those who already have the necessary technical background would dramatically reduce the scale and scope of measurements. Finally they observe that the benefits of collecting this data are clear, whereas the risks are nebulous. In a similar vein, George Danezis wrote a rebuttal of the public review, arguing that the reviewers’ concerns are based on a superficial understanding of what ethical research in this area looks like.

Let’s be concrete about the risks involved. Encore modifies a webpage such that web browsers accessing it will, automatically and invisibly to the user, also access a number of unrelated webpages (or resources). By design, those unrelated webpages contain material which is considered unacceptable, perhaps to the point of illegality, in at least some countries. Moreover, it is known that these countries mount active MITM attacks on much of the network traffic exiting the country, precisely to detect and block access to unacceptable material. Indeed, the whole point of the exercise is to provoke an observable response from the MITM, in order to discover what it will and won’t respond to.

The MITM has the power to do more than just block access. It almost certainly records the client IP address of each browser that accesses undesirable material, and since it’s operated by a state, those logs could be used to arrest and indict people for accessing illegal material. Or perhaps the state would just cut off their Internet access, which would be a lesser harm but still a punishment. It could also send back malware instead of the expected content (we don’t know if that has ever happened in real life, but very similar things have [4]), or turn around and mount an attack on the site hosting the material (this definitely has happened [5]). It could also figure out that certain accesses to undesirable material are caused by Encore and ignore them, causing the data collected to be junk, or it could use Encore itself as an attack vector (i.e. replacing the Encore program with malware).

In addition to the state MITM, we might also want to worry about other adversaries in a position to monitor user behavior online, such as employers, compromised coffee shop WiFi routers, and user-tracking software. Employers may have their own list of material that people aren’t supposed to access using corporate resources. Coffee shop WiFi is probably interested in finding a way to turn your laptop into a botnet zombie; any unencrypted network access is a chance to inject some malware. User-tracking software might become very confused about what someone’s demographic is, and start hitting them with ads that relate to whatever controversial topic Encore is looking for censorship of. (This last might actually be a Good Thing, considering the enormous harms behavioral targeting can do. [6])

All of these are harm to someone. It’s important to keep in mind that except for poisoning the data collected by Encore (harm to the research itself) all of them can happen in the absence of Encore. Malware, ad networks, embedded videos, embedded like buttons, third-party resources of any kind: all of these can and do cause a client computer to access material without its human operator’s knowledge or consent, including accesses to material that some countries consider undesirable. Many of them also offer an active MITM the opportunity to inject malware.

The ethical debate over this paper has largely focused on increased risk of legal, or quasilegal, sanctions taken against people whose browsers were enlisted to run Encore tests. I endorse the authors’ observation that informed consent would actually make that risk worse. Because there are so many reasons a computer might contact a network server without its owner’s knowledge, people already have plausible deniability regarding accesses to controversial material (i.e. I never did that, it must have been a virus or something). If Encore told its enlistees what it was doing and gave them a chance to opt out, it would take that away.

Nobody involved in the debate knows how serious this risk really is. We do know that many countries are not nearly as aggressive about filtering the Internet as they could be, [7] so it’s reasonable to think they can’t be bothered to prosecute people just for an occasional attempt to access stuff that is blocked. It could still be that they do prosecute people for bulk attempts to access stuff that is blocked, but Encore’s approach—many people doing a few tests—would tend to avoid that. But there’s enough uncertainty that I think the authors should be talking to people in a position to know for certain: lawyers and activists from the actual countries of interest. There is not one word either in the papers or the reviews to suggest that anyone has done this. The organizations that the authors are talking to (Citizen Lab, Oxford Internet Institute, the Berkman Center) should have appropriate contacts already or be able to find them reasonably quickly.

Meanwhile, all the worry over legal risks has distracted from worrying about the non-legal risks. The Encore authors are fairly dismissive of the possibility that the MITM might subvert Encore’s own code or poison the results; I think that’s a mistake. They consider the extra bandwidth costs Encore incurs, but they don’t consider the possibility of exposing the enlistee to malware on a page (when they load an entire page). More thorough monitoring and reportage on Internet censorship might cause the censor to change its behavior, and not necessarily for the better—for instance, if it’s known that some ISPs are less careful about their filtering, that might trigger sanctions against them. These are just the things I can think of off the top of my head.

In closing, I think the controversy over this paper is more about the community not having come to an agreement about its own research ethics than it is about the paper itself. If you read the paper carefully, the IRB at each author’s institution did not review this research. They declined to engage with it. This was probably a correct decision from the board’s point of view, because an IRB’s core competency is medical and psychological research. (They’ve come in for criticism in the past for reviewing sociological studies as if they were clinical trials.) They do not, in general, have the background or expertise to review this kind of research. There are efforts underway to change that: for instance, there was a Workshop on Ethics in Networked Systems Research at the very same conference that presented this paper. (I wish I could have attended.) Development of a community consensus here will, hopefully, lead to better handling of future, similar papers.

Parking Sensors: Analyzing and Detecting Parked Domains

Reviewed 14 September 2015

Advertising, Machine learning, Malware, Software ecology, Web

Thomas Vissers, Wouter Joosen, Nick Nikiforakis

Network and Distributed System Security Symposium 2015

In the same vein as the ecological study of ad injectors I reviewed back in June, this paper does an ecological analysis of domain parking. Domain parking is the industry term of art for the practice of registering a whole bunch of domain names you don’t have any particular use for but hope someone will buy, and while you wait for the buyer, sticking a website consisting entirely of ads in the space, usually with This domain is for sale! at the top in big friendly letters. Like many ad-driven online business models, domain parking can be mostly harmless or it can be a lead-in to outright scamming people and infesting their computers with malware, and the research question in this paper is, how bad does it get?

In order to answer that question, the authors carried out a modest-sized survey of 3000 parked domains, identified by trawling the DNS for name servers associated with 15 known parking services. (Also like many other online businesses, domain parking runs on an affiliate marketing basis: lots of small fry register the domains and then hand the keys over to big services that do the actual work of maintaining the websites with the ads.) All of these services engage in all of the abusive behavior one would expect: typosquatting, aggressive behavioral ad targeting, drive-by malware infection, and feeding visitors to scam artists and phishers. I do not come away with a clear sense of how common any of these attacks are relative to the default parking page of advertisements and links—they have numbers, but they’re not very well organized, and different sets of parking pages were used in each section (discussing a different type of abuse) which makes it hard to compare across sections.

I’m most interested in the last section of the paper, in which they develop a machine classifier that can distinguish parking pages from normal webpages, based on things like the amount of text that is and isn’t a hyperlink, number of external links, total volume of resources drawn from third-party sources, and so on. The bulk of this section is devoted to enumerating all of the features that they tested, but doesn’t do a terribly good job of explaining which features wound up being predictive. Algorithmic choices also seem a little arbitrary. They got 97.9% true positive rate and 0.5% false positive rate out of it, though, which says to me that this isn’t a terribly challenging classification problem and probably most anything would have worked. (This is consistent with the intuitive observation that you, a human, can tell at a glance when you’ve hit a parking page.)

Detecting Semantic Cloaking on the Web

Reviewed 7 August 2015

Machine learning, Spam, Web

Baoning Wu, Brian D. Davison

International World Wide Web Conference 2006

Now for something a little different: today’s paper is about detecting search engine spam. Specifically, it’s about detecting when a Web site presents different content to a search engine’s crawler than it does to human visitors. As the article points out, this can happen for benign or even virtuous reasons: a college’s front page might rotate through a selection of faculty profiles, or a site might strip out advertising and other material that is only relevant to human visitors when it knows it’s talking to a crawler. However, it also happens when someone wants to fool the search engine into ranking their site highly for searches where they don’t actually have relevant material.

To detect such misbehavior, obviously one should record each webpage as presented to the crawler, and then again as presented to a human visitor, and compare them. The paper is about two of the technical challenges which arise when you try to execute this plan. (They do not claim to have solved all of the technical challenges in this area.) The first of these is, of course, how do you program a computer to tell when a detected difference is spam, versus when it is benign? and here they have done something straightforward: supervised machine classification. You could read this paper just as a case study in semi-automated feature selection for a machine classifier, and you would learn something. (Feature selection is somewhat of a black art—features that appear to be highly predictive may be accidents of your sample, and features that logically should be predictive might not work out at all. In this case, the positive features they list seem plausibly motivated, but several of the negative features (features which are anticorrelated with spamming) seem likely to be accidental. I would have liked to see more analysis of why each feature is predictive.)

The second technical challenge is less obvious: sites are updated frequently. You don’t want to mistake an update for any kind of variation between the crawl result and the human-visitor result. But it’s not practical to visit a site simultaneously as the crawler and as the human, just because of how many sites a crawl has to touch (and if you did, the spammers might be able to figure out that your human visit was an audit). Instead, you could visit the site repeatedly as each and see if the changes match, but this is expensive. The paper proposes to weed out sites that don’t change at all between the crawler visit and the human visit, and do the more expensive check only to the sites that do. A refinement is to use a heuristic to pick out changes that are more likely to be spam: presence of additional keywords or links in the crawler version, relative to the human version. In their tests, this cuts the number of sites that have to be investigated in detail by a factor of 10 (and could do better by refining the heuristic further). These kinds of manual filter heuristics are immensely valuable in classification problems when one of the categories (no cloaking) is much larger than the other(s), both because it reduces the cost of running the classifier (and, in this case, the cost of data collection), and because machine-learning classifiers often do better when the categories all have roughly the same number of examples.

This paper shouldn’t be taken as the last word in this area: it’s ten years old, its data set is respectable for an experiment but tiny compared to the global ’net, and false positive and negative rates of 7% and 15% (respectively) are much too large for production use. The false positive paradox is your nemesis when you are trying to weed spammers out of an index of 10⁹ websites. We know from what little they’ve said about it in public (e.g. [1] [2]) that Google does something much more sophisticated. But it is still valuable as a starting point if you want to learn how to do this kind of research yourself.

Automated Detection and Fingerprinting of Censorship Block Pages

Reviewed 5 August 2015

Censorship, Machine learning, Web

Ben Jones, Tzu-Wen Lee, Nick Feamster, Phillipa Gill

Internet Measurement Conference 2014

This short paper, from IMC last year, presents a re-analysis of data collected by the OpenNet Initiative on overt censorship of the Web by a wide variety of countries. Overt means that when a webpage is censored, the user sees an error message which unambiguously informs them that it’s censored. (A censor can also act deniably, giving the user no proof that censorship is going on—the webpage just appears to be broken.) The goal of this reanalysis is to identify block pages (the error messages) automatically, distinguish them from normal pages, and distinguish them from each other—a new, unfamiliar format of block page may indicate a new piece of software is in use to do the censoring.

The chief finding is that block pages can be reliably distinguished from normal pages just by looking at their length: block pages are typically much shorter than normal. This is to be expected, seeing that they are just an error message. What’s interesting, though, is that this technique works better than techniques that look in more detail at the contents of the page. I’d have liked to see some discussion of what kinds of misidentification appear for each technique, but there probably wasn’t room for that. Length is not an effective tactic for distinguishing block pages from each other, but term frequency is (they don’t go into much detail about that).

One thing that’s really not clear is how they distinguish block pages from ordinary HTTP error pages. They mention that ordinary errors introduce significant noise in term-frequency clustering, but they don’t explain how they weeded them out. It might have been done manually; if so, that’s a major hole in the overall automated-ness of this process.

An Automated Approach for Complementing Ad Blockers’ Blacklists

Reviewed 20 July 2015

Machine learning, Privacy, Web

David Gugelmann, Markus Happe, Bernhard Ager, Vincent Lenders

Privacy Enhancing Technologies 2015

Last week’s long PETS paper was very abstract; this week’s paper is very concrete. The authors are concerned that manually-curated blacklists, as currently used by most ad-blocking software, cannot hope to keep up with the churn in the online ad industry. (I saw a very similar talk at WPES back in 2012 [1] which quoted the statistic that the default AdBlock Plus filter list contains 18,000 unique URLS, with new ones added at a rate of five to 15 every week.) They propose to train a machine classifier on network-level characteristics that differ between ad services and normal web sites, to automate detection of new ad providers and/or third-party privacy-invasive analytics services. (This is the key difference from the paper at WPES 2012: that project used static analysis of JavaScript delivered by third-party services to extract features for their classifier.)

They find that a set of five features provides a reasonably effective classification: proportion of requests that are third-party (for transclusion into another website), number of unique referrers, ratio of received to sent bytes, and proportion of requests including cookies. For the training set they used, they achieve 83% precision and 85% recall, which is reasonable for a system that will be used to identify candidates for manual inspection and addition to blacklists.

There are several methodological bits in here which I liked. They use entropy-based discretization and information gain to identify valuable features and discard unhelpful ones. They compare a classifier trained on manually-labeled data (from a large HTTP traffic trace) with a classifier trained on the default AdBlock Plus filter list; both find similar features, but the ABP filter list has better coverage of infrequently used ads or analytics services, whereas the manually labeled training set catches a bunch of common ads and analytics services that ABP missed.

One fairly significant gap is that the training set is limited to cleartext HTTP. There’s a strong trend nowadays toward HTTPS for everything, including ads, but established ad providers are finding it difficult to cut all their services over efficiently, which might provide an opportunity for new providers—and thus cause a shift toward providers that have been missed by the blacklists.

There’s almost no discussion of false positives. Toward the end there is a brief mention of third-party services like Gravatar and Flattr, that share a usage pattern with ads/analytics and showed up as false positives. But it’s possible to enumerate common types of third-party services (other than ads and analytics) a priori: outsourced commenting (Disqus, hypothes.is), social media share buttons (Facebook, Twitter), shared hosting of resources (jQuery, Google Fonts), static-content CDNs, etc. Probably, most of these are weeded out by the ratio of received to sent bytes check, but I would still have liked to see an explicit check of at least a few of these.

And finally, nobody seems to have bothered talking to the people who actually maintain the ABP filter lists to find out how they do it. (I suspect it relies strongly on manual, informal reporting to a forum or something.) If this is to turn into anything more than an experiment, people need to be thinking about integration and operationalization.

Cache Timing Attacks Revisited: Efficient and Repeatable Browser History, OS, and Network Sniffing

Reviewed 15 July 2015

Fingerprinting, Information leakage, Privacy, Web

Chetan Bansal, Sören Preibusch, Natasa Milic-Frayling

International Information Security and Privacy Conference 2015

Cache timing attacks use the time some operation takes to learn whether or not a datum is in a cache. They’re inherently hard to fix, because the entire point of a cache is to speed up access to data that was used recently. They’ve been studied as far back as Multics. [1] In the context of the Web, the cache in question (usually) is the browser’s cache of resources retrieved from the network, and the attacker (usually) is a malicious website that wants to discover something about the victim’s activity on other websites. [2] [3] This paper gives examples of using cache timing to learn information needed for phishing, discover the victim’s location, and monitor the victim’s search queries. It’s also known to be possible to use the victim’s browsing habits as an identifying fingerprint [4].

The possibility of this attack is well-known, but to date it has been dismissed as an actual risk for two reasons: it was slow, and probing the cache was a destructive operation, i.e. the attacker could only probe any given resource once, after which it would be cached whether or not the victim had ever loaded it. This paper overcomes both of those limitations. It uses Web Workers to parallelize cache probing, and it sets a very short timeout on all its background network requests—so short that the request can only succeed if it’s cached. Otherwise, it will be cancelled and the cache will not be perturbed. (Network requests will always take at least a few tens of milliseconds, whereas the cache can respond in less than a millisecond.) In combination, these achieve two orders of magnitude speedup over the best previous approach, and, more importantly, they render the attack repeatable.

What do we do about it? Individual sites can mark highly sensitive data not to be cached at all, but this slows the web down, and you’ll never get everyone to do it for everything that could be relevant. Browsers aren’t about to disable caching; it’s too valuable. A possibility (not in this paper) is that we could notice the first time a resource was loaded from a new domain, and deliberately delay satisfying it from the cache by approximately the amount of time it took to load off the network originally. I’d want to implement and test that to make sure it didn’t leave a hole, but it seems like it would allow us to have the cake and eat it.

In closing, I want to point out that this is a beautiful example of the maxim that attacks always get better. Cryptographers have been banging that drum for decades, trying to get people to move away from cipher primitives that are only a little weak before it’s too late, without much impact. The same, it seems, goes for information leaks.