Web, page 2 | Readings in Information Security

Today we have a study of ad injection software, which runs on your computer and inserts ads into websites that didn’t already have them, or replaces the website’s ads with their own. (The authors concentrate on browser extensions, but there are several other places where such programs could be installed with the same effect.) Such software is, in 98 out of 100 cases (figure taken from paper), not intentionally installed; instead it is a side-load, packaged together with something else that the user intended to install, or else it is loaded onto the computer by malware.

The injected ads cannot easily be distinguished from ads that a website intended to run, by the person viewing the ads or by the advertisers. A website subjected to ad injection, however, can figure it out, because it knows what its HTML page structure is supposed to look like. This is how the authors detected injected ads on a variety of Google sites; they say that they developed software that can be reused by anyone, but I haven’t been able to find it. They say that Content-Security-Policy should also work, but that doesn’t seem right to me, because page modifications made by a browser extension should, in general, be exempt from CSP.

The bulk of the paper is devoted to characterizing the ecosystem of ad-injection software: who makes it, how does it get onto people’s computers, what does it do? Like the malware ecosystem [1] [2], the core structure of this ecosystem is a layered affiliate network, in which a small number of vendors market ad-injection modules which are added to a wide variety of extensions, and broker ad delivery and clicks from established advertising exchanges. Browser extensions are in an ideal position to surveil the browser user and build up an ad-targeting profile, and indeed, all of the injectors do just that. Ad injection is often observed in conjunction with other malicious behaviors, such as search engine hijacking, affiliate link hijacking, social network spamming, and preventing uninstallation, but it’s not clear whether the ad injectors themselves are responsible for that (it could equally be that the extension developer is trying to monetize by every possible means).

There are some odd gaps. There is no mention of click-fraud; it is easy for an extension to forge clicks, so I’m a little surprised the authors did not discuss the possibility. There is also no discussion of parasitic repackaging. This is a well-known problem with desktop software, with entire companies whose business model is take software that someone else wrote and gives away for free; package it together with ad injectors and worse; arrange to be what people find when they try to download that software. [3] [4] It wouldn’t surprise me if these were also responsible for an awful lot of the problematic extensions discussed in the paper.

An interesting tidbit, not followed up on, is that ad injection is much more common in South America, parts of Africa, South Asia, and Southeast Asia than in Europe, North America, Japan, or South Korea. (They don’t have data for China, North Korea, or all of Africa.) This could be because Internet users in the latter countries are more likely to know how to avoid deceptive software installers and malicious extensions, or, perhaps, just less likely to click on ads in general.

The solutions presented in this paper are rather weak: more aggressive weeding of malicious extensions from the Chrome Web Store and similar repositories, reaching out to ad exchanges to encourage them to refuse service to injectors (if they can detect them, anyway). A more compelling solution would probably start with a look at who profits from bundling ad injectors with their extensions, and what alternative sources of revenue might be viable for them. Relatedly, I would have liked to see some analysis of what the problematic extensions’ overt functions were. There are legitimate reasons for an extension to add content to all webpages, e.g. [5] [6], but extension repositories could reasonably require more careful scrutiny of an extension that asks for that privilege.

It would also help if the authors acknowledged that the only difference between an ad injector and a legitimate ad provider is that the latter only runs ads on sites with the site’s consent. All of the negative impact to end users—behavioral tracking, pushing organic content below the fold or under interstitials, slowing down page loads, and so on—is present with site-solicited advertising. And the same financial catch-22 is present for website proprietors as extension developers: advertising is one of the only proven ways to earn revenue for a website, but it doesn’t work all that well, and it harms your relationship with your end users. In the end I think the industry has to find some other way to make money.

If you are a website, it is fairly easy to identify the web browser in use by each of your visitors, even if they take steps to suppress the blatant things like the User-Agent header. [1] [2] It is so easy, in fact, that researchers typically try to make it harder for themselves, trying instead to identify individual users even as they move around, change IP addresses, flush their cookies, etc. [3] [4]

If you are a passive eavesdropper in between the browser and the website, and the network traffic is encrypted, and particularly if you are isolated from the client’s IP address by anonymizing relays (e.g. Tor), the task should logically be much harder. Or is it? The authors of this short paper did the most obvious thing: capture packet traces and throw them at an off-the-shelf machine classifier. The feature vectors seen by the machine classifier are not described as clearly as I’d like, but I think they divided the trace into equal-length intervals and aggregated packet sizes in each direction in each interval; this is also one of the most basic and obvious things to do (the future work bit talks a little about better feature engineering). Despite the lack of tuning, they report 70–90% classification accuracy on a four-way choice among browsers (Chrome, Firefox, IE, Tor Browser) and 30–80% accuracy for a 13-way choice among browser and plugin combinations (by which they seem to mean whether or not JavaScript and Flash were enabled) (note that for a 13-way choice, chance performance would be 7.7% accuracy).

This is a short workshop paper, so it is to be expected that the results are a little crude and have missing pieces. The authors already know they need to tune their classifier. I hope someone has told them about ROC curves; raw accuracies make me grind my teeth. Besides that, the clear next step is to figure out what patterns the classifiers are picking up on, and then how to efface those patterns. I think it’s quite likely that the signal they see depends on gross characteristics of the different HTTP implementations used by each browser; for instance, at time of publication, Chrome implemented SPDY and QUIC, and the others didn’t.

The paper contains some handwaving in the direction of being able to fingerprint individual users with this information, but I’d want to see detectable variation among installations of the same browser before I’d be convinced that’s possible.

Readings in Information Security

Papers tagged ‘Web’

Ad Injection at Scale: Assessing Deceptive Advertisement Modifications

Identifying Webbrowsers in Encrypted Communications