Papers tagged ‘Advertising’

Parking Sensors: Analyzing and Detecting Parked Domains

In the same vein as the ecological study of ad injectors I reviewed back in June, this paper does an ecological analysis of domain parking. Domain parking is the industry term of art for the practice of registering a whole bunch of domain names you don’t have any particular use for but hope someone will buy, and while you wait for the buyer, sticking a website consisting entirely of ads in the space, usually with This domain is for sale! at the top in big friendly letters. Like many ad-driven online business models, domain parking can be mostly harmless or it can be a lead-in to outright scamming people and infesting their computers with malware, and the research question in this paper is, how bad does it get?

In order to answer that question, the authors carried out a modest-sized survey of 3000 parked domains, identified by trawling the DNS for name servers associated with 15 known parking services. (Also like many other online businesses, domain parking runs on an affiliate marketing basis: lots of small fry register the domains and then hand the keys over to big services that do the actual work of maintaining the websites with the ads.) All of these services engage in all of the abusive behavior one would expect: typosquatting, aggressive behavioral ad targeting, drive-by malware infection, and feeding visitors to scam artists and phishers. I do not come away with a clear sense of how common any of these attacks are relative to the default parking page of advertisements and links—they have numbers, but they’re not very well organized, and different sets of parking pages were used in each section (discussing a different type of abuse) which makes it hard to compare across sections.

I’m most interested in the last section of the paper, in which they develop a machine classifier that can distinguish parking pages from normal webpages, based on things like the amount of text that is and isn’t a hyperlink, number of external links, total volume of resources drawn from third-party sources, and so on. The bulk of this section is devoted to enumerating all of the features that they tested, but doesn’t do a terribly good job of explaining which features wound up being predictive. Algorithmic choices also seem a little arbitrary. They got 97.9% true positive rate and 0.5% false positive rate out of it, though, which says to me that this isn’t a terribly challenging classification problem and probably most anything would have worked. (This is consistent with the intuitive observation that you, a human, can tell at a glance when you’ve hit a parking page.)

Automated Experiments on Ad Privacy Settings

You’ve probably noticed the creepy effect where you consider buying something online, or maybe just look at a page for something that happens to be for sale, and for weeks afterward you get ads on totally unrelated websites for that thing or similar things. The more reputable online ad brokerages offer a degree of control over this effect (e.g.: Google, Microsoft). This study investigates exactly what effect those settings have on the ads observed by automated browsing agents. The basic idea is to set some of the knobs, visit websites that will tell the ad provider something about the simulated customer’s preferences, possibly adjust the knobs again, and finally record what is being advertised on a general-interest website.

A great deal of the paper is devoted to statistical methodology. Because the ad provider is a stateful black box, and one whose behavior may depend on uncontrollable external factors (e.g. that advertiser has exhausted their budget for the month), it’s vital to avoid as many statistical assumptions as possible. They use permutation tests and supervised classification (logistic regression), both of which make minimal assumptions. They’re also very careful about drawing conclusions from their results. I’m not much of a statistician, but it all sounds carefully thought out and plausible, with one exception: heavy reliance on significance testing, which has come in for serious criticism [1] to the point where some journals no longer accept its use at all [2]. This is exactly the sort of research where p-values can mislead; if I were reviewing this prior to publication I would have encouraged the authors to think about how they could present the same conclusions without using significance testing.

Now, the actual conclusions. Only Google’s ads were tested. (Expanding the tests to additional ad brokers is listed as future work.) They confirm that turning a particular topic (dating) off in the preferences does cause those ads to go away. They observe that two highly sensitive topics (substance abuse, disability) that do trigger targeted ads are not controllable via the preferences; in fact, they are completely invisible on that screen. And the most interesting case is when they set the ad preferences to explicitly reveal a gender (man or woman) then browsed a bunch of sites related to job searching. Simulated customers who claimed to be men got ads for a career coaching service which promised better odds of being hired into $200K+ executive positions; those who claimed to be women did not see these ads.

This last example clearly reflects the well-known glass ceiling which affects business in general, but (as the authors point out) it’s impossible to tell, from outside the black box, why it shows up in this case. The career-coaching service could have chosen to advertise only to men. Google’s ad-targeting algorithm could have been coded with (likely unconscious) bias in its category structure, or—this is the most probable explanation—its feedback mechanism could have detected that men are more likely to click on ads with those keywords, so it makes that even more likely by showing them preferentially to men. There’s a telling comment at the very end of the paper:

… We consider it more likely that Google has lost control over its massive, automated advertising system. Even without advertisers placing inappropriate bids, large-scale machine learning can behave in unexpected ways.

There’s a lesson here for all the big data companies: the best an unbiased machine learning system can hope to do is produce an accurate reflection of the training set—including whatever biases are in there. If you want to avoid reduplicating all the systemic biases of the offline world, you will have to write code to that effect.

Ad Injection at Scale: Assessing Deceptive Advertisement Modifications

Today we have a study of ad injection software, which runs on your computer and inserts ads into websites that didn’t already have them, or replaces the website’s ads with their own. (The authors concentrate on browser extensions, but there are several other places where such programs could be installed with the same effect.) Such software is, in 98 out of 100 cases (figure taken from paper), not intentionally installed; instead it is a side-load, packaged together with something else that the user intended to install, or else it is loaded onto the computer by malware.

The injected ads cannot easily be distinguished from ads that a website intended to run, by the person viewing the ads or by the advertisers. A website subjected to ad injection, however, can figure it out, because it knows what its HTML page structure is supposed to look like. This is how the authors detected injected ads on a variety of Google sites; they say that they developed software that can be reused by anyone, but I haven’t been able to find it. They say that Content-Security-Policy should also work, but that doesn’t seem right to me, because page modifications made by a browser extension should, in general, be exempt from CSP.

The bulk of the paper is devoted to characterizing the ecosystem of ad-injection software: who makes it, how does it get onto people’s computers, what does it do? Like the malware ecosystem [1] [2], the core structure of this ecosystem is a layered affiliate network, in which a small number of vendors market ad-injection modules which are added to a wide variety of extensions, and broker ad delivery and clicks from established advertising exchanges. Browser extensions are in an ideal position to surveil the browser user and build up an ad-targeting profile, and indeed, all of the injectors do just that. Ad injection is often observed in conjunction with other malicious behaviors, such as search engine hijacking, affiliate link hijacking, social network spamming, and preventing uninstallation, but it’s not clear whether the ad injectors themselves are responsible for that (it could equally be that the extension developer is trying to monetize by every possible means).

There are some odd gaps. There is no mention of click-fraud; it is easy for an extension to forge clicks, so I’m a little surprised the authors did not discuss the possibility. There is also no discussion of parasitic repackaging. This is a well-known problem with desktop software, with entire companies whose business model is take software that someone else wrote and gives away for free; package it together with ad injectors and worse; arrange to be what people find when they try to download that software. [3] [4] It wouldn’t surprise me if these were also responsible for an awful lot of the problematic extensions discussed in the paper.

An interesting tidbit, not followed up on, is that ad injection is much more common in South America, parts of Africa, South Asia, and Southeast Asia than in Europe, North America, Japan, or South Korea. (They don’t have data for China, North Korea, or all of Africa.) This could be because Internet users in the latter countries are more likely to know how to avoid deceptive software installers and malicious extensions, or, perhaps, just less likely to click on ads in general.

The solutions presented in this paper are rather weak: more aggressive weeding of malicious extensions from the Chrome Web Store and similar repositories, reaching out to ad exchanges to encourage them to refuse service to injectors (if they can detect them, anyway). A more compelling solution would probably start with a look at who profits from bundling ad injectors with their extensions, and what alternative sources of revenue might be viable for them. Relatedly, I would have liked to see some analysis of what the problematic extensions’ overt functions were. There are legitimate reasons for an extension to add content to all webpages, e.g. [5] [6], but extension repositories could reasonably require more careful scrutiny of an extension that asks for that privilege.

It would also help if the authors acknowledged that the only difference between an ad injector and a legitimate ad provider is that the latter only runs ads on sites with the site’s consent. All of the negative impact to end users—behavioral tracking, pushing organic content below the fold or under interstitials, slowing down page loads, and so on—is present with site-solicited advertising. And the same financial catch-22 is present for website proprietors as extension developers: advertising is one of the only proven ways to earn revenue for a website, but it doesn’t work all that well, and it harms your relationship with your end users. In the end I think the industry has to find some other way to make money.