Papers tagged ‘Machine learning’

Improving Cloaking Detection Using Search Query Popularity and Monetizability

Here’s another paper about detecting search engine spam, contemporary with Detecting Semantic Cloaking on the Web and, in fact, presented as a refinement to that paper. They don’t make any changes to the is this page spamming the search engine algorithm itself; rather, they optimize the scan for pages that are spamming the search engine, by looking at the spammers’ economic incentives.

It obviously does no good to make your site a prominent search hit for keywords that nobody ever searches for. Less obviously, since search engine spammers are trying to make money by sucking people into linkfarms full of ads, they should be focusing on keywords that lots of advertisers are interested in (monetizability in the paper’s jargon). Therefore, the search engine operator should focus its spam-detection efforts on the most-searched and most-advertised keywords. Once you identify a linkfarm, you can weed it out of all your keyword indexes, so this improves search results in the long tail as well as the popular case.

They performed a quick verification of this hypothesis with the help of logs of the 5000 most popular and 5000 most advertised search keywords, from the MSN search engine (now Bing). Indeed, the spam was strongly skewed toward both high ends—somewhat more toward monetizability than popularity.

That’s really all there is to say about this paper. They had a good hypothesis, they tested it, and the hypothesis was confirmed. I’ll add that this is an additional reason (besides just making money) for search engines to run their own ad brokerages, and a great example of the value of applying economic reasoning to security research.

Parking Sensors: Analyzing and Detecting Parked Domains

In the same vein as the ecological study of ad injectors I reviewed back in June, this paper does an ecological analysis of domain parking. Domain parking is the industry term of art for the practice of registering a whole bunch of domain names you don’t have any particular use for but hope someone will buy, and while you wait for the buyer, sticking a website consisting entirely of ads in the space, usually with This domain is for sale! at the top in big friendly letters. Like many ad-driven online business models, domain parking can be mostly harmless or it can be a lead-in to outright scamming people and infesting their computers with malware, and the research question in this paper is, how bad does it get?

In order to answer that question, the authors carried out a modest-sized survey of 3000 parked domains, identified by trawling the DNS for name servers associated with 15 known parking services. (Also like many other online businesses, domain parking runs on an affiliate marketing basis: lots of small fry register the domains and then hand the keys over to big services that do the actual work of maintaining the websites with the ads.) All of these services engage in all of the abusive behavior one would expect: typosquatting, aggressive behavioral ad targeting, drive-by malware infection, and feeding visitors to scam artists and phishers. I do not come away with a clear sense of how common any of these attacks are relative to the default parking page of advertisements and links—they have numbers, but they’re not very well organized, and different sets of parking pages were used in each section (discussing a different type of abuse) which makes it hard to compare across sections.

I’m most interested in the last section of the paper, in which they develop a machine classifier that can distinguish parking pages from normal webpages, based on things like the amount of text that is and isn’t a hyperlink, number of external links, total volume of resources drawn from third-party sources, and so on. The bulk of this section is devoted to enumerating all of the features that they tested, but doesn’t do a terribly good job of explaining which features wound up being predictive. Algorithmic choices also seem a little arbitrary. They got 97.9% true positive rate and 0.5% false positive rate out of it, though, which says to me that this isn’t a terribly challenging classification problem and probably most anything would have worked. (This is consistent with the intuitive observation that you, a human, can tell at a glance when you’ve hit a parking page.)

Detecting Internet Filtering from Geographic Time Series

We’re picking back up with a paper that’s brand new—so new that it exists only as an arXiv preprint and I don’t know if it is planned to be published anywhere. It probably hasn’t gone through formal peer review yet.

Wright and colleagues observe that because Tor is commonly used to evade censorship, changes in the number of people using Tor from any given country are a signal of a change in the censorship régime in that country. This isn’t a new idea: the Tor project itself has been doing something similar since 2011. What this paper does is present an improved algorithm for detecting such changes. It uses PCA to compare the time series of Tor active users across countries. The idea is that if there’s a change in Tor usage worldwide, that probably doesn’t indicate censorship, but a change in just a few countries is suspicious. To model this using PCA, they tune the number of principal components so that the projected data matrix is well-divided into what they call normal and anomalous subspaces; large components in the anomalous subspace for any data vector indicate that that country at that time is not well-predicted by all the other countries, i.e. something fishy is going on.

They show that their algorithm can pick out previously known cases where a change in Tor usage is correlated with a change in censorship, and that its top ten most anomalous countries are mostly the countries one would expect to be suspicious by this metric—but also a couple that nobody had previously suspected, which they highlight as a matter needing further attention.

PCA used as an anomaly detector is a new idea on me. It seems like they could be extracting more information from it than they are. The graphs in this paper show what’s probably a global jump in Tor usage in mid-2013; this has a clear explanation, and they show that their detector ignores it (as it’s supposed to), but can they make their detector call it out separately from country-specific events? PCA should be able to do that. Similarly, it seems quite probable that the ongoing revolutions and wars in the Levant and North Africa are causing correlated changes to degree of censorship region-wide; PCA should be able to pull that out as a separate explanatory variable. These would both involve taking a closer look at the normal subspace and what each of its dimensions mean.

It also seems to me that a bit of preprocessing, using standard time series decomposition techniques, would clean up the analysis and make its results easier to interpret. There’s not one word about that possibility in the paper, which seems like a major omission; decomposition is the first thing that anyone who knows anything about time series analysis would think of. In this case, I think seasonal variation should definitely be factored out, and removing linear per-country trends might also helpful.

Sparse Overcomplete Word Vector Representations

It’s time for another of my occasional looks at a paper that doesn’t have anything to do with security. This one is about an attempt to bridge a gap between two modes of analysis in computational linguistics.

In a word vector analysis (also referred to as a distributed word representation in this paper) one maps words onto vectors in an abstract high-dimensional space, defined by the co-occurrence probabilities of each pair of words within some collection of documents (a corpus). This can be done entirely automatically with any source of documents; no manual preprocessing is required. Stock machine-learning techniques, applied to these vectors, perform surprisingly well on a variety of downstream tasks—classification of the words, for instance. However, the vectors are meaningless to humans, so it’s hard to use them in theoretical work that requires interpretability, or to combine them with other information (e.g. sentence structure). Lexical semantics, by contrast, relies on manual, word-by-word tagging with human-meaningful categories (part of speech, sense of word, role in sentence, etc) which is slow and expensive, but the results are much easier to use as a basis for a wide variety of further studies.

The paper proposes a technique for transforming a set of word vectors to make them more interpretable. It’s essentially the opposite of PCA. They project the vectors into a higher-dimensional space, one in which they are all sparse (concretely, more than 90% of the components of each vector are zero). Words that are semantically related will (they claim) share nonzero components of these vectors, so each component has more meaning than in the original space. The projection matrix can be generated by standard mathematical optimization techniques (specifically, gradient descent with some tweaks to ensure convergence).

To back up the claim that the sparse vectors are more interpretable, they first show that five stock downstream classification tasks achieve an average of 5% higher accuracy when fed sparse vectors than the dense vectors from which they were derived, and then that humans achieve 10% higher accuracy on a word intrusion task (one of these five words does not belong in the list, which is it?) when the words are selected by their rank along one dimension of the sparse vectors, than when they are selected the same way from the dense vectors. An example would probably help here: Table 6 of the paper quotes the top-ranked words from five dimensions of the sparse and the dense vectors:

Dense Sparse
combat, guard, honor, bow, trim, naval fracture, breathing, wound, tissue, relief
’ll, could, faced, lacking, seriously, scored relationships, connections, identity, relations
see, n’t, recommended, depending, part files, bills, titles, collections, poems, songs
due, positive, equal, focus, respect, better naval, industrial, technological, marine
sergeant, comments, critics, she, videos stadium, belt, championship, toll, ride, coach

Neither type of list is what you might call an ideal Platonic category,1 but it should be clear that the sparse lists have more meaning in them.

Because I’m venturing pretty far out of my field, it’s hard for me to tell how significant this paper is; I don’t have any basis for comparison. This is, in fact, the big thing missing from this paper: a comparison to other techniques for doing the same thing, if any. Perhaps the point is that there aren’t any, but I didn’t see them say so. I am also unclear on how you would apply this technique to anything other than an analysis of words. For instance, my own research right now involves (attempting to) mechanically assign topics to entire documents. Right now we’re reducing each document to a bag of words, carrying out LDA on the bags, and then manually labeling each LDA cluster with a topic. Could we use bags of sparse word vectors instead? Would that help LDA do its job better? Or is LDA already doing what this does? I don’t know the answers to these questions.

1 If you are under the impression that categories in natural language are even vaguely Platonic, go at once to your friendly local public library and request a copy of Women, Fire, and Dangerous Things.

Detecting Semantic Cloaking on the Web

Now for something a little different: today’s paper is about detecting search engine spam. Specifically, it’s about detecting when a Web site presents different content to a search engine’s crawler than it does to human visitors. As the article points out, this can happen for benign or even virtuous reasons: a college’s front page might rotate through a selection of faculty profiles, or a site might strip out advertising and other material that is only relevant to human visitors when it knows it’s talking to a crawler. However, it also happens when someone wants to fool the search engine into ranking their site highly for searches where they don’t actually have relevant material.

To detect such misbehavior, obviously one should record each webpage as presented to the crawler, and then again as presented to a human visitor, and compare them. The paper is about two of the technical challenges which arise when you try to execute this plan. (They do not claim to have solved all of the technical challenges in this area.) The first of these is, of course, how do you program a computer to tell when a detected difference is spam, versus when it is benign? and here they have done something straightforward: supervised machine classification. You could read this paper just as a case study in semi-automated feature selection for a machine classifier, and you would learn something. (Feature selection is somewhat of a black art—features that appear to be highly predictive may be accidents of your sample, and features that logically should be predictive might not work out at all. In this case, the positive features they list seem plausibly motivated, but several of the negative features (features which are anticorrelated with spamming) seem likely to be accidental. I would have liked to see more analysis of why each feature is predictive.)

The second technical challenge is less obvious: sites are updated frequently. You don’t want to mistake an update for any kind of variation between the crawl result and the human-visitor result. But it’s not practical to visit a site simultaneously as the crawler and as the human, just because of how many sites a crawl has to touch (and if you did, the spammers might be able to figure out that your human visit was an audit). Instead, you could visit the site repeatedly as each and see if the changes match, but this is expensive. The paper proposes to weed out sites that don’t change at all between the crawler visit and the human visit, and do the more expensive check only to the sites that do. A refinement is to use a heuristic to pick out changes that are more likely to be spam: presence of additional keywords or links in the crawler version, relative to the human version. In their tests, this cuts the number of sites that have to be investigated in detail by a factor of 10 (and could do better by refining the heuristic further). These kinds of manual filter heuristics are immensely valuable in classification problems when one of the categories (no cloaking) is much larger than the other(s), both because it reduces the cost of running the classifier (and, in this case, the cost of data collection), and because machine-learning classifiers often do better when the categories all have roughly the same number of examples.

This paper shouldn’t be taken as the last word in this area: it’s ten years old, its data set is respectable for an experiment but tiny compared to the global ’net, and false positive and negative rates of 7% and 15% (respectively) are much too large for production use. The false positive paradox is your nemesis when you are trying to weed spammers out of an index of 109 websites. We know from what little they’ve said about it in public (e.g. [1] [2]) that Google does something much more sophisticated. But it is still valuable as a starting point if you want to learn how to do this kind of research yourself.

Automated Detection and Fingerprinting of Censorship Block Pages

This short paper, from IMC last year, presents a re-analysis of data collected by the OpenNet Initiative on overt censorship of the Web by a wide variety of countries. Overt means that when a webpage is censored, the user sees an error message which unambiguously informs them that it’s censored. (A censor can also act deniably, giving the user no proof that censorship is going on—the webpage just appears to be broken.) The goal of this reanalysis is to identify block pages (the error messages) automatically, distinguish them from normal pages, and distinguish them from each other—a new, unfamiliar format of block page may indicate a new piece of software is in use to do the censoring.

The chief finding is that block pages can be reliably distinguished from normal pages just by looking at their length: block pages are typically much shorter than normal. This is to be expected, seeing that they are just an error message. What’s interesting, though, is that this technique works better than techniques that look in more detail at the contents of the page. I’d have liked to see some discussion of what kinds of misidentification appear for each technique, but there probably wasn’t room for that. Length is not an effective tactic for distinguishing block pages from each other, but term frequency is (they don’t go into much detail about that).

One thing that’s really not clear is how they distinguish block pages from ordinary HTTP error pages. They mention that ordinary errors introduce significant noise in term-frequency clustering, but they don’t explain how they weeded them out. It might have been done manually; if so, that’s a major hole in the overall automated-ness of this process.

Automated Experiments on Ad Privacy Settings

You’ve probably noticed the creepy effect where you consider buying something online, or maybe just look at a page for something that happens to be for sale, and for weeks afterward you get ads on totally unrelated websites for that thing or similar things. The more reputable online ad brokerages offer a degree of control over this effect (e.g.: Google, Microsoft). This study investigates exactly what effect those settings have on the ads observed by automated browsing agents. The basic idea is to set some of the knobs, visit websites that will tell the ad provider something about the simulated customer’s preferences, possibly adjust the knobs again, and finally record what is being advertised on a general-interest website.

A great deal of the paper is devoted to statistical methodology. Because the ad provider is a stateful black box, and one whose behavior may depend on uncontrollable external factors (e.g. that advertiser has exhausted their budget for the month), it’s vital to avoid as many statistical assumptions as possible. They use permutation tests and supervised classification (logistic regression), both of which make minimal assumptions. They’re also very careful about drawing conclusions from their results. I’m not much of a statistician, but it all sounds carefully thought out and plausible, with one exception: heavy reliance on significance testing, which has come in for serious criticism [1] to the point where some journals no longer accept its use at all [2]. This is exactly the sort of research where p-values can mislead; if I were reviewing this prior to publication I would have encouraged the authors to think about how they could present the same conclusions without using significance testing.

Now, the actual conclusions. Only Google’s ads were tested. (Expanding the tests to additional ad brokers is listed as future work.) They confirm that turning a particular topic (dating) off in the preferences does cause those ads to go away. They observe that two highly sensitive topics (substance abuse, disability) that do trigger targeted ads are not controllable via the preferences; in fact, they are completely invisible on that screen. And the most interesting case is when they set the ad preferences to explicitly reveal a gender (man or woman) then browsed a bunch of sites related to job searching. Simulated customers who claimed to be men got ads for a career coaching service which promised better odds of being hired into $200K+ executive positions; those who claimed to be women did not see these ads.

This last example clearly reflects the well-known glass ceiling which affects business in general, but (as the authors point out) it’s impossible to tell, from outside the black box, why it shows up in this case. The career-coaching service could have chosen to advertise only to men. Google’s ad-targeting algorithm could have been coded with (likely unconscious) bias in its category structure, or—this is the most probable explanation—its feedback mechanism could have detected that men are more likely to click on ads with those keywords, so it makes that even more likely by showing them preferentially to men. There’s a telling comment at the very end of the paper:

… We consider it more likely that Google has lost control over its massive, automated advertising system. Even without advertisers placing inappropriate bids, large-scale machine learning can behave in unexpected ways.

There’s a lesson here for all the big data companies: the best an unbiased machine learning system can hope to do is produce an accurate reflection of the training set—including whatever biases are in there. If you want to avoid reduplicating all the systemic biases of the offline world, you will have to write code to that effect.

An Automated Approach for Complementing Ad Blockers’ Blacklists

Last week’s long PETS paper was very abstract; this week’s paper is very concrete. The authors are concerned that manually-curated blacklists, as currently used by most ad-blocking software, cannot hope to keep up with the churn in the online ad industry. (I saw a very similar talk at WPES back in 2012 [1] which quoted the statistic that the default AdBlock Plus filter list contains 18,000 unique URLS, with new ones added at a rate of five to 15 every week.) They propose to train a machine classifier on network-level characteristics that differ between ad services and normal web sites, to automate detection of new ad providers and/or third-party privacy-invasive analytics services. (This is the key difference from the paper at WPES 2012: that project used static analysis of JavaScript delivered by third-party services to extract features for their classifier.)

They find that a set of five features provides a reasonably effective classification: proportion of requests that are third-party (for transclusion into another website), number of unique referrers, ratio of received to sent bytes, and proportion of requests including cookies. For the training set they used, they achieve 83% precision and 85% recall, which is reasonable for a system that will be used to identify candidates for manual inspection and addition to blacklists.

There are several methodological bits in here which I liked. They use entropy-based discretization and information gain to identify valuable features and discard unhelpful ones. They compare a classifier trained on manually-labeled data (from a large HTTP traffic trace) with a classifier trained on the default AdBlock Plus filter list; both find similar features, but the ABP filter list has better coverage of infrequently used ads or analytics services, whereas the manually labeled training set catches a bunch of common ads and analytics services that ABP missed.

One fairly significant gap is that the training set is limited to cleartext HTTP. There’s a strong trend nowadays toward HTTPS for everything, including ads, but established ad providers are finding it difficult to cut all their services over efficiently, which might provide an opportunity for new providers—and thus cause a shift toward providers that have been missed by the blacklists.

There’s almost no discussion of false positives. Toward the end there is a brief mention of third-party services like Gravatar and Flattr, that share a usage pattern with ads/analytics and showed up as false positives. But it’s possible to enumerate common types of third-party services (other than ads and analytics) a priori: outsourced commenting (Disqus,, social media share buttons (Facebook, Twitter), shared hosting of resources (jQuery, Google Fonts), static-content CDNs, etc. Probably, most of these are weeded out by the ratio of received to sent bytes check, but I would still have liked to see an explicit check of at least a few of these.

And finally, nobody seems to have bothered talking to the people who actually maintain the ABP filter lists to find out how they do it. (I suspect it relies strongly on manual, informal reporting to a forum or something.) If this is to turn into anything more than an experiment, people need to be thinking about integration and operationalization.