Papers tagged ‘Large sample’

The Most Controversial Topics in Wikipedia: A multilingual and geographical analysis

One of my more social-science-y interests lately has been in reverse-engineering the rationale for nation-state censorship policies from the data available. Any published rationale is almost always quite vague (harmful to public order, blasphemous, sort of thing). Hard data in this area consists of big lists of URLs, domain names, and/or keywords that are known or alleged to be blocked. Keywords are great, when you can get them, but URLs are less helpful, and domain names even less so. I have a pretty good idea why the Federal Republic of Germany might have a problem with a site that sells sheet music of traditional European folk songs (actual example from #BPjMleak), but I don’t know which songs are at stake, because they blocked the entire site. I could find out, but I’d probably want a dose of brain bleach afterward. More to the point, no matter how strong my stomach is, I don’t have the time for the amount of manual research required to figure out the actual issue with all 3000 of the sites on that list—and that’s just one country, whose politics and history are relatively well-known to me.

So, today’s paper is about mechanically identifying controversial Wikipedia articles. Specifically, they look through the revision history of each article for what they call mutual reverts, where two editors have each rolled back the other’s work. This is a conservative measure; edit warring on Wikipedia can be much more subtle. However, it’s easy to pick out mechanically. Controversial articles are defined as those where there are many mutual reverts, with extra weight given to mutual reverts by pairs of senior editors (people with many contributions to the entire encyclopedia). They ran this analysis for ten different language editions, and the bulk of the article is devoted to discussing how each language has interesting peculiarities in what is controversial. Overall, there’s strong correlation across languages, strong correlation with external measures of political or social controversy, and strong correlation with the geographic locations where each language is spoken. An interesting exception to that last is that the Middle East is controversial in all languages, even those that are mostly spoken very far from there; this probably reflects the ongoing wars in that area, which have affected everyone’s politics at least somewhat.

What does this have to do with nation-state censorship? Well, politically controversial topics in language X ought to be correlated with topics censored by nation-states where X is commonly spoken. There won’t be perfect alignment; there will be topics that get censored that nobody bothers to argue about on Wikipedia (pornography, for instance) and there will be topics of deep and abiding Wikipedia controversy that nobody bothers to censor (Spanish football teams, for instance). But if an outbound link from a controversial Wikipedia article gets censored, it is reasonably likely that the censorship rationale has something to do with the topic of the article. The same is true of censored pages that share a significant number of keywords with a controversial article. It should be good enough for hypothesis generation and rough classification of censored pages, at least.

Taster’s Choice: A Comparative Analysis of Spam Feeds

Here’s another paper about spam; this time it’s email spam, and they are interested not so much in the spam itself, but in the differences between collections of spam (feeds) as used in research. They have ten different feeds, and they compare them to each other looking only at the domain names that appear in each. The goal is to figure out whether or not each feed is an unbiased sample of all the spam being sent at any given time, and whether some types of feed are better at detecting particular sorts of spam. (Given this goal, looking only at the domain names is probably the most serious limitation of the paper, despite being brushed off with a footnote. It means they can’t say anything about spam that doesn’t contain any domain names, which may be rare, but is interesting because it’s rare and different from all the rest. They should have at least analyzed the proportion of it that appeared in each feed.)

The spam feeds differ primarily in how they collect their samples. There’s one source consisting exclusively of manually labeled spam (from a major email provider); two DNS blacklists (these provide only domain names, and are somehow derived from other types of feed); three MX honeypots (registered domains that accept email to any address, but are never used for legitimate mail); two seeded honey accounts (like honeypots, but a few addresses are made visible to attract more spam); one botnet-monitoring system; and one hybrid. They don’t have full details on exactly how they all work, which is probably the second most serious limitation.

The actual results of the analysis are basically what you would expect: manually labeled spam is lower-volume but has more unique examples in it, botnet spam is very high volume but has lots of duplication, everything else is somewhere in between. They made an attempt to associate spam domains with affiliate networks (the business of spamming nowadays is structured as a multi-level marketing scheme) but they didn’t otherwise try to categorize the spam itself. I can think of plenty of additional things to do with the data set—which is the point: it says right in the abstract most studies [of email spam] use a single spam feed and there has been little examination of how such feeds may differ in content. They’re not trying so much to produce a comprehensive analysis themselves as to alert people working in this subfield that they might be missing stuff by looking at only one data source.

Scandinista! Analyzing TLS Handshake Scans and HTTPS Browsing by Website Category

Today’s paper is a pilot study, looking into differences in adoption rate of HTTPS between various broad categories of websites (as defined by Alexa). They looked at raw availabilty of TLS service on port 443, and they also checked whether an HTTP request for the front page of each Alexa domain would redirect to HTTPS or vice versa. This test was conducted only once, and supplemented with historical data from the University of Michigan’s HTTPS Ecosystem project.

As you would expect, there is a significant difference in the current level of HTTPS availability from one category to another. They only show this information for a few categories, but the pattern is not terribly surprising: shopping 82%, business 70%, advertisers 45%, adult 36%, news 30%, arts 26%. (They say The relatively low score for Adult sites is surprising given that the industry has a large amount of paid content, but I suspect this is explained by that industry’s habit of outsourcing payment processing, plus the ubiquitous (not just in the adult category) misapprehension that only payment processing is worth bothering to encrypt.) It is also not surprising to find that more popular sites are more likely to support HTTPS. And the enormous number of sites that redirect their front page away from HTTPS is depressing, but again, not surprising.

What’s more interesting to me is the trendlines, which show a steady, slow, category-independent, linear uptake rate. There’s a little bit of a bump in adult and news around 2013 but I suspect it’s just noise. (The response growth over time figure (number 2), which appears to show a category dependence, is improperly normalized and therefore misleading. You want to look only at figure 1.) The paper looks for a correlation with the Snowden revelations; I’d personally expect that the dominant causal factor here is the difficulty of setting up TLS, and I’d like to see them look for correlations with major changes in that: for instance, Cloudflare’s offering no-extra-cost HTTPS support [1], Mozilla publishing a server configuration guide [2], or the upcoming Let’s Encrypt no-hassle CA [3]. It might also be interesting to look at uptake rate as a function of ranking, rather than category; it seems like the big names are flocking to HTTPS lately, it would be nice to know for sure.

The study has a number of methodological problems, which is OK for a pilot, but they need to get fixed before drawing serious conclusions. I already mentioned the normalization problem in figure 2: I think they took percentages of percentages, which doesn’t make sense. The right thing would’ve been to just subtract the initial level seen in figure 1 from each line, which (eyeballing figure 1) would have demonstrated an increase of about 5% in each category over the time period shown, with no evidence for a difference between categories. But before we even get that far, there’s a question of the difference between an IP address (the unit of the UMich scans), a website (the unit of TLS certificates), and a domain (the unit of Alexa ranking). To take some obvious examples: There are hundreds, if not thousands, of IP addresses that will all answer to the name of www.google.com. Conversely, Server Name Indication permits one IP address to answer for dozens or even hundreds of encrypted websites, and that practice is even more common over unencrypted HTTP. And hovering around #150 in the Alexa rankings is amazonaws.com, which is the backing store for at least tens of thousands of different websites, each of which has its own subdomain and may or may not have configured TLS. The correct primary data sources for this experiment are not Alexa and IPv4 scanning, but DNS scanning and certificate transparency logs. (A major search engine’s crawl logs would also be useful, if you could get your hands on them.) Finally, they should pick one set of 10-20 mutually exclusive, exhaustive categories (one of which would have to be Other) and consistently use them throughout the paper.

Ad Injection at Scale: Assessing Deceptive Advertisement Modifications

Today we have a study of ad injection software, which runs on your computer and inserts ads into websites that didn’t already have them, or replaces the website’s ads with their own. (The authors concentrate on browser extensions, but there are several other places where such programs could be installed with the same effect.) Such software is, in 98 out of 100 cases (figure taken from paper), not intentionally installed; instead it is a side-load, packaged together with something else that the user intended to install, or else it is loaded onto the computer by malware.

The injected ads cannot easily be distinguished from ads that a website intended to run, by the person viewing the ads or by the advertisers. A website subjected to ad injection, however, can figure it out, because it knows what its HTML page structure is supposed to look like. This is how the authors detected injected ads on a variety of Google sites; they say that they developed software that can be reused by anyone, but I haven’t been able to find it. They say that Content-Security-Policy should also work, but that doesn’t seem right to me, because page modifications made by a browser extension should, in general, be exempt from CSP.

The bulk of the paper is devoted to characterizing the ecosystem of ad-injection software: who makes it, how does it get onto people’s computers, what does it do? Like the malware ecosystem [1] [2], the core structure of this ecosystem is a layered affiliate network, in which a small number of vendors market ad-injection modules which are added to a wide variety of extensions, and broker ad delivery and clicks from established advertising exchanges. Browser extensions are in an ideal position to surveil the browser user and build up an ad-targeting profile, and indeed, all of the injectors do just that. Ad injection is often observed in conjunction with other malicious behaviors, such as search engine hijacking, affiliate link hijacking, social network spamming, and preventing uninstallation, but it’s not clear whether the ad injectors themselves are responsible for that (it could equally be that the extension developer is trying to monetize by every possible means).

There are some odd gaps. There is no mention of click-fraud; it is easy for an extension to forge clicks, so I’m a little surprised the authors did not discuss the possibility. There is also no discussion of parasitic repackaging. This is a well-known problem with desktop software, with entire companies whose business model is take software that someone else wrote and gives away for free; package it together with ad injectors and worse; arrange to be what people find when they try to download that software. [3] [4] It wouldn’t surprise me if these were also responsible for an awful lot of the problematic extensions discussed in the paper.

An interesting tidbit, not followed up on, is that ad injection is much more common in South America, parts of Africa, South Asia, and Southeast Asia than in Europe, North America, Japan, or South Korea. (They don’t have data for China, North Korea, or all of Africa.) This could be because Internet users in the latter countries are more likely to know how to avoid deceptive software installers and malicious extensions, or, perhaps, just less likely to click on ads in general.

The solutions presented in this paper are rather weak: more aggressive weeding of malicious extensions from the Chrome Web Store and similar repositories, reaching out to ad exchanges to encourage them to refuse service to injectors (if they can detect them, anyway). A more compelling solution would probably start with a look at who profits from bundling ad injectors with their extensions, and what alternative sources of revenue might be viable for them. Relatedly, I would have liked to see some analysis of what the problematic extensions’ overt functions were. There are legitimate reasons for an extension to add content to all webpages, e.g. [5] [6], but extension repositories could reasonably require more careful scrutiny of an extension that asks for that privilege.

It would also help if the authors acknowledged that the only difference between an ad injector and a legitimate ad provider is that the latter only runs ads on sites with the site’s consent. All of the negative impact to end users—behavioral tracking, pushing organic content below the fold or under interstitials, slowing down page loads, and so on—is present with site-solicited advertising. And the same financial catch-22 is present for website proprietors as extension developers: advertising is one of the only proven ways to earn revenue for a website, but it doesn’t work all that well, and it harms your relationship with your end users. In the end I think the industry has to find some other way to make money.

Links that speak: The global language network and its association with global fame

The paper we’re looking at today isn’t about security, but it’s relevant to anyone who’s doing field studies of online culture, which can easily become part of security research. My own research right now, for instance, touches on how the Internet is used for online activism and how that may or may not be safe for the activists; if you don’t have a handle on online culture—and how it varies worldwide—you’re going to have a bad time trying to study that.

What we have here is an analysis of language pairings as found on Twitter, Wikipedia, and a database of book translations. Either the same person uses both languages in a pair, or text has been translated from one to the other. Using these pairs, they identify hub languages that are very likely to be useful to connect people in distant cultures. These are mostly, but not entirely, the languages with the greatest number of speakers. Relative to their number of speakers, regional second languages like Malay and Russian show increased importance, and languages that are poorly coupled to English (which is unsurprisingly right in the middle of the connectivity graph), like Chinese, Arabic, and the languages of India, show reduced importance.

There is then a bunch of hypothesizing about how the prominence of a language as a translation hub might influence how famous someone is and/or how easy it is for something from a particular region to reach a global audience. That’s probably what the authors of the paper thought was most important, but it’s not what I’m here for. What I’m here for is what the deviation between translation hub and widely spoken language tells us about how to do global field studies. It is obviously going to be more difficult to study an online subculture that conducts itself in a language you don’t speak yourself, and the fewer people do speak that language, the harder it will be for you to get some help. But if a language is widely spoken but not a translation hub, it may be difficult for you to get the right kind of help.

For instance, machine translations between the various modern vernaculars of Arabic and English are not very good at present. I could find someone who speaks any given vernacular Arabic without too much difficulty, but I’d probably have to pay them a lot of money to get them to translate 45,000 Arabic documents into English, or even just to tell me which of those documents were in their vernacular. (That number happens to be how many documents in my research database were identified as some kind of Arabic by a machine classifier—and even that doesn’t work as well as it could; for instance, it is quite likely to be confused by various Central Asian languages that can be written with an Arabic-derived script and have a number of Arabic loanwords but are otherwise unrelated.)

What can we (Western researchers, communicating primarily in English) do about it? First is just to be aware that global field studies conducted by Anglophones are going to be weak when it comes to languages poorly coupled to English, even when they are widely spoken. In fact, the very fact of the poor coupling makes me skeptical of the results in this paper when it comes to those languages. They only looked at three datasets, all of which are quite English-centric. Would it not have made sense to supplement that with polylingual resources centered on, say, Mandarin Chinese, Modern Standard Arabic, Hindi, and Swahili? These might be difficult to find, but not being able to find them would tend to confirm the original result, and if you could find them, you could both improve the lower bounds for coupling to English, and get a finer-grained look at the languages that are well-translated within those clusters.

Down the road, it seems to me that whenever you see a language cluster that’s widely spoken but not very much in communication with any other languages, you’ve identified a gap in translation resources and cultural cross-pollination, and possibly an underserved market. Scanlations make a good example: the lack of officially-licensed translations of comic books (mostly of Japanese-language manga into English) spawned an entire subculture of unofficial translators, and that subculture is partially responsible for increasing mainstream interest in the material to the point where official translations are more likely to happen.

Analysis of the HTTPS Certificate Ecosystem

The Internet Measurement Conference brings us an attempt to figure out just how X.509 server certificates are being used in the wild, specifically for HTTPS servers. Yet more specifically, they are looking for endemic operational problems that harm security. The basic idea is to scan the entire IPv4 number space for servers responding on port 443, make note of the certificate(s) presented, and then analyze them.

This research question is nothing new; the EFF famously ran a similar study back in 2010, the SSL Observatory. And operational concerns about the way certificates are used in the wild go back decades; see Peter Gutmann’s slide deck Everything you Never Wanted to Know about PKI but were Forced to Find Out (PDF). What makes this study interesting is, first, it’s three years later; things can change very fast in Internet land (although, in this case, they have not). Second, the scale: the authors claim to have successfully contacted 178% more TLS hosts (per scan) and harvested 736% more certificates (in total, over the course of 110 scans spanning a little more than a year) than any previous such study.

What do we learn? Mostly that yeah, the TLS PKI is a big mess, and it hasn’t gotten any better since 2010. There are too many root CAs. There are far too many unconstrained intermediate certificates, and yet, at the same time, there are too few intermediates! (The point of intermediates is that they’re easy to replace, so if they get compromised you don’t have a catastrophe on your hands. Well, according to this paper, some 26% of all currently valid HTTPS server certificates are signed by one intermediate. No way is that going to be easy to replace if it gets compromised.) Lots of CAs ignore the baseline policies for certificate issuance and get away with it. (Unfortunately, the paper doesn’t say whether there are similar problems with the EV policies.)

Zoom out: when you have a piece of critical infrastructure with chronic operational issues, it’s a safe bet that they’re symptoms and the real problem is with operator incentives. The paper doesn’t discuss this at all, unfortunately, so I’ll throw in some speculation here. The browser vendors are notionally in the best position to Do Something about this mess, but they don’t: because the only real option they have is to delete root certificates from the Official List. Not only does this tend to put the offending CA out of business, it also causes some uncertain-but-large number of websites (most or all of which didn’t do anything wrong) to stop working. Such a drastic sanction is almost never seen to be appropriate. Browsers have hardly any good positive incentives to offer the CAs to do things right; note that EV certificates, which get special treatment in the browser UI and can therefore be sold at a premium, do come with a tighter set of CA requirements (stronger crypto, reliable OCSP, that sort of thing) which are, as far as I’m aware, followed.

Zoom out again: there’s no shortage of technical suggestions that could turn into less drastic sanctions and incentives for the CAs, but they never get implemented: why? Well, you ask me, I say it’s because both OpenSSL and NSS are such terrible code that nobody wants to hack on them, and the brave souls who do it anyway are busy chipping away at the mountain of technical debt and/or at features that are even more overdue. This, though, we know how to fix. It only takes money.