Papers published in 2014

The Most Controversial Topics in Wikipedia: A multilingual and geographical analysis

One of my more social-science-y interests lately has been in reverse-engineering the rationale for nation-state censorship policies from the data available. Any published rationale is almost always quite vague (harmful to public order, blasphemous, sort of thing). Hard data in this area consists of big lists of URLs, domain names, and/or keywords that are known or alleged to be blocked. Keywords are great, when you can get them, but URLs are less helpful, and domain names even less so. I have a pretty good idea why the Federal Republic of Germany might have a problem with a site that sells sheet music of traditional European folk songs (actual example from #BPjMleak), but I don’t know which songs are at stake, because they blocked the entire site. I could find out, but I’d probably want a dose of brain bleach afterward. More to the point, no matter how strong my stomach is, I don’t have the time for the amount of manual research required to figure out the actual issue with all 3000 of the sites on that list—and that’s just one country, whose politics and history are relatively well-known to me.

So, today’s paper is about mechanically identifying controversial Wikipedia articles. Specifically, they look through the revision history of each article for what they call mutual reverts, where two editors have each rolled back the other’s work. This is a conservative measure; edit warring on Wikipedia can be much more subtle. However, it’s easy to pick out mechanically. Controversial articles are defined as those where there are many mutual reverts, with extra weight given to mutual reverts by pairs of senior editors (people with many contributions to the entire encyclopedia). They ran this analysis for ten different language editions, and the bulk of the article is devoted to discussing how each language has interesting peculiarities in what is controversial. Overall, there’s strong correlation across languages, strong correlation with external measures of political or social controversy, and strong correlation with the geographic locations where each language is spoken. An interesting exception to that last is that the Middle East is controversial in all languages, even those that are mostly spoken very far from there; this probably reflects the ongoing wars in that area, which have affected everyone’s politics at least somewhat.

What does this have to do with nation-state censorship? Well, politically controversial topics in language X ought to be correlated with topics censored by nation-states where X is commonly spoken. There won’t be perfect alignment; there will be topics that get censored that nobody bothers to argue about on Wikipedia (pornography, for instance) and there will be topics of deep and abiding Wikipedia controversy that nobody bothers to censor (Spanish football teams, for instance). But if an outbound link from a controversial Wikipedia article gets censored, it is reasonably likely that the censorship rationale has something to do with the topic of the article. The same is true of censored pages that share a significant number of keywords with a controversial article. It should be good enough for hypothesis generation and rough classification of censored pages, at least.

I Know What You’re Buying: Privacy Breaches on eBay

eBay intends not to let anyone else figure out what you’re in a habit of buying on the site. Because of that, lots of people consider eBay the obvious place to buy things you’d rather your neighbors not know you bought (there is a survey in this very paper confirming this fact). However, this paper demonstrates that a determined adversary can figure out what you bought.

(Caveat: This paper is from 2014. I do not know whether eBay has made any changes since it was published.)

eBay encourages both buyers and sellers to leave feedback on each other, the idea being to encourage fair dealing by attaching a persistent reputation to everyone. Feedback is associated with specific transactions, and anyone (whether logged into eBay or not) can see each user’s complete feedback history. Items sold are visible, items bought are not, and buyers’ identities are obscured. The catch is, you can match up buyer feedback with seller feedback by the timestamps, using obscured buyer identities as a disambiguator, and thus learn what was bought. It involves crawling a lot of user pages, but it’s possible to do this in a couple days without coming to eBay’s notice.

They demonstrate that this is a serious problem by identifying users who purchased gun holsters (eBay does not permit the sale of actual firearms), pregnancy tests, and HIV tests. As an additional fillip they show that people commonly use the same handle on eBay as Facebook and therefore purchase histories can be correlated with all the personal information one can typically extract from a Facebook profile.

Solutions are pretty straightforward and obvious—obscured buyer identities shouldn’t be correlated with their real handles; feedback timestamps should be quantized to weeks or even months; feedback on buyers might not be necessary anymore; eBay shouldn’t discourage use of various enhanced-privacy modes, or should maybe even promote them to the default. (Again, I don’t know whether any of these solutions has been implemented.) The value of the paper is perhaps more in reminding website developers in general that cross-user correlations are always a privacy risk.

RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response

Lots of software developers would love to know in detail how people use their work, and to that end, there’s an increasing number of programs that phone home with crash reports, heat maps of which UI elements get used, and even detailed logs of every interaction. The problem, of course, is how to collect as much useful information as possible without infringing on user privacy. Usually, what the developers want are aggregate statistics—how often does this widget get used by everyone, that sort of thing—so the logical method to apply is differential privacy. However, stock differential privacy algorithms assume a central, trusted database, and in this case, the only people who could run the database are the developers—the very same people against whom individual responses should be protected. Also, differential privacy’s mathematical guarantees are eroded if you repeatedly ask the same survey questions of the same population—which is exactly what you want to do to understand how user behavior changes over time.

This paper solves both problems with a combination of randomized response and Bloom filters. Randomized response is an old, ingenious idea for asking sensitive yes-no survey questions such that any one person has plausible deniability, but the true aggregate number of yes responses is still known: each survey participant secretly flips a coin before answering, and if it comes up heads they answer yes whether or not that’s true. Otherwise they answer honestly. After the fact, everyone can claim to have answered yes because of the coin flip, but to recover the true population statistic one simply doubles the number of no answers. Bloom filters (which I’m not going to try to explain) are used to extend this from yes-no to reporting sets of strings. Finally, there are two stages of randomization. Given a set of survey questions, the system computes a Permanent randomized response which is, as the name implies, reused until either the answers or the questions change. This prevents privacy erosion, because each user is always either answering honestly or falsely to any given question; a nosy server cannot average out the coin tosses. Additional instantaneous randomness is added each time a report is submitted, to prevent the report being a fingerprint for that user. The system is said to be in use for telemetry from the Chrome browser, and they give several examples of data collected in practice.

The Permanent randomized responses illustrate a basic tension in security design: you can often get better security outcomes versus a remote attacker if you keep local state, but that local state is probably high-value information for a local attacker. For example, any system involving trust on first use will store a list of frequently contacted remote peers, with credentials, on the local machine; tampering with that can destroy the system’s security guarantees, and simply reading it tells you, at a minimum, which remote peers you regularly connect to. The TAILS live-CD is intended to leave no trace on the system it’s run on, which means it changes its entry guards on every reboot, increasing the chance of picking a malicious guard. In this case, a local adversary who can read a Chrome browser profile has access to much higher-value data than the Permanent responses, but a cautious user who erases their browser profile on every shutdown, to protect against that local adversary, is harming their own security against the Chrome developers. (This might be the right tradeoff, of course.)

I also wonder what a network eavesdropper learns from the telemetry reports. Presumably they are conveyed over a secure channel, but at a minimum, that still reveals that the client software is installed, and the size of the upload might reveal things. A crash report, for instance, is probably much bulkier than a statistics ping, and is likely to be submitted immediately rather than on a schedule.

Automated Detection and Fingerprinting of Censorship Block Pages

This short paper, from IMC last year, presents a re-analysis of data collected by the OpenNet Initiative on overt censorship of the Web by a wide variety of countries. Overt means that when a webpage is censored, the user sees an error message which unambiguously informs them that it’s censored. (A censor can also act deniably, giving the user no proof that censorship is going on—the webpage just appears to be broken.) The goal of this reanalysis is to identify block pages (the error messages) automatically, distinguish them from normal pages, and distinguish them from each other—a new, unfamiliar format of block page may indicate a new piece of software is in use to do the censoring.

The chief finding is that block pages can be reliably distinguished from normal pages just by looking at their length: block pages are typically much shorter than normal. This is to be expected, seeing that they are just an error message. What’s interesting, though, is that this technique works better than techniques that look in more detail at the contents of the page. I’d have liked to see some discussion of what kinds of misidentification appear for each technique, but there probably wasn’t room for that. Length is not an effective tactic for distinguishing block pages from each other, but term frequency is (they don’t go into much detail about that).

One thing that’s really not clear is how they distinguish block pages from ordinary HTTP error pages. They mention that ordinary errors introduce significant noise in term-frequency clustering, but they don’t explain how they weeded them out. It might have been done manually; if so, that’s a major hole in the overall automated-ness of this process.

Censorship in the Wild: Analyzing Internet Filtering in Syria

Last week we looked at a case study of Internet filtering in Pakistan; this week we have a case study of Syria. (I think this will be the last such case study I review, unless I come across a really compelling one; there’s not much new I have to say about them.)

This study is chiefly interesting for its data source: a set of log files from the Blue Coat brand DPI routers that are allegedly used [1] [2] to implement Syria’s censorship policy, covering a 9-day period in July and August of 2011. leaked by the Telecomix hacktivist group. Assuming that these log files are genuine, this gives the researchers what we call ground truth: they can be certain that sites appearing in the logs are, or are not, censored. (This doesn’t mean they know the complete policy, though. The routers’ blacklists could include sites or keywords that nobody tried to visit during the time period covered by the logs.)

With ground truth it is possible to make more precise deductions from the phenomena. For instance, when the researchers see URLs of the form http://a1b2.cdn.example/adproxy/cyber/widget blocked by the filter, they know (because the logs say so) that the block is due to a keyword match on the string proxy, rather than the domain name, the IP address, or any other string in the HTTP request. This, in turn, enables them to describe the censorship policy quite pithily: Syrian dissident political organizations, anything and everything to do with Israel, instant messaging tools, and circumvention tools are all blocked. This was not possible in the Pakistani case—for instance, they had to guess at the exact scope of the porn filter.

Because the leaked logs cover only a very short time window, it’s not possible to say anything about the time evolution of Syrian censorship, which is unfortunate, considering the tumultuous past few years that the country has had.

The leak is from several years ago. There is heavy reliance on keyword filtering; it would be interesting to know if this has changed since, what with the increasing use of HTTPS making keyword filtering less useful. For instance, since 2013 Facebook has defaulted to HTTPS for all users. This would have made it much harder for Syria to block access to specific Facebook pages, as they were doing in this study.

A Look at the Consequences of Internet Censorship Through an ISP Lens

When a national government decides to block access to an entire category of online content, naturally people who wanted to see that content—whatever it is—will try to find workarounds. Today’s paper is a case study of just such behavior. The authors were given access to a collection of bulk packet logs taken by an ISP in Pakistan. The ISP had captured a day’s worth of traffic on six days ranging from October 2011 through August 2013, a period that included two significant changes to the national censorship policy. In late 2011, blocking access to pornography became a legal mandate (implemented as a blacklist of several thousand sites, maintained by the government and disseminated to ISPs in confidence—the authors were not allowed to see this blacklist). In mid-2012, access to Youtube was also blocked, in retaliation for hosting anti-Islamic videos [1]. The paper analyzes the traffic in aggregate to understand broad trends in user behavior and how these changed in response to the censorship.

The Youtube block triggered an immediate and obvious increase in encrypted traffic, which the authors attribute to an increased use of circumvention tools—the packet traces did not record enough information to identify exactly what tool, or to discriminate circumvention from other encrypted traffic, but it seems a reasonable assumption. Over the next several months, alternative video sharing/streaming services rose in popularity; as of the last trace in the study, they had taken over roughly 80% of the market share formerly held by Youtube.

Users responded quite differently to the porn block: roughly half of the inbound traffic formerly attributable to porn just disappeared, but the other half was redirected to different porn sites that didn’t happen to be on the official blacklist. The censorship authority did not react by adding the newly popular sites to the blacklist. Perhaps a 50% reduction in overall consumption of porn was good enough for the politicians who wanted the blacklist in the first place.

The paper also contains also some discussion of the mechanism used to block access to censored domains. This confirms prior literature [2] so I’m not going to go into it in great detail; we’ll get to those papers eventually. One interesting tidbit (also previously reported) is that Pakistan has two independent filters, one implemented by local ISPs which falsifies DNS responses, and another operating in the national backbone which forges TCP RSTs and/or HTTP redirections.

The authors don’t talk much about why user response to the Youtube block was so different from the response to the porn block, but it’s evident from their discussion of what people do right after they hit a block in each case. This is very often a search engine query (unencrypted, so visible in the packet trace). For Youtube, people either search for proxy/circumvention services, or they enter keywords for the specific video they wanted to watch, hoping to find it elsewhere, or at least a transcript. For porn, people enter keywords corresponding to a general type of material (sex act, race and gender of performers, that sort of thing), which suggests that they don’t care about finding a specific video, and will be content with whatever they find on a site that isn’t blocked. This is consistent with analysis of viewing patterns on a broad-spectrum porn hub site [3]. It’s also consistent with the way Youtube is integrated into online discourse—people very often link to or even embed a specific video on their own website, in order to talk about it; if you can’t watch that video you can’t participate in the conversation. I think this is really the key finding of the paper, since it gets at when people will go to the trouble of using a circumvention tool.

What the authors do talk about is the consequences of these blocks on the local Internet economy. In particular, Youtube had donated a caching server to the ISP in the case study, so that popular videos would be available locally rather than clogging up international data channels. With the block and the move to proxied, encrypted traffic, the cache became useless and the ISP had to invest in more upstream bandwidth. On the other hand, some of the video services that came to substitute for Youtube were Pakistani businesses, so that was a net win for the local economy. This probably wasn’t intended by the Pakistani government, but in similar developments in China [4] and Russia [5], import substitution is clearly one of the motivating factors. From the international-relations perspective, that’s also highly relevant: censorship only for ideology’s sake probably won’t motivate a bureaucracy as much as censorship that’s seen to be in the economic interest of the country.

Why Doesn’t Jane Protect Her Privacy?

Today’s paper is very similar to What Deters Jane from Preventing Identification and Tracking on the Web? and shares an author. The main difference is that it’s about email rather than the Web. The research question is the same: why don’t people use existing tools for enhancing the security and privacy of their online communications? (In this case, specifically tools for end-to-end encryption of email.) The answers are also closely related. As before, many people think no one would bother snooping on them because they aren’t important people. They may understand that their webmail provider reads their email in order to select ads to display next to it, but find this acceptable, and believe that the provider can be trusted not to do anything else with its knowledge. They may believe that the only people in a position to read their email are nation-state espionage agencies, and that trying to stop them is futile. All of these are broadly consistent with the results for the Web.

A key difference, though, is that users’ reported understanding of email-related security risks is often about a different category of threat that end-to-end encryption doesn’t help with: spam, viruses, and phishing. In fact, it may hurt: one of Gmail’s (former) engineers went on record with a detailed argument for why their ability to read all their users’ mail was essential to their ability to filter spam. [1] I’m not sure that isn’t just a case of not being able to see out of their local optimum, but it certainly does make the job simpler. Regardless, it seems to me that spam, viruses, and phishing are a much more visible and direct threat to the average email user’s personal security than any sort of surveillance. Choosing to use a service that’s very good at filtering, even at some cost in privacy, therefore strikes me as a savvy choice rather than an ignorant one. Put another way, I think a provider of end-to-end encrypted email needs to demonstrate that it can filter junk just as effectively if it wants to attract users away from existing services.

(In the current world, encryption is a signal of not being spam, but in a world where most messages were encrypted, spammers would start using encryption, and so would your PHB who keeps sending you virus-infected spreadsheets that you have to look at for your job.)

Another key difference is, you can unilaterally start using Tor, anti-tracking browser extensions, and so on, but you can’t unilaterally start encrypting your email. You can only send encrypted email to people who can receive encrypted email. Right now, that means there is a strong network effect against the use of encrypted email. There’s not a single word about this in the paper, and I find that a serious omission. It does specifically say that they avoided asking people about their experiences (if any) with PGP and similar software because they didn’t want to steer their thinking that way, but I think that was a mistake. It means they can’t distinguish what people think about email privacy in general, from what they think about end-to-end encryption tools that they may have tried, or at least heard of. There may be a substantial population of people who only looked into PGP just enough to discover that it’s only useful if the recipient also uses it, and don’t think of it anymore unless specifically prompted about it.

Regional Variation in Chinese Internet Filtering

This is one of the earlier papers that looked specifically for regional variation in China’s internet censorship; as I mentioned when reviewing Large-scale Spatiotemporal Characterization of Inconsistencies in the World’s Largest Firewall, assuming that censorship is monolithic is unwise in general and especially so for a country as large, diverse, and technically sophisticated as China. This paper concentrates on variation in DNS-based blockade: they probed 187 DNS servers in 29 Chinese cities (concentrated, like the population, toward the east of the country) for a relatively small number of sites, both highly likely and highly unlikely to be censored within China.

The results reported are maybe better described as inconsistencies among DNS servers than regional variation. For instance, there are no sites called out as accessible from one province but not another. Rather, roughly the same set of sites is blocked in all locales, but all of the blocking is somewhat leaky, and some DNS servers are more likely to leak—regardless of the site—than others. The type of DNS response when a site is blocked also varies from server to server and site to site; observed behaviors include no response at all, an error response, or (most frequently) a success response with an incorrect IP address. Newer papers (e.g. [1] [2]) have attempted to explain some of this in terms of the large-scale network topology within China, plus periodic outages when nothing is filtered at all, but I’m not aware of any wholly compelling analysis (and short of a major leak of internal policy documents, I doubt we can ever have one).

There’s also an excellent discussion of the practical and ethical problems with this class of research. I suspect this was largely included to justify the author’s choice to only look at DNS filtering, despite its being well-known that China also uses several other techniques for online censorship. It nonetheless provides valuable background for anyone wondering about methodological choices in this kind of paper. To summarize:

  • Many DNS servers accept queries from the whole world, so they can be probed directly from a researcher’s computer; however, they might vary their response depending on the apparent location of the querent, their use of UDP means it’s hard to tell censorship by the server itself from censorship by an intermediate DPI router, and there’s no way to know the geographic distribution of their intended clientele.

  • Studying most other forms of filtering requires measurement clients within the country of interest. These can be dedicated proxy servers of various types, or computers volunteered for the purpose. Regardless, the researcher risks inflicting legal penalties (or worse) on the operators of the measurement clients; even if the censorship authority normally takes no direct action against people who merely try to access blocked material, they might respond to a sufficiently high volume of such attempts.

  • Dedicated proxy servers are often blacklisted by sites seeking to reduce their exposure to spammers, scrapers, trolls, and DDoS attacks; a study relying exclusively on such servers will therefore tend to overestimate censorship.

  • Even in countries with a strong political commitment to free expression, there are some things that are illegal to download or store; researchers must take care not to do so, and the simplest way to do that is to avoid retrieving anything other than text.

Censorship Resistance: Let a Thousand Flowers Bloom?

This short paper presents a simple game-theoretic analysis of a late stage of the arms race between a censorious national government and the developers of tools for circumventing that censorship. Keyword blocking, IP-address blocking, and protocol blocking for known circumvention protocols have all been insitituted and then evaded. The circumvention tool is now steganographically masking its traffic so it is indistinguishable from some commonly-used, innocuous cover protocol or protocols; the censor, having no way to unmask this traffic, must either block all use of the cover protocol, or give up.

The game-theoretic question is, how many cover protocols should the circumvention tool implement? Obviously, if there are several protocols, then the tool is resilient as long as not all of them are blocked. On the other hand, implementing more cover protocols requires more development effort, and increases the probability that some of them will be imperfectly mimicked, making the tool detectable. [1] This might seem like an intractable question, but the lovely thing about game theory is it lets you demonstrate that nearly all the fine details of each player’s utility function are irrelevant. The answer: if there’s good reason to believe that protocol X will never be blocked, then the tool should only implement protocol X. Otherwise, it should implement several protocols, based on some assessment of how likely each protocol is to be blocked.

In real life there probably won’t be a clear answer to will protocol X ever be blocked? As the authors themselves point out, the censors can change their minds about that quite abruptly, in response to political conditions. So, in real life several protocols will be needed, and that part of the analysis in this paper is not complete enough to give concrete advice. Specifically, it offers a stable strategy for the Nash equilibrium (that is, neither party can improve their outcome by changing the strategy) but, again, the censors might abruptly change their utility function in response to political conditions, disrupting the equilibrium. (The circumvention tool’s designers are probably philosophically committed to free expression, so their utility function can be assumed to be stable.) This requires an adaptive strategy. The obvious adaptive strategy is for the tool to use only one or two protocols at any given time (using more than one protocol may also improve verisimilitude of the overall traffic being surveilled by the censors) but implement several others, and be able to activate them if one of the others stops working. The catch here is that the change in behavior may itself reveal the tool to the censor. Also, it requires all the engineering effort of implementing multiple protocols, but some fraction of that may go to waste.

The paper also doesn’t consider what happens if the censor is capable of disrupting a protocol in a way that only mildly inconveniences normal users of that protocol, but renders the circumvention tool unusable. (For instance, the censor could be able to remove the steganography without necessarily knowing that it is there. [2]) I think this winds up being equivalent to the censor being able to block that protocol without downside, but I’m not sure.

Links that speak: The global language network and its association with global fame

The paper we’re looking at today isn’t about security, but it’s relevant to anyone who’s doing field studies of online culture, which can easily become part of security research. My own research right now, for instance, touches on how the Internet is used for online activism and how that may or may not be safe for the activists; if you don’t have a handle on online culture—and how it varies worldwide—you’re going to have a bad time trying to study that.

What we have here is an analysis of language pairings as found on Twitter, Wikipedia, and a database of book translations. Either the same person uses both languages in a pair, or text has been translated from one to the other. Using these pairs, they identify hub languages that are very likely to be useful to connect people in distant cultures. These are mostly, but not entirely, the languages with the greatest number of speakers. Relative to their number of speakers, regional second languages like Malay and Russian show increased importance, and languages that are poorly coupled to English (which is unsurprisingly right in the middle of the connectivity graph), like Chinese, Arabic, and the languages of India, show reduced importance.

There is then a bunch of hypothesizing about how the prominence of a language as a translation hub might influence how famous someone is and/or how easy it is for something from a particular region to reach a global audience. That’s probably what the authors of the paper thought was most important, but it’s not what I’m here for. What I’m here for is what the deviation between translation hub and widely spoken language tells us about how to do global field studies. It is obviously going to be more difficult to study an online subculture that conducts itself in a language you don’t speak yourself, and the fewer people do speak that language, the harder it will be for you to get some help. But if a language is widely spoken but not a translation hub, it may be difficult for you to get the right kind of help.

For instance, machine translations between the various modern vernaculars of Arabic and English are not very good at present. I could find someone who speaks any given vernacular Arabic without too much difficulty, but I’d probably have to pay them a lot of money to get them to translate 45,000 Arabic documents into English, or even just to tell me which of those documents were in their vernacular. (That number happens to be how many documents in my research database were identified as some kind of Arabic by a machine classifier—and even that doesn’t work as well as it could; for instance, it is quite likely to be confused by various Central Asian languages that can be written with an Arabic-derived script and have a number of Arabic loanwords but are otherwise unrelated.)

What can we (Western researchers, communicating primarily in English) do about it? First is just to be aware that global field studies conducted by Anglophones are going to be weak when it comes to languages poorly coupled to English, even when they are widely spoken. In fact, the very fact of the poor coupling makes me skeptical of the results in this paper when it comes to those languages. They only looked at three datasets, all of which are quite English-centric. Would it not have made sense to supplement that with polylingual resources centered on, say, Mandarin Chinese, Modern Standard Arabic, Hindi, and Swahili? These might be difficult to find, but not being able to find them would tend to confirm the original result, and if you could find them, you could both improve the lower bounds for coupling to English, and get a finer-grained look at the languages that are well-translated within those clusters.

Down the road, it seems to me that whenever you see a language cluster that’s widely spoken but not very much in communication with any other languages, you’ve identified a gap in translation resources and cultural cross-pollination, and possibly an underserved market. Scanlations make a good example: the lack of officially-licensed translations of comic books (mostly of Japanese-language manga into English) spawned an entire subculture of unofficial translators, and that subculture is partially responsible for increasing mainstream interest in the material to the point where official translations are more likely to happen.