Joss Wright | Readings in Information Security

We’re picking back up with a paper that’s brand new—so new that it exists only as an arXiv preprint and I don’t know if it is planned to be published anywhere. It probably hasn’t gone through formal peer review yet.

Wright and colleagues observe that because Tor is commonly used to evade censorship, changes in the number of people using Tor from any given country are a signal of a change in the censorship régime in that country. This isn’t a new idea: the Tor project itself has been doing something similar since 2011. What this paper does is present an improved algorithm for detecting such changes. It uses PCA to compare the time series of Tor active users across countries. The idea is that if there’s a change in Tor usage worldwide, that probably doesn’t indicate censorship, but a change in just a few countries is suspicious. To model this using PCA, they tune the number of principal components so that the projected data matrix is well-divided into what they call normal and anomalous subspaces; large components in the anomalous subspace for any data vector indicate that that country at that time is not well-predicted by all the other countries, i.e. something fishy is going on.

They show that their algorithm can pick out previously known cases where a change in Tor usage is correlated with a change in censorship, and that its top ten most anomalous countries are mostly the countries one would expect to be suspicious by this metric—but also a couple that nobody had previously suspected, which they highlight as a matter needing further attention.

PCA used as an anomaly detector is a new idea on me. It seems like they could be extracting more information from it than they are. The graphs in this paper show what’s probably a global jump in Tor usage in mid-2013; this has a clear explanation, and they show that their detector ignores it (as it’s supposed to), but can they make their detector call it out separately from country-specific events? PCA should be able to do that. Similarly, it seems quite probable that the ongoing revolutions and wars in the Levant and North Africa are causing correlated changes to degree of censorship region-wide; PCA should be able to pull that out as a separate explanatory variable. These would both involve taking a closer look at the normal subspace and what each of its dimensions mean.

It also seems to me that a bit of preprocessing, using standard time series decomposition techniques, would clean up the analysis and make its results easier to interpret. There’s not one word about that possibility in the paper, which seems like a major omission; decomposition is the first thing that anyone who knows anything about time series analysis would think of. In this case, I think seasonal variation should definitely be factored out, and removing linear per-country trends might also helpful.

This is one of the earlier papers that looked specifically for regional variation in China’s internet censorship; as I mentioned when reviewing Large-scale Spatiotemporal Characterization of Inconsistencies in the World’s Largest Firewall, assuming that censorship is monolithic is unwise in general and especially so for a country as large, diverse, and technically sophisticated as China. This paper concentrates on variation in DNS-based blockade: they probed 187 DNS servers in 29 Chinese cities (concentrated, like the population, toward the east of the country) for a relatively small number of sites, both highly likely and highly unlikely to be censored within China.

The results reported are maybe better described as inconsistencies among DNS servers than regional variation. For instance, there are no sites called out as accessible from one province but not another. Rather, roughly the same set of sites is blocked in all locales, but all of the blocking is somewhat leaky, and some DNS servers are more likely to leak—regardless of the site—than others. The type of DNS response when a site is blocked also varies from server to server and site to site; observed behaviors include no response at all, an error response, or (most frequently) a success response with an incorrect IP address. Newer papers (e.g. [1] [2]) have attempted to explain some of this in terms of the large-scale network topology within China, plus periodic outages when nothing is filtered at all, but I’m not aware of any wholly compelling analysis (and short of a major leak of internal policy documents, I doubt we can ever have one).

There’s also an excellent discussion of the practical and ethical problems with this class of research. I suspect this was largely included to justify the author’s choice to only look at DNS filtering, despite its being well-known that China also uses several other techniques for online censorship. It nonetheless provides valuable background for anyone wondering about methodological choices in this kind of paper. To summarize:

Many DNS servers accept queries from the whole world, so they can be probed directly from a researcher’s computer; however, they might vary their response depending on the apparent location of the querent, their use of UDP means it’s hard to tell censorship by the server itself from censorship by an intermediate DPI router, and there’s no way to know the geographic distribution of their intended clientele.
Studying most other forms of filtering requires measurement clients within the country of interest. These can be dedicated proxy servers of various types, or computers volunteered for the purpose. Regardless, the researcher risks inflicting legal penalties (or worse) on the operators of the measurement clients; even if the censorship authority normally takes no direct action against people who merely try to access blocked material, they might respond to a sufficiently high volume of such attempts.
Dedicated proxy servers are often blacklisted by sites seeking to reduce their exposure to spammers, scrapers, trolls, and DDoS attacks; a study relying exclusively on such servers will therefore tend to overestimate censorship.
Even in countries with a strong political commitment to free expression, there are some things that are illegal to download or store; researchers must take care not to do so, and the simplest way to do that is to avoid retrieving anything other than text.

Readings in Information Security

Papers by Joss Wright

Detecting Internet Filtering from Geographic Time Series

Regional Variation in Chinese Internet Filtering