Encore: Lightweight Measurement of Web Censorship with Cross-Origin Requests

As I’ve mentioned a few times here before, one of the biggest problems in measurement studies of Web censorship is taking the measurement from the right place. The easiest thing (and this may still be difficult) is to get access to a commercial VPN exit or university server inside each country of interest. But commercial data centers and universities have ISPs that are often somewhat less aggressive about censorship than residential and mobile ISPs in the same country—we think. [1] And, if the country is big enough, it probably has more than one residential ISP, and there’s no reason to think they behave exactly the same. [2] [3] What we’d really like is to enlist spare CPU cycles on a horde of residential computers across all of the countries we’re interested in.

This paper proposes a way to do just that. The authors propose to add a script to globally popular websites which, when the browser is idle, runs tests of censorship. Thus, anyone who visits the website will be enlisted. The first half of the paper is a technical demonstration that this is possible, and that you get enough information out of it to be useful. Browsers put a bunch of restrictions on what network requests a script can make—you can load an arbitrary webpage in an invisible <iframe>, but you don’t get notified of errors and the script can’t see the content of the page; conversely, <img> can only load images, but a script can ask to be notified of errors. Everything else is somewhere in between. Nonetheless, the authors make a compelling case for being able to detect censorship of entire websites with high accuracy and minimal overhead, and a somewhat less convincing case for being able to detect censorship of individual pages (with lower accuracy and higher overhead). You only get a yes-or-no answer for each thing probed, but that is enough for many research questions that we can’t answer right now. Deployment is made very easy, a simple matter of adding an additional third-party script to websites that want to participate.

The second half of the paper is devoted to ethical and practical considerations. Doing this at all is controversial—in a box on the first page, above the title of the paper, there’s a statement from the SIGCOMM 2015 program committee, saying the paper almost got rejected because some reviewers felt it was unethical to do anything of the kind without informed consent by the people whose computers are enlisted to make measurements. SIGCOMM also published a page-length review by John Byers, saying much the same thing. Against this, the authors argue that informed consent in this case is of dubious benefit, since it does not reduce the risk to the enlistees, and may actually be harmful by removing any traces of plausible deniability. They also point out that many people would need a preliminary course in how Internet censorship works and how Encore measures it before they could make an informed choice about whether to participate in this research. Limiting the pool of enlistees to those who already have the necessary technical background would dramatically reduce the scale and scope of measurements. Finally they observe that the benefits of collecting this data are clear, whereas the risks are nebulous. In a similar vein, George Danezis wrote a rebuttal of the public review, arguing that the reviewers’ concerns are based on a superficial understanding of what ethical research in this area looks like.

Let’s be concrete about the risks involved. Encore modifies a webpage such that web browsers accessing it will, automatically and invisibly to the user, also access a number of unrelated webpages (or resources). By design, those unrelated webpages contain material which is considered unacceptable, perhaps to the point of illegality, in at least some countries. Moreover, it is known that these countries mount active MITM attacks on much of the network traffic exiting the country, precisely to detect and block access to unacceptable material. Indeed, the whole point of the exercise is to provoke an observable response from the MITM, in order to discover what it will and won’t respond to.

The MITM has the power to do more than just block access. It almost certainly records the client IP address of each browser that accesses undesirable material, and since it’s operated by a state, those logs could be used to arrest and indict people for accessing illegal material. Or perhaps the state would just cut off their Internet access, which would be a lesser harm but still a punishment. It could also send back malware instead of the expected content (we don’t know if that has ever happened in real life, but very similar things have [4]), or turn around and mount an attack on the site hosting the material (this definitely has happened [5]). It could also figure out that certain accesses to undesirable material are caused by Encore and ignore them, causing the data collected to be junk, or it could use Encore itself as an attack vector (i.e. replacing the Encore program with malware).

In addition to the state MITM, we might also want to worry about other adversaries in a position to monitor user behavior online, such as employers, compromised coffee shop WiFi routers, and user-tracking software. Employers may have their own list of material that people aren’t supposed to access using corporate resources. Coffee shop WiFi is probably interested in finding a way to turn your laptop into a botnet zombie; any unencrypted network access is a chance to inject some malware. User-tracking software might become very confused about what someone’s demographic is, and start hitting them with ads that relate to whatever controversial topic Encore is looking for censorship of. (This last might actually be a Good Thing, considering the enormous harms behavioral targeting can do. [6])

All of these are harm to someone. It’s important to keep in mind that except for poisoning the data collected by Encore (harm to the research itself) all of them can happen in the absence of Encore. Malware, ad networks, embedded videos, embedded like buttons, third-party resources of any kind: all of these can and do cause a client computer to access material without its human operator’s knowledge or consent, including accesses to material that some countries consider undesirable. Many of them also offer an active MITM the opportunity to inject malware.

The ethical debate over this paper has largely focused on increased risk of legal, or quasilegal, sanctions taken against people whose browsers were enlisted to run Encore tests. I endorse the authors’ observation that informed consent would actually make that risk worse. Because there are so many reasons a computer might contact a network server without its owner’s knowledge, people already have plausible deniability regarding accesses to controversial material (i.e. I never did that, it must have been a virus or something). If Encore told its enlistees what it was doing and gave them a chance to opt out, it would take that away.

Nobody involved in the debate knows how serious this risk really is. We do know that many countries are not nearly as aggressive about filtering the Internet as they could be, [7] so it’s reasonable to think they can’t be bothered to prosecute people just for an occasional attempt to access stuff that is blocked. It could still be that they do prosecute people for bulk attempts to access stuff that is blocked, but Encore’s approach—many people doing a few tests—would tend to avoid that. But there’s enough uncertainty that I think the authors should be talking to people in a position to know for certain: lawyers and activists from the actual countries of interest. There is not one word either in the papers or the reviews to suggest that anyone has done this. The organizations that the authors are talking to (Citizen Lab, Oxford Internet Institute, the Berkman Center) should have appropriate contacts already or be able to find them reasonably quickly.

Meanwhile, all the worry over legal risks has distracted from worrying about the non-legal risks. The Encore authors are fairly dismissive of the possibility that the MITM might subvert Encore’s own code or poison the results; I think that’s a mistake. They consider the extra bandwidth costs Encore incurs, but they don’t consider the possibility of exposing the enlistee to malware on a page (when they load an entire page). More thorough monitoring and reportage on Internet censorship might cause the censor to change its behavior, and not necessarily for the better—for instance, if it’s known that some ISPs are less careful about their filtering, that might trigger sanctions against them. These are just the things I can think of off the top of my head.

In closing, I think the controversy over this paper is more about the community not having come to an agreement about its own research ethics than it is about the paper itself. If you read the paper carefully, the IRB at each author’s institution did not review this research. They declined to engage with it. This was probably a correct decision from the board’s point of view, because an IRB’s core competency is medical and psychological research. (They’ve come in for criticism in the past for reviewing sociological studies as if they were clinical trials.) They do not, in general, have the background or expertise to review this kind of research. There are efforts underway to change that: for instance, there was a Workshop on Ethics in Networked Systems Research at the very same conference that presented this paper. (I wish I could have attended.) Development of a community consensus here will, hopefully, lead to better handling of future, similar papers.

Parking Sensors: Analyzing and Detecting Parked Domains

In the same vein as the ecological study of ad injectors I reviewed back in June, this paper does an ecological analysis of domain parking. Domain parking is the industry term of art for the practice of registering a whole bunch of domain names you don’t have any particular use for but hope someone will buy, and while you wait for the buyer, sticking a website consisting entirely of ads in the space, usually with This domain is for sale! at the top in big friendly letters. Like many ad-driven online business models, domain parking can be mostly harmless or it can be a lead-in to outright scamming people and infesting their computers with malware, and the research question in this paper is, how bad does it get?

In order to answer that question, the authors carried out a modest-sized survey of 3000 parked domains, identified by trawling the DNS for name servers associated with 15 known parking services. (Also like many other online businesses, domain parking runs on an affiliate marketing basis: lots of small fry register the domains and then hand the keys over to big services that do the actual work of maintaining the websites with the ads.) All of these services engage in all of the abusive behavior one would expect: typosquatting, aggressive behavioral ad targeting, drive-by malware infection, and feeding visitors to scam artists and phishers. I do not come away with a clear sense of how common any of these attacks are relative to the default parking page of advertisements and links—they have numbers, but they’re not very well organized, and different sets of parking pages were used in each section (discussing a different type of abuse) which makes it hard to compare across sections.

I’m most interested in the last section of the paper, in which they develop a machine classifier that can distinguish parking pages from normal webpages, based on things like the amount of text that is and isn’t a hyperlink, number of external links, total volume of resources drawn from third-party sources, and so on. The bulk of this section is devoted to enumerating all of the features that they tested, but doesn’t do a terribly good job of explaining which features wound up being predictive. Algorithmic choices also seem a little arbitrary. They got 97.9% true positive rate and 0.5% false positive rate out of it, though, which says to me that this isn’t a terribly challenging classification problem and probably most anything would have worked. (This is consistent with the intuitive observation that you, a human, can tell at a glance when you’ve hit a parking page.)

Robust and Efficient Elimination of Cache and Timing Side Channels

We’re closing out Unpublished Results Week here with a paper that was submitted to CCS 2015 last May, but doesn’t appear to have made it.

I’ve talked about side channels a little before, but let me repeat why time and cache channels keep coming up and why they’re hard to get rid of: they’re side effects of hardware-design choices that, most of the time, are exactly what you want, if you want your computer to go fast. Think about long versus short division, for instance. It takes less mental effort, and less scratch paper, to manually divide 123456789 by 3 than by 29; the same is true (sort of) for your computer. So if it’s your job to design the CPU, why shouldn’t the integer divide instruction finish quicker in the former case? Especially if you have profiling information that says that most of the time integer divide gets used with small divisors. Memory caches are even more vital—the other day I saw a graph suggesting that it takes order of 17,000 clock cycles for the CPU to come all the way back up to full number crunching speed after a context switch; most of that is probably due to stuff falling out of the cache and having to be pulled back in.

So: side channels are hard to get rid of in hardware because that would involve making design tradeoffs that would hurt performance for the supermajority of code that isn’t crunching sensitive data, and they’re hard to get rid of in software because the hardware, and the compiler, are actively fighting you. (The paper says several times that side-channel-free cryptographic primitives pretty much have to be done in hand-written assembly language, and that getting them right is extremely difficult even by the standards of people who write assembly language by hand.) Wouldn’t it be nice to find a big hammer that would make the problem just … go away? That’s this paper.

A principal design goal for the big hammer in this paper is ease of use. The only thing the application programmer has to do is label protected functions that need to have their side channels eliminated. And it’s not a compiler paper; they haven’t come up with a way to make a compiler do the extremely difficult assembly tuning for you. All the magic happens at runtime. Given all that, it works pretty much the only way it could work: it measures how much the timing varies and then it pads out the fast cases to the same length as the slow cases. It also flushes state from the protected function out of as many CPU caches as possible (memory, TLB, branch target buffer, …) after the function is done, and takes steps (page coloring, CPU and memory pinning, realtime scheduling priority) to ensure that the protected function is not perturbed by what malicious code has done before it starts executing. The authors demonstrate reliability by testing several different crypto primitives that definitely do have side-channel information leaks, and showing that the leak is rendered unavailable in all cases. They also claim the overhead is acceptable (8.5%).

I really like the concept and hope to see more in this vein in the future. I can’t speculate on why this particular paper didn’t get accepted at CCS, but I do feel that the overhead measurements were inadequate: they don’t consider indirect costs of flushing caches and pinning stuff to CPUs all the time. I’d like to see whole-system benchmarking of, say, a shared-hosting HTTPS server. And at least one round of adversarial testing would be nice. It is one thing to demonstrate that your measurements show that the side channels you already knew about have been closed; it is another thing if someone who spends all their time thinking of side-channel attacks cannot find any in your system.

Generating Pseudo-random Floating-Point Values

Like Monday’s paper, today’s has not been officially published anywhere, and in fact appears never to have been finished. It’s on a fairly arcane topic, and one that’s not directly related to security—as I said a while back, floating-point numbers usually aren’t used in security-critical algorithms. However, it deserves more attention, being both a case of everyone everywhere is Doing It Wrong, and a nice example of the danger of forgetting that an abstraction is hiding something from you.

For numerical and statistical computation, one frequently needs random floating-point numbers in the open interval (0,1)(0,\,1). However, all off-the-shelf pseudorandom number generators produce either random integers, or a stream of random bits (which is functionally the same thing—bits are trivially put together into integers and vice versa). The easy, obvious way to bridge the gap is to divide (in floating point) the integers by the largest value the PRNG can produce. This is how it’s done basically everywhere, including industrial-grade statistical languages like R…but what the paper is saying is, it’s wrong.

To understand why it’s wrong, you need to understand in detail how floating-point numbers are not mathematical real numbers. On one level this is obvious: an IEEE single occupies 32 bits and a double 64, so at most they could represent 2322^{32} or 2642^{64} different numbers, respectively; that isn’t even countably infinite. (A large chunk of the encoding is reserved for not a number values, so the true count is somewhat smaller.) But high-level programming languages encourage us to use floating point numbers as if they were the reals, and most of the time we’re working with values that are well-approximated by floats, and we get away with it.

The numbers that a binary floating-point format can represent are all of the form

±m.mmm…m×2±e.eee…e

where ±m.mmm…m (the mantissa) and ±e.eee…e (the exponent) are binary numbers of fixed length. For IEEE singles, the mantissa has 25 bits and the exponent 8, counting their respective signs in both cases, and there’s a clever trick (not relevant here) to fit that all into 32 bits total. This is just the scientific notation you probably learned in high school, in base 2 instead of base 10, and if the mantissa and exponent were infinitely wide it could represent all real numbers. However, they are finite, and this means not only that it can’t, but that the values it can are not uniformly distributed over the real line. Every time you pass a power of two, the exponent ticks up or down by one, but the mantissa stays the same length, so it loses or gains one bit of precision. Therefore, there are the same number of representable values between 1 and 2 as there are between 0.5 and 1, between 0.25 and 0.5, and so on, and each range’s representable numbers are twice as far apart as those in the next smaller range.

Some of the numbers representable by an IEEE single-precision float (top row) and producible by dividing a 32-bit unsigned integer by 23212^{32}-1 (bottom row).

Now, when you generate a random integer and divide it by the largest integer the RNG can produce, the result is uniformly distributed over (0,1)(0,\,1). If the largest integer the RNG can produce is 23212^{32}-1, as is often the case (even on systems with 64-bit integers), the gap between values producible by this method will be about 2×10102 \times 10^{-10}. The figure above compares the spacing of these values along the real line with the spacing of the values representable by an IEEE single-precision float. Only from 0.001953125 (292^{-9}) to 0.003906250 (282^{-8}) is there a one-to-one correspondence. Above 282^{-8}, the representable numbers are more widely spaced than the values that the RNG will produce, which means clusters of RNG-produced values will get rounded to the same representable number, which introduces non-uniformity. Below 292^{-9}, it’s the other way around: all of the RNG-produced values are exactly representable, but there are many more that can’t be produced. Downey estimates that over the entire range, only 7% of the numbers are producible.

The most significant gap is right at zero. The smallest number IEEE single can represent is on the order of 103810^{-38} or 104510^{-45} (depending on whether you allow denormals), and IEEE double goes all the way down to 2×103082 \times 10^{-308} (or 5×103245 \times 10^{-324} with denormals). But, the smallest number producible by the simple RNG method is 2×10102 \times 10^{-10}, dozens or hundreds of decimal orders of magnitude bigger. Downey suggests that this could cause serious problems for statistical algorithms, such as inverse transform sampling of the exponential distribution.

Downey’s proposed fix rests on the observation that a random number uniformly distributed over (0,1)(0,\,1) will be greater than 0.5 exactly half of the time; when it isn’t, it will be greater than 0.25 exactly half of the time; and so on. Therefore, his algorithm first selects an exponent by flipping coins in a loop—that is, drawing one bit at a time from the output of the RNG. The exponent starts out as −1, corresponding to a number between 0.5 and 1; each time the bit is 0, the exponent is reduced by 1 and another bit is drawn, until either a 1 is drawn or the exponent reaches its minimum allowable value. Then the algorithm fills in the mantissa with random bits. Finally, there’s a small correction: if the mantissa happens to come out zero, flip another bit and if it’s 1, increase the exponent by 1 again. This accounts for values that are exactly 1 over a power of two, which straddle a boundary between exponents and therefore can be selected from either side; without the correction they would be selected slightly more often than is appropriate.

It’s too bad Downey never finished this paper. The biggest missing piece is a clear and convincing demonstration that the naïve algorithm does introduce significant errors into common calculations. For exponential sampling by inverse transform, there is an inarguable error (the distribution is truncated on the right) but one could argue that it doesn’t rise to the level of significance because exponential deviates larger than 9.3 should only get drawn one time in 101010^{10} or so. There are no other examples of potentially problematic tasks, and I am not enough of a statistician to think of any myself. IEEE double has enough mantissa, even in the range from 0.5 to 1, that it can represent every multiple of 2×10102 \times 10^{-10}, so nonuniformity does not occur if you’re generating doubles by the simple method, only missing values.

There are also a couple of practical problems with the algorithm. A potentially-lengthy run of coin tosses, with the requirement that every bit is independent and identically distributed, is poorly handled by many ordinary RNGs; I would only use this algorithm with a cryptographic RNG. Relatedly, on average the coin-tossing phase will terminate quickly, but if you do hit a long run of zeroes it’d be observably slower in that case. I don’t see a good way to implement the algorithm so that it uses a fixed number of random bits per float generated, though, short of generate all the coin tosses in advance which would consume 1078 bits of randomness for every IEEE double; this would probably be unacceptably slow overall.

Detecting Internet Filtering from Geographic Time Series

We’re picking back up with a paper that’s brand new—so new that it exists only as an arXiv preprint and I don’t know if it is planned to be published anywhere. It probably hasn’t gone through formal peer review yet.

Wright and colleagues observe that because Tor is commonly used to evade censorship, changes in the number of people using Tor from any given country are a signal of a change in the censorship régime in that country. This isn’t a new idea: the Tor project itself has been doing something similar since 2011. What this paper does is present an improved algorithm for detecting such changes. It uses PCA to compare the time series of Tor active users across countries. The idea is that if there’s a change in Tor usage worldwide, that probably doesn’t indicate censorship, but a change in just a few countries is suspicious. To model this using PCA, they tune the number of principal components so that the projected data matrix is well-divided into what they call normal and anomalous subspaces; large components in the anomalous subspace for any data vector indicate that that country at that time is not well-predicted by all the other countries, i.e. something fishy is going on.

They show that their algorithm can pick out previously known cases where a change in Tor usage is correlated with a change in censorship, and that its top ten most anomalous countries are mostly the countries one would expect to be suspicious by this metric—but also a couple that nobody had previously suspected, which they highlight as a matter needing further attention.

PCA used as an anomaly detector is a new idea on me. It seems like they could be extracting more information from it than they are. The graphs in this paper show what’s probably a global jump in Tor usage in mid-2013; this has a clear explanation, and they show that their detector ignores it (as it’s supposed to), but can they make their detector call it out separately from country-specific events? PCA should be able to do that. Similarly, it seems quite probable that the ongoing revolutions and wars in the Levant and North Africa are causing correlated changes to degree of censorship region-wide; PCA should be able to pull that out as a separate explanatory variable. These would both involve taking a closer look at the normal subspace and what each of its dimensions mean.

It also seems to me that a bit of preprocessing, using standard time series decomposition techniques, would clean up the analysis and make its results easier to interpret. There’s not one word about that possibility in the paper, which seems like a major omission; decomposition is the first thing that anyone who knows anything about time series analysis would think of. In this case, I think seasonal variation should definitely be factored out, and removing linear per-country trends might also helpful.

Sparse Overcomplete Word Vector Representations

It’s time for another of my occasional looks at a paper that doesn’t have anything to do with security. This one is about an attempt to bridge a gap between two modes of analysis in computational linguistics.

In a word vector analysis (also referred to as a distributed word representation in this paper) one maps words onto vectors in an abstract high-dimensional space, defined by the co-occurrence probabilities of each pair of words within some collection of documents (a corpus). This can be done entirely automatically with any source of documents; no manual preprocessing is required. Stock machine-learning techniques, applied to these vectors, perform surprisingly well on a variety of downstream tasks—classification of the words, for instance. However, the vectors are meaningless to humans, so it’s hard to use them in theoretical work that requires interpretability, or to combine them with other information (e.g. sentence structure). Lexical semantics, by contrast, relies on manual, word-by-word tagging with human-meaningful categories (part of speech, sense of word, role in sentence, etc) which is slow and expensive, but the results are much easier to use as a basis for a wide variety of further studies.

The paper proposes a technique for transforming a set of word vectors to make them more interpretable. It’s essentially the opposite of PCA. They project the vectors into a higher-dimensional space, one in which they are all sparse (concretely, more than 90% of the components of each vector are zero). Words that are semantically related will (they claim) share nonzero components of these vectors, so each component has more meaning than in the original space. The projection matrix can be generated by standard mathematical optimization techniques (specifically, gradient descent with some tweaks to ensure convergence).

To back up the claim that the sparse vectors are more interpretable, they first show that five stock downstream classification tasks achieve an average of 5% higher accuracy when fed sparse vectors than the dense vectors from which they were derived, and then that humans achieve 10% higher accuracy on a word intrusion task (one of these five words does not belong in the list, which is it?) when the words are selected by their rank along one dimension of the sparse vectors, than when they are selected the same way from the dense vectors. An example would probably help here: Table 6 of the paper quotes the top-ranked words from five dimensions of the sparse and the dense vectors:

Dense Sparse
combat, guard, honor, bow, trim, naval fracture, breathing, wound, tissue, relief
’ll, could, faced, lacking, seriously, scored relationships, connections, identity, relations
see, n’t, recommended, depending, part files, bills, titles, collections, poems, songs
due, positive, equal, focus, respect, better naval, industrial, technological, marine
sergeant, comments, critics, she, videos stadium, belt, championship, toll, ride, coach

Neither type of list is what you might call an ideal Platonic category,1 but it should be clear that the sparse lists have more meaning in them.

Because I’m venturing pretty far out of my field, it’s hard for me to tell how significant this paper is; I don’t have any basis for comparison. This is, in fact, the big thing missing from this paper: a comparison to other techniques for doing the same thing, if any. Perhaps the point is that there aren’t any, but I didn’t see them say so. I am also unclear on how you would apply this technique to anything other than an analysis of words. For instance, my own research right now involves (attempting to) mechanically assign topics to entire documents. Right now we’re reducing each document to a bag of words, carrying out LDA on the bags, and then manually labeling each LDA cluster with a topic. Could we use bags of sparse word vectors instead? Would that help LDA do its job better? Or is LDA already doing what this does? I don’t know the answers to these questions.

1 If you are under the impression that categories in natural language are even vaguely Platonic, go at once to your friendly local public library and request a copy of Women, Fire, and Dangerous Things.

Whiskey, Weed, and Wukan on the World Wide Web

The subtitle of today’s paper is On Measuring Censors’ Resources and Motivations. It’s a position paper, whose goal is to get other researchers to start considering how economic constraints might affect the much-hypothesized arms race or tit-for-tat behavior of censors and people reacting to censorship: as they say,

[…] the censor and censored have some level of motivation to accomplish various goals, some limited amount of resources to expend, and real-time deadlines that are due to the timeliness of the information that is being spread.

They back up their position by presenting a few pilot studies, of which the most compelling is the investigation of keyword censorship on Weibo (a Chinese microblogging service). They observe that searches are much more aggressively keyword-censored than posts—that is, for many examples of known-censored keywords, one is permitted to make a post on Weibo containing that keyword, but searches for that keyword will produce either no results or very few results. (They don’t say whether unrelated searches will turn up posts containing censored keywords.) They also observe that, for some keywords that are not permitted to be posted, the server only bothers checking for variations on the keyword if the user making the post has previously tried to post the literal keyword. (Again, the exact scope of the phenomenon is unclear—does an attempt to post any blocked keyword make the server check more aggressively for variations on all blocked keywords, or just that one? How long does this escalation last?) And finally, whoever is maintaining the keyword blacklists at Weibo seems to care most about controlling the news cycle: terms associated with breaking news that the government does not like are immediately added to the blacklist, and removed again just as quickly when the event falls out of the news cycle or is resolved positively. They give detailed information about this phenomenon for one news item, the Wukan incident, and cite several other keywords that seem to have been treated the same.

They compare Weibo’s behavior to similar keyword censorship by chat programs popular in China, where the same patterns appear, but whoever is maintaining the lists is sloppier and slower about it. This is clear evidence that the lists are not maintained centrally (by some government agency) and they suggest that many companies are not trying very hard:

At times, we often suspected that a keyword blacklist was being typed up by an over-worked college intern who was given vague instructions to filter out anything that might be against the law.

Sadly, I haven’t seen much in the way of people stepping up to the challenge presented, designing experiments to probe the economics of censorship. You can see similar data points in other studies of China [1] [2] [3] (it is still the case, as far as I know, that ignoring spurious TCP RST packets is sufficient to evade several aspects of the Great Firewall), and in reports from other countries. It is telling, for instance, that Pakistani censors did not bother to update their blacklist of porn sites to keep up with a shift in viewing habits. [4] George Danezis has been talking about the economics of anonymity and surveillance for quite some time now [5] [6] but that’s not quite the same thing. I mentioned above some obvious follow-on research just for Weibo, and I don’t think anyone’s done that. Please tell me if I’ve missed something.

Limitations of End-to-End Encryption in Secure Computer Networks

Today we’re going to go back in time, all the way to the the dawn of computer networks. When this technical report was filed, the largest operational internetwork was still called ARPAnet and it still ran on NCP; people were still talking about hosts and communications subnetwork processors as if they were two different physical devices, and security levels as if that was the only possible way to conceptualize access control; and I was five months old.

(I apologize for the poor quality of the linked PDF—to the best of my knowledge, this is the only version to be found online. Also, if anyone knows D.W. Snow’s first name, please tell me.)

To the best of my knowledge, this is the very first published article to discuss the things that end-to-end encryption (what we would now call a secure channel protocol) does not protect from an eavesdropper. Everyone doing computer security in 1978 was thinking mostly about protecting classified government secrets, so the authors frame the problem in terms of a Trojan Horse program with access to such secrets, but forbidden by the OS from sending messages to anyone who isn’t cleared to access the same secrets: if all outgoing network traffic is encrypted end-to-end to its (legitimate) recipient, can this Trojan Horse still exfiltrate information to someone who isn’t a legitimate recipient?

They point out that, of necessity, a packet-switched network has to reveal the destination address, transmission time, and length of every packet in cleartext. They model each of these as Shannonian communication channels, and determine sending rates on the order of 100 bits per second for each—more than adequate to leak a text document. (They observe, by way of comparison, that the standard military teletype runs at 75 bps.)

Nowadays, this threat model might seem quaint, even silly—we put a lot more effort into preventing untrusted code from seeing secret information in the first place. The information leak, however, is real, still exists, and can be used for other purposes. The most terrifying example I know is Hookt on fon-iks, in which a completely passive eavesdropper can reconstruct the words spoken in an encrypted VoIP phone conversation, just from the length and timing of each packet. Different syllables compress to different length packets, and every natural language has rules about which syllables can follow which; the rules can be modeled with a Markov chain, and there you are.

The countermeasures and conclusions sections of this paper are much more embarrassing in retrospect than the dated threat model. They say, there’s nothing one can practically do about this end-to-end, but we can close the hole if we make every single intermediate relay a trusted arbiter of the (one true) security policy, at which point we don’t need end-to-end encryption… I feel quite confident in saying, even in 1978 it ought to have been obvious that that was never going to happen. What’s sad is, if people hadn’t given up on end-to-end countermeasures back then, perhaps we would actually have some by now. (It’s easy for VoIP! All you have to do is use a constant-bitrate compression algorithm. Shame none of the widely deployed VoIP programs bother.)

Folk Models of Home Computer Security

It is well known that people who aren’t computer security experts tend to ignore expert advice on computer security, and (to some extent as a consequence) get exploited. This paper is not the first, or the last, to investigate why; see also the What Deters Jane papers [1] [2], So Long and No Thanks for the Externalities, and Users Are Not the Enemy. However, this paper provides a much more compelling explanation than anything earlier (that I have read), and a lens through which to view everything since. It’s plainly written and requires almost no specialized background knowledge; you should just go ahead and read the whole thing.

For those not in the mood to read the whole thing right now, I will summarize. Wash conducted semi-structured, qualitative interviews of 33 home computer users, who were selected to maximize sample diversity, and specifically to exclude security experts. From these, he extracted a number of what he calls folk models—qualitative, brief descriptions of how these people understand various threats to their computer security. The term folk is used to describe a model which accurately reflects users’ understanding of a system, and is broadly shared among a user population, but might not accurately reflect the true behavior of that system. [3] In this case, that means the models reflect what the interviewees think are the threats to their home computers, even if those aren’t accurate. Indeed, it is precisely where the model is inaccurate to the real threats that it provides an explanation of the phenomenon (i.e. users not bothering to follow well-publicized security advice).

A key aspect of all the models presented is a division of security threats into viruses and hackers. Virus is used by the interviewees as an umbrella term, corresponding most closely to what experts call malware—any piece of software which is unwanted and has harmful effects. (One model expands this category even further, to include programs which are unintentionally harmful, i.e. they have bugs.) The models differ primarily in users’ understanding of how viruses get into the computer, and what they are programmed to do once there. This can be very vague (e.g. viruses are bad software you don’t want on your computer) or quite specific (e.g. viruses are deliberately programmed by hackers as an act of vandalism; they cause annoying problems with the computer; you get them passively by visiting sketchy websites—an expert will acknowledge this as true-but-incomplete).

Hackers on the other hand are people who are actively seeking to exploit computers; most interviewees share the understanding that this involves taking control of a computer remotely, thus allowing it to be manipulated as if the hacker were physically present at its console. (Many of them hedge that they do not know how this is done, but they are sure that it is possible.) The models here differ primarily in the motives ascribed to the hackers, which are: vandalism, identity theft, or targeted identity theft and financial fraud. This last is one of the most telling observations in the entire paper: a significant number of people believe that they are safe from hackers because they have nothing worth stealing or exposing. (Again, an expert would recognize this as true-but-incomplete: there really is a subpopulation of black-hat actors who specialize in going after the big fish. The catch, of course, is that the data exfiltrated from a big fish might include millions of people’s personal credit card numbers…)

Having presented these models, Wash runs down a list of standard items of home computer security advice (drawn from Microsoft, CERT, and US-CERT’s guides on the topic) and points out how many of them are either useless or unimportant according to these models: for instance, if you think you can’t get viruses without actively downloading software, then antivirus software is pointless, and patching only valuable if it eliminates bugs you trip over yourself; if you think hackers rarely, if ever, vandalize a computer, then backups are not necessary to protect against that risk. He closes by comparing the novel-at-the-time threat of botnets to all the models, observing that none of them account for the possibility that an attacker might subvert computers indiscriminately and automatically, then use them only for their Internet connection. In particular, all of the hacker models assume that computers are attacked in order to do something to that computer, rather than as a means to an unrelated goal (sending spam, enlarging the botnet, executing DDoS attacks, …) and that the hacker must be doing something manually at the time of the attack.

The landscape of security threats has changed quite a bit since this paper was published. I would be curious to know whether ransomware, RATs, third-party data breaches, and so on have penetrated the public consciousness enough to change any of the models. I’d also like to know whether and how much people’s understanding of the threats to a mobile phone is different. And, although Wash did make an effort to cover a broad variety of non-expert home computer users, they are all from the general population near his Midwestern university, hence mostly WEIRDos. I’m not aware of any studies of this type conducted anywhere but North America and Europe, but I bet it’s not quite the same elsewhere…

Taster’s Choice: A Comparative Analysis of Spam Feeds

Here’s another paper about spam; this time it’s email spam, and they are interested not so much in the spam itself, but in the differences between collections of spam (feeds) as used in research. They have ten different feeds, and they compare them to each other looking only at the domain names that appear in each. The goal is to figure out whether or not each feed is an unbiased sample of all the spam being sent at any given time, and whether some types of feed are better at detecting particular sorts of spam. (Given this goal, looking only at the domain names is probably the most serious limitation of the paper, despite being brushed off with a footnote. It means they can’t say anything about spam that doesn’t contain any domain names, which may be rare, but is interesting because it’s rare and different from all the rest. They should have at least analyzed the proportion of it that appeared in each feed.)

The spam feeds differ primarily in how they collect their samples. There’s one source consisting exclusively of manually labeled spam (from a major email provider); two DNS blacklists (these provide only domain names, and are somehow derived from other types of feed); three MX honeypots (registered domains that accept email to any address, but are never used for legitimate mail); two seeded honey accounts (like honeypots, but a few addresses are made visible to attract more spam); one botnet-monitoring system; and one hybrid. They don’t have full details on exactly how they all work, which is probably the second most serious limitation.

The actual results of the analysis are basically what you would expect: manually labeled spam is lower-volume but has more unique examples in it, botnet spam is very high volume but has lots of duplication, everything else is somewhere in between. They made an attempt to associate spam domains with affiliate networks (the business of spamming nowadays is structured as a multi-level marketing scheme) but they didn’t otherwise try to categorize the spam itself. I can think of plenty of additional things to do with the data set—which is the point: it says right in the abstract most studies [of email spam] use a single spam feed and there has been little examination of how such feeds may differ in content. They’re not trying so much to produce a comprehensive analysis themselves as to alert people working in this subfield that they might be missing stuff by looking at only one data source.