Over the years there have been several attempts to build anonymous publication or distributed anonymous storage systems—usually they start with a peer-to-peer file sharing protocol not entirely unlike BitTorrent, and then build some combination of indexing, replication, encryption, and anonymity on top. All have at least one clever idea. None has achieved world domination. (We’ll know someone’s finally gotten it right when the web browsers start shipping native support for their protocol.)

Tangler is a relatively old example, and its one clever idea is what they call document entanglement. To understand document entanglement you have to know about something called $k$-of-$n$ secret sharing. This is a mathematical technique that converts a secret into $n$ shares. Each share is the same size as the original secret, possibly plus a little overhead. Anyone who has a copy of $k$ of those $n$ shares can reconstruct the original secret, but if they have even just one fewer, they can’t. $k$ and $n$ can be chosen arbitrarily. Secret sharing is normally not used for large secrets (like an entire document) because each share is the same size as the original, so you’ve just increased your overall storage requirement $n$ times—but in a distributed document store like Tangler, you were going to do that anyway, because the document should remain retrievable even if some of the peers holding shares drop out of the network.

Document entanglement, then, is secret sharing with a clever twist: you arrange to have some of the $n$ shares of your document be the same bitstring as existing shares for other documents. This is always mathematically possible, as long as fewer than $k$ existing shares are used. This reduces the amount of data added to the system by each new document, but more importantly, it makes the correspondence between shares and documents many-to-many instead of many-to-one. Thus, operators can honestly say they do not know which documents are backed by which shares, and they have an incentive not to cooperate with deletion requests, since deleting one document may render many other documents inaccessible.

I am not convinced entanglement actually provides the security benefit claimed; deleting all $n$ of the shares belonging to one document should cause other documents to lose no more than one share and thus not be permanently damaged. (The originators of those documents would of course want to generate new shares to preserve redundancy.) It is still probably worth doing just because it reduces the cost of adding new documents to the system, but security-wise it’s solving the wrong problem. What you really want here is: server operators should be unable to determine which documents they hold shares for, even if they know the metadata for those documents. (And yet, somehow, they must be able to hand out the right shares on request!) Similar things are possible, under the name private information retrieval, and people are trying to apply that to anonymous publication, but what I said one really wants here is even stronger than the usual definition of PIR, and I’m not sure it’s theoretically possible.

This is a report on a pilot usability study. The authors ran five journalists (there aren’t any more details than that) through the process of installing, activating, and using the Tor Browser for a small number of canned tasks, identifying a number of problems:

… people did have difficulty with installing Tor Browser (principally because of the Gatekeeper code-signing feature on OS X), did not understand what many of the many options meant, and were confused about why certain things were happening.

They are going to do a much larger study, and were soliciting feedback on experimental design. I have only two things to say. First, the proposal is to do a large test of 200 users and then, presumably, start making changes to the software to improve usability. The problem with this is, it is very likely that subtle (yet serious) UX issues are being masked out by the more blatant ones: no matter how many people you experiment on, you won’t detect the subtle problems until the blatant ones are fixed. Therefore, it would be far more valuable to do a series of smaller user studies, improving the software based on the results of each study before doing the next one. This strategy also ensures that the research results do get incorporated into the product, rather than being lost in the shuffle once the paper is published.

The other point is more of a hypothesis about what would be good to aim for. To use Tor in a way that genuinely improves your security outcomes, you need to understand what it is doing and why, and to do that you have to wrap your head around some concepts that may be unfamiliar—especially if you haven’t previously needed to understand the Internet itself in any kind of detail. (For instance, the fact that every IP packet is labeled with its source and destination is obvious once you think about it, but I never thought about it to a lot of people.) There probably needs to be a training manual, and this manual needs to take the attitude that yeah, this is a little tricky, and you have to think about it some, but don’t panic, you can understand it. Shoot for the we understand tone said to characterize Rust compiler errors (warning: Reddit). The place I’ve seen this done best, personally, was the tutorial and concepts guide for GnuCash, which took just this tone with regard to double-entry bookkeeping—also somewhat notorious for its inscrutability. (Note: I read this a long time ago, and I don’t know whether its current edition is still like that.)

Last week we looked at a case study of Internet filtering in Pakistan; this week we have a case study of Syria. (I think this will be the last such case study I review, unless I come across a really compelling one; there’s not much new I have to say about them.)

This study is chiefly interesting for its data source: a set of log files from the Blue Coat brand DPI routers that are allegedly used [1] [2] to implement Syria’s censorship policy, covering a 9-day period in July and August of 2011. leaked by the Telecomix hacktivist group. Assuming that these log files are genuine, this gives the researchers what we call ground truth: they can be certain that sites appearing in the logs are, or are not, censored. (This doesn’t mean they know the complete policy, though. The routers’ blacklists could include sites or keywords that nobody tried to visit during the time period covered by the logs.)

With ground truth it is possible to make more precise deductions from the phenomena. For instance, when the researchers see URLs of the form `http://a1b2.cdn.example/adproxy/cyber/widget` blocked by the filter, they know (because the logs say so) that the block is due to a keyword match on the string `proxy`, rather than the domain name, the IP address, or any other string in the HTTP request. This, in turn, enables them to describe the censorship policy quite pithily: Syrian dissident political organizations, anything and everything to do with Israel, instant messaging tools, and circumvention tools are all blocked. This was not possible in the Pakistani case—for instance, they had to guess at the exact scope of the porn filter.

Because the leaked logs cover only a very short time window, it’s not possible to say anything about the time evolution of Syrian censorship, which is unfortunate, considering the tumultuous past few years that the country has had.

The leak is from several years ago. There is heavy reliance on keyword filtering; it would be interesting to know if this has changed since, what with the increasing use of HTTPS making keyword filtering less useful. For instance, since 2013 Facebook has defaulted to HTTPS for all users. This would have made it much harder for Syria to block access to specific Facebook pages, as they were doing in this study.

The Youtube block triggered an immediate and obvious increase in encrypted traffic, which the authors attribute to an increased use of circumvention tools—the packet traces did not record enough information to identify exactly what tool, or to discriminate circumvention from other encrypted traffic, but it seems a reasonable assumption. Over the next several months, alternative video sharing/streaming services rose in popularity; as of the last trace in the study, they had taken over roughly 80% of the market share formerly held by Youtube.

Users responded quite differently to the porn block: roughly half of the inbound traffic formerly attributable to porn just disappeared, but the other half was redirected to different porn sites that didn’t happen to be on the official blacklist. The censorship authority did not react by adding the newly popular sites to the blacklist. Perhaps a 50% reduction in overall consumption of porn was good enough for the politicians who wanted the blacklist in the first place.

The paper also contains also some discussion of the mechanism used to block access to censored domains. This confirms prior literature [2] so I’m not going to go into it in great detail; we’ll get to those papers eventually. One interesting tidbit (also previously reported) is that Pakistan has two independent filters, one implemented by local ISPs which falsifies DNS responses, and another operating in the national backbone which forges TCP RSTs and/or HTTP redirections.

The authors don’t talk much about why user response to the Youtube block was so different from the response to the porn block, but it’s evident from their discussion of what people do right after they hit a block in each case. This is very often a search engine query (unencrypted, so visible in the packet trace). For Youtube, people either search for proxy/circumvention services, or they enter keywords for the specific video they wanted to watch, hoping to find it elsewhere, or at least a transcript. For porn, people enter keywords corresponding to a general type of material (sex act, race and gender of performers, that sort of thing), which suggests that they don’t care about finding a specific video, and will be content with whatever they find on a site that isn’t blocked. This is consistent with analysis of viewing patterns on a broad-spectrum porn hub site [3]. It’s also consistent with the way Youtube is integrated into online discourse—people very often link to or even embed a specific video on their own website, in order to talk about it; if you can’t watch that video you can’t participate in the conversation. I think this is really the key finding of the paper, since it gets at when people will go to the trouble of using a circumvention tool.

What the authors do talk about is the consequences of these blocks on the local Internet economy. In particular, Youtube had donated a caching server to the ISP in the case study, so that popular videos would be available locally rather than clogging up international data channels. With the block and the move to proxied, encrypted traffic, the cache became useless and the ISP had to invest in more upstream bandwidth. On the other hand, some of the video services that came to substitute for Youtube were Pakistani businesses, so that was a net win for the local economy. This probably wasn’t intended by the Pakistani government, but in similar developments in China [4] and Russia [5], import substitution is clearly one of the motivating factors. From the international-relations perspective, that’s also highly relevant: censorship only for ideology’s sake probably won’t motivate a bureaucracy as much as censorship that’s seen to be in the economic interest of the country.

This short paper presents a simple game-theoretic analysis of a late stage of the arms race between a censorious national government and the developers of tools for circumventing that censorship. Keyword blocking, IP-address blocking, and protocol blocking for known circumvention protocols have all been insitituted and then evaded. The circumvention tool is now steganographically masking its traffic so it is indistinguishable from some commonly-used, innocuous cover protocol or protocols; the censor, having no way to unmask this traffic, must either block all use of the cover protocol, or give up.

The game-theoretic question is, how many cover protocols should the circumvention tool implement? Obviously, if there are several protocols, then the tool is resilient as long as not all of them are blocked. On the other hand, implementing more cover protocols requires more development effort, and increases the probability that some of them will be imperfectly mimicked, making the tool detectable. [1] This might seem like an intractable question, but the lovely thing about game theory is it lets you demonstrate that nearly all the fine details of each player’s utility function are irrelevant. The answer: if there’s good reason to believe that protocol X will never be blocked, then the tool should only implement protocol X. Otherwise, it should implement several protocols, based on some assessment of how likely each protocol is to be blocked.

In real life there probably won’t be a clear answer to will protocol X ever be blocked? As the authors themselves point out, the censors can change their minds about that quite abruptly, in response to political conditions. So, in real life several protocols will be needed, and that part of the analysis in this paper is not complete enough to give concrete advice. Specifically, it offers a stable strategy for the Nash equilibrium (that is, neither party can improve their outcome by changing the strategy) but, again, the censors might abruptly change their utility function in response to political conditions, disrupting the equilibrium. (The circumvention tool’s designers are probably philosophically committed to free expression, so their utility function can be assumed to be stable.) This requires an adaptive strategy. The obvious adaptive strategy is for the tool to use only one or two protocols at any given time (using more than one protocol may also improve verisimilitude of the overall traffic being surveilled by the censors) but implement several others, and be able to activate them if one of the others stops working. The catch here is that the change in behavior may itself reveal the tool to the censor. Also, it requires all the engineering effort of implementing multiple protocols, but some fraction of that may go to waste.

The paper also doesn’t consider what happens if the censor is capable of disrupting a protocol in a way that only mildly inconveniences normal users of that protocol, but renders the circumvention tool unusable. (For instance, the censor could be able to remove the steganography without necessarily knowing that it is there. [2]) I think this winds up being equivalent to the censor being able to block that protocol without downside, but I’m not sure.