Over the years there have been several attempts to build anonymous publication or distributed anonymous storage systems—usually they start with a peer-to-peer file sharing protocol not entirely unlike BitTorrent, and then build some combination of indexing, replication, encryption, and anonymity on top. All have at least one clever idea. None has achieved world domination. (We’ll know someone’s finally gotten it right when the web browsers start shipping native support for their protocol.)

Tangler is a relatively old example, and its one clever idea is what they call document entanglement. To understand document entanglement you have to know about something called $k$-of-$n$ secret sharing. This is a mathematical technique that converts a secret into $n$ shares. Each share is the same size as the original secret, possibly plus a little overhead. Anyone who has a copy of $k$ of those $n$ shares can reconstruct the original secret, but if they have even just one fewer, they can’t. $k$ and $n$ can be chosen arbitrarily. Secret sharing is normally not used for large secrets (like an entire document) because each share is the same size as the original, so you’ve just increased your overall storage requirement $n$ times—but in a distributed document store like Tangler, you were going to do that anyway, because the document should remain retrievable even if some of the peers holding shares drop out of the network.

Document entanglement, then, is secret sharing with a clever twist: you arrange to have some of the $n$ shares of your document be the same bitstring as existing shares for other documents. This is always mathematically possible, as long as fewer than $k$ existing shares are used. This reduces the amount of data added to the system by each new document, but more importantly, it makes the correspondence between shares and documents many-to-many instead of many-to-one. Thus, operators can honestly say they do not know which documents are backed by which shares, and they have an incentive not to cooperate with deletion requests, since deleting one document may render many other documents inaccessible.

I am not convinced entanglement actually provides the security benefit claimed; deleting all $n$ of the shares belonging to one document should cause other documents to lose no more than one share and thus not be permanently damaged. (The originators of those documents would of course want to generate new shares to preserve redundancy.) It is still probably worth doing just because it reduces the cost of adding new documents to the system, but security-wise it’s solving the wrong problem. What you really want here is: server operators should be unable to determine which documents they hold shares for, even if they know the metadata for those documents. (And yet, somehow, they must be able to hand out the right shares on request!) Similar things are possible, under the name private information retrieval, and people are trying to apply that to anonymous publication, but what I said one really wants here is even stronger than the usual definition of PIR, and I’m not sure it’s theoretically possible.

Nowadays we are pretty darn sure we know how to encrypt network packets in transit such that there’s no realistic way to decrypt them. There are a whole host of second-order problems that are not as well solved, but basic message confidentiality is done—except for one thing: the length and timing of each packet are still visible to anyone who can monitor the wire. It turns out that this enables eavesdroppers to extract a frightening amount of information from those encrypted packets, such as which page of a medical-advice website you are reading [1], the language of your VoIP phone call [2], or even a transcript of your VoIP phone call [3].

In principle, we know how to close this information leak. It is simply necessary for Alice to send Bob a continuous stream of fixed-size packets, at a fixed rate, forever, whether or not she has anything useful to say. (When she doesn’t have anything useful to say, she can encrypt an endless stream of binary zeroes.) This is obviously a non-starter in any context where Alice cares about power consumption, shared channel capacity, or communicating with more than one Bob. Even when none of these is a concern—for instance, high-volume VPN links between data centers, which are running at significant utilization 24x7x365 anyway—forcing all traffic to fit the constant packet length and transmission schedule adds significant overhead. Thus, there’s a whole line of research trying to minimize that overhead, and/or see how far we can back down from the ideal-in-principle case without revealing too much.

Today’s paper offers a theoretical model and test framework that should make it easier to experiment with the high-volume-VPN-link case. I like it for its concreteness—often theoretical modeling papers feel so divorced from practice that it’s hard to see how to do anything constructive with them. It is also easy to see how the model could be extended to more sophisticated traffic-shaping techniques, which would be the obvious next step. The big downside, though, is that it only considers a fixed network topology: Alice talking to Bob and no one else, ever. Even for inter-data-center links, topologies can change on quite short notice, and a topology-change event might be exactly what Eve is listening for (perhaps it gives her advance notice of some corporate organizational change). To be fair, the continuous stream of fixed size packets scheme does completely solve the length-and-timing issue in principle; we do not have nearly as good schemes for concealing who is talking to whom, even in principle. Which is unfortunate, because you can do even more terrifying things with knowledge only of who is talking to whom. [4]