Papers tagged ‘Side channels’

k-Indistinguishable Traffic Padding in Web Applications

A Web application is a piece of software whose execution is split between the client and the server, and it sends data between the two sides every time you do something significant—even, perhaps, on every single keystroke. Inevitably, the volume of data has some correlation with the action performed. Since most Web applications are offered to the general public at low or no cost, an adversary can map out much of its internal state machine and develop a model that lets them deduce what someone’s doing from an observed sequence of packets. [1] [2] This has been successfully applied (in the lab) to Web applications for highly sensitive tasks, such as tax preparation and medical advice.

The only known solution to this problem is to pad the amount of data passed over the network in each direction so that the adversary cannot learn anything about the state of the web application from an observed sequence of packets. Unfortunately, the obvious way to do this involves unacceptably high levels of overhead—this paper quotes [1] as observing 21074% overhead for a well-known online tax system. Today’s paper proposes that techniques from the field of privacy-preserving data publishing (PPDP, [3]) be brought to bear on the problem.

The paper is mathematically heavy going and I’m not going to go into details of their exact proposal, because it’s really just a starting point. Instead, I’m going to pull out some interesting observations:

  • More padding is not always better. They show an example case where padding messages to a multiple of 128 bytes is exactly as good as padding them to a multiple of 512 bytes, and padding them to a multiple of 520 bytes is worse.

  • Computing the optimal amount of padding for a web application (as they define it) is NP-hard even with complete, ideal knowledge of its behavior. However, there are reasonably efficient and speedy approximations—if I understood them correctly, it might even be feasible to apply one of their algorithms on the fly with each client.

  • Because optimality is defined over an entire sequence of user interactions with an application, knowing how much padding to send in response to any given request may require the server to predict the future. Again, however, there are approximations that avoid this problem.

  • PPDP offers a whole bunch of different ways to model the degree to which one has mitigated the information leak: k-anonymity is the model they spend the most time on, but there’s also l-diversity, (αk)-anonymity, t-closeness, etc. etc. I don’t follow that literature but I have the impression people are still actively making up new models and haven’t spent a whole lot of time on figuring out which is best in which situations.

The great failing of this and all the other papers I’ve ever read in this area is operationalization. This paper’s algorithms need a complete, or nearly so, map of all possible sequences of states that a Web application can enter, and the authors shrug off the question of how you create this map in the first place. (I very strongly suspect that task can’t be mechanized in the general case—it will turn out to be the Halting Problem in another of its many disguises.) They also don’t seem to realize that things like Google search suggestions (the running example in the paper) are constantly changing. (Of course, that makes life harder for the adversary too.) This doesn’t mean there is no hope; it only means you need more approximations. I think it ought to be possible to build good-enough automatic padding into CMSes, and maybe even into web application frameworks. Trying that should be a priority for future research.

Robust and Efficient Elimination of Cache and Timing Side Channels

We’re closing out Unpublished Results Week here with a paper that was submitted to CCS 2015 last May, but doesn’t appear to have made it.

I’ve talked about side channels a little before, but let me repeat why time and cache channels keep coming up and why they’re hard to get rid of: they’re side effects of hardware-design choices that, most of the time, are exactly what you want, if you want your computer to go fast. Think about long versus short division, for instance. It takes less mental effort, and less scratch paper, to manually divide 123456789 by 3 than by 29; the same is true (sort of) for your computer. So if it’s your job to design the CPU, why shouldn’t the integer divide instruction finish quicker in the former case? Especially if you have profiling information that says that most of the time integer divide gets used with small divisors. Memory caches are even more vital—the other day I saw a graph suggesting that it takes order of 17,000 clock cycles for the CPU to come all the way back up to full number crunching speed after a context switch; most of that is probably due to stuff falling out of the cache and having to be pulled back in.

So: side channels are hard to get rid of in hardware because that would involve making design tradeoffs that would hurt performance for the supermajority of code that isn’t crunching sensitive data, and they’re hard to get rid of in software because the hardware, and the compiler, are actively fighting you. (The paper says several times that side-channel-free cryptographic primitives pretty much have to be done in hand-written assembly language, and that getting them right is extremely difficult even by the standards of people who write assembly language by hand.) Wouldn’t it be nice to find a big hammer that would make the problem just … go away? That’s this paper.

A principal design goal for the big hammer in this paper is ease of use. The only thing the application programmer has to do is label protected functions that need to have their side channels eliminated. And it’s not a compiler paper; they haven’t come up with a way to make a compiler do the extremely difficult assembly tuning for you. All the magic happens at runtime. Given all that, it works pretty much the only way it could work: it measures how much the timing varies and then it pads out the fast cases to the same length as the slow cases. It also flushes state from the protected function out of as many CPU caches as possible (memory, TLB, branch target buffer, …) after the function is done, and takes steps (page coloring, CPU and memory pinning, realtime scheduling priority) to ensure that the protected function is not perturbed by what malicious code has done before it starts executing. The authors demonstrate reliability by testing several different crypto primitives that definitely do have side-channel information leaks, and showing that the leak is rendered unavailable in all cases. They also claim the overhead is acceptable (8.5%).

I really like the concept and hope to see more in this vein in the future. I can’t speculate on why this particular paper didn’t get accepted at CCS, but I do feel that the overhead measurements were inadequate: they don’t consider indirect costs of flushing caches and pinning stuff to CPUs all the time. I’d like to see whole-system benchmarking of, say, a shared-hosting HTTPS server. And at least one round of adversarial testing would be nice. It is one thing to demonstrate that your measurements show that the side channels you already knew about have been closed; it is another thing if someone who spends all their time thinking of side-channel attacks cannot find any in your system.

Limitations of End-to-End Encryption in Secure Computer Networks

Today we’re going to go back in time, all the way to the the dawn of computer networks. When this technical report was filed, the largest operational internetwork was still called ARPAnet and it still ran on NCP; people were still talking about hosts and communications subnetwork processors as if they were two different physical devices, and security levels as if that was the only possible way to conceptualize access control; and I was five months old.

(I apologize for the poor quality of the linked PDF—to the best of my knowledge, this is the only version to be found online. Also, if anyone knows D.W. Snow’s first name, please tell me.)

To the best of my knowledge, this is the very first published article to discuss the things that end-to-end encryption (what we would now call a secure channel protocol) does not protect from an eavesdropper. Everyone doing computer security in 1978 was thinking mostly about protecting classified government secrets, so the authors frame the problem in terms of a Trojan Horse program with access to such secrets, but forbidden by the OS from sending messages to anyone who isn’t cleared to access the same secrets: if all outgoing network traffic is encrypted end-to-end to its (legitimate) recipient, can this Trojan Horse still exfiltrate information to someone who isn’t a legitimate recipient?

They point out that, of necessity, a packet-switched network has to reveal the destination address, transmission time, and length of every packet in cleartext. They model each of these as Shannonian communication channels, and determine sending rates on the order of 100 bits per second for each—more than adequate to leak a text document. (They observe, by way of comparison, that the standard military teletype runs at 75 bps.)

Nowadays, this threat model might seem quaint, even silly—we put a lot more effort into preventing untrusted code from seeing secret information in the first place. The information leak, however, is real, still exists, and can be used for other purposes. The most terrifying example I know is Hookt on fon-iks, in which a completely passive eavesdropper can reconstruct the words spoken in an encrypted VoIP phone conversation, just from the length and timing of each packet. Different syllables compress to different length packets, and every natural language has rules about which syllables can follow which; the rules can be modeled with a Markov chain, and there you are.

The countermeasures and conclusions sections of this paper are much more embarrassing in retrospect than the dated threat model. They say, there’s nothing one can practically do about this end-to-end, but we can close the hole if we make every single intermediate relay a trusted arbiter of the (one true) security policy, at which point we don’t need end-to-end encryption… I feel quite confident in saying, even in 1978 it ought to have been obvious that that was never going to happen. What’s sad is, if people hadn’t given up on end-to-end countermeasures back then, perhaps we would actually have some by now. (It’s easy for VoIP! All you have to do is use a constant-bitrate compression algorithm. Shame none of the widely deployed VoIP programs bother.)

On Subnormal Floating Point and Abnormal Timing

Most of the time, people think the floating-point unit has no security significance, simply because you don’t use the floating-point unit for anything with security significance. Cryptography, authentication, access control, process isolation, virtualization, trusted paths, it’s all done with integers. Usually not even negative integers.

Today’s paper presents some situations where that is not the case: where the low-level timing behavior of floating-point arithmetic is, in fact, security-critical. The most elemental of these situations involves displaying stuff on the screen—nowadays, everything on the screen gets run through a 3D rendering pipeline even if it looks completely flat, because you have the hardware just sitting there, and that process intrinsically involves floating point. And there’s an API to tell you how long it took to render the previous frame of animation because if you were actually animating something you would genuinely need to know that. So if there’s something being displayed on the screen that you, a malicious program, are not allowed to know what it is, but you can influence how it is being displayed, you might be able to make the information you’re not allowed to know affect the rendering time, and thus extract the information—slowly, but not too slowly to be practical. There is also a scenario involving differentially private databases, where you’re allowed to know an approximation to an aggregate value but not see individual table rows; the aggregate is computed with floating point, and, again, computation time can reveal the secret values.

In both cases, the floating-point computations are uniform, with no data-dependent branches or anything like that, so how does timing variation sneak in? It turns out that on all tested CPUs and GPUs, primitive floating-point arithmetic operations—add, multiply, divide—don’t always take the same amount of time to execute. Certain combinations of input values are slower than others, and predictably so. As the authors point out, this is a well-known problem for numerical programmers. It has to do with a feature of IEEE floating point known as subnormal numbers. These allow IEEE floating point to represent numbers that are very close to zero, so close that they don’t fit into the bit representation without bending the rules. This is mathematically desirable because it means that if two floating-point values are unequal, subtracting one from the other will never produce zero. However, subnormals are awkward to implement in hardware; so awkward that CPUs in the 1990s were notorious for suffering a slowdown of 100x or more for basic arithmetic on subnormals. Nowadays it’s not so bad; if I’m reading the (poorly designed) charts in this paper correctly, it’s only a 2–4x slowdown on modern hardware. But that’s still enough to detect and build a timing channel out of.

Timing channels are a perennial problem because they tend to be side-effects of something desirable. Algorithms are often easier to understand if you dispose of special cases up-front—this paper also talks about how division by zero might be much faster than division by a normal value, presumably because the CPU doesn’t bother running through the divide circuit in that case. Running fast most of the time, slow in unusual cases, is often an excellent algorithmic choice for overall performance: hash tables, quicksort, etc. The papers that defined the concept of covert channels [1] [2] discuss timing channels introduced by data caching, without which Moore’s Law would have been hamstrung by fundamental physics decades ago.

However, I don’t think there’s any excuse for variable-time arithmetic or logic operations, even in floating point. (I might be persuaded to allow it for division, which is genuinely algorithmically hard.) In particular I have never understood why subnormal numbers are a problem for hardware. Now I don’t know beans about hardware design, but I have written floating-point emulation code, in assembly language, and here’s the thing: in software, subnormals are straightforward to implement. You need larger internal exponent precision, an additional test in both decode and encode, a count-leading-zeroes operation on the way in and an extra bit-shift on the way out. I didn’t try to make my emulator constant-time, but I can imagine doing it without too much trouble, at least for addition and multiplication. In hardware, that seems like it ought to translate to a wider internal data bus, a couple of muxes, a count-leading-zeroes widget, and a barrel shifter that you probably need anyway, and it should be easy to make it constant-time across both subnormals and normals, without sacrificing speed at all.

Constant time floating point operations for all inputs and outputs does strike me as harder, but only because IEEE floating point is specified to generate hardware faults and/or out-of-band exception bits under some circumstances. This was a mistake. A floating-point standard that has neither control nor status registers, and generates exclusively in-band results, is decades overdue. It would require sacrificing a bit of mantissa for a sticky this number is inexact flag, but I think that should be acceptable.

As a final headache, how long do you suppose it’ll be before someone comes up with an exploit that can only be fixed by making all of the transcendental function library (sin, cos, log, exp, …) constant-time? Never mind performance, I doubt this can be done without severely compromising accuracy.

Large-scale Spatiotemporal Characterization of Inconsistencies in the World’s Largest Firewall

Lots of academic research on Internet censorship treats the countries doing the censorship as monoliths: that is, measurements will typically only be conducted from one client in one fixed location (often a commercial VPS or colocation provider), and the results are assmued to reflect the situation countrywide. When you’re talking about a country as large as China, that assumption seems poorly justified, and there have been several studies aiming to collect more fine-grained information. [1] [2] [3] This paper is in that line of research, with a systematic survey of censorship of one application (Tor) in roughly 150 different locations across China, repeating the measurement at hourly intervals for 27 days. The measurement clients are diverse both in terms of geographic location and network topology.

The results largely confirm what was already suspected. This particular application is indeed blocked consistently across China, with the possible exception of CERNET (China Education and Research Network), whose filtering is less aggressive. The filtering occurs at major China-wide IXPs, as suspected from previous studies. The firewall appears to operate primarily by dropping inbound traffic to China; the authors don’t try to explain this, but earlier related research [4] points out that the firewall must wait to see a TCP SYN/ACK packet before it can successfully forge RST packets in both directions. Finally, there is concrete evidence for failures, lasting hours at a time, uncorrelated with geographic location, where traffic passes uncensored. This was anecdotally known to happen but not previously studied in any kind of detail, to my knowledge. This paper doesn’t speculate at all on why the failures happen or how we could figure that out, which I think is unfortunate.

The techniques used to collect the data are more interesting, at least to me. The principal method is called hybrid idle scanning, first presented by some of the same authors in a different paper [5]. It allows a measurement host to determine whether a client can complete a TCP handshake with a server, without itself being either the client or the server; if the handshake does not complete successfully, it reveals whether client-server or server-client packets are being lost. It does rely on an information leak in older client TCP stacks (predictable IP-ID sequences, [6]) but millions of hosts worldwide still run operating systems with these bugs—the authors report an estimate that they they comprise 1% of the global IPv4 address space. Thus, it’s possible to find a measurement client in any geographic location with reasonably common Internet usage. Data from this technique is backed up with more detailed information from traceroutes and SYN probes from a smaller number of locations. They describe a previously-unreported server-side information leak in Linux’s handling of half-open TCP connections, which can be used to study what IP-based blacklisting of a server looks like to that server, without access to that server.

I’m also impressed with the authors’ systematic presentation of the hypotheses they wanted to test and how they chose to test each of them. Anyone interested in network measurements could probably learn something about how to structure an experiment from this paper.