Like Monday’s paper, today’s has not been officially published anywhere, and in fact appears never to have been finished. It’s on a fairly arcane topic, and one that’s not directly related to security—as I said a while back, floating-point numbers usually aren’t used in security-critical algorithms. However, it deserves more attention, being both a case of everyone everywhere is Doing It Wrong, and a nice example of the danger of forgetting that an abstraction is hiding something from you.

For numerical and statistical computation, one frequently needs random floating-point numbers in the open interval $\left(0,\phantom{\rule{0.167em}{0ex}}1\right)$. However, all off-the-shelf pseudorandom number generators produce either random integers, or a stream of random bits (which is functionally the same thing—bits are trivially put together into integers and vice versa). The easy, obvious way to bridge the gap is to divide (in floating point) the integers by the largest value the PRNG can produce. This is how it’s done basically everywhere, including industrial-grade statistical languages like R…but what the paper is saying is, it’s wrong.

To understand why it’s wrong, you need to understand in detail how floating-point numbers are not mathematical real numbers. On one level this is obvious: an IEEE single occupies 32 bits and a double 64, so at most they could represent ${2}^{32}$ or ${2}^{64}$ different numbers, respectively; that isn’t even countably infinite. (A large chunk of the encoding is reserved for not a number values, so the true count is somewhat smaller.) But high-level programming languages encourage us to use floating point numbers as if they were the reals, and most of the time we’re working with values that are well-approximated by floats, and we get away with it.

The numbers that a binary floating-point format can represent are all of the form

$±m.mmm…m×2±e.eee…e$

where $±m.mmm…m$ (the mantissa) and $±e.eee…e$ (the exponent) are binary numbers of fixed length. For IEEE singles, the mantissa has 25 bits and the exponent 8, counting their respective signs in both cases, and there’s a clever trick (not relevant here) to fit that all into 32 bits total. This is just the scientific notation you probably learned in high school, in base 2 instead of base 10, and if the mantissa and exponent were infinitely wide it could represent all real numbers. However, they are finite, and this means not only that it can’t, but that the values it can are not uniformly distributed over the real line. Every time you pass a power of two, the exponent ticks up or down by one, but the mantissa stays the same length, so it loses or gains one bit of precision. Therefore, there are the same number of representable values between 1 and 2 as there are between 0.5 and 1, between 0.25 and 0.5, and so on, and each range’s representable numbers are twice as far apart as those in the next smaller range.

Now, when you generate a random integer and divide it by the largest integer the RNG can produce, the result is uniformly distributed over $\left(0,\phantom{\rule{0.167em}{0ex}}1\right)$. If the largest integer the RNG can produce is ${2}^{32}-1$, as is often the case (even on systems with 64-bit integers), the gap between values producible by this method will be about $2×{10}^{-10}$. The figure above compares the spacing of these values along the real line with the spacing of the values representable by an IEEE single-precision float. Only from 0.001953125 (${2}^{-9}$) to 0.003906250 (${2}^{-8}$) is there a one-to-one correspondence. Above ${2}^{-8}$, the representable numbers are more widely spaced than the values that the RNG will produce, which means clusters of RNG-produced values will get rounded to the same representable number, which introduces non-uniformity. Below ${2}^{-9}$, it’s the other way around: all of the RNG-produced values are exactly representable, but there are many more that can’t be produced. Downey estimates that over the entire range, only 7% of the numbers are producible.

The most significant gap is right at zero. The smallest number IEEE single can represent is on the order of ${10}^{-38}$ or ${10}^{-45}$ (depending on whether you allow denormals), and IEEE double goes all the way down to $2×{10}^{-308}$ (or $5×{10}^{-324}$ with denormals). But, the smallest number producible by the simple RNG method is $2×{10}^{-10}$, dozens or hundreds of decimal orders of magnitude bigger. Downey suggests that this could cause serious problems for statistical algorithms, such as inverse transform sampling of the exponential distribution.

Downey’s proposed fix rests on the observation that a random number uniformly distributed over $\left(0,\phantom{\rule{0.167em}{0ex}}1\right)$ will be greater than 0.5 exactly half of the time; when it isn’t, it will be greater than 0.25 exactly half of the time; and so on. Therefore, his algorithm first selects an exponent by flipping coins in a loop—that is, drawing one bit at a time from the output of the RNG. The exponent starts out as −1, corresponding to a number between 0.5 and 1; each time the bit is 0, the exponent is reduced by 1 and another bit is drawn, until either a 1 is drawn or the exponent reaches its minimum allowable value. Then the algorithm fills in the mantissa with random bits. Finally, there’s a small correction: if the mantissa happens to come out zero, flip another bit and if it’s 1, increase the exponent by 1 again. This accounts for values that are exactly 1 over a power of two, which straddle a boundary between exponents and therefore can be selected from either side; without the correction they would be selected slightly more often than is appropriate.

It’s too bad Downey never finished this paper. The biggest missing piece is a clear and convincing demonstration that the naïve algorithm does introduce significant errors into common calculations. For exponential sampling by inverse transform, there is an inarguable error (the distribution is truncated on the right) but one could argue that it doesn’t rise to the level of significance because exponential deviates larger than 9.3 should only get drawn one time in ${10}^{10}$ or so. There are no other examples of potentially problematic tasks, and I am not enough of a statistician to think of any myself. IEEE double has enough mantissa, even in the range from 0.5 to 1, that it can represent every multiple of $2×{10}^{-10}$, so nonuniformity does not occur if you’re generating doubles by the simple method, only missing values.

There are also a couple of practical problems with the algorithm. A potentially-lengthy run of coin tosses, with the requirement that every bit is independent and identically distributed, is poorly handled by many ordinary RNGs; I would only use this algorithm with a cryptographic RNG. Relatedly, on average the coin-tossing phase will terminate quickly, but if you do hit a long run of zeroes it’d be observably slower in that case. I don’t see a good way to implement the algorithm so that it uses a fixed number of random bits per float generated, though, short of generate all the coin tosses in advance which would consume 1078 bits of randomness for every IEEE double; this would probably be unacceptably slow overall.

Most of the time, people think the floating-point unit has no security significance, simply because you don’t use the floating-point unit for anything with security significance. Cryptography, authentication, access control, process isolation, virtualization, trusted paths, it’s all done with integers. Usually not even negative integers.

Today’s paper presents some situations where that is not the case: where the low-level timing behavior of floating-point arithmetic is, in fact, security-critical. The most elemental of these situations involves displaying stuff on the screen—nowadays, everything on the screen gets run through a 3D rendering pipeline even if it looks completely flat, because you have the hardware just sitting there, and that process intrinsically involves floating point. And there’s an API to tell you how long it took to render the previous frame of animation because if you were actually animating something you would genuinely need to know that. So if there’s something being displayed on the screen that you, a malicious program, are not allowed to know what it is, but you can influence how it is being displayed, you might be able to make the information you’re not allowed to know affect the rendering time, and thus extract the information—slowly, but not too slowly to be practical. There is also a scenario involving differentially private databases, where you’re allowed to know an approximation to an aggregate value but not see individual table rows; the aggregate is computed with floating point, and, again, computation time can reveal the secret values.

In both cases, the floating-point computations are uniform, with no data-dependent branches or anything like that, so how does timing variation sneak in? It turns out that on all tested CPUs and GPUs, primitive floating-point arithmetic operations—add, multiply, divide—don’t always take the same amount of time to execute. Certain combinations of input values are slower than others, and predictably so. As the authors point out, this is a well-known problem for numerical programmers. It has to do with a feature of IEEE floating point known as subnormal numbers. These allow IEEE floating point to represent numbers that are very close to zero, so close that they don’t fit into the bit representation without bending the rules. This is mathematically desirable because it means that if two floating-point values are unequal, subtracting one from the other will never produce zero. However, subnormals are awkward to implement in hardware; so awkward that CPUs in the 1990s were notorious for suffering a slowdown of 100x or more for basic arithmetic on subnormals. Nowadays it’s not so bad; if I’m reading the (poorly designed) charts in this paper correctly, it’s only a 2–4x slowdown on modern hardware. But that’s still enough to detect and build a timing channel out of.

Timing channels are a perennial problem because they tend to be side-effects of something desirable. Algorithms are often easier to understand if you dispose of special cases up-front—this paper also talks about how division by zero might be much faster than division by a normal value, presumably because the CPU doesn’t bother running through the divide circuit in that case. Running fast most of the time, slow in unusual cases, is often an excellent algorithmic choice for overall performance: hash tables, quicksort, etc. The papers that defined the concept of covert channels [1] [2] discuss timing channels introduced by data caching, without which Moore’s Law would have been hamstrung by fundamental physics decades ago.

However, I don’t think there’s any excuse for variable-time arithmetic or logic operations, even in floating point. (I might be persuaded to allow it for division, which is genuinely algorithmically hard.) In particular I have never understood why subnormal numbers are a problem for hardware. Now I don’t know beans about hardware design, but I have written floating-point emulation code, in assembly language, and here’s the thing: in software, subnormals are straightforward to implement. You need larger internal exponent precision, an additional test in both decode and encode, a count-leading-zeroes operation on the way in and an extra bit-shift on the way out. I didn’t try to make my emulator constant-time, but I can imagine doing it without too much trouble, at least for addition and multiplication. In hardware, that seems like it ought to translate to a wider internal data bus, a couple of muxes, a count-leading-zeroes widget, and a barrel shifter that you probably need anyway, and it should be easy to make it constant-time across both subnormals and normals, without sacrificing speed at all.

Constant time floating point operations for all inputs and outputs does strike me as harder, but only because IEEE floating point is specified to generate hardware faults and/or out-of-band exception bits under some circumstances. This was a mistake. A floating-point standard that has neither control nor status registers, and generates exclusively in-band results, is decades overdue. It would require sacrificing a bit of mantissa for a sticky this number is inexact flag, but I think that should be acceptable.

As a final headache, how long do you suppose it’ll be before someone comes up with an exploit that can only be fixed by making all of the transcendental function library (sin, cos, log, exp, …) constant-time? Never mind performance, I doubt this can be done without severely compromising accuracy.