Papers tagged ‘Hardware’

Stealthy Dopant-Level Hardware Trojans

Most of the time we treat silicon chips—CPUs, for instance—as black boxes. The manufacturer publishes specifications for integrating it into a larger widget and making use of it, we code to those specifications, and the chip does its job. But inside the package is a fiendishly complicated machine, and there have been plenty of incidents where it didn’t work quite right, e.g. the infamous F00F and FDIV bugs in the Pentium. This is not to pick on Intel; every CPU manufacturer has had similar troubles, but only the x86 is sufficiently famous that its troubles make Wikipedia.

Anything that can happen by accident can happen on purpose. Most chip designers have to contract manufacture out to a fab plant run by a separate company, and the manufacturing process is opaque to them. It might occur to you to wonder whether the manufacturer can tamper with the design. It’s possible to disassemble a chip under an electron microscope and make sure all the wires are where they’re supposed to be, but this is expensive and destroys the chip, and it can only detect some of the possible changes to the design.

This paper presents a technique a fab plant could use to sabotage a chip design, that can’t be detected by any current technique for inspecting a chip after the fact. Basically, by changing the dopant mask, the fab can turn individual transistors into shorts to ground or Vcc. This is invisible to a microscope; all the wires are where they’re supposed to be, it’s just that a block of semiconductor has the wrong electronic properties. They demonstrate that this can be used to introduce subtle bugs that will not be caught by functional testing, such as making the output of a hardware RNG predictable, or introducing a power-consumption side channel into a cryptographic accelerator.

No solutions are presented; I’m not much of a hardware person and I have no idea what solutions might be possible. This is the hardware equivalent of a malicious compiler, and people are working on solving that problem—but they rely on the fact that you can inspect the output of a compiler in detail, because it’s just another string of bits. How do you detect that a block of silicon has been doped with boron instead of phosphorus? X-ray crystallography, maybe?

Using Memory Errors to Attack a Virtual Machine

All modern operating systems include some form of process isolation—a mechanism for insuring that two programs can run simultaneously on the same machine without interfering with each other. Standard techniques include virtual memory and static verification. Whatever the technique, though, it relies on the axiom that the computer faithfully executes its specified instruction set (as the authors of today’s paper put it). In other words, a hardware fault can potentially ruin everything. But how practical is it to exploit a hardware fault? They are unpredictable, and the most common syndrome is for a single bit in RAM to flip. How much damage could one bit-flip do, when you have no control over where or when it will happen?

Today’s paper demonstrates that, if you are able to run at least one program yourself on a target computer, a single bit-flip can give you total control. Their exploit has two pieces. First, they figured out a way to escalate a single bit-flip to total control of a Java virtual machine; basically, once the bit flips, they have two pointers to the same datum but with different types (integer, pointer) and that allows them to read and write arbitrary memory locations, which in turn enables them to overwrite the VM’s security policy with one that lets the program do whatever it wants. That’s the hard part. The easy part is, their attack program fills up the computer’s memory with lots of copies of the data structure that will give them total control if a bit flips inside it. That way, the odds of the bit flip being useful are greatly increased. (This trick is known as heap spraying and it turns out to be useful for all sorts of exploits, not just the ones where you mess with the hardware.)

This is all fine and good, but you’re still waiting for a cosmic ray to hit the computer and flip a bit for you, aren’t you? Well, maybe there are other ways to induce memory errors. Cosmic rays are ionizing radiation, which you can also get from radioactivity. But readily available radioactive material (e.g. in smoke detectors) isn’t powerful enough, and powerful enough radioactive material is hard to get (there’s a joke where the paper suggests that the adversary could purchase a small oil drilling company in order to get their hands on a neutron source). But there’s also heat, and that turns out to work quite well, at least on the DRAM that was commonly available in 2003. Just heat the chips to 80–100˚C and they start flipping bits. The trick here is inducing enough bit-flips to exploit, but not so many that the entire computer crashes; they calculate an ideal error rate at which the exploit program has a 71% chance of succeeding before the computer crashes.

The attack in this paper probably isn’t worth worrying about in real life. But attacks only ever get nastier. For instance, about a decade later some people figured out how to control where the memory errors happen, and how to induce them without pointing a heat gun at the hardware: that was An Experimental Study of DRAM Disturbance Errors which I reviewed back in April. That attack is probably practical, if the hardware isn’t taking any steps to prevent it.

On Subnormal Floating Point and Abnormal Timing

Most of the time, people think the floating-point unit has no security significance, simply because you don’t use the floating-point unit for anything with security significance. Cryptography, authentication, access control, process isolation, virtualization, trusted paths, it’s all done with integers. Usually not even negative integers.

Today’s paper presents some situations where that is not the case: where the low-level timing behavior of floating-point arithmetic is, in fact, security-critical. The most elemental of these situations involves displaying stuff on the screen—nowadays, everything on the screen gets run through a 3D rendering pipeline even if it looks completely flat, because you have the hardware just sitting there, and that process intrinsically involves floating point. And there’s an API to tell you how long it took to render the previous frame of animation because if you were actually animating something you would genuinely need to know that. So if there’s something being displayed on the screen that you, a malicious program, are not allowed to know what it is, but you can influence how it is being displayed, you might be able to make the information you’re not allowed to know affect the rendering time, and thus extract the information—slowly, but not too slowly to be practical. There is also a scenario involving differentially private databases, where you’re allowed to know an approximation to an aggregate value but not see individual table rows; the aggregate is computed with floating point, and, again, computation time can reveal the secret values.

In both cases, the floating-point computations are uniform, with no data-dependent branches or anything like that, so how does timing variation sneak in? It turns out that on all tested CPUs and GPUs, primitive floating-point arithmetic operations—add, multiply, divide—don’t always take the same amount of time to execute. Certain combinations of input values are slower than others, and predictably so. As the authors point out, this is a well-known problem for numerical programmers. It has to do with a feature of IEEE floating point known as subnormal numbers. These allow IEEE floating point to represent numbers that are very close to zero, so close that they don’t fit into the bit representation without bending the rules. This is mathematically desirable because it means that if two floating-point values are unequal, subtracting one from the other will never produce zero. However, subnormals are awkward to implement in hardware; so awkward that CPUs in the 1990s were notorious for suffering a slowdown of 100x or more for basic arithmetic on subnormals. Nowadays it’s not so bad; if I’m reading the (poorly designed) charts in this paper correctly, it’s only a 2–4x slowdown on modern hardware. But that’s still enough to detect and build a timing channel out of.

Timing channels are a perennial problem because they tend to be side-effects of something desirable. Algorithms are often easier to understand if you dispose of special cases up-front—this paper also talks about how division by zero might be much faster than division by a normal value, presumably because the CPU doesn’t bother running through the divide circuit in that case. Running fast most of the time, slow in unusual cases, is often an excellent algorithmic choice for overall performance: hash tables, quicksort, etc. The papers that defined the concept of covert channels [1] [2] discuss timing channels introduced by data caching, without which Moore’s Law would have been hamstrung by fundamental physics decades ago.

However, I don’t think there’s any excuse for variable-time arithmetic or logic operations, even in floating point. (I might be persuaded to allow it for division, which is genuinely algorithmically hard.) In particular I have never understood why subnormal numbers are a problem for hardware. Now I don’t know beans about hardware design, but I have written floating-point emulation code, in assembly language, and here’s the thing: in software, subnormals are straightforward to implement. You need larger internal exponent precision, an additional test in both decode and encode, a count-leading-zeroes operation on the way in and an extra bit-shift on the way out. I didn’t try to make my emulator constant-time, but I can imagine doing it without too much trouble, at least for addition and multiplication. In hardware, that seems like it ought to translate to a wider internal data bus, a couple of muxes, a count-leading-zeroes widget, and a barrel shifter that you probably need anyway, and it should be easy to make it constant-time across both subnormals and normals, without sacrificing speed at all.

Constant time floating point operations for all inputs and outputs does strike me as harder, but only because IEEE floating point is specified to generate hardware faults and/or out-of-band exception bits under some circumstances. This was a mistake. A floating-point standard that has neither control nor status registers, and generates exclusively in-band results, is decades overdue. It would require sacrificing a bit of mantissa for a sticky this number is inexact flag, but I think that should be acceptable.

As a final headache, how long do you suppose it’ll be before someone comes up with an exploit that can only be fixed by making all of the transcendental function library (sin, cos, log, exp, …) constant-time? Never mind performance, I doubt this can be done without severely compromising accuracy.

Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors

A few weeks ago, a new exploit technique called rowhammering was making the rounds of all the infosec news outlets. Turns out you can flip bits in RAM by reading over and over and over again from bits that are physically adjacent on the RAM chip; turns out that with sufficient cleverness you can leverage that to escape security restrictions being enforced in software. Abstractly, it’s as simple as toggling the this process is allowed to do whatever it wants bit, which is a real bit in RAM on many operating systems—most of the cleverness is in grooming the layout of physical RAM so that that bit is physically adjacent to a group of bits that you’re allowed to read from.

This paper describes and analyzes the underlying physical flaw that makes this exploit possible. It is (as far as I know) the first suggestion in the literature that this might be a security vulnerability; chip manufacturers have been aware of it as a reliability problem for much longer. [1] [2] If you’re not familiar with the details of how DRAM works, there are two important things to understand. First, DRAM deliberately trades stability for density. You can (presently) pack more bits into a given amount of chip surface area with DRAM than with any other design, but the bits don’t retain their values indefinitely; each bit has an electrical charge that slowly leaks away. DRAM is therefore refreshed continuously; every few tens of milliseconds, each bit will be read and written back, restoring the full charge.

Second, software people are accustomed to thinking of memory at the lowest level of abstraction as a one-dimensional array of bytes, an address space. But DRAM is physically organized as a two-dimensional array of bits, laid out as a grid on the two-dimensional surface of the chip. To read data from memory, the CPU first instructs the memory bank to load an entire row into a buffer (unimaginatively called the row buffer), and then to transmit columns (which are the width of entire cache lines) from that row. Rows are huge—32 to 256 kilobits, typically; larger than the page granularity of memory access isolation. This means, bits which are far apart in the CPU’s address spaces (even the so-called physical address space!) may be right next to each other on a RAM chip. And bits which are right next to each other on a RAM chip may belong to different memory-isolation regions, so it’s potentially a security problem if they can affect each other.

Now, the physical flaw is simply that loading one row of a DRAM chip into the row-buffer can cause adjacent rows to leak their charges faster, due to unavoidable electromagnetic phenomena, e.g. capacitative coupling between rows. Therefore, if you force the memory bank to load a particular row over and over again, enough charge might leak out of an adjacent row to make the bits change their values. (Reading a row implicitly refreshes it, so the attack row itself won’t be affected.) It turns out to be possible to do this using only ordinary CPU instructions.

The bulk of this paper is devoted to studying exactly when and how this can happen on a wide variety of different DRAM modules from different vendors. Important takeaways include: All vendors have affected products. Higher-capacity chips are more vulnerable, and newer chips are more vulnerable—both of these are because the bit-cells in those chips are smaller and packed closer together. The most common error-correcting code used for DRAM is inadequate to protect against or even detect the attack, because it can only correct one-bit errors and detect two-bit errors within any 64-bit word, but under the attack, several nearby bits often flip at the same time. (You should still make a point of buying error-correcting RAM, plus a motherboard and CPU that actually take advantage of it. One-bit errors are the normal case and they can happen as frequently as once a day. See Updates on the need to use error-correcting memory for more information.)

Fortunately, the authors also present a simple fix: the memory controller should automatically refresh nearby rows with some low, fixed probability whenever a row is loaded. This will have no visible effect in normal operation, but if a row is being hammered, nearby rows will get refreshed sufficiently more often that bit flips will not happen. It’s not clear to anyone whether this has actually been implemented in the latest motherboards, vendors being infamously closemouthed about how their low-level hardware actually works. However, several of the authors here work for Intel, and I’d expect them to take the problem seriously, at least.