Diffie-Hellman key exchange is a cryptographic primitive used in nearly all modern security protocols. Like many cryptographic primitives, it is difficult to break because it is difficult to solve a particular mathematical problem; in this case, the discrete logarithm problem. Generally, when people try to break a security protocol, they either look at the pure math of it—searching for an easier way to solve discrete logarithms—or they look for mistakes in how the protocol is implemented—something that will mean you don’t have to solve a discrete logarithm to break the protocol.

This paper does a bit of both. It observes something about the way Diffie-Hellman is used in Internet security protocols that makes it unusually easy to solve the discrete logarithms that will break the protocol. Concretely: to break an instance of Diffie-Hellman as studied in this paper, you have to solve for $x$ in the equation

$g^x \equiv y \;(\text{mod}\,p)$

where all numbers are positive integers, $g$ and $p$ are fixed in advance, and $y$ is visible on the wire. The trouble is with the fixed in advance part. It turns out that the most efficient known way to solve this kind of equation, the general number field sieve, can be broken into two stages. The first stage is much more expensive than the second, and it depends only on $g$ and $p$. So if the same $g$ and $p$ were reused for many communications, an eavesdropper could do the first stage in advance, and then breaking individual communications would be much easier—perhaps easy enough to do on the fly, as needed.

At least three common Internet security protocols (TLS, IPsec, and SSH) do reuse $g$ and $p$, if they are not specifically configured otherwise. As the paper puts it, if the attacker can precompute for one 1024-bit group, they can compromise 37% of HTTPS servers in the Alexa Top 1 million, 66% of all probeable IPsec servers, and 26% of all probeable SSH servers. A group is a specific pair of values for $g$ and $p$, and the number of bits essentially refers to how large $p$ is. 1024 bits is the smallest size that was previously considered secure; this paper demonstrates that the precomputation for one such group would cost only a little less than a billion dollars, most of which goes to constructing the necessary supercomputer—the incremental cost for more groups is much smaller. As such we have to move 1024 bits onto the not secure anymore pile. (There’s an entire section devoted to the possibility that the NSA might already have done this.)

(Another section of the paper demonstrates that 512-bit groups can be precomputed by a small compute cluster, and 768-bit groups by a substantial (but not enormous) cluster: 110 and 36,500 core-years of computation, respectively. The former took one week of wall-clock time with the equipment available to the authors. We already knew those groups were insecure; unfortunately, they are still accepted by ~10% of servers.)

What do we do about it? If you’re running a server, the thing you should do right now is jump to a 2048-bit group; the authors have instructions for common TLS servers and SSH, and generic security configuration guidelines for HTTPS servers and SSH also cover this topic. (If you know where to find instructions for IPsec, please let me know.) 2048 bits is big enough that you probably don’t need to worry about using the same group as anyone else, but generating your own groups is also not difficult. It is also important to make sure that you have completely disabled support for export ciphersuites. This eliminates the 512- and 768-bit groups and also several other primitives that we know are insecure.

At the same time, it would be a good idea turn on support for TLSv1.2 and modern ciphersuites, including elliptic curve Diffie-Hellman, which requires an attacker to solve a more complicated equation and is therefore much harder to break. It’s still a discrete logarithm problem, but in a different finite field that is harder to work with. I don’t understand the details myself, but an elliptic curve group only needs 256 bits to provide the same security as a 2048-bit group for ordinary Diffie-Hellman. There is a catch: the general number field sieve works for elliptic curves, too. I’m not aware of any reason why the precomputation attack in this paper wouldn’t apply, and I wish the authors had estimated how big your curve group needs to be for it to be infeasible. 256 bits is almost certainly big enough; but how about 128 bits, which is usually considered equivalent to 1024 bits for ordinary DH?

In the longer timeframe, where we think about swapping out algorithms entirely, I’m going to say this paper means cryptographers should take precomputation should not help the attacker as a design principle. We already do this for passwords, where the whole point of salt is to make precomputation not help; it’s time to start thinking about that in a larger context.

Also in the longer timeframe, this is yet another paper demonstrating that old clients and servers, that only support old primitives that are now insecure, hurt everyone’s security. Just about every time one peer continues to support a broken primitive for backward compatibility, it has turned out that a man in the middle can downgrade it—trick it into communicating insecurely even with counterparties that do support good primitives. (There’s a downgrade attack to 512-bit Diffie-Hellman groups in this paper.) This is one of those thorny operational/incentive alignment problems that hasn’t been solved to anyone’s satisfaction. Windows XP makes an instructive example: it would have been technically possible for Microsoft to backport basically all of the security improvements (and other user-invisible improvements) in later versions of Windows, and there are an enormous number of people who would have benefited from that approach. But it wouldn’t have been in Microsoft’s own best interest, so instead they did things geared to force people onto newer versions of the OS even at the expense of security for people who didn’t want to budge, and people who needed to continue interoperating with people who didn’t want to budge (schannel.dll in XP still doesn’t support TLS 1.2). I could spin a similar story for basically every major player in the industry.

Consider the humble traffic light. The basic design dates to the 1890s, the signals it displays can be explained to a toddler, they’re everywhere and no one thinks anything of them, but if they malfunction an entire city could be paralyzed. Also, for longer than you might realize, they have not been controlled by simple, independent timing circuits; coordinated light patterns, intended to ensure that traffic flows smoothly along the entire length of a boulevard, date to the 1940s. Nowadays each light’s control box tends to have a computer inside, and they talk to each other on radio links, because running wires between all the intersections is expensive.

Today’s paper is from WOOT 2014 and is a case study of the security of such networks of traffic lights. If you have any experience with embedded device security you can probably guess how it turned out: The radio links are in the clear. The protocols are proprietary, but easy to reverse-engineer (or acquire software for). The computers run an outdated proprietary embedded OS and there’s no mechanism for ensuring it gets patched when necessary. Operating-system-level debugging interfaces with complete control over the device may have been left active and exposed to the network. Management software fails to work correctly if passwords are changed from the vendor’s default. And once you’re connected, you can completely reconfigure the traffic lights—which is by design, so adjustments don’t require sending a maintenance crew to every intersection.

The saving grace is the malfunction management unit. This is a separate circuit board, built with discrete logic and configured with physical jumpers, which prevents the light at any single intersection from displaying an unsafe combination of signals (e.g. green in all directions). If the computer tries to put the lights in a state that isn’t on the MMU’s whitelist, or change them too rapidly, it will be overridden and the lights will enter a failsafe mode (such as blinking red in all directions) until someone physically resets the control box. This safety mechanism limits the damage an attacker can do, but it’s still perfectly possible to bring all traffic to a halt, malcoordinate the lights to hinder rather than help traffic flow (which might go completely unnoticed) or just give yourself green lights all the time at everyone else’s expense.

This paper is probably more valuable for its broader lessons (section 6) than for this one case study. Why is it that security experts are not surprised when yet another embedded, safety-critical system turns out to be wide open to remote explotation? The authors call this phenomenon the security phase change. You have an industry with decades of experience designing, manufacturing, and selling widgets which do not communicate with the outside world. The engineers have a solid understanding of how the widget might fail due to accidents, wear and tear, manufacturing defects, etc, and what needs to be done to make sure it fails safe—that malfunction management unit is not in the traffic light controller by chance. But no one thinks about failures due to malice, because why would anyone go around to each and every traffic light and monkeywrench it? It would take hours. Someone would probably call the police.

To such a widget, features are incrementally added, each desirable in its own terms. Microcontrollers are cheaper and more flexible than fixed-function logic boards. Radio links are cheaper and more flexible than running wires down the street. If the fire trucks can override the lights, they can get to the fire faster. If all the lights report diagnostics to a central monitoring station, we can dispatch repair crews more efficiently and not need to make as many inspections.

The security phase change happens when a series of these changes culminates in a device that has a general-purpose computer inside, connected directly to a network that anyone can tap into with relatively little effort. Suddenly the adversary doesn’t need to drive around the entire city to monkeywrench all the lights. Or crack open the hood of your car to get at the engine control computer’s maintenance interface. Or be physically onsite at your chemical plant to mess with the process controller. I understand jet engines are a service now. I would like to think that those engineers have thought about channel security, at the very least.

The Internet Measurement Conference brings us an attempt to figure out just how X.509 server certificates are being used in the wild, specifically for HTTPS servers. Yet more specifically, they are looking for endemic operational problems that harm security. The basic idea is to scan the entire IPv4 number space for servers responding on port 443, make note of the certificate(s) presented, and then analyze them.

This research question is nothing new; the EFF famously ran a similar study back in 2010, the SSL Observatory. And operational concerns about the way certificates are used in the wild go back decades; see Peter Gutmann’s slide deck Everything you Never Wanted to Know about PKI but were Forced to Find Out (PDF). What makes this study interesting is, first, it’s three years later; things can change very fast in Internet land (although, in this case, they have not). Second, the scale: the authors claim to have successfully contacted 178% more TLS hosts (per scan) and harvested 736% more certificates (in total, over the course of 110 scans spanning a little more than a year) than any previous such study.

What do we learn? Mostly that yeah, the TLS PKI is a big mess, and it hasn’t gotten any better since 2010. There are too many root CAs. There are far too many unconstrained intermediate certificates, and yet, at the same time, there are too few intermediates! (The point of intermediates is that they’re easy to replace, so if they get compromised you don’t have a catastrophe on your hands. Well, according to this paper, some 26% of all currently valid HTTPS server certificates are signed by one intermediate. No way is that going to be easy to replace if it gets compromised.) Lots of CAs ignore the baseline policies for certificate issuance and get away with it. (Unfortunately, the paper doesn’t say whether there are similar problems with the EV policies.)

Zoom out: when you have a piece of critical infrastructure with chronic operational issues, it’s a safe bet that they’re symptoms and the real problem is with operator incentives. The paper doesn’t discuss this at all, unfortunately, so I’ll throw in some speculation here. The browser vendors are notionally in the best position to Do Something about this mess, but they don’t: because the only real option they have is to delete root certificates from the Official List. Not only does this tend to put the offending CA out of business, it also causes some uncertain-but-large number of websites (most or all of which didn’t do anything wrong) to stop working. Such a drastic sanction is almost never seen to be appropriate. Browsers have hardly any good positive incentives to offer the CAs to do things right; note that EV certificates, which get special treatment in the browser UI and can therefore be sold at a premium, do come with a tighter set of CA requirements (stronger crypto, reliable OCSP, that sort of thing) which are, as far as I’m aware, followed.

Zoom out again: there’s no shortage of technical suggestions that could turn into less drastic sanctions and incentives for the CAs, but they never get implemented: why? Well, you ask me, I say it’s because both OpenSSL and NSS are such terrible code that nobody wants to hack on them, and the brave souls who do it anyway are busy chipping away at the mountain of technical debt and/or at features that are even more overdue. This, though, we know how to fix. It only takes money.