Administrative Note

Readings will be on hiatus for the next two weeks. Posts will resume on September 7.

Sparse Overcomplete Word Vector Representations

It’s time for another of my occasional looks at a paper that doesn’t have anything to do with security. This one is about an attempt to bridge a gap between two modes of analysis in computational linguistics.

In a word vector analysis (also referred to as a distributed word representation in this paper) one maps words onto vectors in an abstract high-dimensional space, defined by the co-occurrence probabilities of each pair of words within some collection of documents (a corpus). This can be done entirely automatically with any source of documents; no manual preprocessing is required. Stock machine-learning techniques, applied to these vectors, perform surprisingly well on a variety of downstream tasks—classification of the words, for instance. However, the vectors are meaningless to humans, so it’s hard to use them in theoretical work that requires interpretability, or to combine them with other information (e.g. sentence structure). Lexical semantics, by contrast, relies on manual, word-by-word tagging with human-meaningful categories (part of speech, sense of word, role in sentence, etc) which is slow and expensive, but the results are much easier to use as a basis for a wide variety of further studies.

The paper proposes a technique for transforming a set of word vectors to make them more interpretable. It’s essentially the opposite of PCA. They project the vectors into a higher-dimensional space, one in which they are all sparse (concretely, more than 90% of the components of each vector are zero). Words that are semantically related will (they claim) share nonzero components of these vectors, so each component has more meaning than in the original space. The projection matrix can be generated by standard mathematical optimization techniques (specifically, gradient descent with some tweaks to ensure convergence).

To back up the claim that the sparse vectors are more interpretable, they first show that five stock downstream classification tasks achieve an average of 5% higher accuracy when fed sparse vectors than the dense vectors from which they were derived, and then that humans achieve 10% higher accuracy on a word intrusion task (one of these five words does not belong in the list, which is it?) when the words are selected by their rank along one dimension of the sparse vectors, than when they are selected the same way from the dense vectors. An example would probably help here: Table 6 of the paper quotes the top-ranked words from five dimensions of the sparse and the dense vectors:

Dense Sparse
combat, guard, honor, bow, trim, naval fracture, breathing, wound, tissue, relief
’ll, could, faced, lacking, seriously, scored relationships, connections, identity, relations
see, n’t, recommended, depending, part files, bills, titles, collections, poems, songs
due, positive, equal, focus, respect, better naval, industrial, technological, marine
sergeant, comments, critics, she, videos stadium, belt, championship, toll, ride, coach

Neither type of list is what you might call an ideal Platonic category,1 but it should be clear that the sparse lists have more meaning in them.

Because I’m venturing pretty far out of my field, it’s hard for me to tell how significant this paper is; I don’t have any basis for comparison. This is, in fact, the big thing missing from this paper: a comparison to other techniques for doing the same thing, if any. Perhaps the point is that there aren’t any, but I didn’t see them say so. I am also unclear on how you would apply this technique to anything other than an analysis of words. For instance, my own research right now involves (attempting to) mechanically assign topics to entire documents. Right now we’re reducing each document to a bag of words, carrying out LDA on the bags, and then manually labeling each LDA cluster with a topic. Could we use bags of sparse word vectors instead? Would that help LDA do its job better? Or is LDA already doing what this does? I don’t know the answers to these questions.

1 If you are under the impression that categories in natural language are even vaguely Platonic, go at once to your friendly local public library and request a copy of Women, Fire, and Dangerous Things.

Whiskey, Weed, and Wukan on the World Wide Web

The subtitle of today’s paper is On Measuring Censors’ Resources and Motivations. It’s a position paper, whose goal is to get other researchers to start considering how economic constraints might affect the much-hypothesized arms race or tit-for-tat behavior of censors and people reacting to censorship: as they say,

[…] the censor and censored have some level of motivation to accomplish various goals, some limited amount of resources to expend, and real-time deadlines that are due to the timeliness of the information that is being spread.

They back up their position by presenting a few pilot studies, of which the most compelling is the investigation of keyword censorship on Weibo (a Chinese microblogging service). They observe that searches are much more aggressively keyword-censored than posts—that is, for many examples of known-censored keywords, one is permitted to make a post on Weibo containing that keyword, but searches for that keyword will produce either no results or very few results. (They don’t say whether unrelated searches will turn up posts containing censored keywords.) They also observe that, for some keywords that are not permitted to be posted, the server only bothers checking for variations on the keyword if the user making the post has previously tried to post the literal keyword. (Again, the exact scope of the phenomenon is unclear—does an attempt to post any blocked keyword make the server check more aggressively for variations on all blocked keywords, or just that one? How long does this escalation last?) And finally, whoever is maintaining the keyword blacklists at Weibo seems to care most about controlling the news cycle: terms associated with breaking news that the government does not like are immediately added to the blacklist, and removed again just as quickly when the event falls out of the news cycle or is resolved positively. They give detailed information about this phenomenon for one news item, the Wukan incident, and cite several other keywords that seem to have been treated the same.

They compare Weibo’s behavior to similar keyword censorship by chat programs popular in China, where the same patterns appear, but whoever is maintaining the lists is sloppier and slower about it. This is clear evidence that the lists are not maintained centrally (by some government agency) and they suggest that many companies are not trying very hard:

At times, we often suspected that a keyword blacklist was being typed up by an over-worked college intern who was given vague instructions to filter out anything that might be against the law.

Sadly, I haven’t seen much in the way of people stepping up to the challenge presented, designing experiments to probe the economics of censorship. You can see similar data points in other studies of China [1] [2] [3] (it is still the case, as far as I know, that ignoring spurious TCP RST packets is sufficient to evade several aspects of the Great Firewall), and in reports from other countries. It is telling, for instance, that Pakistani censors did not bother to update their blacklist of porn sites to keep up with a shift in viewing habits. [4] George Danezis has been talking about the economics of anonymity and surveillance for quite some time now [5] [6] but that’s not quite the same thing. I mentioned above some obvious follow-on research just for Weibo, and I don’t think anyone’s done that. Please tell me if I’ve missed something.

Limitations of End-to-End Encryption in Secure Computer Networks

Today we’re going to go back in time, all the way to the the dawn of computer networks. When this technical report was filed, the largest operational internetwork was still called ARPAnet and it still ran on NCP; people were still talking about hosts and communications subnetwork processors as if they were two different physical devices, and security levels as if that was the only possible way to conceptualize access control; and I was five months old.

(I apologize for the poor quality of the linked PDF—to the best of my knowledge, this is the only version to be found online. Also, if anyone knows D.W. Snow’s first name, please tell me.)

To the best of my knowledge, this is the very first published article to discuss the things that end-to-end encryption (what we would now call a secure channel protocol) does not protect from an eavesdropper. Everyone doing computer security in 1978 was thinking mostly about protecting classified government secrets, so the authors frame the problem in terms of a Trojan Horse program with access to such secrets, but forbidden by the OS from sending messages to anyone who isn’t cleared to access the same secrets: if all outgoing network traffic is encrypted end-to-end to its (legitimate) recipient, can this Trojan Horse still exfiltrate information to someone who isn’t a legitimate recipient?

They point out that, of necessity, a packet-switched network has to reveal the destination address, transmission time, and length of every packet in cleartext. They model each of these as Shannonian communication channels, and determine sending rates on the order of 100 bits per second for each—more than adequate to leak a text document. (They observe, by way of comparison, that the standard military teletype runs at 75 bps.)

Nowadays, this threat model might seem quaint, even silly—we put a lot more effort into preventing untrusted code from seeing secret information in the first place. The information leak, however, is real, still exists, and can be used for other purposes. The most terrifying example I know is Hookt on fon-iks, in which a completely passive eavesdropper can reconstruct the words spoken in an encrypted VoIP phone conversation, just from the length and timing of each packet. Different syllables compress to different length packets, and every natural language has rules about which syllables can follow which; the rules can be modeled with a Markov chain, and there you are.

The countermeasures and conclusions sections of this paper are much more embarrassing in retrospect than the dated threat model. They say, there’s nothing one can practically do about this end-to-end, but we can close the hole if we make every single intermediate relay a trusted arbiter of the (one true) security policy, at which point we don’t need end-to-end encryption… I feel quite confident in saying, even in 1978 it ought to have been obvious that that was never going to happen. What’s sad is, if people hadn’t given up on end-to-end countermeasures back then, perhaps we would actually have some by now. (It’s easy for VoIP! All you have to do is use a constant-bitrate compression algorithm. Shame none of the widely deployed VoIP programs bother.)

Folk Models of Home Computer Security

It is well known that people who aren’t computer security experts tend to ignore expert advice on computer security, and (to some extent as a consequence) get exploited. This paper is not the first, or the last, to investigate why; see also the What Deters Jane papers [1] [2], So Long and No Thanks for the Externalities, and “Users Are Not the Enemy. However, this paper provides a much more compelling explanation than anything earlier (that I have read), and a lens through which to view everything since. It’s plainly written and requires almost no specialized background knowledge; you should just go ahead and read the whole thing.

For those not in the mood to read the whole thing right now, I will summarize. Wash conducted semi-structured, qualitative interviews of 33 home computer users, who were selected to maximize sample diversity, and specifically to exclude security experts. From these, he extracted a number of what he calls folk models—qualitative, brief descriptions of how these people understand various threats to their computer security. The term folk is used to describe a model which accurately reflects users’ understanding of a system, and is broadly shared among a user population, but might not accurately reflect the true behavior of that system. [3] In this case, that means the models reflect what the interviewees think are the threats to their home computers, even if those aren’t accurate. Indeed, it is precisely where the model is inaccurate to the real threats that it provides an explanation of the phenomenon (i.e. users not bothering to follow well-publicized security advice).

A key aspect of all the models presented is a division of security threats into viruses and hackers. Virus is used by the interviewees as an umbrella term, corresponding most closely to what experts call malware—any piece of software which is unwanted and has harmful effects. (One model expands this category even further, to include programs which are unintentionally harmful, i.e. they have bugs.) The models differ primarily in users’ understanding of how viruses get into the computer, and what they are programmed to do once there. This can be very vague (e.g. viruses are bad software you don’t want on your computer) or quite specific (e.g. viruses are deliberately programmed by hackers as an act of vandalism; they cause annoying problems with the computer; you get them passively by visiting sketchy websites—an expert will acknowledge this as true-but-incomplete).

Hackers on the other hand are people who are actively seeking to exploit computers; most interviewees share the understanding that this involves taking control of a computer remotely, thus allowing it to be manipulated as if the hacker were physically present at its console. (Many of them hedge that they do not know how this is done, but they are sure that it is possible.) The models here differ primarily in the motives ascribed to the hackers, which are: vandalism, identity theft, or targeted identity theft and financial fraud. This last is one of the most telling observations in the entire paper: a significant number of people believe that they are safe from hackers because they have nothing worth stealing or exposing. (Again, an expert would recognize this as true-but-incomplete: there really is a subpopulation of black-hat actors who specialize in going after the big fish. The catch, of course, is that the data exfiltrated from a big fish might include millions of people’s personal credit card numbers…)

Having presented these models, Wash runs down a list of standard items of home computer security advice (drawn from Microsoft, CERT, and US-CERT’s guides on the topic) and points out how many of them are either useless or unimportant according to these models: for instance, if you think you can’t get viruses without actively downloading software, then antivirus software is pointless, and patching only valuable if it eliminates bugs you trip over yourself; if you think hackers rarely, if ever, vandalize a computer, then backups are not necessary to protect against that risk. He closes by comparing the novel-at-the-time threat of botnets to all the models, observing that none of them account for the possibility that an attacker might subvert computers indiscriminately and automatically, then use them only for their Internet connection. In particular, all of the hacker models assume that computers are attacked in order to do something to that computer, rather than as a means to a goal that doesn’t involve it (sending spam, enlarging the botnet, executing DDoS attacks, …) and that the hacker must be doing something manually at the time of the attack.

The landscape of security threats has changed quite a bit since this paper was published. I would be curious to know whether ransomware, RATs, third-party data breaches, and so on have penetrated the public consciousness enough to change any of the models. I’d also like to know whether and how much people’s understanding of the threats to a mobile phone is different. And, although Wash did make an effort to cover a broad variety of non-expert home computer users, they are all from the general population near his Midwestern university, hence mostly WEIRDos. I’m not aware of any studies of this type conducted anywhere but North America and Europe, but I bet it’s not quite the same elsewhere…

Taster’s Choice: A Comparative Analysis of Spam Feeds

Here’s another paper about spam; this time it’s email spam, and they are interested not so much in the spam itself, but in the differences between collections of spam (feeds) as used in research. They have ten different feeds, and they compare them to each other looking only at the domain names that appear in each. The goal is to figure out whether or not each feed is an unbiased sample of all the spam being sent at any given time, and whether some types of feed are better at detecting particular sorts of spam. (Given this goal, looking only at the domain names is probably the most serious limitation of the paper, despite being brushed off with a footnote. It means they can’t say anything about spam that doesn’t contain any domain names, which may be rare, but is interesting because it’s rare and different from all the rest. They should have at least analyzed the proportion of it that appeared in each feed.)

The spam feeds differ primarily in how they collect their samples. There’s one source consisting exclusively of manually labeled spam (from a major email provider); two DNS blacklists (these provide only domain names, and are somehow derived from other types of feed); three MX honeypots (registered domains that accept email to any address, but are never used for legitimate mail); two seeded honey accounts (like honeypots, but a few addresses are made visible to attract more spam); one botnet-monitoring system; and one hybrid. They don’t have full details on exactly how they all work, which is probably the second most serious limitation.

The actual results of the analysis are basically what you would expect: manually labeled spam is lower-volume but has more unique examples in it, botnet spam is very high volume but has lots of duplication, everything else is somewhere in between. They made an attempt to associate spam domains with affiliate networks (the business of spamming nowadays is structured as a multi-level marketing scheme) but they didn’t otherwise try to categorize the spam itself. I can think of plenty of additional things to do with the data set—which is the point: it says right in the abstract most studies [of email spam] use a single spam feed and there has been little examination of how such feeds may differ in content. They’re not trying so much to produce a comprehensive analysis themselves as to alert people working in this subfield that they might be missing stuff by looking at only one data source.

All Your Biases Belong To Us: Breaking RC4 in WPA-TKIP and TLS

RC4 is a very widely used stream cipher, developed in the late 1980s. As wikipedia puts it, RC4 is remarkable for its speed and simplicity in software, but it has weaknesses that argue against its use in new systems. Today’s paper demonstrates that it is even weaker than previously suspected.

To understand the paper you need to know how stream ciphers work. The core of a stream cipher is a random number generator. The encryption key is the starting seed for the random number generator, which produces a keystream—a sequence of uniformly random bytes. You then combine the keystream with your message using the mathematical XOR operation, and that’s your ciphertext. On the other end, the recipient knows the same key, so they can generate the same keystream, and they XOR the ciphertext with the keystream and get the message back.

If you XOR a ciphertext with the corresponding plaintext, you learn the keystream. This wouldn’t be a big deal normally, but, the basic problem with RC4 is that its keystream isn’t uniformly random. Some bytes of the keystream are more likely to take specific values than they ought to be. Some bytes are more likely to not take specific values. And some bytes are more likely to take the same value as another byte in the keystream. (The ABSAB bias, which is mentioned often in this paper, is an example of that last case: you have slightly elevated odds of encountering value A, value B, a middle sequence S, and then A and B again.) All of these are referred to as biases in the RC4 keystream.

To use RC4’s biases to decrypt a message, you need to get your attack target to send someone (not you) the same message many times, encrypted with many different keys. You then guess a keystream which exhibits as many of the biases as possible, and you XOR this keystream with all of the messages. Your guess won’t always be right, but it will be right slightly more often than chance, so the correct decryption will also appear slightly more often than chance. It helps that it doesn’t have to be exactly the same message. If there is a chunk that is always the same, and it always appears in the same position, that’s enough. It also helps if you already know some of the message; that lets you weed out bad guesses faster, and exploit more of the biases.

Asking the attack target to send the same message many times might seem ridiculous. But remember that many Internet protocols involve headers that are fixed or nearly so. The plaintext of a TKIP-encrypted message, for instance, will almost always be the WiFi encapsulation of an IPv4 packet. If you know the IP addresses of your target and of the remote host it’s communicating with, that means you know eight bytes of the message already, and they’ll always be the same and in the same position. The paper goes into some detail about how to get a Web browser to make lots and lots of HTTPS requests with a session cookie (always the same, but unknown—the secret you want to steal) in a predictable position, with known plaintext on either side of it.

All this was well-known already. What’s new in this paper is: first, some newly discovered biases in the RC4 keystream, and a statistical technique for finding even more; second, improved attacks on TKIP and TLS. Improved means that they’re easier to execute without anyone noticing, and that they take less time. Concretely, the best known cookie-stealing attack before this paper needed to see 13230 HTTPS messages (that’s about 14 billion) and this paper cuts it to 9227 messages (1.2 billion), which takes only 75 hours to run. That’s entering the realm of practical in real life, if only for a client computer that is left on all the time, with the web browser running.

It’s a truism in computer security (actually, security in general) that attacks only ever get better. What’s important, when looking at papers like this, is not so much the feasibility of any particular paper’s attack, but the trend. The first-known biases in RC4 keystream were discovered only a year after the algorithm was published, and since then there’s been a steady drumbeat of researchers finding more biases and more effective ways to exploit them. That means RC4 is no good, and everyone needs to stop using it before someone finds an attack that only takes a few minutes. Contrast the situation with AES, where there are no known biases, and fifteen years of people looking for some kind of attack has produced only a tiny improvement over brute force.

Advice to the general public: Does this affect you? Yes. What can you do about it? As always, make sure your Web browsers are fully patched up—current versions of Firefox, Chrome, and IE avoid using RC4, and upcoming versions will probably take it out altogether. Beyond that, the single most important thing for you to do is make sure everything you’ve got that communicates over WiFi—router, computers, tablets, smartphones, etc.—are all set to use WPA2 and CCMP only. (The configuration interface may refer to CCMP as AES-CCMP or just AES; in this context, those are three names for the same thing.) The alternatives WEP, WPA, and TKIP all unavoidably involve RC4 and are known to be broken to some extent. Any WiFi-capable widget manufactured after 2006 has no excuse for not supporting WPA2 with CCMP. It should definitely be possible to set your home router up this way; unfortunately, many client devices don’t expose the necessary knobs.

Detecting Semantic Cloaking on the Web

Now for something a little different: today’s paper is about detecting search engine spam. Specifically, it’s about detecting when a Web site presents different content to a search engine’s crawler than it does to human visitors. As the article points out, this can happen for benign or even virtuous reasons: a college’s front page might rotate through a selection of faculty profiles, or a site might strip out advertising and other material that is only relevant to human visitors when it knows it’s talking to a crawler. However, it also happens when someone wants to fool the search engine into ranking their site highly for searches where they don’t actually have relevant material.

To detect such misbehavior, obviously one should record each webpage as presented to the crawler, and then again as presented to a human visitor, and compare them. The paper is about two of the technical challenges which arise when you try to execute this plan. (They do not claim to have solved all of the technical challenges in this area.) The first of these is, of course, how do you program a computer to tell when a detected difference is spam, versus when it is benign? and here they have done something straightforward: supervised machine classification. You could read this paper just as a case study in semi-automated feature selection for a machine classifier, and you would learn something. (Feature selection is somewhat of a black art—features that appear to be highly predictive may be accidents of your sample, and features that logically should be predictive might not work out at all. In this case, the positive features they list seem plausibly motivated, but several of the negative features (features which are anticorrelated with spamming) seem likely to be accidental. I would have liked to see more analysis of why each feature is predictive.)

The second technical challenge is less obvious: sites are updated frequently. You don’t want to mistake an update for any kind of variation between the crawl result and the human-visitor result. But it’s not practical to visit a site simultaneously as the crawler and as the human, just because of how many sites a crawl has to touch (and if you did, the spammers might be able to figure out that your human visit was an audit). Instead, you could visit the site repeatedly as each and see if the changes match, but this is expensive. The paper proposes to weed out sites that don’t change at all between the crawler visit and the human visit, and do the more expensive check only to the sites that do. A refinement is to use a heuristic to pick out changes that are more likely to be spam: presence of additional keywords or links in the crawler version, relative to the human version. In their tests, this cuts the number of sites that have to be investigated in detail by a factor of 10 (and could do better by refining the heuristic further). These kinds of manual filter heuristics are immensely valuable in classification problems when one of the categories (no cloaking) is much larger than the other(s), both because it reduces the cost of running the classifier (and, in this case, the cost of data collection), and because machine-learning classifiers often do better when the categories all have roughly the same number of examples.

This paper shouldn’t be taken as the last word in this area: it’s ten years old, its data set is respectable for an experiment but tiny compared to the global ’net, and false positive and negative rates of 7% and 15% (respectively) are much too large for production use. The false positive paradox is your nemesis when you are trying to weed spammers out of an index of 109 websites. We know from what little they’ve said about it in public (e.g. [1] [2]) that Google does something much more sophisticated. But it is still valuable as a starting point if you want to learn how to do this kind of research yourself.

Automated Detection and Fingerprinting of Censorship Block Pages

This short paper, from IMC last year, presents a re-analysis of data collected by the OpenNet Initiative on overt censorship of the Web by a wide variety of countries. Overt means that when a webpage is censored, the user sees an error message which unambiguously informs them that it’s censored. (A censor can also act deniably, giving the user no proof that censorship is going on—the webpage just appears to be broken.) The goal of this reanalysis is to identify block pages (the error messages) automatically, distinguish them from normal pages, and distinguish them from each other—a new, unfamiliar format of block page may indicate a new piece of software is in use to do the censoring.

The chief finding is that block pages can be reliably distinguished from normal pages just by looking at their length: block pages are typically much shorter than normal. This is to be expected, seeing that they are just an error message. What’s interesting, though, is that this technique works better than techniques that look in more detail at the contents of the page. I’d have liked to see some discussion of what kinds of misidentification appear for each technique, but there probably wasn’t room for that. Length is not an effective tactic for distinguishing block pages from each other, but term frequency is (they don’t go into much detail about that).

One thing that’s really not clear is how they distinguish block pages from ordinary HTTP error pages. They mention that ordinary errors introduce significant noise in term-frequency clustering, but they don’t explain how they weeded them out. It might have been done manually; if so, that’s a major hole in the overall automated-ness of this process.

20,000 In League Under The Sea: Anonymous Communication, Trust, MLATs, and Undersea Cables

Today’s paper takes another shot at modeling how the physical topology of the Internet affects the security of Tor against passive adversaries with the ability to snoop on a lot of traffic. It’s by some of the same people who wrote Defending Tor from Network Adversaries and is quite closely related.

Most of the work of this paper goes into building a flexible, formal threat model, which Tor client software could (in principle) use to inform its routing decisions. Acknowledging that there’s always going to be a good deal of uncertainty about what adversaries are out there and what they are capable of, they make two key design decisions. The model is probabilistic (based on a Bayesian belief network), and it takes user input. For instance, if you have reason to think the government of Transbelvia has it in for you, you can instruct Tor to avoid paths that Transbelvia might be able to snoop on, and the model will expand that out to all the ways they might do that. Conversely, if you trust a particular organization you might like to preferentially use its guards or exit nodes, and it can do that too.

The model is very thorough about different ways a government might be able to snoop on network traffic—not just relays physically hosted in the country, but ASes and IXPs (Transbelvia hosts a major IXP for Eastern Europe), submarine cable landing sites (not relevant for a landlocked country), mutual legal assistance treaties (MLATs) which might be used to have another country do some snooping on Transbelvia’s behalf, and even hacking into and subverting routers at interesting points in the connectivity graph. (The pun in the title refers to their analysis of how MLATs could allow several of the usual suspects to snoop on 90+% of all submarine cable traffic, even though they host hardly any cable landings themselves.) Equally important, it can be expanded at need when new techniques for spying are revealed.

I think something like this is going to be an essential building block if we want to add any spy-aware routing algorithm to Tor, but I have two serious reservations. First, simplest, but less important, right now all Tor clients make routing decisions more-or-less the same way (there have been small changes to the algorithm over time, but everyone is strongly encouraged to stay close to the latest client release anyway, just because of bugs). If clients don’t all make routing decisions the same way, then that by itself might be usable to fingerprint them, and thus cut down the number of people who might’ve taken some action, from all Tor users to all Tor users who make routing decisions like THIS. If highly personalized threat models are allowed, the latter group might be just one person.

Second, and rather more serious, the user-input aspect of this system is going to require major user experience research and design to have any hope of not being worse than the problem it’s trying to solve. It’s not just a matter of putting a friendly face on the belief language (although that does need to happen)—the system will need to educate its users in the meaning of what it is telling them, and it will need to walk them through the consequences of their choices. And it might need to provide nudges if there’s a good reason to think the user’s assessment of their threat model is flat-out wrong (even just making that judgement automatically is fraught with peril—but so is not making that judgement).