Improving Cloaking Detection Using Search Query Popularity and Monetizability

Here’s another paper about detecting search engine spam, contemporary with Detecting Semantic Cloaking on the Web and, in fact, presented as a refinement to that paper. They don’t make any changes to the is this page spamming the search engine algorithm itself; rather, they optimize the scan for pages that are spamming the search engine, by looking at the spammers’ economic incentives.

It obviously does no good to make your site a prominent search hit for keywords that nobody ever searches for. Less obviously, since search engine spammers are trying to make money by sucking people into linkfarms full of ads, they should be focusing on keywords that lots of advertisers are interested in (monetizability in the paper’s jargon). Therefore, the search engine operator should focus its spam-detection efforts on the most-searched and most-advertised keywords. Once you identify a linkfarm, you can weed it out of all your keyword indexes, so this improves search results in the long tail as well as the popular case.

They performed a quick verification of this hypothesis with the help of logs of the 5000 most popular and 5000 most advertised search keywords, from the MSN search engine (now Bing). Indeed, the spam was strongly skewed toward both high ends—somewhat more toward monetizability than popularity.

That’s really all there is to say about this paper. They had a good hypothesis, they tested it, and the hypothesis was confirmed. I’ll add that this is an additional reason (besides just making money) for search engines to run their own ad brokerages, and a great example of the value of applying economic reasoning to security research.

k-Indistinguishable Traffic Padding in Web Applications

A Web application is a piece of software whose execution is split between the client and the server, and it sends data between the two sides every time you do something significant—even, perhaps, on every single keystroke. Inevitably, the volume of data has some correlation with the action performed. Since most Web applications are offered to the general public at low or no cost, an adversary can map out much of its internal state machine and develop a model that lets them deduce what someone’s doing from an observed sequence of packets. [1] [2] This has been successfully applied (in the lab) to Web applications for highly sensitive tasks, such as tax preparation and medical advice.

The only known solution to this problem is to pad the amount of data passed over the network in each direction so that the adversary cannot learn anything about the state of the web application from an observed sequence of packets. Unfortunately, the obvious way to do this involves unacceptably high levels of overhead—this paper quotes [1] as observing 21074% overhead for a well-known online tax system. Today’s paper proposes that techniques from the field of privacy-preserving data publishing (PPDP, [3]) be brought to bear on the problem.

The paper is mathematically heavy going and I’m not going to go into details of their exact proposal, because it’s really just a starting point. Instead, I’m going to pull out some interesting observations:

  • More padding is not always better. They show an example case where padding messages to a multiple of 128 bytes is exactly as good as padding them to a multiple of 512 bytes, and padding them to a multiple of 520 bytes is worse.

  • Computing the optimal amount of padding for a web application (as they define it) is NP-hard even with complete, ideal knowledge of its behavior. However, there are reasonably efficient and speedy approximations—if I understood them correctly, it might even be feasible to apply one of their algorithms on the fly with each client.

  • Because optimality is defined over an entire sequence of user interactions with an application, knowing how much padding to send in response to any given request may require the server to predict the future. Again, however, there are approximations that avoid this problem.

  • PPDP offers a whole bunch of different ways to model the degree to which one has mitigated the information leak: k-anonymity is the model they spend the most time on, but there’s also l-diversity, (αk)-anonymity, t-closeness, etc. etc. I don’t follow that literature but I have the impression people are still actively making up new models and haven’t spent a whole lot of time on figuring out which is best in which situations.

The great failing of this and all the other papers I’ve ever read in this area is operationalization. This paper’s algorithms need a complete, or nearly so, map of all possible sequences of states that a Web application can enter, and the authors shrug off the question of how you create this map in the first place. (I very strongly suspect that task can’t be mechanized in the general case—it will turn out to be the Halting Problem in another of its many disguises.) They also don’t seem to realize that things like Google search suggestions (the running example in the paper) are constantly changing. (Of course, that makes life harder for the adversary too.) This doesn’t mean there is no hope; it only means you need more approximations. I think it ought to be possible to build good-enough automatic padding into CMSes, and maybe even into web application frameworks. Trying that should be a priority for future research.

Inferring Mechanics of Web Censorship Around the World

I’ve talked a bunch about papers that investigate what is being censored online in various countries, but you might also want to know how it’s done. There are only a few ways it could be done, and this paper does a good job of laying them out:

  • By DNS name: intercept DNS queries either at the router or the local DNS relay, return either no such host or a server that will hand out errors for everything.

  • By IP address: in an intermediate router, discard packets intended for particular servers, and/or respond with TCP RST packets (which make the client disconnect) or forged responses. (In principle, an intermediate router could pretend to be the remote host for an entire TCP session, but it doesn’t seem that anyone does.)

  • By data transferred in cleartext: again in an intermediate router, allow the initial connection to go through, but if blacklisted keywords are detected then forge a TCP RST.

There are a few elaborations and variations, but those are the basic options if you are implementing censorship in the backbone of the network. The paper demonstrates that all are used. It could also, of course, be done at either endpoint, but that is much less common (though not unheard of) and the authors of this paper ruled it out of scope. It’s important to understand that the usual modes of encryption used on the ’net today (e.g. HTTPS) do not conceal either the DNS name or the IP address of the remote host, but do conceal the remainder of an HTTP request. Pages of an HTTPS-only website cannot be censored individually, but the entire site can be censored by its DNS name or server IP address. This is why Github was being DDoSed a few months ago to try to get them to delete repositories being used to host circumvention tools [1]: Chinese censors cannot afford to block the entire site, as it is too valuable to their software industry, but they have no way to block access to the specific repositories they don’t like.

Now, if you want to find out which of these scenarios is being carried out by any given censorious country, you need to do detailed network traffic logging, because at the application level, several of them are indistinguishable from the site being down or the network unreliable. This also means that the censor could choose to be stealthy: if Internet users in a particular country expect to see an explicit message when they try to load a blocked page, they might assume that a page that always times out is just broken. [2] The research contribution of this paper is in demonstrating how you do that, through a combination of packet logging and carefully tailored probes from hosts in-country. They could have explained themselves a bit better: I’m not sure they bothered to try to discriminate packets are being dropped at the border router from packets are being dropped by a misconfigured firewall on the site itself, for instance. Also, I’m not sure whether it’s worth going to the trouble of packet logging, frankly. You should be able to get the same high-level information by comparing the results you get from country A with those you get from country B.

Another common headache in this context is knowing whether the results you got from your measurement host truly reflect what a normal Internet user in the country would see. After all, you are probably using a commercial data center or academic network that may be under fewer restrictions. This problem is one of the major rationales for Encore, which I discussed a couple weeks ago [3]. This paper nods at that problem but doesn’t really dig into it. To be fair, they did use personal contacts to make some of their measurements, so those may have involved residential ISPs, but they are (understandably) vague about the details.

Tangler: A Censorship-Resistant Publishing System Based On Document Entanglements

Over the years there have been several attempts to build anonymous publication or distributed anonymous storage systems—usually they start with a peer-to-peer file sharing protocol not entirely unlike BitTorrent, and then build some combination of indexing, replication, encryption, and anonymity on top. All have at least one clever idea. None has achieved world domination. (We’ll know someone’s finally gotten it right when the web browsers start shipping native support for their protocol.)

Tangler is a relatively old example, and its one clever idea is what they call document entanglement. To understand document entanglement you have to know about something called kk-of-nn secret sharing. This is a mathematical technique that converts a secret into nn shares. Each share is the same size as the original secret, possibly plus a little overhead. Anyone who has a copy of kk of those nn shares can reconstruct the original secret, but if they have even just one fewer, they can’t. kk and nn can be chosen arbitrarily. Secret sharing is normally not used for large secrets (like an entire document) because each share is the same size as the original, so you’ve just increased your overall storage requirement nn times—but in a distributed document store like Tangler, you were going to do that anyway, because the document should remain retrievable even if some of the peers holding shares drop out of the network.

Document entanglement, then, is secret sharing with a clever twist: you arrange to have some of the nn shares of your document be the same bitstring as existing shares for other documents. This is always mathematically possible, as long as fewer than kk existing shares are used. This reduces the amount of data added to the system by each new document, but more importantly, it makes the correspondence between shares and documents many-to-many instead of many-to-one. Thus, operators can honestly say they do not know which documents are backed by which shares, and they have an incentive not to cooperate with deletion requests, since deleting one document may render many other documents inaccessible.

I am not convinced entanglement actually provides the security benefit claimed; deleting all nn of the shares belonging to one document should cause other documents to lose no more than one share and thus not be permanently damaged. (The originators of those documents would of course want to generate new shares to preserve redundancy.) It is still probably worth doing just because it reduces the cost of adding new documents to the system, but security-wise it’s solving the wrong problem. What you really want here is: server operators should be unable to determine which documents they hold shares for, even if they know the metadata for those documents. (And yet, somehow, they must be able to hand out the right shares on request!) Similar things are possible, under the name private information retrieval, and people are trying to apply that to anonymous publication, but what I said one really wants here is even stronger than the usual definition of PIR, and I’m not sure it’s theoretically possible.

The Most Controversial Topics in Wikipedia: A multilingual and geographical analysis

One of my more social-science-y interests lately has been in reverse-engineering the rationale for nation-state censorship policies from the data available. Any published rationale is almost always quite vague (harmful to public order, blasphemous, sort of thing). Hard data in this area consists of big lists of URLs, domain names, and/or keywords that are known or alleged to be blocked. Keywords are great, when you can get them, but URLs are less helpful, and domain names even less so. I have a pretty good idea why the Federal Republic of Germany might have a problem with a site that sells sheet music of traditional European folk songs (actual example from #BPjMleak), but I don’t know which songs are at stake, because they blocked the entire site. I could find out, but I’d probably want a dose of brain bleach afterward. More to the point, no matter how strong my stomach is, I don’t have the time for the amount of manual research required to figure out the actual issue with all 3000 of the sites on that list—and that’s just one country, whose politics and history are relatively well-known to me.

So, today’s paper is about mechanically identifying controversial Wikipedia articles. Specifically, they look through the revision history of each article for what they call mutual reverts, where two editors have each rolled back the other’s work. This is a conservative measure; edit warring on Wikipedia can be much more subtle. However, it’s easy to pick out mechanically. Controversial articles are defined as those where there are many mutual reverts, with extra weight given to mutual reverts by pairs of senior editors (people with many contributions to the entire encyclopedia). They ran this analysis for ten different language editions, and the bulk of the article is devoted to discussing how each language has interesting peculiarities in what is controversial. Overall, there’s strong correlation across languages, strong correlation with external measures of political or social controversy, and strong correlation with the geographic locations where each language is spoken. An interesting exception to that last is that the Middle East is controversial in all languages, even those that are mostly spoken very far from there; this probably reflects the ongoing wars in that area, which have affected everyone’s politics at least somewhat.

What does this have to do with nation-state censorship? Well, politically controversial topics in language X ought to be correlated with topics censored by nation-states where X is commonly spoken. There won’t be perfect alignment; there will be topics that get censored that nobody bothers to argue about on Wikipedia (pornography, for instance) and there will be topics of deep and abiding Wikipedia controversy that nobody bothers to censor (Spanish football teams, for instance). But if an outbound link from a controversial Wikipedia article gets censored, it is reasonably likely that the censorship rationale has something to do with the topic of the article. The same is true of censored pages that share a significant number of keywords with a controversial article. It should be good enough for hypothesis generation and rough classification of censored pages, at least.

I Know What You’re Buying: Privacy Breaches on eBay

eBay intends not to let anyone else figure out what you’re in a habit of buying on the site. Because of that, lots of people consider eBay the obvious place to buy things you’d rather your neighbors not know you bought (there is a survey in this very paper confirming this fact). However, this paper demonstrates that a determined adversary can figure out what you bought.

(Caveat: This paper is from 2014. I do not know whether eBay has made any changes since it was published.)

eBay encourages both buyers and sellers to leave feedback on each other, the idea being to encourage fair dealing by attaching a persistent reputation to everyone. Feedback is associated with specific transactions, and anyone (whether logged into eBay or not) can see each user’s complete feedback history. Items sold are visible, items bought are not, and buyers’ identities are obscured. The catch is, you can match up buyer feedback with seller feedback by the timestamps, using obscured buyer identities as a disambiguator, and thus learn what was bought. It involves crawling a lot of user pages, but it’s possible to do this in a couple days without coming to eBay’s notice.

They demonstrate that this is a serious problem by identifying users who purchased gun holsters (eBay does not permit the sale of actual firearms), pregnancy tests, and HIV tests. As an additional fillip they show that people commonly use the same handle on eBay as Facebook and therefore purchase histories can be correlated with all the personal information one can typically extract from a Facebook profile.

Solutions are pretty straightforward and obvious—obscured buyer identities shouldn’t be correlated with their real handles; feedback timestamps should be quantized to weeks or even months; feedback on buyers might not be necessary anymore; eBay shouldn’t discourage use of various enhanced-privacy modes, or should maybe even promote them to the default. (Again, I don’t know whether any of these solutions has been implemented.) The value of the paper is perhaps more in reminding website developers in general that cross-user correlations are always a privacy risk.

Stealthy Dopant-Level Hardware Trojans

Most of the time we treat silicon chips—CPUs, for instance—as black boxes. The manufacturer publishes specifications for integrating it into a larger widget and making use of it, we code to those specifications, and the chip does its job. But inside the package is a fiendishly complicated machine, and there have been plenty of incidents where it didn’t work quite right, e.g. the infamous F00F and FDIV bugs in the Pentium. This is not to pick on Intel; every CPU manufacturer has had similar troubles, but only the x86 is sufficiently famous that its troubles make Wikipedia.

Anything that can happen by accident can happen on purpose. Most chip designers have to contract manufacture out to a fab plant run by a separate company, and the manufacturing process is opaque to them. It might occur to you to wonder whether the manufacturer can tamper with the design. It’s possible to disassemble a chip under an electron microscope and make sure all the wires are where they’re supposed to be, but this is expensive and destroys the chip, and it can only detect some of the possible changes to the design.

This paper presents a technique a fab plant could use to sabotage a chip design, that can’t be detected by any current technique for inspecting a chip after the fact. Basically, by changing the dopant mask, the fab can turn individual transistors into shorts to ground or Vcc. This is invisible to a microscope; all the wires are where they’re supposed to be, it’s just that a block of semiconductor has the wrong electronic properties. They demonstrate that this can be used to introduce subtle bugs that will not be caught by functional testing, such as making the output of a hardware RNG predictable, or introducing a power-consumption side channel into a cryptographic accelerator.

No solutions are presented; I’m not much of a hardware person and I have no idea what solutions might be possible. This is the hardware equivalent of a malicious compiler, and people are working on solving that problem—but they rely on the fact that you can inspect the output of a compiler in detail, because it’s just another string of bits. How do you detect that a block of silicon has been doped with boron instead of phosphorus? X-ray crystallography, maybe?

RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response

Lots of software developers would love to know in detail how people use their work, and to that end, there’s an increasing number of programs that phone home with crash reports, heat maps of which UI elements get used, and even detailed logs of every interaction. The problem, of course, is how to collect as much useful information as possible without infringing on user privacy. Usually, what the developers want are aggregate statistics—how often does this widget get used by everyone, that sort of thing—so the logical method to apply is differential privacy. However, stock differential privacy algorithms assume a central, trusted database, and in this case, the only people who could run the database are the developers—the very same people against whom individual responses should be protected. Also, differential privacy’s mathematical guarantees are eroded if you repeatedly ask the same survey questions of the same population—which is exactly what you want to do to understand how user behavior changes over time.

This paper solves both problems with a combination of randomized response and Bloom filters. Randomized response is an old, ingenious idea for asking sensitive yes-no survey questions such that any one person has plausible deniability, but the true aggregate number of yes responses is still known: each survey participant secretly flips a coin before answering, and if it comes up heads they answer yes whether or not that’s true. Otherwise they answer honestly. After the fact, everyone can claim to have answered yes because of the coin flip, but to recover the true population statistic one simply doubles the number of no answers. Bloom filters (which I’m not going to try to explain) are used to extend this from yes-no to reporting sets of strings. Finally, there are two stages of randomization. Given a set of survey questions, the system computes a Permanent randomized response which is, as the name implies, reused until either the answers or the questions change. This prevents privacy erosion, because each user is always either answering honestly or falsely to any given question; a nosy server cannot average out the coin tosses. Additional instantaneous randomness is added each time a report is submitted, to prevent the report being a fingerprint for that user. The system is said to be in use for telemetry from the Chrome browser, and they give several examples of data collected in practice.

The Permanent randomized responses illustrate a basic tension in security design: you can often get better security outcomes versus a remote attacker if you keep local state, but that local state is probably high-value information for a local attacker. For example, any system involving trust on first use will store a list of frequently contacted remote peers, with credentials, on the local machine; tampering with that can destroy the system’s security guarantees, and simply reading it tells you, at a minimum, which remote peers you regularly connect to. The TAILS live-CD is intended to leave no trace on the system it’s run on, which means it changes its entry guards on every reboot, increasing the chance of picking a malicious guard. In this case, a local adversary who can read a Chrome browser profile has access to much higher-value data than the Permanent responses, but a cautious user who erases their browser profile on every shutdown, to protect against that local adversary, is harming their own security against the Chrome developers. (This might be the right tradeoff, of course.)

I also wonder what a network eavesdropper learns from the telemetry reports. Presumably they are conveyed over a secure channel, but at a minimum, that still reveals that the client software is installed, and the size of the upload might reveal things. A crash report, for instance, is probably much bulkier than a statistics ping, and is likely to be submitted immediately rather than on a schedule.

The Declining Half-Life of Secrets and the Future of Signals Intelligence

Today we’re going to look at a position paper from the New America think tank’s Cybersecurity Initiative. If you’re someone like me, that descriptor probably raises at least four red flags: regarding anything to do with Internet security, there’s a tremendous gulf in groupthink between people associated with the US government, and people associated with commercial or pro bono development of computer stuff. Which is precisely why it’s useful to read papers from the other side of the gulf, like this one. (This particular think tank appears to be more on the left as that term is used in US politics. I haven’t dug into their other position papers.)

Swire starts by pointing out that the government’s understanding of secrecy was developed during the Cold War, when it was, frankly, much easier to keep secrets. Paper documents in an archive, which readers must physically visit, and demonstrate their need-to-know to the man with a gun at the entrance, are inherently difficult to duplicate. But that entire archive probably fits on a $50 thumbdrive today. In a similar vein, regular readers will recall the standard military teletype with its data transfer rate of 75 bits per second, from Limitations of End-to-End Encryption (1978).

Also, once data has been exfiltrated, it’s much easier to broadcast it, because there are lots more news organizations who might take an interest—or you can just post it online yourself and rely on the tremendously accelerated speed of gossip. These things together are what Swire means by the declining half-life of secrets: secrets have always been expected to get out eventually, but the time scale is no longer decades. The metaphor of a reduced half-life seems spot on to me: leakage of secrets is inherently probabilistic, so exponential decay is the simplest model, and should give people the right first-order intuition.

Swire then moves on to discussing the effects of that groupthink gulf I mentioned. This bit is weaker, because it’s plain that he doesn’t understand why people might prefer the world-view of EFF. But it’s accurate as far as it goes. People associated with the government are starting from the premise that revealing a secret, regardless of its contents, is the worst possible thing anyone can do. (I confess to not understanding how one comes to think this, myself. It probably has to do with one’s default idea of a secret being something that really could get someone killed if it were revealed, never mind that only a tiny fraction of all classified information is that dangerous.) In contrast, the world-view of EFF begins with the premise that most information should be published, and that an organization doing something in secret from the general public probably means it knows, institutionally, that the general public would not approve. And, therefore, that it shouldn’t be doing it in the first place. Since most of the technology community takes this position, the government has an increasingly large problem trying to persuade that community to cooperate with its own attitude, and (Swire says) this will only get worse.

The paper concludes with some fairly weaksauce recommendations: plan for the possibility of disclosure; take the impact on public opinion (should the secret be revealed) into account when planning secret operations; put more effort into arguing for surveillance. Basically, business as usual but with more media savvy. This may be the best one can hope for in the short term, but I have some policy suggestions of my own:

  • Apply Kerckhoffs’ Principle to all surveillance programs. The overall design of the system, its budget, the nature of the data collected, all relevant policies and procedures, everything except the collected data should be public knowledge, subject to normal public oversight (e.g. any Congressional hearings on the topic should be conducted in public and on the record), and debated in public prior to implementation—just like any other government program. If that would render the surveillance useless, the logic of Kerckhoffs’ principle says it was already useless. (I’ve made this point before, on my main blog.)

  • Abandon the desire for exceptional access. The technology community has spent 20+ years explaining over and over and over again why exceptional access is impractical and makes security worse for everyone. Government agencies refusing to accept that message is probably the single strongest reason why the groupthink gulf is as wide as it is.

  • More generally, whenever there is a tradeoff between offense and defense in computer security, choose defense. Design cryptographic standards that are secure for everyone, even if they happen to be enemies of the USA right now (or might be at some time in the future). Disclose all security-critical bugs to the vendors, so they get fixed, even if this means not being able to pull off another Stuxnet. Think of this as the Internet analogue of the SALT and START treaties.

  • Split the NSA in half. Merge the offensive signals intelligence mission into the CIA, or scrap it, I don’t care. Merge the defensive cryptographic and cryptanalytic mission into NIST, declassify and publish everything, and do all future work in public (Kerckhoffs’ Principle again). Make it a bedrock policy that this organization only does offensive research in support of defensive programs (e.g. to demonstrate the (un)soundness of a cipher).

I’m willing to listen to reasons not to do these things, as long as they do not boil down to we’re scared of hypothetical enemy X.

Students Who Don’t Understand Information Flow Should Be Eaten: An Experience Paper

On a completely different and much lighter note from Wednesday, today we’re going to look at a paper about teaching students information-flow theory with Werewolf.

Werewolf (also known as Mafia) is a game in which the players are divided into two groups, the werewolves (or mafiosi) and the villagers, and each is trying to wipe the other out at a rate of one murder per turn. There are a whole bunch of variants. Normally this is played around a table, as a game of strategic deception: only the werewolves know who the werewolves are, and they participate as villagers on the villagers’ turn. In this paper, it becomes a game played online, of surveillance and countersurveillance: the villagers are actively encouraged to exploit information leaks in the game server and discover who the werewolves are. (In a normal game this would be cheating.)

The authors don’t report how this teaching method compares to traditional lectures on any quantitative basis (e.g. final exam scores, class grades). However, they do say that the students loved the exercise, met up outside of official class hours to discuss strategies and plan, and that over the term the exploits and countermeasures grew steadily more sophisticated, in some cases requiring adjustments to the game server to ensure that both sides could still win. It’s hard to imagine this level of student engagement not leading to better grades, better retention, and deeper interest in the material.

I think this is a brilliant idea and not just for teaching information flow. One of the hardest things to teach in a security course is what Bruce Schneier calls the security mindset: intentionally thinking about how a system can be caused to fail, to do something it’s not intended to do. In particular, it is in tension with the usual engineering mindset, which focuses on verifying that something works when used as designed. (Safety engineering and failure analysis have more of the security mindset about them, but usually focus on failures due to accident rather than malice.) But it is exactly how you need to think to successfully cheat at a game, or to notice when someone else is cheating—and in that context, it’s going to be familiar to most college students. Using a game of strategic deception as the backdrop will also encourage students to think along the right lines. I’d like to see this idea adapted to teach other challenging notions in security—penetration testing is the obvious one, but maybe also code exploitation and key management?