Papers tagged ‘Natural language’

The Most Controversial Topics in Wikipedia: A multilingual and geographical analysis

One of my more social-science-y interests lately has been in reverse-engineering the rationale for nation-state censorship policies from the data available. Any published rationale is almost always quite vague (harmful to public order, blasphemous, sort of thing). Hard data in this area consists of big lists of URLs, domain names, and/or keywords that are known or alleged to be blocked. Keywords are great, when you can get them, but URLs are less helpful, and domain names even less so. I have a pretty good idea why the Federal Republic of Germany might have a problem with a site that sells sheet music of traditional European folk songs (actual example from #BPjMleak), but I don’t know which songs are at stake, because they blocked the entire site. I could find out, but I’d probably want a dose of brain bleach afterward. More to the point, no matter how strong my stomach is, I don’t have the time for the amount of manual research required to figure out the actual issue with all 3000 of the sites on that list—and that’s just one country, whose politics and history are relatively well-known to me.

So, today’s paper is about mechanically identifying controversial Wikipedia articles. Specifically, they look through the revision history of each article for what they call mutual reverts, where two editors have each rolled back the other’s work. This is a conservative measure; edit warring on Wikipedia can be much more subtle. However, it’s easy to pick out mechanically. Controversial articles are defined as those where there are many mutual reverts, with extra weight given to mutual reverts by pairs of senior editors (people with many contributions to the entire encyclopedia). They ran this analysis for ten different language editions, and the bulk of the article is devoted to discussing how each language has interesting peculiarities in what is controversial. Overall, there’s strong correlation across languages, strong correlation with external measures of political or social controversy, and strong correlation with the geographic locations where each language is spoken. An interesting exception to that last is that the Middle East is controversial in all languages, even those that are mostly spoken very far from there; this probably reflects the ongoing wars in that area, which have affected everyone’s politics at least somewhat.

What does this have to do with nation-state censorship? Well, politically controversial topics in language X ought to be correlated with topics censored by nation-states where X is commonly spoken. There won’t be perfect alignment; there will be topics that get censored that nobody bothers to argue about on Wikipedia (pornography, for instance) and there will be topics of deep and abiding Wikipedia controversy that nobody bothers to censor (Spanish football teams, for instance). But if an outbound link from a controversial Wikipedia article gets censored, it is reasonably likely that the censorship rationale has something to do with the topic of the article. The same is true of censored pages that share a significant number of keywords with a controversial article. It should be good enough for hypothesis generation and rough classification of censored pages, at least.

Sparse Overcomplete Word Vector Representations

It’s time for another of my occasional looks at a paper that doesn’t have anything to do with security. This one is about an attempt to bridge a gap between two modes of analysis in computational linguistics.

In a word vector analysis (also referred to as a distributed word representation in this paper) one maps words onto vectors in an abstract high-dimensional space, defined by the co-occurrence probabilities of each pair of words within some collection of documents (a corpus). This can be done entirely automatically with any source of documents; no manual preprocessing is required. Stock machine-learning techniques, applied to these vectors, perform surprisingly well on a variety of downstream tasks—classification of the words, for instance. However, the vectors are meaningless to humans, so it’s hard to use them in theoretical work that requires interpretability, or to combine them with other information (e.g. sentence structure). Lexical semantics, by contrast, relies on manual, word-by-word tagging with human-meaningful categories (part of speech, sense of word, role in sentence, etc) which is slow and expensive, but the results are much easier to use as a basis for a wide variety of further studies.

The paper proposes a technique for transforming a set of word vectors to make them more interpretable. It’s essentially the opposite of PCA. They project the vectors into a higher-dimensional space, one in which they are all sparse (concretely, more than 90% of the components of each vector are zero). Words that are semantically related will (they claim) share nonzero components of these vectors, so each component has more meaning than in the original space. The projection matrix can be generated by standard mathematical optimization techniques (specifically, gradient descent with some tweaks to ensure convergence).

To back up the claim that the sparse vectors are more interpretable, they first show that five stock downstream classification tasks achieve an average of 5% higher accuracy when fed sparse vectors than the dense vectors from which they were derived, and then that humans achieve 10% higher accuracy on a word intrusion task (one of these five words does not belong in the list, which is it?) when the words are selected by their rank along one dimension of the sparse vectors, than when they are selected the same way from the dense vectors. An example would probably help here: Table 6 of the paper quotes the top-ranked words from five dimensions of the sparse and the dense vectors:

Dense Sparse
combat, guard, honor, bow, trim, naval fracture, breathing, wound, tissue, relief
’ll, could, faced, lacking, seriously, scored relationships, connections, identity, relations
see, n’t, recommended, depending, part files, bills, titles, collections, poems, songs
due, positive, equal, focus, respect, better naval, industrial, technological, marine
sergeant, comments, critics, she, videos stadium, belt, championship, toll, ride, coach

Neither type of list is what you might call an ideal Platonic category,1 but it should be clear that the sparse lists have more meaning in them.

Because I’m venturing pretty far out of my field, it’s hard for me to tell how significant this paper is; I don’t have any basis for comparison. This is, in fact, the big thing missing from this paper: a comparison to other techniques for doing the same thing, if any. Perhaps the point is that there aren’t any, but I didn’t see them say so. I am also unclear on how you would apply this technique to anything other than an analysis of words. For instance, my own research right now involves (attempting to) mechanically assign topics to entire documents. Right now we’re reducing each document to a bag of words, carrying out LDA on the bags, and then manually labeling each LDA cluster with a topic. Could we use bags of sparse word vectors instead? Would that help LDA do its job better? Or is LDA already doing what this does? I don’t know the answers to these questions.

1 If you are under the impression that categories in natural language are even vaguely Platonic, go at once to your friendly local public library and request a copy of Women, Fire, and Dangerous Things.

Links that speak: The global language network and its association with global fame

The paper we’re looking at today isn’t about security, but it’s relevant to anyone who’s doing field studies of online culture, which can easily become part of security research. My own research right now, for instance, touches on how the Internet is used for online activism and how that may or may not be safe for the activists; if you don’t have a handle on online culture—and how it varies worldwide—you’re going to have a bad time trying to study that.

What we have here is an analysis of language pairings as found on Twitter, Wikipedia, and a database of book translations. Either the same person uses both languages in a pair, or text has been translated from one to the other. Using these pairs, they identify hub languages that are very likely to be useful to connect people in distant cultures. These are mostly, but not entirely, the languages with the greatest number of speakers. Relative to their number of speakers, regional second languages like Malay and Russian show increased importance, and languages that are poorly coupled to English (which is unsurprisingly right in the middle of the connectivity graph), like Chinese, Arabic, and the languages of India, show reduced importance.

There is then a bunch of hypothesizing about how the prominence of a language as a translation hub might influence how famous someone is and/or how easy it is for something from a particular region to reach a global audience. That’s probably what the authors of the paper thought was most important, but it’s not what I’m here for. What I’m here for is what the deviation between translation hub and widely spoken language tells us about how to do global field studies. It is obviously going to be more difficult to study an online subculture that conducts itself in a language you don’t speak yourself, and the fewer people do speak that language, the harder it will be for you to get some help. But if a language is widely spoken but not a translation hub, it may be difficult for you to get the right kind of help.

For instance, machine translations between the various modern vernaculars of Arabic and English are not very good at present. I could find someone who speaks any given vernacular Arabic without too much difficulty, but I’d probably have to pay them a lot of money to get them to translate 45,000 Arabic documents into English, or even just to tell me which of those documents were in their vernacular. (That number happens to be how many documents in my research database were identified as some kind of Arabic by a machine classifier—and even that doesn’t work as well as it could; for instance, it is quite likely to be confused by various Central Asian languages that can be written with an Arabic-derived script and have a number of Arabic loanwords but are otherwise unrelated.)

What can we (Western researchers, communicating primarily in English) do about it? First is just to be aware that global field studies conducted by Anglophones are going to be weak when it comes to languages poorly coupled to English, even when they are widely spoken. In fact, the very fact of the poor coupling makes me skeptical of the results in this paper when it comes to those languages. They only looked at three datasets, all of which are quite English-centric. Would it not have made sense to supplement that with polylingual resources centered on, say, Mandarin Chinese, Modern Standard Arabic, Hindi, and Swahili? These might be difficult to find, but not being able to find them would tend to confirm the original result, and if you could find them, you could both improve the lower bounds for coupling to English, and get a finer-grained look at the languages that are well-translated within those clusters.

Down the road, it seems to me that whenever you see a language cluster that’s widely spoken but not very much in communication with any other languages, you’ve identified a gap in translation resources and cultural cross-pollination, and possibly an underserved market. Scanlations make a good example: the lack of officially-licensed translations of comic books (mostly of Japanese-language manga into English) spawned an entire subculture of unofficial translators, and that subculture is partially responsible for increasing mainstream interest in the material to the point where official translations are more likely to happen.