Papers tagged ‘Culture’

The Most Controversial Topics in Wikipedia: A multilingual and geographical analysis

One of my more social-science-y interests lately has been in reverse-engineering the rationale for nation-state censorship policies from the data available. Any published rationale is almost always quite vague (harmful to public order, blasphemous, sort of thing). Hard data in this area consists of big lists of URLs, domain names, and/or keywords that are known or alleged to be blocked. Keywords are great, when you can get them, but URLs are less helpful, and domain names even less so. I have a pretty good idea why the Federal Republic of Germany might have a problem with a site that sells sheet music of traditional European folk songs (actual example from #BPjMleak), but I don’t know which songs are at stake, because they blocked the entire site. I could find out, but I’d probably want a dose of brain bleach afterward. More to the point, no matter how strong my stomach is, I don’t have the time for the amount of manual research required to figure out the actual issue with all 3000 of the sites on that list—and that’s just one country, whose politics and history are relatively well-known to me.

So, today’s paper is about mechanically identifying controversial Wikipedia articles. Specifically, they look through the revision history of each article for what they call mutual reverts, where two editors have each rolled back the other’s work. This is a conservative measure; edit warring on Wikipedia can be much more subtle. However, it’s easy to pick out mechanically. Controversial articles are defined as those where there are many mutual reverts, with extra weight given to mutual reverts by pairs of senior editors (people with many contributions to the entire encyclopedia). They ran this analysis for ten different language editions, and the bulk of the article is devoted to discussing how each language has interesting peculiarities in what is controversial. Overall, there’s strong correlation across languages, strong correlation with external measures of political or social controversy, and strong correlation with the geographic locations where each language is spoken. An interesting exception to that last is that the Middle East is controversial in all languages, even those that are mostly spoken very far from there; this probably reflects the ongoing wars in that area, which have affected everyone’s politics at least somewhat.

What does this have to do with nation-state censorship? Well, politically controversial topics in language X ought to be correlated with topics censored by nation-states where X is commonly spoken. There won’t be perfect alignment; there will be topics that get censored that nobody bothers to argue about on Wikipedia (pornography, for instance) and there will be topics of deep and abiding Wikipedia controversy that nobody bothers to censor (Spanish football teams, for instance). But if an outbound link from a controversial Wikipedia article gets censored, it is reasonably likely that the censorship rationale has something to do with the topic of the article. The same is true of censored pages that share a significant number of keywords with a controversial article. It should be good enough for hypothesis generation and rough classification of censored pages, at least.

Whiskey, Weed, and Wukan on the World Wide Web

The subtitle of today’s paper is On Measuring Censors’ Resources and Motivations. It’s a position paper, whose goal is to get other researchers to start considering how economic constraints might affect the much-hypothesized arms race or tit-for-tat behavior of censors and people reacting to censorship: as they say,

[…] the censor and censored have some level of motivation to accomplish various goals, some limited amount of resources to expend, and real-time deadlines that are due to the timeliness of the information that is being spread.

They back up their position by presenting a few pilot studies, of which the most compelling is the investigation of keyword censorship on Weibo (a Chinese microblogging service). They observe that searches are much more aggressively keyword-censored than posts—that is, for many examples of known-censored keywords, one is permitted to make a post on Weibo containing that keyword, but searches for that keyword will produce either no results or very few results. (They don’t say whether unrelated searches will turn up posts containing censored keywords.) They also observe that, for some keywords that are not permitted to be posted, the server only bothers checking for variations on the keyword if the user making the post has previously tried to post the literal keyword. (Again, the exact scope of the phenomenon is unclear—does an attempt to post any blocked keyword make the server check more aggressively for variations on all blocked keywords, or just that one? How long does this escalation last?) And finally, whoever is maintaining the keyword blacklists at Weibo seems to care most about controlling the news cycle: terms associated with breaking news that the government does not like are immediately added to the blacklist, and removed again just as quickly when the event falls out of the news cycle or is resolved positively. They give detailed information about this phenomenon for one news item, the Wukan incident, and cite several other keywords that seem to have been treated the same.

They compare Weibo’s behavior to similar keyword censorship by chat programs popular in China, where the same patterns appear, but whoever is maintaining the lists is sloppier and slower about it. This is clear evidence that the lists are not maintained centrally (by some government agency) and they suggest that many companies are not trying very hard:

At times, we often suspected that a keyword blacklist was being typed up by an over-worked college intern who was given vague instructions to filter out anything that might be against the law.

Sadly, I haven’t seen much in the way of people stepping up to the challenge presented, designing experiments to probe the economics of censorship. You can see similar data points in other studies of China [1] [2] [3] (it is still the case, as far as I know, that ignoring spurious TCP RST packets is sufficient to evade several aspects of the Great Firewall), and in reports from other countries. It is telling, for instance, that Pakistani censors did not bother to update their blacklist of porn sites to keep up with a shift in viewing habits. [4] George Danezis has been talking about the economics of anonymity and surveillance for quite some time now [5] [6] but that’s not quite the same thing. I mentioned above some obvious follow-on research just for Weibo, and I don’t think anyone’s done that. Please tell me if I’ve missed something.

Links that speak: The global language network and its association with global fame

The paper we’re looking at today isn’t about security, but it’s relevant to anyone who’s doing field studies of online culture, which can easily become part of security research. My own research right now, for instance, touches on how the Internet is used for online activism and how that may or may not be safe for the activists; if you don’t have a handle on online culture—and how it varies worldwide—you’re going to have a bad time trying to study that.

What we have here is an analysis of language pairings as found on Twitter, Wikipedia, and a database of book translations. Either the same person uses both languages in a pair, or text has been translated from one to the other. Using these pairs, they identify hub languages that are very likely to be useful to connect people in distant cultures. These are mostly, but not entirely, the languages with the greatest number of speakers. Relative to their number of speakers, regional second languages like Malay and Russian show increased importance, and languages that are poorly coupled to English (which is unsurprisingly right in the middle of the connectivity graph), like Chinese, Arabic, and the languages of India, show reduced importance.

There is then a bunch of hypothesizing about how the prominence of a language as a translation hub might influence how famous someone is and/or how easy it is for something from a particular region to reach a global audience. That’s probably what the authors of the paper thought was most important, but it’s not what I’m here for. What I’m here for is what the deviation between translation hub and widely spoken language tells us about how to do global field studies. It is obviously going to be more difficult to study an online subculture that conducts itself in a language you don’t speak yourself, and the fewer people do speak that language, the harder it will be for you to get some help. But if a language is widely spoken but not a translation hub, it may be difficult for you to get the right kind of help.

For instance, machine translations between the various modern vernaculars of Arabic and English are not very good at present. I could find someone who speaks any given vernacular Arabic without too much difficulty, but I’d probably have to pay them a lot of money to get them to translate 45,000 Arabic documents into English, or even just to tell me which of those documents were in their vernacular. (That number happens to be how many documents in my research database were identified as some kind of Arabic by a machine classifier—and even that doesn’t work as well as it could; for instance, it is quite likely to be confused by various Central Asian languages that can be written with an Arabic-derived script and have a number of Arabic loanwords but are otherwise unrelated.)

What can we (Western researchers, communicating primarily in English) do about it? First is just to be aware that global field studies conducted by Anglophones are going to be weak when it comes to languages poorly coupled to English, even when they are widely spoken. In fact, the very fact of the poor coupling makes me skeptical of the results in this paper when it comes to those languages. They only looked at three datasets, all of which are quite English-centric. Would it not have made sense to supplement that with polylingual resources centered on, say, Mandarin Chinese, Modern Standard Arabic, Hindi, and Swahili? These might be difficult to find, but not being able to find them would tend to confirm the original result, and if you could find them, you could both improve the lower bounds for coupling to English, and get a finer-grained look at the languages that are well-translated within those clusters.

Down the road, it seems to me that whenever you see a language cluster that’s widely spoken but not very much in communication with any other languages, you’ve identified a gap in translation resources and cultural cross-pollination, and possibly an underserved market. Scanlations make a good example: the lack of officially-licensed translations of comic books (mostly of Japanese-language manga into English) spawned an entire subculture of unofficial translators, and that subculture is partially responsible for increasing mainstream interest in the material to the point where official translations are more likely to happen.