Sparse Overcomplete Word Vector Representations | Readings in Information Security

It’s time for another of my occasional looks at a paper that doesn’t have anything to do with security. This one is about an attempt to bridge a gap between two modes of analysis in computational linguistics.

In a word vector analysis (also referred to as a distributed word representation in this paper) one maps words onto vectors in an abstract high-dimensional space, defined by the co-occurrence probabilities of each pair of words within some collection of documents (a corpus). This can be done entirely automatically with any source of documents; no manual preprocessing is required. Stock machine-learning techniques, applied to these vectors, perform surprisingly well on a variety of downstream tasks—classification of the words, for instance. However, the vectors are meaningless to humans, so it’s hard to use them in theoretical work that requires interpretability, or to combine them with other information (e.g. sentence structure). Lexical semantics, by contrast, relies on manual, word-by-word tagging with human-meaningful categories (part of speech, sense of word, role in sentence, etc) which is slow and expensive, but the results are much easier to use as a basis for a wide variety of further studies.

The paper proposes a technique for transforming a set of word vectors to make them more interpretable. It’s essentially the opposite of PCA. They project the vectors into a higher-dimensional space, one in which they are all sparse (concretely, more than 90% of the components of each vector are zero). Words that are semantically related will (they claim) share nonzero components of these vectors, so each component has more meaning than in the original space. The projection matrix can be generated by standard mathematical optimization techniques (specifically, gradient descent with some tweaks to ensure convergence).

To back up the claim that the sparse vectors are more interpretable, they first show that five stock downstream classification tasks achieve an average of 5% higher accuracy when fed sparse vectors than the dense vectors from which they were derived, and then that humans achieve 10% higher accuracy on a word intrusion task (one of these five words does not belong in the list, which is it?) when the words are selected by their rank along one dimension of the sparse vectors, than when they are selected the same way from the dense vectors. An example would probably help here: Table 6 of the paper quotes the top-ranked words from five dimensions of the sparse and the dense vectors:

Dense Sparse

combat, guard, honor, bow, trim, naval fracture, breathing, wound, tissue, relief

’ll, could, faced, lacking, seriously, scored relationships, connections, identity, relations

see, n’t, recommended, depending, part files, bills, titles, collections, poems, songs

due, positive, equal, focus, respect, better naval, industrial, technological, marine

sergeant, comments, critics, she, videos stadium, belt, championship, toll, ride, coach

Dense	Sparse
combat, guard, honor, bow, trim, naval	fracture, breathing, wound, tissue, relief
’ll, could, faced, lacking, seriously, scored	relationships, connections, identity, relations
see, n’t, recommended, depending, part	files, bills, titles, collections, poems, songs
due, positive, equal, focus, respect, better	naval, industrial, technological, marine
sergeant, comments, critics, she, videos	stadium, belt, championship, toll, ride, coach

Neither type of list is what you might call an ideal Platonic category,¹ but it should be clear that the sparse lists have more meaning in them.

Because I’m venturing pretty far out of my field, it’s hard for me to tell how significant this paper is; I don’t have any basis for comparison. This is, in fact, the big thing missing from this paper: a comparison to other techniques for doing the same thing, if any. Perhaps the point is that there aren’t any, but I didn’t see them say so. I am also unclear on how you would apply this technique to anything other than an analysis of words. For instance, my own research right now involves (attempting to) mechanically assign topics to entire documents. Right now we’re reducing each document to a bag of words, carrying out LDA on the bags, and then manually labeling each LDA cluster with a topic. Could we use bags of sparse word vectors instead? Would that help LDA do its job better? Or is LDA already doing what this does? I don’t know the answers to these questions.

¹ If you are under the impression that categories in natural language are even vaguely Platonic, go at once to your friendly local public library and request a copy of Women, Fire, and Dangerous Things.