Project of The Sunlight Foundation    
The Open House Project from The Sunlight Foundation

Language Processing Brainstorm

March 9th, 2008 by John Wonderlich · 1 Comment

Could one first determine the word instance frequency in different contexts based on historical usage, such as the English lexicon (all words equally distributed with likelihood of one), the text of NYT front page stories (”Bush”, or “Russia” all having much higher likelihood of mention than “domino” or “recalibration”). Specific social contexts could be assigned a profile of lexical probability, or a sort of tag-cloud of word usage.

Seeing what words come up more often than expected would quickly enable a very nuanced view of a data set’s verbal identity, and comparing it to pre-defined norms would allow for filtering out the natural noise expected by prepositions, articles, and other frequently used words. In other words, any baseline definition of the likelihood of certain words coming up would probably be sufficient to identify the working identity of a different set of text.

The closest popular proxy for this kind of data level lexical analysis is of search terms, or with things tagged online, but I don’t see any reason this couldn’t be done with text more broadly. A few samples of word-use likelihood would allow for a quick and dirty glimpse of the terms a Web site, blog, politician, Title of the US Code, newspaper deals with, in comparison to some well known baselines. (The results would look like this: www.theopenhouseproject.com is more than 25% more likely than the NYT to use the following words: (word list), and far less likely to use ____.)

Has word use frequency been used in this way to infer identity and focus of bodies of work? How difficult would it be to write a program that counts the number of times words are used in specific contexts, and then compares them?

Tags: openhouseproject

1 response so far ↓

Leave a Comment