Project of The Sunlight Foundation    
The Open House Project from The Sunlight Foundation

More on Word Likeliness Comparisons

March 9th, 2008 by John Wonderlich · No Comments

I wrote a post earlier about brainstorming models for language processing, and I’d like to clarify the content.

All words or entities usable online constitute a very large set. Any given body of words (a speech, blog post, news article, or book) contains a subset of that first set, with each word occurring a specific number of times. One should easily be able to generate an alphabetical list of all words used in an given article, and list the number of times each word is used in each article. Merely computing this information wouldn’t be particularly interesting. In order to use word-usage counts to say something interesting about a given text, you’d need some sort of baseline to compare the usage to, to filter out the noise, and see what words are used more often than should be expected.

There is no ultimate standard which sets how often words are expected to be used, so an artificial comparison is necessary in order to identify interesting things about how often a text (or blog, or politician, or whatever) uses particular words. In order to make useful comparisons about how often words are used, a baseline of normalized usage would have to be established first, perhaps by inventing a usage-likelihood score among various bodies of text. (I understand similar things exist for phonemes and letters.)

For example, one could take all fifth grade reading books in the US, or all New York Times front page articles, and make a database of the text. Every unique word could get its own entry (row?) in the database, with the number of times that it appears associated with it. With just this information, one could calculate the usage likelihood for every word, within this defined context. You could define the usage-likelihood as perhaps “unique instances divided by total words in corpus”, and that would give a quantitative linguistic definition for a social context. Another way of defining word-usage-likeliness would be, for example: In The New York Times, one can expect one out of every 138 words to be “Bush”, one out of 27 to be “the”, and one out of 655,987 words to be “Wonderlich”. (I made those up.)

After defining (and normalizing) the likelihood that words appear in text, you could start making comparisons between bodies of work, and creating interesting tag-cloudish visualizations of what distinguishes some text you’d like to analyze. You could build a widget for your blog that says “the following are the words that are more than 25% more likely to be used on this blog than they are to be used in New York Times cover stories”, or, “here are recent news stories that also have similarly unlikely words used.”

For any given block of text, you could output a list of those words used which are most unexpected (again, compared to an artificial standard). This should enable automated calculation of linguistic deviance, which should be something that is really really idiosyncratic, and lead to all sorts of other interesting comparisons.

Tags: openhouseproject

0 responses so far ↓

  • There are no comments yet...Kick things off by filling out the form below.

Leave a Comment