The Open House Project from The Sunlight Foundation

Mash-ups for government transparency

January 25th, 2007 by Joshua Tauberer · 6 Comments

A few years ago I launched GovTrack.us. I didn’t think of it this way at the time, but these days you might call it a mash-up of data about the U.S. Congress. At the time what I was thinking was just collecting information about Congress from various sources (THOMAS, the Senate website, and the House website) and cross-referencing and hyperlinking the data in a way that no one had done yet. In fact, it was the huge amount of public data on the status of legislation that was made available through THOMAS (as I understand it thanks to the Republican take-over in 1994) that inspired me to try to put the data to new uses. It started with updates by email of what your congressmen were up to each day, generated automatically by grabbing data from THOMAS and, effectively, transforming it into a customized email update for anyone who wanted it.

The trouble with building GovTrack is that one has to do a bit of friendly reverse-engineering. The information is all “out there”, meant for public consumption, but it’s not out there in a way that makes it easy to transform into other formats for other uses, like the email updates, RSS feeds, and cross-referenced pages. The trouble is this: While people have no trouble browsing and searching THOMAS (for instance) for the information they need, we can’t make computers do the same thing automatically without much difficulty. To take an example, if I want to have my computer automatically fetch for me a list of all bills that were acted on the previous day (and in fact this is something GovTrack does), I would write a program that fetches the Daily Digest in the Congressional Record from THOMAS, which has bullets like this:

“Eleven bills and one resolution were introduced, as follows: S. 360-370 and S. Res. 37.”

I have no trouble understanding that. But, well, let me say as someone studying linguistics and natural language processing, computers are a long way from being able to understand English prose as well as people, nay as well as three-year-olds. Was the bill S. 365 introduced yesterday? Yes, of course — even though it was not mentioned explicitly (it’s merely in the range 360-370), and that’s just the first problem for a computer trying to make heads or tails of this information. So what’s a programmer to do?
Let’s go back to the goal of this. Certainly I don’t think it’s the government’s job to necessarily provide email updates, RSS feeds, Google Calendar integration of events, and whatever the latest technology hits are. There are a million and one things that one can do with information about the status of legislation, and someone will want each of them. So the question is this: How can the government, and Congress in particular, publish information about what it is doing in a way that makes it easy for others to put the information to new uses?

To be concrete again, because it’s always good to be concrete: How can THOMAS publish a list of bills that were acted on in a purpose-neutral way, a way that makes it easy for programmers to go and write applications to take the information and do anything with it that someone might want?

This is a question that I’ll probably blog more than once about on this site in the next few months. The answer is what’s called structured (or “machine-readable”) data, and it comes down to publishing information twice, once for humans clicking away at links, and once in boring, explicit tables meant for computer applications to transform into different formats. But more on that later.

Tags: OpenHouse · Structured Data

6 responses so far ↓

  • Joshua Tauberer’s Blog » Blog Archive » The Open House Project // Feb 8, 2007 at 10:01 pm

    [...] Mash-ups for government transparency [...]

  • bolson // Feb 12, 2007 at 7:51 pm

    I made my own little reprocessing of the Senate and House data at http://bolson.org/gov/us/senate/ and http://bolson.org/gov/us/house/

    Mostly I wanted to reorganize it into voting record by member, rather than by vote, to get a quick view of what my rep and senators had done.

    XML.house.gov has done some good stuff, but seems to me to suffer from a common malady of moving to XML, defining lots of schema without a clear purpose of what data to store and why, towards what purpose of answering what questions and towards specific uses of the data. I guess on the plus side if they just store and publish all the common data they generate, we’ll be able to digest it into something useful. On the down side It’ll become baroque and hard to manage – like the Census data – and require a lot of set up time to write code to parse the data and get to the part you care about.

  • Joshua Tauberer // Feb 14, 2007 at 6:37 am

    Afaik, the definitions at xml.house.gov were designed strictly for a representation of the text of legislation in XML, and for that purpose what they did seemed pretty well done to me. It’s in active use by the House (http://thomas.loc.gov/home/gpoxmlc110/) and is apparently starting to be used by the Senate (but I don’t think any XML files from the Senate have been posted publicly anywhere yet).

    (As for the Census data, I plan to release a RDF version of the 2000 Census in the coming weeks, which should in principle lower the barrier to using the data — provided semantic web/RDF tools continue to develop.)

  • Bill Versioning: Unintended consequences of data openness | The Open House Project // Feb 14, 2007 at 7:35 am

    [...] This really goes to the same point that I blogged about last time, which is that what it means to be on the Web in 2007 is to publish things in parallel: once for people to view within their web browser as they surf the web (the PDFs, GPO formatted text, and THOMAS HTML formatted text), and again (the THOMAS HTML text as a proxy for the original plain-text files) for computers to mix and match to provide people with useful new ways to appreciate the same information. [...]

  • States are leading the way with downloadable legislative databases | The Open House Project // Feb 22, 2007 at 6:32 pm

    [...] I’ve blogged here before (1, 2) about how publishing raw, structured data that can be processed by computers can have unpredictable benefits, and I feel strongly that Congress should provide a raw database download of the status of all legislation. (They have the database already; it’s what powers THOMAS.) I didn’t realize, though, that a number of state legislatures are already leading the way in this regard. [...]

  • Legislative Databases recommendation makes it to House Leg Branch Appropriations markup | The Open House Project // Jul 14, 2008 at 1:43 pm

    [...] (Other links: last year’s leg branch appropriations blog post, my first or one of my first posts here about structured data) [...]

Leave a Comment