Why do we need a datatrust? Part II
In my first post on available public data sets, I described some of the limitations of Data.gov and the U.S. Census website. There's not as much as you'd like on Data.gov, and the Census site is shockingly tiresome to navigate.Other government agencies, though, do things a little differently, albeit with varying degrees of success.
3. The Internal Revenue Service: They take so much, yet give so little.
The IRS website, compared to the Census site, is very well organized and easy to follow. Where the Census site feels like people have kept adding bits and pieces over the years, the IRS site feels like a cohesive whole. A small link for "Tax Stats" on the home page takes you here, where the data is neatly categorized by type of taxpayer and tax form. The IRS is statutorily required to provide statistics, but only the Office of Tax Analysis in the Secretary of the Treasury's Office and the Congressional Joint Committee on Taxation is allowed to receive detailed tax return files. Other agencies and individuals may only receive information in aggregate to protect privacy, also statutorily required. The information it does provide is crunched by the Statistics of Income Program (SOI), which calculates statistics from 500,000 out of 200 million tax returns.The IRS obviously has access to a wealth of information, and it's published some interesting numbers. One that I found particularly interesting was this table on adjusted gross income for the top 400 returns.
As you can see, the cut-off AGI for the top 400 returns has gone from $24,421,000 in 1992 to $86,380,000 in 2000. (Click on the image for a larger version.) Capital gains as a percentage of AGI has gone from 33% in 1992 to 64% in 2000. The average tax rate has gone from 26.4% to 22.3%. All very interesting, useful data from which one can draw a range of conclusions or start new research.
But there's a lot we don't know.
- How has data changed from 2000 to now?
- How might the returns correlate with specific changes in legislation?
- How do the trends in the top 400 returns compare to the bottom 400?
Not to mention, any other questions we might have of underlying microdata. The SOI program is clearly doing a great deal of work calculating and packaging data to be "anonymous" for the public, but no one else gets to play with that data themselves, and data on something like the top 400 returns ends up being almost ten years old. Tax policy is one of the most significant ways in which the U.S. government seeks to shape American society -- why we have tax credits and deductions for mortgage interest payments for homeowners but nothing equivalent for renters. Yet we, the public. don't have access to data that would help us determine if the way we are being taxed is actually shaping our society in the ways we want.
4. Agency for Healthcare Research & Quality, Medical Expenditure Survey (MEPS): Fascinating Data in a More Flexible Format
The Agency for Healthcare Research & Quality (AHRQ) collects precisely the kind of data we're all struggling to understand as Congress proposes healthcare reform. The Medical Expenditure Survey collects data on the the specific health services that Americans use, how frequently they use them, the cost of these services, and how they are paid for, as well as data on the cost, scope, and breadth of health insurance held by and available to U.S. workers. The data AHQR provides is much more flexible than IRS data, as you can use MEPSnet to create your own tables and statistics.But that doesn't mean you can ask a question like, “How much are single people aged 25-45 paying for health insurance in Miami?” "How much is reasonable to pay for XYZ procedure in Minneapolis?" I assume MEPSnet is useful for researchers who are skilled at working with data, but it's not a real option for ordinary, interested individuals who are looking for some quantitative, data-driven answers to important questions.MEPS also includes data that isn't publicly released for reasons of confidentiality. To access that data, you must be a qualified researcher and travel to a data center.
5. EPA: Great Tools for Personalized Queries if You Don't Need Personal Information
The EPA's site, in many ways, is what I imagine a truly transparent, user-focused agency site could be like. It has much more microdata available, and with much more consumer-oriented search possible. For example, MyEnvironment allows you to type in your zipcode and get a cross-section of many of their datasets all at once:
There are also some mechanisms for inputting data, such as reporting violations, which makes the EPA one of the few agencies I've seen where data doesn't only flow in one direction.But the reason so much data can be made available, in such searchable ways, is that the vast majority of the EPA's microdata is not "personal." They're measures of things like air quality and locations of regulated facilities. They don't have to worry about revealing personal tax information or personal medical expenditure information. We'd love to see if similar data tools could be created for more sensitive data if better guarantees could be made around privacy than exist today.
Our dreams for data
So what would we love to see?
- More “queryable” data—we’ll be able to ask the questions we want to ask, rather than accept the aggregates & statistics as presented.
- More microdata available more quickly—we’ll get to analyze actual responses to surveys and not wait for the microdata to to be “scrubbed” for privacy reasons.
- More longitudinal data available—we’ll be able to do more studies of the same subjects over time and make more of it easily available to the public, rather than only in locked-up data centers.
- More centralized, accessible data—we’ll be able to go to one place and be able to immediately see and have access to a lot of data.
- More user-friendly data—we, as ordinary citizens, will be able to get data-specific answers to important, personal questions.
As I've stated previously, I don't mean to poo-poo the data that these agencies and others have made available. It takes a great deal of time, effort, and resources to make this kind of data available, especially if you have to clean it up (i.e.,. make it "private" for public consumption), which is why it's such a big deal when a government, whether federal, state, or local, makes a real commitment to making data available. We at the Common Data Project are working on a datatrust because we think certain technologies could reduce the costs of making data available by making privacy something more measurable and guaranteeable.We may not be able to make all our data dreams come true immediately, but we definitely don't want to let up on the push for better data.