Would PINQ solve the problems with the Census data?
Frank McSherry, the researcher behind PINQ, has responded to our earlier blog post about the problems found in certain Census datasets and how PINQ might deal with those problems.Would PINQ solve the problems with the Census data?No. But it might help in the future.The immediate problem facing the Census Bureau is that they want to release a small sample of raw data, a Public Use Microdata Sample or PUMS, about 1/20 of the larger dataset they use for their own aggregates, that is supposed to be a statistical sample of the general population. To release that data, the Bureau has to protect the confidentiality of people in the PUMS, and they do so, in part, by manipulating the data. Some of their efforts, though, seem to have altered the data so seriously that it no longer accurately reflects the general population.PINQ would not solve the immediate problem of allowing the Census Bureau to release a 1/20 sample of their data. PINQ only allows researchers to query for aggregates.However, if Census data were released behind PINQ, the Bureau would not have to swap or synthesize data to protect privacy; PINQ would do that. Presumably, if the danger of violating confidentiality were removed, the Census could release more than 1/20 sample of the data. Furthermore, unlike the Bureau’s disclosure avoidance procedures, PINQ is transparent in describing the range of noise that is being added. Currently, the Bureau can’t even tell you what it did to protect privacy without potentially violating it.The mechanism for accessing data through PINQ, of course, would be very different than what researchers are used to today. Now, with raw data, researchers like to “look at the data” and “fit a line to the data.” A lot of these things can be approximated with PINQ, but most researchers reflexively pull back when asked to rethink how they approach data. There are almost certainly research objectives that cannot be met with PINQ alone. But the objectives that can be met should not be held back by the unavailability of high quality statistical information. Researchers able to express how and why their analyses respect privacy should be rewarded with good data, incentivizing creative rethinking of research processes.With this research published, it may be easier to argue that the choice between PUMS (and other microdata) and PINQ is not between raw data/noisy aggregates, but rather bad data/noisy aggregates. If and when it becomes a choice between these two, any serious scientist would reject bad data and accept noisy aggregates.