On the question of "Discernibility"
Where's Waldo?In my last post about PINQ and meaningful privacy guarantees, we defined "privacy guarantee" as a guarantee that the presence or absence of a single record will not be discernible.Sounds reasonable enough, until you ask yourself, what exactly do we mean by "discernible"? And by "exactly", I mean, "quantitatively" what do we mean by "discernible"? After all, differential privacy's central value proposition is that it's going to bring quantifiable, accountable math to bear on privacy, an area of policy that heretofore has been largely preoccupied with placing limitations on collecting and storing data or fine-print legalese and bald-faced marketing.However, PINQ (a Microsoft Research implementation of differential privacy we've been working with) doesn't have a built-in mathematical definition of "discernible" either. A human being (aka one of us) has to do that.
A human endeavors to come up with a machine definition of discernibility.
At our symposium last Fall, we talked about using a legal-ish framework for addressing this very issue of discernibility: Reasonable Suspicion, Probable Cause, Preponderence of Evidence, Clear and Convincing Evidence, Beyond a Reasonable Doubt.Even if we decided to use such a framework, we would still need to figure out how these legal concepts translate into something quantifiable that PINQ can work with.
"Not Discernible" means seeing 50/50.
My initial reaction when I first starting thinking about this problem was that clearly, discernibility or lack thereof needed to revolve around some concept of 50/50, as in "odds of," "chances are."Whatever answer you got out of PINQ, you should never get even a hint of an idea that any one number was more likely to be the real answer than the numbers to either of side of that number. (In other words, x and x+/-1 should be equally likely candidates for "real answerhood.")
Testing discernibility with a "Worst-Case Scenario"
I ask a rather "pointed" question about my neighbor, one that essentially amounts to "Is so-and-so in this data set? Yes or no?" without actually naming names (or social security numbers, email addresses, cell phone numbers or any other unique identifiers). e.g. "How many people in this data set of 'people with skeletons in their closet' wear an eye-patch and live in my building?" Ideally, I should walk away with an answer that says,
"You know what, your guess is as good as mine, it is just as likely that the answer is 0, as it is that the answer is 1."
In such a situation, I would be comfortable saying that I have received ZERO ADDITIONAL INFORMATION on the question of a certain eye-patched individual in my building and whether or not he has skeletons in his closets. I may as well have tossed a coin. My pirate neighbor is truly invisible in the dataset, if indeed he's in there at all.Armed with this idea, I set out to understand how this might be implemented with differential privacy...