CDP @ Open Knowledge Conference 2010: A Recap
Going into the Open Knowledge Conference, I didn't know what to expect. Grace had read about them earlier, and we hoped we'd find like-minded people and organizations at the conference, but we didn't have any personal contacts or references.As it turned out, the talks and people's interests overlapped significantly with the work we do, and vice-versa. Here are a few highlights:
- Rufus Pollock started things off with some background on the Open Knowledge Foundation's work, which is working towards making knowledge in a broad sense available publicly. This turns out to extend quite a bit beyond our interests in sensitive data (for example, it turns out that lack of bibliographic information often prevents copyright expiration from a practical perspective, because no one can apply the statutes which often include calculations based on author birthdate, author death date and other lesser-known facts, as well as lots of rules that vary by already inconsistent jurisdictions.)
- Chris Taggart is championing an effort to bring more local government data on-line in the UK.
- Peter Murray-Rust from Cambridge University made a case for sharing data, and for publishing scientists to clearly state their desire to publishers for the data to be available (which is apparently another copyright issue). He was involved in the creation of the Panton Principles for Open Data in Science (named after a pub in Cambridge).
- Sören Auer gave a couple of talks on DBpedia.org, which is extracting structured data from Wikipedia. Apparently in Germany, for historical reasons open government data, and open data in general, does not have the public support that it has in the UK and US.
- After chatting with Sören, I got a chance to chat with Hugh Williams, another attendee from OpenLink Software to learn more about how DBpedia's 300 million RDF triples is hosted on a single instance of their Virtuoso server, an RDFDB variant, possible thanks to 64-bit architectures - something that was not feasible in the early days of RDFDB when I was working at Epinions. I'm curious to learn more about how a MapReduce-type mechanism sitting on top of an RDFDB store.
- Jordan Hatcher gave a really interesting talk (a shorter version of this talk from the OSSAT) namely that the way in which we're proposing to "release" sensitive data to the public is more akin to the way online companies use of data to drive their services and less like open government efforts where the data is literally given away. We're never going to actually hand over any data. We're only ever going to provide "noisy" descriptions of the data in response to queries. (This topic deserves it's own post and we’ll definitely want to chat with him once we have our thoughts better organized.)
- Jeni Tennison gave an interesting talk on the technical/practical challenges of scaling Open Data, which made me think (in relation to Jordan Hatcher's talk) that we should consider a scenario where we allow for distributed storage of data behind the datatrust API, as this may simplify some of the legal constraints that we will run into.
- Thomas Schandl gave a neat demo of Pool Party, which is a nice thesaurus system for managing linked data, and could be useful for managing a datastore with distinct and diverse data sources.
- Stuart Harrison gave a talk on the data that the local UK government that he works for (Lichfield District) is releasing to try and help engage with the community. They have been able to release a fair bit of data, although privacy and sensitivity of data does seem to be becoming part of the challenges they are facing in doing so. It would be interesting to follow-up with him as well.
- Victor Henning & Jan Reichelt gave an interesting presentation about Mendeley - a self-proclaimed Last.fm for research papers. It seems to me that they are already or will soon be running into interesting questions around who owns the data they collect from their users, as well as expectations around user privacy. Their site says "academic software for research papers" but they seemed to be saying that they would be selling their data in some form.
- Karin Christiansen gave an interesting talk about the issue of transparency in international aid. Apparently there are real challenges identifying corruption, redundant aid and measuring impact because there's no centralized view of where everyone's aid goes. For example, apparently there are 27 different departments/commissions/etc within the US government dispersing international development aid. Apparently a major donation will change hands 6 times before reaching its intended destination, so tracking the money can be very hard. She is the director of http://publishwhatyoufund.org/ which is hoping to address this. This was an interesting talk and an interesting problem, though I didn't see an immediate CDP-relevancy.
- Helen Turvy from the Shuttleworth Foundation made an announcement that to my ears said "if you are involved in making data available to the public somewhere on this planet, we want to help you". Unfortunately I didn't get a chance to chat with her at the conference, but we definitely need to follow-up with her. Her characterization of the kinds of projects the Shuttleworth Foundation funded contrasted with sharply with other foundations we’ve looked at in that they are happy to support general purpose "the more data the better" solutions, as opposed to projects that address a specific problem (e.g. homelessness, pollution). As an all-purpose solution to making sensitive safe for public access, we've been hard-pressed to find funders like Shuttleworth.
- Another item that came up during the day, possibly more than once though I'm not sure from where, was the idea that increasingly organizations, and/or parts of the government are starting to think about having data be "open by default" - in order to save money dealing with Freedom of Information Act requests!! (The UK has a similar concept to the US one by the sound of things.) This is exciting because if the datatrust can provide a cheap way for organizations to meet disclosure obligations, cost might actually help drive adoption.
Finally, my talk went well (many thanks to Mimi and Grace) and the new demo looked great (many thanks there to Tony) - we'll have a post up on the new demo shortly. The fact that we were talking about releasing sensitive data made us fairly unique at the conference, and to many very interesting for future stages of the open data initiatives.I got a chance to chat with several different people running into sensitive data disclosure challenges, most of which today run into an all or nothing decision point: some governing body ends up deciding whether the data in question can be disclosed or not. Allowing a differential-privacy style analysis of the data, with no actual records being disclosed is not part of the discussion. As a result, valuable data is not being opened up for reasons that we hope to soon show are no longer technically valid.To fellow OKCon folks, we look forward to being a more active part of the community, and to bring more attention to the sensitive data scenarios! As I said during my short talk, anyone with interesting sensitive data sharing scenarios, please contact us so we can see if our work can be of use to you.