BACK TO PAPERS
QUESTIONS WE ASKED

2. How do they define “personal information”?

Most privacy certification programs, like Truste, require that the privacy policy identify what kinds of personally identifiable information (PII) are being collected. As a result, nearly every privacy policy we looked at included a long list of the types of information being collected.

Many of the companies we surveyed then categorize the information they collect into 1) “personal information” that you provide, such as name and email address, often when you sign up for an account; and 2) cookie and log data, including IP address, browser type, browser language, web request, and page views.

When the first category is called “personal” information, the second category implicitly becomes “not-personal” information. But the queries we put into search engines are obviously intensely personal.

So are our purchase histories on Amazon, as well as an IP address that can link a certain set of activities to a specific computer.

Yahoo! and Amazon go the extra step of labeling cookie and log data, “automatic information,” giving it a ring of inevitability. Ask Network calls this information “limited information that your browser makes available whenever you visit any website.” Wikipedia similarly states, “When a visitor requests or reads a page, or sends email to a Wikimedia server, no more information is collected than is typically collected by web sites.”

There are companies that do define “personal information” much more broadly. EBay’s definition includes “computer and connection information, statistics on page views, traffic to and from the sites, ad data, IP address and standard web log information” and “information from other companies, such as demographic and navigation information.” AOL states that its AOL Network Information may include “personally identifiable information” that includes “information about your visits to AOL Network Web sites and pages, and your responses to the offerings and advertisements presented on these Web sites and pages” and “information about the searches you perform through the AOL Network and how you use the results of those searches.”

And there are websites that don’t collect information at all: Ixquick and Cuil, the search engines that have been trying to build a brand around privacy. These companies have decided to define “personal” in a rather different way, and in order to protect what is personal, they have chosen not record any IP addresses. Ixquick deletes log data after 48 hours.

We don’t support deleting IP addresses and log data as quickly as possible as a way to protect privacy. We seek solutions for privacy to preserve the value of data, because we believe that more information is always better than less. But we as a society can’t have a thoughtful discussion about what it takes to balance privacy rights against the value of data if companies aren’t honest about how “personal” cookie and log data can be.

Some companies do acknowledge that information that they don’t consider “personal” could become personally identifying if it were to be combined with other data. Microsoft therefore promises to “store page views, clicks and search terms…separately from your contact information or other data that directly identifies you (such as your name, email address, etc.). Further we have built in technological and process safeguards to prevent the unauthorized correlation of this data.” Similarly, WebMD makes this promise: “we do not link non-personal information from Cookies to personally identifiable information without your permission and do not use Cookies to collect or store Personal Health Information about you.” WebMD further states that data warehouses it contracts with are required to agree that they “not attempt to make this information personally identifiable, such as by combining it with other databases.”

The other companies, however, provide very little explanation of what data combination implies for privacy.

When data is combined, many data sets that initially appear to be anonymous or “non-personally identifiable” can become de-anonymized.

Researchers at the University of Texas in recent years have demonstrated that it is possible to de-anonymize through combination, as when Netflix data is combined with IMDB ratings, or when Twitter is combined with Flickr. So when companies offhandedly note that they are combining information they collect from different sources, they are learning a great deal more about individual people than the average user would imagine. And as you might imagine, large companies like Microsoft, Google, and Yahoo! have a wealth of databases at their disposal, but none of this is being made explicit in the policies.

Questions we asked of each company.

  1. What data collection is happening that is not covered by the privacy policy?
  2. How do they define “personal information”?
  3. What promises are being made about sharing information with third parties?
  4. What is their data retention policy and what does it say about their commitment to privacy?
  5. What privacy choices do they offer to the user?
  6. What input do users have into changes to the policy’s terms?
  7. To what extent does they share the data they collect with users and the public?

Introduction / Conclusion / Preview Blog Posts