Is this Personally Identifiable Information?

(Gary Weingarden, Privacy Officer & Director IT Security | Published September 2023)

Is this PII? I get this question a lot. Or the similar claim “I removed all the PII.” I regret to inform you, that’s not quite how it works. Don’t get me wrong, I know what you mean, but we’re really not talking about Personally Identifiable Information. In fact, the term “PII” doesn’t appear in many laws. PII is shorthand for what lawyers call covered data–any data that’s subject to a law or contract. PII casts a wider net than you’d expect and includes more than things like name, address, and SSN. 

Bonus Question: Is this origami unicorn from the movie Blade Runner PII?

Hand holding origami

 

 

 

 

 

 

 

 

Send answers to: gary.weingarden@tufts.edu

Here’s why:

  • Some of our contracts Data Use Agreements and data licenses apply to “all customer data” or something similar to that effect the data provided without a “PII”-like definition. They don’t distinguish between particular data types, or if they do they tend to copy statutory definitions. 
  • Statutes often define personal information to include information that’s linkable to an individual or household. That’s a big chunk of information, as we’ll discuss. So again, it’s all PII:
    • For example, here’s part of the FERPA definition (adopted in 1988—during Ronald Reagan’s second term as President): “information that, alone or in combination, is linked or linkable to a specific student that would allow a reasonable person in the school community, who does not have personal knowledge of the relevant circumstances, to identify the student with reasonable certainty.”
    • And the EU’s General Data Protection Regulation (adopted 28 years later): “any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.”

For simplicity, I’ll call the concept of “information that’s linked or linkable or related to a person or household,” “PII” for the remainder of this post.

And here are a few other things to think about: 

  • PII is about datasets–not data elements. A lot of fields–in isolation–seem anonymous–my zip code maybe. But once it’s included in a database and relates to me, it’s now linkable to me. But if it’s linked to other data, the collection of data can start to identify me. This is sometimes called the “mosaic effect.” 
  • Metadata counts. A sandwich isn’t data, but if it’s on my desk, you can infer that it’s my sandwich, and possibly also that I like that kind of sandwich, and that I haven’t eaten it yet–all data about me (and my sandwich). Now what if instead of a sandwich on my desk, it’s my cell phone at a crime scene or my email address in the Ashley Madison data breach?
  • Other data counts. If I’m sharing data with a vendor–they may have other data that they could use to “re-identify” my database. Similarly, there have been some embarrassing incidents with publishing “anonymous” data that could be combined with public information to identify the supposedly-anonymous. (We see you AOL and Netflix.)
  • PII is Infectious. Once it’s linked with other data, all the data is PII, and it’s harder than it seems to disinfect either dataset.
  • PII doesn’t require us to put a name to a face. If I can link several attributes or actions to a single person–think of Woodward and Bernstein’s informant for the Pentagon Papers—that data set is PII about that unnamed person.
  • Definitions count and sometimes conflict. Definitions vary a lot and so do exclusions. Not all exclude public information, for example. 

In other words, PII doesn’t exist, as such, but the idea of personal information does. 

What’s the opposite of PII?

I said PII wasn’t a thing. How do you think this is going to go? Okay, so maybe there isn’t an “opposite,” but can we figure out what isn’t PII?

The usual exceptions to PII are: 

  • Public information–but don’t get too excited.
  • Aggregate information–again .
  • De-identified or anonymous (see previous snark).

Public information is often excluded from PII definitions. But public doesn’t mean what most people might think. It means the information is legally available in government records. So Facebook posts don’t count; nor do news articles in many cases. 

Aggregate Information Okay … so why is aggregation not as cool as it sounds? Here’s a typical definition (from the California Consumer Privacy Act): 

“Aggregate consumer information” means information that relates to a group or category of consumers, from which individual consumer identities have been removed, that is not linked or reasonably linkable to any consumer or household, including via a device.”

 

See the part about the result not being “linked or reasonably linkable”? Aggregation is a way to try to avoid “PII”–or linkability–I said that was important!

But aggregation requires more than removing names–you have to remove “identities” and you have to break linkability. But that’s what I meant when I said don’t get too excited. Aggregation doesn’t give us a solution–it gives us a different puzzle. 

Aggregation doesn’t automatically anonymize data. Think of a table with roles and time at a company:

Role

 

Number of Employees

 

Average time at company

 

Lawyer

7

5

Engineer

70

2

CEO

1

7

The CEO’s row doesn’t remove any information if we know there’s only one CEO, it reveals his exact stay length, and any additional columns would expose more data that’s linked to the CEO and nobody else. That’s called microdata leakage which means accidentally revealing information about an individual from an aggregated data set. 

Aggregate data can also violate what’s called group privacy. For example–aggregated location data from a fitness app exposed the locations of military bases. In any case, aggregation can work, but it’s not easy. 

De-identification

But you’ve heard about de-identified or anonymous data, isn’t that the solution? Kind of. Let’s see take a look at how FERPA  defines it:

“the removal of all personally identifiable information provided that the educational agency or institution or other party has made a reasonable determination that a student's identity is not personally identifiable, whether through single or multiple releases, and taking into account other reasonably available information.”

More modern defnitions, including Guidance from the Department of Education and the National Institute of Standards and Technology (NIST), which administers FERPA, recognize that identifiability isn’t a “yes or no” question, but another math problem—the risk or probability of re-identification.

 

And there’s that word again “linked.” How do I break the link, and how can I be sure information can’t be “reasonably linked”? 

That’s the problem. It’s said that it only requires 33 bits of information entropy to uniquely identify a person. And many of those bits are available on the internet or in public databases. Studies have found that over 60 percent of the US population can be uniquely identified by their birthdate, zip code, and gender. That makes it tricky to produce a truly anonymous data set. First, it doesn’t take a lot of information to identify a person. Second, biometrics, artificial intelligence, Machine Learning, big data, and even quantum computing are likely to make it even easier over time

What’s the difference between de-identified, anonymous, and pseudonymous?

I’ve included some definitions at the end of this post for those looking to learn more.

What to do?

One approach, available under HIPAA, is a safe harbor–a list of fields that if eliminated from a data set, make the data set “de-identified,” by definition, as long as the custodian doesn’t have reason to believe it isn’t. Here’s the list:

Names, geographic subdivisions smaller than a state, dates that relate to an individual and ages over 89, telephone numbers, VINs, fax numbers, device IDs, email addresses, URLs, SSNs, IP addresses, medical record numbers, biometrics, health plan numbers, full-face photos, account numbers, unique identifiers, characteristics, or codes, or license numbers.

Blessing or curse, the safe harbor is only for Protected Health Information (HIPAA’s version of PII). It also leaves little usable information in the dataset for many purposes. 

Another approach is to use anonymization statistics, such as k-anonymityl-diversity, and t-closeness. These techniques help guide data scientists to apply the right amount of generalization (e.g., age ranges instead of specific ages), suppression (e.g., changing the final 2 or three digits of zip code to asterisks), and similar techniques to prevent re-identification. Here is a good explanation of how these algorithms work.

One more approach to avoiding linkability is differential privacy. That’s how data from the last census was released. Finally, you could just create a synthetic data set–or acquire one, depending on what you’re trying to do.  

If you can’t get all of the PII out of the dataset, you can also determine what the compliance requirements are (disclosure, consent from the individual, consent from the customer, opt-out, etc) and follow them. 

To recap: 

  1. PII isn’t what you think it is; and
  2. Neither is anonymity.

 

Glossary of Terms

Anonymous: De-identified. The GDPR says anonymous data is excluded from its scope and defines it similarly to de-identification. Not many other statutes currently use the term. If it meets the relevant definition, it’s not PII.

De-identified: See discussion in post. If it meets the relevant definition, it’s not PII. 

Differentially Private: A dataset that is produced for specific purposes, for which some data has been modified or swapped, but in a way that won’t impact the planned calculations.

Encrypted: A message or file has been transformed into a text that is indistinguishable from random characters. Encrypted data is still PII. 

Generalization: Data is grouped at a higher level of abstraction or order of magnitude. For example, instead of giving a specific age, we might have a ten-year band (40-50).

Hashed: Encrypted in a way that’s not reversible (can’t be decrypted). Hashed data is still PII. 

Pseudonymous:  Data has been altered by swapping certain fields for other data that is linked to the original information in a separate database or table. Pseudonymous data is still PII.

Redaction/Suppression: Data elements or parts of them are replaced with meaningless characters usually an asterisk. 

Synthetic data: A dataset that has been created to mimic a dataset about real people. 

Tokenization: A form of pseudonymization where certain data elements are replaced with codes or character that encodes the information.