Introduction
With news of a devastating data breach constantly in the headlines, you need to take proactive steps to safeguard the personally identifiable information (PII) that your organization stores and processes.
Along with techniques such as PII masking, PII pseudonymization is one of the most popular and practical ways to protect sensitive data. But what is PII pseudonymization, exactly, and how can you pseudonymize PII? We’ll answer these questions and more in this article.
What is PII Pseudonymization?
Personally identifiable information (PII) is any data that could reveal or help you infer the identity of a unique individual. There are many types of PII, some of the more revealing than others. Below is a list of information that can be considered PII:
-
Identifiers: first, middle and last name; home address; phone number and other contact information; age; date of birth and place of birth; mother's maiden name; gender; race or ethnicity; nationality; ID numbers (e.g., social security number or passport ID)
-
Work and education: employee or student ID; workplace or school address; years of work or study
-
Biometric data: biometric templates (i.e. digital representations) of an individual's fingerprints, retinal scans, facial recognition data
-
Internet data: browsing history, search history, IP addresses, mobile app activity, geolocation data
-
Financial information: credit card numbers, SSNs, bank account numbers
-
Healthcare data: medical conditions or illnesses; dates of treatment or consultation; medications or other treatments
When you hear the word “pseudonym,” you likely first think of fake names like “John Smith” or “Jane Doe.” However, data pseudonymization can apply not only to an individual’s name but also to other PII such as addresses or dates of birth.
Pseudonymization is a technique for de-identifying PII by replacing identifiable information with false substitute values (i.e., a placeholder or pseudonym.) More formally, the European Union’s General Data Protection Regulation (GDPR) defines pseudonymization as “the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately”.
Both pseudonymization and anonymization are methods of de-identifying PII, but the two techniques are distinct:
-
Pseudonymization replaces the original PII with unique placeholders, pseudonyms, or reference tokens.
-
Anonymization hides sensitive information through data masking (e.g., replacing it with blanks or Xs, scrambling the characters, substituting randomly generated information, etc.)
This difference leads to an important distinction: pseudonymization is a reversible process, but anonymization is often not.
- The pseudonymization process can be reversed by keeping a lookup table that links each piece of true, original information with its associated unique, pseudonymized value. (The GDPR definition given above alludes to such lookup tables when it refers to “the use of additional information.”)
- Anonymization techniques such as scrambling or blanking values destroy the underlying data, and thus cannot be reversed.
Because pseudonymization is reversible and anonymization is irreversible, the two techniques usually have different applications. In software testing, for example, organizations can use anonymized data because there is no need to restore the original data after the tests.
On the other hand, pseudonymization is okay to use on data that will be processed by a third party: the organization hands off a pseudonymized version of their data receives the processed pseudonymized data, and can then revert to the original dataset with the help of the lookup table.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
How to Pseudonymize PII
We’ve discussed the definition of PII pseudonymization, but what are the best practices for pseudonymizing PII?
The first step is to decide which of your data requires pseudonymization. The right answer here will depend on the types of information you process and the sensitivity level of each one.
For example, the U.S. Health Insurance Portability and Accountability Act (HIPAA) regulates how healthcare organizations handle individuals’ PII, including their electronic health records. According to HIPAA’s Safe Harbor rule, pseudonymized data must remove 18 specific identifiers:
- Names
- Addresses
- Phone numbers
- Dates related to an individual, such as date of birth or death
- Vehicle identifications
- Fax numbers
- Device identifiers
- Email addresses
- URLs
- Social security numbers
- IP addresses
- Medical record numbers
- Biometric identifiers
- Health plan beneficiary numbers
- Identifiable photographs
- Account numbers
- Other unique identifying numbers
- Certificate or license numbers
Depending on your industry and your jurisdiction, you may have other requirements on what falls under PII — so conducting a thorough analysis of your enterprise data is essential.
While there are many methods of PII pseudonymization, there’s a single rule of thumb: the pseudonyms should not derive from the underlying data itself. For example, you can’t simply scramble the letters in a person’s last name or street address and consider that pseudonymized data.
Instead, data security experts have come up with a few secure ways to pseudonymize PII:
- A monotonic counter that starts at 0 and increments with each new entry in the dataset.
- Random number generation that assigns a randomly generated value to each entry in the dataset. (In this case, you must take care to ensure that the same value does not generate for two different entries.)
- A cryptographic hash function that maps input strings to outputs. Hash functions are one-way: they can generate the output given input but cannot generate the corresponding input given an output. Because hash functions are vulnerable to brute force and dictionary attacks, they are less secure as a pseudonymization technique.
In addition to PII pseudonymization methods, you also need to consider your pseudonymization policy. For example, if the same individual or identifier appears multiple times within a dataset, or within multiple datasets, how will you pseudonymize this repeated data? Will you use the same pseudonym for any occurrence or only for occurrences within the same dataset? Or will you generate a new pseudonym each time?
Finally, it’s important to note that you should never share your lookup table with a third party. Lookup tables that link original and pseudonymized data are themselves sensitive information that needs to be kept safely under lock and key.
How Integrate.io Can Help with PII Pseudonymization
Integrate.io is a powerful, feature-rich ETL data integration platform that makes it simple to build pipelines from your data sources to your cloud data warehouse or data lake. With more than 100 pre-built integrations and a simple drag-and-drop interface, Integrate.io offers a robust set of tools that allow anyone — regardless of background or technical knowledge — to transform and integrate their enterprise data.
Using Integrate.io’s extensive list of data transformations, you can easily define PII pseudonymization rules for your pipelines and workflows, disguising your sensitive and confidential information. In just a few clicks, Integrate.io lets users pseudonymize their data and guard against malicious actors and data breaches.
Want to learn more about how Integrate.io can help with PII pseudonymization? Get in touch with our team of data experts today for a chat about your business needs or to start your 14-day pilot of the Integrate.io platform.