What is Pseudonymization?

Pseudonymization is the process of substituting identifiable personal with a reference or pseudonym. This process allows organizations to share data while protecting the privacy of clients, employees, and other individuals that the data describes.

The pseudonymization process is reversible. Pseudonyms refer back to the original data set, which means that someone with access to the reference table can match each record to the named individual. For this reason, businesses must store pseudonym tables in a safe environment.

How Does Pseudonymization Work?

To understand pseudonymization, you first have to understand what constitutes personal data.

Some examples are obvious: name, address, date of birth all relate to a specific person. Others may be less obvious – if a building manager tracks entry and exit times for each person, could the patterns in that data reveal the subject's identity?

The EU's General Data Protection Regulation (GDPR) defines personal data as follows:

'personal data' means any information relating to an identified or identifiable natural person ('data subject'); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person".

Essentially, personal data is anything that can help to identify a specific individual.

Consider a company that has a sales database with records like this:

NAME | ADDRESS | ORDER TOTAL | DELIVERY DATE --------|----------------|----
---------|--------------- A Aaron | 1381 Big St | $8,371.98 | 08/01/2020 B Barr | 56
Smalltown | $13,938.22 | 08/02/2020 C Cole | Hillside Manor | $9,454.99 | 07/29/2020

The first two fields are clearly personal data as they identify a specific data subject. The organization will need to consider whether someone could use the remaining fields to identify the subject. If not, the organization can process that part of the data without worrying about data protection issues.

Doing this means separating the personal data from the non-personal data. The organization does this by creating two new tables. First, a lookup table with pseudonyms for the individual:

PID | NAME | ADDRESS   -----|---------|-------------- 1001 | A Aaron | 1381 Big 
St  1002 | B Barr | 56 Smalltown  1003 | C Cole | Hillside Manor

These IDs allow the organization to pseudonymize their data like so:

PID | ORDER TOTAL | DELIVERY DATE -----|-------------|------------------ 1001 | 
$8,371.98 | 08/01/2020 1002 | $13,938.22 | 08/02/2020 1003 | $9,454.99 | 
07/29/2020

The pseudonymized table contains no personal data, which means the organization is free to share it with third parties or transfer it across jurisdictions without the individual's consent.

If the organization later needs to reconstruct the original data, it can combine the reference table with the pseudonymized data. Therefore, unlike other forms of data masking, the pseudonymization process is reversible.

When is Pseudonymization Used?

Pseudonymization is one of two approaches to de-identifying personal data:

Pseudonymization: A reversible process where identifiers or tokens replace sensitive values.
Anonymization: A non-reversible process that hides sensitive values permanently. This can involve several data masking techniques such as blanking fields, scrambling values, or replacing fields with randomly generated values.

Anonymization is best in situations where there will never be a need to re-identify the data. For example, software testing needs a representative version of production data. Organizations can anonymize their live data and provide this to the testers.

Pseudonymization suits scenarios where the data owner expects that they will need to re-identify the data in the future. An example of this is when the company needs to pass data to a third party for additional processing. They send a pseudonymized version to the service provider, and this third party performs the necessary operations. When the data owner gets their processed data back, they can use the reference tables to reconstitute the full data set.

Most companies find that they have to use pseudonymization to comply with specific data privacy laws. Two pieces of legislation that codify pseudonymization are GDPR and HIPAA.

General Data Protection Regulation (GDPR)

GDPR is an EU law with worldwide consequences. Companies that handle personal data from EU citizens must comply with GDPR or else face sanctions, even if they don't trade in the EU.

GDPR permits the use of pseudonymization, which it defines as, "the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information, as long as such additional information is kept separately and subject to technical and organizational measures to ensure non-attribution to an identified or identifiable individual."

Organizations are free to use pseudonymization, as long as they ensure that they don't share reference tables outside of the organization.

Health Insurance Portability and Accountability Act (HIPAA)

HIPAA applies to American healthcare organizations that handle patient data. This law ensures certain standards in patient privacy, including rules on pseudonymization.

The text of HIPAA refers to this kind of data obfuscation as de-identification. There are two approaches to de-identification: the Expert Determination method, which involves using statistical methods, or Safe Harbor, which requires the removal of 18 specific identifiers. These identifiers are:r

Names
Addresses
Phone numbers
Dates related to an individual, such as date of birth or death
Vehicle identifications
Fax numbers
Device identifiers
Email addresses
URLs
Social security numbers
IP addresses
Medical record numbers
Biometric identifiers
Health plan beneficiary numbers
Identifiable photographs
Account numbers
Other unique identifying numbers
Certificate or license numbers

Under HIPAA, a covered entity may replace these values with a code or identifier. There are two rules for creating pseudonyms:

The code or identifier can't derive from the underlying data. For example, you can't use an anagram of someone's name as their pseudonym.
The entity may not share their re-identification methods with any other party. This means that the pseudonym lookup tables themselves count as personal information.

What is Pseudonymization?

How Does Pseudonymization Work?

When is Pseudonymization Used?

General Data Protection Regulation (GDPR)

Health Insurance Portability and Accountability Act (HIPAA)

Glossary of Terms

A guide to the nomenclature of data integration technology.

Understanding Data Governance for Data Analysts

What is Pseudonymization?

How Does Pseudonymization Work?

When is Pseudonymization Used?

General Data Protection Regulation (GDPR)

Health Insurance Portability and Accountability Act (HIPAA)

Glossary of Terms

A guide to the nomenclature of data integration technology.

What is PII under CCPA?

What is Information Lifecycle Management?

What is Identity Management?

Related Readings

Understanding Data Governance for Data Analysts