The value of big data is as great as the challenge to harness it: how to make all the human potential that lies in data work for progress, without harming the people and stories behind it? Anonymization and pseudonymization of data are two privacy-engineering techniques that can solve this question.
Because data is the new gold, yes, but this gold is not obtained from an inert object.
Data is attached to people, with names, surnames, location, tax information, medical history… And a vast amount of revealing content that can easily jeopardise the integrity of those people.
But in data live the solutions to unsolved problems and stories yet to be uncovered. Compared to data, the value of gold is much more limited.
But so is its damage potential.
So how do we harness the data from people, for people, but without the people?
Addressing the data privacy challenge
This is exactly the question that privacy engineering seeks to unravel, developing solutions that allow data to be leveraged while complying with data protection legislation and protecting the individuals associated to it.
And just as there are different techniques for painting a canvas, there are different techniques for protecting the privacy of individuals in databases.
Anonymisation and pseudonymization of data are two of these techniques. Their aim is to avoid unwanted re-identification of the person behind the piece of information.
And thus prevent possible threats to the privacy of individuals, whose personal information is present in large data sets.
But the challenge is to do so while providing useful and massive access to the data.
European legislation on personal data protection differentiates between these two techniques and recommends their use for different applications, depending on the use case, the degree of risk, the way data is processed within each company, etc.
Let’s see how each one works and what they are used for.
What is anonymization?
Anonymization is a technique that converts the information in the database in a non-reversible way.
After the anonymization process, people and data are disassociated completely and can no longer be linked, directly or indirectly.
In other words, anonymization turns personal data into impersonal data, forever.
To understand this in an everyday way, think of an anonymous work of art: you can see it, read it, know it, enjoy it and use it… But you will never know who made it.
What is pseudonymisation?
Taking the same reference, what is a pseudonym in art? Someone fictitious to whom that work can be attributed. That is, several works can be associated with the same pseudonym, but that pseudonym cannot be linked to a real name.
In this way, a relationship between the works can be traced, but not a relationship between the works and the real person.
The same thing happens with the pseudonymization of data. Pseudonymization makes it possible to change a piece of data to a pseudonym or alias.
The relationship between the original person, the pseudonym, and the data is maintained, but it can only be decoded with a key.
And, if that key is destroyed, then the pseudonym would lose its link to the real person. The data would be associated with a pseudonym, and the pseudonym would be associated with nobody.
Differences between anonymization and pseudonymization
As you can guess, the main difference is that pseudonymization is usually implemented when you want to keep the relationship between the person and the data, but protected behind lock and key.
Whereas anonymization aims to erase the trace to the source of the data permanently.
With pseudonymization, data can be made available to people who need to use it without knowing the personal information behind it.
But it keeps that personal relationship accessible to those who do need to access it (and have the appropriate permissions).
Pseudonymization vs anonymization: do they have to comply with the GDPR?
Legally, a pseudonym is an identifier.
It represents personal data associated with someone and, since it is reversible, that person could be identified and associated with the data.
Since the process is reversible and with an appropriate key you can identify the person, pseudonymization is subject to greater protection under the European GDPR.
In anonymization, the data is no longer personal data, and therefore it is understood that it would not fall under the data protection law.
Benefits of anonymizing and pseudonymizing data
- Comply with the law (GDPR in Europe, for example).
- Share or trade data both internally and with external companies or technology service providers, without compromising data confidentiality or compliance with the law.
- Obtain data from third-party documents and databases for statistical treatment or for training algorithms with machine learning, without losing the value of the data.
- In the public sector, in order to comply with the principle of transparency, and share anonymized information.
- As an extra layer of cybersecurity. More and more companies work in the cloud, generate and store data and can therefore be subject to cyber-attacks. Pseudonymizing data adds an extra layer of security in the cloud and can discourage a hacker from provoking an attack on this company, as opposed to another that makes it easier for him to do so.
- Change the ownership of the data and be able to continue to use it, even if the customer unsubscribes or to market with them without being able to identify the person from whom the data originated. By pseudonymizing the data and eliminating the link to the original, the pseudonym is no longer subject to personal data protection legislation and could therefore continue to be used for profiling or marketing with information relevant to companies.
Nymiz, the best data anonymization tool & pseudonymization service
Nymiz is one of the best tools for pseudonymization and anonymization of data for companies, at least to comply with the requirements of data protection legislation.
But, in addition, this tool can automatically share and use data with business value, while avoiding possible cybersecurity threats.
Why do we like Nymiz for data anonymization?
This anonymization service has some features that make it a top performer:
- It has Machine Learning algorithms and natural language processing to continuously optimize the scope of the service and reliability in the pseudonymization of data.
- It has a cloud or local version, in order to have the data protected in any environment used by the company internally or with third parties.
- Platform prepared to pseudonymize structured data in databases and unstructured data such as Word, PowerPoint, PDF or e-mail documents.
- Multi-language recognition in English and Spanish, and preparing new languages for future updates.
The 4 most commonly used data pseudonymization techniques
Pseudonymization technique #1: secret key or key deletion key
t is like placing a barrier between the pseudonym and the original data, which can only be opened with a key. The pseudonym can be reconnected to the original data to re-identify it, but only if the key is possessed.
If the key is destroyed, the pseudonymized data will be disconnected from the original personal data forever, but not deleted.
Pseudonymization technique #2: Hash function
A hash function is a mathematical algorithm that transforms a series of data into a new series of characters.
For example, we can use a hash function to convert our OpenSistemas brand name into a series of characters that has nothing to do with it: 1L/GXW+Ep1wKdzdtw7rModHkTrvJIppM7wli70HZ60A=
As you can see, once it has been replaced, that bit of data means nothing. To read it, you need the additional piece of information that allows you to decode it. If you do not have that additional information, you cannot reveal the original data.
Pseudonymization technique #3: Hash function with stored key
This is a type of hash function that also uses an additional key to access the encrypted data.
Pseudonymization technique #4: Tokenization
A random number is exchanged for a set of tokens that do not follow a replicable mathematical sequence or logic.
Examples of data anonymization and pseudonymization
For example, there are many clinical studies in which information is collected from the “subjects”, such as their demographic group, gender, health characteristics… But in the study, all are put together and we draw general conclusions.
In those percentages of people who apply to this or that conclusion, the study subjects are not identified in any way. Therefore, the data relative to the findings is considered anonymous.
Now let’s think about an example of pseudonymized data.
A very common case is in the fintech world when you need to extract very specific groups of customers based on their behavior (clusters), in order to offer them hyper-personalized products and encourage more purchasing.
A marketing data analyst or data engineer may have to identify these audience clusters among customers, without accessing the names associated with their accounts, transactions, etc.
By replacing customer names with a code or numerical series, these common behavior patterns could still be identified without compromising the identity of the people behind them.
5 common misconceptions about pseudonymization and anonymization of data
The GDPR recognizes some common mistakes in the concepts of anonymization and pseudonymization techniques that you will need to be aware of if you need to use them or explain them in a work environment.
Mistake #1. Thinking that a pseudonymized dataset is anonymous.
In pseudonymization, the original data and the encrypted data are still linked, but a “key” is placed between them.
Whoever possesses that key can then link the pseudonymized data back to the original data. In the case of anonymization, this relationship between the original data and the encrypted data is destroyed and cannot be relinked.
Mistake #5. Anonymization and pseudonymization make data useless
While it is true that there are uses for which the data must be personal, we can also obtain many benefits from pseudonymized or anonymized data.
In the previous example, we saw how we can obtain behavioral data for hyper-segmented audiences (clusters) and make an impact with highly personalized products, without needing to know their names and surnames.
It is the data custodian who will decide whether the purpose for which the data are required can be achieved by anonymizing it or, on the contrary, it is necessary to resort to a pseudonymization technique.