The inclusion in the General Data Protection Regulation (GDPR)¹ of “pseudonymisation” as a data protection measure is new, when compared to the existing European directive². Thus, in the GDPR the word (or the concept) “pseudonymisation” occurs somehow 15 times, while in the previous directive it did not appear once.
The rapid technological evolution of the last 10 years has increased the need for interaction and data transfer between cloud applications and systems (including countries outside the European Union). Personal data processing including characteristics, behaviours, and personal preferences are used to create profiles of the natural persons. These personal data are executed by large-scale computing using artificial intelligence and deep learning algorithms that empowers new business models, with benefits for organizations, but on the other hand, may impact the rights and freedoms of data subjects, in particular their privacy.
The use of anonymisation and pseudonymisation are measures that organizations have at their disposal (to fulfil with the principles of data protection by design and data protection by default³) for the protection of the personal data of their users, customers, employees, or other data subject. When correctly applied, they mitigate (or avoid, in the case of anonymisation) the risk of exposing the privacy of personal data subjects, to third parties without having the “need to know” or for other purposes of data processing. These measures should by applied when needed and with the less impact to the organization’s operation and business.
What does data anonymisation and data pseudonymisation consist of? What are the differences?
The main difference between anonymised and pseudonymised data is that the former (anonymised) cease to be personal data as they do not allow the re-identification of natural persons, and the latter (pseudonymised) continue to be so, under the terms of the GDPR.
As provided for in the GDPR, pseudonymised data may be re-identifiable using “additional information”, kept separately from pseudonymised data and subject to technical and organizational measures to ensure that only in the foreseen situations will re-identification be possible.
A deficient anonymisation is one that allows the re-identification of natural persons. One of the exploited vulnerabilities is re-identification using indirect identifiers, stored in post-anonymisation records, e.g. date of birth, postal code, gender⁴.
Another recurring vulnerability, that can be exploited, in anonymisation is the use of data encryption algorithms to “supposedly” anonymize personal data. Once the key for the used algorithm has been identified, the data can be re-identified… Even when the decryption key is not known, it is not guaranteed that re-identification will not be possible in the future. It will depend on the robustness the algorithm, the use or not of “salt” (set of random characters) in the functions of generating encryption hashes, and the computing resources capacity available to reverse the encryption using brute force algorithms on trial/error (it is expected that the use of quantum computing can exponentially reduce the time required to break cryptographic codes).
A good practice is to permanently erase data that could allow re-identification or replace them with characters without any algorithmic connection with the original.
Thus, encryption is not an anonymisation technique, but it can be a powerful tool for pseudonymisation. It protects data confidentiality by applying an encryption algorithm (symmetric or asymmetric) where the key(s) are known. These keys should be considered “additional information” that allows the re-identification of natural persons.
What is the interest / advantage of applying these tools / techniques to personal data?
As previously mentioned, they are two very useful techniques for the protection of personal data and must be used by design and by default.
Anonymisation is a very useful alternative to delete records containing personal data. It can be used, for example, in the execution of the “right to erasure” of personal data by the subject or, after expiry of the period of retention of personal data. Also, the use of anonymisation could avoid referential integrity failures in existing databases, when removing records containing personal data which causes null or non-existent references in the existing tables to the deleted records, causing errors or failures in the normal functioning of the system. It is also an adequate measure to use when preparing to share data for operations where no identifiable data is needed, e.g. statistical treatment, data and behaviour analysis of a population for decision making.
Pseudonymisation is a powerful measure for personal data protection, that can be applied “by design” to data repositories. In this case, the data structures are designed to separate different types of personal data and control the aggregation (“by default”), only to authorized user profiles or autonomous agents. For example, when designing a database of clinical patient records, identification data, contact data and health data could be stored in three different tables or data structures, with identifiers (which link the data to the identification of the patients) pseudonymized. Thus, it is guaranteed that the identification of the patients will only occur to the extent of the “need to know” exclusively to users with authorization to process clinical data, or to process personal contact data. In this case, the chronological records of accesses for auditing purposes (as foreseen in the “responsibility”⁵ principle) are also facilitated.
The use of pseudonymisation allows organizations to mitigate the risk of personal data breach in activities that involve, for example, application interaction, profiling, direct marketing or sharing data with external entities within the context of defined personal data processing activities.
Which people / departments / entities should apply them?
These measures should be applied both by the controllers and by their processors (in this case following the instructions of the controller).
The application of anonymization or pseudonymization in personal data processing activities must be described in the inventory of processing activities. The procedure to be applied, including the algorithm/tool, must be known by the parties involved and should be validated by the Data Protection Officer (DPO).
Its practical application may be carried out by any authorized intervener in the personal data processing. It may also be automatically applied by an algorithm (and without human intervention) for record sets that meet defined the eligibility rules, e.g. anonymisation of the data after the retention period or pseudonymisation of data records before international transfer to a processor.
Last question: What are the risks and limitations that may exist with the application of anonymisation and pseudonymisation?
The incorrect application of anonymisation may lead to the subsequent re-identification of the data with an impact on the privacy of the natural persons or could lead to the irreparable loss of data, due to the application of erroneous or excessive anonymisation of data, causing the unavailability of data (that was incorrectly anonymized) when necessary.
Anonymisation algorithms must be adequately tested in order to guarantee their effectiveness (i.e. that only records eligible to be anonymized are so), that the integrity of data that are not subject to anonymisation is safeguarded and the guarantee of non-reidentification of the natural person.
In pseudonymisation, it is fundamental that the cipher keys, allocation tables or other “additional information” that allow reversing the identification of the natural persons are kept separate and stored in a safe way, to avoid an unauthorized or illegitimate re-identification of the records, that leads to a violation of personal data, in accordance with the GDPR.
It is also important to emphasize that a disproportionate use of encryption as a means of pseudonymisation could cause degradation in performance or exhaust the capacity of the support systems resources. These algorithms (and the operations for encrypting and decrypting large amounts of records) consume much more resources (computer processors, memory and storage) than the normal use of the data in its original format (without encryption).
From my experience as an external DPO and Consultant in Information Security and Data Protection, I note that the application of anonymisation or pseudonymisation of data in existing IT systems is a big challenge for organizations and may not even be technically feasible. Some samples of common constraints are: technical limitations and little flexibility for the evolution of data structures, software packages and existing APIs that are “black-boxes” with no possible evolution or, the proliferation/duplication of personal data in different repositories (including physical support on paper or electronic data backups) that not only difficult the pseudonymization/ anonymization process as could be used as sources for the re-identification of data in the future.
___________________________________________
¹Regulation (EU) 2016/679 of the European Parliament and of the Council, of 27 April 2016, on the protection of natural persons with regard to the processing of personal data and the free movement of such data and repealing Directive 95/46 /EC
²Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data
³as provided for in point 1 of article 25 of the RGPD: (…) the controller applies, both when defining the means of treatment and at the time of the treatment itself, the appropriate technical and organizational measures, such as pseudonymization, aimed at to effectively apply the principles of data protection, such as minimization, and to include the necessary guarantees in the treatment, in a way that it meets the requirements of this regulation and protects the rights of data subjects.(…)
⁴In 2002, Latanya Sweeney showed in a research paper that she was able to identify and obtain sensitive medical data from 87% of the US population, based on linking 2 data sources supposedly “anonymized”: visits to hospitals in the country and lists of voters by postal code in the United States. Both lists included date of birth, gender and zip code.
⁵Article 5, paragraph 2 Principles relating to the processing of personal data
Vitorino Gouveia, Chief Operating Officer (COO), Project Manager, Lead Consultant, University Teacher, Information Security & GDPR Specialist, ISO 27001, GDPR Auditor and Trainer.