Engineering identity from anonymity: Our work on risks of re-identification
As you’ve no doubt read, either through Algorithm or national news, data is everywhere. We create and consume data continuously. This data is specific to us, but when consolidated, this shared data can be of immense value. The value is not only for population level insights.
An example – it has been identified that data held by the Australian Government is a strategic national resource, which can unlock economic impact, drive service delivery efficacy and fundamentally impact policy outcomes and direction. By extension, we know that data is of considerable value for industry globally. In essence, data matters for everyone.
Conversely, if data has been created by us or the data is about us, then sometimes someone seeing that data might be able to connect an identity to it. Doing this is called ‘re-identification‘, and it’s a process through which very personal and sensitive insights can be drawn about individuals.
This is obviously easy when the data has some personal information such as your name and address. So generally, when datasets with personal information are released publicly, there is a goal of removing the “personal” from the “information” through a process often called “de-identification”.
The de-identification of data employs various statistical techniques such as aggregation, masking or perturbation are used to group or hide individual details, but still allow for the release of data for the public good or for information sharing. In fact, Data61 has been instrumental in development of de-identification decision making frameworks in partnership with the Office of the Australian Information Commissioner (OAIC).
But, just wait a sec, you might be wondering if your data has been de-identified, then doesn’t that mean the information is safe? Does de-identification prevent re-identification?
Currently data breaches are getting more attention than re-identification. However, even after applying de-identification techniques, re-identification poses a very real risk. One study showed that two thirds of the US population can be individually identified with basic data such a gender, date of birth and zip code, all readily available from census data – and for many, from Facebook. This has been described as the ‘arithmetic of uniqueness ‘ – though, there are some nuances with these calculations, such as the census roll in the aforementioned study not being 100 per cent accurate.
Re-identification risk is real; in fact, government has gone as far to suggest jail time for individuals who have been found to be deliberately attempting to re-identify personal information.
So it turns out that “de-identification” might be a misnomer and “de-identification” (at least in the ways it is often carried out) may not always prevent re-identification.
A case of missed re-identification risk in “de-identified” data was recently highlighted (August 2019) when Public Transport Victoria was found to have breached the Privacy and Data Protection Act after a dataset containing the records of roughly 1.5 billion myki trips was exposed. The data release leak resulted in the possible re-identification of individuals’ travel activity in the last three years.
Data experts at CSIRO’s Data61 were consulted on technical aspects of the investigation, with findings revealing personal information could be obtained from the PTV dataset without expert skills or resources.
“Our research found that when two myki card scans are known by time and stop location, more than three in five of those pairs of scans are unique and therefore more likely to be personally identifiable” said Dr Paul Tyler, Data Privacy Team Leader at CSIRO’s Data61. “So-called ’de-identified’ data can still carry re-identification risk especially in linked transactional data.”
Data61 is looking at re-identification from a new angle. As part of our privacy investment the Re-identification Risk Ready Reckoner (R4) is based on ground-breaking research from Data61’s Information Security and Privacy Group. We’ve translated theoretical research into a working dashboard that allows data custodians to understand the re-identification risk of a data set, and then provides options to users on how to mitigate that risk. R4 was used to analyse the PTV dataset.
There’s no silver bullet for privacy. Each technique involves its own nuanced trade-offs between the utility of data, and the protection of privacy of the people the data relates to. Our work around re-identification aims to both quantify and mitigate the risks involved.