Engineering identity from anonymity: Our work on risks of re-identification

By Jane ScowcroftMay 25th, 2018

As you’ve no doubt read, either through Algorithm or national news, data is everywhere. We create and consume data continuously. This data is specific to us, but when consolidated, this shared data can be of immense value. The value is not only for population level insights.

An example – it has been identified that data held by the Australian Government is a strategic national resource , which can unlock economic impact, drive service delivery efficacy and fundamentally impact policy outcomes and direction. By extension, we know that data is of considerable value for industry globally.In essence, data matters for everyone.

Conversely, if data has been created by us or the data is about us, then sometimes someone seeing that data might be to draw our identity from it. Doing this is called ‘re-identification‘, and it’s process through which very personal and sensitive insights can be drawn about individuals.

This is obviously easy when the data has some personal information such as your name and address. So generally, when datasets with personal information are released publicly, there is a goal of removing the “personal” from the “information” through de-identification.

It’s likely that you’ve heard about de-identification of data – whereby various statistical techniques such as aggregation, masking or perturbation are used to group or hide individual details, but still allow for the release of data for the public good or for information sharing. In fact, Data61 has been instrumental in development of de-identification decision making frameworks in partnership with the Office of the Australian Information Commissioner (OAIC).

But, just wait a sec, you might be wondering if your data has been de-identified, then doesn’t that mean the information is safe? Does de-identification prevent re-identification?

Currently data breaches are getting more attention than re-identification. However even after applying de-identification techniques, re-identification poses a very real risk. One study showed that two thirds of the US population can be individually identified with basic data such a gender, date of birth and zip code, all readily available from census data – and for many, from Facebook. This has been described as the ‘arithmetic of uniqueness ‘ – though, there are some nuances with these calculations, such as the census roll in the aforementioned study not being 100 per cent accurate.

Re-identification risk is real; in fact, government has gone as far to suggest jail time for individuals who have been found to be deliberately attempting to re-identify personal information.

Data61 is looking at re-identification from a new angle. As part of our privacy investment the Re-identification Risk Ready Reckoner (R4) is based on ground-breaking research from Data61’s Information Security and Privacy Group. We’ve translated theoretical research into a working dashboard that allows data custodians to understand the re-identification risk of a data set, and then provides options to users on how to mitigate that risk.

An example of the R4 re-identification risk dashboard, via Data61‘s Bill Simpson-Young  © Data61

There’s no silver bullet for privacy. Each technique involves its own nuanced trade-offs between the utility of data, and the protection of privacy of the people the data relates to. Our work around re-identification aims to both quantify and mitigate the risks involved.