Transform, perturb, pollute
There’s a tenuous and fragile balance between the usability of data and the privacy of the people to which it refers. At CSIRO’s Data61, we’re focused on finding ways to ensure data can remain private without sacrificing too much of its usefulness.
One particularly clever way to do this is to ‘perturb’ data. This means transforming it enough that an individual’s data is polluted with noise, but not so much that analytics on the whole data set return a different result. We’re also working on ensuring this is ‘provable’ – that we can quantify risks and back up our claims with mathematical testing.
The Algorithm editorial team spoke with Data61’s Dali Kaafar, privacy expert and Group Leader of the Information Security & Privacy Group and Hassan Asghar, Research Scientist in the same Group, about this privacy preserving technique.
Q: What exactly does privacy protection mean?
A: The fundamental question is: ‘how do we allow data analysis while keeping people’s data private?’ The answer lies in how we define privacy. One way to define it is to consider that any data that is specific to an individual or a small group of individuals is private. The rest is a shared characteristic of the dataset or population or survey.
Another important consideration is the notion of uncertainty. If you look up in the dictionary, you would see that privacy involves being free from disturbance. We could then define privacy as the level of uncertainty that others aiming at guessing additional information about you would always have.
For instance, if there is a very low likelihood that some statistics about movements around a city would lead to finding out individual routes of travel then you may consider these statistics to be private. However if these statistics, combined with some other public knowledge, say knowing you leave your house every day at 7 am sharp, would lead to isolate your route of travel with a high probability then your privacy is breached.
Once we define privacy in this way, we can see how we can ensure privacy of one’s data while allowing useful analysis. All we need to do is to ensure that the result of any analysis obfuscates the data of a single individual. Differential privacy, a theory of privacy, allows us to construct algorithms that achieve just that. One common technique is perturbation of the statistics or the process to extract the statistics.
Q: What’s ‘perturbation’?
A: Perturbation is the process of obfuscating the data. To understand perturbation, let’s look into the analysis questions that are asked about data. For example, a question on a hospital database could be: how many patients hospitalised today were emergency admissions? To provide privacy, all we need to do is to know what happens if we change the status of one admission.
The answer changes by at most one. Thus, we perturb the answer by adding (controlled) noise to the true answer which hides the inclusion or exclusion of one emergency admission, and hence one patient. We don’t need to investigate changes in analysis to prove privacy; we just need to know we’ve masked the presence of that one admission.
Privacy is guaranteed by perturbation scaled to the question, independent of data. This is the key feature of differential privacy: the ability to separate the privacy process from actual data.
Q: Doesn’t changing data change any analysis you would do on a data set?
A: It depends why you’re doing the analysis. If the purpose is to learn statistical properties of the dataset, then the result of the analysis does not change (by much). On the other hand, specific questions on a small number of individuals will be overwhelmed by noise. If your intentions are broad, you’re fine. If your intentions are to figure out information about individuals, you won’t get far.
Suppose we want to know the result of a referendum within an organisation, which is a yes or no vote. What’s the percentage of people who voted yes? Suppose the true answer is 85%.
Using perturbation (changing the vote of one employee), we might get an answer which is 85.1%. From a dataset of 1,000 employees, this is small. We still learn the majority voted yes.
What if we ask, “How many employees with a given first name voted ‘yes’?”. The true answer is likely to be low, say 0.1%. Perturbation changes that to 0.2%, overwhelming the true answer. Here’s the thing: the true answer here is of no statistical value. Hence, privacy overwhelms utility. As a side note, the perturbation parameters can be public (with no loss in privacy); meaning that an analyst knows how much the data is likely to be perturbed and can factor this into the analysis.
Q: Can we see an example of this in practice?
A: Yes, the federal government’s Priority Investment Approach (PIA) to welfare is an effort to find ways to lower welfare dependency through the analysis of long-term data. Obviously, this is deeply sensitive information. Having provable privacy preservation is important. We released a synthetic version of this dataset , which maintains lower order statistics of the dataset. That includes correlations and longitudinal patterns.
Q: Interesting – what’s another example?
A: In 2017, we worked with Transport for NSW to release a data set of Opal trips , such that researchers, analysts and developers could learn patterns in transport use without reducing the privacy protections of Opal card users. From the dataset we can learn the tap-on and off times and locations, giving us a privacy-preserving spatio-temporal output. We learn how frequently or less frequently different stations, bus-stops, etc are used at different times of the day. On the other hand, our privacy treatment (via differential privacy) ensures that specific information about any given individual’s movements is not leaked.
It’s close to impossible to opt-out of being part of some data set in contemporary society, whether it’s a business, a government or a mobile phone contact list on your friend’s device. The nuances of risk, probability and the difficult trajectories that one can approach the concept of privacy from mean a rigorous, technical and mathematically supported approach to these solutions is beneficial.
The method of using noise and obfuscation is a valuable capability in a large toolbox of approaches available for businesses and governments grappling with the challenges of big data in 2018.