How to measure privacy in data science.

When collecting, publishing or working with user-related data, the notion of privacy is really important. And data scientists always work with users’ data. So quantifiable measures of privacy enable us to assess, compare, improve and optimize privacy-enhancing mechanisms.

In this domain, we do not really have a set of conditions that defined what privacy metrics must respect. So we can just consider everything that describes the level of privacy.

In this story, we will describe Uncertainty-based privacy metrics ( Anonymity set size, Shannon’s entropy, Normalized Shannon’s entropy), Time-based privacy metrics, Data-similarity-based privacy metrics (k-anonymity, l-diversity, t-closeness) and Indistinguishability-based privacy metrics.

this metric assumes that low uncertainty in the adversary’s estimate correlates with low privacy. It means the less certain the adversary is about a piece of information the more privacy this information keeps. And for that method, we have can introduce multiples metrics:

It is defined as follows: In a set containing many items, this metric defined the size of the set of items that an adversary can not distinguish from a single item. It means when you have 100 items you can narrow that down to a maximum of 10 indistinguishable items for example. So the size of the Anonymity set, in that case, is 10. That method is really simple and can be easily tracked but the only downside is that it depends on the number of items/members in the system.

This metric measures the uncertainty associated with predicting the outcome of a random variable. It is more mathematical and used in many domains. In an anonymity set, the adversary could learn which member performs a certain action. If we consider {x1, …, xn} as an anonymity set and p(xi) the probability estimated by the adversary of xi being the item who performed that action, the attacker’s goal can be modelled as predicting the outcome of a radon variable X distributed according to 𝑝. So the Shannon entropy is defined as:

You can watch this youtube video to understand where this formula comes from:

With this metric, we measure privacy by using the amount of time that an adversary could spend trying to compromise user privacy. For example, we may say an adversary needs 100 years to compromise a CIA database but for a simple website, it may be only a few hours. And then the CIA database protects user privacy more than this random website. We may use two different points here:

- Time until adversary’s success: Which defined the maximum time required by an adversary to break the anonymity of a system with a high probability.
- Maximum tracking time

This is mainly used in the context of databases and measures the privacy of observable (available through an API for example) or published data.

This method is used to make sure that in a dataset any item is indistinguishable from k-1 other items. It can be used to anonymise data before publication for example.

**Source**: Parra-Arnau, Arias-Cabarcos, Strufe: Privacy-Enhancing Technologies – Metrics –

In this dataset for example the identifying attribute help to directly identify an individual so it must be removed to prevent privacy. Some other identifiers called quasi-identifier could be used to reconstruct the original data. so there is a need to modify it in other to prevent privacy.

And for that purpose, we may choose a number like 3 and make each group of 3 rows difficult to tell apart. That will be a 3-anonymity. And for that you may need to remove some data, Like on that image for example:

Or completely remove a column.

The final data may look like this:

But we still have a problem. The sensitive attribute is unique in the anonymised set and so we can easily predict that every data matching the first set has Cancer. And then know that Alice has concerns with just her quasi-identifier.

This comes to solve the problem introduces earlier with k-anonymity. The idea is that each k-anonymous class has at least l well-represented sensitive values. So we must sample the Sensitive attribute again to randomised the apparition of the sensitive data. Like here for an example:

We also have 𝒕-closeness to make sure that small group distributions are close to the entire distribution.

And those previous methods still have problems. because background knowledge can still help to detect someone in the group, it is also hard to choose the k in k-anonymity and the more we change the data the more we lose information.

With that said, we still have more metrics and the domain is constantly evolving with more and more research. We can also list Indistinguishability-based privacy metrics that we haven’t covered here. And much more detailed concepts around those concepts could be fine online and in the respective research papers discussing them.

We have the knowledge and the infrastructure to build, deploy and monitor Ai solutions for any of your needs.

Contact us