Introduction to Privacy in machine learning


When we hear about data, privacy is always around (no Joke). And all data scientists have heard data at least 10000 times. So privacy is a concept they have to be aware of when working with publicly collected or available data as well as privately undisclosed data.


In this story, I will introduce some privacy concepts related to data manipulation, data set publication and what you as a data scientist should know concerning privacy.

Let’s start by defining what privacy is. So according to Wikipedia, Privacy is the ability of an individual or group to seclude themselves or information about themselves and thereby express themselves selectively.

But in the scope of technology, this term is a bit more challenging to define. Let’s go back to 1890 with

Samuel Warren, Louis Brandeis: “The Right to Privacy”, Harvard Law Review, Vol. IV, No. 5, 15th December 1890

Back there this concept of snapshot photography allowed newspapers to publish people’s pictures without their consent. And that leads to individuals being injured. So there was a need for a law to ensure that there is protection for individuals willing to be private and alone. The conclusion the law about was this right to be let alone. Which is kind of the first breach of privacy.

Then in 1967, Alan Westin came up with this definition of privacy:

the claim of individuals … to determine for themselves when, how, and what extent of information about them is communicated to others.“

— Alan Westin (1967)

Finally in 2004, with Helen Nissenbaum in Privacy as Contextual Integrity, we had this view of privacy, stating that: data is shared with a specific mindset in a particular context.

Pic by

With this concept, a violation of privacy concerns a change in one of these parameters. for example, a normal flow could be:

  • Sender: Google Analytics
  • Subject: User visiting
  • Information type: browsing data
  • Recipient: Website Owner
  • Transmission Principle: Consent( This data must be used to show website traffic to the Website owner. )

A possible violation is the change of the recipient by Google for advertisement, selling it to other people. This change may be considered a threat to user privacy.

When working with data, and writing laws related to data manipulation, You should keep in mind this notion. And much more privacy-related concepts.

Another example is this one from Wikipedia:

  • Sender: the same US resident
  • Subject: a US resident
  • Information type: tax information
  • Recipient: the US Internal Revenue Service
  • Transmission principle: the recipient will hold the information in strict confidentiality.

And a possible violation can be the US Internal Revenue Service sharing this information with a bank service for example.

Informational Self-Determination

We have the notion of Informational Self-Determination with the important fact that:

the sovereign (self-determined) citizen controls collection, use, and can effectively retract even previously openly published data, upon change of mind.

So individuals have the right to decide independently and freely what happens to their personal data when and for what these data may be used. They have the right to delete all personal information from any tiers. And also the right to choose freely what they want to share and what they don’t want to share.

We have those fundamental principles of processing that are laid down in Article 5 of the GDPR :

  • personal data must be collected and processed fairly and lawfully
  • When data is collected it must be kept only for one or more specified, explicit and lawful purposes and they must not be used for purposes other than the one mentioned during the data collection and in the consent signed by the end user.
  • The strict necessary data for the specified purpose must be kept no longer than necessary
  • The Sender (user) must be informed about inform who collects which data for which purposes, how the data is processed, stored, forwarded etc.
  • The user has the right to access the data, change it, or delete it.
  • The Recipient must keep the shared data safe and secure.

With all those principles, designing a data-related application can be challenging but it is both important for the data scientist as well as the end user.

Data publication and Privacy Threat

It is common that some companies decided to publish some dataset online for public usage, or just open part of their system to a third party with some protection. But this practice may lead us to many problems in a privacy-related way. For example, Netflix has shared a dataset of films that users like on their platform for a competition for their recommendation system. But a researcher has shown that this data linked to the IMDB public data can help to identify individuals and their movie preferences (No one wants to publish all their movies, mainly because it can say a lot about your personality, sexual preferences, …)

That drives us to 2 types of disclosure:

  • Disclosure of Identity: Being able to identify an individual in a set of data and get some private information out of it
  • Disclosure of attributes: Being able to reconstruct a (hidden) attribute or Link additional information to identity.

It can be for example a re-identification case when an adversary is able to merge different public datasets and identify individuals from there. Which is a threat to privacy. Ex: public anonymized location dataset with Twitter geo-located posts [identity disclosure].

It can also be identity theft. People could act like you or login into some system with your information because they know your personal information.


So the need for privacy that protects individuals without losing the functionality of a system is really important for day-to-day apps. Particularly data science-related systems, as they depend 100% on data.

With that said, it is still hard to measure privacy as it is a really abstract notion. A way to measure it could be important mainly for system processing data and also for companies willing to publish datasets because it can help to determine to which extent collected or published data protect users’ privacy. And in the upcoming stories, we will talk about this notion also.

Let's Innovate together for a better future.

We have the knowledge and the infrastructure to build, deploy and monitor Ai solutions for any of your needs.

Contact us