When we hear about data, privacy is always around (no Joke). And all data scientists have heard data at least 10000 times. So privacy is a concept they have to be aware of when working with publicly collected or available data as well as privately undisclosed data.
In this story, I will introduce some privacy concepts related to data manipulation, data set publication and what you as a data scientist should know concerning privacy.
Let’s start by defining what privacy is. So according to Wikipedia, Privacy is the ability of an individual or group to seclude themselves or information about themselves and thereby express themselves selectively.
But in the scope of technology, this term is a bit more challenging to define. Let’s go back to 1890 with
Samuel Warren, Louis Brandeis: “The Right to Privacy”, Harvard Law Review, Vol. IV, No. 5, 15th December 1890
Back there this concept of snapshot photography allowed newspapers to publish people’s pictures without their consent. And that leads to individuals being injured. So there was a need for a law to ensure that there is protection for individuals willing to be private and alone. The conclusion the law about was this right to be let alone. Which is kind of the first breach of privacy.
Then in 1967, Alan Westin came up with this definition of privacy:
the claim of individuals … to determine for themselves when, how, and what extent of information about them is communicated to others.“
— Alan Westin (1967)
Finally in 2004, with Helen Nissenbaum in Privacy as Contextual Integrity, we had this view of privacy, stating that: data is shared with a specific mindset in a particular context.
Pic by https://twitter.com/ynotez/status/1250578500588879873/photo/1
With this concept, a violation of privacy concerns a change in one of these parameters. for example, a normal flow could be:
A possible violation is the change of the recipient by Google for advertisement, selling it to other people. This change may be considered a threat to user privacy.
When working with data, and writing laws related to data manipulation, You should keep in mind this notion. And much more privacy-related concepts.
Another example is this one from Wikipedia:
And a possible violation can be the US Internal Revenue Service sharing this information with a bank service for example.
We have the notion of Informational Self-Determination with the important fact that:
the sovereign (self-determined) citizen controls collection, use, and can effectively retract even previously openly published data, upon change of mind.
So individuals have the right to decide independently and freely what happens to their personal data when and for what these data may be used. They have the right to delete all personal information from any tiers. And also the right to choose freely what they want to share and what they don’t want to share.
We have those fundamental principles of processing that are laid down in Article 5 of the GDPR :
With all those principles, designing a data-related application can be challenging but it is both important for the data scientist as well as the end user.
It is common that some companies decided to publish some dataset online for public usage, or just open part of their system to a third party with some protection. But this practice may lead us to many problems in a privacy-related way. For example, Netflix has shared a dataset of films that users like on their platform for a competition for their recommendation system. But a researcher has shown that this data linked to the IMDB public data can help to identify individuals and their movie preferences (No one wants to publish all their movies, mainly because it can say a lot about your personality, sexual preferences, …)
That drives us to 2 types of disclosure:
It can be for example a re-identification case when an adversary is able to merge different public datasets and identify individuals from there. Which is a threat to privacy. Ex: public anonymized location dataset with Twitter geo-located posts [identity disclosure].
It can also be identity theft. People could act like you or login into some system with your information because they know your personal information.
So the need for privacy that protects individuals without losing the functionality of a system is really important for day-to-day apps. Particularly data science-related systems, as they depend 100% on data.
With that said, it is still hard to measure privacy as it is a really abstract notion. A way to measure it could be important mainly for system processing data and also for companies willing to publish datasets because it can help to determine to which extent collected or published data protect users’ privacy. And in the upcoming stories, we will talk about this notion also.
We have the knowledge and the infrastructure to build, deploy and monitor Ai solutions for any of your needs.
Contact us