The impact of Machine Learning in data privacy

Originally posted by Fabiana Clemente at https://www.linkedin.com/pulse/impact-machine-learning-dataprivacy-fabiana-clemente/ on March 02, 2020

As the world moves towards digitalization, more and more personal and private information is being gathered everyday. It’s a must to process and explore this data in order for organizations to innovate.

In a new information era, where data is the new oil, data privacy is emerging as one of the biggest concerns for governments and society. In this landscape, new regulations and privacy laws are emerging worldwide, changing the data projects landscape and new challenging realities.

It’s time to rethink Data Privacy!

Masking … a solution?

One of the most commonly used solutions by those who need to manipulate and perform research or development with sensitive data is data masking. Data masking hides data elements from users that are considered to be sensitive, and that cannot be shown to the individuals that are working with the data itself. Typically, it replaces the data elements with similar-looking fake data, but ensuring vital parts of personally identifiable information. Different from encryption, data masking is designed to not be reversible at all, making it completely useless for the attackers that might try to reverse the masking applied. But is it really enough to prevent database individuals re-identification? The answer is simple, it’s not.

The risk of re-identification is real, regardless of whether masking or encryption have been applied. It makes the work for the attackers much harder but it’s still possible. This topic is well studied in the privacy community:

In 2016, Australia’s Federal Department of Health published medical billing records of about 2.9 million Australians online. These records came from the Medicare Benefits Scheme (MBS) and the Pharmaceutical Benefits scheme (PBS) containing records of around 10 percent of the population. With the release of this potentially sensitive data, researchers tested its security against re-identification attacks. Using only public available information, the researchers were able to decrypt the information within the MBS dataset (link).
In the US, it was found that 87% of the population can be uniquely identified based on 5-digit ZIP, gender and date of birth; 53% are likely to be uniquely identified with only place (city, town or municipality), gender and date of birth. Even at county level it is still possible to re-identify 18% of the total population from the US. (link).
Netflix prize data it’s another example of how masking and encryption can easily be “reversed”. Using solely data from 2005, researchers from MIT were able to re-identify Netflix users through the combination with Amazon Product open database. Based on this matching users profile it’s possible not only to uncover users shopping habits, full name, or even political beliefs of supposedly anonymous individuals (link).

Does Machine Learning amplify data privacy issues?

As explained in our previous post, Machine Learning (ML) is a subset within the field of AI, that requires large datasets so it can “learn” patterns with high levels of accuracy. But how it affects data privacy?

The same problems regarding data privacy that were pointed with the rise of Big Data are also relevant for ML:

Possibility to re-identify personal information from large datasets
Availability of high dimensional data due to the reduction of storage costs
Mining of unstructured data using Deep Learning techniques and possibility to incorporate high dimensional data in one single model.

This leads to a whole new level of data available and possibilities to re-identify private information, although minimal personal characteristics where made available. Following I’ll give a simple example:

Supposing that a company is performing some market analysis based on customers feedback. Due to privacy concerns, all the personal information such as name, age, gender, etc. where deducted from the dataset to be analysed. It’s legit to think that know that it’s impossible, for example, to know the age or even the gender of the customer behind a certain feedback correct? Wrong! There are ways to re-identify the gender completely based on the subtle differences in word choice — Gender classification in Twitter.

In summary, data that undergone pseudonymisation, meaning masking or removal of personally identifiable data is no longer sufficient to comply with the new legal frameworks for data privacy.

The new definition of privacy

Due to the increasing risk of re-identification (higher volumes of personal data being shared, computational power increase, bigger amount of available data, etc.), new regulations regarding data privacy have been published: in 2016, the Federal Attorney-General introduced the Privacy Amendment (Re-identification Offence) in Australia; the European General Data Protection Regulation (GDPR) in 2016 and the California Consumer Privacy act signed in 2018, are some of the measures taken in order to guarantee data privacy.

With these new regulations, a new definition for truly private data has been created. As cited at Recital 26 in GDPR:

“The principles of data protection should apply to any information concerning an identified or identifiable natural person.

Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person.

The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.”

Conclusion

Overall, the concepts of privacy and security have changed since the arrival of Big Data and Machine Learning, and organizations need to adapt in order to ensure the best protection of their customers data.

New privacy regulations are thriving for new levels of data privacy- through regulation of how data can be used, ensuring that collected data is manipulated in a more transparent, fair and secure way.

Organizations need to review their data policies either for internal or external processes, such as anonymization and privacy methods usage, in order to stay innovative and leverage the latest advances in technology.

BlogSULxMarch 2, 2020ydata