GDPR & You: My Talk at Cloudera Sessions München
Posted on Mi 11 Oktober 2017 in conferences
Unless you have been avoiding all news, you have likely heard of the coming changes in European privacy regulations which go into effect in May 2018. The changes are covered under the General Data Privacy Regulation Directive, whose final text was made available in May 2016.
I presented a talk at Cloudera Sessions Munich covering a few topics I found interesting on data privacy and security overall (not just for GDPR). Although inspired by some of the GDPR provisions, my talk focused on how a few areas might be impacted by the regulation and dove into how companies can take GDPR as a suggestion to start taking ethical data science more seriously.
The main takeaways I wanted to share are:
1. GDPR doesn't require ethical or even interpretable machine learning. But you should be doing this anyways, right?
There are a lot of scary articles out there, usually by someone with half of a clue, talking about how GDPR is going to kill artificial intelligence in Europe as we know it. They cite a paragraph in a recital which calls for the ability to explain automated decisions and processing to the data subject (aka client / user / you & me).
However, if you take time to read the text of GDPR as well as consult several legal papers on the topic, it is fairly clear that this right doesn't exist the way it's being spread in the headlines. A great paper on this topic is Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation (Wachter et al., 2017), where they delve into the potential legal implications of this section of the directive and explain it is highly likely this will be interpreted as a right to be informed. That said, if you cannot explain your model at all, doesn't that concern you? As a data scientist and machine learning practitioner, it bothers me! In fact, I think if we were required to explain our models more often, this might lead to a better understanding of our problem space, innovative new ways to measure or classify our results and more ethical models. Why? Because if I take the time to create an interpretable model, I not only can better explain why it behaves that way, but also I can see if perhaps there has been some "data leakage" which means my model has perhaps learned something I wanted to avoid (i.e. how to be racist or sexist).
So how do we promote more interpretability within the community? Interpretability in machine learning has already been a topic for several years, with workshops, great papers, open-source libraries and in-depth blog writeups. What saddens me is how often the Kaggle-verse who somehow values every last half-percentage of accuracy over anything interpretable. Don't be that person! Instead, spend time finding a model that you can explain, reason with and defend.
2. Data privacy is a myth. However, you can do your best at REAL anonymization to protect your customers.
Think your data is private? If you have used a service that uses third-party data processing, had your data released as part of a competition or study, or simply leave default settings on most of your applications and sites, then it is probably not. Why? In a "big data" world, de-anonymization (especially targeted) is trivial. Research in de-anonymization made a leap in 2008 when Arvind Narayanan and Vitaly Shmatikov published their paper: Robust De-anonymization of Large Sparse Datasets. The researchers had successfully de-anonymized users released in data by the Netflix Prize. This data was released knowingly by Netflix and, according to Netflix, had been properly anonymized. The paper was well-received and Narayanan went on for further research on deanonymization. It is also worth reading just for the fantastic burns.
Peak joy: I have a *real* reason to read the Netflix de-anon paper in entirety. And let me tell you, it is full of 🔥 https://t.co/KU5HPaDLoE pic.twitter.com/dmgHGqvg04
— katharine jarmul (@kjam) September 30, 2017
Andreas Dewes and several reporters from NDR and ARD researched this same topic recently, presenting the findings in a re:publica talk #NacktimNetz (Note: it is in German, but they presented also at DefCon and that video should be available soon). They were able to very easily get ahold of German politician, police officer and public servant click-stream data via a third-party company selling complete URL stream data for persons. Without great difficulty, they could find personally identifyable information in the data and deanonymize the complete browsing history of the person. So what can you do as a person handling potentially sensitive user data? Mainly, don't be evil (but no, really this time...). Don't sell your customer data to third-parties. Don't release it as a competition because it will be fun. Don't give it to anyone. Don't keep it connected to the public internet with default passwords. Just, be smart about it. And if you do choose to give it, sell it or release it, know that you need to really think about what that might mean WHEN someone deanonymizes it.
3. Data portability will hopefully inspire and encourage more competition.
A ray of hope in this slightly grim blog post is the GDPR articles related to data portability. To me, this is perhaps the most exciting part of GDPR and holds quite a lot of power if implemented properly. Of course, there is quite a lot of debate surrounding how this will actually be enforced by the courts.
The working party document is fairly clear about its interpretation, stating that:
In this regard, WP29 considers that the right to data portability covers data provided knowingly and actively by the data subject as well as the personal data generated by his or her activity. This new right cannot be undermined and limited to the personal information directly communicated by the data subject, for example, on an online form.
To me, this sounded like the mobile phone number portability competition. I actually decided to read a bit about how that was implemented in Europe, and there were several interesting papers I found related to the topic, including one Mobile number portability in Europe (Buehler et al., 2005) which explored pricing and its relation to the number of persons switching carriers. Via some networking folks, I found anecdotal evidence that in some areas where start ups and smaller network carriers were competing with the larger companies on features, there was a high proportion of mobile users porting their numbers.
For me, data portability opens up this same door. What if I could get all of my location data and port it to a new company? What if I could choose who I use for my language learning apps and port data easily between them?
The possibility for competition on whom is a better data guardian, whom has better features and better security and better privacy could be real. This makes me both happy and hopeful.
In case you want to look through them, you can find my slides here:
If there is video recording, I will post it as well.
Slide References (in order they were presented)
- Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation (Wachter et al., 2017)
- O'Reilly Post: Ideas on Interpreting Machine Learning
- Semantics derived automatically from language corpora contain human-like biases (Caliskan et al., 2017)
- Robust De-anonymization of Large Sparse Datasets (Narayanan et al. 2008)
- Andreas Dewes: re:publica talk #NacktimNetz
- Article 29 working party document
- Mobile number portability in Europe (Buehler et al., 2005)
- HBR: The Barriers Big Companies Face When They Try to Act Like Lean Startups