Abstract
The GDPR (General Data Protection Regulation) is an EU law that came into effect on 25th May 2018. The law has sent some shockwaves among the data professionals. As much as most of its provisions are similar to the existing data laws, the compliance standards have been significantly raised. The risks and penalties associated with non-compliance of GDPR laws are far much higher. One area that will bear the weight of this new regulation is data science. The data science practice has been affected in three main ways. These are limits to consumer profiling and data processing, accountability for automated decision biases & discrimination, and a right to explain automated decision-making.
Introduction
The GDPR is a regulation within the EU laws that address the privacy and data protection issues for individuals living within the EU. It was approved by the European Parliament on April 27, 2016, and came into effect on May 25, 2018 [2]. Being a regulation, GDPR is binding to all EU member states and doesn’t require any other enabling legislation. The regulation aims at addressing the aspect of personal data export outside the EEA (European Economic Area) and EU (European Union) areas. It gives citizens and residents control over their personal data. At the same time, it simplifies and unifies the data regulatory framework within the EU borders. Any organization handling data about EU citizens must, therefore, comply with its provisions.
The GDPR generally defines and strengthens the data protection rights for consumers. It replaces the Data Protection Directive 95/46/EC in order to harmonize the EU data security rules. As much the law aims at reinforcing the privacy rights within the EU, it also becomes a key milestone in the EU Digital Single Market initiative. The regulation seeks to reconcile the two competing values of innovation and privacy. The regulation largely retains the terminology and principles of the 1995 Directive. In addition, it brings in some new principles accompanied by uncertain consequences. This includes stricter concept of consent, data portability requirement, and a legal entitlement to be forgotten [3].
The law provides hope for achieving a greater uniformity level across Europe. There is a special place for research in this law as it curves out exemptions for historical, scientific, and health research. Research firms that collect personal data can, therefore, implement safeguards to avoid the restrictions on processing sensitive data. The law also permits processing of personal data by organizations without the owner’s consent as long as it is for research purposes only. In some cases, this data can be transferred to third parties without requiring any other transfer mechanism to be put in place. The law has actually changed the handling of data and the utilization of data science. In this paper, we will focus mainly on the aspect of data science and GDPR.
GDPR Highlights
The GDPR (General Data Protection Regulation) retains the main data protection principles from the 95/46/EC Data Protection Directive (DPD). Additionally, it has been simplified to offer more control and transparency to the EU citizens [4]. Some of its key highlights include:
1) Personal Data and Roles
Personal data refers to any information that relates to an individual and one that can be used for identification purposes be it directly or indirectly. It can be a name, location data, identification number, or even an online identifier like IP address. It also relates to physical factors, physiological, mental, genetic, cultural, economic, and social identity of the person. Such data remains personal whether it is used in private, work, or public contexts. The three main roles that can be considered to ensure GDPR personal data compliance are:
- a) Data controller – this is a person who should determine the means and purposes of processing personal data.
- b) Data processor – this person is responsible for processing data at the behest of the controller.
- c) Data subject – this is the person relating to the identifiable data.
Processing in this context refers to anything that is done on the data. This includes collection, storage, and erasure. As you long you are dealing with EU citizens’ personal data, GDPR is applicable.
2) Consent and rights
GDPR defines consent as ‘any freely given, informed, specific, and unambiguous indication of the wishes of the data subject for his/her personal data to be processed’. This indication can be through a statement or a clear affirmative action. It signifies the agreement to use data related to him/her. The GDPR requires that the data subject be fully aware of the purpose and means of processing their data and have no doubt over it. This includes situations where the data undergoes multiple processing activities.
The consent must be positive and can’t be presumed from the data subject’s inaction. Under the data Access and Erasure rights, the subject can seek for a confirmation from the data controller on various things. This include who has accessed the data, where it is being processed, the purpose it is being processed for, and how it is accessed. They can also request the controller to send them a digital copy of the data, erase their data, or halt any third party access to the data.
3) Breaches and penalties
A data breach is basically a security breach that leads to unlawful destruction, accidental loss or alteration of stored data. It also includes unauthorized access, disclosure, and transmission of personal data. Such a breach requires that the UK Information Commissioner’s Office (the supervisory authority) is notified within 72 hours. This is especially if the breach can result in risking the subject’s rights and freedoms. Breaching the GDPR can lead to a maximum fine of 20 million euros or 4 percent of the annual global turnover. An entity can be fined 2 percent for not keeping their records in order. This also applies to the case of not notifying the relevant authority on a data breach.
4) Demonstrating Compliance
Data controllers are required to showcase the organization’s compliance with the GDPR regulations. To this extent, they are needed to maintain records about their location, data processing activities, data storage locations, and categories of stored data. Technically, the organization may be required to appoint a Data Protection Officer as part of the compliance measures. Other measures include carrying out privacy impact assessments, audits, pseudonymization, implementation of data policies, and minimization.
How GDPR will affect Data Science
When the GDPR regulations came into effect, they dramatically changed the way in which you can utilize data science [4]. The effects of these regulations cuts across all corporations using the EU citizens data. Any organization that collects data from EU citizens exceeding 5000 in a year has to be accountable regardless of their location. The EU legislature is keen on protecting these data protection laws and has proposed serious fines for infringement. This can go up to one million Euros (for smaller companies) or 5 percent of the global annual turnover. Most companies have already put plans in place to comply with these regulations.
The legal background for these regulations is found upon a patchwork of state laws among the individual member states and several independent supervisors. The absence of a single privacy framework had complicated the compliance to data transfer regulations. Most multi-national corporations could walk away with violations as the EU supervisors would find it hard to address these issues in a unified manner.
The data science industry has seen assertive data-driven business models in the recent past. This has stimulated the need to have a stepwise regulatory framework that ensures that there are effective lobby strategies and proper investment in data protection. The legislation of GDPR is therefore geared towards establishing a standard law across all the EU states and replace most of the existing individual state regulations. The enforcement of the law is through a One Stop Shop supervision approach.
The GDPR seems to places greater focus on the transparency of the data and prioritization of the rights of the subjects. This may be a cause of concern for most data scientists working in Europe. For several years, data science has shown that it can provide valuable insight into the operation of companies [5]. It can steer strategic decisions by analyzing results and options. The GDPR has brought in some new obstacles that will change this situation. These obstacles will stand in the way in which data science products are built. There are for instance reasons to worry as far as profiling is concerned.
GDPR gives individuals a right to an explanation for a decision made based on the automated processing. This means that making a decision on someone’s ineligibility to receive a loan based on an automated algorithm may have some legal implications. These restrictions may not only apply to loans. Application of machine learning approaches like artificial neural networks would soon be brought into question. The GDPR standards don’t accept the level of bias that can be introduced by algorithmic approaches using hidden layers to make a decision.
GDPR aims at protecting people from bias but this comes at the expense of data science. Most processes that have been designed to ensure efficient processing of data may find themselves on the wrong side of the law. Most data companies may, therefore, opt to go for less complex models and approaches to decision making. There will be more focus on interpretable analyses that offer accountable explanations instead. This means that the precision levels offered by automated approaches will soon be lost.
On the flipside, GDPR will provide new opportunities for the data scientists. The new regulatory framework requires creativity and ingenuity. The transparency and integrity standards set for the data would also boost the reputation and profile of the complaint data scientists. The good thing is that most of the data science doesn’t have to deal with personal data. Majority of data science projects uses anonymous data and hence won’t be affected much by the GDPR regulations.
Data Science and GDPR relevance
The privacy regulations contained within the GDPR are quite a challenge to most data scientists. It pushes the use of data towards the opposite direction to the one the data scientists are used to. Both the privacy advocates and the data scientists have the best interest of the individual. There is, however, a potential conflict of goals and they will tend to apply different methodologies in achieving these goals. The data science goal is to maximize data collection by obtaining new data and discovering new uses for the existing one. The privacy advocates, on the other hand, have a goal of minimizing the data collection and decreasing the usage of the existing one. In order to comply with GDPR, both the data scientists and privacy advocates need to realign and coordinate their goals for the benefit of the individual. There is an economic perspective and a protection aspect that needs to be carefully considered.
There the number of data science techniques that have the capacity to de-anonymize private data has increased in the recent past. The privacy advocates are therefore aware of the analytic techniques that data scientists can employ to generate data. Data scientists have been able to link sensitive anonymous data to specific individuals and in the process reveal their gender and ethnicity. This can be based on Facebook likes, snapshots, and cell-tower check-ins. There have been cases where individual records were anonymously released but somehow became de-anonymized. Personal data can be either observed, volunteered, or inferred. Data science is behind inference of personal data. This is the reason regulators are training their guns on the data scientist in order to protect the privacy of data.
The increased working on data using analytic technologies and cutting-edge data storage has continued to risk its sanctity and violate its privacy concerns. Modern data technologies like on-demand cloud storage, NoSQL technologies, and in-memory processing provide data scientists with an opportunity to store massive amounts of raw data. These data lakes raise a number of privacy concerns. They include:
- a) Data awareness – companies tend to lose the oversight over the stored data, replication, risks, and privacy.
- b) Governance – the many cloud storage systems raise concerns over the security measures. Raw data may, for instance, be flowing into a pilot programs system and there are no mature governance models.
- c) Control – the retrieval, storage, and distribution of raw data comes with unknown dangers. One such danger is that companies tend to lose control over their data. This may also mean that they lose a right to erasure and the right to be forgotten.
GDPR direct impacts on the Data Scientists
The GDPR regulations emphasize the rights of the individual to understand and oversee how their data is used. This impacts on the ability of the data scientists to collect data. Going forward, there will be increased legislation on the principles of privacy. This will reduce the baseline collection of data through processes and systems. Collecting data through the default browser settings will, for instance, require the consent of the concerned persons. Individuals will have to be informed about the purpose of collecting the data and allowed to give their consent. The EU parliament is still debating some of these modalities and there may be some form of compromise soon.
GDPR will also impact on the ability to use the data. Data scientists will require an express permission from individuals to use their personal data. There is still a big debate over its practicality and there is a possibility that some exceptions will be introduced. Without such exceptions, it would be quite difficult for data scientists to find new uses for existing data. This is because every application of the data would require an express consent from the individual. The regulation will also impact on the ability to transfer data. The stiff fines that have been imposed in the law will discourage corporates from buying, selling, or even sharing personal data.
The GDPR will impact significantly on the customer profiling process. The new regulations require that the customer has to be informed of when the data will be collected and how it will be used. This means that the customer has to be informed how the data will impact them in aspects like credit scoring and fraud detection. In fact, the customers have the right to choose not to participate in the automatic profiling algorithms. The companies also have to use algorithms that are sufficiently robust so that they breach the regulations.
The other aspect that will be impacted by this regulation is the data storage requirements. The GDPR requires both compliance and accountability. The implication of this is that corporations will be needed to demonstrate their compliance status to the supervisors. This involves extensive preparations, up-front data audit, and documentation. The data scientists need to demonstrate the accessibility and type of customer data in their custody. This includes observed data, volunteered data, and inferred data. The individual has to be guaranteed their right to erasure. It will be mandatory for the company to know where all copies of the person’s data are stored so as to guarantee a right to be forgotten. The data scientists will have to familiarize themselves with the implications of storing people’s data. This includes passing it through service providers like Cloud Storage, analytics tools, and Cloud-based BI services.
Data Protection Officer
GDPR also places a much more emphasis on the privacy of data in your company. This means that larger companies will be required to have a Data Protection Officer. The EU parliament has proposed a fine of 5 percent of the global annual turnover to be imposed on violators. The final figure may be different but the bottom line is that companies have to comply with this regulation. The law is meant to impose accountability and not just compliance. The supervisor would, therefore, want to see a comprehensive documentation that demonstrates the company’s compliance with the regulation.
Data Science and GDPR Compliance
GDPR is here to stay and it is in the best interest of the data scientists to move towards compliance. Compliance may not be an overnight process but can be achieved through a series of steps. The accountability clause, for instance, requires documentation of the data assets and systems as well as carrying out the risk assessment. The following steps will offer a great start to any data scientist who is working towards complying with the GDPR regulations.
Step 1: Audit the entire data ecosystem
This will help you to determine some of the privacy gaps in the data that may expose you to GDPR violation. You can begin with the structured data within the BI systems and look out for dark data in the operational systems. You then need to look at the Big Data which includes any sensor data and weblog data. Document all the data found in those locations, where they are replicated, the established controls, and authorized users. Identify any personal data and other data that could easily become personal if processed by the various data science techniques. This would help you to introduce the necessary changes that will put you on the compliance path.
Step 2: Ensure proper implementation of the user consent
The current phase of GDPR implementation allows for user consent to be grandfathered into the system. It is important to ensure that the user consent has been properly implemented. This is because the use of the data within the GDPR framework may be severely hampered if there is no proper consent from the subjects.
Step 3: Ensure that the product roadmaps complies with Privacy by Design principles
You may need to familiarize yourself with the notions of privacy by default and privacy by design. The product owners then need to be made aware of data collection restrictions so that future products can be developed with this in mind. It is important to maintain full functionality but still comply with the GDPR restrictions. The products need to be designed in a way that enables a data-driven business model while still honoring the existing privacy laws.
Step 4: Engage in a dialogue with an expert
You need to initiate a dialogue with a person who understands operations within the privacy framework. This can be an external expert or a corporate privacy officer. The privacy subject is quite complex and has high stakes. A strong communication between a technical expert and yourself or the legal representative will help to break down this subject.
Finally, you will need to adhere to article 5 of the GDPR regulations when processing the data for compliance. The article prescribes six principles which a data scientist must adhere to while processing personal data [7]. They are:
- a) Lawful processing, transparency, and fairness
- b) Keeping to the original purpose
- c) Minimization of the data size
- d) Upholding accuracy
- e) Removing unused data
- f) Ensuring integrity and confidentiality of the data
PhD Researcher / Aegean University
References
- Dataconomy: [http://dataconomy.com/2018/04/how-gdpr-will-affect-data-science/]
- Cloudera VISION Blog: [https://vision.cloudera.com/general-data-protection-regulation-gdpr-and-data-science/]
- Blackmer, W.S: [https://www.infolawgroup.com/2016/05/articles/gdpr/gdpr-getting-ready-for-the-new-eu-general-data-protection-regulation/]
- University of Nottingham: [http://blogs.nottingham.ac.uk/digitalresearch/2017/10/06/introduction-general-data-protection-regulation-research/]
- DSI Analytics: [https://dsianalytics.com/data-science-and-pending-eu-privacy-regulation-a-storm-on-the-horizon/]
- Darwin: [https://www.darwinrecruitment.com/blog/2018/03/impact-gdpr-dsgvo-on-data-science-2018]
- Dr. Scott Summers: [http://www.insidegovernment.co.uk/uploads/2018/02/Presentation-Scott-Summers-Final.pdf]