High Risk AI and GDPR
Josh Tyrangiel, in his opinion piece (or homage to Dragos Tudorache?), “Let’s just copy Europe’s homework on AI” , notes that Europe retains the first-mover advantage on digital good governance. Leading the way with GDPR in 2016, the European Parliament passed the EU AI Act whereby “Europe is poised to remain a gallant protector of individual rights while moving into a far more seductive era with the world’s AI tech companies.” The EU AI Act has been closely tracked by private social media corporations, having been subject to several fines under GDPR, e.g. Meta & Instagram and dealing with further limitations on collecting personalized user data to fuel mobile and online advertising platforms with Apple iOS 14+ changes. The depth of competence and articulation of real-world risks are to be applauded for setting the expectation of integrating fairness into a standard operational process for companies using AI. Ensuring fairness means measuring fairness: how should we balance data collection in the name of ethics?
At the Federal regulatory level, the United States has gone as far as publishing an AI Risk Management Framework by NIST, including a compendium of resources and excellent playbooks. The White House has also published A Blueprint for an AI Bill of Rights. However, we have not as of yet established a Federal regulatory bill on AI. Several states, such as California with CCPA (AB 302, 2023), Connecticut (SB 1103, 2023), Louisiana (SCR 49, 2023)and Vermont (HB 410)are leading the way with AI regulation. Guidance we have with respect to AI and Algorithms are primarily “risk-based” recommendations. Meaning, it’s up to the company or individual to assess the risks to reputation, brand or civil liability.
By contrast, the EU AI Act defines high risk data and AI uses cases and is explicit about the requirements of documentation, training and transparency. Unlike the NIST framework, the EU AI Act defines high risk and prohibited AI use cases (see Title III Chapter 1, Article 6). Article 9’s layout of levels of risk management is neither controversial nor unreasonable. If a high-risk AI system may compromise the health, safety or fundamental rights of individuals, it is the responsibility of a company to implement a continuous iterative process which includes estimation and evaluation of the risks when the AI system is used as well as evaluation of other possibly arising risks based on the analysis of data gathered from the production model (Article 9, Section 2(c)). This begs the question on the availability of labeled datasets — specifically data labeled with sensitive, protected user/consumer attributes to ensure evaluation and analysis of fairness.
This has been keeping me up at night. While the intent is good, there seems to be a tension between GDPR and application of the AI Act which leaves several questions for real-world practitioners. Since GDPR, tech companies have been vigilantly establishing compliant data collection and governance procedures with respect to personal data. The prevailing ethos for those operating in the EU has been “privacy by design.” That is to mean, we will maintain the de minimus principles of data collection, establish high levels of controls and access to that data, allow for deletion and anonymization — to name a few. Some policy experts and privacy lawyers contend that nothing should dramatically change from a data perspective, since GDPR still applies- the AI Act is simply a “sprinkle” on top of existing data policy. As data scientists and machine learning experts, we are left to wonder if we have sufficient data to stick to the letter of the law.
In its prescriptive articulation of how to handle high risk data, Title III: Chapter 2, Article 10 Section 5, the Act opens the possibility of collecting and processing special categories of sensitive, personal data in order to ensure bias detection and correction can be achieved for high risk systems. The Act is careful to limit the scope to training data and goes on to recommend that one should pursue this route if bias detection and correction cannot be effectively fulfilled by processing other data, including synthetic or anonymised data. Anonymized data still requires collection, which requires consent and articulation for end users. In Europe, we also implement specific protocols for deletion and retention, and ideally keep data isolated from generally accessible training data. The potential to leverage synthetic data is still in it’s R&D stages. Fairness algorithms also generally come at the cost of other notions of fairness, which itself creates a level of risk.
Of course GDPR provisions such as strict controls and documentation applies. However, I am concerned we are opening the door to thinking about data collection differently. Whereas previously, we were strictly limited to the collection of personalized data in furtherance of a product experience, pending user consent…now are we asking companies to solicit more data and and play a balancing act on how much and what data we need to collect to ensure fairness while maintaining user privacy. Is ignorance an option? From a privacy design perspective, if we don’t collect data in the first place, why start? If the AI product is in the high-risk category, then ignorance is not an option. From a risk management perspective, we are assuming the risk that an unsupervised AI systems will learn sensitive user attributes. The strategy of asking for forgiveness later requires a solid understanding of the risks and cost if the system is shown to be biased.
Recommendation:
- Existing AI companies operating in the EU already comply with GDPR. Continued access controls and strong anonymization protocols should allow engineering teams to integrate fairness and bias testing without violating privacy rules
- Teams need to integrate fairness and bias testing into their testing and evaluation framework (e.g. experimental launch reviews). This means for example, understanding the scope of risk your AI product presents, establishing reasonable thresholds for fairness across key categories of interest and reviewing the evaluation criteria with accountability programs dedicated to reviewing your AI model performance. Accountability is part of the human oversight necessary for AI Governance.
- For new AI companies, we recommend thinking about Responsible AI by Design. Develop Product with privacy in mind and establish ML training and development processes with fairness and bias as part of the operational evaluation process. Establishing a set of principles from the onset will provide the governance framework. Actively train with third-party, compliant test data or synthetic datasets before considering data collection. Limit data points collected only as required by establishing statistically valid, rotating anonymized samples designed to test for bias without compromising privacy integrity.
- When collecting additional data, check how the system performs when fed with new personal data; segregate training data from live data; and document the set up of datasets and their characteristics.
Excellent paper on Synthetic Data with rich academic references to understanding, accounting and mitigating bias: Bias on Demand: A Modelling Framework That Generates Synthetic Data With Bias
The role of explainable AI in the context of the AI Act
Let’s just copy Europe’s homework on AI
https://www.theverge.com/2023/12/14/24001919/eu-ai-act-foundation-models-regulation-data
https://www.dataguidance.com/opinion/france-cnil-guidance-ai-preview-eu-ai-act