How do you document the AI processing for GDPR?

by Sypher - February 08, 2024


In the highly dynamic context of evolving technology and increasing use of artificial intelligence (AI), and also in the context of the first EU-wide agreement to regulate the use of AI, we were pleased to host the #SypherPrivacyTalks webinar How do you document AI processing for GDPR (held in Romanian), with guest speaker Tudor Galoș, expert consultant in the field of digital transformation and privacy, from which we present here the main points of the discussion. 

A note of clarification: we are referring here to generative AI, especially related to texts, i.e. LLM - Large Language Models, which are essentially models that interpret, generate, write texts, make translations and can even hold a discussion. We don't necessarily cover other types of AI, for example, those that play chess.

When discussing AI even at a generic level, we need to understand that AI does what it is taught to do. That's why if you start putting fake content into an LLM, for example, to say "The sky is red", it will answer, "No, the sky is blue", but if you keep insisting and more users come and say "The sky is red", then when the AI is asked again what the sky is usually like, it will answer "Red". 


Risks of using AI

One of the first risks to look at is the accuracy of the information. ChatGPT and, in general, any generative AI produces, in addition to correct content, a lot of false content that is only probabilistically correct. There are instances where, for example, lawyers have gone to court in the United States presenting certain cases spawned by ChatGPT, without those cases having actually existed. 

Artificial intelligence algorithms are based primarily on statistics and some very complex equations. In the case of LLMs, they are part of a family called deep learning, part of machine learning, included in the family of artificial intelligence algorithms. Deep learning works with huge amounts of data and their validation is done on other types of datasets, which involves multi-step learning with minimal human involvement. 

The importance of input validation is therefore paramount. Hence the difference in impact between a public LLM and a private one that does the learning on your servers. We'll talk more about this later in the webinar. 

So given that an AI learns continuously, this means that if you give it personal data, for example copy-paste from a CV, it will "ingest" that data and use it later, which is personal data processing. The result is that we lose control over it, because it is impossible to find it and remove it from the LLM. 


What kind of personal and non-personal data do AI algorithms process?

First of all, we have to start from what personal data means, because the definition is very broad. Personal data (simplified definition) = Any information that directly or indirectly leads to the identification of a person. So even a simple description can be considered personal data.


"The guy in the orange T-shirt in Sypher's AI webinar" is an example of an indirect description that can lead to the identification of Tudor Galos 

 

Apart from this, Chat GPT can remember a lot more personal data, including the time you made a request. It can measure the speed at which you type, or whether you're speaking through a voice recognition system, etc. 

So all the data is fed into these decision-making processes, and that's where the problem comes in. It's all statistical and it's very important how the algorithm is set up at that moment to give relevant answers. 

Many HR departments are tempted to use artificial intelligence algorithms in the recruitment process. But this is where the tricky part comes in, because that algorithm can end up discriminating due to the data sets on which the learning has been done. 

“The biggest problem with artificial intelligence algorithms today is discrimination.” 

And there's the additional problem that you don't always understand how the algorithm makes its decisions, to figure out if the recommendation is correct or not. It's kind of like a black box where you know what you're inputing, you see what comes out, but you don't know exactly what's going on in the middle. And when something bad happens and you have to investigate, it's very hard to monitor things. 


What measures can we take to ensure GDPR compliance when using generative AI with personal data?

Can procedures or policies be established to help with GDPR compliance if you decide to use generative algorithms in your business process? 

 

First of all, GDPR always starts from the risks that can happen to the individual, not the company.  

So you have to ask yourself, how can you potentially harm a person if you process their data through an AI algorithm?  "I risk missing out on very good candidates, discriminating against people of other races, people who were born in a certain region and so on." 

The first thing you need to do is validate the input data. You need to make sure that it is relevant, contextual and does not discriminate. There are specific procedures that deal with this area, but you have to make sure that the data is qualitative, relevant, diverse, that it has been analysed in the context in which it is going to be used, it is cleaned of errors, inconsistencies or missing values. 

Then follows the pre-processing of the data through various technologies: Principal Component Analysis, Linear Discriminant Analysis, to reduce the size and increase the relevance. It's not easy! Because - let's not forget - algorithms have no ethics and this is the big challenge

You have to ask yourself, OK, what can go wrong? So you test, see how it works on certain data, create some "synthetic" data (fake data that is) to test the algorithm and know where you need to correct it. 

“The use of AI should not be done on an ad-hoc basis but as part of a rigorous process.” 

In this process we determine how exactly we need to process the data, make sure each data set has the same components, set the "prompt" (what we are asking it) and test to see how well the results match what a human would perhaps do, but much more efficiently. 

“If we use generative AI for processing personal data it is mandatory to do DPIA.” 

Any implementation of AI technology that processes personal data must be subject to a DPIA. Because it's high risk. In addition to the discrimination that we've already talked about, we're talking about risks of incorrect processing, risks of unwanted disclosure of personal data, risks of decisions that have a legal impact on a person and so on. 

So we also have to include the controls that are in place so that these things don't happen: what measures have we actually taken, and how much do we reduce the likelihood of this risk manifesting itself.  

Even the GDPR specifies: DPIA is mandatory for such new technologies.


How do we identify the parties involved in processing: controllers, associated controllers, processors? 

How do we identify these parties in the context of an LLM offered by OpenAI in the US and how do we establish roles, how do we map this whole process which has quite a lot of components and unknowns? 

There are indeed several "players" here.  It's not just the algorithm and the company that implements the algorithm. There are several key roles:  

  • First of all, there is the role of the producer, the one who actually produces the algorithm. 
  • Then there is an intermediary role of the importer, the one who takes the algorithm and brings it from the United States to Europe and starts selling it.  
  • Then there are the distributors or implementers, which are the consulting or development firms that adapt the algorithm to the needs of the organisation.  
  • Finally, there are the users - the organisations or customers that consume the technology. 

Everyone has a role in this equation, and it is very difficult to say that someone is a processor, for the simple reason that everyone has autonomy. The algorithm evolves, it learns, it changes. What it learns from one client benefits the other clients. So, there is influence, everyone acts as a controller at this point

The hard part of defining the role of the controller is where you set the boundary of everyone's responsibility. Because they're sort of associated controller, and you have to come up with a very clear definition. 

“The main question is: "Who is accountable in this chain?” 

If the algorithm makes some decisions with negative consequences, who is to responsible and accountable in this whole chain? Legislation such as European data laws, including GDPR, already regulate this and new legislation like the AI Act will help regulate these issues even further.
 

Is a vendor such as Microsoft in the case of Azure Open AI a processor or associated controller? Listen to the webinar (in Romanian) with Tudor Galoș to learn more practical examples and further details.


How do you document the personal data flows that AI algorithms use?

You must understand that this is an iterative learning process. For example, if you are using a ChatGPT wrapper, you need to know that this data is also going to OpenAI in the US, and the question is, what are they doing with it? Because at that point you need to inform the data subject that this data is also transferred there and what will happen to it.

If an AI can recognise my writing style, for example, can't an LLM-type algorithm eventually replicate me? And this is where things get really complex.

“We need to be very clear about how we document this information, where it goes and what steps we take to protect it. How we filter the input, for example, and what recommendations we make to users.”

When you plan to use this kind of technology, and you have already documented the data flow, you have identified the parties, still before you use it, you have to include these things in the privacy notice, which is anyway the result of a DPIA that was done beforehand.

The DPIA is absolutely the first one, because first you must make sure that you can actually use this algorithm.

Once you've done the analysis and you've managed all the risks, then you inform on the risks that you've considered and how you've addressed them. So you have to address the risks and fix them before you inform. Yes, the privacy notice is mandatory and, in addition, if you do such processing, you must give the individual the right to object, especially if you rely on legitimate interest.
 

If you are using an AI to evaluate CVs, do you have to inform the candidate that you are using an AI so that they can exercise their right to object to automated processing? Do you also have to tell them what algorithm you are using and get their consent to proceed, with a separate checkbox for automated processing? Is this consent freely given? Listen to the webinar (in Romanian) with Tudor Galoș to learn more practical examples and further details. 


Our own servers versus a service provided by a third party - advantages, disadvantages

From a data protection point of view, it's much better to use your own servers, in the sense that the operation is under your control, you do the validations, you make sure there's no discrimination and so on. You make sure that the answers are correct, that the GDPR principles are respected. 

The downside is the cost. Let's not forget that OpenAI is not a company, it's a funded Foundation.  The PRO subscription is a subscription that does not cover the real cost, which is somewhere in the thousands of euros per month. 

Just as an example, for every 50 queries, half a litre of water is used just to cool the OpenAI servers, which is not very environmentally friendly. So, if you take on something like this and you have many users, you are going to consume a lot of energy and you are going to need very powerful computing resources. 

This kind of deployment - on your own servers - is more feasible today if you limit the application to a specific technology that allows some knowledge base queries for learning purposes, i.e. talking to users and giving advice related to that technology.  

But, of course, if we look at the other side of the coin, with a public LLM there are some risks that we have already identified: discrimination, false content, etc. So you have to have verification mechanisms in place and that is not easy. 

 

What is new about the AI Act? 

It's good that there are rules and regulations. The Data Act in the EU and GDPR already have many elements of this, they are a good foundation, but what the AI Act really brings new is a risk-based approach. We’re talking about a risk categorization system, whereby AI systems are regulated based on the level of risk they pose to the health, safety and fundamental rights of a person. There are four categories of risk: 

  • Minimal/no risk, where you don't have any legal obligations, you don't have to do anything.  
  • Limited risk, where you have transparency obligations, you have to inform the user that there are certain risks.  
  • High risk, where systems are regulated and you literally have to test them in a sandbox, which will be available in every country, a controlled testing environment provided by a government organisation. It is not yet known who will be the organisation that will provide this. This means that you as a company will be obliged to test your algorithm in a secure environment with synthetic data and validate it with representatives of the authority. 
    Examples include safety applications for the aviation industry, self-driving cars, etc. 
  • Unacceptable risk – where certain systems are banned outright. Those systems at the moment are artificial intelligence systems that use subliminal techniques to manipulate or deceive individuals or groups of people or to alter their behaviour.  

Other examples of banned systems would be those that exploit the vulnerability of an individual or group of people, for example manipulating children, the elderly, the mentally ill. Also, systems for biometric categorisation or social scoring (as used in China), which will also affect the FICO score used by various banks, since many now use it in an automated way. AI systems that make risk assessments of people, as well as real-time biometric identification systems in public spaces, will be no exception. Incidentally, biometric identification systems are also banned under the GDPR.

The European Parliament is now trying to add some clarifications on the biometric categorisation of individuals, i.e. to be allowed in therapeutic scenarios.


Conclusion

Let's not forget that AI is the future and we have to learn it, but we have to be careful because using it is not so simple and there are many steps to go through: from how we teach it, what data sets we use, cleaning the data set, understanding the output (i.e. interpreting the results), correcting the results, to providing it with feedback. And this is a cyclical, continuous process. 

For more details on how we document personal data processing with AI, more practical examples, and to hear live questions from the audience, watch the full webinar recording (In Romanian)