Attacks on Machine Unlearning: How Unlearned Models Leak Information

Posted on Mo 13 Oktober 2025 in ml-memorization

In the past articles, you've been exploring the field of machine unlearning, investigating if you can surgically remove memorized or learned data from models without retraining them from scratch or from an earlier checkpoint.

Unlearning is one proposed solution to the AI/ML memorization problem explored in this multi-article series.

Are you a visual learner? There's a YouTube video on this article (unlearning attacks) on the Probably Private channel.

https://youtu.be/CuH7BHqIiYk

In this article, you'll investigate if the current proposed unlearning methods are safe against our original attack definitions as well as against any interesting new attacks that unlearning might introduce.

Evaluating Unlearning Models with MIAs

As you already learned in the unlearning definition article, unlearning isn't yet well defined. This means that most research uses subpar or mixed evaluation criteria to determine if something is unlearned. The lack of clear, easy-to-implement and consistent evaluation criteria means that it's almost impossible to compare the many approaches against one another in any meaningful way.

Let's say the AI/ML industry decided on using a consistent and useful metric, like MIA and having a consistent approach to MIA testing, like holding false positives at a particular minimum (say 3%), then this would mean that researchers and practitioners alike could more easily evaluate use cases and determine their risk appetite. New unlearning approaches would be applied quickly and progress could be made because there are easily comparable measurements.

Hayes and colleagues from Google DeepMind (2024) call out trends in unlearning research where suboptimal MIAs are used to boost the perceived performance of unlearning methods. By weakening attacks and then demonstrating that the unlearning method "works", much unlearning research gives a false sense of privacy without real gains.

One reason behind this performance disparity is researchers usually only perform MIAs on the forget set points. Sometimes they also add a small random training data subsample. But Hayes and team found that targeted attacks on a wider selection of training data, particularly those that might be overexposed after unlearning shows that "state of the art" unlearning actually makes new groups of persons vulnerable.

In addition, the choice of the forget set introduces problems. The forget set should ideally be a diverse representation of the training data (from common case to uncommon cases) in order to truly evaluate whether the method can work. Hayes and team found that some forget sets are cherry picked -- causing unrealistic outcomes compared to forget sets chosen via representative sampling processes.

Since unlearning, like learning, will have different rates based on example difficulty and class diversity, the authors call for explicit conversations on unlearning privacy tradeoffs. This also means focusing on practical advice, like what unlearning hyperparameters to choose and how to find useful metrics for stopping criteria (i.e. when the model has unlearned enough and is now ready for use).

In their own testing, they found that the LiRA attack is by far the most effective at exposing privacy risk and modeling a repeatable way to test and compare unlearning methods. In their experiments, they compared per-example LiRAs versus "population" LiRAs and found the former to be qualitatively and quantitatively better at modeling privacy risk.

This, of course, involves significant dedication to privacy risk testing as a normal part of training and operational infrastructure, as it requires the ability to:

test a variety of sampling methods for forget and remember sets
run fine-tuning for unlearning, ideally creating several unlearned models
training several (smaller) models who haven't seen the forget sets
performing example-by-example LiRAs with the unseen versus the unlearned models
determining a tradeoff evaluation to determine which unlearned model to use

The feasibility of doing this at a non-big-tech company is probably very small to impossible. Doing this even at a big-tech company requires significant investment: planning, people, expertise and compute time devoted to privacy metrics, which is probably not the case for those companies today. If the field wants to create better and more consistent metrics, there needs to be easier ways to opt into regular privacy testing. These processes should be streamlined into normal training and evaluation languages, frameworks and ML/AI pipelines.

Aside from the additional resources required for appropriate unlearning evaluation, unlearning introduces new attacks. Let's investigate emergent attacks against unlearned models.

New Unlearning Attacks

Since unlearning has a before and after state, this can be exploited to reveal exactly what was unlearned. In Chen et al. (2021), the authors introduced a novel membership inference attack that reveals whether the target sample was part of the original model and was unlearned.

To do so, they:

Train an original model and train an additional unlearning model (or more than one). In a real attack, an attacker either downloaded the previous open-weight model or saved several inputs/outputs from previous models.
Process a chosen example through each model and collect information about the prediction output. Here, the model will give a confidence range across several classes or potential next steps in a sequence (like predicting the next word). Save these outputs. If possible, this can also be directly saving the values at the logit layer, like some of the LiRA tests.
Train a discriminator to separate the outputs from the unlearned model from that of the original. You can train this discriminator using local models and then test the discriminator on the actual outputs from the models you are trying to mimic/attack. If using logits like LiRA, you might also be able to infer a threshold and use this to calculate how likely it is that an output came from the original or unlearned model.

The most useful points for this attack are to find the points that have been unlearned or points that have similar proximity or other attributes, given you expect a large change in those outputs based on the unlearning process.

To mitigate these attacks, the authors recommend information suppression of the outputs, like returning only the most probable class or next sequence without any confidence intervals. Of course, if you are releasing an open-weight model this isn't possible. They also reference more robust and holistic approaches, like applying differential privacy, which you'll explore further in the next article.

This attack has since been updated and enhanced by Betran et al. (2024). The authors use similar methods to compare the original and unlearned models and reconstruct the unlearned data. How does that work?

The authors investigated when the trained and unlearned models were different by one example. They were able to essentially calculate the exact difference between the models from that one example. This provided them with enough information to make a rough guess as to the sample itself by approximating the input that would account for that change in the model weights (i.e. an approximation of the embedding given the change in the gradients). This is similar to model inversion attacks, where you can reveal input and class information based on model gradients and activations. This is a gradient reconstruction attack (which has numerous literature associated with it).

The authors found that the loss on unlearned examples acts differently than other examples. These oddities in confidence intervals leave artifacts of the deep learning-based methods which reveal that something in that class or near that training embedding was unlearned.

In many ways, unlearning methods create model artifacts that leave clues as to what was unlearned. Even when done at scale, this could quickly expose "missing information", especially when comparing model responses over time. Because unlearning methods don't take this into account, they leave a new security and privacy problem that should be addressed.

Additionally, new attacks target how information is stored in the embeddings themselves. By investigating embeddings you can find personal information like names, screen names, addresses, and other training data contents. This means embedding model updates also expose who requested their data removal and expose new persons and sources in the training data.²

So far, you know that unlearning creates changes that can be observed in the forget sets and the retain sets. Some of these changes enable new attacks, like these new reconstruction avenues. But does unlearning create any other privacy risks?

The Privacy Onion Effect

Carlini et al. published a paper called "The Privacy Onion Effect" in 2022 which outlined new privacy risks when unlearning targets memorized examples. The authors discovered removing memorized data exposes new, different data points that were previously sheltered by those memorized points.

They define the effect as:

Removing the “layer” of outlier points that are most vulnerable to a privacy attack exposes a new layer of previously-safe points to the same attack

They use LiRA and measure attack success across many points in a given dataset (similar to Hayes et al.). They measure the attack "advantage" (i.e. increase in exposure) for a particular training data example. As you already learned, some data points are more prone to memorization and to attack, particularly those who might be considered rare, novel or complex.

Removing these points with unlearning exposes new points which are now more rare, novel and complex after the removal of the memorized data points.

Going back to margin theory, you can think of these points like support vectors, holding up the decision boundaries. When you remove one layer of these supporting points, the next layer of supporting points is exposed. This can keep going, like an onion.¹

The privacy onion effect isn't global and isn't reproducible in every model and every dataset. This once again shows you how important it is to address these risks as you develop models, so the unique privacy risks for each dataset, model architecture and model task combinations are better understood. In fact, the authors prove that privacy auditing is unstable, producing different privacy risks even with small dataset changes.

The authors' advice is clear: if you're doing privacy audits via membership inference, you need to use your actual training dataset because changes in the dataset will significantly affect the privacy risk of individuals (both those removed and others) in the model.

Because unlearning presents new and different risks and attacks, it's important to step back and review the original goal.

What is unlearning trying to achieve?

In many ways unlearning is poorly defined and implemented because it's built on a shaky understanding of privacy risk in deep learning. From where I sit, unlearning research feels like a back-and-forth conversation between privacy lawyers and technologists where neither side is really understanding what the other is trying to say.

In my opinion, it'd be helpful to evaluate: What are we trying to achieve when implementing unlearning?

If you want to build privacy-respecting deep learning systems, you have to acknowledge and embrace how and why problems like memorization happen. If you take a holistic approach, you'll see that almost none of the unlearning research focuses on that part of the problem: why and how memorization occurs. Instead, it focuses on byproducts of this phenomenon, by reducing performance on a particular forget example without addressing how the information in that example will affect both model outputs and how that example relates to other data.

Defining unlearning is not just an activity for lawyers, policy makers and technologists. Privacy and privacy risk is a lived human experience; and some people hold an undue amount of risk just because of who they are (i.e. outliers, underrepresented persons, people that "stick out"). Defining how to address the memorization problem is about having conversations as a society about the actual risks and not hand waving that a solution will present itself automatically.

If AI systems are going to affect people's lives, work and communities, defining AI privacy and redress must take into account the impact of these systems on people's lives and the unique effectiveness of these systems at using memorization to expose some people more than others.

In the next three articles, you'll investigate differential privacy and privacy auditing as a solution to the memorization problem.

The authors tested many possible explanations, such as training regularization noise, presence of outliers or duplicates and discovered this phenomenon isn't global, it's local. It doesn't work uniformly, although it is definitely generalizable. For example, it doesn't affect only a few examples, but instead whole groups of many examples. Targeted attacks show some points are easier to attack than others. Humans inspect the images to find and then remove the 10 most similar examples to see if this makes the target image vulnerable. ↩
See Privacy Side Channels in Machine Learning Systems, Debenedetti et al., 2024 ↩