Blocking AI/ML Memorization with Software Guardrails

Posted on Fr 11 Juli 2025 in ml-memorization

One common way to control memorization in today's deep learning systems is to fix the problem by building software around it. This software can also be used to deal with other undesired behavior, like producing hate speech or mentioning criminal activities.

In this article, you'll learn about how software around AI/deep learning models can be used and explore why these interventions provide more of a good feeling than an actual practical solution to the problem.

How an AI product is designed

AI and deep learning models are just a tiny part of an overall system. Most of the system is deterministic software around the non-deterministic machine learning model. At an extremely high-level, this is how a Chat Assistant system might look:

An example high-level architecture, where you see that the data first comes in via software with an API call, then the data goes through some sort of input processing. Then to the LLM itself (this usually now includes the tokenization as part of the LLM). Then to some output processing and then back to a piece of software with an API call to the user.

In the above figure, the chat messages come in from a user via an API call to software that processes the input. As you learned in exploring the design of a machine learning system, this text will be prepared for the machine learning input. This could vary depending on design, from potential removal or correction of typos, grammatical errors, to appending meta information from the user account or other data source, and then eventually this text and any additional inputs are tokenized and sent to the AI model (via another API call).

The AI model will process the tokenized input and calculate some predicted set of tokens as a response. More often than not, there is now software around this step that requests multiple possible responses. Depending on the design, the model might return the beginning of a response while the system continues calculating the next part of the response. Remember: the model will use its own response as part of the input to continue calculating the next word(s).

If you have heard about topics like temperature, top-k and top-p sampling, these are implemented in software around the model outputs, resulting in multiple queries before the final response is constructed.

You don't need to learn the deep details of these sampling choices and settings, just know these are different parameters that the chat provider and/or the user can set to determine the deterministic or explorative qualities of the response. This creates several ways to sample longer answers and compare or explore response possibilities before determining a final response. For large models, there are other optimizations used, like potentially splitting the prediction task between a small and large model (see: speculative decoding) to improve speed.

Sometimes the response is fully formed, but sometimes the response can start before the final text is formulated. Either way, this response usually goes through another batch of software filters on its way back to the original user.

There is a tradeoff between how much post processing you can do and the response latency, so usually these are light-touch filters and interfaces before the text reaches the user. Depending on the system this might be performed many times before the answer is fully formulated.

This process starts all over again the next time the user sends a message.

Filtering inputs and outputs

As you can probably tell from the diagram, if you want to use software to build protections against memorization you need to either:

catch potentially harmful input before it reaches the AI model (i.e. in the input, text cleaning and tokenization step)
or attempt to remove it as it is produced by the system (either as part of the testing and generation) or before it reaches the user.

Let's explore and compare both options.

Prompt rewriting

In search engines, there's been significant research on rewriting queries to improve user experience, by correcting typos or expanding search terms for better results. This approach inspires the idea of prompt rewriting, where the user's interactions with the model might be modified before it hits the machine learning model.

There are several motivations for rewriting prompts for better alignment with whatever the organization wants the model to do or not do. This is usually provided in a meta prompt (also called system prompt) which describes in natural language how the model should behave and what it should or shouldn't do. You might have seen easy ways around this if the model wasn't trained to distinguish the meta prompt from user input with the "ignore all previous instructions and ...".

But since models don't have any "concept" of what information is learned that can be used and which cannot, this type of intervention doesn't work as easily for memorization problems. Even if a company wanted to list every possible copyright character to not reproduce their likeness (i.e. "Don't show Batman"), there are easy ways to indirectly and even unintentionally anchor copyright or otherwise memorized images/words.

The same research around copyright in generative images experimented with additional approaches, where prompts are tested for similarity to "forbidden" prompts and rewritten to avoid potential problems. This was explored in related research that attempted to identify the forbidden "concepts" (for example: Batman) and then fine tune the model to remove the potentially problematic concept.

For example, a prompt like "Gotham superhero" should align closer with "superhero" and end up further from "Batman". As you might guess, if implemented at scale this could be extremely expensive because you would need to find every possible term, test for memorization and then implement learning interventions. It might also not always work for the task you want it to do (i.e. which well-known superheroes aren't copyrighted?).

In-context Unlearning

In-context learning (sometimes also called few-shot learning) is a common prompt engineering strategy where you type extra instructions and examples into the prompt to demonstrate the task or how you'd like it to respond. In-context or few-shot learning allows users to on-the-fly introduce a new concept or pattern to a general purpose LLM by showing a few examples and then asking for the model to complete the next in the sequence.

For example, you could give a list of sentences and then follow each item with what language it was written in and then upload a document and ask that the model return each sentence with the language it was written next to it.

In-context learning has been used alongside prompt rewriting as a way to "unlearn" concepts. In-context "unlearning" modifies the user prompt to replace data points that should be forgotten with "dummy labels". This only scales if the forget-set is quite small and the concept is easily defined and filtered. Also it won't work as well for things that don't easily mold into an in-context prompt setting (i.e. freeform conversations). In other research on data removal from models, this type of in-context or input rewriting was proven ineffective at reducing training data exfiltration.

Doing in-context unlearning at scale successfully would mean being able to accurately determine that the user is performing an attack or that the prompt would unintentionally release memorized information. But because model developers aren't currently testing for memorization, current architectures and training and evaluation would still need to be modified to cover this input- or output-testing.

How could this type of rewriting or filtering work on the outputs instead of the inputs?

Research and applications in output filtering

Because filtering inputs is fairly difficult, in today's largest AI systems memorization testing is done via unsophisticated output filters. These filters only exist for certain systems and generally test if the model response directly matches training data that should not be output.

For example, GitHub's Copilot can test if the generated code directly matches publicly accessible code. To avoid unnecessary latency, this is usually done via an advanced hashing memory structure, so exact matches are found quickly and the false positive rate remains low.

From Copilot documentation, this is the description of the intervention.¹

Copilot code referencing searches for matches by taking the code suggestion, plus some of the code that will surround the suggestion if it is accepted, and comparing it against an index of all public repositories on GitHub.com. Code in private GitHub repositories, or code outside of GitHub, is not included in the search process. The search index is refreshed every few months. As a result, newly committed code, and code from public repositories deleted before the index was created, may not be included in the search. For the same reason, the search may return matches to code that has been deleted or moved since the index was created.

This explains the recent problems where private repository code was exposed since that code has already been memorized and yet is no longer being tested by the output filters. Depending on the index updates, this could also apply to code you might have deleted -- for example, if you found that you accidentally exposed a secret (like a key or password) or other potentially sensitive details (i.e. exposure to libraries or systems with known vulnerabilities or environment settings).

Additional interventions can test visual output, such as asking a different machine learning model "is Batman in this image?" and block outputs that find undesirable memorized content in the output. As you might imagine, this is very difficult to scale, but might work for smaller models and a small subset of data or tasks.

It is likely that larger LLMs including ChatGPT use some of these output filters to block certain undesired responses (i.e. Terms of Service violations) or to comply with right to be forgotten requests. For example, in recent news, ChatGPT wouldn't respond when the response contained specific person's names which seems like a clear sign of an output-filter rather than concept-unlearning intervention.

You can only catch what you definitely know

The problem is that you can only really do this efficiently if you know what you are looking for and if it scales appropriately. Since very few companies test for memorization as a part of their model evaluation, it's also unknown internally how much memorization happens. If users can adjust settings like temperature or other parameters to shift model behavior at will, this would also change the produced content, making the problem even more non-deterministic than it already is.

For software teams trying to develop these interventions, it's like you're building a box to fit an object in, but nobody has told you what the object is. You're building based on vibes, not based on facts and knowledge.

If rigorous testing for privacy violations and memorization happens as part of the model training and evaluation, then you start from this basic understanding and likely build both better protections and also train models with fewer issues.

Unsurprisingly, software-based filters are easy to bypass. Any motivated attacker can easily sidestep things like prompt rewriting with their own prompt engineering (more on this in the next article).

Chiyuan Zhang presented several easy methods for bypassing the GitHub Copilot output filters (originally published in research Ippolito et al.). By changing the variable names to French or adding comment markers to start the line, previously undesired code output was output because the hashing memory architecture didn't catch the similarities.² This image shows how Ippolito et al.'s attack to produce a previously blocked function description by changing the variable names to French.

An example block of code that shows variable names in French and then many lines of previously blocked code.

This same research group found that models would at times output "style transfer" on memorized text by changing spacing, language or writing style even if not prompted by the user to do so, again showing that near-memorization testing (or paraphrase testing) might be necessary to catch these types of responses.

Determining that someone or something is in the training data and has been memorized is easy to perform when these output filters are on, as they are a direct indicator. Just like the ChatGPT example that (likely) exposed that a person had requested their data be deleted, these blocked answers leak information.

Debenedetti et al. named these types of information leaks "side channels" -- borrowing the term from cybersecurity where an attacker can extract sensitive information by observing changes in outputs or related side channels (often by observing attributes like latency, response content or other signals).

In this case, the side channel is as simple as producing prompts that generate a generic response (like, "I can't help you with that.") or generate a specifically different type of response (i.e. empty response, shortened response or fundamentally divergent response).

In information security, this falls under the concept of non-interference. This concept is easy to see with forgotten passwords. If a password reset form says "We emailed you your password" if an email is found but it says "This user doesn't exist, please create an account" if the email isn't found, then this response leaks potentially sensitive information about whether the person has an account or not.

In conclusion, the output and input filter examples you've read in this article leak particular information about what prompts are allowed and what outputs are allowed (and which not). Via a variety of clever prompts, these rudimentary safeguards are easy to evade. For this reason, software-based filters are not an appropriate intervention for problems like memorization.

In the next article, you'll investigate fine-tuned guardrails and other training-based alignment methods to determine if they are a valid solution to this problem.

As always, I'm very open to feedback (positive, neutral or critical), questions (there are no stupid questions!) and creative re-use of this content. If you have any of those, please share it with me! This helps keep me inspired and writing. :)

Accessed on 5 March, 2025. ↩
You can now just turn these filters off in GitHub in your settings, and this option is likely to surface in other systems where these settings are not public-facing. ↩