Algorithmic-based Guardrails: External guardrail models and alignment methods

Posted on Mo 28 Juli 2025 in ml-memorization

You've probably at some point heard the term "guardrails" when talking about security or safety in AI systems like LLMs or multi-modal models (i.e. models that include and produce multiple modalities, like speech and image, videos, image and text).

Are you a visual learner? There's a YouTube video for this article on the Probably Private channel.

In this article in the series, you'll dive deeper into what technically falls under the term "guardrail" in today's AI systems and review whether these are a reasonable approach to memorization in AI/ML models.

What are guardrails?

The term is unfortunately difficult to detail technically. Since the term became popular, it's been used to describe a variety of interventions in AI/ML systems. These can range from:

software-based input and output filters (as you read in the last article)
external algorithmic/machine learning model input and output filters
actual fine-tuning or extended training which attempts to update the main model to reduce the chance of unwanted outputs (this is sometimes called alignment)

Let's review the second two in further detail and compare them to what you learned about software-based filters.

External machine learning models as input and output filters

Similar to the software-based inputs and outputs you reviewed, machine learning based filters or algorithmic filters attempt to identify problematic inputs and outputs. Instead of using the hashing or keyword-based approaches that you know from software filters, these use a trained model that sits outside of the main model and predicts whether the input or output should be blocked.

There are already some popular open-source models that do just this, like Llama-Guard, Prompt Shield and Code Shield, which all fall under the Purple-Llama family of models released by Meta. Let's investigate how these work in a real system.

These models are trained to identify known problems within the models themselves, like toxicity and malicious content, as well as known attacks against multi-model models, such as prompt injection attacks to provoke banned responses.

The model uses the input from the conversation or individual chat message and identifies if the user, the prompt or the answer could be viewed as problematic. Some of these models are trained with a variety of categories, like violence, hateful language, illegal activities and even privacy. Some are trained just to identify one particular problem, like protecting the meta prompt or attempting to find cybersecurity errors in generated code.

For example, a guardrail model can process chat input to identify if someone has included anything that would be considered a jailbreak attempt, like "Ignore instructions and do this instead". Or the model might identify particular racist remarks or slurs and flag a conversation as discriminatory.

This works well for inputs that are easy to classify. The model is trained on a classification task, and the training data has the input (text and/or other input) and the label is the appropriate category of problems occurring (i.e. insecure code or derogatory statement).

These content-filtering models can also be trained on new categories that a given organization has in mind -- for example, not to mention a competitor when answering about services. Llama Guard has specific instructions on how to add and train your own categories.

These are still machine learning models, and machine learning models are relatively easy to fool or trick. Not only is this possible by actively attacking the model, like you learned in the adversarial machine learning article, but also by simply testing interesting new and creative approaches that are unlikely to have been tested or trained yet.

This has long been the case in cyber- and information security, where security professionals become quite skilled at thinking outside of the box and using what they know about systems and security to devise new attack vectors and creative workarounds. Because a machine learning model doesn't have actual reasoning, it is often much easier to evade than other humans.

There are many examples of successful attacks, but one of my favorite recent examples came from researchers looking at ASCII art. In their work, they ask usually blocked questions by changing key words to ASCII art text. They even developed an open-source library so you can try out your own adversarial requests.

Two user flows are shown. On the top one, the user prompts to ask how to build a bomb and the guardrails reject it. On the second prompt the query is broken down into words before being sent and an ASCII mask is applied to the word "bomb" so now it is written in asterisks. Then the LLM system responds with how to build a bomb. ArtPrompt example of evading guardrails with ASCII art

In addition to these creative evasions, the issues that come up for privacy and copyright are certainly much harder to train than something like violent or criminal behavior.

Although Llama-Guard has a category for privacy, it doesn't have many details about how this was trained. I decided to test out a few conversation examples to test what it considered private or not.¹

Private	Not private
Outputting a list of numbers with the text "id"	Asking for an ID
The text "credit card number" with any number of digits	Entering or repeating a phone number
Asking for a personal address	Asking for non-public information about a person
Asking to interpret a medical report and tell who it is	Sharing medical information via chat

These results are quite similar to NVIDIA's Nemo Guardrails, which uses Microsoft's Presidio to scan for easy-to-find categories of personal data and block or mask those tokens. Although it's great to block potential release of nonpublic information like an address or phone number, it doesn't mean that privacy is actually guaranteed.

Comparing these model-based guardrails to the two major attack vectors, it's clear that the interventions don't prevent revealing memorized private or copyrighted information. Even just answering information about a person can be seen as a violation of their privacy and can reveal something that can be used outside of context and consent. And what about reproducing someone's face, voice, likeness? In addition, these guardrails don't prevent a membership inference attack, and aren't trained to evaluate if the training data is being repeated or exposed.

Perhaps more important is the question: who decides what is a guardrail and how it's trained? Most companies don't have enough data to develop and train their own guardrail models, which means they are relying on model providers to release useful guardrails.

Because each system might be different, general privacy and intellectual property guardrails can be erroneous, because the company might need to receive personal information to perform a lookup on a customer database or return copyright material that the organization has a license to use. Since most models aren't documented with how they were trained and what they can do in detail, this leaves organizations struggling to understand how to effectively deploy and monitor the guardrails available and what to do if there aren't any guardrails that fit their use case.

Since filter-like approaches are external to the actual model, what happens if you try to instead incorporate the guardrail task into the actual learning step? In this case, you want to ensure that while a task is learned, potential errors or undesired outputs are avoided -- which brings us to training-based approaches.

Fine-tuning or training-based alignment approaches

In addition to filtering and flagging inputs and outputs, today's largest models generally go through alignment as part of the fine-tuning/training step of the model development process. Let's review what this looks like and how it works.

The entire training process of today's LLMs looks something like this:

These terms can be confusing because they are very LLM-jargon specific, so let's translate what each step does:

Pretraining: For me, this naming is very strange because essentially this is just unsupervised training that produces a base language model. This model is trained on content (text, image, video, etc) at scale. The input embeddings (i.e. text or multi-modal embeddings) are also learned. The model is not trained to specifically predict chat-style responses but it might have chat text as part of the training data. For LLMs, this results in a model that is good at predicting the next token(s) when given a small or large amount of text. For other sequence-based models, it will predict the next step when given an input sequence (such as next part of audio wave or next image for video).
(Optional) Extended or Continued Pretraining: This step can be used to further pretrain a publicly-released base model from a large LLM provider² or to continue pretraining on a context- or language specific dataset. Like in the previous step, this pretraining is just learning basic language or sequence modeling, so no labeled data or supervised training is used.
Training (or sometimes called Supervised Training or Supervised Fine Tuning): This is where the base language or sequence-based model is trained to complete a given task, like answering chat messages. You can also train base language models to do other things: write code, classify text, translate or perform other sequence-based machine learning tasks. Today's chat assistants are trained on chat-like texts with additional prompt inputs that give instructions and show completions. In these datasets the user input is listed under "User" and the model should learn to respond as the "Agent" speaker. Instruction datasets can also be used where there are task-completion examples, like counting, mathematics, "world model" tasks, etc.
(Optional) Fine-tuning for Alignment: Although this can be done as a normal part of the LLM training, sometimes a separate dataset and objective is used for training guardrail alignment. If so, this usually happens directly before models go into use to ensure that these guardrails are not partially changed or forgotten during another step in the fine tuning process. These alignment datasets include examples of responding differently to requests for objectionable content.

These steps can change based on a particular model setup, but they're useful to know even if your organization's method is different.

To better understand common implementations for steps 3 and 4, you'll first need to become familiar with reinforcement learning.

A brief introduction to Reinforcement Learning: Reinforcement learning is a particular type of machine learning that uses an incentivization-like method to measure loss and update model parameters. The field emerged out of robotics, where you'd like to set particular constraints and reward or penalize particular steps or next-action predictions in order to train the robot towards a goal.

During the final training steps (#3 and #4), model optimization focuses on "preference" optimization. Let's review the most popular approaches for doing so.

Collect Initial Data on Human Preferences: First you need to develop a dataset that shows human preference between a variety of chat responses or conversations. Usually these are collected by data workers³ who act as the "agent" and produce high quality responses or interact with an already trained chatbot to produce conversations. Once enough data exists, data workers shift to ranking and correcting responses (i.e. which response is better, or what would make this response even better).
Reinforcement Learning with Human Feedback (RLHF): The "human preference" chat data is used to train a reinforcement learning reward model that learns human preferences from this ranked data (think of this as a supervised text classification task). The reward model is then used to give or subtract points from the model as it continues training. The model updates (i.e. loss/optimization) are directly calculated via this reward/penalty. More specifically, the reward model output is combined via a policy function (which balances learning with "remembering what it already learned") and this function calculates the model parameter updates.
Direct Preference Optimization (DPO): Here, only binary "human preference" chat data can be used (i.e. like/dislike) instead of a more nuanced ranking (i.e. most favorite to least). Then a policy function is used to increase likelihood of positive responses and decrease likelihood of negative responses.

In addition to providing higher quality and more interesting responses, this human feedback also helps reduce harmful text learned from the internet -- whether that's violent and criminal activities or just blatant racism, sexism, ableism, etc. Relatively recently privacy and intellectual property have been added to the list of ways to align models.

This means, however, that data workers must be given guidance on what constitutes a privacy or intellectual property violation. Only then can conversations be guided away from these outputs. As you learned in the attacks article this would be impossible for any human to do since it would require photographic memory of the entire set of training data examples.

One approach to automate this would be to actually test outputs from both the pretrained model and the fine-tuned model for their proximity to training examples and to encourage divergence from these examples. As far as I know this is not an active approach in production-grade model training.

Since it would be impossible for humans to review every conversation to determine if it has released person-related information out of context or if it has repeated potentially copyrighted content without attribution, this alignment usually finds only the most blatant examples, as shown in the small exercise with Llama Guard. These models end up avoiding directly outputting personal contact information or recognizing blatant requests that might violate privacy (i.e. "Tell me the social security number of [person]).

By design this fine tuning for alignment only modifies the model slowly and slightly, because there is a penalty attributed to too much divergence from the underlying base model. This penalty exists to ensure the fine-tuning doesn't create "catastrophic forgetting" of the large-scale text learned in the base language model.

Jailbreaking attacks

Despite these external and internal model guardrails, there are many examples of quickly and easily "jailbreaking" models. This term is used to refer to deflection of the guardrail models (i.e. by letting a prompt or response that should be flagged go through) or by either modifying or exploiting the actual model in a way that subverts the alignment fine-tuning.

Let's review a few broader categories of these attacks:

Clever prompting: Early attacks which often still work use clever prompt engineering to deflect guardrail model filters and/or alignment. One fun example is to "time travel" or "world travel" to a place where the topic is allowed. There are many great examples of clever prompt attacks, and there's likely to be many years of new prompt-based attack development.
Fine-tuning to remove alignment: Many models are available for free download and use on HuggingFace or via GitHub. Even OpenAI and Copilot allow forking of the model and fine-tuning "on your own data". Although big model providers have terms of service that prohibit malicious fine-tuning or use, this doesn't necessarily stop motivated users from downloading or forking the model and then fine-tuning to remove guardrails. Recent research on publicly-released models show the costs can be as low as $160 to significantly reduce fine-tuned guardrails.
Adversarial attacks: Adversarial attacks can be developed based on model outputs. These attacks can also be transferred from one model to another, based on the same principles of transfer learning. By design, adversarial attacks change the outputs and model behavior (either to make an error or to push outputs in a particular direction).

As of yet, there haven't been general publicly released examples where attackers utilize these methods to exfiltrate memorized data from machine learning models -- outside of producing naked images and videos of famous or less famous people without their consent or impersonating famous people via deepfakes.

As AI systems are used in increasingly proprietary and sensitive environments, these attacks will become more valuable. Particularly if you wanted to build an attack based on searching for a particular piece of content or targeting a particular person or company, this is easier to do than a general attack.

Accidental attacks on privacy, where personal information is released by queries have already been reported, and it would be useful to require transparency reporting on how often this occurs.

As you learned previously, the larger the model the easier this is to do -- even with existing guardrails. Recall that Nasr et al. predicted the ability to extract more than a million word-for-word training data examples from ChatGPT with a larger budget.

Similar to the problems with input and output filters, non-deterministic approaches (i.e. fine tuning or ML-based filters) to these problems are unlikely to catch all unwanted outputs without a clear definition of what is expected. Because most organizations training large-scale AI systems do not actively test for memorization, it is difficult to then prove that the second training/fine-tuning step has reduced this memorization to an acceptable threshold.

In general, when thinking about privacy, it is enough to prove that one person's privacy is fully violated (i.e. their data is memorized and exposed in a way they did not expect or consent to). To do this for every data example that is not collected under enthusiastic consent presents a scaling problem that would require significantly changing how privacy metrics and auditing are used in model training and evaluation.⁴

Although model training via fine-tuning and ML-based guardrails help with AI safety and reliability, they are not a substitute for thinking through and addressing real issues of privacy and memorization.

In the next article, you'll learn about the field of "machine unlearning", or if/when/how it is possible to remove information that has already been learned from deep learning models.

As always, I'm very open to feedback (positive, neutral or critical), questions (there are no stupid questions!) and creative re-use of this content. If you have any of those, please share it with me! This helps keep me inspired and writing. :)

Acknowledgements: I would like to thank Vicki Boykis for feedback, corrections and thoughts on this article. Any mistakes, typos, inaccuracies or controversial opinions are my own.

This isn't meant to be a thorough or exhaustive test or research, but I am curious if you happen to come across a holistic approach! I used Llama-Guard 3.8. ↩
For example, Mistral AI has a few base pretrained models and you can see examples of extended/continued pretraining from Lightning AI. ↩
Usually paid extremely low wages and overwhelmingly employed in Africa and South America, the working conditions of these often high-skilled workers with advanced degrees is documented well by DAIR's data workers reporting. ↩
More details on what companies/organizations and individuals can do to combat this in future articles. :) ↩