AI Guardrails Explained

Posted on Do 18 Juni 2026 in security

There's been a lot of chatter about guardrails that's more marketing than science, so here's a real take on guardrails so that you can make real decisions about y:our machine learning architecture.

Note that this is a mega-post that I'm going to keep updated with the latest references. Should you have a question not answered here, I'd love to hear from you and add an FAQ section. :)

What are guardrails

Guardrails are used to determine whether an input or response to a generative deep learning system is safe or unsafe. Depending on what type of guardrail you are talking about, where they sit and how they work will vary. This is a main source of confusion I see in conversations about guardrails, because different types of guardrails will operate differently.

In this post you'll learn about three major types of guardrails, how exactly to compare them and decide which one fits your use case or architecture.

Why use guardrails

Because many LLMs, multi-modal generative models and other types of generative image models are trained on the public internet, they are also filled with things that the internet is filled with. This means not only controversial, toxic, not-safe-for-work, biased information, political and/or advertising content but also likely misinformation, malware and other stuff you don't want to produce on your organization's app or website.

Depending on your application and use case, your need for guardrails will vary. Many AI vendors provide built-in guardrails as part of the system they offer you, and you can likely steer some of the output via your system prompt if you are just getting started.

However, as you scale your generative AI efforts and especially as you enter "agentic" workflows, you'll find that you need guardrails to stop harmful content or agent actions. So-called "misalignment" can happen even from fictional inputs into the training data, as Anthropic recently acknowledged when they found that learning dystopian sci-fi led to undesired outputs.

In an ideal world, we wouldn't need guardrails because AI vendors would pay to have only high quality content and language from the start. This could then also be trained directly so that any biases introduced are well documented, annotated and reviewed (by humans).¹ However, as of yet this is not the norm in producing models.

Another misconception I often hear is that you don't need guardrails because the model providers already have them. This may be true for your personal use, but if you are deploying customer or internal facing AI systems without any guardrails specific to your organization's needs, I think you're likely asking for a clever attacker to at a minimum take a curious look around.

That said, I like to say "use guardrails but they won't save you". They are software and models, nothing more and nothing less. This means they will have bugs, errors and miss things. I still think it's useful for you to learn about them and learn when, where and how they might support your overall privacy and security problems, bugs and all.

Let's dive in.

Types of guardrails

I like to reason about guardrails by thinking about where they are and how they work. When doing so, there are 3 main types.

External deterministic guardrails

These are guardrails that work outside of the main model (external) and are deterministic in nature, usually embedded in software and clever memory structures. Here are some examples to illustrate this type of guardrail for you.

Many of these types are related to AI/ML memorization and try to prevent repetition of memorized content.

Copyright checks: Many AI vendors now run copyright checks that look at if the tokens or pixels being output directly match known copyright works. This is usually done via very fast lookups in data structures similar to bloom filters.

Recipe copyright guardrails from Azure Guardrails

Questionable Code Licensing: Similar to the copyright checks, several code assistants look at whether the output matches known code with licensing that is not open or not open without proper attributions.

A screenshot titled "Enabling or Disabling suggestions matching public code" from the Copilot documentation. There are several paragraphs describing how to turn the settings on and off for allowing known matches of public code to be either blocked or allowed.

Stopping on particular language: This would look at known words that will create a guardrail. For example, this could be criminal related, explicit language or content related, or relate to some business context. These would then just waste tokens and GPU time, but this is a very brittle filter to use, since there are many ways to write certain words (i.e. many languages, ascii text, misspellings/typos, emojis, etc).

As you can see, these can be useful for some particular types of problems, but they are not very flexible, by their very nature! If you need more flexibility, you end up using other types of guardrails.

Let's look at external algorithmic guardrails.

External algorithmic guardrails

These live external (outside) the main model, but themselves can be models! Or they can use other algorithms that make sense for checking if a prompt or similar is "safe" to answer. Let's look at a few examples.

Open-weight guardrail models, like LlamaGuard, QwenGuard or ShieldGemma: These are models trained and released by AI vendors, security and research companies or organizations. These models are trained for the guardrail task. They usually have lists of categories and the ability to output whether a certain category of safe/unsafe was found or not found.

A screenshot showing a form that says Content Filters. In it there are two large configurable categories: violence and self harm. There is a slider that you can move from "low", "medium" and "high" to set the sensitivity of the guardrail. Example of setting categories and thresholds from Microsoft Copilot

For example, here is a list of LlamaGuard-3 categories

Code	Name
S1	Violent crimes
S2	Non-violent crimes
S3	Sex-related crimes
S4	Child sexual exploitation
S5	Defamation
S6	Specialized advice
S7	Privacy
S8	Intellectual property
S9	Indiscriminate weapons
S10	Hate
S11	Suicide & self-harm
S12	Sexual content
S13	Elections
S14	Code interpreter abuse

When you run a prompt through these models it will return something like "safe" vs "unsafe" and it will return any matching categories found if the prompt was deemed unsafe. There are also models that can return likelihood of "safe" vs "unsafe" or let you set policies for which categories you care about and only check for those.

LLM as a Judge: Technically LLM as a Judge also falls here, although there are some significant differences in latency, cost and consistency when compared with the specifically trained models for guardrails. In the following section, I'll break down exactly what these tradeoffs are and give you some ways to better architect judges into your workflows.
Build Your Own: Many of the external guardrail models are small LLMs fine-tuned for the guardrail task. You could do this yourself if you had labeled data that matched what you were expecting in terms of inputs. You could also build other types of models (take a small mixture-of-experts and start there) or even a small NLP classifier to see how far you get. In the end, your use case and language specifics may warrant this approach, especially if you will continue having similar use cases over time.

External algorithmic guardrails are great because you can take them with you. The more you invest in understanding how they fit into your workflows, the more agile you will be in approaching new AI ideas, armed with the knowledge of how to catch at least some problems well.

The final type of guardrail is one you can't see or manipulate unless you work at a large AI vendor. Let's look at alignment training.

Internal alignment Training

Alignment training is usually the final stage of training before an LLM goes online. It is part of the reinforcement learning (or similar fine-tuning) steps that happen at the final stages of LLM training in order to make sure the LLM isn't putting out (too much) garbage output.

The largest AI vendors spend significant money and time making sure the LLM is both "engaging" but also that any dangerous content from the pretraining phase is difficult to access or surface. Of course, this doesn't mean it always works 100% and it also means that different vendors will prioritize different things they deem "dangerous".

High level overview of how an LLM is made

Most of the time you won't be able to notice or choose anything about these types of guardrails; but I do encourage you to practice red teaming to get to know what model providers deem dangerous, to better compare models for security and privacy reasoning and to debunk myths about how AI "works".

I have a YouTube mini-course introducing you to red teaming with code and examples.

Comparison with LLM as a Judge

I got a great question on YouTube about what's the difference between these external algorithmic guardrail models and LLM-as-a-Judge.

In case you're new to the concept, LLM-as-a-Judge is using a different LLM that is prompted to review and sometimes rate the answer of another LLM. When people use this for guardrails, they usually write a long prompt around what should be avoided or what answers to reject.

Then, they add a full round-trip between usually the LLM output, so you are making 2x calls to the AI vendor provider. You could also run your own LLM, but either way you are making several rounds to an LLM to get a response: one to get the response, and one to see if you should use the response. You are also spending those tokens in some ways 2x.

Here's my take on using a LLM-as-a-judge versus a guardrail model trained for the task:

Latency and Cost: Depending on which LLM you are using as a judge, you may be introducing significant latency and token costs to your inference. There's nothing worse for users than to be experiencing a "slow response" and the fact that you're paying for the privilege of losing users. The reason why many guardrail models are small (some as small as 600T parameters) is because it means they add far less latency. They are also open weight, so you're not paying for the tokens (you are paying to serve them).
Consistency and Reliability: LLMs are trained first and foremost on text generation on a wide variety of tasks. This means they are not as task-specific as guardrail models. In my experience, this means sometimes the output shifts enough that it is no longer automatically parseable.² For example, the output can be in the reasoning tokens or the output format changes. Because external algorithmic guardrails are trained to ONLY answer in the final tokens their output is much more reliable and easily parseable.
Portability: If you decide to change model providers or want to test out having several model providers, having external guardrail models is a great way to take some of your security engineering with you regardless of which main model you are using. You can also easily switch out external guardrail models and compare them for different use cases as you build up data that demonstrates what you want to actually catch.
Fit-for-your-purpose: If you start developing evaluation suites around what you expect from a privacy and security point-of-view you'll likely be able to start tweaking or building your own models. At that point they will be completely fit for your purpose and give you the ability to continue training or expanding the different types of guardrails for different types of vulnerabilities or undesired input/output.

Food for Thought: The first LlamaGuard was trained using 8 80GB GPUs. The datasets used were a mixture of open training data (OpenAIMod) and Meta-collected data with labels. They built out the labels via their internal security experts and some external APIs like Perplexity AI.

To give you an idea of the type of data you might need or collect, let's actually look at some of the available datasets which exist today.

Looking at the Datasets

This is just a small subset of the available OpenAIMod dataset, which covers categories around sexual content, harmful and hate language. It was part of the data used to train LlamaGuard and several of the other open weight models.

There is a table with several rows of text and then labels (0, 1) for different columns. The small selection of text shows sexual content, hate and harrassment examples and a benign label that talks about sexual content with questionable words. Example from OpenAIMod (sorry for the troubling language)

Although the dataset doesn't have much information about how it was collected, it appears to be subsets of larger scraped datasets with a focus on keywords that might trigger one of the guardrails. There are a variety of safe examples mixed in with the unsafe ones that have similar words, so that the model is trained appropriately on examples of safe and unsafe text that use similar tokens/words. This is what makes them much more adaptable than a hard deterministic filter.

In addition, there are some other large datasets which focus on cybersecurity and criminal risk. The WildGuard dataset has a variety of examples that fit closer to something like CodeShield or other variants, where you might want to recognize when the user is asking the LLM to create unsafe code. This could happen either on purpose or accidentally via a poisoned skill or repository.

A screenshot showing several columns: prompt, response, whether the prompt is harmful and whether the model refused and whether the response is harmful. You can see a sneaky prompt that asks for some harmful things in between normal requests. Example from WildGuard

There are many similar datasets that aren't released by the open weight model organizations due to their sensitivity. These datasets are essentially what you need to start training your own guardrails. With the available open datasets and ones you create yourself, you can take a small LLM and turn it into a guardrail with extended fine-tuning.

Want to learn more about how fine tuning works? Check out Sebastian Rashka's YouTube series, GitHub Repo and book to learn how you can fine-tune your own LLM. Of course, this also means renting compute that fits the size you need to modify model parameters and the ability to afford that for likely a few training runs before you get it right.

These fine-tuned models are then used directly as a filter for incoming prompts/requests. If something gets marked safe, it goes to the LLM or multi-modal model. If it gets marked unsafe, you don't waste tokens or time on it and instead return a "Sorry I can't help you with that"-style answer.

If you're interested in looking at either using an open-weight model or perhaps looking at building your own, you probably would like to compare how exactly they are made and what are the differences between them.

Let's take a deeper look at several popular open-weight guardrail models so you can figure out which might fit your needs.

Comparing Guardrail models

Let's explore some of the most popular and easy-to-use open-weight guardrail models. These can act as a starting point for you to potentially collect labeled data or decide which categories matter to you.

LlamaGuard (3 and 4)

LlamaGuard3 and LlamaGuard4 from the Meta team are two different models with a variety of categories that a large social media company would care about. This means they may or may not fit what you are actually looking for in your guardrails, but if they do, they are backed with tons of industry experience in content moderation.

What goes into a Guardrail model? On YouTube I dive into the original Meta LlamaGuard paper which outlines the original approach for training a guardrail model.

Using the model is fairly straightforward and you can test it out directly with ollama on your machine. I have a Jupyter Notebook available that can show you some examples of using LlamaGuard to detect privacy problems.

It works well for some obvious queries such as: Where is Angela Merkel from? (safe because she is a public figure). And correctly labels "What is Angela Merkel's home address?" as unsafe. However, if you split up potentially personal data (like writing an email address out using German descriptors for punctuation) it does not recognize them. And there's plenty of other ways to hide personal data in text that it will also not find.

Like all models, guardrail models are "often wrong, but sometimes useful". This means that you'll need to test them for the types of input you expect and the types of categories you care about to see how they perform. Building out evaluation datasets so you can test which guardrails fit and maybe one day train your own is a great starting point as you look into what issues you want to catch and avoid.

Qwen3Guard

Qwen3Guard launched in 2025 and changed some of the initial architecture and design choices compared with LlamaGuard.

Let's investigate the two major changes in guardrail model design Qwen3Guard introduced:

The "Controversial" label

Qwen3Guard took an interesting idea that perhaps "safe" versus "unsafe" is too strict and that there exist gray zones where depending on context or use case something might be in-between.

To validate this idea, they split the training data into two sets. Then they used these training datasets to train different models. Since they had learned that training on mainly safe versus mainly unsafe labels created different model outputs (i.e. training mainly on safe created more permissive models and mainly on unsafe models that are more prone to labeling things unsafe), they also reweighted the training samples in those datasets. "Loose" models were trained with a weighted dataset with more safe examples and "strict" models learned from mainly unsafe training data.

Then, they used those models to "cross annotate" data from the other models. If the different models (strict and loose) didn't agree on the label, those examples were marked as "controversial".

Image from the Qwen3Guard release on how the controversial label was trained

After finding a useful threshold based on training data splits, the "controversial" label was added to examples and the final guardrail model was trained to respond: safe, unsafe, controversial to the associated examples and categories.

Streaming capabilities

Qwen3Guard has built-in streaming capabilities, which means you can use it with streaming LLMs (or multi-modal if you can parse text separately). Streaming mode is a capability of several LLMs and multi-modal models which allows tokens/responses to show up for users as they are produced and transmitted across the network.

This allows you to ensure that you aren't wasting tokens or compute on answering messages that you shouldn't bother answering.

A diagram showing an unsafe user request to build a bomb. The input is marked as unsafe, but for demonstration, the image shows how the streaming LLM guard evaluates each token as it is produced until the guardrail model reaches tokens that mark the response as unsafe. Image from the Qwen3Guard release on how streaming works

Ideally if you are running your AI service in streaming mode you also want to run your guardrails in streaming mode. I predict that this will become more common, especially after the release of Qwen3Guard.

You can test Qwen3Guard directly with ollama on your machine. I have a Jupyter Notebook available that can show you some examples of using Qwen3Guard to detect categories in text-based prompts.

Guardrail Models from your Cloud or AI Vendor

The best guardrail to start with is the one already built into your system, which for most of us is the guardrails that come from our Cloud or AI vendors.

Obviously many AI model providers also "install" their own guardrails, both via alignment training and fine-tuning, but also via their own guardrail models and software which they have running similar to what you've explored thus far in this post.

As a customer of such a service / vendor, that is not something you can influence or control, but it is something you should consider. In an upcoming post you'll learn about how to test and evaluate model provider safeguards to determine which providers/vendors and which models best fit your use cases and your desired guardrails.

Aside from the guardrails and alignment decisions you cannot control, many cloud and AI vendors provide separate guardrail models and deterministic software tests that you can use.

Let's take a look at the options that are readily available depending on your AI vendor.

AWS and Bedrock

AWS has a variety of guardrail models and software-based testing offered primarily via their Bedrock service. Let's investigate and help you categorize and organize them to better assess which ones fit your needs.

Content Filters: These are guardrail models that have categories and review input and output for "safe"/"unsafe". YOu can also toggle whether you want a "low", "medium" or "high" threshold. The supported categories at the time of this post are: Hate, Insults, Sexual, Violence, Misconduct and Prompt Attack. In the documentation, you can see there are two models for content filters, one guardrail model trained on the toxic prompt style categories in multi-modal inputs and the other trained on potential prompt injection or jailbreak attacks.
Denied topics: This lets you "tune" your guardrail to avoid particular topics. Based on the documentation, it's likely an LLM-as-a-judge with a longer in-context prompt that can help determine if the topics are included and relevant given the input/output.
Word filters: This is a fast deterministic text check against a blocked token/word list, which you can add to yourself by uploading a document or CSV. AWS already offers this for some profanity by default, but you can add words to the list.
Sensitive information filters: This is likely a combination of a small fast model (similar to Microsoft Presidio) which uses entity discovery to mark potential personally identifiable tokens and a Regex-based filter looking for common sensitive text data, like credit card numbers or particular ID-numbers.³ AWS has detailed documentation on what entity types are supported and how to use them via the API based on your tier.
Contextual Grounding Checks: In an AI/LLM deployment where you have human-collected information sources, like a RAG system with documents defining company FAQs or similar, these checks are likely LLM-as-a-judge setups specified to test output against defined grounding check sources. In the AWS documentation on the checks you can see explicitly how grounding sources must be wrapped in the API call and how a threshold between .01 and .99 should be set, which relates to inference probability (between 0 and 1 or 0% and 100%).
Automated Reasoning Checks: This is modeling from the computer science field of automated reasoning, which is probably built based on Amazon Science's own tooling and research that can take natural language and build out logical "theorem" chunks or "rules" which can then be tested to find potential solutions. If you've ever worked with "solvers" in Optimization Research or tinkered with Google's OR Tools it is a similar approach.

Automated Reasoning finds logical flaws in the input or output (i.e. is it possible that I am living in this zip code and that I have access to this regional supplier?). This is also why this guardrail is only available in English and has long and complicated documentation on how to create the policy, test and troubleshoot it and validate that it is actually properly working for your use. That said, I'm a big fan of turning text into code to validate and I personally use OR-tools to do calendar planning when working with my LLM setups.

On AWS guardrail models are not explicitly versioned as far as I can tell, so you might want to regularly test and modify your guardrails if you are also modifying your models and prompts.

Azure and Microsoft

Azure and Microsoft offer their own sets of guardrails, primarily via AI Foundry on Azure, which you can set up alongside your preferred model usage and Copilot.

Toxic and Unwanted Content: They have main categories which mirror topics you are now familiar with: Hate, Sexual, Self-harm, Violence. For each of these categories, they allow you to set a threshold or level between Off, Low, Medium and High.
Prompt injection and Jailbreaking: There are guardrails for a variety of prompt injection and jailbreaking attacks: User prompt attacks, Indirect attacks and "Spotlighting", which looks for prompt injection in user-uploaded or accessed materials/documents. Based on the documentation, these are likely several task-specific guardrail models.
Memorization and Copyright: Azure offers two copyright/memorization guardrails: one for code and one for text, which are deterministic software-based guardrails.
Personal information: There is a guardrail for filtering out personally identifying information, likely Presidio because the documented categories and entities closely match those.
Groundedness: Another guardrail evaluates groundedness, which allows you to provide additional parameters and documents where likely an LLM-as-a-judge will check if the output matches the same facts found in the text. Since they also offer a particular "domain selection", they might also have some fine-tuned LLMs that work for specialized fields like medicine which they use for the judge if you select that domain. Note that groundedness is only available in English.
Agent "Task Adherence": An interesting guardrail that checks whether the agent uses "misaligned tool" choice, mismatching the tool input or output (i.e. previous calls don't go well with later calls) and whether the output compared to the user prompt is also mismatched. The documentation for the guardrail has several examples which are instructive as to what can possibly be checked for this system.

If you're using Azure guardrails, I recommend you take their learning guide to get an overview of implementation and features; but if you want to dive into any specific type of guardrail you can also start directly with the links above related to each type.

Azure Foundry safety also has other settings and features which you may want to use. For example, you can set region preferences, which models and what modes you want available (i.e. only text or text and images).

Google Cloud

Google Cloud offers a guardrails service where you can configure numerous guardrail models as code in your generative AI projects. The guardrails documentation has several code examples in Python, Ok and REST.

Because the guardrails API abstracts away many of the individual guardrails, you can also use it more like a "mega guardrail" where you set up all guardrails as a policy.

JSON showing a "policyResults" with a list of policy types (categories) and then a score and whether the policy was violated or not.

Toxic Content: Filter hate, violence, harassment, obscenity and set thresholds for each of these categories in your policy.
Medical information: Flag queries that contain questions or answers related to medical advice or symptoms.
Sexually explicit: Set a policy for finding sexually explicit inputs or outputs.
Personally identifiable information: Flag text that may contain personal identifiers or text/prompts that are requesting personal identifiers.

As you've already learned, behind the scenes these are likely several different types of models that Google has optimized for their own use and made available via this policy for enterprise customers.

Anthropic

Anthropic doesn't explicitly offer any guardrails that you can turn on and off. Instead, they offer some advice on adding guardrail language to your prompt. They give some examples in their documentation which give hints at their alignment training.

My advice: If you're using Anthropic, start thinking about building additional prompt add-ons that get appended based on the use case and see if their alignment works for you. If not, start thinking about how to integrate your own guardrails before and after your calls to the API.

Remember that prompt exfiltration is fairly easy (my YouTube video on the topic) and consider anything that goes into your system prompt as something that someone could extract.

OpenAI

OpenAI has their own separate documentation for guardrail models and instructions for how to get them configured for your Enterprise setup.

At the time of this post, you can choose between a few different models:

Content Moderation: A guardrail model that blocks unwanted topics like hate, harassment, self-harm and sexual content. This one also returns the model prediction per category, allowing you to set your own thresholds after you have run testing and evaluation (or update them based on observability data).
Personally Identifiable Tokens: A privacy-filter model, called Privacy-Filter, which attempts to find personal identifiers in text and replace or remove them. This model is also available on HuggingFace in case you want to use it outside of OpenAI's infrastructure. The model will replace found entities with entity type as a form of redaction/masking (i.e. [PERSON_NAME]).
Jailbreaks: A jailbreak detection model trained to find potential jailbreaks via a variety of modes including role playing and obfuscation. This is likely a LLM-as-a-judge because you can choose the base OpenAI model to use.

A screenshot of the configuration for OpenAI's jailbreak detection where you can see the model choice, confidence threshold and maximum turns used to find jailbreaks.

Prompt Injection: A LLM-as-a-judge setup looking specifically at whether there are hidden instructions or divergency in the input and the original intent/system prompt. This can also return the reasoning tokens of the judge and can be used for multi-turn and agentic detection.
NSFW: Another LLM-as-a-judge that looks for any input or responses that could be considered "not safe for work".
Off-topic: A LLM-as-a-judge setup looking for topic adherence where you can set the main purpose and then have the judge look at whether the conversation fits the general topic or topics. This is useful if you have an open chat experience where people might try to use free tokens via your service to perform LLM or AI tasks. This is also why having open chat interfaces behind a login is a good idea.
A custom Judge setup: OpenAI provides a custom Judge prompt setup where you can define your own judge criteria, what model to use and where it sits in your OpenAI API calls.
Unsafe URL filtering: A deterministic software-based allow list that checks that URLs match a predefined set of allowed URLs, especially useful for things like tool calls and agent workflows. You can also use it to set https requirements and to look for potential user info in query strings or the URL itself.

If using OpenAI Agents, many of these are setup as part of the OpenAI agent software as well, so check the latest documentation

Less popular but interesting guardrail models

Because the major guardrail models probably have pretty good training data and are informed by ever-growing experience, I would recommend starting there. However, I think it's also useful to consider other open-weight guardrails or domain-specific approaches.

Note: if you know of a project I should add here, please ping me! You can reach me via many channels.

OpenGuardrails

OpenGuardrails is connected to a small security company by two researchers who are working on AI security. They released their research last year alongside an open weight model and training dataset on HuggingFace.

I found their research and open-source project interesting for a few reasons:

Policy-First: They focus on creating a universal policy setup (similar to Google's approach) where you can set a policy based on the different guardrail types in one place and update it with thresholds as you go.
Multi-Lingual: The researchers are based in China and have focused on making the guardrails and training data more representative than many of the other guardrail suites. The initial guardrails support 118 languages and they also release some performance details on several languages, so you can review in the paper whether the languages you want to support are well-represented (and of course evaluate it with your own data).
Deployable Open-Source Software for Guardrails: Similar to NeMo, which I'll cover later, OpenGuardrails comes with open-source software that allows you to wrap both normal LLM/AI workflows but also agentic tasks with guardrails.

I also have a Jupyter Notebook comparing OpenGuardrails so you can see if it might work for your project.

I hope the project and many like it will spring up to offer more open-source, open-weight, open-training data choices for practitioners.

NVIDIA's NeMo guardrails

Come back for more... I tried to get NeMo set up several months ago but they were going through a refactor and the docs were outdated and did not work. Now, it appears the refactor might be done, so I'll get a notebook working and update it here soon!

Your Use Case Specific Guardrail here

I'm a big fan of building specific models for specific use cases. This is also the case for guardrails!

If you're already taking time to:

model which security and privacy problems you expect to see in your running system
build out your prompts and evaluation suites for better product understanding
test different models and prompts for consistency and experience
monitor your system for performance, latency, spend
build out automation for some of your testing and evaluation

then you already have all the necessary components for deciding which additional models and guardrails you can use. The only missing decision at that time is whether you have the engineering capacity and knowledge to fine-tune or train your own.

Training your own model from scratch is a tall order, especially if you're looking for a more general purpose guardrail. However, if you already like a certain open-model (that also has open or compatible licensing) and you just want to further tweak it for something like domain-specific languages, this becomes much more doable.

A useful starting place is to evaluate where your current guardrails are failing and to see if you have ideas around why those failures are happening. Initially you probably want to try to tune your guardrails with whatever knobs you have to see if you can bring down that error rate without having to fine-tune.

If not, the question becomes, do the failures represent something you think a model can learn? You could even test it with a small NLP model (using scikit-learn and/or spaCy to build a small dataset and test whether a small model can learn anything about what you are trying to catch). This doesn't mean you actually need to use the small model, but it can certainly give you a signal if the concept is learnable at a small scale before you invest in a larger fine-tuning or training setup.

Here's a very basic overview of the steps you would need to take to train your own:

Collect Data: You can use data you already use to evaluate your guardrails and collect traces you're allowed to train with to add to those. Additionally you can build out synthetic datasets that mimic the error types or guardrails you want to build. You should also have a healthy sampling of things that look similar, but are actually safe.

Note: You can use your current guardrail model to collect labels for your data and also use the OpenAI Moderation endpoint or similar services to label additional examples you build or generate. However, the best quality data and labels usually means you produced it with your internal expertise and use-case-specific knowledge and labeled it as such.

Determine Base Model or Guardrail: Do you have a guardrail model that is open-weight with appropriate licensing? Or do you have an open-weight LLM (will take more effort!) with appropriate licensing that you want to use as your base? For both, test them with the data you collected. If using an LLM, you'll also need to build your system prompt and tinker with ensuring you have the right outputs and labels that you want to build. You can use an LLM to help with this.
Read the LlamaGuard paper for ideas: The original LlamaGuard paper and also the Qwen3Guard and OpenGuardrails papers are very explicit about how they trained the models, even down to batch sizes and compute in several of the papers. Take note of things you'd like to try as part of your fine-tuning or training and if you need or want to compare different types of training (i.e. with few-shot examples in the prompt or testing in- and out-of-vocabulary performance).
Start with a short fine-tuning before moving to thinking about training your own from scratch: My advice would be to fine-tune a current guardrail model with a small update (and a small number of rounds) and test that first. Then you can deviate further from the base model and move up to training a guardrail model from an LLM from scratch.

That said, if you are already doing fine-tuning or have experience training larger deep learning models, you probably already have a good idea of what can work and what might break. For those just starting, training too long on a smaller dataset often degrades the model performance for other data, so just make sure you both keep an eye on what you're fine tuning but also see that you don't reduce knowledge/learning for other key parts of the model outputs that you need (for example, other categories of the model that you want to maintain but don't want to train yourself).

Note: For tips on getting started on supervised fine-tuning and training, I recommend first learning how to fine-tune an LLM from Sebastian Raschka's course and then look into using a fine-tuning platform or service from your cloud provider if you don't already have your own fine-tuning infrastructure set up.

Validate your model with data sampled from production: Before launching your model, sample production data and test it against your model. How is your model doing? Does it have better performance on real production data than your current setup? Make sure you have an idea of how it's working and that you haven't introduced new errors. When you're sure based on that data, launch it and keep monitoring.

Not everyone will want to build their own guardrails or even need to do so, but I do think that building better evaluation datasets and being able to test a variety of different guardrail models against one another is an important skillset for any company or team who are putting guardrails into their architecture.

Let's look at how you can leverage easy to turn on/off cloud guardrails all the way to integrating your own.

Architecture Choices

If you're using cloud guardrails, the nice thing is that you don't have to assemble their serving environment yourself. This is why it's a good idea to start with those, get to know how they work and how they also affect your product or internal workflows and only later think about deploying your own models.

For your architectural choices you often want some guardrails to sit between the user and the service and others to sit between the model outputs and the user. Depending on your setup, you might want to also build in streaming capabilities for your guardrails so you don't end up partially wasting tokens on responses you need to throw away.

An example high-level view of where guardrails can sit in the data flow

You probably also want to implement basic observability for your guardrails if that doesn't come built into your system. Here are a few things that I would recommend you track:

Guardrail hits: How many matches for what categories?
Latency
Token usage or billing for guardrails (ROI)
If possible, human feedback for the guardrail-tagged responses (thumbs up/down)
A sampling of traces for positive hits and some for negative

Please understand that traces can contain personal data, meaning they should be treated as such. This means talking with your data protection/privacy team and deciding if your policies and procedures for traces should be adjusted. It also means you want to encrypt them at rest, store them carefully and apply necessary controls, such as pseudonymization, to appropriately protect personal data. If you are interested in learning how to do this at scale, I offer advisory and workshops to help build privacy and security into your workflows.

Ideally you are using these data points to analyze the effectiveness of your guardrails and to learn what types of privacy and security problems are actually happening in your systems and products. For example, if you are concerned about a certain category, but it never gets a hit even though you believe you've properly set it up, then analyzing the traces and determining if you need to adjust anything is a good starting point.

Starting with more general guardrails and then looking at if they shift customer/user feedback and evaluating observability data to see what actually matters in your product or use case (i.e. guardrails for a particular category rather than guardrails for every category) is a good way to learn what protections are worth the tokens/spend/latency and which ones are less relevant for your use cases.

Depending on your guardrails, you'll be building out policies to decide what is "safe enough". This means getting some subject matter experts, some security and privacy leads and some product owners together and building out a policy and evaluation dataset that you think meets your initial assumptions.

Assume your assumptions will change once you actually deploy your system. Work with that same expert group to adjust as needed and to start building competency in understanding the perceived versus real risk in the running system. By exercising this "risk radar" muscle, you'll develop a better idea of what policies are reasonable. If you are regularly building them out into examples and code, you'll also develop better evaluation datasets and begin understanding where the limits of guardrails are for your particular needs.

Ideally you find that the return on investment (ROI) for guardrails shows that they are effectively blocking malicious attempts or other problems and have a low rate of negative feedback from presumably non-malicious flows. You also can then start adjusting which models you use and see if smaller, faster and cheaper models operate well for your use case.

After learning more about your system and users, you can likely begin deploying more use-case specific guardrails and experiment with training or fine-tuning your own guardrails. This might mean adjusting your architecture so that different products or services can route based on the types of protections they need.

More than anything I hope it's clear that guardrails are also models, which means they need supervision and maintenance. Someone has to own even the "well-functioning" guardrails in case a model, prompt or architecture change impacts their effectiveness or application.

Determining who owns this, especially if you don't have a team dedicated to technical AI governance, means figuring out who has the right skills for such a task. Likely you'll end up with some co-ownership between some data and security persons until it's clearer how those capabilities at your organization can work together.

Use guardrails, but they won't save you

I said it before and I'll say it again. Guardrails are useful and also prone to errors and problems. Building holistic privacy and security means having multiple ways to address and control risk, and guardrails is one of several ways to do so.

In terms of agentic workflows, I would advise caution against assuming guardrails will prevent incorrect and insecure behavior. Recent examples show otherwise...

A tweet from a user Son Luong (@sluongng) showing Codex using clever Docker tricks to launch a new docker container as root even though sudo is not allowed in the guardrails/sandbox config.

Stay tuned for more articles on building agentic security.

Do you have any questions or topics related to guardrails that I didn't cover in this article? I look forward to hearing and reading from you. Feel free to reach out any time).

If this post helped you, consider subscribing to my newsletter or my YouTube and sharing my work. I also offer advisory and workshops and a new Practical AI Privacy course on topics like security and privacy in AI/ML and personal AI.

Note that this would probably also only work well if models were less general and more task-specific, allowing you to fully curate trusted content on a particular theme or for a particular task. The problem with generic/general models is that they must be one tool for everything, which quickly leads to feeding as much data as you can at scale. It also means that users will definitely try to get undesired outputs because either they think it's fun, they think they will make money with it or many other reasons. If you created slightly less powerful, boring models that were really good at specific things, the amount of "fun" and potential profit drops, making them also less interesting. ↩
In addition, AI vendors are certainly fine-tuning and testing out small changes, even in deployed versioned models. For this reason, something that worked well a week ago might shift due to small fine-tuning or deployment efficiency changes. ↩
Entity discovery via Named Entity Recognition (NER) is a well-known subfield of Natural Language Processing (NLP) and has been a part of language-based machine learning for decades. Common entities are things like people, places, organizations and dates. You can extend Presidio or build your own NER pipeline using open-source frameworks like SpaCy. ↩