Multiple Lines of Defense: How We Actually Prevent AI Jailbreaks

Author

Elana Feldman

The headlines are full of stories of AI agents going rogue: language issues, talking like a pirate, offering inappropriate discounts, and more. As AI has become more widespread in customer service, the number of people trying to break it has increased. Bad actors phrase things in weird ways, try to confuse the system and see if they can trick an AI agent into doing something outside its scope.

Just last week, a client’s engineering team asked the question that’s on everyone’s mind right now: “How do you make sure someone can’t trick your AI into doing something it shouldn’t?” This is the most important question of the moment as applied AI moves from being experimental to being indispensable. The great news is, there is a real answer that ignores the AI hype and centers that applied AI wins on skilled execution.

When our clients hire human customer service agents, those agents are trained on clear rules: Here's what you can say. Here's what you can't say. Here's when you escalate to your manager. Nobody wants new agents to suddenly answer questions about payroll or imagine policies on the fly. At Pypestream, we see little difference between those expectations we place on humans and those we place on automated agents. Our AI practitioners build AI-enabled solutions that perform specific tasks that bring value, while following the established business rules that maintain compliance and mitigate risk.

In a call center, managers provide the supervision to ensure your agents are staying within their training boundaries. At Pypestream, we achieve this for our systems of AI agents with Supervisor Agent, our intelligent orchestration layer. Think of it as the operator at the front desk who routes calls to the right team. When a user starts a conversation, the Supervisor Agent gauges which route to take based on an analysis of the user’s inquiry against the specific tools and the specific tasks it is able to execute. If what the user wants isn't on that list, Supervisor Agent doesn't try to figure it out or get creative. It admits its limitations and escalates to a human. It only routes to what it was built for.

Question answered and problem solved, right? Not quite. If security was just about having one really good checkpoint, this would not be the question of the moment. Unfortunately, that's not how real systems work, especially when you're dealing with AI that needs to be flexible to be helpful. We need multiple layers, and they all have to work together.

As a next layer of security, all workflows in a Pypestream AI-enabled solution are not just routed by Supervisor Agent, they’re also processed through an action observability layer. With this step, the Pypestream platform is not just logging what happened, it is tracking why actions happened. So, when something non-traditional or fishy starts showing up (like a bunch of unusual requests coming from the same place or someone systematically probing for data they shouldn't gain access to), we see it right when it happens and what has resulted. Complete observability ensures that we can understand, respond and improve, as needed, in real-time. Through our observability layer, our clients have comprehensive logs of what their AI is doing. Our teams can run A/B tests on different approaches and make real-time updates based on what we’re seeing.

When we explain this to clients, some of them expect it to be more complicated than a few layers of security assurance. And sure, the implementation has deep complexity. But the philosophy? It's pretty straightforward. We set explicitly clear boundaries. We enforce them at multiple levels. We have visibility into all of it. That's what actually works.

One of our engineers put it best when they remarked, "We want to make sure the platform is in the middle, and that we can control what we're giving out to whom and when, so that even if something does slip through the first line of defense, there’s a second line to bounce off as well."

Let's say, hypothetically, a bad actor finds a way to mess with one of our AI agents. Maybe they are trying to get a full refund on a recent purchase. Even if they manage to get past the Supervisor and get placed into a refunds workflow, when that AI agent tries to actually fulfill the request, it hits the second layer of protection. The platform checks: is this action part of the expected workflow? Am I allowed to give this level of refund to this customer? Do they meet all the business rules we agreed on? If the answer is no, it gets rejected and routes back to what it's allowed to handle.

Our clients expect to know exactly which data is going where and when. The Pypestream platform, with features like Supervisor Agent and the observability layer, in the hands of our AI practitioners, gives them that. The Pypestream platform architecture ensures the user cannot access sensitive operations by bypassing Supervisor Agent nor avoid observability. Our clients are not just trusting that security is handled, we can show them how it is working.

In more than a decade in this business, we have learned the best defense against jailbreaks isn't just technical sophistication. It's a platform built on the fundamental design principle that it does not do anything except what it has been explicitly assigned to do. No surprises. Just systems of AI agents that do exactly what they’re supposed to do, and nothing more.

AI & Automation

Mar 10, 2026