Secure AI Agents with a Safety Kernel

Prevent catastrophic AI failures. Learn how a Safety Kernel secures agentic AI systems and enables safe automation.

12 min read
February 2026
Illustration of a safety kernel protecting AI agents in a cloud environment

This week, a stark reality check rippled through the AI community when a Meta AI alignment director detailed her experience with OpenClaw, an open-source AI agent. The task was mundane: clean up an overstuffed email inbox. The result was catastrophic. The agent didn’t just organize; it ran amok, deleting critical emails and disrupting workflows because it interpreted “clean up” as a mandate for aggressive deletion.

Plavno’s Take: What Most Teams Miss

At Plavno, we see a fundamental architectural mistake in how most teams are approaching agentic AI: they are treating agents as trusted power users rather than untrusted, potentially hallucinating interns. The industry is obsessed with capability—can the agent book a flight?—while completely ignoring controllability—can we stop the agent from booking the wrong flight?

The core issue is the lack of a “Permission Layer” or “Safety Kernel” sitting between the Large Language Model (LLM) and your critical APIs. Most implementations use frameworks like LangChain or LlamaIndex to wrap APIs in a function call. If the LLM emits the right JSON, the action executes. This is dangerous. When an agent “hallucinates” an action, it isn’t just generating wrong text; it is triggering a destructive API call.

Teams get stuck because they assume that prompt engineering is enough to ensure safety. It is not. You cannot prompt‑engineer your way out of a race condition or a logic error that causes an agent to format a hard drive. The failure mode in the news wasn’t that the AI was “evil”; it was that the system architecture allowed a probabilistic model to execute irreversible state changes without a deterministic gatekeeper.

What This Means in Real Systems

In a production environment, an agentic system is not just a chat interface; it is a complex orchestration of moving parts. To prevent the “OpenClaw scenario,” we must rethink the stack. The architecture cannot be LLM → API. It must be LLM → Orchestrator → Safety Kernel → Policy Engine → API.

The Safety Kernel

This is a dedicated service (often written in a rigid language like Go or Rust) that sits between the agent and your infrastructure. Its job is to validate every single tool call. It does not trust the LLM’s output. Instead, it parses the intended action, checks it against a pre‑defined schema, and validates it against the user’s permission context.

Reversible Operations

For high‑risk actions (deletion, money transfer, data overwriting), the system must enforce reversibility by design. For example, an agent should never have access to a DELETE /email/{id} endpoint. It should only have access to POST /email/{id}/trash. This creates a “time buffer” where human operators or automated rollback scripts can intervene.

Sandboxing and Isolation

If an agent needs to execute code or manipulate files, it must do so within a sandboxed environment (e.g., Firecracker microVMs or gVisor). We see teams giving agents access to a shared filesystem to “help with data processing.” This is a recipe for disaster. If the agent goes rogue, it should only corrupt its own isolated container, not the production database.

Observability and Circuit Breakers

You need deep observability into the reasoning trace of the agent. Before an action is executed, the system should log: *What was the user prompt? What was the agent’s internal reasoning? What tool did it select? What are the parameters?* If the agent attempts to call a destructive tool three times in a row without success, a circuit breaker should trigger, freezing the agent session and alerting a human administrator.

Why the Market Is Moving This Way

The shift toward agentic AI is driven by the limitations of passive chatbots. Chatbots can answer questions, but they cannot close tickets, update CRMs, or execute workflows. Businesses are demanding ROI, and ROI comes from automation, not conversation.

Technologically, we have crossed a threshold. With the release of advanced reasoning models and improved function‑calling capabilities, the latency and reliability of agents have reached a point where real‑time interaction is feasible. However, the security tooling has not kept pace. We have race cars (the models) but no brakes (the safety infrastructure). The market is moving this way because the competitive pressure to automate is high, but the understanding of the blast radius is low. Vendors are shipping “autonomous” features without shipping “undo” buttons.

Business Value

Implementing a safety‑first agentic architecture protects the business from catastrophic loss, but it also enables faster, more aggressive automation. When you know that an agent cannot accidentally delete your production database, you feel comfortable giving it access to more tools, which increases its utility.

Consider the cost of the alternative. A data breach or massive data loss caused by a rogue agent can cost millions in remediation and reputational damage. Conversely, a well‑gated agent can handle 80% of Tier 1 support tickets.

Concrete Example

Imagine an AI automation system for processing refunds. A standard agent might issue a refund for any customer who complains. The safety‑gated agent checks the refund amount against a policy (e.g., “Refunds < $50 are auto‑approved; > $50 require human review”). If the agent tries to refund $500, the Safety Kernel blocks the API call to Stripe/PayPal and routes the request to a human queue. This reduces manual work by 70% while ensuring zero financial leakage from policy violations.

Real‑World Application

1. Automated Inbox Management (The Safe Way)

Instead of giving an agent direct delete access, we architect a system where the agent applies tags (e.g., “spam,” “newsletter,” “urgent”). A separate, non‑AI script runs nightly to move items tagged “spam” to a hidden folder, and another script permanently deletes them only after 30 days. This mimics the OpenClaw use case but introduces a 30‑day rollback window.

2. CRM Data Enrichment

Sales teams want agents to scrape LinkedIn and update their CRM. The risk is the agent overwriting accurate data with hallucinations. The solution is a “conflict resolution” pattern. The agent writes to a staging table. A human reviews the diff (Current Data vs. Proposed Data) and clicks “Approve.” The agent never writes directly to the master record.

3. DevOps and CI/CD

Agents are increasingly used to manage infrastructure. We implement a “break‑glass” mechanism. The agent can restart services or scale pods automatically. However, any action that modifies the VPC, deletes a database, or changes security groups requires a cryptographic signature from a senior engineer. The agent prepares the Terraform plan, but it cannot apply it.

How We Approach This at Plavno

At Plavno, we do not build “magic” AI black boxes. We build software systems that happen to use AI models as a component. When we design custom software with agentic capabilities, we start with the failure modes.

Principle 1: The LLM is a Guest, Not the Host.

The LLM runs in a restricted environment. It has no network access; it can only communicate via function calls that we define. It cannot “guess” an API endpoint. If it tries to call a function we haven’t explicitly whitelisted, the call fails silently and safely.

Principle 2: Explicit Policy as Code.

We use policy engines (like OPA - Open Policy Agent) to enforce business logic outside of the LLM. The LLM might *think* it’s okay to delete a user, but the Policy Engine checks the actual rules (e.g., “User cannot be deleted if they have an active subscription”). This decouples the intelligence from the governance.

Principle 3: Audit Trails are Mandatory.

Every agent action is logged to an immutable append‑only log. We treat agent actions with the same scrutiny as financial transactions. If an agent makes a mistake, we can replay the log, understand exactly why it happened, and patch the specific tool definition or prompt to prevent recurrence.

What to Do If You’re Evaluating This Now

  • Inventory your API Keys: Do your agents have API keys with admin privileges? Revoke them immediately. Generate new keys with the principle of least privilege (read‑only or write‑only to specific scopes).
  • Test for “Blast Radius”: Create a sandbox environment. Tell your agent to “delete all data” or “send an email to everyone.” See what happens. If it succeeds, your architecture is unsafe. It should fail or trigger an alert.
  • Demand Reversibility: Ask your vendor or engineering team: “If this agent makes a mistake, how do we roll back the state in under 5 minutes?” If the answer involves restoring a SQL backup from last night, the system is not ready for production agents.
  • Implement a Human‑in‑the‑Loop (HITL) for Destructive Actions: For the first 3 months of deployment, any write operation (POST, PUT, DELETE) should require a “click to approve” in a UI. This trains the agent and builds trust.

Conclusion

The OpenClaw incident is a warning shot. Agentic AI offers immense potential to automate the drudgery of white‑collar work, but it introduces a new class of operational risk: the autonomous error. We cannot rely on the model to be perfect; we must rely on our systems to be resilient. By wrapping AI agents in a Safety Kernel, enforcing reversibility, and treating them as untrusted users, we can harness the power of automation without burning down the house. The future of AI isn’t just about smarter models; it’s about safer software engineering.

Renata Sarvary

Renata Sarvary

Sales Manager

Ready to Secure Your AI Agents?

Worried about giving an AI agent write-access to your production database? Plavno can architect a safety-first agentic system with reversible actions and strict permission boundaries.

Schedule a Free Consultation

Frequently Asked Questions

AI Voice Assistant Implementation FAQs

Common questions about replacing IVR systems with conversational AI

What is a Safety Kernel in AI?

A Safety Kernel is a dedicated service, often written in a rigid language like Go or Rust, that sits between the Large Language Model (LLM) and your critical infrastructure. It validates every tool call against a pre‑defined schema and user permissions to ensure the agent does not execute destructive or unauthorized actions.

Why is prompt engineering insufficient for AI agent safety?

Prompt engineering cannot prevent race conditions, logic errors, or irreversible state changes. Since LLMs are probabilistic, they can hallucinate actions. Relying solely on prompts to stop an agent from formatting a hard drive or deleting data is risky; a deterministic gatekeeper like a Safety Kernel is required.

How can I make AI agent operations reversible?

You can enforce reversibility by design by restricting agents to specific endpoints. For example, instead of allowing access to a DELETE endpoint, provide access to a TRASH endpoint. This creates a time buffer where human operators or automated scripts can intervene and restore data if necessary.

What are the business risks of deploying agentic AI?

The primary risk is catastrophic loss, including data breaches, massive data loss, and reputational damage. Without proper security controls, an agent can misinterpret a command and disrupt workflows or delete critical information, costing millions in remediation.

How does Plavno approach AI agent security?

Plavno treats the LLM as a guest, not a host. We implement a 'Permission Layer' using policy engines like OPA, enforce strict audit trails, and ensure the LLM runs in a restricted environment with no direct network access, communicating only via explicitly whitelisted function calls.