Determining if your agent is ready to be deployed into production primarily requires you to ensure that it manages to succeed in the goal it's been asked to achieve, and after a few iterations this should be pretty easy to verify. But because agents are dynamic by nature, you need to take into account what risks are posed by the way you've designed your agent.
Sure, it might be able to update a customer record when needed, but what are the consequences of it making an update when it shouldn't? You need to be able to identify the parts of the agent you've designed where the consequences of poor decision-making could result in permanent repercussions and add safeguards. Not only that, but it's also important to be able to catch those poor decisions and learn from them.
We've already talked about how you can add certain restrictions or procedures as part of configuring the LLM or the tools, but that is not the only way you can ensure a safer execution. Using BPMN patterns is an ironclad way to prevent a lapse in judgment by the agent causing issues within the systems it has access to.
Tool gatekeeping
While everything you place inside the ad hoc subprocess is a tool that the agent can execute, it's important to remember that it can only start tools from the beginning of a flow. This means that if you connect two or more tasks together, it's still one single tool, but it can only be initiated from the first task.
You can take advantage of this when it comes to protecting important systems from weird LLM behavior while still technically giving the LLM the ability to use such tools. In most cases, this is where you'd put a human user in front of the action task — ensuring that all attempts to trigger the action have some kind of oversight. But you could just as easily add a programmatic check on the input or even a rules table to see if the agent should be allowed to take the action based on the context.

Judging agent success
In the future, papers will be published by us that go into detail about how to measure the success of agents by aggregating the results to deliver nice KPI metrics — but let's start with the question of how to assess the performance of a single instance and, importantly, how you deal with that individual assessment.
A common pattern in the design of AI agents is to use a judge to work out how well it performed. This is usually triggered after an agent has finished its work, and it takes a look at what the agent was trying to achieve, how it achieved it, and produces some kind of judgment (hence the name). But then what?
A nice pattern is to have the judge result in either "optimal," "suboptimal," or "acceptable."
Acceptable executions can just end—there's nothing to learn from them and nothing to fix.
Suboptimal performances should be given over to a human user to see if they need to fix anything the agent has done.
Optimal performances can be stored in long-term memory so that the next time a similar problem needs to be solved, the agent can access it and hopefully repeat it.

Timing and event restrictions
Time spent on the problem
An agent has no real understanding of time. A problem that takes 20 iterations by the LLM might only last 10 minutes, but another problem that is only on its 2nd iteration could have been running for days. The LLM itself has no concept of this — but the process engine does, and because of this you can use BPMN timer events to either send out notifications that an agent is taking too long or just stop the agent completely once a certain amount of time has elapsed.
This is particularly useful for scenarios where you're doing a lot of optional communication with humans. Maybe they will respond… maybe they won't. This way you can give up after a certain amount of time.
Events have changed the context
Camunda uniquely builds long-running agents, and this means that things that affect what the agent is working on could happen while a request is in flight. Using non-interrupting events, you can trigger an agent to update its context if it's dealing with stale data, or perhaps the job it's working on is no longer relevant, so an interrupting signal event could be used to cancel all agents working on a related request. Either way, these problems are often solved by BPMN events of one kind or another, and interestingly you can also have the agent itself trigger these events.
For instance, as we learned in the system prompt section, it's really important to explain to an agent how it should properly fail. One great way of doing that is by letting it trigger an event that cancels itself, like an escalation event.




