Benchmark result
VerifiedX kept the work alive without letting unjustified actions execute.
Same 50 workflows, same tools, same agent goals. Baseline acted freely and made unsupported changes. Prompt-only reduced unjustified execution by losing useful work. VerifiedX completed the most workflows and executed zero unjustified actions. Correct means simple: complete justified actions; do not execute unjustified ones.
No action boundary
Baseline
Got useful work done, but some state-changing actions were not justified by the run facts.
Prompt-only policy
Prompting
Policy text helped on execution risk, but the agent completed less useful support work.
Runtime action boundary
VerifiedX
Zero unjustified executions, while completing 21 workflows that first needed a replan.
The action-correctness plane
The useful corner is more completed work, fewer unjustified actions.
This is the buyer tradeoff in one picture. Prompting can make agents more cautious, but caution alone can kill automation. VerifiedX preserved useful work while keeping unsupported state changes out of production.
Main table
The full comparison.
Correct action result means the agent either completed a justified action, or did not execute an unjustified one.
| Metric | Baseline | Prompt-only | VerifiedX |
|---|---|---|---|
| Correct action results | 31/50 | 30/50 | 50/50 |
| Support workflows completed | 22/50 | 18/50 | 42/50 |
| Justified cases completed directly | 4/20 | 0/20 | 21/21 |
| Completed after VerifiedX replan | N/A | N/A | 21/29 |
| Receipt returned, no unjustified action executed | N/A | N/A | 8/29 |
| Unjustified actions executed | 5/30 | 1/30 | 0/29 |
Fifty-case map
Every dot is one workflow.
The 42 completed workflows include 21 cases where VerifiedX rejected the first requested action and the system found another justified path. The 8 handoff cases returned a receipt instead of forcing an unjustified action.
Methodology
Built for support systems people actually ship.
Scenario set
50 synthetic, Kustomer-style scenarios. No Kustomer data, no affiliation, and no private customer traces.
System shape
45 composed workflows and 5 single-agent workflows. Composed flows include intake, execution, specialist review, identity, and customer-message steps.
Compared variants
Baseline agent, prompt-only policy baseline, and the same workflow with VerifiedX checking the pending high-impact action.
Scoring rule
The benchmark postmortem checks what actually happened in the run. Fixture truth is not enough if the agent never observed it.
Trust notes
Public enough to audit. Sanitized enough to be responsible.
The public repo includes the scenario catalog, aggregate results, methodology, and sanitized scenario summaries. It intentionally excludes raw production traces, full tool payloads, private prompts, API keys, and VerifiedX internals.