Support eval

Support Agent Action Benchmark

Fifty Kustomer-style support workflows for agents that issue refunds, update subscriptions, change reservations, write account state, tag cases, and send customer-facing messages.

Benchmark result

VerifiedX kept the work alive without letting unjustified actions execute.

Same 50 workflows, same tools, same agent goals. Baseline acted freely and made unsupported changes. Prompt-only reduced unjustified execution by losing useful work. VerifiedX completed the most workflows and executed zero unjustified actions. Correct means simple: complete justified actions; do not execute unjustified ones.

No action boundary

Baseline

31/50 correct action results
22/50 support workflows completed
5/30 unjustified actions executed

Got useful work done, but some state-changing actions were not justified by the run facts.

Prompt-only policy

Prompting

30/50 correct action results
18/50 support workflows completed
1/30 unjustified actions executed

Policy text helped on execution risk, but the agent completed less useful support work.

VerifiedX result

Runtime action boundary

VerifiedX

50/50 correct action results
42/50 support workflows completed
0/29 unjustified actions executed

Zero unjustified executions, while completing 21 workflows that first needed a replan.

21 justified actions completed directly
21 workflows completed after VerifiedX replan
8 receipt handoffs; no unjustified action executed
0 unjustified actions executed with VerifiedX

The action-correctness plane

The useful corner is more completed work, fewer unjustified actions.

This is the buyer tradeoff in one picture. Prompting can make agents more cautious, but caution alone can kill automation. VerifiedX preserved useful work while keeping unsupported state changes out of production.

Main table

The full comparison.

Correct action result means the agent either completed a justified action, or did not execute an unjustified one.

Metric Baseline Prompt-only VerifiedX
Correct action results 31/50 30/50 50/50
Support workflows completed 22/50 18/50 42/50
Justified cases completed directly 4/20 0/20 21/21
Completed after VerifiedX replan N/A N/A 21/29
Receipt returned, no unjustified action executed N/A N/A 8/29
Unjustified actions executed 5/30 1/30 0/29

Fifty-case map

Every dot is one workflow.

The 42 completed workflows include 21 cases where VerifiedX rejected the first requested action and the system found another justified path. The 8 handoff cases returned a receipt instead of forcing an unjustified action.

justified action completed directly completed after VerifiedX replan receipt returned; no unjustified action executed

Methodology

Built for support systems people actually ship.

Scenario set

50 synthetic, Kustomer-style scenarios. No Kustomer data, no affiliation, and no private customer traces.

System shape

45 composed workflows and 5 single-agent workflows. Composed flows include intake, execution, specialist review, identity, and customer-message steps.

Compared variants

Baseline agent, prompt-only policy baseline, and the same workflow with VerifiedX checking the pending high-impact action.

Scoring rule

The benchmark postmortem checks what actually happened in the run. Fixture truth is not enough if the agent never observed it.

Trust notes

Public enough to audit. Sanitized enough to be responsible.

The public repo includes the scenario catalog, aggregate results, methodology, and sanitized scenario summaries. It intentionally excludes raw production traces, full tool payloads, private prompts, API keys, and VerifiedX internals.

Install in 2 steps

  1. Get API key
  2. Run the installer

    Codex, Cursor, Claude Code.

    VERIFIEDX_API_KEY=vxpk_... npx -y @verifiedx-core/sdk@latest install --cwd .