You roll out an agent for routine payment repair cases. It can clear low-dollar wires that got stuck because a reference field is missing or a prior payee name needs a match.
You give it read access to the case system, the sanctions screen, the core ledger, and a policy service that blocks anything above $2,500.
The guardrails look right:
- no new payees
- no release if the name match score is below 98%
- stop if a customer note mentions fraud, court order, or chargeback
- hand off if the agent fails twice
The dashboard is green. 92% of cases close in under 40 seconds. Escalations stay low. No money moves outside the low-dollar bucket.
So you think the hard part is over.
The case note looked perfect
Then one case lands in audit. The agent writes:
Released payment repair. Sanctions clear. Prior payee matched. Amount under low-dollar limit. No fraud terms found.
What the case file shows:
{
"case_id": "PR-41872",
"decision": "release",
"checks": ["sanctions_clear", "payee_match", "amount_under_2500"],
"signed_receipt": null
}
It looks clean. That’s the trap. What the agent didn’t do:
- It did not tie the decision to a signed receipt that says which rule set ran.
- It did not record the exact ledger snapshot and sanctions list used at that minute.
- It did not mark which customer text was masked for privacy and which text was kept.
- It did not state the stop rule that an examiner can test without replaying the whole case.
- It did not give the audit anything they could rely on instead of doing fresh samples.
Now multiply it by a quarter
Now multiply that single clean-looking case by 10,000 payment repairs in a quarter.
Ops loves the speed. Audit still pulls samples. Compliance still asks for screenshots. Legal still asks how much raw trace data you plan to keep, and for how long.
| Metric | One case | 10,000 cases/quarter |
|---|---|---|
| Missing signed action receipts | 1 | 10,000 |
| Audit sample rechecks | 1 | 600 |
| Ops prep for review packets | 8 minutes | 80 hours |
| Stored raw traces legal may need to hold | 1 | 10,000 |
That is why teams get stuck. The agent is fast, but the proof is not something audit can reuse.
So the logs become extra work, not less work. And if the logs are rich enough to answer every question later, legal sees a bigger file cabinet for court.
Why the proof matters more than the prompt
The bottleneck is not what the agent can do; it’s what other people can trust without rechecking.
Most teams react by shrinking tool access and polishing prompts. That helps a little. But in low-harm work, the bigger shift comes when every action carries a short proof packet that follows the same shape every time: what tool was called, which rule set ran, what data was masked, and why the agent stopped or acted.
This is the same lesson teams learned from boilerplate interface contracts from regulated finance.
Once the contract is accepted, you stop arguing about every field on every trade. Here, the contract is the AI action receipt. If regulators and auditors accept that receipt, and if it does not just hand legal more exposure, you can reuse the same control shell across payment repair, matching breaks, or research memo prep.
If that acceptance never comes, firms keep the tight tool lists and the manual checks, no matter how pretty the dashboard looks. And even if the proof gets perfect, big wires, trading, and account closures still stay behind hard gates. What changes is where you let agents act alone: small, bounded tasks where a live rule can still stop them in time.
If your team needs engineers who build signed action receipts and live policy hooks around bounded agent tasks, that’s what we do at InTheValley.
- The agent looks fine. Why won’t audit trust it? - March 29, 2026
- The Agentic Startup Manifesto - June 8, 2025
- Remote Hiring in 2025 - April 5, 2025
