You roll out an AI support system that can finish a billing step on its own. When a customer action requires a charge, it sends the charge call, waits for a reply, and, if the call times out, retries up to 5 times.
What the agent can do:
- charge a customer
- wait for a reply or retry
- ask the billing provider what happened
- send the charge through a separate route
The nasty part of payments is simple: once the charge call leaves your system, a timeout does not mean the provider did nothing. It can mean the charge went through, and the answer never came back.
The dashboard goes green
By lunch, the run view looks clean. The case closes because the last try comes back okay.
Charge step recovered after retry 3. Customer case can close.
{
"action": "charge_customer",
"attempts": 3,
"final_status": "recovered",
"error_family": "timeout_after_send",
"status_check_run": false,
"route": "primary"
}
That looks safe because the last line says recovered. But the log only says the third call worked. It does not say the first call failed.
What the agent didn’t do:
- It did not ask the billing provider whether attempt 1 had already been committed.
- It did not check whether the same timeout kept showing up on the same route.
- It did not ask whether anything about the next try would be different: worker, route, timing window, or status read.
- It did not stop when its own business record remained unchanged, and no new proof appeared.
That is the whole idea. This is a proof problem, not a stamina problem. A new try gets a turn only if you can say what new proof it might bring before it runs.
Three refunds later, the pattern is obvious
One charge can look like bad luck. Three customers on the same Monday is a category. Your support team sees refunds. Your dashboard sees green recoveries.
| Customer | Retry rule | New proof before next try? | What happened |
|---|---|---|---|
| 1 | Up to 5 tries | No | Charged twice, then refunded |
| 2 | Up to 5 tries | No | Charged twice, then refunded |
| 3 | Up to 5 tries | No | Charged twice, then refunded |
At that point, the cost is not just the duplicate charge. It is all the extra run history and wait time spent repeating the same customer mutation under the same bad conditions.
If the only thing that changes is the attempt number, you are not learning. You are just rolling the same die again.
What the agent should have done instead: after the same failure repeats unchanged, it stops same-charge retries. It asks the billing provider what happened, or it uses a separate route only if that route still shares the same duplicate check or has a clean way to reverse a double charge.
The next retry has to earn its turn
A retry counts only when you can name the new proof it will create.
That cuts against the usual mix of idempotency keys, backoff, and fixed retry counts.
Those are still good tools. They just do not answer the ugly case where the charge may already have been committed, and another retry might repeat it under the same bad conditions.
Even starting a fresh run is just less history unless it truly changes the worker, code version, timing window, or billing route, or gives you a newer status read.
In particle filtering, this is called particle weight degeneracy: more samples from the same thin slice stop updating your estimate. Each retry under the same bad conditions is the same slice.
After enough repeats, more samples from the same thin slice stop changing your view of what happened. In durable distributed AI tool orchestration, that means the hard part is not getting better at retrying. It is knowing when the next retry is still the same bet.
If your team needs engineers who can tell when a retry adds proof and when it only repeats the same charge path, that’s what we do at InTheValley.
- We charged three customers twice. Why did retries keep billing? - March 30, 2026
- The dashboard is green. Why are renewals still going wrong? - March 29, 2026
- I don’t trust this green dashboard. What is it hiding? - March 29, 2026
