3 m read

Tests passed. Why is checkout broken again?

Your team keeps reopening checkout bugs even after the visible tests go green.

After one coupon bug comes back two days after release, you give the repair system the issue text, the failing checkout trace, and the files near that trace. It writes patch candidates, runs compile and the repo’s visible tests, and sorts the fixes by what turns the dashboard green.

That setup feels sane because the patch stays close to the failing code path, and the review screen is clean. But the system can’t see the customer session that support saw, and it can’t see the stronger checks you wish you had. It only sees the tests already in the repo.

The review screen

This is the kind of thing you end up reviewing:

“Rank 1: patch-17. Failing test fixed. All visible checks pass. Ready for review.”

{
  "issue": "Coupon applies in cart but drops at final payment",
  "candidate": "patch-17",
  "files": ["checkout/discounts.py", "tests/test_coupon_submit.py"],
  "compile": "pass",
  "visible_tests": "128/128",
  "rank": 1
}

If you’re busy, that looks like proof. What the system didn’t do:

  1. It didn’t lock the exact broken path from the ticket and failing trace before scoring patches. The customer bug lives on the last payment handoff, not just in cart math.
  2. It didn’t run nearby checks on that same path, like the same coupon with a card retry, a page refresh, or a different payment method.
  3. It didn’t try fake fixes on that same path: tiny changes you already know should not fix the coupon handoff, just to see whether wrong edits also keep the tests green.
  4. It didn’t check the payment boundary with other probes against a known-good build, so a pass could still mean weak checks or a bad test setup.

One shopper enters SAVE20, clicks pay, and still gets charged full price.

The patch pile

That one miss turns into a ranking problem fast.

A green build starts to look like one big score, but it is really mixed together from two things: whether the patch is right, and whether your visible checks ever pressed on the exact broken path.

When teams add two local warnings on that path—how often known non-fixes still pass, and how often same-path probes fail—they stop treating all green patches as equal.

Visible-pass patch group Wrong-change pass rate Same-path probe miss rate True fix rate
Local warning stays low < 0.10 < 0.10 > 80%
Local warning stays high > 0.50 > 0.30 < 25%

On repos the system has not seen before, ranking green patches with those local checks beats raw visible-pass ranking by at least 10 points for picking the truly right fix first, but only when the checks were picked from the ticket and failing trace before any patch got scored.

The pressure test

A green patch matters only when wrong patches fail on the exact customer path.

That pushes against the habit of accepting repo fixes mainly because they compile, pass visible tests, or look like a believable diff.

Without local wrong-change checks, same-path probes, and a sense of how noisy those signals are, visible-pass repo repair is still guessing with better manners.

In measurement work, this is called confounding bridge proxy: a middle score looks useful because one local check is weak, not because that score means truth everywhere.

In repository repair, the green dashboard can point to the right patch for the wrong reason, which means your ranking story is only as good as the pressure you put on the exact broken path.

If your team needs engineers who can test AI patches on the exact customer path instead of stopping at a green build, that’s what we do at InTheValley.

InTheValley

Leave a Reply