We cross-referenced 15 academic papers (NAACL 2025, AAAI 2026, ICSE 2026, Microsoft Research, Stanford TACL 2024), 6 reproducible GitHub benchmarks, official pricing pages from Anthropic, Cursor, and GitHub (verified May 12, 2026), cost telemetry from engineering teams running AI coding tools in production, and every controlled developer productivity study published through early 2026. This is what we found.
If you maintain a codebase with 10,000+ files and you’re trying to figure out which AI tools to use – stop reading marketing pages.
The research answers most of your questions, and the answers are not what the tool vendors are telling you.
Here’s what the evidence says.
Finding 1: The model matters roughly 3× more than the tool
This is the single most important finding, and it contradicts the way most people think about AI coding tools.
Oracle-SWE (arXiv 2604.07789, Microsoft Research + Georgia Tech, April 2026) measured what types of context actually improve coding agent performance.
They isolated five context signals and measured each one’s contribution.
The result that rewrites the conversation:
| Condition | GPT-4o success rate | GPT-5 success rate |
|---|---|---|
| Base (no context at all) | 23% | 73% |
| + All oracle context signals | 97% | 100% |
The model upgrade from GPT-4o to GPT-5 added +50 percentage points before any context was added.
The maximum gain from all five oracle signals combined was +74pp for GPT-4o (23% → 97%) – but the starting point was so much higher for GPT-5 that it needed only +27pp of those same signals to reach perfect performance.
The implication: if you’re on an older model, upgrading the model does roughly 3× more for your output quality than adding semantic search.
The Oracle-SWE data gives us the math directly: semantic indexing provides primarily Edit Location (+10pp) and API Usage (+5pp) – a maximum of +15pp.
The model upgrade from GPT-4o to GPT-5 provided +50pp. 50 ÷ 15 = 3.3×.
Both Cursor and Claude give you access to the same frontier models (Claude Opus 4.7, GPT-5.4). So the model is not a differentiator between tools – it’s a differentiator between plans within the same tool, based on which model you default to.
Finding 2: Semantic indexing produces real but conditional gains
The marketing says “semantic search makes your AI understand your codebase.”
The research says “yes, but only under specific conditions, and it can hurt if done wrong.”
When semantic indexing helps
CodeRAG-Bench (NAACL 2025, 9,000 coding tasks, 10 LLMs) measured +27.4% code correctness when GPT-4o received gold-standard retrieved documents on SWE-bench tasks.
Cursor’s own internal A/B test (cursor.com/blog/semsearch) measured +12.5% average accuracy with semantic search enabled vs grep-only.
On codebases with 1,000+ files, code retention improved by +2.6% – a measurable and reproducible gain.
When semantic indexing hurts
SRACG (AAAI 2026) showed that naive always-on retrieval (what they call “standard RACG”) actually degrades performance by 2.6–3.6% compared to no retrieval at all.
Selective retrieval -retrieving only when the model’s confidence is low- improved results by +2.4 to +7.1pp across 7 LLMs.
An empirical study on retrieved information types (arXiv 2503.20589) found that retrieving similar-looking code (surface-level semantic matches) rather than task-relevant code drops code correctness by up to 15%.
The threshold where indexing pays off
The evidence consistently shows that the benefit concentrates on large codebases:
- Under 100 files: negligible improvement – the model’s parametric knowledge and simple grep are sufficient
- Over 1,000 files: measurable gains – Cursor’s A/B test showed +2.6% retention, CodeRAG-Bench showed +6–27% depending on retrieval quality
- Over 10,000 files: speed advantage dominates – Cursor’s trigram inverted index (cursor.com/blog/fast-regex-search) eliminates most file reads, while ripgrep (which Claude Code uses) can take 15+ seconds on repos of this scale
If your codebase has fewer than 1,000 files, semantic indexing is a nice-to-have, not a necessity.
If it’s over 10,000 files, it’s a requirement for a usable workflow.
Finding 3: The real cost explosion is model drift, not platform fees
This is where the enterprise data gets uncomfortable.
Across the engineering teams we work with, one pattern recurs consistently: AI costs running $9,400/month against a $1,100/month subscription base.
The subscription was 12% of actual spend. The other 88% was usage – driven entirely by model selection and behavior.
Two root causes:
Root cause 1: Model drift. The team started on Claude Sonnet (~$0.33/call). Over three months, the majority of usage had shifted to Claude Opus (~$0.99/call).
Despite making fewer total requests, costs went up – because Opus ran roughly 3× the per-call cost in this pattern.
For context: current published pricing puts Opus 4.5/4.7 at $5/$25 per MTok vs Sonnet 4.6 at $3/$15 – a 1.67× per-token ratio.
The higher observed call-cost ratio reflects the specific model versions, session sizes, and extended context usage in this enterprise’s data.
No one decided to switch. It happened organically because developers self-select for the “best” model.
Root cause 2: No visibility. The cost spike was discovered on the invoice, not in real time. No budget alerts. No per-user dashboards. No model governance policy.
Claude Opus represented over 70% of all token spend – despite Sonnet achieving a comparable SWE-bench score (Sonnet 4.6: 79.6% vs Opus 4.5: 80.9%, per the Steel.dev leaderboard) at roughly 60% of the cost per token.
The platform fee is negligible
On Cursor Teams plans, there’s an explicit $0.25 per million token platform surcharge on non-Auto requests (cursor.com/help/models-and-usage/token-rate).
On a typical 120K-token session, that’s ~$0.03.
The model cost for the same session on Opus 4.7 ($5/$25 per MTok): ~$1.00.
The model is 97% of the cost. The platform fee is 3%.
The semantic search infrastructure is not what makes your bill expensive – your model choice is.
Finding 4: Sonnet covers 90% of implementation at ~40% lower cost
This finding directly contradicts the instinct to always use the “best” model.
The cost analysis across teams identified the optimization opportunity: switching the default model from Opus to Sonnet recovers $3,500–5,000/month in avoidable spend – without any tooling changes.
The quality trade-off is small.
On the Steel.dev SWE-bench Verified leaderboard:
- Claude Sonnet 4.6: 79.6%
- Claude Opus 4.5: 80.9%
- GPT-5.4: 78.20% (per Vals.ai, updated April 30, 2026 – not yet tracked on Steel.dev at time of writing)
For feature development, API integrations, test writing, code reviews, and non-critical bug fixes -which is 90% of daily coding work- the 1.3pp quality gap is not observable in practice.
The cost data illustrates the asymmetry: the highest-spending developers were often the most cost-efficient – running deep agentic sessions at roughly ~$0.01 per accepted line of output.
The real waste came from mid-tier users spending nearly 3× that rate by triggering expensive Opus requests on low-complexity tasks with low acceptance rates – work Sonnet would have handled identically.
Important caveat: the SWE-bench contamination study (Prathifkumar, Mathews, Nagappan; University of Waterloo, December 2025) found that models perform roughly 3× better on SWE-Bench-Verified overall, and 6× better at finding edited files, than on fresh comparable benchmarks – suggesting training data overlap.
This means the 79.6% vs 80.9% gap may be larger on genuinely novel tasks. But it also means both scores are inflated – neither Sonnet nor Opus performs at 80% on work your team hasn’t done before.
Finding 5: Flat-rate billing is the only safe way to use Opus at volume
This is the structural insight that determines the stack.
Anthropic’s Claude Max plans (claude.com/pricing):
- Max 5x: $100/month – 5× more usage per session than Pro, flat rate, no per-token billing
- Max 20x: $200/month – 20× more usage per session than Pro, flat rate
Both plans have two weekly usage limits: one that applies across all models and a separate one for Sonnet models only. Exact numeric thresholds are not publicly disclosed by Anthropic.
Heavy autonomous pipeline use can hit these caps independently of the per-session limit – plan accordingly if you’re running long, unattended agentic sessions.
At API rates, $100 worth of Opus 4.7 covers roughly 100 moderate sessions (100K input + 20K output each, at $5/$25 per MTok = ~$1.00/session).
Power users running heavier sessions (150K input + 30K output) do 5 sessions/day over 20 working days (100 sessions at ~$1.50 each) consuming ~$150/month in raw API value. Anthropic subsidizes that heavy use by ~33% on Max 5x.
On Cursor Pro+ ($60/month, includes $70 of API usage at cursor.com/docs/models-and-pricing), the same 100 Opus sessions would cost $100 in API usage – $30 over the included budget, hitting on-demand billing.
On Copilot Pro+ ($39/month, includes 7,000 AI Credits per month — 3,900 base + 3,100 flex, both included in the subscription – at docs.github.com/copilot/reference/copilot-billing/models-and-pricing), Opus sessions drain credits at frontier model rates.
Starting June 1, 2026, every Chat and agent interaction is billed in credits. Heavy agent use generates significant overage on top of the $39 base.
The bottom line: if you need Opus-level reasoning regularly, the only billing model that doesn’t generate surprise invoices is flat-rate.
Finding 6: Execution context matters more than retrieval for code quality
This finding nuances the semantic search story in an important way.
Oracle-SWE ranked five context signals by their contribution to coding agent success:
| Rank | Signal | Contribution (GPT-4o, SWE-bench Verified) |
|---|---|---|
| #1 | Reproduction Test (a failing test) | +33pp |
| #2 | Execution Context (error traces, stack traces) | +22pp |
| #3 | Edit Location (exact file and lines to modify) | +10pp |
| #4 | API Usage (what APIs the fix should use) | +5pp |
| #5 | Regression Test (tests that should still pass) | +4pp |
Note: pp values are from Figure 4 of the paper (a chart). The paper’s text states that on SWE-bench-Verified specifically, “Edit Location contributes more than API Usage and Execution Context” – meaning Edit Location ranks above Execution Context on this benchmark, reversing positions #2 and #3 above. The Execution Context > Edit Location ordering holds on SWE-bench-Live and Pro. Contribution order varies by benchmark; the values above reflect a single model-benchmark combination.
Semantic indexing primarily provides signals #3 and #4 – Edit Location and API Usage.
The dominant signals (#1 and #2) come from running the code and observing what fails, not from retrieving code snippets.
Both Cursor and Claude Code provide execution context through their agent harnesses (running tests, reading errors, and iterating on failures).
This is why Claude Code achieves comparable SWE-bench scores without a pre-built semantic index – the top two signals come from execution, not retrieval.
The most common agent failure mode is “incorrect implementation” (39.9% of failures), followed by “overly specific implementation” (23.4%).
“Failed to find edit location” (the problem semantic indexing directly fixes) is only the #3 failure mode at 12.9%. (Failure mode breakdown from Yang et al. 2024 / SWE-Agent, cited in Oracle-SWE.)
Finding 7: Context rot is real and accelerates with retrieval volume
The Chroma 2025 study (Hong, Troynikov, Huber; code at github.com/chroma-core/context-rot) tested 18 frontier models and found that all 18 performed worse as input length increased – no exceptions.
Semantically related content (code from the same domain) caused worse degradation than unrelated filler, because it creates plausible distractors for the model’s attention.
The Stanford “Lost in the Middle” study (TACL 2024, 2,000+ citations) confirmed a U-shaped performance curve: models perform best with relevant information at the beginning or end of context, worst in the middle.
GPT-3.5-Turbo’s accuracy dropped below its closed-book baseline (56.1%) when the relevant document was in the middle of a 20-document context – meaning the extra context actively hurt.
This matters for tool choice because:
- Claude Code’s grep+read approach appends each tool response (grep output, file reads) directly to the conversation context. Over a long session, accumulated tool outputs grow the context window continuously – triggering context rot at around 20% window usage (documented in GitHub issue #34685, cited in Hivetrail April 2026).
- Cursor’s Dynamic Context Discovery (cursor.com/blog/semsearch) converts long tool responses to files, lazy-loads context, and returns summaries from its Explore subagent – keeping the context window lean. SWE-ContextBench (February 2026) found that “accurate summarization and retrieval of experience trajectories significantly improves agent performance” over verbose trajectory dumps. (Note: SWE-ContextBench specifically measures cross-task context reuse -whether agents can learn from summarized solutions to prior related tasks- not within-session context compression. The directional principle applies: summarized context consistently outperforms raw verbose context.)
The tool that manages context accumulation better sustains quality over longer sessions.
The research says Cursor’s architecture handles this structurally; Claude Code relies on the developer’s discipline to start fresh sessions.
Finding 8: The productivity gains are real but smaller and more conditional than advertised
Four controlled studies, read together:
| Study | Method | Finding |
|---|---|---|
| GitHub RCT 2022 | 95 developers, HTTP server task | 55% faster with Copilot |
| Google Enterprise RCT 2024 | 96 Google engineers, real production tasks | 21% faster – not 55% |
| METR Open-Source RCT 2025 | 16 experienced OSS developers, 246 tasks, Cursor Pro + Claude 3.5/3.7 Sonnet | 19% slower with AI |
| ICSE 2026 Longitudinal | 800 developers, 2 years, 151M IDE events | AI users delete more code over time |
The METR finding is the critical counterweight.
On genuinely novel, complex tasks (real open-source contributions to projects the developers maintain), AI tools produced a 19% slowdown.
The cause: verification overhead – developers spent significant time evaluating, correcting, and re-prompting AI suggestions that didn’t match their project’s specific patterns.
The key variable is task novelty.
On familiar patterns (CRUD, API integrations, test boilerplate), AI tools accelerate.
On genuinely novel work, they can slow you down. This is directly relevant to large codebases with custom patterns – the more bespoke your architecture, the more verification overhead you’ll encounter.
Additionally, Sahoo et al. (ASE 2024) –Insights from the Usage of the Ansible Lightspeed Code Completion Service– found that 18.16% of initially accepted AI suggestions were later deleted (7,436 of 40,945 accepted) and 6.62% were substantially modified with 50%+ edits – meaning roughly 25% of “accepted” suggestions carried hidden downstream cost. (Domain: Ansible/IT automation; direction likely generalizes, exact percentages may vary by domain.)
The recommended stack (evidence-based)
Based on the full body of research, here’s the stack we recommend for teams working on large codebases (10,000+ files).
For daily coding: an IDE with semantic indexing
Use Cursor (cursor.com/pricing). Set Sonnet 4.6 or GPT-5.4 as the default model.
Why?
- Semantic indexing produces +6–27% correctness gains on codebases above 1,000 files (CodeRAG-Bench NAACL 2025: +6.9% on ODEX, +27.4% on SWE-bench, both with high-quality retrieved context)
- AST-aware chunking via tree-sitter produces semantically coherent code units – never splits mid-function (Copilot’s completion-time retrieval uses 60-line sliding windows that can; its agent-mode chunking is not publicly documented)
- Agent-trained embeddings retrieve what the agent needs to complete tasks, not just similar-looking code – SRACG (AAAI 2026) shows this distinction produces +7pp gain where generic retrieval loses 3.6pp
- Trigram inverted index returns regex results in milliseconds vs 15+ seconds for ripgrep on large repos
- Dynamic Context Discovery manages context rot through compression and lazy-loading (validated by SWE-ContextBench)
- Composer 2 at $0.50/$2.50 per MTok handles routine edits at 10× lower cost than frontier models
- Sonnet 4.6 at $3/$15 per MTok covers 90% of implementation tasks at the 79.6% SWE-bench quality level – well within the Pro+ plan’s $70 API budget
For architecture, research, and deep reasoning: flat-rate Opus
Use Claude Max 5x (claude.com/pricing). Use for all Opus-level work only.
Why?
- Opus 4.7 at flat rate eliminates the model drift cost explosion – Opus consistently drives over 70% of costs in teams without model governance
- 1M token context window at standard rates (no long-context surcharge) handles multi-document research synthesis, full-repo reads, and long intent verification
- Oracle-SWE confirms model quality is the dominant factor (+50pp gap between model generations) – for tasks that genuinely need frontier reasoning, you want Opus without token anxiety
- Flat rate means autonomous pipeline runs, long research sessions, and complex debugging have no per-token cost ceiling
For routine completions (optional, if budget allows): Copilot
Use Copilot Pro ($10/month, github.com/features/copilot) only if your repos are on GitHub and you value its PR integration.
Why it’s optional: Cursor already includes unlimited tab completions. Copilot adds unlimited completions for $10/month – but you’re paying for a feature Cursor already provides.
The value is in Copilot’s GitHub-native code review, PR workflow, and cross-repository search if your infrastructure is on GitHub.
The discipline that makes the stack work
Never use Opus inside Cursor: this is the non-negotiable rule from the enterprise data. Opus at per-token API rates creates uncapped cost. Opus on Claude Max is flat-rate. All Opus-level work goes to Claude Max. All implementation work stays on Cursor with Sonnet/GPT-5.4/Composer 2.
Route non-coding work to Claude, not Cursor: the less visible failure mode is tool drift, not model drift.
Power users running research, architecture decisions, planning, and synthesis through Cursor pay on-demand per-token rates for work that has nothing to do with semantic indexing or codebase context – the two reasons Cursor exists.
The same work on Claude Max costs $0 marginal, runs on a better model, and doesn’t count against your Cursor API budget.
A pattern we see repeatedly: a single power user routinely accounting for the majority of total API costs while running all task types through Cursor at on-demand rates.
Splitting by tool type –Cursor for coding, Claude for reasoning– maintains identical output quality at roughly half the cost. The rule: use Cursor when your codebase is part of the answer; use Claude when it isn’t.
Lock the default model: the enterprise data showed model drift (Sonnet → Opus) happened organically because developers self-select for the “best” model without seeing the cost. Model governance is a policy decision, not a personal preference.
Use budget alerts: the enterprise cost spike was discovered on the invoice.
Set alerts at meaningful thresholds ($50, $100, $200/month for individuals; scaled for teams) and review weekly.
What this stack costs
| Tool | Plan | Monthly cost |
|---|---|---|
| Cursor | Pro+ | $60 |
| Claude | Max 5x | $100 |
| Total | $160 |
For comparison: teams running unmanaged Cursor with Opus drift routinely land at $400–700/person/month in AI costs.
A managed Cursor + Claude stack at $160/person/month is a 60–75% cost reduction using the same models at the same quality level – the difference is governance, not technology.
What we’d watch for next
Copilot’s June 1 billing transition. Token-based AI Credits will change the cost profile for teams that rely on Copilot Chat and cloud agents. Completions stay free. But agentic usage -the fast-growing use case- becomes variable-cost for the first time. Teams currently on flat-rate Business plans will see their first usage-based bills in July.
Cursor’s Composer 2 quality ceiling. At $0.50/$2.50 per MTok it’s the cheapest capable model in the stack. But on complex architectural decisions, it hits a ceiling that Sonnet doesn’t.
Watch for Composer 3 (or 2.5!) – if Cursor can close the gap to Sonnet-level quality at Composer 2 pricing, the cost equation shifts dramatically.
Claude Code’s context rot improvements. Anthropic is aware of the context management problem (they document it themselves at platform.claude.com/docs/en/build-with-claude/context-windows).
If they ship Dynamic Context Discovery-style summarization for Claude Code, the “Claude for research only” limitation weakens – and a single-tool Claude stack becomes viable even for large codebases.
The METR slowdown replication. The 19% slowdown finding (arxiv:2507.09089) was from 16 developers on 246 tasks. It’s rigorous but small.
If a larger study replicates the slowdown on novel tasks, the productivity narrative around AI coding tools needs fundamental revision. If it doesn’t replicate, the finding was a sample-size artifact and the 21–55% speedup range holds.
Sources
Academic Papers
- Oracle-SWE — “Quantifying the Contribution of Oracle Information Signals on SWE Agents” — Microsoft Research + Georgia Tech — arXiv 2604.07789 (April 2026) — GPT-4o: 23% base → 97% with all oracle signals; signal contribution ranking verified
- CodeRAG-Bench — “Can Retrieval Augment Code Generation?” — NAACL 2025 Findings — ACL Anthology — 9,000 tasks, 10 retrievers, 10 LLMs; +27.4% with gold documents
- SRACG — Selective Retrieval-Augmented Code Generation — AAAI 2026 — ojs.aaai.org — naive RAG −2.6 to −3.6pp; selective RAG +2.4 to +7.1pp
- SWE-bench Contamination — “Does SWE-Bench-Verified Test Agent Ability or Model Memory?” — Prathifkumar, Mathews, Nagappan (University of Waterloo) — arXiv 2512.10218 (December 2025) — ~3× better on SWE-Bench-Verified overall, ~6× better at finding edited files vs fresh comparable benchmarks
- “What Truly Matters?” — Empirical study on retrieved information — arXiv 2503.20589 — similar-code retrieval −15%; API-relevant retrieval +20%
- SWE-ContextBench — “Context Learning in Coding” — arXiv 2602.08316 — summarized context outperforms verbose trajectories
- Context Rot — Chroma Research — “How Increasing Input Tokens Impacts LLM Performance” — research.trychroma.com/context-rot (July 2025) — 18 frontier models, all degrade; GitHub
- Lost in the Middle — Stanford/Samaya AI — TACL 2024 — arXiv 2307.03172 — U-shaped attention curve; GPT-3.5 drops below closed-book baseline
- METR RCT — “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity” — arXiv 2507.09089 — 16 developers, 246 tasks, 19% slowdown
- ICSE 2026 — “Evolving with AI: A Longitudinal Analysis of Developer Logs” — Sergeyuk, Huang, Karaeva, Serova, Golubev, Ahmed (JetBrains Research + academic collaborators; accepted to ICSE’26 Research track) — arXiv 2601.10258 — 800 developers, 2-year telemetry + 62-professional survey; AI users delete significantly more code
- Google Enterprise RCT — “How much does AI impact development speed?” — arXiv 2410.12944 — 96 Google engineers, 21% improvement
Official Documentation (Pricing verified May 12, 2026)
- Anthropic Pricing — docs.anthropic.com/en/docs/about-claude/pricing
- Anthropic Models — docs.anthropic.com/en/docs/about-claude/models
- Claude Plans — claude.com/pricing
- Cursor Models & Pricing — cursor.com/docs/models-and-pricing
- Cursor Token Rate — cursor.com/help/models-and-usage/token-rate
- GitHub Copilot Billing Transition — github.blog/news-insights/company-news/github-copilot-is-moving-to-usage-based-billing/
- GitHub Copilot Models and Pricing — docs.github.com/copilot/reference/copilot-billing/models-and-pricing
- GitHub Copilot Usage-Based Billing — docs.github.com/copilot/concepts/billing/usage-based-billing-for-individuals
Engineering Sources
- Cursor Semantic Search — cursor.com/blog/semsearch
- Cursor Fast Regex Search — cursor.com/blog/fast-regex-search
- Cursor Secure Codebase Indexing — cursor.com/blog/secure-codebase-indexing
- SWE-bench Verified Leaderboard — leaderboard.steel.dev
Reproducible Benchmarks
- Semble — github.com/MinishLab/semble — ~98% token reduction vs grep+read
- Lumen — github.com/ory/lumen — up to −39% cost, up to −66% output tokens, up to −53% time on real bug fixes (averages across 9 benchmark runs: −26% cost, −37% output tokens, −28% time)
- Sverklo — github.com/sverklo/sverklo — 60-task code retrieval benchmark
Co-authored with Claude Opus 4.7. All claims are independently verified against primary sources.
