AI-generated code may improve short-term velocity while quietly increasing long-term maintenance complexity, readability problems, and hidden operational fragility.
The conversation we keep having — and the one we’re avoiding
Almost every discussion about AI coding assistants is a discussion about speed. How many lines per hour. How many tickets per sprint. How many engineers you no longer need to hire. The pitch is velocity, and on velocity the tools largely deliver: most developers using them report real, felt productivity gains.
But velocity is the easy question. The harder question — the one that determines whether AI-assisted development is a durable advantage or a deferred bill — is sustainability. What happens to a codebase, eighteen months in, when a meaningful share of it was generated by a model optimized to produce plausible code right now rather than maintainable code over time?
The early large-scale evidence is not reassuring. When researchers stop measuring how fast code is written and start measuring what happens to it afterward, a consistent pattern emerges: the same tools accelerating output are also accelerating duplication, churn, and downstream instability. The productivity is real. So, increasingly, is the debt.
What the data actually shows
This isn’t a vibes argument. Several independent, large-scale analyses now point the same direction.
Code is being copied, not consolidated
GitClear’s 2025 AI Copilot Code Quality research analyzed roughly 211 million changed lines of code across five years (2020–2024), drawn from repositories owned by Google, Microsoft, Meta, and enterprise C-corps. The findings describe a structural shift in how code gets written:
- Copy/pasted code rose from 8.3% to 12.3% of all changes between 2021 and 2024 — and for the first time in the dataset’s history, “copy/paste” surpassed “moved” code.
- The prevalence of duplicated code blocks grew roughly fourfold.
- Refactoring collapsed: changed lines associated with refactoring fell from 25% in 2021 to under 10% in 2024.
Read together, these tell a clear story. AI assistants are extraordinarily good at generating new code on demand and conspicuously bad at suggesting you consolidate what already exists. They don’t know your codebase’s existing abstractions, so they reproduce logic instead of reusing it. The result is a slow erosion of the DRY principle — more cloned blocks, fewer shared abstractions, and a steadily rising surface area for bugs to hide in.
The code is also less stable
Code churn — the share of lines revised or reverted within two weeks of being written — is a useful proxy for “code that wasn’t right the first time.” GitClear’s earlier “Coding on Copilot” analysis projected churn would roughly double against its pre-AI 2021 baseline. High churn isn’t inherently bad, but a sustained rise signals more rework, more half-finished changes, and more instability baked into the default workflow.
Stability degrades even when quality looks fine
The most counterintuitive finding comes from Google’s DORA research. In the 2024 State of DevOps report, increased AI adoption correlated with an estimated 7.2% reduction in delivery stability (and a slight throughput dip), even as most respondents reported productivity gains. The 2025 report reinforced it, adding a new Rework Rate metric and confirming that higher AI use is associated with greater delivery instability — more unplanned deployments and failed changes.
Crucially, DORA’s interpretation is not “AI writes bad code.” It’s that AI tends to increase batch size — bigger, faster changesets — and bigger changes have always been riskier to ship. DORA’s framing is the one worth internalizing: AI is a multiplier of existing conditions. It strengthens teams with strong review, testing, and architectural discipline, and it amplifies the weaknesses of teams without them.
Why this debt is hidden
Classic technical debt is loud. It announces itself: the function nobody will touch, the TODO from 2019, the test suite everyone skips. You can see it.
AI-generated debt is quiet, and that’s what makes it dangerous. The code:
- Works. It compiles and runs.
- Passes tests. Often the same tests the AI also wrote.
- Reads fine at the line level — clean syntax, reasonable names.
What it lacks is architectural judgment: awareness of the abstraction that already exists three files over, the invariant the rest of the system depends on, the reason the “obvious” approach was rejected two years ago. Each individual suggestion is locally reasonable. The damage is systemic — it shows up only when you zoom out to the level of the whole codebase, where you find five near-identical implementations of the same logic, none of which knows about the others.
This is why standard review misses it. A reviewer scanning a diff sees a tidy, working block and approves it. The duplication, the missing reuse, the creeping complexity — those are invisible at diff altitude. The debt accumulates below the line of sight, and surfaces later as the thing nobody wants to own.
The operational fragility nobody put on the roadmap
Maintenance complexity eventually becomes an operations problem. When the same logic is cloned across a system with no shared source of truth, the predictable failure modes follow:
- A fix applied in one place silently leaves the other copies broken. The bug you “fixed” is still live in four other call sites.
- Mean time to resolution climbs, because nobody can hold the now-sprawling system in their head, and the AI that generated it can’t explain why it made the choices it did.
- Change failure rate rises as larger, faster-moving changesets ship with thinner human understanding behind them.
- Onboarding slows — new engineers face a codebase with the volume of a mature system and the coherence of a first draft.
None of these appear on a velocity dashboard. They appear in your incident channel, your on-call burnout, and your “why does every change take longer than it used to” retrospectives.
This is not an argument against AI coding assistants
It would be a serious misread to take this as “stop using AI.” The productivity gains are real and, used well, the tools are genuinely transformative. The argument is narrower and more actionable:
Treat AI-assisted development as an accelerant, not an autopilot — and invest in the discipline that keeps acceleration from becoming fragility.
Concretely, for teams shipping AI-generated code at scale:
- Measure the right things. Velocity metrics alone will hide this problem. Track duplication, code churn, change failure rate, and rework rate alongside throughput. If your dashboards only show speed, you’ve instrumented yourself to miss the debt.
- Make refactoring a first-class, scheduled activity. AI won’t suggest consolidation, so the human loop has to own it. The collapse from 25% to under 10% of changed lines going to refactoring is a gap you have to close deliberately.
- Review for systems, not just diffs. Ask the question the diff can’t answer: “Does this already exist somewhere? Does it fit our abstractions?” Pair AI generation with tooling that surfaces duplication and architectural drift.
- Keep batch sizes small. Since AI’s stability cost runs largely through batch size, the oldest DORA lesson still applies — ship smaller, more frequent, more reviewable changes.
- Keep humans accountable for architecture. Generation can be delegated. Judgment about how the system fits together cannot.
The bottom line
The first technical debt crisis was built one rushed deadline at a time, by humans who knew they were cutting corners. The next one may be built far faster, far more quietly — by tools that produce code that looks finished, passes the tests, and reads cleanly, while steadily eroding the structure underneath.
AI coding assistants didn’t create technical debt; software teams have always done that on their own. What’s new is the scale and the speed, and the fact that the debt now arrives disguised as progress. The velocity is real. Whether it’s sustainable depends entirely on whether we choose to measure — and pay down — what it costs.
Sources
- GitClear — AI Copilot Code Quality: 2025 Research
- GitClear — “Coding on Copilot” code churn analysis (153M+ lines analyzed)
- Google Cloud — Announcing the 2024 DORA Report
- Google Cloud — Announcing the 2025 DORA Report
Leave a Reply