Velocity was never a perfect metric. But it was a useful one — until AI coding tools made the assumption it rests on completely false.
For two decades, velocity gave Agile teams something valuable: a shared language for capacity.
Not a perfect language. Practitioners always knew that story points were relative, that velocity varied by team, that comparing velocity across teams was meaningless. The Agile community built entire curricula around the correct and incorrect uses of the metric.
But underneath all of that nuance, velocity rested on one assumption that nobody really questioned — because it was so obviously true that it didn’t need stating.
The assumption: story points track human effort.
A five-point story required roughly five points of human thinking, deciding, writing, and building. A team’s sustainable velocity reflected the amount of focused human attention they could sustain over a two-week sprint. When velocity went up, it meant the team was either getting better, getting faster, or both.
That assumption is now false.
Not weakened. Not in need of recalibration. False.
And the organizations that haven’t noticed are making capacity decisions, roadmap commitments, performance evaluations, and hiring plans based on a number that stopped meaning what they think it means the moment they added AI coding tools to the team.
What Velocity Was Actually Measuring
To understand why velocity broke, it helps to understand what it was measuring in the first place — and why that worked.
Story points were never meant to measure time. They were meant to measure relative complexity, using the team’s own past performance as the reference. The insight behind Agile estimation is that humans are terrible at predicting absolute duration but reasonably consistent at judging relative size.
“Is this bigger or smaller than the login feature we built last sprint?” produces more reliable answers than “how many hours will this take?”
Velocity emerged from this as a planning tool. If a team consistently closes 40 points per sprint, you can reasonably plan 40 points for the next one. The number is team-specific, sprint-specific, and only useful in context — but within those constraints, it works.
The reason it works is that human output is relatively stable. A developer who writes 200 lines of thoughtful, production-ready code in a day is working at roughly the same cognitive intensity whether this sprint or the next. The variation in velocity across sprints reflects team dynamics, dependencies, and interruptions — not random noise in human cognitive output.
AI coding tools introduced exactly that random noise. Or rather, they introduced something much more destabilizing: a dramatic, asymmetric increase in generation speed with no corresponding change in the human judgment required to evaluate that generation.
The Broken Assumption in Practice
Here’s a concrete example of how it plays out.
A product team is planning a sprint. Based on historical velocity, they plan for 45 story points. They have a senior developer, two mid-level developers, and a product owner running point on requirements.
Six months ago, those 45 points would have taken roughly the shape of ten to twelve tickets, each requiring meaningful human thought at every stage: design, implementation, testing, review.
Today, with AI coding tools in the workflow, the generation phase of those tickets is dramatically compressed. A ticket that used to take a day now takes two hours. The scaffold is generated, the edge cases are suggested, the test cases are drafted. The sprint closes 45 points — on schedule, on budget, green across the board.
What the velocity chart doesn’t show:
The senior developer spent the sprint evaluating, correcting, and approving AI-generated code at roughly three times their previous review load. The product owner’s acceptance criteria were translated by AI into implementation so quickly that ambiguities that used to surface during development never surfaced at all — they’re sitting in production now, waiting to be discovered. Two tickets were closed with code that passes all tests but contains architectural assumptions that don’t hold for the edge cases that will show up in three months.
Velocity: 45 points. Same as last sprint. Dashboard: green.
Three Planning Decisions Velocity Can No Longer Support
Capacity Planning
When a Scrum team plans a sprint, the foundational question is: how much can we commit to? Velocity answers this by saying: roughly what you’ve done before, adjusted for team changes and known interruptions.
In an AI-assisted team, this calculation is broken because the relationship between story points and human cognitive load has become non-linear and unpredictable.
A sprint where the team generates twice as much code as usual is not a sprint where the team exerted twice as much judgment. It may be a sprint where one senior developer made four times as many review decisions, accumulated significant decision fatigue by Wednesday, and approved several things on Thursday afternoon that they would have caught on Monday morning.
The capacity constraint in an AI-assisted team is not generation hours. It’s judgment hours. And velocity measures neither.
Performance Evaluation
This one has serious organizational consequences.
When engineering managers, CTOs, and product leaders evaluate team performance, velocity is almost always in the picture. A team that consistently closes 50 points is performing better than one that closes 35, all else being equal.
In an AI-assisted context, a team closing 50 points might be performing better. Or it might be a team where one exhausted senior developer is approving everything quickly to keep the sprint board green, accumulating technical debt and personal depletion simultaneously.
Amazon discovered this dynamic in March 2026, when a six-hour outage on their main ecommerce site was traced to AI-assisted code deployments that had been approved without adequate review. Their velocity metrics had shown strong performance. Their deployment frequency was up. By every traditional Agile measure, the team was firing on all cylinders.
Until it wasn’t.
Their fix — mandating senior sign-off on all AI-assisted deployments — is essentially an acknowledgment that velocity had been measuring the wrong thing. The metric showed generation was happening fast. It didn’t show that judgment was degrading.
Roadmap Commitments
Perhaps the most consequential place velocity shows up is in roadmap commitments made to stakeholders.
“Based on our current velocity, we can ship the new onboarding flow by Q2.” This kind of statement is made in every product organization, in every industry, every week. It depends entirely on the belief that past velocity predicts future capacity.
In an AI-assisted team, this prediction is less reliable than it looks — not because the team will slow down, but because the hidden costs of AI-assisted velocity are not evenly distributed over time.
Code that was generated quickly in Q1 generates maintenance burden in Q2. Technical debt that accumulated invisibly during high-velocity sprints surfaces as unplanned work during the sprints you committed to deliver the onboarding flow. The velocity number was real. The capacity it implied was not.
The Agile Principle Velocity Was Supposed to Protect
There’s an irony here worth naming.
One of the core commitments of Agile development is sustainable pace. The Agile Manifesto’s twelfth principle states that “Agile processes promote sustainable development. The sponsors, developers, and users should be able to maintain a constant pace indefinitely.”
Velocity was partly a tool for enforcing this principle. By tracking output over time and using that track record as the basis for planning, teams could resist the pressure to over-commit. “We can’t commit to 60 points — our velocity is 40” is a statement of sustainable capacity, not laziness.
AI coding tools have created a situation where velocity appears to increase sustainably — the chart goes up and stays up — while the actual human cognitive load is increasing toward unsustainability. The metric that was supposed to protect sustainable pace is now actively obscuring the violation of it.
I’ve written separately about AI Fatigue — the specific form of cognitive depletion that comes from operating as a human judgment layer over machine-speed generation. The symptoms are subtle and slow-building: slightly less pushback in code reviews, slightly faster approvals, slightly less documentation of reasoning. None of these show up in a velocity chart. All of them compound over time.
A team running at unsustainable AI-assisted pace looks, on the velocity chart, like a high-performing team. Right up until the point where it doesn’t.
What Agile Teams Actually Need to Measure
Let me be direct about something: I don’t have a clean replacement for velocity. Neither does the Agile community. The tooling, the frameworks, and the mental models for measuring team health in an AI-assisted context are still being built.
What I can offer is a set of questions that are more honest about what matters — and some leading indicators that capture what velocity misses.
Review quality over review completion. The difference between a sprint where a senior developer reviewed twenty pull requests and pushed back on eight versus a sprint where they reviewed twenty and pushed back on zero is enormous. The velocity chart shows the same number. Track the ratio of substantive feedback to silent approvals. Watch what happens to that ratio in the second half of a sprint, and on Fridays.
Defect origin tracking. When bugs surface in production, trace them back. Was it AI-generated code that was approved quickly? Was it a module where the review was thin? This isn’t about blame — it’s about understanding the actual failure modes of your specific team’s AI-assisted workflow.
Unplanned work ratio. The clearest signal of accumulated technical debt is the percentage of sprint capacity consumed by unplanned work — bugs, incidents, refactoring that wasn’t on the roadmap. A team whose velocity is climbing but whose unplanned work ratio is also climbing is accumulating debt faster than they’re delivering value.
Comprehension sampling. Once a month, pick five tickets that closed in the last sprint and ask the developer to walk you through the implementation in a thirty-minute technical review. Not to test them — to calibrate your understanding of how well the code is understood by the people who shipped it. If the answers are thin and hesitant, you have what engineers are starting to call “confidence debt”: code that works today but can’t be maintained tomorrow because no human on the team fully understands it.
Sustainable pace check-ins. Ask your team directly, weekly: on a scale of one to ten, how would you rate your decision quality today compared to Monday morning? Most experienced developers track this intuitively. The question is whether your organization creates space for the answer to surface before it becomes an attrition problem.
The Uncomfortable Implication for Agile Practice
Velocity is not the only Agile metric that AI disrupts. It’s just the most prominent.
Sprint planning assumes that past performance predicts future capacity. AI changes the relationship between effort and output in ways that make this assumption unreliable.
Definition of Done assumes that a completed ticket represents a unit of understood, production-ready work. AI-generated code can satisfy every criterion in a Definition of Done while still being code that nobody on the team could reconstruct from memory.
Retrospectives are supposed to surface what slowed the team down. But AI Fatigue doesn’t feel like slowdown — it feels like acceleration. You don’t retrospect on being too fast.
None of this means Agile is broken. The values — collaboration, working software, responding to change, sustainable pace — are as relevant as ever. The practices need to evolve to account for a world where the bottleneck is no longer human execution speed.
The teams that navigate this well won’t be the ones that abandon Agile. They’ll be the ones that update their measurement practices to reflect where the new constraints actually live: not in how fast code is generated, but in how well it is understood, reviewed, and maintained by the humans responsible for it.
A Final Thought
Every major shift in how software is built has eventually required a shift in how software teams measure their health.
Waterfall gave way to Agile when the industry recognized that big-batch planning didn’t match the reality of how software development actually unfolds. Velocity and story points emerged as better tools for a more iterative, human-centered process.
AI-assisted development is another shift of comparable magnitude. The tools are different. The bottlenecks are different. The failure modes are different.
Velocity served us well. For twenty years, it helped teams plan honestly, resist over-commitment, and build at a sustainable pace. It was a good solution to the problem it was designed to solve.
That problem has changed.
The teams that recognize this early — and start building measurement practices that capture judgment quality, cognitive sustainability, and architectural coherence — will have a significant advantage over the teams that keep celebrating green dashboards until something breaks.
The dashboard will be green. The question is what it isn’t showing you.
Diego Fiorentin is the founder of NextTo.ai, an AI consulting firm that helps companies implement AI operationally — business-first, not tool-first. If your organization is rethinking how to measure engineering health in an AI-assisted context, book a conversation →
