We are skeptical whenever an engineering AI program starts with a promised percentage uplift.
Not because improvement is impossible. Improvement is often real. But because early claims are usually much more precise than the operating model underneath them.
That is a problem.
One of the most useful counterweights we have seen came from METR's July 10, 2025 study on experienced open-source developers. In that study, developers using early-2025 AI tools were slower on average, even though they estimated that they had been faster. That gap matters. It tells us that felt productivity and measured productivity can diverge, especially in mature codebases and high-context work.
If you are an engineering leader, that result should not make you anti-AI. It should make you more serious about measurement.
The mistake we see most often
Teams ask What is our productivity gain? before they have answered What exactly did we roll out?
That creates three bad habits:
- people measure tool usage instead of workflow adoption
- people count output volume instead of review quality
- people force business ROI language before the engineering operating model is stable
The result is a dashboard that looks busy but does not help anyone decide whether the rollout is working.
Start with adoption evidence, not end-state ROI
Anthropic's February 10, 2025 Economic Index found AI use leaning more toward augmentation than full automation. That matches what we see in engineering organizations: most of the value comes from humans working differently, not from humans disappearing from the workflow.
That means your first measurement model should track whether the organization is using the workflow as intended.
We usually recommend four layers.
| Layer | Question | Good early signal |
|---|---|---|
| Access | Do the right people have the right tools? | Approved tool path in place |
| Adoption | Are teams actually using the named workflow? | Repeated usage in the right repositories or tasks |
| Control | Are review and guardrail rules being followed? | Shared reviewer behavior and low exception noise |
| Outcome | Is the workflow changing delivery behavior? | Better throughput, less rework, or less ambiguity |
The first three layers come before you try to sell a CFO-grade ROI story.
What we would track in the first 90 days
If we were helping a CTO run a flagship AI enablement program, we would start with a small scorecard like this:
| Signal | Why it matters | What to avoid |
|---|---|---|
| Named workflows in scope | Prevents fuzzy rollout language | AI across engineering |
| Teams using approved tool path | Shows whether standardization is real | Counting every tool equally |
| Review exceptions or escalations | Surfaces control gaps quickly | Treating silence as success |
| Manager reinforcement cadence | Tests whether ownership exists | Making adoption a one-off training event |
| Follow-up adoption review | Forces evidence, not stories | Waiting six months to inspect anything |
None of those signals are glamorous. All of them are more useful than early vanity math.
Why raw usage numbers are not enough
High usage can still hide poor rollout quality.
For example:
- one team may use the tool heavily in low-risk tasks
- another may use it rarely but correctly in the highest-value workflow
- a third may use it constantly while bypassing the intended review model
A flat usage count treats all three as equal. Operationally, they are not equal at all.
This is why we prefer adoption reviews over raw activity dashboards.
An adoption review asks: did the team use the approved workflow in the approved way?
That is the question that matters.
Measure where time is being reinvested
GitHub's 2024 enterprise survey also pointed to something leaders often miss: respondents reported using time saved with AI coding tools for collaboration, system design, and learning.
That matters because some of the earliest value will not show up as more tickets shipped.
It may show up as:
- fewer stalled design decisions
- better reviewer focus
- faster onboarding into an existing codebase
- less friction in repetitive drafting work
Those are real operating effects. They are just harder to see if your only lens is feature output per sprint.
A scoreboard we trust more than AI productivity
Here is a simpler model we would rather defend in a leadership meeting:
1. Workflow adoption
- Which named workflows are live?
- Which teams use them?
- How often are they used inside the intended boundary?
2. Review integrity
- Are reviewers aligned on validation expectations?
- Are exception paths clear?
- Are there recurring classes of failure?
3. Manager reinforcement
- Are engineering managers coaching against the workflow?
- Do retrospectives or one-on-ones surface usage quality?
- Is adoption drifting by team?
4. Business-facing implications
- Has decision speed improved?
- Has documentation or test support become more reliable?
- Has leadership confidence in the rollout increased?
The first three are operational. The fourth is where ROI begins to become discussable.
Do not confuse speed with value
There is a deeper reason to avoid fake productivity math early: speed alone can be the wrong variable.
If AI causes a team to produce more code but increases review uncertainty, policy drift, or secret exposure, the organization may move faster into a weaker operating state.
GitHub wrote on April 1, 2025 that more than 39 million secrets were leaked across GitHub in 2024. That number is not a statement about your organization specifically, but it is a useful reminder that developer convenience and security discipline do not automatically move together.
So when a team tells us we are faster, we want to ask:
- faster at what?
- under which review rules?
- with which security controls?
- and with what evidence that the new speed is repeatable rather than noisy?
Our preferred sequence
We would measure AI rollout in this order:
- workflow clarity
- approved usage
- reviewer consistency
- manager reinforcement
- downstream delivery effects
Only after those five are visible would we try to frame a stronger ROI case.
That sequencing is less exciting than a hero claim. It is also much harder to embarrass yourself with later.
A note on credibility
If you sell the return before you can describe the operating model, you are borrowing credibility from a future state that does not exist yet.
We would rather hand a buyer a smaller, defensible story:
We approved these workflows. These teams are using them. These review rules are holding. These managers are reinforcing them. Here is the evidence so far.
That is not weaker than an early uplift promise.
It is what makes a later value story believable.
Sources
- METR,
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity, July 10, 2025 - Anthropic,
The Anthropic Economic Index, February 10, 2025 - GitHub,
Survey: The AI wave continues to grow on software development teams, August 20, 2024, updated April 15, 2025 - GitHub,
GitHub found 39M secret leaks in 2024. Here's what we're doing to help, April 1, 2025

