We are skeptical whenever an engineering AI program starts with a promised percentage uplift.

Not because improvement is impossible. Improvement is often real. But because early claims are usually much more precise than the operating model underneath them.

That is a problem.

One of the most useful counterweights we have seen came from METR's July 10, 2025 study on experienced open-source developers. In that study, developers using early-2025 AI tools were slower on average, even though they estimated that they had been faster. That gap matters. It tells us that felt productivity and measured productivity can diverge, especially in mature codebases and high-context work.

If you are an engineering leader, that result should not make you anti-AI. It should make you more serious about measurement.

The mistake we see most often

Teams ask What is our productivity gain? before they have answered What exactly did we roll out?

That creates three bad habits:

people measure tool usage instead of workflow adoption
people count output volume instead of review quality
people force business ROI language before the engineering operating model is stable

The result is a dashboard that looks busy but does not help anyone decide whether the rollout is working.

Start with adoption evidence, not end-state ROI

Anthropic's February 10, 2025 Economic Index found AI use leaning more toward augmentation than full automation. That matches what we see in engineering organizations: most of the value comes from humans working differently, not from humans disappearing from the workflow.

That means your first measurement model should track whether the organization is using the workflow as intended.

We usually recommend four layers.

Layer	Question	Good early signal
Access	Do the right people have the right tools?	Approved tool path in place
Adoption	Are teams actually using the named workflow?	Repeated usage in the right repositories or tasks
Control	Are review and guardrail rules being followed?	Shared reviewer behavior and low exception noise
Outcome	Is the workflow changing delivery behavior?	Better throughput, less rework, or less ambiguity

The first three layers come before you try to sell a CFO-grade ROI story.

What we would track in the first 90 days

If we were helping a CTO run a flagship AI enablement program, we would start with a small scorecard like this:

Signal	Why it matters	What to avoid
Named workflows in scope	Prevents fuzzy rollout language	`AI across engineering`
Teams using approved tool path	Shows whether standardization is real	Counting every tool equally
Review exceptions or escalations	Surfaces control gaps quickly	Treating silence as success
Manager reinforcement cadence	Tests whether ownership exists	Making adoption a one-off training event
Follow-up adoption review	Forces evidence, not stories	Waiting six months to inspect anything

None of those signals are glamorous. All of them are more useful than early vanity math.

Why raw usage numbers are not enough

High usage can still hide poor rollout quality.

For example:

one team may use the tool heavily in low-risk tasks
another may use it rarely but correctly in the highest-value workflow
a third may use it constantly while bypassing the intended review model

A flat usage count treats all three as equal. Operationally, they are not equal at all.

This is why we prefer adoption reviews over raw activity dashboards.

An adoption review asks: did the team use the approved workflow in the approved way?

That is the question that matters.

Measure where time is being reinvested

GitHub's 2024 enterprise survey also pointed to something leaders often miss: respondents reported using time saved with AI coding tools for collaboration, system design, and learning.

That matters because some of the earliest value will not show up as more tickets shipped.

It may show up as:

fewer stalled design decisions
better reviewer focus
faster onboarding into an existing codebase
less friction in repetitive drafting work

Those are real operating effects. They are just harder to see if your only lens is feature output per sprint.

A scoreboard we trust more than `AI productivity`

Here is a simpler model we would rather defend in a leadership meeting:

1. Workflow adoption

Which named workflows are live?
Which teams use them?
How often are they used inside the intended boundary?

2. Review integrity

Are reviewers aligned on validation expectations?
Are exception paths clear?
Are there recurring classes of failure?

3. Manager reinforcement

Are engineering managers coaching against the workflow?
Do retrospectives or one-on-ones surface usage quality?
Is adoption drifting by team?

4. Business-facing implications

Has decision speed improved?
Has documentation or test support become more reliable?
Has leadership confidence in the rollout increased?

The first three are operational. The fourth is where ROI begins to become discussable.

Do not confuse speed with value

There is a deeper reason to avoid fake productivity math early: speed alone can be the wrong variable.

If AI causes a team to produce more code but increases review uncertainty, policy drift, or secret exposure, the organization may move faster into a weaker operating state.

GitHub wrote on April 1, 2025 that more than 39 million secrets were leaked across GitHub in 2024. That number is not a statement about your organization specifically, but it is a useful reminder that developer convenience and security discipline do not automatically move together.

So when a team tells us we are faster, we want to ask:

faster at what?
under which review rules?
with which security controls?
and with what evidence that the new speed is repeatable rather than noisy?

Our preferred sequence

We would measure AI rollout in this order:

workflow clarity
approved usage
reviewer consistency
manager reinforcement
downstream delivery effects

Only after those five are visible would we try to frame a stronger ROI case.

That sequencing is less exciting than a hero claim. It is also much harder to embarrass yourself with later.

A note on credibility

If you sell the return before you can describe the operating model, you are borrowing credibility from a future state that does not exist yet.

We would rather hand a buyer a smaller, defensible story:

We approved these workflows. These teams are using them. These review rules are holding. These managers are reinforcing them. Here is the evidence so far.

That is not weaker than an early uplift promise.

It is what makes a later value story believable.

Sources

METR, Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity, July 10, 2025
Anthropic, The Anthropic Economic Index, February 10, 2025
GitHub, Survey: The AI wave continues to grow on software development teams, August 20, 2024, updated April 15, 2025
GitHub, GitHub found 39M secret leaks in 2024. Here's what we're doing to help, April 1, 2025

How to measure AI adoption without fake ROI

The mistake we see most often

Start with adoption evidence, not end-state ROI

What we would track in the first 90 days

Why raw usage numbers are not enough

Measure where time is being reinvested

A scoreboard we trust more than `AI productivity`

1. Workflow adoption

2. Review integrity

3. Manager reinforcement

4. Business-facing implications

Do not confuse speed with value

Our preferred sequence

A note on credibility

Sources

Scale AI in engineering with control.

The mistake we see most often

Start with adoption evidence, not end-state ROI

What we would track in the first 90 days

Why raw usage numbers are not enough

Measure where time is being reinvested

A scoreboard we trust more than AI productivity

1. Workflow adoption

2. Review integrity

3. Manager reinforcement

4. Business-facing implications

Do not confuse speed with value

Our preferred sequence

A note on credibility

Sources

Scale AI in engineering with control.

A scoreboard we trust more than `AI productivity`