How to measure AI adoption without fake ROI

A practical scorecard for proving adoption before making larger productivity claims.

Illustration of AI adoption metrics and engineering workflow signals

We are skeptical whenever an engineering AI program starts with a promised percentage uplift.

Not because improvement is impossible. Improvement is often real. But because early claims are usually much more precise than the operating model underneath them.

That is a problem.

One of the most useful counterweights we have seen came from METR's July 10, 2025 study on experienced open-source developers. In that study, developers using early-2025 AI tools were slower on average, even though they estimated that they had been faster. That gap matters. It tells us that felt productivity and measured productivity can diverge, especially in mature codebases and high-context work.

If you are an engineering leader, that result should not make you anti-AI. It should make you more serious about measurement.

The mistake we see most often

Teams ask What is our productivity gain? before they have answered What exactly did we roll out?

That creates three bad habits:

  • people measure tool usage instead of workflow adoption
  • people count output volume instead of review quality
  • people force business ROI language before the engineering operating model is stable

The result is a dashboard that looks busy but does not help anyone decide whether the rollout is working.

Start with adoption evidence, not end-state ROI

Anthropic's February 10, 2025 Economic Index found AI use leaning more toward augmentation than full automation. That matches what we see in engineering organizations: most of the value comes from humans working differently, not from humans disappearing from the workflow.

That means your first measurement model should track whether the organization is using the workflow as intended.

We usually recommend four layers.

LayerQuestionGood early signal
AccessDo the right people have the right tools?Approved tool path in place
AdoptionAre teams actually using the named workflow?Repeated usage in the right repositories or tasks
ControlAre review and guardrail rules being followed?Shared reviewer behavior and low exception noise
OutcomeIs the workflow changing delivery behavior?Better throughput, less rework, or less ambiguity

The first three layers come before you try to sell a CFO-grade ROI story.

What we would track in the first 90 days

If we were helping a CTO run a flagship AI enablement program, we would start with a small scorecard like this:

SignalWhy it mattersWhat to avoid
Named workflows in scopePrevents fuzzy rollout languageAI across engineering
Teams using approved tool pathShows whether standardization is realCounting every tool equally
Review exceptions or escalationsSurfaces control gaps quicklyTreating silence as success
Manager reinforcement cadenceTests whether ownership existsMaking adoption a one-off training event
Follow-up adoption reviewForces evidence, not storiesWaiting six months to inspect anything

None of those signals are glamorous. All of them are more useful than early vanity math.

Why raw usage numbers are not enough

High usage can still hide poor rollout quality.

For example:

  • one team may use the tool heavily in low-risk tasks
  • another may use it rarely but correctly in the highest-value workflow
  • a third may use it constantly while bypassing the intended review model

A flat usage count treats all three as equal. Operationally, they are not equal at all.

This is why we prefer adoption reviews over raw activity dashboards.

An adoption review asks: did the team use the approved workflow in the approved way?

That is the question that matters.

Measure where time is being reinvested

GitHub's 2024 enterprise survey also pointed to something leaders often miss: respondents reported using time saved with AI coding tools for collaboration, system design, and learning.

That matters because some of the earliest value will not show up as more tickets shipped.

It may show up as:

  • fewer stalled design decisions
  • better reviewer focus
  • faster onboarding into an existing codebase
  • less friction in repetitive drafting work

Those are real operating effects. They are just harder to see if your only lens is feature output per sprint.

A scoreboard we trust more than AI productivity

Here is a simpler model we would rather defend in a leadership meeting:

1. Workflow adoption

  • Which named workflows are live?
  • Which teams use them?
  • How often are they used inside the intended boundary?

2. Review integrity

  • Are reviewers aligned on validation expectations?
  • Are exception paths clear?
  • Are there recurring classes of failure?

3. Manager reinforcement

  • Are engineering managers coaching against the workflow?
  • Do retrospectives or one-on-ones surface usage quality?
  • Is adoption drifting by team?

4. Business-facing implications

  • Has decision speed improved?
  • Has documentation or test support become more reliable?
  • Has leadership confidence in the rollout increased?

The first three are operational. The fourth is where ROI begins to become discussable.

Do not confuse speed with value

There is a deeper reason to avoid fake productivity math early: speed alone can be the wrong variable.

If AI causes a team to produce more code but increases review uncertainty, policy drift, or secret exposure, the organization may move faster into a weaker operating state.

GitHub wrote on April 1, 2025 that more than 39 million secrets were leaked across GitHub in 2024. That number is not a statement about your organization specifically, but it is a useful reminder that developer convenience and security discipline do not automatically move together.

So when a team tells us we are faster, we want to ask:

  • faster at what?
  • under which review rules?
  • with which security controls?
  • and with what evidence that the new speed is repeatable rather than noisy?

Our preferred sequence

We would measure AI rollout in this order:

  1. workflow clarity
  2. approved usage
  3. reviewer consistency
  4. manager reinforcement
  5. downstream delivery effects

Only after those five are visible would we try to frame a stronger ROI case.

That sequencing is less exciting than a hero claim. It is also much harder to embarrass yourself with later.

A note on credibility

If you sell the return before you can describe the operating model, you are borrowing credibility from a future state that does not exist yet.

We would rather hand a buyer a smaller, defensible story:

We approved these workflows. These teams are using them. These review rules are holding. These managers are reinforcing them. Here is the evidence so far.

That is not weaker than an early uplift promise.

It is what makes a later value story believable.

Sources

  • METR, Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity, July 10, 2025
  • Anthropic, The Anthropic Economic Index, February 10, 2025
  • GitHub, Survey: The AI wave continues to grow on software development teams, August 20, 2024, updated April 15, 2025
  • GitHub, GitHub found 39M secret leaks in 2024. Here's what we're doing to help, April 1, 2025

Talk to us

Scale AI in engineering with control.

We help define the workflows, guardrails, and proof you need.

Get in contact