Measurement methodology

We define baseline metrics, instrument impact, and align with evidence—so you can prove change, not assume it.

Baseline metrics

Before rolling out AI-assisted workflows, we establish a baseline so changes can be attributed honestly. We use metrics that are traceable and avoid single-number ROI or perception-only claims.

Throughput: e.g. cycle time, deployment frequency—with caveats (batch size, scope). We don't attribute change to AI without controls.
Stability: e.g. change failure rate, mean time to restore (MTTR). DORA 2024 suggests monitoring stability when rolling out AI; we align with that.
Trust and process: Survey or qualitative checks: "Do we review AI output?" "Do we have guardrails?" Mapped to NIST/OWASP where relevant.

Evidence: DORA 2024 (throughput/stability); METR RCT (context matters). See DORA 2024 and main repo docs/evidence/sources.md (IDs 5, 7).

Instrumentation

How we capture and report impact in pilots and engagements:

Delivery metrics: Cycle time and deployment frequency from your CI/CD or value-stream tooling; same definitions before and after rollout.
Stability signals: Change failure rate and MTTR from production/incident data.
Adoption quality: AI usage rate and trust split by role (developer, reviewer, manager).
Review discipline: Percent of AI-assisted PRs with human review, test evidence, and security scan pass.
Security quality: Vulnerability density in AI-assisted vs non-AI PRs, before and after review.

Sample dashboard (mock)

A minimal view of what "prove it" can look like—baseline vs post-rollout, with the same definitions. This is illustrative; real dashboards plug into your pipeline and ticketing data.

Cycle time (p50)

3.8 d

Baseline: 4.2 d

Deploy frequency

14/wk

Baseline: 12/wk

Change fail rate

Baseline: 8%

MTTR

1.9 h

Baseline: 2.1 h

Mock data for illustration. Real implementation uses your pipeline (e.g. DORA-style) and does not attribute delta to AI without controlled comparison.

DORA alignment

DORA (DevOps Research and Assessment) four key metrics—deployment frequency, lead time for changes, change failure rate, and MTTR—are the industry standard for software delivery performance. We use them where your context allows:

Same definitions before and after AI rollout so comparisons are fair.
We do not claim "X% productivity gain" without objective metrics; perception-only claims are insufficient (see NAD guidance on Copilot claims).
DORA 2024 and later work highlight that AI adoption can affect throughput and stability differently; we monitor both and emphasize guardrails and review.

References: DORA 2024 (sources.md ID 5), DORA AI Capabilities Model (system-level practices). See dora.dev and main repo docs/evidence/impact-metrics.md.

Sources

We do not invent metrics or ROI. All cited stats and methodology tie to our evidence pack and research-sources index in the main repo.

DORA 2024 — AI adoption, throughput/stability (evidence/sources.md ID 5).
METR RCT (2025) — context-dependent outcomes (evidence/sources.md ID 7).
Impact metrics and what to track — docs/evidence/impact-metrics.md.
NAD/BBB Microsoft Copilot — objective metrics over perception-only claims (evidence/sources.md ID 14).

Back to method

Baseline metrics

Before rolling out AI-assisted workflows, we establish a baseline so changes can be attributed honestly. We use metrics that are traceable and avoid single-number ROI or perception-only claims.

Throughput: e.g. cycle time, deployment frequency—with caveats (batch size, scope). We don't attribute change to AI without controls.

Stability: e.g. change failure rate, mean time to restore (MTTR). DORA 2024 suggests monitoring stability when rolling out AI; we align with that.

Trust and process: Survey or qualitative checks: "Do we review AI output?" "Do we have guardrails?" Mapped to NIST/OWASP where relevant.

Evidence: DORA 2024 (throughput/stability); METR RCT (context matters). See DORA 2024 and main repo docs/evidence/sources.md (IDs 5, 7).

Instrumentation

How we capture and report impact in pilots and engagements:

Delivery metrics: Cycle time and deployment frequency from your CI/CD or value-stream tooling; same definitions before and after rollout.

Stability signals: Change failure rate and MTTR from production/incident data.

Adoption quality: AI usage rate and trust split by role (developer, reviewer, manager).

Review discipline: Percent of AI-assisted PRs with human review, test evidence, and security scan pass.

Security quality: Vulnerability density in AI-assisted vs non-AI PRs, before and after review.

Sample dashboard (mock)

A minimal view of what "prove it" can look like—baseline vs post-rollout, with the same definitions. This is illustrative; real dashboards plug into your pipeline and ticketing data.

Cycle time (p50)

3.8 d

Baseline: 4.2 d

Deploy frequency

14/wk

Baseline: 12/wk

Change fail rate

Baseline: 8%

MTTR

1.9 h

Baseline: 2.1 h

Mock data for illustration. Real implementation uses your pipeline (e.g. DORA-style) and does not attribute delta to AI without controlled comparison.

DORA alignment

Same definitions before and after AI rollout so comparisons are fair.

We do not claim "X% productivity gain" without objective metrics; perception-only claims are insufficient (see NAD guidance on Copilot claims).

DORA 2024 and later work highlight that AI adoption can affect throughput and stability differently; we monitor both and emphasize guardrails and review.

References: DORA 2024 (sources.md ID 5), DORA AI Capabilities Model (system-level practices). See dora.dev and main repo docs/evidence/impact-metrics.md.

Sources

We do not invent metrics or ROI. All cited stats and methodology tie to our evidence pack and research-sources index in the main repo.

DORA 2024 — AI adoption, throughput/stability (evidence/sources.md ID 5).

METR RCT (2025) — context-dependent outcomes (evidence/sources.md ID 7).

Impact metrics and what to track — docs/evidence/impact-metrics.md.

NAD/BBB Microsoft Copilot — objective metrics over perception-only claims (evidence/sources.md ID 14).