How to Measure Flaky Test Rate in CI Before It Becomes a Release Risk

Flaky tests are not just an annoyance in CI, they are a reliability signal. If a test fails sometimes and passes when rerun without code changes, the pipeline stops being a trustworthy gate. That creates a bad habit for teams, because engineers start treating failures as background noise, and real defects can slip through the same channel.

The practical question is not whether flakes exist, most mature test suites have them. The real question is how to measure the flaky test rate in CI early enough to keep it from eroding release confidence. A simple, consistent metric lets QA managers, DevOps engineers, and release owners see whether instability is increasing, which test groups are causing it, and when the cost of noise is starting to outweigh the value of the gate.

For a useful baseline, it helps to remember what CI is supposed to do. Continuous integration is the practice of merging code frequently and validating it through automated checks, ideally before defects spread across branches and environments. Test automation provides the repeatable execution layer that makes that possible. If either layer becomes noisy, the signal quality drops. For background, see continuous integration, test automation, and software testing.

What flaky test rate actually measures

A flaky test rate is the share of test outcomes that appear unstable, meaning the same test sometimes passes and sometimes fails under the same or nearly the same conditions. In CI, the useful question is not only, “How many tests are flaky?” but also, “How much pipeline decision-making is contaminated by flakes?”

There are several ways to define the metric, and the right one depends on how your pipeline is organized.

Three practical definitions

Flaky test count divided by total tests in a window
Useful as a quick health metric, but it hides frequency. Ten flaky tests that each fail once are not the same as one test that fails every run.
Flaky failure events divided by all failure events
Better for separating noise from true quality regressions. If half your red builds are caused by unstable tests, your CI is losing credibility fast.
Flaky runs divided by total runs
Best when you want to understand the amount of wasted execution and rerun overhead.

The most operationally useful metric for release teams is usually the second one, because it shows how much of the failure stream is noise.

A test suite can have a small number of flaky tests and still create a large operational problem if those tests fail often enough to block merges or trigger reruns.

Start with a definition your team can enforce

Before you can measure flakiness, you need to define it in a way your tooling can recognize. If the definition is vague, reports become argument generators instead of decision tools.

A good working definition for CI is:

A test is flaky if it fails in one run and passes in a subsequent retry without a relevant code change, environment change, or test change that explains the difference.

That definition matters because it separates flakiness from genuine product instability. A real failure is still a failure, even if it is intermittent due to timing, network, or shared environment pressure. But a flaky test is a measurement problem first, and a defect signal second.

You should also decide whether your organization treats known flakes as:

still blocking,
allowed to pass with a warning,
automatically quarantined,
or excluded from release gates.

That decision changes how you compute the metric and what it means for release confidence.

The simplest calculation that is still useful

If you want a metric that a release owner can read in one glance, use this formula:

text flaky test rate in CI = flaky failure events / total failure events

Where:

flaky failure event means a failure that was followed by a pass on rerun under the same pipeline context,
total failure event means any test failure that caused or could have caused a pipeline red state.

Example:

40 total failure events in a week
14 of those were identified as flaky after rerun
Flaky test rate = 14 / 40 = 35%

This does not mean 35% of your tests are flaky. It means 35% of your failure signal is noise. That is much more actionable.

If you also want a suite-level stability view, track a second metric:

text flaky test ratio = number of tests with at least one flaky event / total tests executed

That helps identify the breadth of the problem, while the failure-event rate shows how much pain the problem creates.

Instrumentation you need in CI

You cannot measure flakiness well if your CI output only says pass or fail. You need enough metadata to reconstruct what happened.

At minimum, capture:

test name or stable ID,
build ID or pipeline run ID,
commit SHA,
branch name,
environment name,
attempt number or retry count,
timestamp,
final result and retry results,
failure reason or exception text where available.

The important part is stable identity. If test names change frequently or are generated dynamically, the history becomes hard to aggregate.

Store raw results, not just summaries

Summaries are useful for dashboards, but raw execution records are essential for diagnosis. A proper data model usually keeps one row per test attempt, not just one row per test case per build.

That lets you answer questions like:

Did the test fail only on first attempt?
Does it fail only on one branch or one environment?
Are failures clustered around long durations, timeouts, or specific error signatures?

How to distinguish real failures from noise

One of the biggest mistakes teams make is treating every rerun pass as proof that the original failure was fake. Sometimes the rerun passes because the environment recovered, which still points to instability. Sometimes the first failure was a real bug that the rerun accidentally masked.

Use a classification approach instead of a binary label.

Suggested categories

Confirmed flaky: fails, then passes on retry with no code change
Confirmed product failure: fails consistently across retries or environments
Environmental failure: infrastructure, dependency, or testbed issue
Indeterminate: insufficient evidence to classify

This matters because flakiness can come from many sources:

race conditions in the application,
brittle selectors or waits in UI automation,
test data collisions,
shared staging environment contention,
network or service instability,
clock skew, caching, or async eventual consistency,
order dependence between tests.

A rerun policy should not hide these categories. It should expose them.

The best flake measurement systems do not ask, “Did the rerun pass?” They ask, “What kind of instability did the rerun reveal?”

A practical way to compute flake rate from retries

Most CI systems already have retries enabled for some jobs. You can use them as a signal, provided you record the attempts correctly.

A simple classification rule:

if attempt 1 fails and attempt 2 passes, count one flaky failure event,
if all attempts fail, count one true failure event,
if attempt 1 passes, count no failure event,
if the run was interrupted or infrastructure errored, track separately.

Here is a compact pseudocode example:

python def classify_attempts(attempts): # attempts: list like [“fail”, “pass”] or [“fail”, “fail”] if not attempts: return “indeterminate” if attempts[0] == “pass”: return “pass” if “pass” in attempts[1:]: return “flaky” return “true_failure”

That logic is intentionally simple. You can make it stricter by requiring the same environment and unchanged commit, or looser by accepting a pass after any retry inside a short time window.

Why rerun counts matter as much as failure counts

A test that fails once in 200 runs may not look serious until you realize it has generated 200 reruns across the suite. Those reruns consume compute, slow down developers, and reduce trust in the pipeline.

Track these related metrics together:

flaky failure rate: flaky failures / total failures
rerun rate: rerun attempts / total test attempts
pipeline waste ratio: rerun time / total test execution time
blocked merge rate due to instability: merge attempts delayed by flaky outcomes

A release owner usually cares most about whether CI is still a dependable gate. A DevOps engineer cares about wasted execution and queue pressure. A QA manager cares about whether the suite is masking product risk. These are related but not identical concerns.

Segment the metric, or it will mislead you

A single suite-wide flake rate often hides the source of the problem. Break the metric down by category.

Useful dimensions

test layer, unit, integration, API, UI, end-to-end
repository or service
branch type, mainline vs feature branches
environment, local, ephemeral CI, shared staging
test owner or team
file, folder, or tag
browser, device, or platform

UI tests often have a higher false failure rate than unit tests, because they depend on rendering, timing, and selectors. API tests may be more stable, but can still be flaky if dependencies are asynchronous or test data is shared. Segmenting prevents bad comparisons, like judging all automation by the noisiest layer.

Example of a segment report

Segment	Failure events	Flaky events	Flaky failure rate
Unit tests	80	6	7.5%
API tests	40	8	20.0%
UI tests	30	15	50.0%

This kind of table is more useful than a single number because it points to where remediation effort should go first.

Set thresholds that reflect delivery risk

There is no universal acceptable flake rate. What matters is whether the instability is high enough to hurt delivery.

A sensible threshold policy should answer three questions:

When is the trend bad enough to require action?
When should a test be quarantined?
When is CI no longer reliable as a release gate?

A pragmatic policy

Investigate when a test fails flaky more than once in a day or more than a small percentage of its runs over a week
Quarantine when a test is repeatedly blocking merges but cannot be fixed immediately
Escalate when flaky failures make up a material portion of all failures, especially on the mainline branch
Freeze releases or tighten gates when instability starts to affect the team’s ability to distinguish code regressions from noise

The exact threshold should be based on your own workflow, but the policy must be explicit. Without that, teams normalize the pain and stop responding to deteriorating signals.

What to do when the rate rises

Once your metric shows growing CI instability, the goal is not just to report it. The goal is to reduce noise without creating blind spots.

1. Separate test defects from product defects

If a failure is nondeterministic, confirm it with retry history, logs, screenshots, traces, or network captures. If the failure is reproducible, treat it as a product or environment issue, not a flake.

2. Reduce shared-state coupling

Many flakes come from tests competing for the same users, records, queues, ports, or caches. Fix this by:

generating unique test data,
isolating databases,
resetting state between tests,
namespacing resources by run ID,
avoiding hidden dependencies on ordering.

3. Make waits and synchronization explicit

UI and integration tests often fail because they assume immediate readiness. Prefer explicit waits for observable conditions instead of arbitrary sleep statements.

typescript

await page.getByRole('button', { name: 'Save' }).click();
await expect(page.getByText('Saved')).toBeVisible({ timeout: 5000 });

That pattern does not eliminate every flake, but it replaces timing guesses with a specific condition.

4. Inspect environment variance

If failures cluster on one runner type, browser version, region, or time of day, the issue may be infrastructural. Compare failure density across:

executor images,
browser containers,
CPU limits,
network latency,
service versions,
data refresh cycles.

5. Stop letting retries become a hiding place

Retries are helpful when used as a measurement tool, but dangerous when they become the only response. If a test needs three retries to be “green,” it is not stable enough to be part of a release gate.

A dashboard that release owners can trust

A good flake dashboard should answer operational questions, not just technical ones.

Include these widgets:

flaky failure rate over time,
top flaky tests by failure count,
top flaky suites by share of noisy failures,
rerun volume by day,
mainline vs feature branch comparison,
open flaky tests by owner,
time since last flaky fix.

Add annotations for changes such as:

new browser or runner image,
major dependency upgrades,
test framework migration,
parallelism changes,
service virtualization changes.

Without change markers, it is hard to know whether flakiness is improving because of remediation or merely shifting due to pipeline changes.

A sample GitHub Actions pattern for capturing retries

If your CI allows retries, preserve the attempt count in the logs so you can compute flake rates later.

name: test
on: [push, pull_request]
jobs:
  tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npm test -- --retry=2

The important part is not the exact runner syntax, it is that you collect attempt-level outcomes somewhere outside the ephemeral job log. If the only record is a single green or red badge, you cannot trend flakiness accurately.

Common mistakes when measuring flake rate

Counting every rerun pass as a success

A rerun pass means the test is unstable, not that the original signal was useless.

Mixing infrastructure outages with test flakiness

If your runners fail because of unavailable dependencies or expired credentials, that is CI reliability noise, but not necessarily flaky tests. Keep a separate category.

Using only aggregate numbers

A 5% flake rate might be acceptable in one service and disastrous in another. Context matters.

Ignoring age and ownership

Old flaky tests often survive because nobody feels responsible for them. Ownership is part of the fix.

Quarantining forever

Quarantine is a temporary pressure valve, not a maintenance strategy. Every quarantined test should have an owner and a due date.

When flaky test rate becomes a release risk

The metric becomes meaningful when it starts changing behavior. Warning signs include:

engineers rerun pipelines before reading failures,
releases wait on manual judgment instead of automated gates,
red builds are investigated less aggressively because they are “probably flakes,”
teams avoid adding more coverage because they distrust the suite,
merge throughput slows even when product quality is stable.

That is the point where flakiness is no longer a test maintenance issue, it is a delivery risk. A release process depends on fast, credible signals. If the signal is noisy, the organization compensates by moving slower, adding manual review, or ignoring failures. None of those are free.

A minimal rollout plan

If you want to start measuring next week, keep the implementation simple.

Enable retries only where necessary, and record every attempt.
Define flaky vs true failure, in writing.
Store attempt-level results in a table or metrics pipeline.
Compute flaky failure rate weekly, plus per-suite and per-environment breakdowns.
Set one escalation threshold for mainline CI.
Assign owners to the top noisy tests and remove them from permanent quarantine.

This approach is enough to tell whether CI is getting more trustworthy or less trustworthy. You do not need a perfect taxonomy on day one, you need a stable measurement loop.

Final thoughts

Measuring flaky test rate in CI is less about a fancy statistic and more about preserving trust. The key is to distinguish noise from signal, track retry outcomes at the attempt level, and segment the data so the source of instability is visible. Once you know which tests are noisy, how often they fail, and how much pipeline time they consume, you can decide whether the problem is a few isolated tests or a broader release risk.

A clean metric will not fix the suite by itself, but it will stop teams from guessing. That is the first step toward a CI system that tells the truth often enough to support real release confidence.