Performance Regressions: How I Detect Them Early

Context

Performance regressions are the silent tax on velocity. A team ships a feature that adds 50ms to startup. Another team adds a new Dagger module that increases DI initialization by 30ms. A library update introduces an extra allocation per frame. Each change is small and passes code review. After three months, the app is 400ms slower and nobody can identify a single cause.

Problem

Manual performance testing does not scale. Developers do not notice 20ms regressions on flagship devices. QA cannot run benchmark suites on every PR. By the time the regression surfaces in production metrics, it has been merged for weeks, buried under hundreds of commits, and is nearly impossible to attribute.

The solution is automated performance regression detection in CI, with statistical rigor sufficient to catch small regressions before merge.

Constraints

Must detect regressions as small as 5% on targeted metrics
Must run on every PR without adding more than 10 minutes to CI time
Must distinguish real regressions from measurement noise
Must not produce excessive false positives (alert fatigue kills adoption)
Must work with Gradle-based Android builds on CI infrastructure

Design

Layer 1: Macrobenchmark in CI

Android Macrobenchmark measures app-level performance (startup, scroll, animation) on real or virtual devices.

@RunWith(AndroidJUnit4::class)
class StartupBenchmark {
    @get:Rule
    val benchmarkRule = MacrobenchmarkRule()
 
    @Test
    fun coldStartup() {
        benchmarkRule.measureRepeated(
            packageName = "com.example.app",
            metrics = listOf(StartupTimingMetric()),
            iterations = 10,
            startupMode = StartupMode.COLD,
            setupBlock = {
                pressHome()
                killProcess()
            }
        ) {
            startActivityAndWait()
        }
    }
 
    @Test
    fun feedScroll() {
        benchmarkRule.measureRepeated(
            packageName = "com.example.app",
            metrics = listOf(FrameTimingMetric()),
            iterations = 5,
            startupMode = StartupMode.WARM
        ) {
            startActivityAndWait()
            val feed = device.findObject(By.res("feed_list"))
            feed.setGestureMargin(device.displayWidth / 5)
            repeat(3) {
                feed.fling(Direction.DOWN)
                device.waitForIdle()
            }
        }
    }
}

Layer 2: Statistical Regression Detection

Raw benchmark numbers are noisy. A single run showing 820ms vs. a baseline of 800ms means nothing. You need statistical confidence.

data class BenchmarkResult(
    val metric: String,
    val values: List<Double>
)
 
class RegressionDetector(
    private val confidenceLevel: Double = 0.95,
    private val regressionThreshold: Double = 0.05 // 5%
) {
    fun detect(baseline: BenchmarkResult, current: BenchmarkResult): RegressionVerdict {
        val baselineMean = baseline.values.average()
        val currentMean = current.values.average()
 
        val percentChange = (currentMean - baselineMean) / baselineMean
 
        // Mann-Whitney U test for non-parametric comparison
        val pValue = mannWhitneyUTest(baseline.values, current.values)
 
        val isStatisticallySignificant = pValue < (1 - confidenceLevel)
        val exceedsThreshold = percentChange > regressionThreshold
 
        return when {
            isStatisticallySignificant && exceedsThreshold -> RegressionVerdict.REGRESSION(
                metric = current.metric,
                baselineMean = baselineMean,
                currentMean = currentMean,
                percentChange = percentChange,
                pValue = pValue
            )
            isStatisticallySignificant && percentChange < -regressionThreshold ->
                RegressionVerdict.IMPROVEMENT(
                    metric = current.metric,
                    percentChange = percentChange
                )
            else -> RegressionVerdict.NO_CHANGE
        }
    }
 
    private fun mannWhitneyUTest(a: List<Double>, b: List<Double>): Double {
        // Simplified: in practice, use Apache Commons Math or equivalent
        val combined = (a.map { it to "a" } + b.map { it to "b" })
            .sortedBy { it.first }
        val rankSumA = combined.mapIndexedNotNull { index, (_, group) ->
            if (group == "a") index + 1.0 else null
        }.sum()
        val u = rankSumA - a.size * (a.size + 1) / 2.0
        val n = a.size * b.size
        // Normal approximation for large samples
        val mean = n / 2.0
        val std = kotlin.math.sqrt(n * (a.size + b.size + 1) / 12.0)
        val z = (u - mean) / std
        // Two-tailed p-value approximation
        return 2 * (1 - normalCdf(kotlin.math.abs(z)))
    }
 
    private fun normalCdf(z: Double): Double {
        return 0.5 * (1 + erf(z / kotlin.math.sqrt(2.0)))
    }
 
    private fun erf(x: Double): Double {
        // Abramowitz and Stegun approximation
        val t = 1.0 / (1.0 + 0.3275911 * kotlin.math.abs(x))
        val poly = t * (0.254829592 + t * (-0.284496736 + t * (1.421413741 +
            t * (-1.453152027 + t * 1.061405429))))
        val result = 1.0 - poly * kotlin.math.exp(-x * x)
        return if (x >= 0) result else -result
    }
}
 
sealed class RegressionVerdict {
    data class REGRESSION(
        val metric: String,
        val baselineMean: Double,
        val currentMean: Double,
        val percentChange: Double,
        val pValue: Double
    ) : RegressionVerdict()
 
    data class IMPROVEMENT(
        val metric: String,
        val percentChange: Double
    ) : RegressionVerdict()
 
    object NO_CHANGE : RegressionVerdict()
}

Layer 3: Baseline Management

Baselines must be updated deliberately, not automatically.

Strategy	Pros	Cons
Git-committed baselines	Versioned, reviewable	Requires manual updates
CI-generated rolling baselines	Always current	Gradual regressions slip through
Release-pinned baselines	Clear reference point	Stale between releases

The recommended approach: pin baselines to the last release, update explicitly when performance changes are intentional.

// baseline.json stored in repo
{
    "cold_startup_ms": {
        "p50": 780,
        "p90": 920,
        "p99": 1150
    },
    "feed_scroll_frame_p95_ms": {
        "value": 12.4
    },
    "last_updated": "2026-01-15",
    "release": "v4.2.0"
}

Layer 4: CI Pipeline Integration

# .github/workflows/performance.yml
name: Performance Regression Check
on:
  pull_request:
    branches: [main]
 
jobs:
  benchmark:
    runs-on: macos-latest # for hardware acceleration
    steps:
      - uses: actions/checkout@v4
 
      - name: Set up Android SDK
        uses: android-actions/setup-android@v3
 
      - name: Run benchmarks
        uses: reactivecircus/android-emulator-runner@v2
        with:
          api-level: 34
          arch: x86_64
          script: ./gradlew :benchmark:connectedCheck
 
      - name: Compare with baseline
        run: |
          python3 scripts/compare_benchmarks.py \
            --baseline baselines/performance.json \
            --current benchmark/build/outputs/connected_android_test_additional_output \
            --threshold 0.05 \
            --confidence 0.95
 
      - name: Post results to PR
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const results = require('./benchmark-comparison.json')
            // Post formatted table as PR comment

Layer 5: Automated Bisection

When a regression is detected on the main branch, automatically identify the causing commit.

// Script pseudocode for automated bisection
class PerformanceBisector(
    private val benchmarkRunner: BenchmarkRunner,
    private val baselineMetric: Double,
    private val regressionThreshold: Double
) {
    suspend fun bisect(goodCommit: String, badCommit: String): String {
        val commits = git.log(goodCommit, badCommit)
        if (commits.size <= 1) return commits.first()
 
        val midpoint = commits[commits.size / 2]
        git.checkout(midpoint)
 
        val result = benchmarkRunner.run()
        val isRegressed = (result.mean - baselineMetric) / baselineMetric > regressionThreshold
 
        return if (isRegressed) {
            bisect(goodCommit, midpoint)
        } else {
            bisect(midpoint, badCommit)
        }
    }
}

Build-Time Regression Detection

Performance is not only runtime. Build time regressions affect developer productivity.

// Track Gradle build times per module
// settings.gradle.kts
gradle.addBuildListener(object : BuildAdapter() {
    override fun buildFinished(result: BuildResult) {
        val timings = gradle.taskGraph.allTasks
            .filter { it.state.executed }
            .sortedByDescending { it.state.endTime - it.state.startTime }
            .take(20)
 
        timings.forEach { task ->
            val duration = task.state.endTime - task.state.startTime
            println("TASK_TIMING|${task.path}|${duration}ms")
        }
    }
})

Trade-offs

Approach	Detection Sensitivity	CI Time Cost	Maintenance Burden
Macrobenchmark on every PR	High	5-10 min	Medium
Nightly benchmark suite	Medium	0 on PR	Low
Production metrics only	Low (reactive)	None	Low
Microbenchmark per module	Very high (per function)	2-3 min	High

Failure Modes

Flaky benchmarks: measurement noise causes false positives. Mitigation: run enough iterations (10+) and use statistical tests, not raw comparisons.
Emulator vs. device divergence: CI emulators have different performance characteristics than real devices. Calibrate thresholds per environment.
Gradual regression: 1% per week is undetectable per PR but devastating over a quarter. Use release-pinned baselines to catch cumulative drift.
Threshold too strict: teams ignore performance gates that constantly fail. Start with 10% thresholds, tighten as infrastructure matures.
Missing coverage: benchmark only covers happy path. A regression in error handling or edge case flows goes undetected.

Scaling Considerations

Shard benchmarks across multiple CI machines for parallel execution
Maintain per-module benchmark ownership so teams run only their benchmarks on their PRs
Build a performance dashboard that shows trends over weeks and months
Invest in dedicated benchmark hardware for consistent, reproducible results

Observability

Dashboard showing P50/P90/P99 for all tracked metrics across releases
Trend lines with regression annotations (which commit, which PR)
Alert on statistical regressions that exceed threshold
Weekly performance digest sent to engineering leadership

Key Takeaways

Automate benchmark execution on every PR. Manual testing does not catch 5% regressions.
Use statistical tests (Mann-Whitney U or similar) to distinguish real regressions from noise.
Pin baselines to releases and update them explicitly. Rolling baselines hide gradual regressions.
Start with generous thresholds (10%) and tighten over time. False positive fatigue kills adoption faster than missed regressions.
Track both runtime performance and build time as first-class metrics.
Automate bisection for regressions detected on the main branch.

Final Thoughts

Performance regression detection is a CI infrastructure problem, not a testing problem. The tools (Macrobenchmark, statistical analysis, automated bisection) are mature. The real challenge is operational: maintaining baselines, tuning thresholds, and building team habits around performance gates. Invest in this infrastructure early. The cost of catching a regression before merge is orders of magnitude lower than fixing it after users notice.