Performance Regressions: How I Detect Them Early
A practical framework for catching performance regressions before they reach production, covering CI benchmarks, statistical analysis, alerting strategies, and automated bisection.
Context
Performance regressions are the silent tax on velocity. A team ships a feature that adds 50ms to startup. Another team adds a new Dagger module that increases DI initialization by 30ms. A library update introduces an extra allocation per frame. Each change is small and passes code review. After three months, the app is 400ms slower and nobody can identify a single cause.
Problem
Manual performance testing does not scale. Developers do not notice 20ms regressions on flagship devices. QA cannot run benchmark suites on every PR. By the time the regression surfaces in production metrics, it has been merged for weeks, buried under hundreds of commits, and is nearly impossible to attribute.
The solution is automated performance regression detection in CI, with statistical rigor sufficient to catch small regressions before merge.
Constraints
- Must detect regressions as small as 5% on targeted metrics
- Must run on every PR without adding more than 10 minutes to CI time
- Must distinguish real regressions from measurement noise
- Must not produce excessive false positives (alert fatigue kills adoption)
- Must work with Gradle-based Android builds on CI infrastructure
Design
Layer 1: Macrobenchmark in CI
Android Macrobenchmark measures app-level performance (startup, scroll, animation) on real or virtual devices.
@RunWith(AndroidJUnit4::class)
class StartupBenchmark {
@get:Rule
val benchmarkRule = MacrobenchmarkRule()
@Test
fun coldStartup() {
benchmarkRule.measureRepeated(
packageName = "com.example.app",
metrics = listOf(StartupTimingMetric()),
iterations = 10,
startupMode = StartupMode.COLD,
setupBlock = {
pressHome()
killProcess()
}
) {
startActivityAndWait()
}
}
@Test
fun feedScroll() {
benchmarkRule.measureRepeated(
packageName = "com.example.app",
metrics = listOf(FrameTimingMetric()),
iterations = 5,
startupMode = StartupMode.WARM
) {
startActivityAndWait()
val feed = device.findObject(By.res("feed_list"))
feed.setGestureMargin(device.displayWidth / 5)
repeat(3) {
feed.fling(Direction.DOWN)
device.waitForIdle()
}
}
}
}Layer 2: Statistical Regression Detection
Raw benchmark numbers are noisy. A single run showing 820ms vs. a baseline of 800ms means nothing. You need statistical confidence.
data class BenchmarkResult(
val metric: String,
val values: List<Double>
)
class RegressionDetector(
private val confidenceLevel: Double = 0.95,
private val regressionThreshold: Double = 0.05 // 5%
) {
fun detect(baseline: BenchmarkResult, current: BenchmarkResult): RegressionVerdict {
val baselineMean = baseline.values.average()
val currentMean = current.values.average()
val percentChange = (currentMean - baselineMean) / baselineMean
// Mann-Whitney U test for non-parametric comparison
val pValue = mannWhitneyUTest(baseline.values, current.values)
val isStatisticallySignificant = pValue < (1 - confidenceLevel)
val exceedsThreshold = percentChange > regressionThreshold
return when {
isStatisticallySignificant && exceedsThreshold -> RegressionVerdict.REGRESSION(
metric = current.metric,
baselineMean = baselineMean,
currentMean = currentMean,
percentChange = percentChange,
pValue = pValue
)
isStatisticallySignificant && percentChange < -regressionThreshold ->
RegressionVerdict.IMPROVEMENT(
metric = current.metric,
percentChange = percentChange
)
else -> RegressionVerdict.NO_CHANGE
}
}
private fun mannWhitneyUTest(a: List<Double>, b: List<Double>): Double {
// Simplified: in practice, use Apache Commons Math or equivalent
val combined = (a.map { it to "a" } + b.map { it to "b" })
.sortedBy { it.first }
val rankSumA = combined.mapIndexedNotNull { index, (_, group) ->
if (group == "a") index + 1.0 else null
}.sum()
val u = rankSumA - a.size * (a.size + 1) / 2.0
val n = a.size * b.size
// Normal approximation for large samples
val mean = n / 2.0
val std = kotlin.math.sqrt(n * (a.size + b.size + 1) / 12.0)
val z = (u - mean) / std
// Two-tailed p-value approximation
return 2 * (1 - normalCdf(kotlin.math.abs(z)))
}
private fun normalCdf(z: Double): Double {
return 0.5 * (1 + erf(z / kotlin.math.sqrt(2.0)))
}
private fun erf(x: Double): Double {
// Abramowitz and Stegun approximation
val t = 1.0 / (1.0 + 0.3275911 * kotlin.math.abs(x))
val poly = t * (0.254829592 + t * (-0.284496736 + t * (1.421413741 +
t * (-1.453152027 + t * 1.061405429))))
val result = 1.0 - poly * kotlin.math.exp(-x * x)
return if (x >= 0) result else -result
}
}
sealed class RegressionVerdict {
data class REGRESSION(
val metric: String,
val baselineMean: Double,
val currentMean: Double,
val percentChange: Double,
val pValue: Double
) : RegressionVerdict()
data class IMPROVEMENT(
val metric: String,
val percentChange: Double
) : RegressionVerdict()
object NO_CHANGE : RegressionVerdict()
}Layer 3: Baseline Management
Baselines must be updated deliberately, not automatically.
| Strategy | Pros | Cons |
|---|---|---|
| Git-committed baselines | Versioned, reviewable | Requires manual updates |
| CI-generated rolling baselines | Always current | Gradual regressions slip through |
| Release-pinned baselines | Clear reference point | Stale between releases |
The recommended approach: pin baselines to the last release, update explicitly when performance changes are intentional.
// baseline.json stored in repo
{
"cold_startup_ms": {
"p50": 780,
"p90": 920,
"p99": 1150
},
"feed_scroll_frame_p95_ms": {
"value": 12.4
},
"last_updated": "2026-01-15",
"release": "v4.2.0"
}Layer 4: CI Pipeline Integration
# .github/workflows/performance.yml
name: Performance Regression Check
on:
pull_request:
branches: [main]
jobs:
benchmark:
runs-on: macos-latest # for hardware acceleration
steps:
- uses: actions/checkout@v4
- name: Set up Android SDK
uses: android-actions/setup-android@v3
- name: Run benchmarks
uses: reactivecircus/android-emulator-runner@v2
with:
api-level: 34
arch: x86_64
script: ./gradlew :benchmark:connectedCheck
- name: Compare with baseline
run: |
python3 scripts/compare_benchmarks.py \
--baseline baselines/performance.json \
--current benchmark/build/outputs/connected_android_test_additional_output \
--threshold 0.05 \
--confidence 0.95
- name: Post results to PR
if: always()
uses: actions/github-script@v7
with:
script: |
const results = require('./benchmark-comparison.json')
// Post formatted table as PR commentLayer 5: Automated Bisection
When a regression is detected on the main branch, automatically identify the causing commit.
// Script pseudocode for automated bisection
class PerformanceBisector(
private val benchmarkRunner: BenchmarkRunner,
private val baselineMetric: Double,
private val regressionThreshold: Double
) {
suspend fun bisect(goodCommit: String, badCommit: String): String {
val commits = git.log(goodCommit, badCommit)
if (commits.size <= 1) return commits.first()
val midpoint = commits[commits.size / 2]
git.checkout(midpoint)
val result = benchmarkRunner.run()
val isRegressed = (result.mean - baselineMetric) / baselineMetric > regressionThreshold
return if (isRegressed) {
bisect(goodCommit, midpoint)
} else {
bisect(midpoint, badCommit)
}
}
}Build-Time Regression Detection
Performance is not only runtime. Build time regressions affect developer productivity.
// Track Gradle build times per module
// settings.gradle.kts
gradle.addBuildListener(object : BuildAdapter() {
override fun buildFinished(result: BuildResult) {
val timings = gradle.taskGraph.allTasks
.filter { it.state.executed }
.sortedByDescending { it.state.endTime - it.state.startTime }
.take(20)
timings.forEach { task ->
val duration = task.state.endTime - task.state.startTime
println("TASK_TIMING|${task.path}|${duration}ms")
}
}
})Trade-offs
| Approach | Detection Sensitivity | CI Time Cost | Maintenance Burden |
|---|---|---|---|
| Macrobenchmark on every PR | High | 5-10 min | Medium |
| Nightly benchmark suite | Medium | 0 on PR | Low |
| Production metrics only | Low (reactive) | None | Low |
| Microbenchmark per module | Very high (per function) | 2-3 min | High |
Failure Modes
Related: Failure Modes I Actively Design For.
- Flaky benchmarks: measurement noise causes false positives. Mitigation: run enough iterations (10+) and use statistical tests, not raw comparisons.
- Emulator vs. device divergence: CI emulators have different performance characteristics than real devices. Calibrate thresholds per environment.
- Gradual regression: 1% per week is undetectable per PR but devastating over a quarter. Use release-pinned baselines to catch cumulative drift.
- Threshold too strict: teams ignore performance gates that constantly fail. Start with 10% thresholds, tighten as infrastructure matures.
- Missing coverage: benchmark only covers happy path. A regression in error handling or edge case flows goes undetected.
Scaling Considerations
- Shard benchmarks across multiple CI machines for parallel execution
- Maintain per-module benchmark ownership so teams run only their benchmarks on their PRs
- Build a performance dashboard that shows trends over weeks and months
- Invest in dedicated benchmark hardware for consistent, reproducible results
Observability
- Dashboard showing P50/P90/P99 for all tracked metrics across releases
- Trend lines with regression annotations (which commit, which PR)
- Alert on statistical regressions that exceed threshold
- Weekly performance digest sent to engineering leadership
Key Takeaways
- Automate benchmark execution on every PR. Manual testing does not catch 5% regressions.
- Use statistical tests (Mann-Whitney U or similar) to distinguish real regressions from noise.
- Pin baselines to releases and update them explicitly. Rolling baselines hide gradual regressions.
- Start with generous thresholds (10%) and tighten over time. False positive fatigue kills adoption faster than missed regressions.
- Track both runtime performance and build time as first-class metrics.
- Automate bisection for regressions detected on the main branch.
See also: Building With Intent, Not Just Tools.
Further Reading
- Debugging Performance Issues in Large Android Apps: A systematic approach to identifying, isolating, and fixing performance bottlenecks in large Android codebases, covering profiling strate...
- How Garbage Collection Impacts Android Performance: A detailed look at ART's garbage collection mechanisms, how GC pauses affect frame rates, and practical strategies to minimize GC impact ...
- How I Profile Android Apps in Production: Techniques for collecting meaningful performance data from production Android apps without degrading user experience, covering sampling s...
Final Thoughts
Performance regression detection is a CI infrastructure problem, not a testing problem. The tools (Macrobenchmark, statistical analysis, automated bisection) are mature. The real challenge is operational: maintaining baselines, tuning thresholds, and building team habits around performance gates. Invest in this infrastructure early. The cost of catching a regression before merge is orders of magnitude lower than fixing it after users notice.
Recommended
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.
Understanding ANRs: Detection, Root Causes, and Fixes
A systematic look at Application Not Responding errors on Android, covering the detection mechanism, common root causes in production, and concrete strategies to fix and prevent them.