Designing a Simple Metrics Collection Service
Architecture for a lightweight metrics ingestion pipeline using a buffer, batch writes, and pre-aggregated rollups on Postgres.
Context
I needed to collect application metrics (request counts, latency percentiles, error rates) without depending on Datadog or similar services. The system had to ingest metrics from 20 application instances, store them efficiently, and support queries for dashboards and alerting. Total budget for the metrics infrastructure: $0 beyond existing Postgres.
Problem
Writing individual metric data points to a relational database at high frequency is a known anti-pattern. The write volume overwhelms the database, and querying raw data points for time-series visualization is slow. The solution requires buffering, batching, and pre-aggregation.
Constraints
- Ingestion rate: ~5,000 metric data points per second across all application instances
- Retention: raw data for 24 hours, 1-minute rollups for 30 days, 1-hour rollups for 1 year
- Query latency for dashboard: under 500ms for a 24-hour window
- Storage: Postgres (no TimescaleDB extension available on Neon)
- No additional services (no Kafka, no Redis, no InfluxDB)
- Metric types: counters, gauges, histograms (approximated via percentiles)
See also: Building a View Counter System With Postgres.
Design
Data Model
CREATE TABLE metrics_raw (
id BIGSERIAL,
metric_name TEXT NOT NULL,
tags JSONB NOT NULL DEFAULT '{}',
value DOUBLE PRECISION NOT NULL,
recorded_at TIMESTAMPTZ NOT NULL DEFAULT now()
) PARTITION BY RANGE (recorded_at);
CREATE TABLE metrics_1m (
metric_name TEXT NOT NULL,
tags JSONB NOT NULL DEFAULT '{}',
bucket TIMESTAMPTZ NOT NULL,
count BIGINT NOT NULL,
sum DOUBLE PRECISION NOT NULL,
min DOUBLE PRECISION NOT NULL,
max DOUBLE PRECISION NOT NULL,
p50 DOUBLE PRECISION,
p95 DOUBLE PRECISION,
p99 DOUBLE PRECISION,
PRIMARY KEY (metric_name, tags, bucket)
);
CREATE TABLE metrics_1h (
metric_name TEXT NOT NULL,
tags JSONB NOT NULL DEFAULT '{}',
bucket TIMESTAMPTZ NOT NULL,
count BIGINT NOT NULL,
sum DOUBLE PRECISION NOT NULL,
min DOUBLE PRECISION NOT NULL,
max DOUBLE PRECISION NOT NULL,
p50 DOUBLE PRECISION,
p95 DOUBLE PRECISION,
p99 DOUBLE PRECISION,
PRIMARY KEY (metric_name, tags, bucket)
);Partitioning metrics_raw by day allows efficient pruning (drop the oldest partition daily).
Ingestion Pipeline
Application Instance
-> In-memory buffer (flush every 5 seconds or 500 points)
-> HTTP POST to metrics API
-> Batch INSERT into metrics_raw
-> Rollup worker (every 60 seconds)
-> INSERT INTO metrics_1m ... ON CONFLICT UPDATE
Client-Side Buffering
Each application instance buffers metrics in memory:
class MetricsBuffer {
private buffer: MetricPoint[] = [];
private flushInterval: NodeJS.Timer;
constructor(private endpoint: string) {
this.flushInterval = setInterval(() => this.flush(), 5000);
}
record(name: string, value: number, tags: Record<string, string> = {}) {
this.buffer.push({ name, value, tags, timestamp: Date.now() });
if (this.buffer.length >= 500) this.flush();
}
private async flush() {
if (this.buffer.length === 0) return;
const batch = this.buffer.splice(0);
try {
await fetch(this.endpoint, {
method: 'POST',
body: JSON.stringify(batch),
headers: { 'Content-Type': 'application/json' },
});
} catch {
// Re-queue failed batch (with size limit to prevent memory growth)
if (this.buffer.length < 10000) {
this.buffer.unshift(...batch);
}
}
}
}Batch Insert
Related: Event Tracking System Design for Android Applications.
The metrics API receives batches and inserts them in a single statement:
INSERT INTO metrics_raw (metric_name, tags, value, recorded_at)
VALUES
($1, $2, $3, $4),
($5, $6, $7, $8),
...A batch of 500 points inserts in 5-15ms.
Rollup Worker
Every 60 seconds, a background worker aggregates raw data into 1-minute buckets:
INSERT INTO metrics_1m (metric_name, tags, bucket, count, sum, min, max, p50, p95, p99)
SELECT
metric_name,
tags,
date_trunc('minute', recorded_at) AS bucket,
COUNT(*) AS count,
SUM(value) AS sum,
MIN(value) AS min,
MAX(value) AS max,
percentile_cont(0.50) WITHIN GROUP (ORDER BY value) AS p50,
percentile_cont(0.95) WITHIN GROUP (ORDER BY value) AS p95,
percentile_cont(0.99) WITHIN GROUP (ORDER BY value) AS p99
FROM metrics_raw
WHERE recorded_at >= $last_rollup_time
AND recorded_at < date_trunc('minute', now())
GROUP BY metric_name, tags, bucket
ON CONFLICT (metric_name, tags, bucket)
DO UPDATE SET
count = EXCLUDED.count,
sum = EXCLUDED.sum,
min = EXCLUDED.min,
max = EXCLUDED.max,
p50 = EXCLUDED.p50,
p95 = EXCLUDED.p95,
p99 = EXCLUDED.p99;Trade-offs
| Approach | Write Throughput | Query Speed | Storage Efficiency | Accuracy |
|---|---|---|---|---|
| Raw points only | High | Slow (scan) | Low | Exact |
| Pre-aggregated only | N/A | Fast | High | Approximate |
| Raw + rollups (this design) | High | Fast (rollups) | Medium | Exact raw, approx rollups |
| TimescaleDB | Very high | Very fast | High | Exact |
The main trade-off: percentile rollups are approximations. When you roll up 1-minute p95 values into 1-hour p95 values, the result is an approximation (you cannot compute exact percentiles from percentile summaries). For operational dashboards, this is acceptable. For SLA calculations, query the raw data within its 24-hour retention window.
Storage Estimates
| Table | Rows/Day | Size/Day | Retention | Total Size |
|---|---|---|---|---|
| metrics_raw | 432M | ~20GB | 24 hours | ~20GB |
| metrics_1m | 1.44M | ~150MB | 30 days | ~4.5GB |
| metrics_1h | 24K | ~3MB | 1 year | ~1.1GB |
The raw table is large. Daily partition drops keep it bounded.
Failure Modes
Buffer overflow on application crash: If an application instance crashes, its in-memory buffer is lost. At 5-second flush intervals, maximum data loss is 5 seconds of metrics per instance. For a crash that affects all instances simultaneously, up to 5 seconds of all metrics are lost. This is acceptable for operational metrics.
Rollup worker falls behind: If the rollup query takes longer than 60 seconds (due to high cardinality or database load), it will overlap with the next rollup window. Mitigation: acquire a lock before running the rollup, skip the run if the lock is held, and alert if consecutive skips exceed 3.
High cardinality tags: If tags have high cardinality (e.g., user_id as a tag), the metrics_1m table grows explosively. With 50,000 unique user IDs and 100 metric names, that is 5M rows per minute. Mitigation: validate tag cardinality at ingestion. Reject or drop tags with cardinality above a threshold (e.g., 1,000 unique values).
Partition management failure: If the daily partition creation job fails, inserts into metrics_raw fail with a routing error. Mitigation: pre-create partitions 7 days ahead and alert if the creation job fails.
Scaling Considerations
- At 50,000 data points per second, Postgres batch inserts become the bottleneck. At that scale, consider a write-ahead buffer (append-only file) that a separate process drains into Postgres.
- For multi-region deployments, each region should have its own metrics pipeline. Cross-region aggregation happens at the dashboard query layer.
- If query latency for dashboards exceeds 500ms, add materialized views for the most common dashboard queries and refresh them on a schedule.
- The migration path to TimescaleDB is straightforward: replace partition management with hypertables and rollup queries with continuous aggregates.
Observability
- Meta-metrics: track the metrics system's own performance (ingestion rate, rollup duration, buffer depth)
- Alert on rollup lag (time since last successful rollup exceeding 5 minutes)
- Monitor
metrics_rawpartition sizes to detect cardinality explosions - Dashboard query latency (the dashboard itself should be fast; slow dashboard queries indicate missing indexes or excessive data)
Key Takeaways
- Buffer on the client, batch on the server, and roll up in the background. This three-stage pipeline keeps write throughput high and read latency low.
- Pre-aggregated rollups are essential for time-series queries. Never scan raw data for dashboard rendering.
- Percentile rollups are inherently approximate. Accept this or pay the cost of storing all raw data.
- High cardinality tags are the most common operational issue. Validate cardinality at ingestion, not after the database is overwhelmed.
- Partition by time for efficient data lifecycle management. Drop partitions instead of running DELETE queries.
Further Reading
- Designing a Simple Authentication Service: Architecture for a session-based authentication service with JWT access tokens, refresh token rotation, and measured security trade-offs.
- Designing a Simple CMS From Scratch: Architecture decisions behind building a file-based CMS with MDX, Git-backed versioning, and incremental builds for a content-heavy site.
- Building a Simple Search Index: Designing an inverted index from scratch with tokenization, ranking, and query parsing, then comparing it against Postgres full-text search.
Final Thoughts
This system handles 5,000 metrics per second on a single Postgres instance. The total infrastructure cost is zero beyond existing database capacity. It lacks the polish of Datadog or Grafana Cloud, but it provides the core capability: collecting, storing, and querying application metrics with bounded storage and predictable query performance. For teams that cannot justify $500+/month for a metrics SaaS, this is a viable alternative.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.