Designing a Simple Metrics Collection Service

Context

I needed to collect application metrics (request counts, latency percentiles, error rates) without depending on Datadog or similar services. The system had to ingest metrics from 20 application instances, store them efficiently, and support queries for dashboards and alerting. Total budget for the metrics infrastructure: $0 beyond existing Postgres.

Problem

Writing individual metric data points to a relational database at high frequency is a known anti-pattern. The write volume overwhelms the database, and querying raw data points for time-series visualization is slow. The solution requires buffering, batching, and pre-aggregation.

Constraints

Ingestion rate: ~5,000 metric data points per second across all application instances
Retention: raw data for 24 hours, 1-minute rollups for 30 days, 1-hour rollups for 1 year
Query latency for dashboard: under 500ms for a 24-hour window
Storage: Postgres (no TimescaleDB extension available on Neon)
No additional services (no Kafka, no Redis, no InfluxDB)
Metric types: counters, gauges, histograms (approximated via percentiles)

Design

Data Model

CREATE TABLE metrics_raw (
  id BIGSERIAL,
  metric_name TEXT NOT NULL,
  tags JSONB NOT NULL DEFAULT '{}',
  value DOUBLE PRECISION NOT NULL,
  recorded_at TIMESTAMPTZ NOT NULL DEFAULT now()
) PARTITION BY RANGE (recorded_at);
 
CREATE TABLE metrics_1m (
  metric_name TEXT NOT NULL,
  tags JSONB NOT NULL DEFAULT '{}',
  bucket TIMESTAMPTZ NOT NULL,
  count BIGINT NOT NULL,
  sum DOUBLE PRECISION NOT NULL,
  min DOUBLE PRECISION NOT NULL,
  max DOUBLE PRECISION NOT NULL,
  p50 DOUBLE PRECISION,
  p95 DOUBLE PRECISION,
  p99 DOUBLE PRECISION,
  PRIMARY KEY (metric_name, tags, bucket)
);
 
CREATE TABLE metrics_1h (
  metric_name TEXT NOT NULL,
  tags JSONB NOT NULL DEFAULT '{}',
  bucket TIMESTAMPTZ NOT NULL,
  count BIGINT NOT NULL,
  sum DOUBLE PRECISION NOT NULL,
  min DOUBLE PRECISION NOT NULL,
  max DOUBLE PRECISION NOT NULL,
  p50 DOUBLE PRECISION,
  p95 DOUBLE PRECISION,
  p99 DOUBLE PRECISION,
  PRIMARY KEY (metric_name, tags, bucket)
);

Partitioning metrics_raw by day allows efficient pruning (drop the oldest partition daily).

Ingestion Pipeline

Application Instance
  -> In-memory buffer (flush every 5 seconds or 500 points)
    -> HTTP POST to metrics API
      -> Batch INSERT into metrics_raw
        -> Rollup worker (every 60 seconds)
          -> INSERT INTO metrics_1m ... ON CONFLICT UPDATE

Client-Side Buffering

Each application instance buffers metrics in memory:

class MetricsBuffer {
  private buffer: MetricPoint[] = [];
  private flushInterval: NodeJS.Timer;
 
  constructor(private endpoint: string) {
    this.flushInterval = setInterval(() => this.flush(), 5000);
  }
 
  record(name: string, value: number, tags: Record<string, string> = {}) {
    this.buffer.push({ name, value, tags, timestamp: Date.now() });
    if (this.buffer.length >= 500) this.flush();
  }
 
  private async flush() {
    if (this.buffer.length === 0) return;
    const batch = this.buffer.splice(0);
    try {
      await fetch(this.endpoint, {
        method: 'POST',
        body: JSON.stringify(batch),
        headers: { 'Content-Type': 'application/json' },
      });
    } catch {
      // Re-queue failed batch (with size limit to prevent memory growth)
      if (this.buffer.length < 10000) {
        this.buffer.unshift(...batch);
      }
    }
  }
}

Batch Insert

The metrics API receives batches and inserts them in a single statement:

INSERT INTO metrics_raw (metric_name, tags, value, recorded_at)
VALUES
  ($1, $2, $3, $4),
  ($5, $6, $7, $8),
  ...

A batch of 500 points inserts in 5-15ms.

Rollup Worker

Every 60 seconds, a background worker aggregates raw data into 1-minute buckets:

INSERT INTO metrics_1m (metric_name, tags, bucket, count, sum, min, max, p50, p95, p99)
SELECT
  metric_name,
  tags,
  date_trunc('minute', recorded_at) AS bucket,
  COUNT(*) AS count,
  SUM(value) AS sum,
  MIN(value) AS min,
  MAX(value) AS max,
  percentile_cont(0.50) WITHIN GROUP (ORDER BY value) AS p50,
  percentile_cont(0.95) WITHIN GROUP (ORDER BY value) AS p95,
  percentile_cont(0.99) WITHIN GROUP (ORDER BY value) AS p99
FROM metrics_raw
WHERE recorded_at >= $last_rollup_time
  AND recorded_at < date_trunc('minute', now())
GROUP BY metric_name, tags, bucket
ON CONFLICT (metric_name, tags, bucket)
DO UPDATE SET
  count = EXCLUDED.count,
  sum = EXCLUDED.sum,
  min = EXCLUDED.min,
  max = EXCLUDED.max,
  p50 = EXCLUDED.p50,
  p95 = EXCLUDED.p95,
  p99 = EXCLUDED.p99;

Trade-offs

Approach	Write Throughput	Query Speed	Storage Efficiency	Accuracy
Raw points only	High	Slow (scan)	Low	Exact
Pre-aggregated only	N/A	Fast	High	Approximate
Raw + rollups (this design)	High	Fast (rollups)	Medium	Exact raw, approx rollups
TimescaleDB	Very high	Very fast	High	Exact

The main trade-off: percentile rollups are approximations. When you roll up 1-minute p95 values into 1-hour p95 values, the result is an approximation (you cannot compute exact percentiles from percentile summaries). For operational dashboards, this is acceptable. For SLA calculations, query the raw data within its 24-hour retention window.

Storage Estimates

Table	Rows/Day	Size/Day	Retention	Total Size
metrics_raw	432M	~20GB	24 hours	~20GB
metrics_1m	1.44M	~150MB	30 days	~4.5GB
metrics_1h	24K	~3MB	1 year	~1.1GB

The raw table is large. Daily partition drops keep it bounded.

Failure Modes

Buffer overflow on application crash: If an application instance crashes, its in-memory buffer is lost. At 5-second flush intervals, maximum data loss is 5 seconds of metrics per instance. For a crash that affects all instances simultaneously, up to 5 seconds of all metrics are lost. This is acceptable for operational metrics.

Rollup worker falls behind: If the rollup query takes longer than 60 seconds (due to high cardinality or database load), it will overlap with the next rollup window. Mitigation: acquire a lock before running the rollup, skip the run if the lock is held, and alert if consecutive skips exceed 3.

High cardinality tags: If tags have high cardinality (e.g., user_id as a tag), the metrics_1m table grows explosively. With 50,000 unique user IDs and 100 metric names, that is 5M rows per minute. Mitigation: validate tag cardinality at ingestion. Reject or drop tags with cardinality above a threshold (e.g., 1,000 unique values).

Partition management failure: If the daily partition creation job fails, inserts into metrics_raw fail with a routing error. Mitigation: pre-create partitions 7 days ahead and alert if the creation job fails.

Scaling Considerations

At 50,000 data points per second, Postgres batch inserts become the bottleneck. At that scale, consider a write-ahead buffer (append-only file) that a separate process drains into Postgres.
For multi-region deployments, each region should have its own metrics pipeline. Cross-region aggregation happens at the dashboard query layer.
If query latency for dashboards exceeds 500ms, add materialized views for the most common dashboard queries and refresh them on a schedule.
The migration path to TimescaleDB is straightforward: replace partition management with hypertables and rollup queries with continuous aggregates.

Observability

Meta-metrics: track the metrics system's own performance (ingestion rate, rollup duration, buffer depth)
Alert on rollup lag (time since last successful rollup exceeding 5 minutes)
Monitor metrics_raw partition sizes to detect cardinality explosions
Dashboard query latency (the dashboard itself should be fast; slow dashboard queries indicate missing indexes or excessive data)

Key Takeaways

Buffer on the client, batch on the server, and roll up in the background. This three-stage pipeline keeps write throughput high and read latency low.
Pre-aggregated rollups are essential for time-series queries. Never scan raw data for dashboard rendering.
Percentile rollups are inherently approximate. Accept this or pay the cost of storing all raw data.
High cardinality tags are the most common operational issue. Validate cardinality at ingestion, not after the database is overwhelmed.
Partition by time for efficient data lifecycle management. Drop partitions instead of running DELETE queries.

Final Thoughts

This system handles 5,000 metrics per second on a single Postgres instance. The total infrastructure cost is zero beyond existing database capacity. It lacks the polish of Datadog or Grafana Cloud, but it provides the core capability: collecting, storing, and querying application metrics with bounded storage and predictable query performance. For teams that cannot justify $500+/month for a metrics SaaS, this is a viable alternative.