Designing a Simple Metrics Collection Service

Dhruval Dhameliya·October 6, 2025·8 min read

Architecture for a lightweight metrics ingestion pipeline using a buffer, batch writes, and pre-aggregated rollups on Postgres.

Context

I needed to collect application metrics (request counts, latency percentiles, error rates) without depending on Datadog or similar services. The system had to ingest metrics from 20 application instances, store them efficiently, and support queries for dashboards and alerting. Total budget for the metrics infrastructure: $0 beyond existing Postgres.

Problem

Writing individual metric data points to a relational database at high frequency is a known anti-pattern. The write volume overwhelms the database, and querying raw data points for time-series visualization is slow. The solution requires buffering, batching, and pre-aggregation.

Constraints

  • Ingestion rate: ~5,000 metric data points per second across all application instances
  • Retention: raw data for 24 hours, 1-minute rollups for 30 days, 1-hour rollups for 1 year
  • Query latency for dashboard: under 500ms for a 24-hour window
  • Storage: Postgres (no TimescaleDB extension available on Neon)
  • No additional services (no Kafka, no Redis, no InfluxDB)
  • Metric types: counters, gauges, histograms (approximated via percentiles)

See also: Building a View Counter System With Postgres.

Design

Data Model

CREATE TABLE metrics_raw (
  id BIGSERIAL,
  metric_name TEXT NOT NULL,
  tags JSONB NOT NULL DEFAULT '{}',
  value DOUBLE PRECISION NOT NULL,
  recorded_at TIMESTAMPTZ NOT NULL DEFAULT now()
) PARTITION BY RANGE (recorded_at);
 
CREATE TABLE metrics_1m (
  metric_name TEXT NOT NULL,
  tags JSONB NOT NULL DEFAULT '{}',
  bucket TIMESTAMPTZ NOT NULL,
  count BIGINT NOT NULL,
  sum DOUBLE PRECISION NOT NULL,
  min DOUBLE PRECISION NOT NULL,
  max DOUBLE PRECISION NOT NULL,
  p50 DOUBLE PRECISION,
  p95 DOUBLE PRECISION,
  p99 DOUBLE PRECISION,
  PRIMARY KEY (metric_name, tags, bucket)
);
 
CREATE TABLE metrics_1h (
  metric_name TEXT NOT NULL,
  tags JSONB NOT NULL DEFAULT '{}',
  bucket TIMESTAMPTZ NOT NULL,
  count BIGINT NOT NULL,
  sum DOUBLE PRECISION NOT NULL,
  min DOUBLE PRECISION NOT NULL,
  max DOUBLE PRECISION NOT NULL,
  p50 DOUBLE PRECISION,
  p95 DOUBLE PRECISION,
  p99 DOUBLE PRECISION,
  PRIMARY KEY (metric_name, tags, bucket)
);

Partitioning metrics_raw by day allows efficient pruning (drop the oldest partition daily).

Ingestion Pipeline

Application Instance
  -> In-memory buffer (flush every 5 seconds or 500 points)
    -> HTTP POST to metrics API
      -> Batch INSERT into metrics_raw
        -> Rollup worker (every 60 seconds)
          -> INSERT INTO metrics_1m ... ON CONFLICT UPDATE

Client-Side Buffering

Each application instance buffers metrics in memory:

class MetricsBuffer {
  private buffer: MetricPoint[] = [];
  private flushInterval: NodeJS.Timer;
 
  constructor(private endpoint: string) {
    this.flushInterval = setInterval(() => this.flush(), 5000);
  }
 
  record(name: string, value: number, tags: Record<string, string> = {}) {
    this.buffer.push({ name, value, tags, timestamp: Date.now() });
    if (this.buffer.length >= 500) this.flush();
  }
 
  private async flush() {
    if (this.buffer.length === 0) return;
    const batch = this.buffer.splice(0);
    try {
      await fetch(this.endpoint, {
        method: 'POST',
        body: JSON.stringify(batch),
        headers: { 'Content-Type': 'application/json' },
      });
    } catch {
      // Re-queue failed batch (with size limit to prevent memory growth)
      if (this.buffer.length < 10000) {
        this.buffer.unshift(...batch);
      }
    }
  }
}

Batch Insert

Related: Event Tracking System Design for Android Applications.

The metrics API receives batches and inserts them in a single statement:

INSERT INTO metrics_raw (metric_name, tags, value, recorded_at)
VALUES
  ($1, $2, $3, $4),
  ($5, $6, $7, $8),
  ...

A batch of 500 points inserts in 5-15ms.

Rollup Worker

Every 60 seconds, a background worker aggregates raw data into 1-minute buckets:

INSERT INTO metrics_1m (metric_name, tags, bucket, count, sum, min, max, p50, p95, p99)
SELECT
  metric_name,
  tags,
  date_trunc('minute', recorded_at) AS bucket,
  COUNT(*) AS count,
  SUM(value) AS sum,
  MIN(value) AS min,
  MAX(value) AS max,
  percentile_cont(0.50) WITHIN GROUP (ORDER BY value) AS p50,
  percentile_cont(0.95) WITHIN GROUP (ORDER BY value) AS p95,
  percentile_cont(0.99) WITHIN GROUP (ORDER BY value) AS p99
FROM metrics_raw
WHERE recorded_at >= $last_rollup_time
  AND recorded_at < date_trunc('minute', now())
GROUP BY metric_name, tags, bucket
ON CONFLICT (metric_name, tags, bucket)
DO UPDATE SET
  count = EXCLUDED.count,
  sum = EXCLUDED.sum,
  min = EXCLUDED.min,
  max = EXCLUDED.max,
  p50 = EXCLUDED.p50,
  p95 = EXCLUDED.p95,
  p99 = EXCLUDED.p99;

Trade-offs

ApproachWrite ThroughputQuery SpeedStorage EfficiencyAccuracy
Raw points onlyHighSlow (scan)LowExact
Pre-aggregated onlyN/AFastHighApproximate
Raw + rollups (this design)HighFast (rollups)MediumExact raw, approx rollups
TimescaleDBVery highVery fastHighExact

The main trade-off: percentile rollups are approximations. When you roll up 1-minute p95 values into 1-hour p95 values, the result is an approximation (you cannot compute exact percentiles from percentile summaries). For operational dashboards, this is acceptable. For SLA calculations, query the raw data within its 24-hour retention window.

Storage Estimates

TableRows/DaySize/DayRetentionTotal Size
metrics_raw432M~20GB24 hours~20GB
metrics_1m1.44M~150MB30 days~4.5GB
metrics_1h24K~3MB1 year~1.1GB

The raw table is large. Daily partition drops keep it bounded.

Failure Modes

Buffer overflow on application crash: If an application instance crashes, its in-memory buffer is lost. At 5-second flush intervals, maximum data loss is 5 seconds of metrics per instance. For a crash that affects all instances simultaneously, up to 5 seconds of all metrics are lost. This is acceptable for operational metrics.

Rollup worker falls behind: If the rollup query takes longer than 60 seconds (due to high cardinality or database load), it will overlap with the next rollup window. Mitigation: acquire a lock before running the rollup, skip the run if the lock is held, and alert if consecutive skips exceed 3.

High cardinality tags: If tags have high cardinality (e.g., user_id as a tag), the metrics_1m table grows explosively. With 50,000 unique user IDs and 100 metric names, that is 5M rows per minute. Mitigation: validate tag cardinality at ingestion. Reject or drop tags with cardinality above a threshold (e.g., 1,000 unique values).

Partition management failure: If the daily partition creation job fails, inserts into metrics_raw fail with a routing error. Mitigation: pre-create partitions 7 days ahead and alert if the creation job fails.

Scaling Considerations

  • At 50,000 data points per second, Postgres batch inserts become the bottleneck. At that scale, consider a write-ahead buffer (append-only file) that a separate process drains into Postgres.
  • For multi-region deployments, each region should have its own metrics pipeline. Cross-region aggregation happens at the dashboard query layer.
  • If query latency for dashboards exceeds 500ms, add materialized views for the most common dashboard queries and refresh them on a schedule.
  • The migration path to TimescaleDB is straightforward: replace partition management with hypertables and rollup queries with continuous aggregates.

Observability

  • Meta-metrics: track the metrics system's own performance (ingestion rate, rollup duration, buffer depth)
  • Alert on rollup lag (time since last successful rollup exceeding 5 minutes)
  • Monitor metrics_raw partition sizes to detect cardinality explosions
  • Dashboard query latency (the dashboard itself should be fast; slow dashboard queries indicate missing indexes or excessive data)

Key Takeaways

  • Buffer on the client, batch on the server, and roll up in the background. This three-stage pipeline keeps write throughput high and read latency low.
  • Pre-aggregated rollups are essential for time-series queries. Never scan raw data for dashboard rendering.
  • Percentile rollups are inherently approximate. Accept this or pay the cost of storing all raw data.
  • High cardinality tags are the most common operational issue. Validate cardinality at ingestion, not after the database is overwhelmed.
  • Partition by time for efficient data lifecycle management. Drop partitions instead of running DELETE queries.

Further Reading

  • Designing a Simple Authentication Service: Architecture for a session-based authentication service with JWT access tokens, refresh token rotation, and measured security trade-offs.
  • Designing a Simple CMS From Scratch: Architecture decisions behind building a file-based CMS with MDX, Git-backed versioning, and incremental builds for a content-heavy site.
  • Building a Simple Search Index: Designing an inverted index from scratch with tokenization, ranking, and query parsing, then comparing it against Postgres full-text search.

Final Thoughts

This system handles 5,000 metrics per second on a single Postgres instance. The total infrastructure cost is zero beyond existing database capacity. It lacks the polish of Datadog or Grafana Cloud, but it provides the core capability: collecting, storing, and querying application metrics with bounded storage and predictable query performance. For teams that cannot justify $500+/month for a metrics SaaS, this is a viable alternative.

Recommended