<-- All Articles

What Actually Runs in a Monitoring Stack

A production monitoring stack built on Prometheus has three parts, and it helps to understand the split before you touch a config file. Prometheus itself does two jobs: it scrapes metrics from your targets on a schedule and stores them in a local time-series database (TSDB). It is a pull model -- Prometheus reaches out to each target's HTTP endpoint and pulls the current numbers, rather than targets pushing data to a central collector. Grafana is a separate process that queries Prometheus and draws the dashboards. Alertmanager is a third process that receives fired alerts from Prometheus, then deduplicates, groups, and routes them to email, Slack, PagerDuty, or wherever your on-call actually looks. Three processes, one job each. You can run all three on a single small VM to start.

The pull model matters more than it sounds. It means Prometheus has a live view of which targets are up: if a scrape fails, that is itself a signal you can alert on. You do not need every service to know where your monitoring server lives -- Prometheus keeps the list.

Exporters: You Probably Already Emit Metrics

Prometheus scrapes anything that exposes metrics in its text format over HTTP, conventionally at /metrics. The gap between "I have no monitoring" and "I have host metrics" is one binary: node_exporter. Install it on every machine and you get CPU, memory, disk, filesystem, and network metrics out of the box on port 9100.

Beyond hosts, most infrastructure you care about either exposes /metrics natively or has a maintained exporter. Postgres, Redis, Nginx, Blackbox (for HTTP/TCP probes), and cAdvisor for containers are all a package install away. The rule of thumb: before you write a custom collector, check whether an exporter already exists. It almost always does.

Wiring targets in is a matter of listing scrape jobs in prometheus.yml. Here is a minimal but production-shaped config -- Prometheus scraping itself plus two node_exporter hosts, with alerting rules and Alertmanager pulled in:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["localhost:9093"]

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: node_exporter
    static_configs:
      - targets:
          - "web01:9100"
          - "db01:9100"
        labels:
          env: production

The scrape_interval of 15 seconds is a sane default. Do not chase 5-second intervals early -- you quadruple your storage and query cost for resolution you will not look at. The labels block attaches env=production to every metric from those targets, which lets you slice dashboards and alerts by environment later.

Storage Sizing Without Guessing

Prometheus stores everything locally by default with a 15-day retention window (--storage.tsdb.retention.time=15d). Before you assume that is not enough, do the math, because the local TSDB is remarkably efficient. After compression, each sample costs roughly 1 to 2 bytes on disk. The formula that matters:

bytes = retention_seconds x samples_per_second x bytes_per_sample

samples_per_second = active_series / scrape_interval

Work an example. Say you have 100,000 active time series scraped every 15 seconds. That is about 6,667 samples/second. Over 15 days (1,296,000 seconds) at 2 bytes/sample, you land around 17 GB. Round up generously for the write-ahead log and index overhead and you are still well under 40 GB. A first stack monitoring a few dozen hosts will not come close to that -- 10,000 series is a more realistic starting point, which is a couple of gigabytes.

Two flags control the ceiling: --storage.tsdb.retention.time for age and --storage.tsdb.retention.size for a hard disk cap (whichever hits first wins). Set the size cap so a metrics explosion cannot fill the disk and take the box down.

Local TSDB is plenty for a single Prometheus watching one environment for two to four weeks. You reach for remote_write to a long-term store -- Thanos, Mimir, or Cortex -- only when you need multi-year retention, a global view across many Prometheus servers, or high availability that survives losing the local disk. That is a second project, not a day-one requirement. The remote_write block itself is small:

remote_write:
  - url: "https://mimir.internal/api/v1/push"
    queue_config:
      capacity: 10000
      max_shards: 10
      max_samples_per_send: 2000

If you are standing up your first stack, skip this. Local retention will carry you further than you expect, and long-term storage adds real operational weight.

Dashboards: Import First, Build Later

The instinct to build a dashboard from a blank canvas is a trap on day one. The Grafana community has already published thousands of dashboards. For host metrics, import "Node Exporter Full" (dashboard ID 1860) and you get CPU, memory, disk I/O, and network panels that took someone else weeks to refine. Import it, point it at your Prometheus data source, and you have real visibility in five minutes.

Where teams go wrong is treating dashboards as click-ops -- built by hand in the UI, undocumented, and lost when the VM dies. Provision them as code instead. Grafana reads dashboard JSON and data source definitions from files on disk at startup:

# /etc/grafana/provisioning/dashboards/default.yml
apiVersion: 1
providers:
  - name: 'infra'
    orgId: 1
    folder: 'Infrastructure'
    type: file
    options:
      path: /var/lib/grafana/dashboards

# /etc/grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://localhost:9090
    isDefault: true

Drop your dashboard JSON into /var/lib/grafana/dashboards, commit the whole provisioning directory to git, and your monitoring is now reproducible. Customize the imported dashboards over time -- but do it by editing the JSON in version control, not by clicking around a UI you will forget the state of.

Alerts That People Will Actually Read

This is where most first stacks fail. The temptation is to alert on everything: high CPU, a full-ish disk, elevated memory. Within a week the channel is noise, on-call mutes it, and a real outage slips through. The discipline is to alert on symptoms, not causes. Nobody needs to be paged because CPU hit 90% for a minute. They need to be paged when the service is actually down, when errors are reaching users, or when a disk will fill within hours.

Three rules cover most of the ground. Note the for: clause on each -- it requires the condition to hold continuously before the alert fires, which is what stops brief spikes from flapping your pager:

groups:
  - name: symptom-alerts
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} is down"

      - alert: DiskWillFillIn4Hours
        expr: |
          predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[1h], 4*3600) < 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Disk on {{ $labels.instance }} will fill within 4h"

      - alert: HighErrorRate
        expr: |
          sum by (service) (rate(http_requests_total{code=~"5.."}[5m]))
          /
          sum by (service) (rate(http_requests_total[5m])) > 0.05
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.service }} 5xx rate above 5%"

Look at what these do. up == 0 fires when a scrape fails -- the pull model gives you liveness for free. predict_linear extrapolates the current disk trend four hours forward and warns before you run out, not after, which is the difference between a scheduled cleanup and a 3 a.m. page. The error-rate alert watches the fraction of 5xx responses, a symptom users actually feel. Each has a for: window sized to the metric: two minutes for a down host, fifteen for a slow-moving disk trend.

Start with these three, wire them to one Alertmanager receiver, and resist adding more until something breaks that these did not catch. An alert you would not wake up for should not page you -- make it a dashboard panel instead.

The Honest Takeaway

A useful first stack is smaller than you think: one VM running Prometheus, Grafana, and Alertmanager; node_exporter on every host; an imported dashboard; and three symptom-based alerts. That is a genuinely production-grade starting point, and you can stand it up in an afternoon. The mistakes that hurt later are premature -- reaching for Thanos before you need retention, cranking scrape intervals, or drowning the team in cause-based alerts nobody reads. Add complexity when a concrete problem demands it, keep your config and dashboards in git, and let the pull model do the boring liveness work for you. Config specifics here were checked against current Prometheus and Grafana documentation.

Setting up observability from scratch or drowning in alert noise? Learn about our DevOps automation services or schedule a consultation.

Need help with observability?

Our engineers build monitoring and alerting that catches real problems without burning out on-call. Let us stand up your stack.

Book a Free 30-Min Review