Why Monitoring Matters
Blind spots kill uptime. Every production outage that catches you off guard is a monitoring failure. If a disk fills up at 3 AM and nobody knows until customers start complaining, the problem is not the disk. The problem is that nobody was watching.
You cannot fix what you cannot see. Monitoring gives you visibility into what your servers are actually doing: how much CPU they are burning, how fast memory is disappearing, whether disk I/O is bottlenecked, and how close you are to running out of space. Good monitoring catches problems while they are still small. It turns 3 AM emergencies into Tuesday afternoon maintenance tasks.
The goal is simple: know about problems before your users do.
Choosing a Monitoring Stack
There are dozens of monitoring tools out there, but one stack has become the industry standard for good reason: Prometheus for metrics collection and Grafana for visualization.
Prometheus is an open-source time-series database built for monitoring. It works on a pull model -- Prometheus scrapes metrics from your servers at regular intervals and stores them locally. This means your application servers do not need to know anything about your monitoring infrastructure. They just expose metrics on an HTTP endpoint, and Prometheus handles the rest.
Grafana connects to Prometheus and turns raw metric data into dashboards you can actually read. Graphs, gauges, heatmaps, alert panels -- Grafana makes it visual. Together, they form a stack that is free, battle-tested, and runs everywhere from small startups to companies serving millions of requests per second.
The third piece is Node Exporter, a lightweight agent that runs on each server and exposes system-level metrics. CPU, memory, disk, network, filesystem -- Node Exporter collects it all automatically and makes it available for Prometheus to scrape.
Installing Node Exporter
Node Exporter is a single binary. Download the latest release from the Prometheus GitHub releases page, extract it, and run it as a system service. By default it listens on port 9100.
Systemd Unit File
Create /etc/systemd/system/node_exporter.service:
[Unit] Description=Prometheus Node Exporter After=network-online.target Wants=network-online.target [Service] User=node_exporter Group=node_exporter Type=simple ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target
Then enable and start it:
sudo systemctl daemon-reload sudo systemctl enable node_exporter sudo systemctl start node_exporter
Once running, you can verify it by hitting http://your-server:9100/metrics in a browser. You will see a wall of text -- that is every metric Node Exporter collects. CPU usage per core, memory breakdowns, disk space per mount, network traffic per interface, and more. All of it is exposed automatically with zero configuration.
Prometheus Configuration
Prometheus needs a configuration file that tells it where to find your exporters. The default config file is prometheus.yml. Here is a minimal configuration that scrapes Node Exporter from a single server:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'node'
static_configs:
- targets:
- 'server-01:9100'
- 'server-02:9100'
- 'server-03:9100'
The scrape_interval controls how often Prometheus pulls metrics. Fifteen seconds is a good default. Going lower gives you more granularity but increases storage and CPU usage on the Prometheus server.
For storage retention, launch Prometheus with the --storage.tsdb.retention.time=30d flag to keep 30 days of metric history. Adjust based on your disk capacity. A single server being scraped every 15 seconds generates roughly 1-2 GB of storage per month.
Prometheus runs on port 9090 by default. Once it is running, you can query metrics directly in its built-in expression browser at http://prometheus-server:9090.
Key Metrics to Track
Not every metric matters equally. Focus on the ones that tell you whether your server is healthy and whether it is trending toward trouble. Here are the essential metrics with their Prometheus names:
CPU Usage
The raw metric is node_cpu_seconds_total. To get the actual utilization percentage, use the rate() function:
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
This gives you the percentage of CPU time spent doing actual work across all cores over a 5-minute window.
Memory
Use node_memory_MemAvailable_bytes to track available memory. This is more accurate than calculating free memory manually because it accounts for buffers and cache that the kernel can reclaim under pressure.
Disk Space and I/O
Track available disk space with node_filesystem_avail_bytes per mount point. For I/O performance, watch node_disk_io_time_seconds_total which tells you how much time the disk spent processing requests. High I/O time means your disk is the bottleneck.
Network Traffic
Monitor bandwidth with node_network_receive_bytes_total and node_network_transmit_bytes_total. Wrap them in rate() to get bytes per second. Sudden traffic spikes can indicate a DDoS, a misbehaving application, or a backup job that is saturating your network link.
System Load
The classic load averages: node_load1, node_load5, and node_load15. Compare these against your CPU core count. A load of 4.0 on a 4-core server means you are at capacity. A load of 8.0 on that same server means processes are queuing up and things are getting slow.
Setting Up Grafana
Grafana runs on port 3000 by default. After installation, log in with the default credentials (admin/admin) and immediately change the password.
The first step is adding Prometheus as a data source. Go to Configuration, then Data Sources, then Add Data Source. Select Prometheus and enter your Prometheus server URL (for example, http://localhost:9090 if running on the same host). Click Save and Test to verify the connection.
Instead of building dashboards from scratch, import dashboard #1860 (Node Exporter Full). This is a community-maintained dashboard that covers every metric Node Exporter provides. Go to Dashboards, then Import, enter 1860 as the dashboard ID, select your Prometheus data source, and click Import. You will instantly have panels for CPU, memory, disk, network, and more.
Use this as your starting point. Over time, customize it. Remove panels you never look at, add panels for metrics specific to your workload, and adjust thresholds to match your environment.
Alerting Rules
Dashboards are useless at 3 AM if nobody is watching them. Alerting is what turns monitoring from passive observation into active protection. Prometheus supports alerting rules that fire when conditions are met.
Create an alerting rules file and reference it in your prometheus.yml:
rule_files: - 'alert_rules.yml'
Here is a practical alert_rules.yml with rules for the most critical conditions:
groups:
- name: server_alerts
rules:
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space below 15% on {{ $labels.instance }}"
- alert: DiskSpaceCritical
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 5
for: 2m
labels:
severity: critical
annotations:
summary: "Disk space below 5% on {{ $labels.instance }}"
- alert: HighCPU
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 5m
labels:
severity: warning
annotations:
summary: "CPU usage above 90% for 5 min on {{ $labels.instance }}"
- alert: MemoryLow
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Available memory below 10% on {{ $labels.instance }}"
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is unreachable"
To actually receive notifications, you need Alertmanager. Prometheus sends firing alerts to Alertmanager, which handles deduplication, grouping, and routing to receivers like email, Slack, PagerDuty, or webhooks. A basic Alertmanager config for email and Slack looks like this:
route:
receiver: 'team-notifications'
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: 'team-notifications'
email_configs:
- to: 'ops@example.com'
slack_configs:
- channel: '#alerts'
api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
Common Pitfalls
Alert fatigue. This is the number one monitoring killer. If your team gets 50 alerts a day and most of them are noise, people stop reading them. Every alert should require human action. If it does not, it should be a dashboard panel, not an alert. Start with fewer alerts at higher thresholds and tighten them as you learn what is normal for your environment.
Missing baselines. You need to know what "normal" looks like before you can detect "abnormal." Run your monitoring for at least two weeks before setting alert thresholds. CPU usage that looks alarming on a quiet Tuesday might be perfectly normal during a Friday deployment window. Context matters.
Ignoring disk I/O. Teams obsess over CPU and memory but ignore disk performance. A server can have 90% idle CPU and plenty of free memory and still be painfully slow because every request is waiting on disk I/O. Monitor node_disk_io_time_seconds_total and watch for sustained high values.
Not monitoring the monitor. If your Prometheus server goes down, you have no monitoring. Run Prometheus on a separate host from your application servers. Set up a basic external health check (even a simple cron job that curls the Prometheus API) so you know when monitoring itself is broken.
Next Steps
Once your infrastructure monitoring is solid, expand into application-level metrics. Most languages have Prometheus client libraries that let you instrument your own code -- track request latency, error rates, queue depths, and business metrics that matter to your specific application.
For log aggregation, look at Grafana Loki. It integrates natively with Grafana and lets you correlate logs with metrics in the same dashboard. When an alert fires for high CPU, you can immediately check the logs from that time window without switching tools.
Finally, add uptime monitoring from external vantage points. Your internal monitoring will not help if the network path between your users and your servers is broken. Services like Blackbox Exporter (another Prometheus component) can probe your endpoints from the outside and alert on HTTP failures, SSL certificate expiration, and DNS resolution problems.
Monitoring is not a one-time setup. It is a practice. Every outage should lead to a new check. Every performance issue should produce a new dashboard panel. The best monitoring systems are built incrementally by teams that learn from their incidents.
Need 24/7 monitoring for your infrastructure? Learn about our monitoring and NOC services or schedule a consultation.