What You'll Learn
How Prometheus works, its pull-based architecture, PromQL basics, Node Exporter for infrastructure metrics, and Alertmanager for notifications.
What is Prometheus?
Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud, now part of the CNCF. It is the de-facto standard for monitoring Kubernetes and cloud-native environments.
📥 Pull-Based Model
Unlike Datadog or New Relic where agents push data, Prometheus actively reaches out to HTTP endpoints (e.g. `/metrics`) on your applications to "scrape" data.
⏱ Time-Series DB
Stores data as streams of timestamped values belonging to the same metric and the same set of labeled dimensions.
Architecture Overview
CPU/RAM Metrics
/metrics endpoint
Dashboards
Slack/PagerDuty
Prometheus Configuration (prometheus.yml)
global:
scrape_interval: 15s # How often to scrape targets
evaluation_interval: 15s # How often to evaluate alert rules
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
# Load rules once and periodically evaluate them
rule_files:
- "alert_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
scrape_configs:
# The job name is added as a label `job=` to any timeseries
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'my_app'
metrics_path: '/api/metrics'
static_configs:
- targets: ['app-server-1:8080', 'app-server-2:8080']
PromQL Basics (Query Language)
PromQL is how you query the data for Grafana dashboards or Alert rules.
# 1. Simple Selection (Instant Vector)
http_requests_total{status="200", method="GET"}
# 2. Rate (Per-second average over a 5 minute window)
# Used for counters that continuously increase
rate(http_requests_total[5m])
# 3. CPU Usage Percentage (Node Exporter)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 4. Memory Usage Percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# 5. 95th Percentile Response Time (from a Histogram)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Defining Alert Rules
groups:
- name: infrastructure_alerts
rules:
- alert: HighCpuLoad
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU load on {{ $labels.instance }}"
description: "CPU load is > 85% for 5 minutes (current value: {{ $value }}%)"
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"