Prometheus Monitoring Observability Grafana

Prometheus Monitoring: Complete Setup with Node Exporter & Alertmanager

IB
Ingrid Björk
Observability Engineer
Aug 15, 2025
25 min read

What You'll Learn

How Prometheus works, its pull-based architecture, PromQL basics, Node Exporter for infrastructure metrics, and Alertmanager for notifications.

What is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud, now part of the CNCF. It is the de-facto standard for monitoring Kubernetes and cloud-native environments.

📥 Pull-Based Model

Unlike Datadog or New Relic where agents push data, Prometheus actively reaches out to HTTP endpoints (e.g. `/metrics`) on your applications to "scrape" data.

⏱ Time-Series DB

Stores data as streams of timestamped values belonging to the same metric and the same set of labeled dimensions.

Architecture Overview

Node Exporter
CPU/RAM Metrics
Your App
/metrics endpoint
Prometheus Server
Scrapes endpoints, stores TSDB data
Grafana
Dashboards
Alertmanager
Slack/PagerDuty

Prometheus Configuration (prometheus.yml)

prometheus.yml
global:
  scrape_interval: 15s     # How often to scrape targets
  evaluation_interval: 15s # How often to evaluate alert rules

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

# Load rules once and periodically evaluate them
rule_files:
  - "alert_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
scrape_configs:
  # The job name is added as a label `job=` to any timeseries
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'my_app'
    metrics_path: '/api/metrics'
    static_configs:
      - targets: ['app-server-1:8080', 'app-server-2:8080']

PromQL Basics (Query Language)

PromQL is how you query the data for Grafana dashboards or Alert rules.

PromQL Queries
# 1. Simple Selection (Instant Vector)
http_requests_total{status="200", method="GET"}

# 2. Rate (Per-second average over a 5 minute window)
# Used for counters that continuously increase
rate(http_requests_total[5m])

# 3. CPU Usage Percentage (Node Exporter)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 4. Memory Usage Percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# 5. 95th Percentile Response Time (from a Histogram)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Defining Alert Rules

alert_rules.yml
groups:
- name: infrastructure_alerts
  rules:
  - alert: HighCpuLoad
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU load on {{ $labels.instance }}"
      description: "CPU load is > 85% for 5 minutes (current value: {{ $value }}%)"

  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} down"

Keep Reading

D
DevOps

Docker Networking Demystified: Bridge, Host & Overlay

8 min read Read More
C
Cloud

AWS IAM Roles vs Users vs Policies

10 min read Read More
P
Programming

Understanding Python's GIL & Multiprocessing

14 min read Read More