The Modern Monitoring Stack

Prometheus and Grafana form the industry-standard open-source monitoring stack. Prometheus collects and stores time-series metrics, while Grafana visualizes them through customizable dashboards. For QA engineers, this stack provides real-time insights into application performance, infrastructure health, and user experience metrics.

Effective monitoring is a core component of continuous testing in DevOps, enabling teams to catch issues early. When combined with API performance testing, you can correlate load test results with production metrics. For teams running containerized environments, our guide on containerization for testing shows how to integrate monitoring with Docker workflows.

Why Prometheus + Grafana?

  • Pull-Based Model - Prometheus scrapes metrics from targets, no client-side push needed
  • Powerful Query Language (PromQL) - Flexible querying and aggregation of metrics
  • Service Discovery - Automatic target discovery in dynamic environments (Kubernetes, AWS, etc.)
  • Alerting - Built-in alert manager with routing and silencing
  • Open Source - No vendor lock-in, large community support
  • Grafana Visualization - Rich dashboards with multiple data source support

Prometheus Architecture

Core Components

# Prometheus architecture overview
components:
  prometheus_server:
    - scrapes_metrics: true
    - stores_timeseries: true
    - evaluates_rules: true

  exporters:
    - node_exporter: "System metrics (CPU, memory, disk)"
    - blackbox_exporter: "Probe endpoints (HTTP, DNS, TCP)"
    - custom_exporters: "Application-specific metrics"

  pushgateway:
    - for_batch_jobs: true
    - short_lived_processes: true

  alertmanager:
    - handles_alerts: true
    - routes_notifications: true
    - silences_alerts: true

  service_discovery:
    - kubernetes: true
    - consul: true
    - ec2: true
    - dns: true

Installing Prometheus

# Docker installation
docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

# Kubernetes installation (using Helm)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/prometheus

# Configuration file: prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'application'
    static_configs:
      - targets: ['app:8080']
    metrics_path: '/metrics'

Instrumenting Applications

Node.js Application

// app.js
const express = require('express');
const promClient = require('prom-client');

const app = express();
const register = promClient.register;

// Enable default metrics (CPU, memory, event loop lag)
promClient.collectDefaultMetrics({ register });

// Custom metrics
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});

const httpRequestTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const activeConnections = new promClient.Gauge({
  name: 'active_connections',
  help: 'Number of active connections'
});

// Middleware to track metrics
app.use((req, res, next) => {
  const start = Date.now();
  activeConnections.inc();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;

    httpRequestDuration
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .observe(duration);

    httpRequestTotal
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .inc();

    activeConnections.dec();
  });

  next();
});

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

// Business logic
app.get('/api/users', async (req, res) => {
  // Your logic here
  res.json({ users: [] });
});

app.listen(3000, () => {
  console.log('Server running on port 3000');
  console.log('Metrics available at http://localhost:3000/metrics');
});

Python Application (Flask)

# app.py
from flask import Flask, Response
from prometheus_client import Counter, Histogram, Gauge, generate_latest, REGISTRY
import time

app = Flask(__name__)

# Define metrics
request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[0.1, 0.3, 0.5, 1.0, 3.0, 5.0, 10.0]
)

active_requests = Gauge(
    'http_requests_active',
    'Active HTTP requests'
)

# Middleware
@app.before_request
def before_request():
    request.start_time = time.time()
    active_requests.inc()

@app.after_request
def after_request(response):
    duration = time.time() - request.start_time

    request_count.labels(
        method=request.method,
        endpoint=request.endpoint or 'unknown',
        status=response.status_code
    ).inc()

    request_duration.labels(
        method=request.method,
        endpoint=request.endpoint or 'unknown'
    ).observe(duration)

    active_requests.dec()
    return response

# Metrics endpoint
@app.route('/metrics')
def metrics():
    return Response(generate_latest(REGISTRY), mimetype='text/plain')

# Application routes
@app.route('/api/data')
def get_data():
    return {'data': []}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

PromQL: The Query Language

Basic Queries

# Instant vector - current value
http_requests_total

# Filter by labels
http_requests_total{method="GET", status_code="200"}

# Range vector - values over time
http_requests_total[5m]

# Rate of increase (per second)
rate(http_requests_total[5m])

# Increase over time period
increase(http_requests_total[1h])

Advanced PromQL

# Request rate per endpoint
sum(rate(http_requests_total[5m])) by (route)

# Error rate percentage
(
  sum(rate(http_requests_total{status_code=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))
) * 100

# P95 latency
histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket[5m])
)

# Availability percentage
(
  sum(up{job="application"})
  / count(up{job="application"})
) * 100

# Memory usage growth rate
deriv(node_memory_MemAvailable_bytes[1h])

Aggregation Operators

# Sum across all instances
sum(http_requests_total)

# Average response time by route
avg(http_request_duration_seconds) by (route)

# Maximum CPU usage
max(node_cpu_seconds_total) by (instance)

# Count of services up
count(up == 1)

# Top 5 endpoints by traffic
topk(5, sum(rate(http_requests_total[5m])) by (endpoint))

# Bottom 3 performing instances
bottomk(3, avg(http_request_duration_seconds) by (instance))

Grafana Dashboards

Installing Grafana

# Docker installation
docker run -d \
  --name=grafana \
  -p 3000:3000 \
  grafana/grafana

# Add Prometheus data source
# Navigate to: http://localhost:3000 (admin/admin)
# Configuration → Data Sources → Add Prometheus
# URL: http://prometheus:9090

Creating Performance Dashboard

{
  "dashboard": {
    "title": "Application Performance",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (route)",
            "legendFormat": "{{route}}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
            "legendFormat": "Error %"
          }
        ],
        "type": "graph",
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [5],
                "type": "gt"
              },
              "operator": {"type": "and"},
              "query": {"params": ["A", "5m", "now"]},
              "reducer": {"type": "avg"}
            }
          ]
        }
      },
      {
        "title": "P95 Latency",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))",
            "legendFormat": "{{route}}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Active Connections",
        "targets": [
          {
            "expr": "active_connections",
            "legendFormat": "Connections"
          }
        ],
        "type": "stat"
      }
    ]
  }
}

Pre-Built Dashboards

# Import popular dashboards from Grafana.com

# Node Exporter Full (ID: 1860)
# - System metrics: CPU, memory, disk, network

# Kubernetes Cluster Monitoring (ID: 7249)
# - Pod metrics, deployments, resource usage

# Application Performance (Custom)
# - Request rates, error rates, latencies
# - Database query performance
# - Cache hit rates

Alerting with Prometheus & Grafana

Prometheus Alert Rules

# alerts.yml
groups:
  - name: performance_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          (sum(rate(http_requests_total{status_code=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m]))) * 100 > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }}% (threshold: 5%)"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.route }}"
          description: "P95 latency is {{ $value }}s"

      - alert: ServiceDown
        expr: up{job="application"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is down"

      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes /
          node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage above 90%"
          description: "{{ $labels.instance }} memory: {{ $value }}%"

AlertManager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true

    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    email_configs:
      - to: 'team@example.com'
        from: 'alerts@example.com'
        smarthost: 'smtp.gmail.com:587'

  - name: 'slack'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'

Monitoring Best Practices

The Four Golden Signals

# 1. LATENCY - Request duration
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# 2. TRAFFIC - Request rate
sum(rate(http_requests_total[5m]))

# 3. ERRORS - Error rate
sum(rate(http_requests_total{status_code=~"5.."}[5m]))

# 4. SATURATION - Resource utilization
avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) * 100

RED Method (Rate, Errors, Duration)

# Request Rate
sum(rate(http_requests_total[5m])) by (service)

# Error Rate
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)

# Duration (P50, P95, P99)
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

USE Method (Utilization, Saturation, Errors)

# CPU Utilization
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# CPU Saturation (load average)
node_load1 / count(node_cpu_seconds_total{mode="idle"}) without (cpu, mode)

# Memory Utilization
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk Saturation (IO wait)
rate(node_disk_io_time_seconds_total[5m]) * 100

Performance Testing with Prometheus

Load Test Metrics Collection

# load_test_with_metrics.py
import requests
import time
from prometheus_client import Counter, Histogram, push_to_gateway

# Metrics
test_requests_total = Counter('load_test_requests_total', 'Total load test requests', ['status'])
test_duration = Histogram('load_test_duration_seconds', 'Load test request duration')

def run_load_test(url, duration_seconds, rps):
    """Run load test and push metrics to Pushgateway"""
    end_time = time.time() + duration_seconds

    while time.time() < end_time:
        start = time.time()

        try:
            response = requests.get(url, timeout=10)
            test_requests_total.labels(status=response.status_code).inc()
        except Exception as e:
            test_requests_total.labels(status='error').inc()

        request_time = time.time() - start
        test_duration.observe(request_time)

        # Maintain target RPS
        sleep_time = (1.0 / rps) - request_time
        if sleep_time > 0:
            time.sleep(sleep_time)

    # Push to Prometheus Pushgateway
    push_to_gateway(
        'localhost:9091',
        job='load_test',
        registry=CollectorRegistry()
    )

# Run test
run_load_test('http://api.example.com/health', duration_seconds=300, rps=100)

Conclusion

Prometheus and Grafana provide a powerful, flexible monitoring stack for QA engineers. From instrumenting applications to creating insightful dashboards and setting up intelligent alerts, this stack enables proactive performance monitoring (as discussed in API Performance Testing: Metrics and Tools) and rapid issue detection.

Key Takeaways:

  • Instrument applications with custom metrics
  • Master PromQL for powerful queries
  • Create actionable dashboards with Grafana
  • Set up alerts based on SLOs/SLIs
  • Follow monitoring (as discussed in Chaos Engineering: Breaking Systems the Right Way) methodologies: RED, USE, Four Golden Signals
  • Integrate monitoring into load testing workflows

Effective monitoring isn’t just about collecting metrics — it’s about turning data into actionable insights.

See Also