The Modern Monitoring Stack

Prometheus and Grafana form the industry-standard open-source monitoring stack. Prometheus collects and stores time-series metrics, while Grafana visualizes them through customizable dashboards. For QA engineers, this stack provides real-time insights into application performance, infrastructure health, and user experience metrics.

Why Prometheus + Grafana?

  • Pull-Based Model - Prometheus scrapes metrics from targets, no client-side push needed
  • Powerful Query Language (PromQL) - Flexible querying and aggregation of metrics
  • Service Discovery - Automatic target discovery in dynamic environments (Kubernetes, AWS, etc.)
  • Alerting - Built-in alert manager with routing and silencing
  • Open Source - No vendor lock-in, large community support
  • Grafana Visualization - Rich dashboards with multiple data source support

Prometheus Architecture

Core Components

# Prometheus architecture overview
components:
  prometheus_server:
    - scrapes_metrics: true
    - stores_timeseries: true
    - evaluates_rules: true

  exporters:
    - node_exporter: "System metrics (CPU, memory, disk)"
    - blackbox_exporter: "Probe endpoints (HTTP, DNS, TCP)"
    - custom_exporters: "Application-specific metrics"

  pushgateway:
    - for_batch_jobs: true
    - short_lived_processes: true

  alertmanager:
    - handles_alerts: true
    - routes_notifications: true
    - silences_alerts: true

  service_discovery:
    - kubernetes: true
    - consul: true
    - ec2: true
    - dns: true

Installing Prometheus

# Docker installation
docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

# Kubernetes installation (using Helm)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/prometheus

# Configuration file: prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'application'
    static_configs:
      - targets: ['app:8080']
    metrics_path: '/metrics'

Instrumenting Applications

Node.js Application

// app.js
const express = require('express');
const promClient = require('prom-client');

const app = express();
const register = promClient.register;

// Enable default metrics (CPU, memory, event loop lag)
promClient.collectDefaultMetrics({ register });

// Custom metrics
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});

const httpRequestTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const activeConnections = new promClient.Gauge({
  name: 'active_connections',
  help: 'Number of active connections'
});

// Middleware to track metrics
app.use((req, res, next) => {
  const start = Date.now();
  activeConnections.inc();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;

    httpRequestDuration
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .observe(duration);

    httpRequestTotal
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .inc();

    activeConnections.dec();
  });

  next();
});

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

// Business logic
app.get('/api/users', async (req, res) => {
  // Your logic here
  res.json({ users: [] });
});

app.listen(3000, () => {
  console.log('Server running on port 3000');
  console.log('Metrics available at http://localhost:3000/metrics');
});

Python Application (Flask)

# app.py
from flask import Flask, Response
from prometheus_client import Counter, Histogram, Gauge, generate_latest, REGISTRY
import time

app = Flask(__name__)

# Define metrics
request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[0.1, 0.3, 0.5, 1.0, 3.0, 5.0, 10.0]
)

active_requests = Gauge(
    'http_requests_active',
    'Active HTTP requests'
)

# Middleware
@app.before_request
def before_request():
    request.start_time = time.time()
    active_requests.inc()

@app.after_request
def after_request(response):
    duration = time.time() - request.start_time

    request_count.labels(
        method=request.method,
        endpoint=request.endpoint or 'unknown',
        status=response.status_code
    ).inc()

    request_duration.labels(
        method=request.method,
        endpoint=request.endpoint or 'unknown'
    ).observe(duration)

    active_requests.dec()
    return response

# Metrics endpoint
@app.route('/metrics')
def metrics():
    return Response(generate_latest(REGISTRY), mimetype='text/plain')

# Application routes
@app.route('/api/data')
def get_data():
    return {'data': []}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

PromQL: The Query Language

Basic Queries

# Instant vector - current value
http_requests_total

# Filter by labels
http_requests_total{method="GET", status_code="200"}

# Range vector - values over time
http_requests_total[5m]

# Rate of increase (per second)
rate(http_requests_total[5m])

# Increase over time period
increase(http_requests_total[1h])

Advanced PromQL

# Request rate per endpoint
sum(rate(http_requests_total[5m])) by (route)

# Error rate percentage
(
  sum(rate(http_requests_total{status_code=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))
) * 100

# P95 latency
histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket[5m])
)

# Availability percentage
(
  sum(up{job="application"})
  / count(up{job="application"})
) * 100

# Memory usage growth rate
deriv(node_memory_MemAvailable_bytes[1h])

Aggregation Operators

# Sum across all instances
sum(http_requests_total)

# Average response time by route
avg(http_request_duration_seconds) by (route)

# Maximum CPU usage
max(node_cpu_seconds_total) by (instance)

# Count of services up
count(up == 1)

# Top 5 endpoints by traffic
topk(5, sum(rate(http_requests_total[5m])) by (endpoint))

# Bottom 3 performing instances
bottomk(3, avg(http_request_duration_seconds) by (instance))

Grafana Dashboards

Installing Grafana

# Docker installation
docker run -d \
  --name=grafana \
  -p 3000:3000 \
  grafana/grafana

# Add Prometheus data source
# Navigate to: http://localhost:3000 (admin/admin)
# Configuration → Data Sources → Add Prometheus
# URL: http://prometheus:9090

Creating Performance Dashboard

{
  "dashboard": {
    "title": "Application Performance",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (route)",
            "legendFormat": "{{route}}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
            "legendFormat": "Error %"
          }
        ],
        "type": "graph",
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [5],
                "type": "gt"
              },
              "operator": {"type": "and"},
              "query": {"params": ["A", "5m", "now"]},
              "reducer": {"type": "avg"}
            }
          ]
        }
      },
      {
        "title": "P95 Latency",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))",
            "legendFormat": "{{route}}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Active Connections",
        "targets": [
          {
            "expr": "active_connections",
            "legendFormat": "Connections"
          }
        ],
        "type": "stat"
      }
    ]
  }
}

Pre-Built Dashboards

# Import popular dashboards from Grafana.com

# Node Exporter Full (ID: 1860)
# - System metrics: CPU, memory, disk, network

# Kubernetes Cluster Monitoring (ID: 7249)
# - Pod metrics, deployments, resource usage

# Application Performance (Custom)
# - Request rates, error rates, latencies
# - Database query performance
# - Cache hit rates

Alerting with Prometheus & Grafana

Prometheus Alert Rules

# alerts.yml
groups:
  - name: performance_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          (sum(rate(http_requests_total{status_code=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m]))) * 100 > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }}% (threshold: 5%)"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.route }}"
          description: "P95 latency is {{ $value }}s"

      - alert: ServiceDown
        expr: up{job="application"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is down"

      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes /
          node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage above 90%"
          description: "{{ $labels.instance }} memory: {{ $value }}%"

AlertManager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true

    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    email_configs:
      - to: 'team@example.com'
        from: 'alerts@example.com'
        smarthost: 'smtp.gmail.com:587'

  - name: 'slack'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'

Monitoring Best Practices

The Four Golden Signals

# 1. LATENCY - Request duration
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# 2. TRAFFIC - Request rate
sum(rate(http_requests_total[5m]))

# 3. ERRORS - Error rate
sum(rate(http_requests_total{status_code=~"5.."}[5m]))

# 4. SATURATION - Resource utilization
avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) * 100

RED Method (Rate, Errors, Duration)

# Request Rate
sum(rate(http_requests_total[5m])) by (service)

# Error Rate
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)

# Duration (P50, P95, P99)
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

USE Method (Utilization, Saturation, Errors)

# CPU Utilization
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# CPU Saturation (load average)
node_load1 / count(node_cpu_seconds_total{mode="idle"}) without (cpu, mode)

# Memory Utilization
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk Saturation (IO wait)
rate(node_disk_io_time_seconds_total[5m]) * 100

Performance Testing with Prometheus

Load Test Metrics Collection

# load_test_with_metrics.py
import requests
import time
from prometheus_client import Counter, Histogram, push_to_gateway

# Metrics
test_requests_total = Counter('load_test_requests_total', 'Total load test requests', ['status'])
test_duration = Histogram('load_test_duration_seconds', 'Load test request duration')

def run_load_test(url, duration_seconds, rps):
    """Run load test and push metrics to Pushgateway"""
    end_time = time.time() + duration_seconds

    while time.time() < end_time:
        start = time.time()

        try:
            response = requests.get(url, timeout=10)
            test_requests_total.labels(status=response.status_code).inc()
        except Exception as e:
            test_requests_total.labels(status='error').inc()

        request_time = time.time() - start
        test_duration.observe(request_time)

        # Maintain target RPS
        sleep_time = (1.0 / rps) - request_time
        if sleep_time > 0:
            time.sleep(sleep_time)

    # Push to Prometheus Pushgateway
    push_to_gateway(
        'localhost:9091',
        job='load_test',
        registry=CollectorRegistry()
    )

# Run test
run_load_test('http://api.example.com/health', duration_seconds=300, rps=100)

Conclusion

Prometheus and Grafana provide a powerful, flexible monitoring stack for QA engineers. From instrumenting applications to creating insightful dashboards and setting up intelligent alerts, this stack enables proactive performance monitoring (as discussed in API Performance Testing: Metrics and Tools) and rapid issue detection.

Key Takeaways:

  • Instrument applications with custom metrics
  • Master PromQL for powerful queries
  • Create actionable dashboards with Grafana
  • Set up alerts based on SLOs/SLIs
  • Follow monitoring (as discussed in Chaos Engineering: Breaking Systems the Right Way) methodologies: RED, USE, Four Golden Signals
  • Integrate monitoring into load testing workflows

Effective monitoring isn’t just about collecting metrics — it’s about turning data into actionable insights.