The Modern Monitoring Stack
Prometheus and Grafana form the industry-standard open-source monitoring stack. Prometheus collects and stores time-series metrics, while Grafana visualizes them through customizable dashboards. For QA engineers, this stack provides real-time insights into application performance, infrastructure health, and user experience metrics.
Why Prometheus + Grafana?
- Pull-Based Model - Prometheus scrapes metrics from targets, no client-side push needed
- Powerful Query Language (PromQL) - Flexible querying and aggregation of metrics
- Service Discovery - Automatic target discovery in dynamic environments (Kubernetes, AWS, etc.)
- Alerting - Built-in alert manager with routing and silencing
- Open Source - No vendor lock-in, large community support
- Grafana Visualization - Rich dashboards with multiple data source support
Prometheus Architecture
Core Components
# Prometheus architecture overview
components:
prometheus_server:
- scrapes_metrics: true
- stores_timeseries: true
- evaluates_rules: true
exporters:
- node_exporter: "System metrics (CPU, memory, disk)"
- blackbox_exporter: "Probe endpoints (HTTP, DNS, TCP)"
- custom_exporters: "Application-specific metrics"
pushgateway:
- for_batch_jobs: true
- short_lived_processes: true
alertmanager:
- handles_alerts: true
- routes_notifications: true
- silences_alerts: true
service_discovery:
- kubernetes: true
- consul: true
- ec2: true
- dns: true
Installing Prometheus
# Docker installation
docker run -d \
--name prometheus \
-p 9090:9090 \
-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
# Kubernetes installation (using Helm)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/prometheus
# Configuration file: prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
- job_name: 'application'
static_configs:
- targets: ['app:8080']
metrics_path: '/metrics'
Instrumenting Applications
Node.js Application
// app.js
const express = require('express');
const promClient = require('prom-client');
const app = express();
const register = promClient.register;
// Enable default metrics (CPU, memory, event loop lag)
promClient.collectDefaultMetrics({ register });
// Custom metrics
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});
const httpRequestTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
const activeConnections = new promClient.Gauge({
name: 'active_connections',
help: 'Number of active connections'
});
// Middleware to track metrics
app.use((req, res, next) => {
const start = Date.now();
activeConnections.inc();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestDuration
.labels(req.method, req.route?.path || req.path, res.statusCode)
.observe(duration);
httpRequestTotal
.labels(req.method, req.route?.path || req.path, res.statusCode)
.inc();
activeConnections.dec();
});
next();
});
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
// Business logic
app.get('/api/users', async (req, res) => {
// Your logic here
res.json({ users: [] });
});
app.listen(3000, () => {
console.log('Server running on port 3000');
console.log('Metrics available at http://localhost:3000/metrics');
});
Python Application (Flask)
# app.py
from flask import Flask, Response
from prometheus_client import Counter, Histogram, Gauge, generate_latest, REGISTRY
import time
app = Flask(__name__)
# Define metrics
request_count = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint'],
buckets=[0.1, 0.3, 0.5, 1.0, 3.0, 5.0, 10.0]
)
active_requests = Gauge(
'http_requests_active',
'Active HTTP requests'
)
# Middleware
@app.before_request
def before_request():
request.start_time = time.time()
active_requests.inc()
@app.after_request
def after_request(response):
duration = time.time() - request.start_time
request_count.labels(
method=request.method,
endpoint=request.endpoint or 'unknown',
status=response.status_code
).inc()
request_duration.labels(
method=request.method,
endpoint=request.endpoint or 'unknown'
).observe(duration)
active_requests.dec()
return response
# Metrics endpoint
@app.route('/metrics')
def metrics():
return Response(generate_latest(REGISTRY), mimetype='text/plain')
# Application routes
@app.route('/api/data')
def get_data():
return {'data': []}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
PromQL: The Query Language
Basic Queries
# Instant vector - current value
http_requests_total
# Filter by labels
http_requests_total{method="GET", status_code="200"}
# Range vector - values over time
http_requests_total[5m]
# Rate of increase (per second)
rate(http_requests_total[5m])
# Increase over time period
increase(http_requests_total[1h])
Advanced PromQL
# Request rate per endpoint
sum(rate(http_requests_total[5m])) by (route)
# Error rate percentage
(
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
) * 100
# P95 latency
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
# Availability percentage
(
sum(up{job="application"})
/ count(up{job="application"})
) * 100
# Memory usage growth rate
deriv(node_memory_MemAvailable_bytes[1h])
Aggregation Operators
# Sum across all instances
sum(http_requests_total)
# Average response time by route
avg(http_request_duration_seconds) by (route)
# Maximum CPU usage
max(node_cpu_seconds_total) by (instance)
# Count of services up
count(up == 1)
# Top 5 endpoints by traffic
topk(5, sum(rate(http_requests_total[5m])) by (endpoint))
# Bottom 3 performing instances
bottomk(3, avg(http_request_duration_seconds) by (instance))
Grafana Dashboards
Installing Grafana
# Docker installation
docker run -d \
--name=grafana \
-p 3000:3000 \
grafana/grafana
# Add Prometheus data source
# Navigate to: http://localhost:3000 (admin/admin)
# Configuration → Data Sources → Add Prometheus
# URL: http://prometheus:9090
Creating Performance Dashboard
{
"dashboard": {
"title": "Application Performance",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (route)",
"legendFormat": "{{route}}"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "(sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
"legendFormat": "Error %"
}
],
"type": "graph",
"alert": {
"conditions": [
{
"evaluator": {
"params": [5],
"type": "gt"
},
"operator": {"type": "and"},
"query": {"params": ["A", "5m", "now"]},
"reducer": {"type": "avg"}
}
]
}
},
{
"title": "P95 Latency",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))",
"legendFormat": "{{route}}"
}
],
"type": "graph"
},
{
"title": "Active Connections",
"targets": [
{
"expr": "active_connections",
"legendFormat": "Connections"
}
],
"type": "stat"
}
]
}
}
Pre-Built Dashboards
# Import popular dashboards from Grafana.com
# Node Exporter Full (ID: 1860)
# - System metrics: CPU, memory, disk, network
# Kubernetes Cluster Monitoring (ID: 7249)
# - Pod metrics, deployments, resource usage
# Application Performance (Custom)
# - Request rates, error rates, latencies
# - Database query performance
# - Cache hit rates
Alerting with Prometheus & Grafana
Prometheus Alert Rules
# alerts.yml
groups:
- name: performance_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: |
(sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }}% (threshold: 5%)"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.route }}"
description: "P95 latency is {{ $value }}s"
- alert: ServiceDown
expr: up{job="application"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
- alert: HighMemoryUsage
expr: |
(1 - (node_memory_MemAvailable_bytes /
node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "Memory usage above 90%"
description: "{{ $labels.instance }} memory: {{ $value }}%"
AlertManager Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
email_configs:
- to: 'team@example.com'
from: 'alerts@example.com'
smarthost: 'smtp.gmail.com:587'
- name: 'slack'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
Monitoring Best Practices
The Four Golden Signals
# 1. LATENCY - Request duration
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# 2. TRAFFIC - Request rate
sum(rate(http_requests_total[5m]))
# 3. ERRORS - Error rate
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
# 4. SATURATION - Resource utilization
avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) * 100
RED Method (Rate, Errors, Duration)
# Request Rate
sum(rate(http_requests_total[5m])) by (service)
# Error Rate
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
# Duration (P50, P95, P99)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
USE Method (Utilization, Saturation, Errors)
# CPU Utilization
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# CPU Saturation (load average)
node_load1 / count(node_cpu_seconds_total{mode="idle"}) without (cpu, mode)
# Memory Utilization
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk Saturation (IO wait)
rate(node_disk_io_time_seconds_total[5m]) * 100
Performance Testing with Prometheus
Load Test Metrics Collection
# load_test_with_metrics.py
import requests
import time
from prometheus_client import Counter, Histogram, push_to_gateway
# Metrics
test_requests_total = Counter('load_test_requests_total', 'Total load test requests', ['status'])
test_duration = Histogram('load_test_duration_seconds', 'Load test request duration')
def run_load_test(url, duration_seconds, rps):
"""Run load test and push metrics to Pushgateway"""
end_time = time.time() + duration_seconds
while time.time() < end_time:
start = time.time()
try:
response = requests.get(url, timeout=10)
test_requests_total.labels(status=response.status_code).inc()
except Exception as e:
test_requests_total.labels(status='error').inc()
request_time = time.time() - start
test_duration.observe(request_time)
# Maintain target RPS
sleep_time = (1.0 / rps) - request_time
if sleep_time > 0:
time.sleep(sleep_time)
# Push to Prometheus Pushgateway
push_to_gateway(
'localhost:9091',
job='load_test',
registry=CollectorRegistry()
)
# Run test
run_load_test('http://api.example.com/health', duration_seconds=300, rps=100)
Conclusion
Prometheus and Grafana provide a powerful, flexible monitoring stack for QA engineers. From instrumenting applications to creating insightful dashboards and setting up intelligent alerts, this stack enables proactive performance monitoring (as discussed in API Performance Testing: Metrics and Tools) and rapid issue detection.
Key Takeaways:
- Instrument applications with custom metrics
- Master PromQL for powerful queries
- Create actionable dashboards with Grafana
- Set up alerts based on SLOs/SLIs
- Follow monitoring (as discussed in Chaos Engineering: Breaking Systems the Right Way) methodologies: RED, USE, Four Golden Signals
- Integrate monitoring into load testing workflows
Effective monitoring isn’t just about collecting metrics — it’s about turning data into actionable insights.