The Modern Monitoring Stack
Prometheus and Grafana form the industry-standard open-source monitoring stack. Prometheus collects and stores time-series metrics, while Grafana visualizes them through customizable dashboards. For QA engineers, this stack provides real-time insights into application performance, infrastructure health, and user experience metrics.
Effective monitoring is a core component of continuous testing in DevOps, enabling teams to catch issues early. When combined with API performance testing, you can correlate load test results with production metrics. For teams running containerized environments, our guide on containerization for testing shows how to integrate monitoring with Docker workflows.
Why Prometheus + Grafana?
- Pull-Based Model - Prometheus scrapes metrics from targets, no client-side push needed
- Powerful Query Language (PromQL) - Flexible querying and aggregation of metrics
- Service Discovery - Automatic target discovery in dynamic environments (Kubernetes, AWS, etc.)
- Alerting - Built-in alert manager with routing and silencing
- Open Source - No vendor lock-in, large community support
- Grafana Visualization - Rich dashboards with multiple data source support
Prometheus Architecture
Core Components
# Prometheus architecture overview
components:
prometheus_server:
- scrapes_metrics: true
- stores_timeseries: true
- evaluates_rules: true
exporters:
- node_exporter: "System metrics (CPU, memory, disk)"
- blackbox_exporter: "Probe endpoints (HTTP, DNS, TCP)"
- custom_exporters: "Application-specific metrics"
pushgateway:
- for_batch_jobs: true
- short_lived_processes: true
alertmanager:
- handles_alerts: true
- routes_notifications: true
- silences_alerts: true
service_discovery:
- kubernetes: true
- consul: true
- ec2: true
- dns: true
Installing Prometheus
# Docker installation
docker run -d \
--name prometheus \
-p 9090:9090 \
-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
# Kubernetes installation (using Helm)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/prometheus
# Configuration file: prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
- job_name: 'application'
static_configs:
- targets: ['app:8080']
metrics_path: '/metrics'
Instrumenting Applications
Node.js Application
// app.js
const express = require('express');
const promClient = require('prom-client');
const app = express();
const register = promClient.register;
// Enable default metrics (CPU, memory, event loop lag)
promClient.collectDefaultMetrics({ register });
// Custom metrics
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});
const httpRequestTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
const activeConnections = new promClient.Gauge({
name: 'active_connections',
help: 'Number of active connections'
});
// Middleware to track metrics
app.use((req, res, next) => {
const start = Date.now();
activeConnections.inc();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestDuration
.labels(req.method, req.route?.path || req.path, res.statusCode)
.observe(duration);
httpRequestTotal
.labels(req.method, req.route?.path || req.path, res.statusCode)
.inc();
activeConnections.dec();
});
next();
});
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
// Business logic
app.get('/api/users', async (req, res) => {
// Your logic here
res.json({ users: [] });
});
app.listen(3000, () => {
console.log('Server running on port 3000');
console.log('Metrics available at http://localhost:3000/metrics');
});
Python Application (Flask)
# app.py
from flask import Flask, Response
from prometheus_client import Counter, Histogram, Gauge, generate_latest, REGISTRY
import time
app = Flask(__name__)
# Define metrics
request_count = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint'],
buckets=[0.1, 0.3, 0.5, 1.0, 3.0, 5.0, 10.0]
)
active_requests = Gauge(
'http_requests_active',
'Active HTTP requests'
)
# Middleware
@app.before_request
def before_request():
request.start_time = time.time()
active_requests.inc()
@app.after_request
def after_request(response):
duration = time.time() - request.start_time
request_count.labels(
method=request.method,
endpoint=request.endpoint or 'unknown',
status=response.status_code
).inc()
request_duration.labels(
method=request.method,
endpoint=request.endpoint or 'unknown'
).observe(duration)
active_requests.dec()
return response
# Metrics endpoint
@app.route('/metrics')
def metrics():
return Response(generate_latest(REGISTRY), mimetype='text/plain')
# Application routes
@app.route('/api/data')
def get_data():
return {'data': []}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
PromQL: The Query Language
Basic Queries
# Instant vector - current value
http_requests_total
# Filter by labels
http_requests_total{method="GET", status_code="200"}
# Range vector - values over time
http_requests_total[5m]
# Rate of increase (per second)
rate(http_requests_total[5m])
# Increase over time period
increase(http_requests_total[1h])
Advanced PromQL
# Request rate per endpoint
sum(rate(http_requests_total[5m])) by (route)
# Error rate percentage
(
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
) * 100
# P95 latency
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
# Availability percentage
(
sum(up{job="application"})
/ count(up{job="application"})
) * 100
# Memory usage growth rate
deriv(node_memory_MemAvailable_bytes[1h])
Aggregation Operators
# Sum across all instances
sum(http_requests_total)
# Average response time by route
avg(http_request_duration_seconds) by (route)
# Maximum CPU usage
max(node_cpu_seconds_total) by (instance)
# Count of services up
count(up == 1)
# Top 5 endpoints by traffic
topk(5, sum(rate(http_requests_total[5m])) by (endpoint))
# Bottom 3 performing instances
bottomk(3, avg(http_request_duration_seconds) by (instance))
Grafana Dashboards
Installing Grafana
# Docker installation
docker run -d \
--name=grafana \
-p 3000:3000 \
grafana/grafana
# Add Prometheus data source
# Navigate to: http://localhost:3000 (admin/admin)
# Configuration → Data Sources → Add Prometheus
# URL: http://prometheus:9090
Creating Performance Dashboard
{
"dashboard": {
"title": "Application Performance",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (route)",
"legendFormat": "{{route}}"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "(sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
"legendFormat": "Error %"
}
],
"type": "graph",
"alert": {
"conditions": [
{
"evaluator": {
"params": [5],
"type": "gt"
},
"operator": {"type": "and"},
"query": {"params": ["A", "5m", "now"]},
"reducer": {"type": "avg"}
}
]
}
},
{
"title": "P95 Latency",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))",
"legendFormat": "{{route}}"
}
],
"type": "graph"
},
{
"title": "Active Connections",
"targets": [
{
"expr": "active_connections",
"legendFormat": "Connections"
}
],
"type": "stat"
}
]
}
}
Pre-Built Dashboards
# Import popular dashboards from Grafana.com
# Node Exporter Full (ID: 1860)
# - System metrics: CPU, memory, disk, network
# Kubernetes Cluster Monitoring (ID: 7249)
# - Pod metrics, deployments, resource usage
# Application Performance (Custom)
# - Request rates, error rates, latencies
# - Database query performance
# - Cache hit rates
Alerting with Prometheus & Grafana
Prometheus Alert Rules
# alerts.yml
groups:
- name: performance_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: |
(sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }}% (threshold: 5%)"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.route }}"
description: "P95 latency is {{ $value }}s"
- alert: ServiceDown
expr: up{job="application"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
- alert: HighMemoryUsage
expr: |
(1 - (node_memory_MemAvailable_bytes /
node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "Memory usage above 90%"
description: "{{ $labels.instance }} memory: {{ $value }}%"
AlertManager Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
email_configs:
- to: 'team@example.com'
from: 'alerts@example.com'
smarthost: 'smtp.gmail.com:587'
- name: 'slack'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
Monitoring Best Practices
The Four Golden Signals
# 1. LATENCY - Request duration
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# 2. TRAFFIC - Request rate
sum(rate(http_requests_total[5m]))
# 3. ERRORS - Error rate
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
# 4. SATURATION - Resource utilization
avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) * 100
RED Method (Rate, Errors, Duration)
# Request Rate
sum(rate(http_requests_total[5m])) by (service)
# Error Rate
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
# Duration (P50, P95, P99)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
USE Method (Utilization, Saturation, Errors)
# CPU Utilization
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# CPU Saturation (load average)
node_load1 / count(node_cpu_seconds_total{mode="idle"}) without (cpu, mode)
# Memory Utilization
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk Saturation (IO wait)
rate(node_disk_io_time_seconds_total[5m]) * 100
Performance Testing with Prometheus
Load Test Metrics Collection
# load_test_with_metrics.py
import requests
import time
from prometheus_client import Counter, Histogram, push_to_gateway
# Metrics
test_requests_total = Counter('load_test_requests_total', 'Total load test requests', ['status'])
test_duration = Histogram('load_test_duration_seconds', 'Load test request duration')
def run_load_test(url, duration_seconds, rps):
"""Run load test and push metrics to Pushgateway"""
end_time = time.time() + duration_seconds
while time.time() < end_time:
start = time.time()
try:
response = requests.get(url, timeout=10)
test_requests_total.labels(status=response.status_code).inc()
except Exception as e:
test_requests_total.labels(status='error').inc()
request_time = time.time() - start
test_duration.observe(request_time)
# Maintain target RPS
sleep_time = (1.0 / rps) - request_time
if sleep_time > 0:
time.sleep(sleep_time)
# Push to Prometheus Pushgateway
push_to_gateway(
'localhost:9091',
job='load_test',
registry=CollectorRegistry()
)
# Run test
run_load_test('http://api.example.com/health', duration_seconds=300, rps=100)
Conclusion
Prometheus and Grafana provide a powerful, flexible monitoring stack for QA engineers. From instrumenting applications to creating insightful dashboards and setting up intelligent alerts, this stack enables proactive performance monitoring (as discussed in API Performance Testing: Metrics and Tools) and rapid issue detection.
Key Takeaways:
- Instrument applications with custom metrics
- Master PromQL for powerful queries
- Create actionable dashboards with Grafana
- Set up alerts based on SLOs/SLIs
- Follow monitoring (as discussed in Chaos Engineering: Breaking Systems the Right Way) methodologies: RED, USE, Four Golden Signals
- Integrate monitoring into load testing workflows
Effective monitoring isn’t just about collecting metrics — it’s about turning data into actionable insights.
See Also
- Continuous Testing in DevOps - Integrate monitoring with your testing pipeline
- API Performance Testing - Load testing strategies for APIs
- CI/CD Pipeline Optimization for QA Teams - Optimize delivery with monitoring insights
- Containerization for Testing - Monitor containerized test environments
- Test Automation Strategy - Build monitoring into your automation framework