In modern software systems, monitoring (as discussed in Risk-Based Testing: Prioritizing Test Efforts for Maximum Impact) and observability have evolved from operational concerns to critical quality assurance practices. QA professionals must understand not only whether features work correctly during testing, but also how systems behave in production, identify performance bottlenecks, and detect issues before users experience them.
This comprehensive guide explores how QA teams can leverage monitoring and observability (as discussed in Service Mesh Testing: Istio and Linkerd Testing Guide) tools—including the ELK Stack for logs, Prometheus and Grafana for metrics, distributed tracing, and synthetic monitoring—to enhance testing strategies, improve system reliability, and enable proactive quality assurance.
Understanding Monitoring vs. Observability
While often used interchangeably, monitoring and observability serve different purposes:
Monitoring
Definition: Collecting, aggregating, and analyzing predefined metrics to detect known problems.
Characteristics:
- Answers known questions: “Is the system up?” “Is CPU usage above 80%?”
- Reactive approach: Alerts trigger when thresholds are exceeded
- Focuses on system health and availability
- Works with predefined dashboards and alerts
Example: Alert when API response time exceeds 500ms for 5 consecutive minutes.
Observability
Definition: Understanding internal system state based on external outputs (logs, metrics, traces) to answer arbitrary questions.
Characteristics:
- Answers unknown questions: “Why is checkout failing for iOS users in Europe?”
- Proactive approach: Enables exploration and debugging
- Focuses on understanding system behavior
- Works with flexible querying and correlation
Example: Investigating why a specific user’s transaction failed by correlating logs, metrics, and traces across multiple services.
The Three Pillars of Observability
- Logs: Discrete events with timestamps describing what happened
- Metrics: Numerical measurements over time showing system performance
- Traces: End-to-end journey of requests through distributed systems
ELK Stack for Log Management
The ELK Stack (Elasticsearch, Logstash, Kibana) provides powerful log aggregation, search, and visualization capabilities.
ELK Stack Architecture
Elasticsearch: Distributed search and analytics engine for storing and querying logs Logstash: Server-side data processing pipeline for ingesting, transforming, and sending logs Kibana: Visualization and exploration tool for Elasticsearch data Beats (often added): Lightweight data shippers for forwarding logs from applications
Setting Up ELK Stack
Docker Compose setup (docker-compose.yml):
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.10.0
container_name: elasticsearch
environment:
- discovery.type=single-node
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
- xpack.security.enabled=false
ports:
- "9200:9200"
- "9300:9300"
volumes:
- elasticsearch-data:/usr/share/elasticsearch/data
networks:
- elk
logstash:
image: docker.elastic.co/logstash/logstash:8.10.0
container_name: logstash
volumes:
- ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
- ./logstash/pipeline:/usr/share/logstash/pipeline
ports:
- "5044:5044"
- "9600:9600"
environment:
- "LS_JAVA_OPTS=-Xms256m -Xmx256m"
networks:
- elk
depends_on:
- elasticsearch
kibana:
image: docker.elastic.co/kibana/kibana:8.10.0
container_name: kibana
ports:
- "5601:5601"
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
networks:
- elk
depends_on:
- elasticsearch
filebeat:
image: docker.elastic.co/beats/filebeat:8.10.0
container_name: filebeat
user: root
volumes:
- ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
networks:
- elk
depends_on:
- elasticsearch
- logstash
volumes:
elasticsearch-data:
networks:
elk:
driver: bridge
Logstash pipeline configuration (logstash/pipeline/logstash.conf):
input {
beats {
port => 5044
}
tcp {
port => 5000
codec => json
}
}
filter {
# Parse JSON logs
if [message] =~ /^\{.*\}$/ {
json {
source => "message"
}
}
# Extract log level
grok {
match => {
"message" => "%{LOGLEVEL:log_level}"
}
}
# Parse timestamp
date {
match => [ "timestamp", "ISO8601", "yyyy-MM-dd HH:mm:ss,SSS" ]
target => "@timestamp"
}
# Add GeoIP data for IP addresses
if [client_ip] {
geoip {
source => "client_ip"
target => "geoip"
}
}
# Extract user agent information
if [user_agent] {
useragent {
source => "user_agent"
target => "user_agent_parsed"
}
}
# Categorize by service
mutate {
add_field => {
"service_category" => "%{[service][name]}"
}
}
# Filter out healthcheck logs
if [path] == "/health" or [url] == "/healthz" {
drop { }
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{[service][name]}-%{+YYYY.MM.dd}"
}
# Debug output (comment out in production)
stdout {
codec => rubydebug
}
}
Filebeat configuration (filebeat/filebeat.yml):
filebeat.inputs:
- type: container
enabled: true
paths:
- /var/lib/docker/containers/*/*.log
processors:
- add_docker_metadata:
host: "unix:///var/run/docker.sock"
- decode_json_fields:
fields: ["message"]
target: ""
overwrite_keys: true
- type: log
enabled: true
paths:
- /var/log/application/*.log
fields:
service: application
environment: production
multiline:
pattern: '^\['
negate: true
match: after
output.logstash:
hosts: ["logstash:5044"]
logging.level: info
logging.to_files: true
logging.files:
path: /var/log/filebeat
name: filebeat
keepfiles: 7
permissions: 0644
Using Kibana for QA
Creating Index Patterns:
- Navigate to Management → Stack Management → Index Patterns
- Create pattern:
logs-*
- Select timestamp field:
@timestamp
Building QA-Focused Dashboards:
Test Execution Monitoring Dashboard:
{
"title": "Test Execution Monitoring",
"panels": [
{
"title": "Test Pass Rate",
"type": "metric",
"query": "service.name:test-runner AND test.status:*"
},
{
"title": "Failed Tests Over Time",
"type": "line",
"query": "test.status:failed"
},
{
"title": "Test Duration Distribution",
"type": "histogram",
"field": "test.duration"
},
{
"title": "Error Messages",
"type": "table",
"query": "log_level:ERROR",
"columns": ["@timestamp", "service.name", "message", "error.stack_trace"]
}
]
}
Application Error Tracking Dashboard:
{
"title": "Application Errors",
"panels": [
{
"title": "Error Rate",
"type": "metric",
"query": "log_level:ERROR OR http.status_code:[500 TO 599]"
},
{
"title": "Top Error Types",
"type": "pie",
"field": "error.type"
},
{
"title": "Errors by Service",
"type": "bar",
"field": "service.name",
"query": "log_level:ERROR"
},
{
"title": "Recent Critical Errors",
"type": "table",
"query": "log_level:CRITICAL OR log_level:FATAL",
"columns": ["@timestamp", "service.name", "message", "error.message"]
}
]
}
Useful Kibana Query Language (KQL) Examples:
# Find all errors in checkout service
service.name:"checkout" AND log_level:ERROR
# Find slow API responses (>1 second)
http.response.time_ms > 1000
# Find failed authentication attempts
event.action:"login" AND event.outcome:"failure"
# Find errors affecting specific user
user.id:"12345" AND log_level:ERROR
# Find database connection errors
message:"connection refused" OR message:"timeout"
# Find errors in last 15 minutes with specific error code
log_level:ERROR AND error.code:"500" AND @timestamp >= now-15m
# Find logs with specific transaction ID
transaction.id:"abc-123-xyz"
# Exclude healthcheck and monitoring logs
NOT (url:"/health" OR url:"/metrics" OR url:"/healthz")
Log Correlation for QA
Correlate logs across services using trace IDs:
Application logging with trace context (Node.js example):
const winston = require('winston');
const { v4: uuidv4 } = require('uuid');
// Create logger with trace context
const logger = winston.createLogger({
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
transports: [
new winston.transports.Console(),
new winston.transports.File({ filename: 'app.log' })
]
});
// Middleware to add trace ID
function traceMiddleware(req, res, next) {
req.traceId = req.headers['x-trace-id'] || uuidv4();
res.setHeader('X-Trace-ID', req.traceId);
// Attach logger with trace context
req.logger = logger.child({
traceId: req.traceId,
service: 'api-gateway',
environment: process.env.NODE_ENV
});
next();
}
// Use in requests
app.use(traceMiddleware);
app.post('/checkout', async (req, res) => {
req.logger.info('Checkout initiated', {
userId: req.user.id,
cartItems: req.body.items.length,
totalAmount: req.body.total
});
try {
const result = await processCheckout(req.body, req.traceId);
req.logger.info('Checkout completed', { orderId: result.orderId });
res.json(result);
} catch (error) {
req.logger.error('Checkout failed', {
error: error.message,
stack: error.stack,
userId: req.user.id
});
res.status(500).json({ error: 'Checkout failed' });
}
});
Prometheus and Grafana for Metrics
Prometheus collects and stores metrics as time series data, while Grafana provides visualization and alerting.
Prometheus Architecture
Components:
- Prometheus Server: Scrapes and stores metrics
- Exporters: Expose metrics from applications and infrastructure
- Pushgateway: Allows short-lived jobs to push metrics
- Alertmanager: Handles alerts and notifications
Setting Up Prometheus
Docker Compose addition:
prometheus:
image: prom/prometheus:v2.47.0
container_name: prometheus
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
ports:
- "9090:9090"
networks:
- monitoring
grafana:
image: grafana/grafana:10.1.0
container_name: grafana
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
(as discussed in [Infrastructure as Code Testing: Complete Validation Guide](/blog/infrastructure-as-code-testing)) - GF_USERS_ALLOW_SIGN_UP=false
ports:
- "3000:3000"
networks:
- monitoring
depends_on:
- prometheus
node-exporter:
image: prom/node-exporter:v1.6.1
container_name: node-exporter
ports:
- "9100:9100"
networks:
- monitoring
Prometheus configuration (prometheus/prometheus.yml):
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
environment: 'prod'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
# Load rules
rule_files:
- 'alerts/*.yml'
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node exporter (system metrics)
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
# Application metrics
- job_name: 'api-gateway'
static_configs:
- targets: ['api-gateway:8080']
metrics_path: '/metrics'
- job_name: 'checkout-service'
static_configs:
- targets: ['checkout:8081']
- job_name: 'payment-service'
static_configs:
- targets: ['payment:8082']
# Kubernetes service discovery
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
Instrumenting Applications for Prometheus
Node.js application metrics (Express + prom-client):
const express = require('express');
const promClient = require('prom-client');
const app = express();
// Create a Registry
const register = new promClient.Registry();
// Add default metrics (CPU, memory, etc.)
promClient.collectDefaultMetrics({ register });
// Custom metrics
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 0.5, 1, 2, 5]
});
const httpRequestTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
const activeConnections = new promClient.Gauge({
name: 'http_active_connections',
help: 'Number of active HTTP connections'
});
const checkoutTotal = new promClient.Counter({
name: 'checkout_total',
help: 'Total number of checkout attempts',
labelNames: ['status', 'payment_method']
});
const checkoutDuration = new promClient.Histogram({
name: 'checkout_duration_seconds',
help: 'Duration of checkout process',
labelNames: ['status'],
buckets: [0.5, 1, 2, 5, 10, 30]
});
// Register metrics
register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);
register.registerMetric(activeConnections);
register.registerMetric(checkoutTotal);
register.registerMetric(checkoutDuration);
// Middleware to track requests
app.use((req, res, next) => {
const start = Date.now();
activeConnections.inc();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const route = req.route ? req.route.path : req.path;
httpRequestDuration
.labels(req.method, route, res.statusCode)
.observe(duration);
httpRequestTotal
.labels(req.method, route, res.statusCode)
.inc();
activeConnections.dec();
});
next();
});
// Business logic with metrics
app.post('/checkout', async (req, res) => {
const start = Date.now();
try {
const result = await processCheckout(req.body);
const duration = (Date.now() - start) / 1000;
checkoutDuration.labels('success').observe(duration);
checkoutTotal.labels('success', req.body.paymentMethod).inc();
res.json(result);
} catch (error) {
const duration = (Date.now() - start) / 1000;
checkoutDuration.labels('failure').observe(duration);
checkoutTotal.labels('failure', req.body.paymentMethod).inc();
res.status(500).json({ error: error.message });
}
});
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
app.listen(8080);
Grafana Dashboards for QA
QA-Focused Dashboard JSON:
{
"dashboard": {
"title": "QA Metrics Dashboard",
"panels": [
{
"title": "API Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{route}}"
}
],
"type": "graph"
},
{
"title": "95th Percentile Response Time",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "{{route}}"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(http_requests_total{status_code=~\"5..\"}[5m]) / rate(http_requests_total[5m]) * 100",
"legendFormat": "Error %"
}
],
"type": "graph"
},
{
"title": "Checkout Success Rate",
"targets": [
{
"expr": "rate(checkout_total{status=\"success\"}[5m]) / rate(checkout_total[5m]) * 100",
"legendFormat": "Success %"
}
],
"type": "gauge"
}
]
}
}
Useful PromQL Queries for QA:
# Request rate per second
rate(http_requests_total[5m])
# Error rate percentage
(rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])) * 100
# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Requests per minute by endpoint
sum(rate(http_requests_total[1m])) by (route) * 60
# Failed checkouts in last hour
increase(checkout_total{status="failure"}[1h])
# Average checkout duration
rate(checkout_duration_seconds_sum[5m]) / rate(checkout_duration_seconds_count[5m])
# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# CPU usage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Distributed Tracing
Distributed tracing tracks requests as they flow through microservices, providing end-to-end visibility.
Jaeger Setup
Docker Compose addition:
jaeger:
image: jaegertracing/all-in-one:1.50
container_name: jaeger
environment:
- COLLECTOR_ZIPKIN_HOST_PORT=:9411
- COLLECTOR_OTLP_ENABLED=true
ports:
- "5775:5775/udp"
- "6831:6831/udp"
- "6832:6832/udp"
- "5778:5778"
- "16686:16686" # UI
- "14250:14250"
- "14268:14268"
- "14269:14269"
- "9411:9411"
networks:
- tracing
Instrumenting Node.js with OpenTelemetry:
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
// Create provider
const provider = new NodeTracerProvider({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'checkout-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
}),
});
// Configure exporter
const exporter = new JaegerExporter({
endpoint: 'http://jaeger:14268/api/traces',
});
provider.addSpanProcessor(new opentelemetry.tracing.SimpleSpanProcessor(exporter));
provider.register();
// Register instrumentations
registerInstrumentations({
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation(),
],
});
// Use in application
const tracer = provider.getTracer('checkout-service');
async function processCheckout(order) {
const span = tracer.startSpan('process_checkout');
span.setAttributes({
'order.id': order.id,
'order.total': order.total,
'user.id': order.userId,
});
try {
// Validate order
const validateSpan = tracer.startSpan('validate_order', { parent: span });
await validateOrder(order);
validateSpan.end();
// Process payment
const paymentSpan = tracer.startSpan('process_payment', { parent: span });
const payment = await processPayment(order);
paymentSpan.setAttributes({
'payment.method': payment.method,
'payment.status': payment.status,
});
paymentSpan.end();
// Create order
const orderSpan = tracer.startSpan('create_order', { parent: span });
const result = await createOrder(order, payment);
orderSpan.end();
span.setStatus({ code: opentelemetry.SpanStatusCode.OK });
return result;
} catch (error) {
span.recordException(error);
span.setStatus({
code: opentelemetry.SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
}
Synthetic Monitoring
Synthetic monitoring proactively tests system availability and performance from user perspective.
Using Prometheus Blackbox Exporter
Configuration (blackbox.yml):
modules:
http_2xx:
prober: http
timeout: 5s
http:
method: GET
valid_status_codes: [200]
fail_if_not_ssl: true
preferred_ip_protocol: ip4
http_post_checkout:
prober: http
timeout: 10s
http:
method: POST
headers:
Content-Type: application/json
body: '{"userId": "test", "items": [{"id": "123", "quantity": 1}]}'
valid_status_codes: [200, 201]
tcp_connect:
prober: tcp
timeout: 5s
icmp_ping:
prober: icmp
timeout: 5s
Prometheus scrape config:
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://api.example.com/health
- https://www.example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Alerting for QA
Prometheus alert rules (alerts/qa-alerts.yml):
groups:
- name: qa_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: |
(rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])) * 100 > 5
for: 5m
labels:
severity: critical
team: qa
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }}% for {{ $labels.route }}"
- alert: SlowAPIResponse
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 10m
labels:
severity: warning
team: qa
annotations:
summary: "API response time degraded"
description: "95th percentile response time is {{ $value }}s"
- alert: CheckoutFailureSpike
expr: |
rate(checkout_total{status="failure"}[5m]) > 0.1
for: 5m
labels:
severity: critical
team: qa
annotations:
summary: "Checkout failures spiking"
description: "Checkout failure rate: {{ $value }} per second"
- alert: ServiceDown
expr: up{job="api-gateway"} == 0
for: 1m
labels:
severity: critical
team: qa
annotations:
summary: "Service {{ $labels.job }} is down"
description: "{{ $labels.instance }} has been down for more than 1 minute"
Conclusion
Monitoring and observability are essential components of modern QA practices. By leveraging tools like ELK Stack for logs, Prometheus and Grafana for metrics, distributed tracing with Jaeger, and synthetic monitoring, QA teams can shift from reactive bug discovery to proactive quality assurance.
These tools enable QA professionals to understand system behavior in production, identify performance bottlenecks, correlate issues across services, and detect problems before they impact users. The key is integrating observability into testing workflows, using production data to inform test strategies, and collaborating with DevOps teams to maintain high-quality, reliable systems.
Key Takeaways:
- Observability extends QA beyond traditional testing
- Logs, metrics, and traces provide comprehensive system visibility
- ELK Stack enables powerful log search and analysis
- Prometheus and Grafana track performance metrics over time
- Distributed tracing reveals service interactions and bottlenecks
- Synthetic monitoring proactively validates system availability
- Alerting enables rapid response to quality issues
- Integration with CI/CD provides continuous quality insights