Monitoring and Observability for QA: Complete Guide

In modern software systems, monitoring (as discussed in Risk-Based Testing: Prioritizing Test Efforts for Maximum Impact) and observability have evolved from operational concerns to critical quality assurance practices. QA professionals must understand not only whether features work correctly during testing, but also how systems behave in production, identify performance bottlenecks, and detect issues before users experience them.

This comprehensive guide explores how QA teams can leverage monitoring and observability (as discussed in Service Mesh Testing: Istio and Linkerd Testing Guide) tools—including the ELK Stack for logs, Prometheus and Grafana for metrics, distributed tracing, and synthetic monitoring—to enhance testing strategies, improve system reliability, and enable proactive quality assurance.

Understanding Monitoring vs. Observability

While often used interchangeably, monitoring and observability serve different purposes:

Monitoring

Definition: Collecting, aggregating, and analyzing predefined metrics to detect known problems.

Characteristics:

Answers known questions: “Is the system up?” “Is CPU usage above 80%?”
Reactive approach: Alerts trigger when thresholds are exceeded
Focuses on system health and availability
Works with predefined dashboards and alerts

Example: Alert when API response time exceeds 500ms for 5 consecutive minutes.

Observability

Definition: Understanding internal system state based on external outputs (logs, metrics, traces) to answer arbitrary questions.

Characteristics:

Answers unknown questions: “Why is checkout failing for iOS users in Europe?”
Proactive approach: Enables exploration and debugging
Focuses on understanding system behavior
Works with flexible querying and correlation

Example: Investigating why a specific user’s transaction failed by correlating logs, metrics, and traces across multiple services.

The Three Pillars of Observability

Logs: Discrete events with timestamps describing what happened
Metrics: Numerical measurements over time showing system performance
Traces: End-to-end journey of requests through distributed systems

ELK Stack for Log Management

The ELK Stack (Elasticsearch, Logstash, Kibana) provides powerful log aggregation, search, and visualization capabilities.

ELK Stack Architecture

Elasticsearch: Distributed search and analytics engine for storing and querying logs Logstash: Server-side data processing pipeline for ingesting, transforming, and sending logs Kibana: Visualization and exploration tool for Elasticsearch data Beats (often added): Lightweight data shippers for forwarding logs from applications

Setting Up ELK Stack

Docker Compose setup (docker-compose.yml):

version: '3.8'

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.10.0
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - xpack.security.enabled=false
    ports:
      - "9200:9200"
      - "9300:9300"
    volumes:
      - elasticsearch-data:/usr/share/elasticsearch/data
    networks:
      - elk

  logstash:
    image: docker.elastic.co/logstash/logstash:8.10.0
    container_name: logstash
    volumes:
      - ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
      - ./logstash/pipeline:/usr/share/logstash/pipeline
    ports:
      - "5044:5044"
      - "9600:9600"
    environment:
      - "LS_JAVA_OPTS=-Xms256m -Xmx256m"
    networks:
      - elk
    depends_on:
      - elasticsearch

  kibana:
    image: docker.elastic.co/kibana/kibana:8.10.0
    container_name: kibana
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    networks:
      - elk
    depends_on:
      - elasticsearch

  filebeat:
    image: docker.elastic.co/beats/filebeat:8.10.0
    container_name: filebeat
    user: root
    volumes:
      - ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    networks:
      - elk
    depends_on:
      - elasticsearch
      - logstash

volumes:
  elasticsearch-data:

networks:
  elk:
    driver: bridge

Logstash pipeline configuration (logstash/pipeline/logstash.conf):

input {
  beats {
    port => 5044
  }

  tcp {
    port => 5000
    codec => json
  }
}

filter {
  # Parse JSON logs
  if [message] =~ /^\{.*\}$/ {
    json {
      source => "message"
    }
  }

  # Extract log level
  grok {
    match => {
      "message" => "%{LOGLEVEL:log_level}"
    }
  }

  # Parse timestamp
  date {
    match => [ "timestamp", "ISO8601", "yyyy-MM-dd HH:mm:ss,SSS" ]
    target => "@timestamp"
  }

  # Add GeoIP data for IP addresses
  if [client_ip] {
    geoip {
      source => "client_ip"
      target => "geoip"
    }
  }

  # Extract user agent information
  if [user_agent] {
    useragent {
      source => "user_agent"
      target => "user_agent_parsed"
    }
  }

  # Categorize by service
  mutate {
    add_field => {
      "service_category" => "%{[service][name]}"
    }
  }

  # Filter out healthcheck logs
  if [path] == "/health" or [url] == "/healthz" {
    drop { }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{[service][name]}-%{+YYYY.MM.dd}"
  }

  # Debug output (comment out in production)
  stdout {
    codec => rubydebug
  }
}

Filebeat configuration (filebeat/filebeat.yml):

filebeat.inputs:
  - type: container
    enabled: true
    paths:
      - /var/lib/docker/containers/*/*.log
    processors:
      - add_docker_metadata:
          host: "unix:///var/run/docker.sock"
      - decode_json_fields:
          fields: ["message"]
          target: ""
          overwrite_keys: true

  - type: log
    enabled: true
    paths:
      - /var/log/application/*.log
    fields:
      service: application
      environment: production
    multiline:
      pattern: '^\['
      negate: true
      match: after

output.logstash:
  hosts: ["logstash:5044"]

logging.level: info
logging.to_files: true
logging.files:
  path: /var/log/filebeat
  name: filebeat
  keepfiles: 7
  permissions: 0644

Using Kibana for QA

Creating Index Patterns:

Navigate to Management → Stack Management → Index Patterns
Create pattern: logs-*
Select timestamp field: @timestamp

Building QA-Focused Dashboards:

Test Execution Monitoring Dashboard:

{
  "title": "Test Execution Monitoring",
  "panels": [
    {
      "title": "Test Pass Rate",
      "type": "metric",
      "query": "service.name:test-runner AND test.status:*"
    },
    {
      "title": "Failed Tests Over Time",
      "type": "line",
      "query": "test.status:failed"
    },
    {
      "title": "Test Duration Distribution",
      "type": "histogram",
      "field": "test.duration"
    },
    {
      "title": "Error Messages",
      "type": "table",
      "query": "log_level:ERROR",
      "columns": ["@timestamp", "service.name", "message", "error.stack_trace"]
    }
  ]
}

Application Error Tracking Dashboard:

{
  "title": "Application Errors",
  "panels": [
    {
      "title": "Error Rate",
      "type": "metric",
      "query": "log_level:ERROR OR http.status_code:[500 TO 599]"
    },
    {
      "title": "Top Error Types",
      "type": "pie",
      "field": "error.type"
    },
    {
      "title": "Errors by Service",
      "type": "bar",
      "field": "service.name",
      "query": "log_level:ERROR"
    },
    {
      "title": "Recent Critical Errors",
      "type": "table",
      "query": "log_level:CRITICAL OR log_level:FATAL",
      "columns": ["@timestamp", "service.name", "message", "error.message"]
    }
  ]
}

Useful Kibana Query Language (KQL) Examples:

# Find all errors in checkout service
service.name:"checkout" AND log_level:ERROR

# Find slow API responses (>1 second)
http.response.time_ms > 1000

# Find failed authentication attempts
event.action:"login" AND event.outcome:"failure"

# Find errors affecting specific user
user.id:"12345" AND log_level:ERROR

# Find database connection errors
message:"connection refused" OR message:"timeout"

# Find errors in last 15 minutes with specific error code
log_level:ERROR AND error.code:"500" AND @timestamp >= now-15m

# Find logs with specific transaction ID
transaction.id:"abc-123-xyz"

# Exclude healthcheck and monitoring logs
NOT (url:"/health" OR url:"/metrics" OR url:"/healthz")

Log Correlation for QA

Correlate logs across services using trace IDs:

Application logging with trace context (Node.js example):

const winston = require('winston');
const { v4: uuidv4 } = require('uuid');

// Create logger with trace context
const logger = winston.createLogger({
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json()
  ),
  transports: [
    new winston.transports.Console(),
    new winston.transports.File({ filename: 'app.log' })
  ]
});

// Middleware to add trace ID
function traceMiddleware(req, res, next) {
  req.traceId = req.headers['x-trace-id'] || uuidv4();
  res.setHeader('X-Trace-ID', req.traceId);

  // Attach logger with trace context
  req.logger = logger.child({
    traceId: req.traceId,
    service: 'api-gateway',
    environment: process.env.NODE_ENV
  });

  next();
}

// Use in requests
app.use(traceMiddleware);

app.post('/checkout', async (req, res) => {
  req.logger.info('Checkout initiated', {
    userId: req.user.id,
    cartItems: req.body.items.length,
    totalAmount: req.body.total
  });

  try {
    const result = await processCheckout(req.body, req.traceId);
    req.logger.info('Checkout completed', { orderId: result.orderId });
    res.json(result);
  } catch (error) {
    req.logger.error('Checkout failed', {
      error: error.message,
      stack: error.stack,
      userId: req.user.id
    });
    res.status(500).json({ error: 'Checkout failed' });
  }
});

Prometheus and Grafana for Metrics

Prometheus collects and stores metrics as time series data, while Grafana provides visualization and alerting.

Prometheus Architecture

Components:

Prometheus Server: Scrapes and stores metrics
Exporters: Expose metrics from applications and infrastructure
Pushgateway: Allows short-lived jobs to push metrics
Alertmanager: Handles alerts and notifications

Setting Up Prometheus

Docker Compose addition:

  prometheus:
    image: prom/prometheus:v2.47.0
    container_name: prometheus
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
    ports:
      - "9090:9090"
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:10.1.0
    container_name: grafana
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
 (as discussed in [Infrastructure as Code Testing: Complete Validation Guide](/blog/infrastructure-as-code-testing))      - GF_USERS_ALLOW_SIGN_UP=false
    ports:
      - "3000:3000"
    networks:
      - monitoring
    depends_on:
      - prometheus

  node-exporter:
    image: prom/node-exporter:v1.6.1
    container_name: node-exporter
    ports:
      - "9100:9100"
    networks:
      - monitoring

Prometheus configuration (prometheus/prometheus.yml):

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    environment: 'prod'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

# Load rules
rule_files:
  - 'alerts/*.yml'

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node exporter (system metrics)
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  # Application metrics
  - job_name: 'api-gateway'
    static_configs:
      - targets: ['api-gateway:8080']
    metrics_path: '/metrics'

  - job_name: 'checkout-service'
    static_configs:
      - targets: ['checkout:8081']

  - job_name: 'payment-service'
    static_configs:
      - targets: ['payment:8082']

  # Kubernetes service discovery
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

Instrumenting Applications for Prometheus

Node.js application metrics (Express + prom-client):

const express = require('express');
const promClient = require('prom-client');

const app = express();

// Create a Registry
const register = new promClient.Registry();

// Add default metrics (CPU, memory, etc.)
promClient.collectDefaultMetrics({ register });

// Custom metrics
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.5, 1, 2, 5]
});

const httpRequestTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const activeConnections = new promClient.Gauge({
  name: 'http_active_connections',
  help: 'Number of active HTTP connections'
});

const checkoutTotal = new promClient.Counter({
  name: 'checkout_total',
  help: 'Total number of checkout attempts',
  labelNames: ['status', 'payment_method']
});

const checkoutDuration = new promClient.Histogram({
  name: 'checkout_duration_seconds',
  help: 'Duration of checkout process',
  labelNames: ['status'],
  buckets: [0.5, 1, 2, 5, 10, 30]
});

// Register metrics
register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);
register.registerMetric(activeConnections);
register.registerMetric(checkoutTotal);
register.registerMetric(checkoutDuration);

// Middleware to track requests
app.use((req, res, next) => {
  const start = Date.now();
  activeConnections.inc();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const route = req.route ? req.route.path : req.path;

    httpRequestDuration
      .labels(req.method, route, res.statusCode)
      .observe(duration);

    httpRequestTotal
      .labels(req.method, route, res.statusCode)
      .inc();

    activeConnections.dec();
  });

  next();
});

// Business logic with metrics
app.post('/checkout', async (req, res) => {
  const start = Date.now();

  try {
    const result = await processCheckout(req.body);

    const duration = (Date.now() - start) / 1000;
    checkoutDuration.labels('success').observe(duration);
    checkoutTotal.labels('success', req.body.paymentMethod).inc();

    res.json(result);
  } catch (error) {
    const duration = (Date.now() - start) / 1000;
    checkoutDuration.labels('failure').observe(duration);
    checkoutTotal.labels('failure', req.body.paymentMethod).inc();

    res.status(500).json({ error: error.message });
  }
});

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(8080);

Grafana Dashboards for QA

QA-Focused Dashboard JSON:

{
  "dashboard": {
    "title": "QA Metrics Dashboard",
    "panels": [
      {
        "title": "API Request Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "{{method}} {{route}}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "95th Percentile Response Time",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "{{route}}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total{status_code=~\"5..\"}[5m]) / rate(http_requests_total[5m]) * 100",
            "legendFormat": "Error %"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Checkout Success Rate",
        "targets": [
          {
            "expr": "rate(checkout_total{status=\"success\"}[5m]) / rate(checkout_total[5m]) * 100",
            "legendFormat": "Success %"
          }
        ],
        "type": "gauge"
      }
    ]
  }
}

Useful PromQL Queries for QA:

# Request rate per second
rate(http_requests_total[5m])

# Error rate percentage
(rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])) * 100

# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Requests per minute by endpoint
sum(rate(http_requests_total[1m])) by (route) * 60

# Failed checkouts in last hour
increase(checkout_total{status="failure"}[1h])

# Average checkout duration
rate(checkout_duration_seconds_sum[5m]) / rate(checkout_duration_seconds_count[5m])

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# CPU usage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Distributed Tracing

Distributed tracing tracks requests as they flow through microservices, providing end-to-end visibility.

Jaeger Setup

Docker Compose addition:

  jaeger:
    image: jaegertracing/all-in-one:1.50
    container_name: jaeger
    environment:
      - COLLECTOR_ZIPKIN_HOST_PORT=:9411
      - COLLECTOR_OTLP_ENABLED=true
    ports:
      - "5775:5775/udp"
      - "6831:6831/udp"
      - "6832:6832/udp"
      - "5778:5778"
      - "16686:16686"  # UI
      - "14250:14250"
      - "14268:14268"
      - "14269:14269"
      - "9411:9411"
    networks:
      - tracing

Instrumenting Node.js with OpenTelemetry:

const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

// Create provider
const provider = new NodeTracerProvider({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'checkout-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
  }),
});

// Configure exporter
const exporter = new JaegerExporter({
  endpoint: 'http://jaeger:14268/api/traces',
});

provider.addSpanProcessor(new opentelemetry.tracing.SimpleSpanProcessor(exporter));
provider.register();

// Register instrumentations
registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});

// Use in application
const tracer = provider.getTracer('checkout-service');

async function processCheckout(order) {
  const span = tracer.startSpan('process_checkout');

  span.setAttributes({
    'order.id': order.id,
    'order.total': order.total,
    'user.id': order.userId,
  });

  try {
    // Validate order
    const validateSpan = tracer.startSpan('validate_order', { parent: span });
    await validateOrder(order);
    validateSpan.end();

    // Process payment
    const paymentSpan = tracer.startSpan('process_payment', { parent: span });
    const payment = await processPayment(order);
    paymentSpan.setAttributes({
      'payment.method': payment.method,
      'payment.status': payment.status,
    });
    paymentSpan.end();

    // Create order
    const orderSpan = tracer.startSpan('create_order', { parent: span });
    const result = await createOrder(order, payment);
    orderSpan.end();

    span.setStatus({ code: opentelemetry.SpanStatusCode.OK });
    return result;
  } catch (error) {
    span.recordException(error);
    span.setStatus({
      code: opentelemetry.SpanStatusCode.ERROR,
      message: error.message,
    });
    throw error;
  } finally {
    span.end();
  }
}

Synthetic Monitoring

Synthetic monitoring proactively tests system availability and performance from user perspective.

Using Prometheus Blackbox Exporter

Configuration (blackbox.yml):

modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      method: GET
      valid_status_codes: [200]
      fail_if_not_ssl: true
      preferred_ip_protocol: ip4

  http_post_checkout:
    prober: http
    timeout: 10s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{"userId": "test", "items": [{"id": "123", "quantity": 1}]}'
      valid_status_codes: [200, 201]

  tcp_connect:
    prober: tcp
    timeout: 5s

  icmp_ping:
    prober: icmp
    timeout: 5s

Prometheus scrape config:

scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://api.example.com/health
          - https://www.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Alerting for QA

Prometheus alert rules (alerts/qa-alerts.yml):

groups:
  - name: qa_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          (rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])) * 100 > 5
        for: 5m
        labels:
          severity: critical
          team: qa
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }}% for {{ $labels.route }}"

      - alert: SlowAPIResponse
        expr: |
          histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 10m
        labels:
          severity: warning
          team: qa
        annotations:
          summary: "API response time degraded"
          description: "95th percentile response time is {{ $value }}s"

      - alert: CheckoutFailureSpike
        expr: |
          rate(checkout_total{status="failure"}[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
          team: qa
        annotations:
          summary: "Checkout failures spiking"
          description: "Checkout failure rate: {{ $value }} per second"

      - alert: ServiceDown
        expr: up{job="api-gateway"} == 0
        for: 1m
        labels:
          severity: critical
          team: qa
        annotations:
          summary: "Service {{ $labels.job }} is down"
          description: "{{ $labels.instance }} has been down for more than 1 minute"

Conclusion

Monitoring and observability are essential components of modern QA practices. By leveraging tools like ELK Stack for logs, Prometheus and Grafana for metrics, distributed tracing with Jaeger, and synthetic monitoring, QA teams can shift from reactive bug discovery to proactive quality assurance.

These tools enable QA professionals to understand system behavior in production, identify performance bottlenecks, correlate issues across services, and detect problems before they impact users. The key is integrating observability into testing workflows, using production data to inform test strategies, and collaborating with DevOps teams to maintain high-quality, reliable systems.

Key Takeaways:

Observability extends QA beyond traditional testing
Logs, metrics, and traces provide comprehensive system visibility
ELK Stack enables powerful log search and analysis
Prometheus and Grafana track performance metrics over time
Distributed tracing reveals service interactions and bottlenecks
Synthetic monitoring proactively validates system availability
Alerting enables rapid response to quality issues
Integration with CI/CD provides continuous quality insights