The ELK Stack for QA
ELK stands for Elasticsearch, Logstash, and Kibana — three open-source tools that together form a powerful log management platform. For QA engineers, ELK provides the ability to search, analyze, and visualize application logs at scale.
Components
Elasticsearch: The search engine that stores and indexes log data. It allows lightning-fast full-text searches across billions of log entries.
Logstash: The data processing pipeline that ingests logs from various sources, transforms them, and sends them to Elasticsearch. It can parse different log formats, enrich data with metadata, and filter out noise.
Kibana: The visualization layer. It provides a web interface for searching logs, building dashboards, and creating alerts. This is where QA engineers spend most of their time.
Filebeat (often added as the “B” in BELK): A lightweight log shipper installed on application servers that sends logs to Logstash or directly to Elasticsearch.
Kibana for Log Investigation
Searching Logs
The Kibana Discover view is your primary tool for log investigation:
# Find all errors in the payment service in the last hour
service: "payment-service" AND level: "ERROR"
# Find timeout errors
message: "timeout" OR message: "timed out"
# Find errors for a specific user
userId: "usr_12345" AND level: "ERROR"
# Find errors during a specific test run
timestamp: [2024-01-15T10:00:00 TO 2024-01-15T10:30:00] AND level: "ERROR"
Correlation Workflow
When a test fails, follow this workflow in Kibana:
- Note the exact time of the test failure
- Search for errors in that time window (+-5 minutes)
- Filter by service to narrow down the source
- Expand the log entry to see full details (stack trace, request ID)
- Search by request ID to trace the request across services
- Check related services for cascading failures
Building Visualizations
Kibana lets you create visualizations from log data:
- Line chart: Error count over time (spot trends after deployments)
- Pie chart: Error distribution by service (which service has the most issues)
- Data table: Top error messages (most common failures)
- Metric: Total error count in the last hour
Grafana for Metrics Dashboards
Grafana excels at visualizing time-series metrics from Prometheus, InfluxDB, Elasticsearch, and many other data sources.
Building a QA Dashboard
Panel 1: Test Execution Trend
Query: count of test runs by status (pass/fail) over time
Type: Time series (stacked)
Panel 2: Flaky Test Rate
Query: (tests that changed result in consecutive runs) / total tests
Type: Gauge with threshold colors
Panel 3: Pipeline Duration
Query: average pipeline execution time by stage
Type: Bar chart
Panel 4: Deployment Success Rate
Query: successful deployments / total deployments
Type: Stat with sparkline
Panel 5: Application Error Rate (post-deployment)
Query: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
Type: Time series with alert threshold line
Annotations
Grafana annotations mark events on time-series graphs. Mark deployments on your metrics graphs to correlate metrics changes with releases:
{
"time": 1705315200000,
"text": "Deployed v2.3.1 to production",
"tags": ["deployment", "production"]
}
This lets you visually see “error rate increased 5 minutes after deployment v2.3.1.”
Practical Log Analysis Patterns
Pattern: Post-Deployment Validation
After each deployment, automatically query logs for:
- New error types not seen before the deployment
- Error rate changes compared to pre-deployment baseline
- Slow query warnings or timeout errors
- Configuration-related errors
Pattern: Test Failure Root Cause
When an E2E test fails with a generic error (e.g., “element not found”), the real cause is often in the backend:
- Get the timestamp of the test failure
- Search backend logs for errors at that time
- Common findings: database timeout, null pointer exception, failed third-party API call
Pattern: Performance Regression Detection
Compare log-based latency metrics before and after deployment:
- Average response time by endpoint
- P99 response times
- Database query durations
- External API call durations
Exercise: Investigate a Production Incident Using Logs
Scenario: After a deployment at 14:00, users report that the checkout page is slow. Error rate has increased from 0.5% to 3%. Use log analysis to find the root cause.
Solution
Step 1: Kibana search for errors after deployment
timestamp: [2024-01-15T14:00:00 TO 2024-01-15T14:30:00] AND level: "ERROR"
Result: 247 errors found, mostly from “order-service”
Step 2: Filter to order-service errors
service: "order-service" AND level: "ERROR" AND timestamp: [14:00 TO 14:30]
Result: “Connection timeout to inventory-service” (180 occurrences)
Step 3: Check inventory-service logs
service: "inventory-service" AND level: ("ERROR" OR "WARN") AND timestamp: [14:00 TO 14:30]
Result: “Database connection pool exhausted” (repeated warnings)
Step 4: Check inventory-service database metrics in Grafana
- Connection pool: 50/50 (maxed out)
- Active queries: 48 (vs. normal 10-15)
- Slow queries: 35 queries taking >5s (vs. normal 0)
Root cause: The new deployment introduced a database query without an index. Under production traffic, this query takes 5+ seconds instead of 50ms, exhausting the connection pool. The inventory-service stops responding, causing checkout timeouts.
Fix: Add the missing database index. Rollback the deployment until the fix is ready.
Key Takeaways
- ELK is essential for QA log investigation — search and filter logs to find root causes
- Grafana visualizes the big picture — dashboards show trends that individual logs cannot
- Correlate timestamps — match test failure times with log entries for fast debugging
- Mark deployments on dashboards — annotations connect metrics changes to releases
- Build reusable queries — save common investigation queries in Kibana for quick access