What Is ETL Testing?

ETL (Extract, Transform, Load) is the process of moving data from source systems to target systems, typically data warehouses or analytics databases. ETL testing verifies that this process is correct, complete, and performant.

Source Systems → Extract → Transform → Load → Target System
(databases,        (pull     (clean,      (insert   (data warehouse,
 APIs, files)       data)     convert,     into      analytics DB)
                              aggregate)   target)

ETL bugs are expensive because they corrupt analytical data. If a report shows incorrect revenue numbers because the ETL pipeline miscalculated currency conversions, business decisions based on that data are flawed.

The Three Phases of ETL Testing

Phase 1: Extract Testing

Verify that data is pulled correctly from source systems.

Test cases:

  • All expected records are extracted (count comparison).
  • Filters work correctly (date ranges, status filters).
  • Incremental extraction picks up only new or changed records.
  • Extraction handles source system downtime gracefully.
-- Source: Count records for extraction window
SELECT COUNT(*) FROM source_orders
WHERE updated_at >= '2025-01-01' AND updated_at < '2025-01-02';
-- Result: 15,234

-- Staging: Verify same count after extraction
SELECT COUNT(*) FROM staging_orders
WHERE source_date >= '2025-01-01' AND source_date < '2025-01-02';
-- Expected: 15,234

Phase 2: Transform Testing

Verify that data transformations produce correct results.

Common transformations to test:

TransformationExampleTest
Type conversionString date → DATEVerify format handling, time zones
Currency conversionUSD → EURVerify exchange rates, rounding
AggregationDaily → Monthly totalsVerify sums match detail records
DeduplicationRemove duplicate recordsVerify correct record retained
Default valuesNULL → “Unknown”Verify all NULLs handled
Business rulesStatus mappingVerify rule application
-- Verify currency conversion
-- Source: amount_usd = 100.00, exchange_rate = 0.85
-- Target: amount_eur should be 85.00
SELECT order_id, amount_usd, amount_eur
FROM target_orders
WHERE order_id = 12345;
-- Verify: amount_eur = amount_usd * exchange_rate

Phase 3: Load Testing

Verify that transformed data is written correctly to the target system.

Test cases:

  • Record counts match between transform and load stages.
  • Primary keys and constraints are respected.
  • Incremental loads do not create duplicates.
  • Full loads replace data correctly.
  • Slowly Changing Dimensions (SCDs) are handled properly.

Data Quality Checks

Completeness

-- No NULL values in required fields
SELECT COUNT(*) FROM target_orders WHERE customer_id IS NULL;
-- Expected: 0

-- All source records present in target
SELECT COUNT(*) FROM source_orders s
LEFT JOIN target_orders t ON s.order_id = t.source_order_id
WHERE t.source_order_id IS NULL;
-- Expected: 0 (no orphaned source records)

Accuracy

-- Sum comparison
SELECT SUM(total) FROM source_orders WHERE order_date = '2025-01-15';
-- Compare with:
SELECT SUM(total_converted) FROM target_orders WHERE order_date = '2025-01-15';
-- Values should match (accounting for currency conversion)

Consistency

-- Referential integrity in target
SELECT COUNT(*) FROM target_order_items oi
LEFT JOIN target_orders o ON oi.order_id = o.order_id
WHERE o.order_id IS NULL;
-- Expected: 0 (no orphaned order items)

Timeliness

Measure how long the ETL pipeline takes. Set SLAs and alert when exceeded.

Error Handling Testing

ETL pipelines must handle errors gracefully:

  1. Bad records: Source data with invalid formats, missing fields, or constraint violations.
  2. Network failures: Source system unreachable during extraction.
  3. Schema changes: Source adds or removes columns.
  4. Volume spikes: 10x the normal data volume.

Test approach:

  • Inject bad records into the source and verify they are routed to an error table (not silently dropped or loaded).
  • Verify error records include the original data, error reason, and timestamp.
  • Verify the pipeline continues processing good records when bad records are encountered.
-- Check error/reject table
SELECT error_reason, COUNT(*) as count
FROM etl_error_log
WHERE pipeline_run = 'run-2025-01-15'
GROUP BY error_reason;

Incremental vs Full Load Testing

Full load: Drops and recreates all target data. Test that the target matches the source completely.

Incremental load: Only processes new or changed records. More complex to test:

  • New records are inserted.
  • Changed records are updated (not duplicated).
  • Deleted source records are handled (soft delete, hard delete, or ignored).
  • The “watermark” (last processed timestamp or ID) advances correctly.

Exercise: ETL Pipeline Testing

Setup

Create source and target tables:

-- Source table
CREATE TABLE source_sales (
    sale_id INTEGER PRIMARY KEY,
    product_name VARCHAR(200),
    quantity INTEGER,
    unit_price DECIMAL(10,2),
    sale_date DATE,
    currency VARCHAR(3),
    customer_email VARCHAR(255),
    region VARCHAR(50)
);

-- Target table (transformed)
CREATE TABLE target_sales_summary (
    summary_id SERIAL PRIMARY KEY,
    product_name VARCHAR(200),
    total_quantity INTEGER,
    total_revenue_usd DECIMAL(12,2),
    sale_month DATE,
    region VARCHAR(50),
    avg_unit_price DECIMAL(10,2),
    etl_loaded_at TIMESTAMP DEFAULT NOW()
);

-- Error table
CREATE TABLE etl_errors (
    error_id SERIAL PRIMARY KEY,
    source_sale_id INTEGER,
    error_type VARCHAR(100),
    error_detail TEXT,
    raw_data JSONB,
    created_at TIMESTAMP DEFAULT NOW()
);

Insert test data with intentional issues:

INSERT INTO source_sales VALUES
  (1, 'Widget A', 10, 25.00, '2025-01-15', 'USD', 'alice@test.com', 'North'),
  (2, 'Widget B', 5, 50.00, '2025-01-15', 'EUR', 'bob@test.com', 'South'),
  (3, 'Widget A', -3, 25.00, '2025-01-15', 'USD', 'charlie@test.com', 'North'), -- negative qty
  (4, NULL, 7, 30.00, '2025-01-16', 'USD', 'dave@test.com', 'East'), -- null product
  (5, 'Widget C', 20, 15.00, '2025-01-16', 'GBP', 'eve@test.com', 'West'),
  (6, 'Widget A', 10, 25.00, '2025-01-16', 'XXX', 'frank@test.com', 'North'); -- invalid currency

Task 1: Extract Validation

Write queries to verify extraction completeness:

  1. Count source records by date.
  2. Count source records by region.
  3. Identify records with data quality issues before transformation.

Task 2: Transform Validation

Simulate the transformation (the ETL should aggregate by product, month, and region, convert to USD):

  1. Verify currency conversion: EUR at 1.08 rate, GBP at 1.27 rate.
  2. Verify aggregation: Widget A in North region for January should total 20 units, $500 USD.
  3. Verify error routing: Records with negative quantity, NULL product, or invalid currency should go to etl_errors table.

Task 3: Load Validation (Reconciliation)

After the ETL runs, verify:

  1. Total revenue in source (after conversion) equals total revenue in target.
  2. Total quantity in source (excluding errors) equals total quantity in target.
  3. Number of error records matches expected count (3 bad records).
  4. No duplicate summary rows exist.

Task 4: Idempotency Test

  1. Run the ETL pipeline once. Record the target state.
  2. Run the ETL pipeline again with the same source data.
  3. Verify the target state is identical (no duplicates, same counts, same sums).

Task 5: Incremental Load Test

  1. Run the ETL for January 15 data.
  2. Add new source records for January 16.
  3. Run the ETL for January 16 (incremental).
  4. Verify January 15 data is unchanged and January 16 data is added.

Deliverables

  1. Extraction validation queries with results.
  2. Transformation verification queries showing currency conversion and aggregation accuracy.
  3. Reconciliation report comparing source and target totals.
  4. Idempotency test results.
  5. Incremental load verification.