Selenium Grid 4: Distributed Test Execution Architecture

Introduction to Selenium Grid 4

Selenium Grid 4 represents a complete architectural redesign of the distributed test execution framework that has been the backbone of parallel browser testing for over a decade. Unlike its predecessors, Grid 4 introduces modern observability features, a fully asynchronous communication layer, and native support for containerized deployments.

The fourth major version addresses critical pain points from Grid 3: difficult troubleshooting, limited scalability, and complex setup procedures. With built-in support for Docker, Kubernetes, GraphQL APIs, and OpenTelemetry tracing, Grid 4 brings enterprise-grade reliability to open-source test infrastructure.

This comprehensive guide explores Grid 4’s architecture, deployment strategies, advanced configuration options, and real-world implementation patterns for teams scaling browser automation from dozens to thousands of concurrent sessions.

Architectural Evolution

From Hub-Node to Distributed Components

Selenium Grid 4 breaks the monolithic Hub-Node model into six specialized components that can be deployed independently or combined:

Router: Entry point for all WebDriver commands, distributes requests to appropriate services

Distributor: Manages node registration and assigns new session requests to available nodes based on capabilities

Session Map: Maintains the mapping between session IDs and the nodes executing those sessions

New Session Queue: Buffers incoming session requests when all nodes are busy, implementing FIFO queueing

Event Bus: Asynchronous message broker (HTTP, Redis, RabbitMQ, or Kafka) for inter-component communication

Node: Executes WebDriver commands on actual browser instances

This microservices-inspired architecture enables:

Horizontal scaling of individual bottleneck components
Zero-downtime rolling updates
Cloud-native deployment patterns
Better fault isolation

Standalone, Hub-Node, and Fully Distributed Modes

Grid 4 supports three deployment topologies:

Mode	Use Case	Components	Scalability
Standalone	Local development, CI pipelines	All-in-one process	Single machine
Hub	Small-to-medium teams	Hub + Nodes	Vertical scaling
Distributed	Enterprise deployments	Independent components	Horizontal scaling

Standalone mode combines all components in a single JVM process, ideal for Docker Compose setups or GitHub Actions workflows. Hub mode groups Router, Distributor, Session Map, and Queue into a single Hub process while Nodes run separately. Fully distributed mode deploys each component independently for maximum flexibility.

GraphQL API for Grid Introspection

One of Grid 4’s most powerful features is its GraphQL endpoint that exposes real-time grid state, session information, and node capabilities.

Querying Grid Status

The GraphQL interface at /graphql provides rich metadata about grid health:

{
  grid {
    totalSlots
    usedSlots
    sessionCount
    maxSession
    nodes {
      id
      status
      uri
      slots {
        stereotype
        sessions {
          id
          capabilities
          startTime
        }
      }
    }
  }
}

This enables building custom dashboards, capacity planning tools, and integration with monitoring systems like Grafana. Unlike Grid 3’s limited JSON status endpoint, the GraphQL API allows clients to request exactly the data they need.

Dynamic Capability Discovery

Teams can query available browser versions and platform combinations programmatically:

{
  grid {
    nodes {
      osInfo {
        name
        version
        arch
      }
      slots {
        stereotype
      }
    }
  }
}

This is invaluable for test frameworks that need to discover capabilities dynamically or validate test matrix configurations against actual grid capacity.

Docker and Kubernetes Deployment

Official Docker Images

Selenium Project maintains regularly updated Docker images for all Grid components:

# docker-compose.yml for Hub-Node topology
version: "3"
services:
  selenium-hub:
    image: selenium/hub:4.15.0
    ports:
      - "4444:4444"
    environment:
      - SE_SESSION_REQUEST_TIMEOUT=300
      - SE_NODE_SESSION_TIMEOUT=300

  chrome:
    image: selenium/node-chrome:4.15.0
    shm_size: 2gb
    depends_on:
      - selenium-hub
    environment:
      - SE_EVENT_BUS_HOST=selenium-hub
      - SE_EVENT_BUS_PUBLISH_PORT=4442
      - SE_EVENT_BUS_SUBSCRIBE_PORT=4443
      - SE_NODE_MAX_SESSIONS=3

  firefox:
    image: selenium/node-firefox:4.15.0
    shm_size: 2gb
    depends_on:
      - selenium-hub
    environment:
      - SE_EVENT_BUS_HOST=selenium-hub
      - SE_EVENT_BUS_PUBLISH_PORT=4442
      - SE_EVENT_BUS_SUBSCRIBE_PORT=4443
      - SE_NODE_MAX_SESSIONS=3

Each browser node includes VNC server for live debugging (selenium/node-chrome-debug:4.15.0 variants), and video recording is available through standalone images with selenium/video:latest sidecar containers.

Kubernetes with Helm Charts

For production-scale deployments, the official Helm chart provides declarative configuration:

helm repo add selenium https://www.selenium.dev/docker-selenium
helm install selenium-grid selenium/selenium-grid \
  --set isolateComponents=true \
  --set chromeNode.replicas=5 \
  --set firefoxNode.replicas=3 \
  --set edgeNode.replicas=2

The chart supports:

Autoscaling with KEDA (Kubernetes Event-Driven Autoscaling)
Persistent session recording storage
Ingress configuration for external access
Sidecar containers for observability agents

Observability with OpenTelemetry

Grid 4’s integration with OpenTelemetry provides distributed tracing across all components, enabling visibility into request flows from client to browser execution.

Tracing Configuration

Enable tracing by setting environment variables:

SE_ENABLE_TRACING=true
SE_TRACING_EXPORTER=otlp
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
OTEL_SERVICE_NAME=selenium-grid

This exports traces to OTLP-compatible backends like Jaeger, Zipkin, or commercial APM tools. Each WebDriver command generates spans showing:

Session creation time
Node selection latency
Command execution duration
Network round-trip times

Integration with Monitoring Stacks

Grid 4 exposes Prometheus metrics at /metrics:

# EXAMPLE
selenium_grid_sessions_active 12
selenium_grid_sessions_queued 3
selenium_grid_node_count 8
selenium_grid_slot_utilization 0.75

Combined with Grafana dashboards, teams gain real-time visibility into grid performance, capacity planning data, and failure rate analysis.

Advanced Configuration

Session Request Timeout and Retry Policies

Grid 4 introduces fine-grained timeout controls:

# Hub/Router configuration
--session-request-timeout 300  # Wait 5min for available slot
--session-retry-interval 5     # Check for slots every 5sec
--healthcheck-interval 60      # Node health check frequency

# Node configuration
--heartbeat-period 30          # Report status every 30sec
--register-period 60           # Re-register if disconnected
--drain-after-session-count 100 # Auto-restart after 100 sessions

The --drain-after-session-count setting is particularly useful for preventing memory leaks in long-running nodes by gracefully replacing them after a configured number of sessions.

Custom Capability Matchers

For complex browser configurations (specific Chrome flags, custom profiles), Grid 4 allows custom capability matching logic:

public class CustomCapabilityMatcher implements CapabilityMatcher {
  @Override
  public boolean matches(Capabilities nodeCapabilities,
                         Capabilities requestedCapabilities) {
    // Custom logic for specialized browser configurations
    String requiredExtension = (String) requestedCapabilities
      .getCapability("customExtension");
    return nodeCapabilities.getCapability("availableExtensions")
      .toString().contains(requiredExtension);
  }
}

This enables routing tests requiring specific browser extensions, locale settings, or performance profiles to appropriately configured nodes.

Relay Configuration for Existing Nodes

Grid 4 can integrate with external Selenium servers (Sauce Labs, BrowserStack, legacy Grid 3 nodes) using relay configuration:

java -jar selenium-server.jar relay \
  --service-url "https://ondemand.us-west-1.saucelabs.com:443/wd/hub" \
  --config relay-sauce.toml

This allows hybrid deployments where some browsers run locally while others use cloud providers, all accessible through a single Grid endpoint.

Comparison with Alternatives

Feature	Selenium Grid 4	Selenoid	Moon (Aerokube)	Zalenium
Protocol Support	WebDriver, CDP	WebDriver, CDP	WebDriver, CDP, Playwright	WebDriver
Browser Video	Via Docker sidecar	Built-in	Built-in	Built-in
Kubernetes Native	Helm chart	Yes	Yes	Deprecated
GraphQL API	✅ Yes	❌ No	❌ No	❌ No
OpenTelemetry	✅ Native	❌ Manual	✅ Native	❌ No
Active Development	✅ Official Selenium	✅ Active	✅ Active	❌ Archived
License	Apache 2.0	Apache 2.0	Commercial	Apache 2.0

Selenoid offers faster startup times and lower resource usage through direct container management but lacks Grid 4’s observability features.

Moon is Aerokube’s commercial Kubernetes-native solution with advanced features like browser caching and built-in VNC/video but requires a paid license.

Zalenium (now archived) pioneered Docker-based grid deployment but has been superseded by official Selenium Docker images.

Pricing and Licensing

Selenium Grid 4: Completely free and open-source (Apache 2.0 license). No feature limitations, commercial use allowed.

Infrastructure costs depend on deployment model:

Cloud VMs: $50-500/month for small-medium grids (AWS EC2, GCP Compute)
Kubernetes Clusters: $100-2000/month depending on scale (EKS, GKE, AKS)
Managed Selenium Services: $150-1500/month (Grid 4-compatible providers)

Commercial support is available through:

Sauce Labs: Grid-compatible cloud execution starting at $149/month
BrowserStack: Grid-compatible infrastructure starting at $99/month
Consulting firms: Implementation and optimization services ($150-250/hour)

Integration Examples

CI/CD Pipeline Integration

GitHub Actions workflow running tests against Grid:

name: E2E Tests
on: [push]
jobs:
  test:
    runs-on: ubuntu-latest
    services:
      selenium-hub:
        image: selenium/hub:4.15.0
        ports:
          - 4444:4444
      chrome:
        image: selenium/node-chrome:4.15.0
        env:
          SE_EVENT_BUS_HOST: selenium-hub
          SE_EVENT_BUS_PUBLISH_PORT: 4442
          SE_EVENT_BUS_SUBSCRIBE_PORT: 4443
    steps:
      - uses: actions/checkout@v3
      - name: Run tests
        run: mvn test -Dselenium.grid.url=http://localhost:4444

Test Framework Configuration

Configure test frameworks to use Grid:

Java (Selenium 4):

RemoteWebDriver driver = new RemoteWebDriver(
  new URL("http://grid:4444"),
  new ChromeOptions()
);

Python (pytest-selenium):

@pytest.fixture
def selenium(selenium):
    selenium.command_executor._url = "http://grid:4444/wd/hub"
    return selenium

JavaScript (WebdriverIO):

exports.config = {
  hostname: 'grid',
  port: 4444,
  path: '/wd/hub',
  capabilities: [{
    browserName: 'chrome'
  }]
}

Best Practices

Capacity Planning

Calculate required Grid capacity using:

Required Nodes = (Total Tests × Avg Test Duration) / (Target Completion Time × Sessions per Node)

For 1000 tests averaging 3 minutes each, targeting 30-minute completion with 5 sessions per node:

Required Nodes = (1000 × 3) / (30 × 5) = 20 nodes

Add 20% buffer for failures and queue spikes.

Node Stability

Implement node lifecycle management:

Set --drain-after-session-count to prevent memory leaks
Configure health checks with reasonable timeouts
Use node labels for routing tests to specialized configurations
Monitor disk space for logs and video recordings

Security Considerations

Grid 4 has no built-in authentication. Production deployments should:

Deploy behind a reverse proxy with authentication (Nginx, Traefik)
Use network segmentation to isolate grid components
Implement rate limiting to prevent resource exhaustion
Scan browser images for vulnerabilities regularly

Conclusion

Selenium Grid 4 modernizes distributed test execution with cloud-native architecture, comprehensive observability, and production-ready deployment patterns. The GraphQL API, OpenTelemetry integration, and microservices design make it suitable for organizations running thousands of daily test executions.

While alternatives like Selenoid offer performance advantages in specific scenarios, Grid 4’s official status, active development, and rich ecosystem make it the default choice for teams already invested in Selenium WebDriver. For greenfield projects, evaluate Grid 4 alongside cloud execution platforms and newer protocols like Playwright’s built-in parallelization to determine the best fit for your architecture and scale requirements.