Error 279: ALL_CONNECTION_TRIES_FAILED
This error occurs when ClickHouse cannot establish a connection to any of the available replicas or shards after exhausting all connection attempts. It indicates a complete failure to connect to remote nodes needed for distributed query execution, parallel replicas, or cluster operations.
Most common causes
-
All replicas unavailable or unreachable
- All remote servers down or restarting
- Network partition isolating all replicas
- All connection attempts timing out
- DNS resolution failing for all hosts
-
Parallel replicas with stale connections
- First query after idle period using stale connection pool
- Connection pool contains dead connections to replicas
- Network configuration causing connections to timeout after inactivity (typically 1+ hour)
- Known issue in versions before 24.5.1.22937 and 24.7.1.5426
-
Pod restarts during rolling updates
- Load balancer routing new connections to terminating pods
- Replicas marked as
ready: true, terminating: truestill receiving traffic - Delay between pod termination and load balancer deregistration (can be 15-20 seconds)
- Multiple replicas restarting simultaneously
-
Distributed query to offline cluster nodes
- Remote shard servers not running
- Network connectivity issues to cluster nodes
- Firewall blocking inter-node communication
- Wrong hostnames in cluster configuration
-
Connection refused errors
- ClickHouse server not listening on port
- Server crashed or killed
- Port not open in firewall
- Service not started yet after deployment
-
clusterAllReplicas()queries during disruption- Queries using
clusterAllReplicas()function - Some replicas unavailable during query execution
- Not using
skip_unavailable_shardssetting
- Queries using
Common solutions
1. For parallel replicas stale connection issue
Workaround (until fixed in newer versions):
Permanent fix: Upgrade to ClickHouse 24.5.1.22937, 24.7.1.5426, or later.
2. Skip unavailable shards/replicas
3. Verify cluster connectivity
4. Check replica status
5. Verify servers are running
6. Configure connection retry settings
7. Implement client-side retry logic
Common scenarios
Scenario 1: Parallel replicas stale connections
Cause: First query after idle period; connection pool has stale connections (bug in versions < 24.5.1.22937).
Solution:
- Upgrade to 24.5.1.22937 / 24.7.1.5426 or later (permanent fix)
- Execute dummy query with
max_parallel_replicas >= cluster_sizeto refresh pool - Implement retry logic that refreshes connection pool
Scenario 2: All replicas down
Cause: All replicas in cluster are down or not accepting connections.
Solution:
- Check if ClickHouse servers are running
- Verify services are accessible on port 9000
- Check for pod/server restarts
- Review cluster configuration
Scenario 3: Rolling restart with load balancer delay
Cause: Load balancer still routing to pods marked ready: true, terminating: true (15-20 second delay before marked ready: false).
Solution:
- Implement retry logic with exponential backoff
- Use connection pooling that handles connection failures
- Wait for fix to prestop hooks (ongoing work)
- Design applications to tolerate temporary connection failures
Scenario 4: clusterAllReplicas() with unavailable replicas
Cause: Using clusterAllReplicas() when one or more replicas unavailable.
Solution:
Scenario 5: Distributed table with dead shards
Cause: Distributed table references shard that is down.
Solution:
Prevention tips
- Keep ClickHouse updated: Upgrade to 24.5+ for parallel replicas fix
- Use skip_unavailable_shards: Allow queries to proceed with partial data
- Monitor cluster health: Track replica availability and connectivity
- Implement retry logic: Handle transient connection failures gracefully
- Test failover: Regularly verify cluster failover mechanisms work
- Configure appropriate timeouts: Match connection timeouts to network conditions
- Plan for rolling updates: Design applications to handle temporary unavailability
Debugging steps
-
Identify which replicas failed:
-
Check cluster connectivity:
-
Check for parallel replicas settings:
-
Test individual replica connections:
-
Check for pod restarts (Kubernetes):
-
Review error_log for connection details:
Special considerations
For parallel replicas (experimental feature):
- Known bug in versions before 24.5.1.22937 / 24.7.1.5426
- Stale connections in pool after inactivity
- First query after idle period likely to fail
- Subsequent queries succeed after pool refresh
- Settings
skip_unavailable_shardsanduse_hedged_requestsnot needed anymore
For distributed queries:
- Error means ALL configured replicas failed
- Each replica has multiple connection attempts
- Full error message shows individual NETWORK_ERROR (210) attempts
- Check both network and server availability
For clusterAllReplicas():
- Queries all replicas in cluster
- Failure expected if any replica unavailable
- Use
skip_unavailable_shards = 1to proceed with available replicas - Common during rolling updates or maintenance
For ClickHouse Cloud rolling updates:
- Pods marked as terminating can still show
ready: truefor 15-20 seconds - Load balancer may route new connections to terminating pods during this window
- Graceful shutdown waits up to 1 hour for running queries
- Design clients to retry connection failures
Load balancer behavior:
- Connection established to load balancer, not directly to replica
- Each query may route to different replica
- Terminating pods remain in load balancer briefly after shutdown starts
- Client retry may succeed if routed to healthy replica
Parallel replicas specific fix
Problem: Stale connections in cluster connection pools cause first query after inactivity to fail.
Affected versions: Before 24.5.1.22937 and 24.7.1.5426
Fix: PR 67389
Workaround until upgraded:
Connection retry settings
Cluster configuration best practices
-
Remove dead nodes from configuration:
-
Use internal_replication:
-
Configure failover properly:
- Ensure cluster has multiple replicas per shard
- Use appropriate
load_balancingstrategy - Test failover by stopping one replica
Client implementation recommendations
For JDBC clients:
For distributed queries:
- Expect temporary failures during rolling updates
- Implement exponential backoff retry
- Use
skip_unavailable_shardsfor non-critical queries - Monitor cluster health before sending queries
Distinguishing scenarios
Parallel replicas issue:
- First query after idle period
- Subsequent queries succeed
- Versions before 24.5.1 / 24.7.1
- Error mentions "replica chosen for query execution"
Actual connectivity issue:
- Consistent failures, not just first query
- Network or server problems
- Individual 210 errors show "Connection refused" or "Timeout"
Rolling restart:
- Errors during known maintenance window
- Transient, resolves after restarts complete
- Correlation with pod restart events
Cluster misconfiguration:
- Persistent errors
- Same replicas always failing
- Wrong hostnames or dead nodes in config
When using clusterAllReplicas()
Monitoring and alerting
Known issues and fixes
Issue 1: Parallel replicas stale connections
- Affected: Versions before 24.5.1.22937 / 24.7.1.5426
- Fix: PR 67389
- Workaround: Execute dummy query to refresh pool or retry
Issue 2: Load balancer routing to terminating pods
- Affected: ClickHouse Cloud during rolling updates
- Symptom: 15-20 second window where terminating pods receive new connections
- Status: Ongoing work on pre-stop hooks
- Workaround: Implement client retry logic
Issue 3: Round-robin replica selection
- Affected: Parallel replicas queries
- Symptom: Forcibly uses ROUND_ROBIN even if replicas unavailable
- Impact: If 1/60 replicas dead, 1/60 requests fail consistently
If you're experiencing this error:
- Check ClickHouse version - upgrade if using parallel replicas on version < 24.5.1 / 24.7.1
- Verify all cluster nodes are running and accessible
- Test connectivity to each replica manually
- For parallel replicas: try executing dummy query to refresh connection pool
- Use
skip_unavailable_shards = 1for queries that can tolerate partial data - Check for correlation with pod restarts or maintenance windows
- Implement exponential backoff retry logic in client
- Review cluster configuration for dead or incorrect nodes
- Check individual connection errors in full exception message (usually 210 errors)
- For persistent issues, check network connectivity between nodes
Related documentation: