Error 279: ALL_CONNECTION_TRIES_FAILED

Tip

This error occurs when ClickHouse cannot establish a connection to any of the available replicas or shards after exhausting all connection attempts. It indicates a complete failure to connect to remote nodes needed for distributed query execution, parallel replicas, or cluster operations.

Most common causes

All replicas unavailable or unreachable
- All remote servers down or restarting
- Network partition isolating all replicas
- All connection attempts timing out
- DNS resolution failing for all hosts
Parallel replicas with stale connections
- First query after idle period using stale connection pool
- Connection pool contains dead connections to replicas
- Network configuration causing connections to timeout after inactivity (typically 1+ hour)
- Known issue in versions before 24.5.1.22937 and 24.7.1.5426
Pod restarts during rolling updates
- Load balancer routing new connections to terminating pods
- Replicas marked as ready: true, terminating: true still receiving traffic
- Delay between pod termination and load balancer deregistration (can be 15-20 seconds)
- Multiple replicas restarting simultaneously
Distributed query to offline cluster nodes
- Remote shard servers not running
- Network connectivity issues to cluster nodes
- Firewall blocking inter-node communication
- Wrong hostnames in cluster configuration
Connection refused errors
- ClickHouse server not listening on port
- Server crashed or killed
- Port not open in firewall
- Service not started yet after deployment
clusterAllReplicas() queries during disruption
- Queries using clusterAllReplicas() function
- Some replicas unavailable during query execution
- Not using skip_unavailable_shards setting

Common solutions

1. For parallel replicas stale connection issue

Workaround (until fixed in newer versions):

-- Periodically execute query to refresh connection pool
SELECT 1 FROM your_table
SETTINGS 
    max_parallel_replicas = 60,  -- >= cluster size
    allow_experimental_parallel_reading_from_replicas = 1,
    cluster_for_parallel_replicas = 'default';

-- Or execute as retry after ALL_CONNECTION_TRIES_FAILED error

Permanent fix: Upgrade to ClickHouse 24.5.1.22937, 24.7.1.5426, or later.

2. Skip unavailable shards/replicas

-- Allow query to proceed even if some replicas unavailable
SET skip_unavailable_shards = 1;

-- For clusterAllReplicas queries
SELECT * FROM clusterAllReplicas('default', system.tables)
SETTINGS skip_unavailable_shards = 1;

3. Verify cluster connectivity

-- Test connection to all cluster nodes
SELECT 
    hostName() AS host,
    count() AS test
FROM clusterAllReplicas('your_cluster', system.one);

-- Check cluster configuration
SELECT *
FROM system.clusters
WHERE cluster = 'your_cluster';

4. Check replica status

-- For replicated tables, check replica health
SELECT 
    database,
    table,
    is_leader,
    is_readonly,
    total_replicas,
    active_replicas
FROM system.replicas;

-- Check for replication lag
SELECT 
    database,
    table,
    absolute_delay,
    queue_size
FROM system.replicas
WHERE absolute_delay > 60 OR queue_size > 100;

5. Verify servers are running

# Check if ClickHouse is listening on port
telnet server-hostname 9000

# Or using nc
nc -zv server-hostname 9000

# Kubernetes - check pod status
kubectl get pods -n your-namespace
kubectl get endpoints -n your-namespace

6. Configure connection retry settings

-- Increase connection attempt count
SET connections_with_failover_max_tries = 5;

-- Increase timeout for failover connections
SET connect_timeout_with_failover_ms = 3000;

-- For distributed queries
SET distributed_connections_pool_size = 1024;

7. Implement client-side retry logic

# Python example
import time

def execute_with_retry(query, max_retries=3):
    for attempt in range(max_retries):
        try:
            # For parallel replicas workaround
            if attempt > 0:
                # Refresh connection pool
                client.query(
                    "SELECT 1",
                    settings={
                        'max_parallel_replicas': 60,
                        'allow_experimental_parallel_reading_from_replicas': 1
                    }
                )
            return client.query(query)
        except Exception as e:
            if 'ALL_CONNECTION_TRIES_FAILED' in str(e) or '279' in str(e):
                if attempt < max_retries - 1:
                    time.sleep(2 ** attempt)  # Exponential backoff
                    continue
            raise

Common scenarios

Scenario 1: Parallel replicas stale connections

Error: Code: 279. DB::Exception: Can't connect to any replica chosen 
for query execution: While executing Remote. (ALL_CONNECTION_TRIES_FAILED)

Cause: First query after idle period; connection pool has stale connections (bug in versions < 24.5.1.22937).

Solution:

Upgrade to 24.5.1.22937 / 24.7.1.5426 or later (permanent fix)
Execute dummy query with max_parallel_replicas >= cluster_size to refresh pool
Implement retry logic that refreshes connection pool

Scenario 2: All replicas down

Error: Code: 279. All connection tries failed. Log:
Code: 210. Connection refused (server:9000)
Code: 210. Connection refused (server:9000)
Code: 210. Connection refused (server:9000)

Cause: All replicas in cluster are down or not accepting connections.

Solution:

Check if ClickHouse servers are running
Verify services are accessible on port 9000
Check for pod/server restarts
Review cluster configuration

Scenario 3: Rolling restart with load balancer delay

Error: Connection failures during rolling restart
Multiple failed attempts to same terminating replica

Cause: Load balancer still routing to pods marked ready: true, terminating: true (15-20 second delay before marked ready: false).

Solution:

Implement retry logic with exponential backoff
Use connection pooling that handles connection failures
Wait for fix to prestop hooks (ongoing work)
Design applications to tolerate temporary connection failures

Scenario 4: clusterAllReplicas() with unavailable replicas

Error: ALL_CONNECTION_TRIES_FAILED in clusterAllReplicas query

Cause: Using clusterAllReplicas() when one or more replicas unavailable.

Solution:

-- Enable skip_unavailable_shards
SELECT * FROM clusterAllReplicas('default', system.tables)
SETTINGS skip_unavailable_shards = 1;

-- Or use cluster() with proper shard selection
SELECT * FROM cluster('default', system.tables)
WHERE shard_num = 1;

Scenario 5: Distributed table with dead shards

Error: All connection tries failed during distributed query

Cause: Distributed table references shard that is down.

Solution:

-- Skip unavailable shards
SELECT * FROM distributed_table
SETTINGS skip_unavailable_shards = 1;

-- Check which shards are unreachable
SELECT * FROM system.clusters WHERE cluster = 'your_cluster';

-- Fix cluster configuration to remove dead nodes

Prevention tips

Keep ClickHouse updated: Upgrade to 24.5+ for parallel replicas fix
Use skip_unavailable_shards: Allow queries to proceed with partial data
Monitor cluster health: Track replica availability and connectivity
Implement retry logic: Handle transient connection failures gracefully
Test failover: Regularly verify cluster failover mechanisms work
Configure appropriate timeouts: Match connection timeouts to network conditions
Plan for rolling updates: Design applications to handle temporary unavailability

Debugging steps

Identify which replicas failed:

SELECT 
    event_time,
    query_id,
    exception
FROM system.query_log
WHERE exception_code = 279
    AND event_date >= today() - 1
ORDER BY event_time DESC
LIMIT 10;

Check cluster connectivity:

-- Test each shard/replica
SELECT 
    cluster,
    shard_num,
    replica_num,
    host_name,
    port,
    is_local
FROM system.clusters
WHERE cluster = 'default';

-- Try to query each node
SELECT * FROM clusterAllReplicas('default', system.one);

Check for parallel replicas settings:

SELECT 
    query_id,
    Settings['allow_experimental_parallel_reading_from_replicas'] AS parallel_replicas,
    Settings['max_parallel_replicas'] AS max_replicas,
    exception
FROM system.query_log
WHERE exception_code = 279
ORDER BY event_time DESC
LIMIT 5;

Test individual replica connections:

# Test each replica manually
telnet replica1-hostname 9000
telnet replica2-hostname 9000

# Or with clickhouse-client
clickhouse-client --host replica1-hostname --query "SELECT 1"

Check for pod restarts (Kubernetes):

# Check pod status and restarts
kubectl get pods -n your-namespace

# Check events during error timeframe
kubectl get events -n your-namespace \
    --sort-by='.lastTimestamp' | grep Killing

Review error_log for connection details:

SELECT 
    event_time,
    name,
    value,
    last_error_message
FROM system.errors
WHERE name = 'ALL_CONNECTION_TRIES_FAILED'
ORDER BY last_error_time DESC;

Special considerations

For parallel replicas (experimental feature):

Known bug in versions before 24.5.1.22937 / 24.7.1.5426
Stale connections in pool after inactivity
First query after idle period likely to fail
Subsequent queries succeed after pool refresh
Settings skip_unavailable_shards and use_hedged_requests not needed anymore

For distributed queries:

Error means ALL configured replicas failed
Each replica has multiple connection attempts
Full error message shows individual NETWORK_ERROR (210) attempts
Check both network and server availability

For clusterAllReplicas():

Queries all replicas in cluster
Failure expected if any replica unavailable
Use skip_unavailable_shards = 1 to proceed with available replicas
Common during rolling updates or maintenance

For ClickHouse Cloud rolling updates:

Pods marked as terminating can still show ready: true for 15-20 seconds
Load balancer may route new connections to terminating pods during this window
Graceful shutdown waits up to 1 hour for running queries
Design clients to retry connection failures

Load balancer behavior:

Connection established to load balancer, not directly to replica
Each query may route to different replica
Terminating pods remain in load balancer briefly after shutdown starts
Client retry may succeed if routed to healthy replica

Parallel replicas specific fix

Problem: Stale connections in cluster connection pools cause first query after inactivity to fail.

Affected versions: Before 24.5.1.22937 and 24.7.1.5426

Fix: PR 67389

Workaround until upgraded:

-- Execute this periodically or as retry after error
SELECT 1
SETTINGS 
    max_parallel_replicas = 100,  -- >= number of replicas
    allow_experimental_parallel_reading_from_replicas = 1,
    cluster_for_parallel_replicas = 'default';

Connection retry settings

-- Maximum connection attempts per replica
SET connections_with_failover_max_tries = 3;

-- Timeout for each connection attempt (milliseconds)
SET connect_timeout_with_failover_ms = 1000;
SET connect_timeout_with_failover_secure_ms = 1000;

-- Connection timeout (seconds)
SET connect_timeout = 10;

-- For hedged requests (parallel connection attempts)
SET use_hedged_requests = 1;  -- Not needed for parallel replicas
SET hedged_connection_timeout_ms = 100;

Cluster configuration best practices

Remove dead nodes from configuration:

<!-- Don't include offline servers -->
<remote_servers>
    <cluster_name>
        <shard>
            <replica>
                <host>active-server.domain.com</host>
                <port>9000</port>
            </replica>
            <!-- Remove or comment out dead servers -->
        </shard>
    </cluster_name>
</remote_servers>

Use internal_replication:

<shard>
    <internal_replication>true</internal_replication>
    <replica>...</replica>
</shard>

Configure failover properly:
- Ensure cluster has multiple replicas per shard
- Use appropriate load_balancing strategy
- Test failover by stopping one replica

Client implementation recommendations

For JDBC clients:

// Use connection pooling
ClickHouseDataSource dataSource = new ClickHouseDataSource(url, properties);

// Implement retry logic
public void executeWithRetry(String query, int maxRetries) {
    for (int attempt = 0; attempt < maxRetries; attempt++) {
        try {
            // Get new connection on each retry
            try (Connection conn = dataSource.getConnection()) {
                // Execute query
            }
            return; // Success
        } catch (SQLException e) {
            if (e.getMessage().contains("ALL_CONNECTION_TRIES_FAILED") 
                && attempt < maxRetries - 1) {
                Thread.sleep(1000 * (long)Math.pow(2, attempt));
                continue;
            }
            throw e;
        }
    }
}

For distributed queries:

Expect temporary failures during rolling updates
Implement exponential backoff retry
Use skip_unavailable_shards for non-critical queries
Monitor cluster health before sending queries

Distinguishing scenarios

Parallel replicas issue:

First query after idle period
Subsequent queries succeed
Versions before 24.5.1 / 24.7.1
Error mentions "replica chosen for query execution"

Actual connectivity issue:

Consistent failures, not just first query
Network or server problems
Individual 210 errors show "Connection refused" or "Timeout"

Rolling restart:

Errors during known maintenance window
Transient, resolves after restarts complete
Correlation with pod restart events

Cluster misconfiguration:

Persistent errors
Same replicas always failing
Wrong hostnames or dead nodes in config

When using `clusterAllReplicas()`

-- Will fail if ANY replica unavailable (without skip setting)
SELECT * FROM clusterAllReplicas('default', system.tables);

-- Recommended: Skip unavailable replicas
SELECT * FROM clusterAllReplicas('default', system.tables)
SETTINGS skip_unavailable_shards = 1;

-- Check which queries are derived from clusterAllReplicas
SELECT 
    query_id,
    initial_query_id,
    is_initial_query,
    exception
FROM system.query_log
WHERE exception_code = 210
    AND is_initial_query = 0  -- Derived queries
ORDER BY event_time DESC;

Monitoring and alerting

-- Track ALL_CONNECTION_TRIES_FAILED errors
SELECT 
    toStartOfHour(event_time) AS hour,
    count() AS error_count,
    uniqExact(initial_query_id) AS unique_queries
FROM system.query_log
WHERE exception_code = 279
    AND event_date >= today() - 7
GROUP BY hour
ORDER BY hour DESC;

-- Check error_log for pattern
SELECT 
    last_error_time,
    last_error_message,
    value AS error_count
FROM system.errors
WHERE name = 'ALL_CONNECTION_TRIES_FAILED'
ORDER BY last_error_time DESC;

Known issues and fixes

Issue 1: Parallel replicas stale connections

Affected: Versions before 24.5.1.22937 / 24.7.1.5426
Fix: PR 67389
Workaround: Execute dummy query to refresh pool or retry

Issue 2: Load balancer routing to terminating pods

Affected: ClickHouse Cloud during rolling updates
Symptom: 15-20 second window where terminating pods receive new connections
Status: Ongoing work on pre-stop hooks
Workaround: Implement client retry logic

Issue 3: Round-robin replica selection

Affected: Parallel replicas queries
Symptom: Forcibly uses ROUND_ROBIN even if replicas unavailable
Impact: If 1/60 replicas dead, 1/60 requests fail consistently

If you're experiencing this error:

Check ClickHouse version - upgrade if using parallel replicas on version < 24.5.1 / 24.7.1
Verify all cluster nodes are running and accessible
Test connectivity to each replica manually
For parallel replicas: try executing dummy query to refresh connection pool
Use skip_unavailable_shards = 1 for queries that can tolerate partial data
Check for correlation with pod restarts or maintenance windows
Implement exponential backoff retry logic in client
Review cluster configuration for dead or incorrect nodes
Check individual connection errors in full exception message (usually 210 errors)
For persistent issues, check network connectivity between nodes

Related documentation:

Most common causes​

Common solutions​

Common scenarios​

Prevention tips​

Debugging steps​

Special considerations​

Parallel replicas specific fix​

Connection retry settings​

Cluster configuration best practices​

Client implementation recommendations​

Distinguishing scenarios​

When using clusterAllReplicas()​

Monitoring and alerting​

Known issues and fixes​

Most common causes

Common solutions

Common scenarios

Prevention tips

Debugging steps

Special considerations

Parallel replicas specific fix

Connection retry settings

Cluster configuration best practices

Client implementation recommendations

Distinguishing scenarios

When using `clusterAllReplicas()`

Monitoring and alerting

Known issues and fixes