Skip to content

HA Cluster

Build an active-passive high-availability cluster using Raft consensus.

Overview

Nefia's HA cluster provides an active-passive configuration based on the Raft consensus algorithm. The leader node handles VPN tunnels and state mutations, and standby nodes are automatically promoted if the leader fails.

plaintext
Node 1 (Leader)     Node 2 (Follower)    Node 3 (Follower)
  │ VPN tunnel        │ standby             │ standby
  │ state mutations   │ Raft replication    │ Raft replication
  │                   │                     │
  └─ Raft TCP ───────┴─────────────────────┘

Replicated State

StateDescription
SessionsSSH session state
QueueTask queue entries

Only the leader node can Apply() state changes. Followers receive updates through the Raft log.

Setup

1

Prepare Configuration

Add cluster configuration to each node's nefia.yaml:

Node 1 (Initial Leader):

yaml
cluster:
  enabled: true
  node_id: "node-1"
  bind_addr: "0.0.0.0:9700"
  advertise_addr: "10.99.0.1:9700"

Node 2:

yaml
cluster:
  enabled: true
  node_id: "node-2"
  bind_addr: "0.0.0.0:9700"
  advertise_addr: "10.99.0.2:9700"
  peers:
    - id: "node-1"
      address: "10.99.0.1:9700"
2

Initialize the Cluster

Bootstrap the cluster on the first node:

bash
nefia cluster init

This creates a single-node cluster and makes this node the leader.

3

Start the Daemon

bash
nefia daemon
4

Add Peers

Add other nodes from the leader node:

bash
nefia cluster add-peer --id node-2 --addr 10.99.0.2:9700
nefia cluster add-peer --id node-3 --addr 10.99.0.3:9700

Management Commands

Check Status

bash
nefia cluster status
plaintext
Node ID:    node-1
State:      Leader
Is Leader:  true
Leader ID:  node-1

List Members

bash
nefia cluster members
plaintext
ID       ADDRESS           STATE      LOCAL
node-1   10.99.0.1:9700    Leader     *
node-2   10.99.0.2:9700    Follower
node-3   10.99.0.3:9700    Follower

Remove a Peer

bash
nefia cluster remove-peer --id node-3

Configuration Reference

yaml
cluster:
  enabled: true
  node_id: "node-1"                    # Unique node ID
  bind_addr: "0.0.0.0:9700"            # Local TCP bind address
  advertise_addr: "10.99.0.1:9700"     # Address advertised to other nodes
  data_dir: ""                          # Data directory (defaults to <state-dir>/cluster if empty)
  peers:
    - id: "node-2"
      address: "10.99.0.2:9700"

Leadership Transitions

When the leader fails, a follower is automatically promoted to leader after the Raft election timeout (1 second).

Callbacks on leadership change:

  • Promoted to leader: Start VPN tunnels, start Reactor listeners
  • Demoted to follower: Stop VPN tunnels, sync state from other nodes

Raft Parameters

ParameterValueDescription
Apply Timeout10sTimeout for log application
Leader Lease500msLeader lease duration
Heartbeat Timeout1sHeartbeat interval
Election Timeout1sElection timeout
Snapshot Retain2Number of snapshots to retain
Connection Pool3Raft TCP connection pool size
Transport Timeout10sCommunication timeout

Data Storage

DataStoreLocation
Raft logBoltDB<data-dir>/raft.db
SnapshotsFile-based<data-dir>/
Admin socketUnix socket<state-dir>/admin.sock

Admin API

Cluster management is performed via the daemon process's Unix socket. CLI commands communicate with the daemon through this API.

Supported actions:

  • status — Get node status
  • members — Get member list
  • add_peer — Add a peer (leader only)
  • remove_peer — Remove a peer (leader only)

mTLS Transport Security

Raft cluster transport supports optional mutual TLS (mTLS) for authenticating peer-to-peer communication. When enabled, all Raft traffic (log replication, heartbeats, snapshots) is encrypted and both peers must present valid certificates signed by a shared CA.

Configuration

yaml
cluster:
  enabled: true
  node_id: "node-1"
  bind_addr: "0.0.0.0:9700"
  advertise_addr: "10.99.0.1:9700"
  tls_enabled: true
  tls_cert_file: "/etc/nefia/cluster-cert.pem"
  tls_key_file: "/etc/nefia/cluster-key.pem"
  tls_ca_file: "/etc/nefia/cluster-ca.pem"
FieldTypeDescription
tls_enabledboolEnable mTLS for Raft inter-node transport
tls_cert_filestringPath to TLS certificate (PEM)
tls_key_filestringPath to TLS private key (PEM)
tls_ca_filestringPath to CA certificate for mutual TLS verification (PEM)

All three file paths (tls_cert_file, tls_key_file, tls_ca_file) are required when tls_enabled is true.

Security Properties

  • TLS 1.3 minimum: Only TLS 1.3 is accepted, ensuring modern cipher suites and key exchange.
  • Mutual authentication: Both sides of the connection must present a certificate (RequireAndVerifyClientCert). A node cannot join the cluster without a valid certificate signed by the shared CA.
  • EKU validation: Peer certificates must include both serverAuth and clientAuth Extended Key Usage (EKU) values. Certificates with ExtKeyUsageAny are also accepted.
  • Forward secrecy: TLS session tickets are disabled to ensure every connection performs a full handshake, preventing session resumption attacks.

MCP Tools

AI agents can manage the Raft cluster programmatically via MCP:

ToolDescription
nefia.cluster.initInitialize a new Raft cluster on this node.
nefia.cluster.statusShow cluster status including leader, term, and commit index.
nefia.cluster.membersList all cluster members and their state (leader, follower, candidate).
nefia.cluster.add_peerAdd a new peer node to the cluster by ID and address.
nefia.cluster.remove_peerRemove a peer node from the cluster.