VPN Setup Guide

Configure WireGuard VPN tunnels, NAT traversal, and key rotation for Nefia.

Nefia uses WireGuard VPN tunnels for all remote connections. This guide covers advanced VPN configuration including NAT traversal, diagnostics, and key rotation.

Architecture

Nefia uses a star (hub-and-spoke) topology:

plaintext

                  ┌──────────────┐
                  │  Operator PC │
                  │  (Hub)       │
                  │  10.99.0.1   │
                  └──────┬───────┘
                         │
            ┌────────────┼────────────┐
            │            │            │
     ┌──────┴───┐ ┌──────┴───┐ ┌─────┴────┐
     │ Target 1 │ │ Target 2 │ │ Target 3 │
     │ 10.99.0.2│ │ 10.99.0.3│ │ 10.99.0.4│
     └──────────┘ └──────────┘ └──────────┘

The operator PC serves as the VPN hub on 10.99.0.1/24. Each target PC gets a unique IP in the 10.99.0.0/24 subnet.

NAT Traversal

When target PCs are behind NAT, you have several options:

STUN Discovery

Use STUN to automatically discover your public IP and port:

bash

nefia vpn invite --name my-server --os linux --stun

This queries public STUN servers to determine the operator's externally reachable address.

Direct Endpoint

If you know your public IP and have port forwarding configured:

bash

nefia vpn invite --name my-server --os linux --endpoint 203.0.113.10:51820

Invite Flags Reference

Flag	Default	Description
`--name`	(required)	Host ID for the new peer
`--os`	(required)	Target OS: `macos`, `linux`, or `windows`
`--stun`	`false`	Use STUN to discover the operator's public endpoint
`--endpoint`		Operator's public endpoint (`ip:port`)
`--enroll-port`	`19820`	TCP port for the enrollment listener
`--listen`	`true`	Start enrollment listener after generating the invite
`--listen-timeout`	`60m`	How long to wait for the agent to enroll
`--ttl`	`24h`	Token time-to-live
`--token-out`		Write raw invite token to a file
`--copy`	`false`	Copy the invite token to the system clipboard
`--no-print-token`	`false`	Suppress human-readable token output (hidden flag for TUI/internal automation)

Port Forwarding

To enroll a remote target PC, the following ports must be forwarded on the operator's router:

Port	Protocol	Purpose
19820	TCP	Enrollment (initial registration only)
51820	UDP	WireGuard VPN tunnel (always)

Environments Without Port Forwarding

In coworking spaces, corporate networks, or other environments where you cannot manage the router, port forwarding is not possible.

How the cloud relay works:

plaintext

Agent  →  HTTPS  →  nefia.ai  ←  HTTPS  ←  Operator
                    (relay)

When the operator runs nefia vpn invite, a local listener starts and a cloud relay session is created simultaneously
The agent first attempts a direct connection (TCP 19820) and falls back to the cloud relay if it fails
Only WireGuard public keys and metadata pass through the cloud relay (private keys are never transmitted)
In environments where direct connections are possible, they are used as usual (completes within 10 seconds)

Manual workarounds (if the cloud relay is unavailable):

Use mobile tethering — Switch the operator PC to a tethered connection, then re-issue the token and enroll:
bash
```
nefia vpn reinvite --name <host-id> --stun
```
Use a network with port forwarding — Configure TCP 19820 / UDP 51820 port forwarding on the router, then enroll
Use a VPS as a relay — Run the operator CLI on a VPS with a public IP to complete enrollment, then migrate the configuration to your local PC

For environments where direct UDP connectivity is unreliable (symmetric NAT, CGNAT, strict firewalls), Nefia supports DERP (Designated Encrypted Relay for Packets) relay servers. DERP provides WebSocket-based relay as a fallback transport path alongside direct WireGuard and TURN.

Relay-First Architecture

Nefia uses a relay-first connection strategy:

plaintext

1. DialTCP via DERP relay  →  Immediate connectivity (< 1s)
2. Background: probe direct UDP path
3. If direct path succeeds  →  Automatic upgrade (transparent)

This ensures connections succeed immediately even in restrictive networks, while automatically upgrading to the fastest available path in the background.

Three-Path Transport

The relayAwareBind manages three transport paths simultaneously:

Path	Protocol	Use Case
DERP	WebSocket (wss://)	Immediate relay fallback, works through HTTP proxies
TURN	UDP/TCP relay	Traditional NAT traversal relay
Direct	UDP	Peer-to-peer, lowest latency

Configuring DERP Servers

Add DERP relay servers to your nefia.yaml:

yaml

vpn:
  derp_servers:
    - url: "wss://relay.nefia.ai/derp"
      region: "ap-northeast-1"
    - url: "wss://relay-us.nefia.ai/derp"
      region: "us-east-1"

DERP servers configured on the operator are automatically propagated to agents during enrollment.

Self-Hosted DERP Deployment

For private deployments, run your own DERP relay using the nefia-derp binary:

bash

nefia-derp \
  --addr :8443 \
  --allowed-keys-file /etc/nefia/allowed-keys.txt \
  --metrics-token "$METRICS_TOKEN"

Flag	Default	Description
`--addr`	`:8443`	Listen address for WebSocket connections
`--max-clients`	`10000`	Maximum number of concurrent clients
`--ping-interval`	`30s`	Keepalive ping interval
`--allowed-keys-file`	—	Path to a file containing allowed WireGuard public keys (one per line). If omitted, all keys are accepted.
`--metrics-token`	--	Bearer token required for `/healthz` full metrics. Without this, `/healthz` returns minimal status only. Can also be set via `NEFIA_DERP_METRICS_TOKEN` environment variable.
`--trust-proxy`	`false`	Trust `Fly-Client-IP` / `X-Forwarded-For` headers for rate limiting. Enable when running behind a reverse proxy.
`--version`	—	Print version information and exit

The DERP server includes:

Per-IP rate limiting: 5 requests/second with burst of 10. Exceeding this returns HTTP 429.
ReadHeaderTimeout: 10 seconds to prevent slowloris attacks.
Graceful shutdown: Active clients receive a StatusGoingAway frame before the server exits. Clients automatically reconnect to the next available DERP server.

NAT Classification

Nefia automatically classifies the NAT type of the network to determine the best connectivity strategy:

NAT Type	Description	Direct Connectivity
EIM (Endpoint-Independent Mapping)	Standard NAT, consistent port mapping	Yes (with STUN)
EDM (Endpoint-Dependent Mapping)	Symmetric NAT, different port per destination	Difficult, relay recommended
CGNAT	Carrier-grade NAT, shared public IP	Relay required

NAT classification is performed automatically during nefia vpn diagnose and is used internally to select the optimal transport path.

Multipath Routing

Nefia supports active-backup multipath routing that automatically selects the best available network path.

When multiple paths are available (e.g., direct UDP, DERP relay, TCP fallback), the system continuously monitors path quality and fails over automatically to avoid flapping between paths.

Configure multipath in nefia.yaml:

yaml

vpn:
  multipath:
    mode: "active-backup"
    probe_interval_sec: 5
    failover_threshold_ms: 0

Field	Description
`mode`	Multipath mode. `"active-backup"` enables automatic failover. Use `"off"` to disable.
`probe_interval_sec`	Integer seconds between health probes.
`failover_threshold_ms`	Latency threshold in milliseconds for failover. `0` for automatic.

The active path is visible in nefia vpn status output. Path switches are logged in the audit trail.

Network Monitoring

The agent includes a network monitor that watches for IP address changes on the local network interfaces. When a change is detected (e.g., Wi-Fi reconnection, VPN toggle, network switch):

The monitor detects added/removed IP addresses within 5 seconds
A signal is sent to the watchdog for immediate tunnel rebuild
The tunnel reconnects using the new network path

This eliminates the need to wait for the regular watchdog interval (up to 60 seconds) after a network change, providing near-instant recovery.

Captive Portal Detection

Before attempting NAT traversal, the agent checks for captive portals (hotel Wi-Fi, airport networks, etc.) that intercept HTTP traffic. If a captive portal is detected, a warning is logged with instructions to authenticate with the portal before the VPN can connect.

Enhanced Diagnostics

nefia vpn diagnose includes additional checks beyond basic VPN health:

Latency Measurement

Each active peer is probed for round-trip latency:

nefia vpn diagnose (latency)

[PASS] peer-my-server-latency: 42ms [WARN] peer-staging-latency: 620ms (>500ms) [FAIL] peer-backup-latency: 2.3s (>2s)

Threshold	Result
< 500ms	PASS
500ms -- 2s	WARN
> 2s	FAIL

Route Conflict Detection

Diagnose checks whether the VPN subnet (10.99.0.0/24 by default) overlaps with any existing routes on the system. Overlapping routes can cause traffic to be misrouted.

nefia vpn diagnose (route conflict)

[FAIL] route-conflict: VPN subnet 10.99.0.0/24 overlaps with existing route 10.99.0.0/24 via en0

VPN Address Collision Prevention

When multiple nefia processes run simultaneously (e.g., concurrent vpn invite and vpn reinvite), file-based locking prevents VPN address collisions. Without this protection, two concurrent invites could assign the same VPN address to different hosts, causing routing conflicts.

The locking mechanism works as follows:

Before modifying nefia.yaml, the process acquires an exclusive file lock (nefia.yaml.lock).

The process reads the current config, selects the next available VPN address, adds the new host, and writes the config back.

The lock is released after the config file is saved.

On Unix (macOS/Linux): Non-blocking flock is used with up to 30 retry attempts at 1-second intervals (max 30-second wait). The OS automatically releases the lock when the owning process exits.
On Windows: LockFileEx provides equivalent exclusive locking with the same retry strategy. The lock file stores the owning process PID. If the owning process has crashed, the stale lock is automatically detected and cleaned up by checking process liveness via OpenProcess / GetExitCodeProcess.

Troubleshooting: Port Already in Use (E1004)

If you see this error when running nefia vpn invite, nefia vpn reinvite, or nefia vpn listen, another process is using the VPN or enrollment port:

Terminal

Error: [E1004] VPN setup failed Try: Check that no other VPN or WireGuard instance is using the same listen port (default: 51820).

Common causes:

A previous nefia vpn invite or nefia vpn listen was interrupted with Ctrl+C but the process did not fully exit
An enrollment listener is running in another terminal
Another WireGuard instance is using the same port

Resolution:

bash

# 1. Check which process is using the port
lsof -i :51820    # WireGuard VPN port
lsof -i :19820    # Enrollment listener port
 
# 2. Terminate the process
kill <PID>
 
# 3. Retry
nefia vpn reinvite --name <host-id> --stun

Reinviting Hosts

If an invite token has expired or you need to re-enroll an existing host, use vpn reinvite instead of removing and recreating the host:

bash

# Regenerate an expired invite (VPN address is preserved)
nefia vpn reinvite --name my-server --stun
 
# Switch from STUN to a direct endpoint
nefia vpn reinvite --name my-server --endpoint 192.168.1.100:19820

This resets the host to pending status, generates a new token, and starts the enrollment listener. The host's VPN address is kept, so firewall rules and DNS records remain valid.

vpn reinvite accepts the same flags as vpn invite, including --token-out, --copy, and --no-print-token.

Writing Tokens to File

Use --token-out to write the raw invite token to a file instead of only displaying it in the terminal. This is useful for automation scripts and CI/CD pipelines:

bash

nefia vpn invite --name my-server --os linux --stun --token-out /tmp/invite-token.txt

The file is created with 0600 permissions (owner-only read/write).

Multi-Host Enrollment

When enrolling multiple hosts, use nefia vpn listen with the --count flag to accept multiple enrollments without restarting the listener:

bash

# Accept exactly 3 enrollments
nefia vpn listen --count 3
 
# Accept all pending hosts (auto-stops when none remain)
nefia vpn listen --count 0

The listener automatically stops when:

The --count limit is reached, or
No pending hosts remain in the configuration (when --count 0), or
The timeout expires (default: 60 minutes)

During enrollment listening, a progress heartbeat is printed every 30 seconds to confirm the listener is still active.

Batch Approval

To approve all pending-approval hosts at once (e.g., after fleet enrollment), use nefia vpn approve --all. This requires explicit risk acknowledgment to prevent accidental mass approval:

bash

# Interactive: shows a confirmation prompt
nefia vpn approve --all
 
# Non-interactive / automation: must pass --accept-risk
nefia vpn approve --all --accept-risk=approve-all

Enrollment Status

Check whether a specific host has completed enrollment:

bash

nefia vpn enroll-status --name my-server

nefia vpn enroll-status --name my-server

Host: my-server Status: active VPN Addr: 10.99.0.2 Public Key: abc123...

If the host is still pending and the invite has expired, a warning is displayed with instructions to run nefia vpn reinvite.

Two-Phase Enrollment (Cloud Relay)

When using cloud relay enrollment (via nefia.ai), the enrollment process uses a two-phase token exchange for enhanced security:

Phase 1: Agent completes enrollment

The agent connects to the cloud relay, presents the session token embedded in the invite, and submits its WireGuard public key. The enrollment session must have been created with an expected_host_id parameter, and the agent's reported host ID must match. Legacy enrollment sessions (without expectedHostID or tokenHash) are rejected.

Phase 2: Bootstrap code exchange

When the operator polls for the enrollment result, a bootstrap code (a short-lived JWT with a 5-minute TTL) is returned instead of the agent token directly. The operator then exchanges this bootstrap code for the actual agent token via a separate authenticated endpoint.

This two-phase design ensures that even if the enrollment completion is intercepted, the attacker cannot obtain a valid agent token without the operator's authenticated session.

Diagnostics

For a comprehensive system-wide health check covering config, auth, audit, VPN, and connectivity, use:

bash

nefia doctor

This runs all VPN diagnostics plus config validation, auth status, audit checks, and end-to-end TCP connectivity tests to each active host. See the doctor command reference for full details.

To run VPN-specific diagnostics only:

bash

nefia vpn diagnose

nefia vpn diagnose

VPN Diagnostics: [PASS] vpn-enabled: VPN is enabled [PASS] operator-keypair: operator keypair is valid [PASS] port-available: UDP port 51820 is available [PASS] enrollment-port: TCP port 19820 (enrollment) is available [PASS] vpn-addr-unique: all 3 VPN addresses are unique [PASS] operator-addr-collision: operator VPN address does not overlap with any peer [PASS] ssh-identity: SSH public key found in ssh.identities [PASS] stun-reachability: STUN reachable (public IP: 203.0.113.10) [PASS] peer-my-server-pubkey: public key for my-server is valid [PASS] peer-my-server-vpnaddr: peer my-server VPN address: 10.99.0.2

Result: All checks passed.

Diagnose a specific host with --host:

bash

nefia vpn diagnose --host my-server

The diagnostic checks include:

VPN enabled and operator keypair validity
UDP port 51820 and TCP enrollment port 19820 availability
VPN address uniqueness across all peers
Operator VPN address collision detection (ensures operator address doesn't overlap with peer addresses)
SSH identity file presence
STUN server reachability
Per-peer public key validation and invite expiry
Connectivity testing (when tunnel is active)

Key Rotation

Rotate the operator's WireGuard key with a grace period to avoid disrupting active connections:

bash

nefia vpn rotate-key --grace-period 72h

The command displays rotation details including the new public key, grace period, and number of active hosts. If the rotation produces any warnings (e.g., agents that could not be notified), a Warning field is shown in the output.

The operator generates a new WireGuard keypair and stores it in the config.

Run nefia vpn push-key to distribute the new public key to all active hosts.

Both old and new keys are accepted during the grace period (default: configurable, e.g. 72h).

After the grace period expires, the old key is revoked and agents using it must re-enroll.

Push the key to all active hosts (or a specific host):

bash

# Push to all active VPN hosts
nefia vpn push-key
 
# Push to a specific host only
nefia vpn push-key --host my-server

MagicDNS

Nefia includes a built-in DNS resolver that maps host IDs to their VPN addresses. MagicDNS is enabled by default with the .nefia domain:

plaintext

my-server.nefia  →  10.99.0.2
dev-box.nefia    →  10.99.0.3
operator.nefia   →  10.99.0.1

Pending Hosts

Pending hosts (those that have been invited but have not yet completed enrollment) are registered in MagicDNS under the .pending subdomain. For example, a pending host mac-dev is resolvable as:

plaintext

mac-dev.pending.nefia  →  10.99.0.5

Once enrollment completes and the host becomes active, the record changes to the standard domain:

plaintext

mac-dev.nefia  →  10.99.0.5

Configure MagicDNS in your nefia.yaml:

yaml

vpn:
  magic_dns:
    enabled: true    # default: true
    domain: "nefia"  # default: "nefia"

View current DNS records with nefia vpn status:

nefia vpn status (MagicDNS excerpt)

MagicDNS: active (.nefia, 4 records) operator.nefia -> 10.99.0.1 my-server.nefia -> 10.99.0.2 dev-box.nefia -> 10.99.0.3 mac-dev.pending.nefia -> 10.99.0.5

VPN Status

nefia vpn status provides a quick overview of VPN health. Even without the --live flag, the command now shows peer handshake staleness indicators:

Peers with last handshake more than 5 minutes ago show a [WARN] indicator.
Peers with last handshake more than 30 minutes ago show a [STALE] indicator.

nefia vpn status

VPN: enabled (not probed)

Peers: my-server 10.99.0.2 last handshake: 2m ago dev-box 10.99.0.3 last handshake: 12m ago [WARN] staging 10.99.0.4 last handshake: 45m ago [STALE]

Troubleshooting: 1 stale peer detected. Run 'nefia vpn diagnose' to investigate.

When stale peers are detected, a troubleshooting hint is displayed suggesting nefia vpn diagnose for further investigation.

Use the --live flag to start the tunnel and see real-time statistics including endpoint addresses, last handshake age, and transfer counters. Use the --ping flag to test TCP connectivity to each active peer (3-second timeout per host):

bash

# Real-time peer statistics
nefia vpn status --live
 
# Connectivity test
nefia vpn status --ping
 
# Both
nefia vpn status --live --ping

Peer Management

Endpoint Validation

When updating peer endpoints via SetPeerEndpoint, the operator validates the endpoint value:

Empty endpoints are rejected — every peer must have a reachable address.
The endpoint format (IP:port) is verified before applying the change.

When adding a new peer with AddPeerWithLocal, a warning is logged if both the public endpoint and the local endpoint are empty, since the peer will be unreachable until an endpoint is configured.

Rebuild and Endpoint Restoration

Peer additions and tunnel Rebuild operations are logged with timing information and the current peer count, which helps diagnose performance issues in large deployments.

Diagnostic Logging

VPN operations — including peer additions, endpoint changes, and tunnel rebuilds — emit structured diagnostic logs with timing data. These logs are useful for troubleshooting connectivity issues and can be viewed through the standard operator log output. Run nefia doctor for a comprehensive health report that checks config, auth, VPN, and end-to-end connectivity, or nefia vpn diagnose for VPN-specific diagnostics only.

Token Inspection

When debugging enrollment issues, use nefia vpn token-info to inspect the contents of an invite token without verifying the signature:

bash

nefia vpn token-info --token 'eyJo...'

This displays the host ID, VPN address, operator public key, endpoint, nonce, and expiry time. Expired tokens are clearly marked with an [EXPIRED] label. See the CLI reference for full details.

Next Steps

Host Management

Organize and manage target PCs with groups and tags.

Security Overview

Understand Nefia's defense-in-depth security architecture.

Configuration Reference

Complete reference for all VPN configuration options.