VPN Setup Guide
Configure WireGuard VPN tunnels, NAT traversal, and key rotation for Nefia.
Nefia uses WireGuard VPN tunnels for all remote connections. This guide covers advanced VPN configuration including NAT traversal, diagnostics, and key rotation.
Architecture
Nefia uses a star (hub-and-spoke) topology:
┌──────────────┐
│ Operator PC │
│ (Hub) │
│ 10.99.0.1 │
└──────┬───────┘
│
┌────────────┼────────────┐
│ │ │
┌──────┴───┐ ┌──────┴───┐ ┌─────┴────┐
│ Target 1 │ │ Target 2 │ │ Target 3 │
│ 10.99.0.2│ │ 10.99.0.3│ │ 10.99.0.4│
└──────────┘ └──────────┘ └──────────┘The operator PC serves as the VPN hub on 10.99.0.1/24. Each target PC gets a unique IP in the 10.99.0.0/24 subnet.
NAT Traversal
When target PCs are behind NAT, you have several options:
STUN Discovery
Use STUN to automatically discover your public IP and port:
nefia vpn invite --name my-server --os linux --stunThis queries public STUN servers to determine the operator's externally reachable address.
Direct Endpoint
If you know your public IP and have port forwarding configured:
nefia vpn invite --name my-server --os linux --endpoint 203.0.113.10:51820Invite Flags Reference
| Flag | Default | Description |
|---|---|---|
--name | (required) | Host ID for the new peer |
--os | (required) | Target OS: macos, linux, or windows |
--stun | false | Use STUN to discover the operator's public endpoint |
--endpoint | Operator's public endpoint (ip:port) | |
--enroll-port | 19820 | TCP port for the enrollment listener |
--listen | true | Start enrollment listener after generating the invite |
--listen-timeout | 60m | How long to wait for the agent to enroll |
--ttl | 24h | Token time-to-live |
--token-out | Write raw invite token to a file | |
--copy | false | Copy the invite token to the system clipboard |
--no-print-token | false | Suppress human-readable token output (hidden flag for TUI/internal automation) |
Port Forwarding
To enroll a remote target PC, the following ports must be forwarded on the operator's router:
| Port | Protocol | Purpose |
|---|---|---|
| 19820 | TCP | Enrollment (initial registration only) |
| 51820 | UDP | WireGuard VPN tunnel (always) |
Environments Without Port Forwarding
In coworking spaces, corporate networks, or other environments where you cannot manage the router, port forwarding is not possible.
How the cloud relay works:
Agent → HTTPS → nefia.ai ← HTTPS ← Operator
(relay)- When the operator runs
nefia vpn invite, a local listener starts and a cloud relay session is created simultaneously - The agent first attempts a direct connection (TCP 19820) and falls back to the cloud relay if it fails
- Only WireGuard public keys and metadata pass through the cloud relay (private keys are never transmitted)
- In environments where direct connections are possible, they are used as usual (completes within 10 seconds)
Manual workarounds (if the cloud relay is unavailable):
- Use mobile tethering — Switch the operator PC to a tethered connection, then re-issue the token and enroll:
bash
nefia vpn reinvite --name <host-id> --stun - Use a network with port forwarding — Configure TCP 19820 / UDP 51820 port forwarding on the router, then enroll
- Use a VPS as a relay — Run the operator CLI on a VPS with a public IP to complete enrollment, then migrate the configuration to your local PC
DERP Relay
For environments where direct UDP connectivity is unreliable (symmetric NAT, CGNAT, strict firewalls), Nefia supports DERP (Designated Encrypted Relay for Packets) relay servers. DERP provides WebSocket-based relay as a fallback transport path alongside direct WireGuard and TURN.
Relay-First Architecture
Nefia uses a relay-first connection strategy:
1. DialTCP via DERP relay → Immediate connectivity (< 1s)
2. Background: probe direct UDP path
3. If direct path succeeds → Automatic upgrade (transparent)This ensures connections succeed immediately even in restrictive networks, while automatically upgrading to the fastest available path in the background.
Three-Path Transport
The relayAwareBind manages three transport paths simultaneously:
| Path | Protocol | Use Case |
|---|---|---|
| DERP | WebSocket (wss://) | Immediate relay fallback, works through HTTP proxies |
| TURN | UDP/TCP relay | Traditional NAT traversal relay |
| Direct | UDP | Peer-to-peer, lowest latency |
Configuring DERP Servers
Add DERP relay servers to your nefia.yaml:
vpn:
derp_servers:
- url: "wss://relay.nefia.ai/derp"
region: "ap-northeast-1"
- url: "wss://relay-us.nefia.ai/derp"
region: "us-east-1"DERP servers configured on the operator are automatically propagated to agents during enrollment.
Self-Hosted DERP Deployment
For private deployments, run your own DERP relay using the nefia-derp binary:
nefia-derp \
--addr :8443 \
--allowed-keys-file /etc/nefia/allowed-keys.txt \
--metrics-token "$METRICS_TOKEN"| Flag | Default | Description |
|---|---|---|
--addr | :8443 | Listen address for WebSocket connections |
--max-clients | 10000 | Maximum number of concurrent clients |
--ping-interval | 30s | Keepalive ping interval |
--allowed-keys-file | — | Path to a file containing allowed WireGuard public keys (one per line). If omitted, all keys are accepted. |
--metrics-token | -- | Bearer token required for /healthz full metrics. Without this, /healthz returns minimal status only. Can also be set via NEFIA_DERP_METRICS_TOKEN environment variable. |
--trust-proxy | false | Trust Fly-Client-IP / X-Forwarded-For headers for rate limiting. Enable when running behind a reverse proxy. |
--version | — | Print version information and exit |
The DERP server includes:
- Per-IP rate limiting: 5 requests/second with burst of 10. Exceeding this returns HTTP 429.
- ReadHeaderTimeout: 10 seconds to prevent slowloris attacks.
- Graceful shutdown: Active clients receive a
StatusGoingAwayframe before the server exits. Clients automatically reconnect to the next available DERP server.
NAT Classification
Nefia automatically classifies the NAT type of the network to determine the best connectivity strategy:
| NAT Type | Description | Direct Connectivity |
|---|---|---|
| EIM (Endpoint-Independent Mapping) | Standard NAT, consistent port mapping | Yes (with STUN) |
| EDM (Endpoint-Dependent Mapping) | Symmetric NAT, different port per destination | Difficult, relay recommended |
| CGNAT | Carrier-grade NAT, shared public IP | Relay required |
NAT classification is performed automatically during nefia vpn diagnose and is used internally to select the optimal transport path.
Multipath Routing
Nefia supports active-backup multipath routing that automatically selects the best available network path.
When multiple paths are available (e.g., direct UDP, DERP relay, TCP fallback), the system continuously monitors path quality and fails over automatically to avoid flapping between paths.
Configure multipath in nefia.yaml:
vpn:
multipath:
mode: "active-backup"
probe_interval_sec: 5
failover_threshold_ms: 0| Field | Description |
|---|---|
mode | Multipath mode. "active-backup" enables automatic failover. Use "off" to disable. |
probe_interval_sec | Integer seconds between health probes. |
failover_threshold_ms | Latency threshold in milliseconds for failover. 0 for automatic. |
The active path is visible in nefia vpn status output. Path switches are logged in the audit trail.
Network Monitoring
The agent includes a network monitor that watches for IP address changes on the local network interfaces. When a change is detected (e.g., Wi-Fi reconnection, VPN toggle, network switch):
- The monitor detects added/removed IP addresses within 5 seconds
- A signal is sent to the watchdog for immediate tunnel rebuild
- The tunnel reconnects using the new network path
This eliminates the need to wait for the regular watchdog interval (up to 60 seconds) after a network change, providing near-instant recovery.
Captive Portal Detection
Before attempting NAT traversal, the agent checks for captive portals (hotel Wi-Fi, airport networks, etc.) that intercept HTTP traffic. If a captive portal is detected, a warning is logged with instructions to authenticate with the portal before the VPN can connect.
Enhanced Diagnostics
nefia vpn diagnose includes additional checks beyond basic VPN health:
Latency Measurement
Each active peer is probed for round-trip latency:
[PASS] peer-my-server-latency: 42ms [WARN] peer-staging-latency: 620ms (>500ms) [FAIL] peer-backup-latency: 2.3s (>2s)
| Threshold | Result |
|---|---|
| < 500ms | PASS |
| 500ms -- 2s | WARN |
| > 2s | FAIL |
Route Conflict Detection
Diagnose checks whether the VPN subnet (10.99.0.0/24 by default) overlaps with any existing routes on the system. Overlapping routes can cause traffic to be misrouted.
[FAIL] route-conflict: VPN subnet 10.99.0.0/24 overlaps with existing route 10.99.0.0/24 via en0
VPN Address Collision Prevention
When multiple nefia processes run simultaneously (e.g., concurrent vpn invite and vpn reinvite), file-based locking prevents VPN address collisions. Without this protection, two concurrent invites could assign the same VPN address to different hosts, causing routing conflicts.
The locking mechanism works as follows:
Before modifying nefia.yaml, the process acquires an exclusive file lock (nefia.yaml.lock).
The process reads the current config, selects the next available VPN address, adds the new host, and writes the config back.
The lock is released after the config file is saved.
- On Unix (macOS/Linux): Non-blocking
flockis used with up to 30 retry attempts at 1-second intervals (max 30-second wait). The OS automatically releases the lock when the owning process exits. - On Windows:
LockFileExprovides equivalent exclusive locking with the same retry strategy. The lock file stores the owning process PID. If the owning process has crashed, the stale lock is automatically detected and cleaned up by checking process liveness viaOpenProcess/GetExitCodeProcess.
Troubleshooting: Port Already in Use (E1004)
If you see this error when running nefia vpn invite, nefia vpn reinvite, or nefia vpn listen, another process is using the VPN or enrollment port:
Error: [E1004] VPN setup failed Try: Check that no other VPN or WireGuard instance is using the same listen port (default: 51820).
Common causes:
- A previous
nefia vpn inviteornefia vpn listenwas interrupted with Ctrl+C but the process did not fully exit - An enrollment listener is running in another terminal
- Another WireGuard instance is using the same port
Resolution:
# 1. Check which process is using the port
lsof -i :51820 # WireGuard VPN port
lsof -i :19820 # Enrollment listener port
# 2. Terminate the process
kill <PID>
# 3. Retry
nefia vpn reinvite --name <host-id> --stunReinviting Hosts
If an invite token has expired or you need to re-enroll an existing host, use vpn reinvite instead of removing and recreating the host:
# Regenerate an expired invite (VPN address is preserved)
nefia vpn reinvite --name my-server --stun
# Switch from STUN to a direct endpoint
nefia vpn reinvite --name my-server --endpoint 192.168.1.100:19820This resets the host to pending status, generates a new token, and starts the enrollment listener. The host's VPN address is kept, so firewall rules and DNS records remain valid.
vpn reinvite accepts the same flags as vpn invite, including --token-out, --copy, and --no-print-token.
Writing Tokens to File
Use --token-out to write the raw invite token to a file instead of only displaying it in the terminal. This is useful for automation scripts and CI/CD pipelines:
nefia vpn invite --name my-server --os linux --stun --token-out /tmp/invite-token.txtThe file is created with 0600 permissions (owner-only read/write).
Multi-Host Enrollment
When enrolling multiple hosts, use nefia vpn listen with the --count flag to accept multiple enrollments without restarting the listener:
# Accept exactly 3 enrollments
nefia vpn listen --count 3
# Accept all pending hosts (auto-stops when none remain)
nefia vpn listen --count 0The listener automatically stops when:
- The
--countlimit is reached, or - No pending hosts remain in the configuration (when
--count 0), or - The timeout expires (default: 60 minutes)
During enrollment listening, a progress heartbeat is printed every 30 seconds to confirm the listener is still active.
Batch Approval
To approve all pending-approval hosts at once (e.g., after fleet enrollment), use nefia vpn approve --all. This requires explicit risk acknowledgment to prevent accidental mass approval:
# Interactive: shows a confirmation prompt
nefia vpn approve --all
# Non-interactive / automation: must pass --accept-risk
nefia vpn approve --all --accept-risk=approve-allEnrollment Status
Check whether a specific host has completed enrollment:
nefia vpn enroll-status --name my-serverHost: my-server Status: active VPN Addr: 10.99.0.2 Public Key: abc123...
If the host is still pending and the invite has expired, a warning is displayed with instructions to run nefia vpn reinvite.
Two-Phase Enrollment (Cloud Relay)
When using cloud relay enrollment (via nefia.ai), the enrollment process uses a two-phase token exchange for enhanced security:
Phase 1: Agent completes enrollment
The agent connects to the cloud relay, presents the session token embedded in the invite, and submits its WireGuard public key. The enrollment session must have been created with an expected_host_id parameter, and the agent's reported host ID must match. Legacy enrollment sessions (without expectedHostID or tokenHash) are rejected.
Phase 2: Bootstrap code exchange
When the operator polls for the enrollment result, a bootstrap code (a short-lived JWT with a 5-minute TTL) is returned instead of the agent token directly. The operator then exchanges this bootstrap code for the actual agent token via a separate authenticated endpoint.
This two-phase design ensures that even if the enrollment completion is intercepted, the attacker cannot obtain a valid agent token without the operator's authenticated session.
Diagnostics
For a comprehensive system-wide health check covering config, auth, audit, VPN, and connectivity, use:
nefia doctorThis runs all VPN diagnostics plus config validation, auth status, audit checks, and end-to-end TCP connectivity tests to each active host. See the doctor command reference for full details.
To run VPN-specific diagnostics only:
nefia vpn diagnoseVPN Diagnostics: [PASS] vpn-enabled: VPN is enabled [PASS] operator-keypair: operator keypair is valid [PASS] port-available: UDP port 51820 is available [PASS] enrollment-port: TCP port 19820 (enrollment) is available [PASS] vpn-addr-unique: all 3 VPN addresses are unique [PASS] operator-addr-collision: operator VPN address does not overlap with any peer [PASS] ssh-identity: SSH public key found in ssh.identities [PASS] stun-reachability: STUN reachable (public IP: 203.0.113.10) [PASS] peer-my-server-pubkey: public key for my-server is valid [PASS] peer-my-server-vpnaddr: peer my-server VPN address: 10.99.0.2
Result: All checks passed.
Diagnose a specific host with --host:
nefia vpn diagnose --host my-serverThe diagnostic checks include:
- VPN enabled and operator keypair validity
- UDP port 51820 and TCP enrollment port 19820 availability
- VPN address uniqueness across all peers
- Operator VPN address collision detection (ensures operator address doesn't overlap with peer addresses)
- SSH identity file presence
- STUN server reachability
- Per-peer public key validation and invite expiry
- Connectivity testing (when tunnel is active)
Key Rotation
Rotate the operator's WireGuard key with a grace period to avoid disrupting active connections:
nefia vpn rotate-key --grace-period 72hThe command displays rotation details including the new public key, grace period, and number of active hosts. If the rotation produces any warnings (e.g., agents that could not be notified), a Warning field is shown in the output.
The operator generates a new WireGuard keypair and stores it in the config.
Run nefia vpn push-key to distribute the new public key to all active hosts.
Both old and new keys are accepted during the grace period (default: configurable, e.g. 72h).
After the grace period expires, the old key is revoked and agents using it must re-enroll.
Push the key to all active hosts (or a specific host):
# Push to all active VPN hosts
nefia vpn push-key
# Push to a specific host only
nefia vpn push-key --host my-serverMagicDNS
Nefia includes a built-in DNS resolver that maps host IDs to their VPN addresses. MagicDNS is enabled by default with the .nefia domain:
my-server.nefia → 10.99.0.2
dev-box.nefia → 10.99.0.3
operator.nefia → 10.99.0.1Pending Hosts
Pending hosts (those that have been invited but have not yet completed enrollment) are registered in MagicDNS under the .pending subdomain. For example, a pending host mac-dev is resolvable as:
mac-dev.pending.nefia → 10.99.0.5Once enrollment completes and the host becomes active, the record changes to the standard domain:
mac-dev.nefia → 10.99.0.5Configure MagicDNS in your nefia.yaml:
vpn:
magic_dns:
enabled: true # default: true
domain: "nefia" # default: "nefia"View current DNS records with nefia vpn status:
MagicDNS: active (.nefia, 4 records) operator.nefia -> 10.99.0.1 my-server.nefia -> 10.99.0.2 dev-box.nefia -> 10.99.0.3 mac-dev.pending.nefia -> 10.99.0.5
VPN Status
nefia vpn status provides a quick overview of VPN health. Even without the --live flag, the command now shows peer handshake staleness indicators:
- Peers with last handshake more than 5 minutes ago show a
[WARN]indicator. - Peers with last handshake more than 30 minutes ago show a
[STALE]indicator.
VPN: enabled (not probed)
Peers: my-server 10.99.0.2 last handshake: 2m ago dev-box 10.99.0.3 last handshake: 12m ago [WARN] staging 10.99.0.4 last handshake: 45m ago [STALE]
Troubleshooting: 1 stale peer detected. Run 'nefia vpn diagnose' to investigate.
When stale peers are detected, a troubleshooting hint is displayed suggesting nefia vpn diagnose for further investigation.
Use the --live flag to start the tunnel and see real-time statistics including endpoint addresses, last handshake age, and transfer counters. Use the --ping flag to test TCP connectivity to each active peer (3-second timeout per host):
# Real-time peer statistics
nefia vpn status --live
# Connectivity test
nefia vpn status --ping
# Both
nefia vpn status --live --pingPeer Management
Endpoint Validation
When updating peer endpoints via SetPeerEndpoint, the operator validates the endpoint value:
- Empty endpoints are rejected — every peer must have a reachable address.
- The endpoint format (IP:port) is verified before applying the change.
When adding a new peer with AddPeerWithLocal, a warning is logged if both the public endpoint and the local endpoint are empty, since the peer will be unreachable until an endpoint is configured.
Rebuild and Endpoint Restoration
Peer additions and tunnel Rebuild operations are logged with timing information and the current peer count, which helps diagnose performance issues in large deployments.
Diagnostic Logging
VPN operations — including peer additions, endpoint changes, and tunnel rebuilds — emit structured diagnostic logs with timing data. These logs are useful for troubleshooting connectivity issues and can be viewed through the standard operator log output. Run nefia doctor for a comprehensive health report that checks config, auth, VPN, and end-to-end connectivity, or nefia vpn diagnose for VPN-specific diagnostics only.
Token Inspection
When debugging enrollment issues, use nefia vpn token-info to inspect the contents of an invite token without verifying the signature:
nefia vpn token-info --token 'eyJo...'This displays the host ID, VPN address, operator public key, endpoint, nonce, and expiry time. Expired tokens are clearly marked with an [EXPIRED] label. See the CLI reference for full details.
Next Steps
Organize and manage target PCs with groups and tags.
Understand Nefia's defense-in-depth security architecture.
Complete reference for all VPN configuration options.