Agent
Reference for the nefia-agent binary, configuration, and lifecycle management.
The nefia-agent binary runs on target PCs and establishes a WireGuard VPN tunnel back to the operator. Once enrolled, it maintains a persistent connection, forwards SSH traffic, and monitors its own health automatically.
Supported Platforms
| OS | Architecture | Service Manager |
|---|---|---|
| macOS 15+ | amd64, arm64 | launchd |
| Linux (Ubuntu 22.04+, Debian 12+, Fedora 38+) | amd64, arm64 | systemd |
| Windows 11+ | amd64, arm64 | Scheduled task via schtasks |
Installation
curl -fsSL https://www.nefia.ai/install-agent.sh | sh -s -- --token '<INVITE_TOKEN>'Downloads the latest binary, verifies its SHA-256 checksum, installs to /usr/local/bin/nefia-agent, enrolls the agent, and registers a system service.
Agent Binary Commands
The nefia-agent binary provides the following subcommands:
| Command | Description |
|---|---|
nefia-agent enroll --token <TOKEN> | Enroll with an operator hub using an invite token |
nefia-agent run --config <PATH> | Run the agent in the foreground (for debugging or service invocation) |
nefia-agent install-service --config <PATH> | Install the agent as a system service |
nefia-agent uninstall-service | Remove the system service |
nefia-agent status | Show agent health status from the last health report |
nefia-agent wol --mac <MAC> | Send a Wake-on-LAN magic packet from the local network |
nefia-agent --version | Print version, commit, and build date, then exit |
Global Flags
| Flag | Default | Description |
|---|---|---|
--config | -- | Path to agent config file |
--log-level | info | Log level: debug, info, warn, error |
Enrollment
Enrollment pairs the agent with an operator using a single-use, HMAC-SHA256 signed invite token. For cloud relay enrollment, agent tokens are issued via a 2-phase bootstrap code exchange to prevent token replay if the enrollment response is intercepted.
Enrollment Flow
Generate an invite token on the operator PC:
nefia vpn invite --name staging-server --os macos --stunWhen logged in (nefia login), this also creates a cloud enrollment session with an expected_host_id parameter that binds the session to a specific agent.
Run the installer with the token on the target PC:
curl -fsSL https://www.nefia.ai/install-agent.sh | sh -s -- --token '<INVITE_TOKEN>'Connecting to operator at 203.0.113.10:19820... Direct enrollment succeeded. Validating invite token... OK Key exchange... OK VPN tunnel established 10.99.0.5 SSH host key registered SHA256:bE4f...kQ9w Enrollment successful! Config written to /Users/admin/.nefia/agent.yaml Service installed.
2-Phase Bootstrap Code Exchange (Cloud Relay)
When enrollment uses the cloud relay, agent tokens are not issued directly. Instead, the operator receives a short-lived bootstrap code and exchanges it for the actual agent token in a separate API call. This happens automatically within the operator CLI.
Phase 1 -- Enrollment completes. The agent submits its WireGuard public key and endpoint to the cloud relay. When the operator polls the enrollment session result (GET /api/enrollment/sessions/[nonce]/result), the server returns a bootstrap code (HS256-signed JWT, 5-minute TTL) instead of the agent token. The bootstrap code embeds the session nonce, team ID, and host ID.
Phase 2 -- Bootstrap exchange. The operator calls POST /api/agent-tokens/bootstrap/exchange with the bootstrap code and nonce. The server verifies the JWT signature, checks that the host_id in the token matches the enrollment session's agentHostID, marks the session as consumed, and returns the actual agent token. This exchange can only succeed once -- replaying the same bootstrap code returns an error.
Security properties:
- Short-lived: The bootstrap code expires after 5 minutes, limiting the replay window.
- Single-use: The enrollment session transitions from
completedtoconsumedatomically. A second exchange attempt is rejected. - Host-bound: The
expected_host_idset during session creation must match the agent that actually completed enrollment, preventing a different agent from claiming the token. - Transparent to users:
nefia-agent enrolland the operator CLI handle both phases automatically. The user experience is unchanged.
Enroll Flags
| Flag | Description |
|---|---|
--token | Invite token string (mutually exclusive with --token-file) |
--token-file | Path to a file containing the invite token |
--out-dir | Directory to write the generated agent config |
--ssh-addr | Local SSH address to forward to (default: 127.0.0.1:22) |
--force | Overwrite existing agent config |
--yes | Skip interactive confirmation |
--install | Automatically install the system service after enrollment |
Cloud Relay Fallback
When the agent cannot reach the operator directly (NAT, CGNAT, firewall), enrollment automatically falls back to the cloud relay:
Agent → HTTPS → nefia.ai (relay) ← HTTPS ← OperatorThe fallback is transparent -- nefia-agent enroll first tries a direct TCP connection (10 second timeout), and if that fails, uses the cloud relay via nefia.ai. Only WireGuard public keys and metadata are exchanged through the relay; private keys never leave the device.
Connecting to operator at 203.0.113.10:19820... Direct connection failed. Trying cloud relay... Cloud relay enrollment succeeded. VPN tunnel established 10.99.0.5
Enrollment complete.
Enrollment Error Messages
When enrollment fails, the agent receives a categorized error message to help diagnose the issue:
| Error | Cause | Resolution |
|---|---|---|
token has expired | The invite token's TTL has elapsed | Generate a new token with nefia vpn reinvite |
token has already been used (nonce reused) | The token was already used for enrollment | Generate a fresh token -- each token is single-use |
invalid token signature | The token was corrupted or tampered with | Copy the full token string again from the operator |
validation failed | Other validation error | Check the operator logs for details |
enrollment nonce must not be empty | The token's nonce field is missing or empty | Generate a new token -- this indicates a corrupted or hand-crafted token |
too many enrollment requests | IP-based rate limiting triggered | Wait 30 seconds and retry |
Invalid bootstrap code | The bootstrap code JWT is expired, tampered with, or the host ID does not match | Re-run enrollment -- the bootstrap code has a 5-minute TTL |
Bootstrap code has already been exchanged | The bootstrap code was already used to obtain the agent token | Generate a new invite token and re-enroll |
Enrollment session is missing agent_host_id | The agent did not report its host ID during enrollment | Check agent logs for enrollment errors and retry |
Enrollment Response Size Limit
Enrollment API responses are limited to 1 MB (MaxEnrollResponseBytes). Responses exceeding this limit are rejected to prevent memory exhaustion from malicious or malformed relay responses.
Rate Limiter Behavior
The per-IP rate limiter uses deterministic eviction when the limiter table is full. It evicts the oldest entry by lastSeen timestamp, preferring entries that still have available tokens over those that are currently rate-limited. This ensures that active, well-behaved clients are not evicted in favor of rate-limited ones.
Configuration
The agent stores its YAML configuration at:
- All platforms:
~/.nefia/agent.yaml(i.e.$HOME/.nefia/agent.yamlon macOS/Linux,%USERPROFILE%\.nefia\agent.yamlon Windows)
| Field | Type | Default | Description |
|---|---|---|---|
private_key | string | -- | WireGuard private key (base64-encoded, or keyring: for OS keyring storage) |
listen_port | int | 51820 | UDP listen port for WireGuard |
address | string | -- | VPN IP with CIDR notation (e.g. 10.99.0.2/24) |
dns | string[] | -- | DNS servers reachable inside VPN (e.g. ["10.99.0.1"]) |
ssh_addr | string | 127.0.0.1:22 | Local SSH address to forward tunnel connections to |
peers | PeerEntry[] | -- | Array of operator peer entries (see below) |
update_server_url | string | -- | Remote update server URL (empty = no remote updates) |
agent_token | string | -- | API token for cloud communication (from enrollment). Supports keyring:agent-token prefix for OS keyring storage. |
agent_token_expires_at | string | -- | RFC 3339 expiry time for the agent token (e.g., 2026-06-01T00:00:00Z). |
host_id | string | -- | Host ID from enrollment, used for endpoint reporting. |
stun_servers | string[] | -- | Custom STUN servers for endpoint discovery (overrides defaults). Format: host:port. |
endpoint_refresh | duration | -- | STUN refresh interval (e.g., 5m). Empty or 0 disables periodic endpoint refresh. |
cloud_api_base | string | -- | Cloud API base URL for endpoint reporting and token rotation. Must be https://. |
derp_servers | DERPServer[] | -- | DERP relay servers for relay-first connectivity. Propagated from operator config during enrollment. |
derp_auto_select | boolean | true | Enable RTT-based automatic DERP server selection when multiple relays are configured. |
derp_probe_interval | duration | 5m | DERP RTT probe interval. Must be at least 30s. |
beacons | object | -- | Agent-side event monitoring configuration. See below. |
Beacon Configuration
The beacons section configures agent-side event monitoring. Events are sent to the operator's reactor via TCP.
beacons:
disk_usage:
enabled: true
interval: "5m"
warning_percent: 80
critical_percent: 90
paths:
- "/"
- "/home"
service_check:
enabled: true
interval: "30s"
services:
- nginx
- postgresql
scripts:
- name: "custom-check"
command: "/opt/scripts/check.sh"
interval: "10m"| Beacon | Fields | Description |
|---|---|---|
disk_usage | enabled, interval, warning_percent, critical_percent, paths | Monitor disk usage on specified mount points. |
service_check | enabled, interval, services | Monitor systemd service status (Linux only). |
scripts | name, command, interval | Run custom scripts. Exit 0 = OK, 1 = warning, 2+ = critical. |
Peer Entry Fields
Each entry in the peers array configures a WireGuard peer (operator):
| Field | Description |
|---|---|
public_key | Base64-encoded WireGuard public key of the operator |
endpoint | UDP endpoint (ip:port); empty for incoming-only peers |
vpn_addr | Peer's VPN IP address (e.g. 10.99.0.1) |
Config Validation
On startup, the agent runs validateAgentConfig() to check all required fields:
- private_key: Must be present (plaintext base64 or
keyring:reference). - address: Must be a valid CIDR notation (e.g.,
10.99.0.2/24). - peers: At least one peer must be configured with a valid public key and VPN address.
- listen_port: Must be a valid port number (0--65535).
- derp_probe_interval: When set, must parse as a Go duration and be at least 30 seconds.
Each validation failure produces a clear error message. The agent exits immediately on validation failure.
Agent Token
Each enrolled agent receives a unique API token for cloud communication (endpoint reporting, token rotation). This token is separate from the operator's OAuth tokens and the WireGuard key pair.
Token Generation
Agent tokens are 256-bit values generated with crypto.randomBytes(32). Only the SHA-256 hash is stored server-side as an ApiToken record -- the plaintext token never persists on the server.
For direct enrollment, the operator generates the token locally and delivers it to the agent over the authenticated WireGuard tunnel.
For cloud relay enrollment, the token is issued via a 2-phase bootstrap code exchange (see 2-Phase Bootstrap Code Exchange). The server issues a short-lived bootstrap code (JWT, 5-minute TTL) when enrollment completes. The operator exchanges this bootstrap code for the actual agent token via POST /api/agent-tokens/bootstrap/exchange. The token is then forwarded to the agent over the VPN tunnel.
Token Storage
The agent stores its token in the OS keyring:
| Platform | Keyring Backend |
|---|---|
| macOS | Keychain (nefia-agent-token) |
| Linux | Secret Service API (GNOME Keyring, KWallet) |
| Windows | Credential Manager (nefia-agent-token) |
When no keyring is available (headless servers, containers), the token is encrypted with AES-256-GCM and saved to a fallback file alongside the agent config.
Automatic Rotation
Agent tokens have a 90-day TTL. The agent checks the remaining TTL every 24 hours. When the TTL drops below 7 days, the agent initiates rotation:
- Calls
POST /api/agent-tokens/rotatewith its current token. - The server verifies the current token, generates a new token, and revokes the old one atomically.
- The agent stores the new token in the keyring, replacing the previous value.
If rotation fails (network error, server unavailable), the agent retries on the next 24-hour cycle. The existing token remains valid until its TTL expires.
Endpoint Reporting
The agent periodically reports its public endpoint to the cloud API so operators can reach it even when IP addresses change:
- API:
POST /api/hosts/[hostId]/endpoint(rate limited to 10 requests/minute) - Trigger: Endpoint is reported after STUN discovery during startup and whenever the network monitor detects an IP change.
- Fields: The agent sends its current public endpoint, local endpoint (for hairpin NAT), and NAT type classification.
The host_id field in the agent config (set during enrollment) identifies which host record to update.
Log Rotation
The agent supports built-in log rotation via the following flags on nefia-agent run:
| Flag | Type | Default | Description |
|---|---|---|---|
--log-file | string | platform default | Path to the log file |
--log-max-size | int | 50 | Maximum log file size in MB before rotation |
--log-max-backups | int | 5 | Number of rotated log files to retain |
--log-max-age | int | 30 | Maximum days to retain rotated log files |
Default log file locations:
| Platform | Path |
|---|---|
| macOS | ~/Library/Logs/nefia/agent.log |
| Linux | $XDG_STATE_HOME/nefia/agent.log (default: ~/.local/state/nefia/agent.log) |
| Windows | %LOCALAPPDATA%\nefia\logs\agent.log |
When the log file exceeds --log-max-size, it is rotated and compressed. Old files are pruned based on --log-max-backups and --log-max-age.
Run Command
Start the agent in the foreground:
nefia-agent run --config <PATH> [flags]| Flag | Description |
|---|---|
--strict-permissions | Refuse to start if config file has world-readable permissions (mode & 0o077 != 0) and contains a plaintext private key. Default: true. Use this in production to enforce secure key storage. |
--auto-update-interval | Interval for checking binary changes and auto-restarting (0 = disabled, e.g. 5m). |
--log-file | Path to the log file (see Log Rotation). |
--log-max-size | Maximum log file size in MB before rotation (default: 50). |
--log-max-backups | Number of rotated log files to retain (default: 5). |
--log-max-age | Maximum days to retain rotated log files (default: 30). |
When registered as a system service, the service manager invokes nefia-agent run automatically.
On Windows, the agent runs an icacls check at startup and warns if Everyone or BUILTIN\Users has read access to the config file. On Unix, it checks file permissions (mode & 0o077) and warns if the config is readable by other users.
Service Management
Installing the Service
Register the agent as a system service so it starts on boot:
nefia-agent install-service --config <PATH>Uninstalling the Service
Remove the system service:
nefia-agent uninstall-service [--yes]The --yes flag skips the interactive confirmation prompt.
Platform Details
Registers a LaunchAgent plist at ~/Library/LaunchAgents/com.nefia.agent.plist. The agent runs as a user-level service (no root required). Logs are written to ~/Library/Logs/nefia/nefia-agent.{stdout,stderr}.log.
launchctl print gui/$(id -u)/com.nefia.agentResource Limits
| Constant | Value | Description |
|---|---|---|
maxAgentConns | 100 | Maximum concurrent forwarded connections. New connections are rejected when the limit is reached. |
drainTimeout | 30s | Maximum time to wait for active connections to finish during graceful shutdown. |
defaultMTU | 1420 | MTU used for the WireGuard netstack TUN device. |
SSH Host Key Verification (TOFU)
The agent's SSH host key is registered automatically during enrollment. The operator uses Trust On First Use (TOFU) semantics:
- First connection: The host key is accepted and persisted to the operator's
known_hostsfile. - Subsequent connections: The key must match. A changed key causes connection rejection with a MITM warning.
- Concurrent access: A RWMutex protects the
known_hostsfile, allowing concurrent reads while serializing writes.
This is safe because VPN connections are already authenticated via WireGuard key pairs.
Hairpin NAT Fallback
During enrollment, the agent calls DiscoverLocalEndpoint() to detect its LAN IP address. This local endpoint is reported to the operator as a fallback for hairpin NAT scenarios where the public endpoint is unreachable from the same LAN.
The agent validates the discovered local IP against the public endpoint:
- If no suitable local IP is found, the field is left empty.
- If the local IP matches the public endpoint IP, the field is left empty (no fallback needed).
- Otherwise, the LAN address is stored in
host.vpn.local_endpointin the operator's config.
When establishing a VPN connection, the operator tries the public endpoint first. If unreachable, it falls back to the local endpoint for same-LAN connectivity.
Lifecycle
Service start -- the system service manager launches nefia-agent run.
VPN connect -- establishes a WireGuard tunnel using the stored keypair and endpoint.
SSH forward -- begins forwarding VPN connections to the local SSH server (capped at 100 concurrent connections). The tunnel forwarding logic detects short writes (io.ErrShortWrite) during data relay, preventing silent data corruption when the destination connection cannot accept a full write.
Health monitor loop -- continuously checks tunnel health. On failure, rebuilds with exponential backoff (5s initial, 5m cap, 2x multiplier).
Health Monitoring and Auto-Update
The agent runs two background loops after connecting:
Health checks (watchdog) run every 60 seconds when healthy, and every 15 seconds when unhealthy (adaptive interval). If the VPN tunnel degrades, it is torn down and rebuilt with exponential backoff (5s initial, 5m cap, 2x multiplier). If all peers are stale for 2 consecutive monitoring cycles, a full tunnel rebuild is triggered. After 2 consecutive rebuild failures, the watchdog enters a 30-minute cooldown period before attempting again (it no longer gives up permanently). If peer status check fails (IPC error), all peers are treated as stale to trigger rebuild investigation. Results are reported to the operator via nefia vpn status, which shows a STALE indicator for peers with no recent handshake.
Network Monitor Integration
The agent runs a network monitor that watches for IP address changes on local interfaces (checked every 5 seconds). When a network change is detected:
- Added and removed IP addresses are logged
- The watchdog receives an immediate rebuild signal
- The tunnel is rebuilt without waiting for the next health check cycle
This provides near-instant recovery after Wi-Fi reconnection, network switches, or VPN toggles on the host system.
Auto-updates are disabled by default (--auto-update-interval 0). To enable them, pass a non-zero duration such as --auto-update-interval 6h to nefia-agent run. When enabled, the agent periodically checks for new releases, downloads the binary, verifies its Ed25519 signature, replaces it atomically, and restarts the service.
Live Config Reload (Unix)
On Unix systems (macOS and Linux), sending SIGHUP to the agent process triggers a config reload:
kill -HUP $(pgrep nefia-agent)Useful for applying key rotation or endpoint changes without service downtime. If the reload fails (e.g., invalid YAML or missing fields), the error is recorded in the health report's last_config_error field, making it easy to diagnose issues remotely via nefia-agent status or by reading agent-health.json.
SSH Log Suppression
SSH connection state changes are logged only on state transitions (reachable to unreachable and vice versa) to prevent log flooding.
Startup Summary
On startup, the agent logs a summary including peer count, auto-update status, config path, and keyring status.
Health Status
Check the agent's health from the command line:
nefia-agent statusReads agent-health.json from the config directory and displays the current health report.
Health Status: ok VPN Connected: true Uptime: 72h22m0s Version: v1.3.0 Rebuild Count: 2 Last Rebuild: 2026-02-28T08:12:00Z Goroutines: 18
If a config reload error has occurred, a Config Error line is shown as well.
Health Report Fields
The agent-health.json file contains the following fields:
| Field | Description |
|---|---|
status | Health status string (ok, degraded, unhealthy). |
vpn_connected | Whether the VPN tunnel is currently connected. |
uptime_seconds | Seconds since the agent process started. |
version | Running agent version string. |
rebuild_count | Total number of tunnel rebuilds since startup. |
last_rebuild | Timestamp of the most recent tunnel rebuild. |
goroutines | Current number of Go goroutines (useful for leak detection). |
last_config_error | Last SIGHUP config reload error message. Empty when the last reload succeeded. |
Remote Management (Operator CLI)
The operator CLI provides two subcommands for managing agents on remote hosts:
nefia agent version
Show the agent version on remote hosts:
nefia agent version [--target <selector>]Defaults to --target all if no target is specified.
HOST VERSION COMPATIBLE MESSAGE staging-server v1.3.0 yes up to date
nefia agent upgrade
Upgrade nefia-agent on remote hosts:
nefia agent upgrade --target <selector> [--agent-dir <path>]| Flag | Description |
|---|---|
--target | Required. Host or group selector for the upgrade target. |
--agent-dir | Directory containing agent binaries to upload. |
Agent upgrade results: staging-server: v1.2.0 -> v1.3.0
Succeeded: 1, Failed: 0
Related
Install the operator CLI and nefia-agent on all supported platforms.
Configure NAT traversal, diagnostics, and key rotation.