Configuration
Complete reference for the nefia.yaml configuration file format and all available options.
Nefia stores its configuration in a single YAML file. This page documents every section and field available in nefia.yaml.
File Location
~/Library/Application Support/nefia/nefia.yamlOverride the default path with the --config flag on any command:
nefia --config /path/to/custom.yaml hosts listCreating the Config File
The configuration file is created automatically when you run nefia setup (or its alias nefia init) or nefia vpn invite. You can also create it manually.
nefia setupConfiguration created at /Users/admin/Library/Application Support/nefia/nefia.yaml
VPN: enabled (keypair generated) Address: 10.99.0.1/24 Audit: enabled (/Users/admin/Library/Application Support/nefia/audit/)
Schema Version
Every configuration file begins with a version field. The current schema version is 1.
version: 1Top-Level Settings
In addition to the version field, the following top-level settings control session lifecycle and host synchronization:
| Field | Type | Default | Description |
|---|---|---|---|
session_ttl_minutes | int | 1440 (24h) | Maximum session lifetime in minutes. Sessions older than this are automatically closed. 0 uses the default (24 hours). |
session_idle_timeout_minutes | int | 30 | Session idle timeout in minutes. Sessions with no activity for this duration are automatically closed based on LastUsedAt. 0 uses the default (30 minutes). |
session_gc_interval_minutes | int | 5 | Interval in minutes for the session garbage collector to scan and remove expired sessions. 0 uses the default (5 minutes). |
host_sync_interval | duration | 5m | How often to synchronize host state from the web dashboard. Uses Go duration format (e.g., 5m, 1h). |
version: 1
session_ttl_minutes: 1440
session_idle_timeout_minutes: 30
session_gc_interval_minutes: 5
host_sync_interval: "5m"Section Reference
defaults — Global execution defaults
Controls default behavior for command execution across all hosts.
| Field | Type | Default | Description |
|---|---|---|---|
concurrency | int | 50 | Maximum number of hosts to operate on in parallel. |
timeout | duration | 30m | Default timeout for remote operations. Supports units: s, m, h. |
output | string | human | Output format: human, json, jsonl, yaml, or compact. |
max_output_bytes | int | 1048576 | Maximum output captured per host (1 MB). Truncated beyond this limit with an [output truncated] warning in stderr. The in-memory snapshot buffer is capped at the larger of 50 MB or max_output_bytes * 2. |
shell | map | — | Default shell per OS. Keys are OS names (macos, linux, windows), values are shell paths (e.g., {"linux": "/bin/bash", "macos": "/bin/zsh"}). Per-host shell overrides this. |
artifact_retention_days | int | 7 | Number of days to retain command output artifacts before automatic cleanup. |
progress.enabled | boolean | true | Enable progress reporting during long-running operations. |
progress.interval | duration | 30s | Interval for progress reporting. |
retry.vpn_recovery_enabled | boolean | true | Enable automatic retry while waiting for transient VPN recovery. The global --retry-timeout flag can enable the behavior for a single command. |
retry.vpn_recovery_timeout | duration | 30s | Default maximum wait time for VPN recovery. Overridden by the global --retry-timeout flag. |
recording.enabled | boolean | false | Enable session recording by default when sessions are opened. |
recording.retention_days | int | 90 | Number of days to retain recorded sessions. |
defaults:
concurrency: 50
timeout: 30m
output: human
max_output_bytes: 1048576
shell:
macos: /bin/zsh
linux: /bin/bash
windows: powershell.exe
artifact_retention_days: 7
progress:
enabled: true
interval: "30s"
retry:
vpn_recovery_enabled: false
vpn_recovery_timeout: "30s"
recording:
enabled: false
retention_days: 90defaults.retry — Automatic VPN recovery retry
Controls retry behavior for transient VPN recovery during CLI operations.
| Field | Type | Default | Description |
|---|---|---|---|
vpn_recovery_enabled | boolean | true | Enable retry for retriable VPN failures. |
vpn_recovery_timeout | duration | 30s | Maximum time to wait for VPN recovery before failing the command. |
defaults:
retry:
vpn_recovery_enabled: false
vpn_recovery_timeout: "30s"defaults.recording — Session recording defaults
Controls whether newly opened sessions are recorded by default.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable session recording by default when a session is opened. |
retention_days | int | 90 | Number of days to retain recording files. |
defaults:
recording:
enabled: false
retention_days: 90defaults.notifications — Desktop notification settings
Controls desktop notifications for completed operations.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable automatic desktop notifications for long-running operations. |
min_duration | duration | 30s | Minimum operation duration before a notification is sent. Only applies when enabled is true. |
defaults:
notifications:
enabled: true
min_duration: "30s"When enabled, operations exceeding min_duration automatically trigger a desktop notification. You can also use the --notify flag on any command to send a notification regardless of this setting.
ssh — SSH transport settings
Configures SSH connections established inside VPN tunnels.
| Field | Type | Default | Description |
|---|---|---|---|
connect_timeout_sec | int | 5 | Seconds to wait for an SSH connection to establish. |
known_hosts | string | ~/.ssh/known_hosts | Path to the SSH known_hosts file for host key verification (on Windows: %USERPROFILE%\.ssh\known_hosts). |
identities | string[] | — | List of SSH private key file paths to offer during authentication. |
agent | boolean | false | Use the system SSH agent for key authentication. |
max_file_size_bytes | int | 3221225472 | Maximum file size for SFTP transfers (3 GB). |
max_concurrent_fs | int | 20 | Maximum concurrent file system operations per host. |
sftp_timeout_sec | int | 300 | Timeout in seconds for individual SFTP operations (read, write, patch, list, stat). The transport layer also applies a 30-minute hard safety timeout as a last resort. |
max_pool_size | int | 100 | Maximum number of pooled SSH connections across all hosts. |
max_concurrent_dial | int | 20 | Maximum number of concurrent SSH dial attempts. |
max_retries | int | 3 | Maximum number of retry attempts for failed SSH connections. |
initial_backoff | duration | 1s | Initial backoff duration between retry attempts. Uses Go duration format (e.g. 1s, 500ms). |
max_backoff | duration | 60s | Maximum backoff duration between retry attempts. Uses Go duration format. |
connection_ttl | duration | 1h | Maximum lifetime of a pooled SSH connection before it is closed and re-established. Uses Go duration format. |
idle_timeout | duration | 10m | Time after which an idle pooled connection is closed. Uses Go duration format. |
circuit_breaker_threshold | int | 5 | Number of consecutive connection failures before the circuit breaker opens for a host. |
circuit_breaker_reset_timeout | duration | 60s | Time to wait before attempting to reconnect to a host after the circuit breaker opens. Uses Go duration format. |
keepalive_timeout | duration | 5s | Timeout for SSH keepalive health-check probes on pooled connections. Uses Go duration format. |
ssh:
connect_timeout_sec: 5
known_hosts: ~/.ssh/known_hosts
identities:
- ~/.ssh/id_ed25519 # Windows: %USERPROFILE%\.ssh\id_ed25519
- ~/.ssh/id_rsa
agent: false
max_file_size_bytes: 3221225472
max_concurrent_fs: 20
# Connection pool tuning
max_pool_size: 100
connection_ttl: "1h"
idle_timeout: "10m"
# Retry / backoff
max_retries: 3
initial_backoff: "1s"
max_backoff: "60s"
# Circuit breaker
circuit_breaker_threshold: 5
circuit_breaker_reset_timeout: "60s"
keepalive_timeout: "5s"vpn — WireGuard VPN configuration
Controls the WireGuard VPN hub on the operator PC.
| Field | Type | Default | Description |
|---|---|---|---|
enabled * | boolean | true | Enable the WireGuard VPN. Must be true for any remote operations. |
private_key | string | — | WireGuard private key. Supports inline base64, $ENV_VAR reference, or keyring: prefix for OS keyring. |
listen_port | int | 51820 | UDP port for WireGuard to listen on. |
address | string | 10.99.0.1/24 | VPN address and subnet for the operator (hub). |
dns | string[] | — | DNS servers pushed to the VPN interface. |
magic_dns | object | — | MagicDNS configuration for resolving host names via the VPN. |
The magic_dns sub-object:
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | true | Enable MagicDNS name resolution. |
domain | string | nefia | Domain suffix appended to host names (e.g., pc-office.nefia). |
upstream | string[] | ["1.1.1.1:53","8.8.8.8:53"] | Upstream DNS servers for non-VPN queries. |
Additional VPN fields:
| Field | Type | Default | Description |
|---|---|---|---|
derp_servers | object[] | — | DERP relay servers for relay-first connectivity. Each entry has url (wss:// or ws://) and region. Propagated to agents during enrollment. |
turn_servers | object[] | — | TURN relay servers for NAT traversal fallback. Each entry has url, username, and password (supports $ENV_VAR references). See turn_servers sub-object below. |
stun_servers | string[] | — | STUN servers for endpoint discovery. Each entry must be host:port format. |
stun_timeout | duration | 5s | Timeout for STUN endpoint discovery requests. |
key_rotation_grace_period | duration | 72h | How long to accept the previous key after rotation. |
auto_rotate_interval | duration | — | Automatic key rotation interval (e.g., 720h for 30 days). Empty disables auto-rotation. |
monitor_interval | duration | 30s | Interval for VPN peer health monitoring. |
local_dial_timeout | duration | 3s | Timeout for local endpoint probe in hairpin NAT fallback. Uses Go duration format. |
derp_probe_interval | duration | 5m | Interval between DERP RTT quality probes. Minimum 30s. Used to measure relay latency for auto-selection. |
derp_auto_select | boolean | true | Enable automatic DERP relay selection based on RTT probes. When enabled, the client periodically probes configured DERP servers and routes traffic through the lowest-latency relay. |
multipath | object | — | Multipath active-backup failover configuration. See multipath sub-object below. |
The derp_servers sub-object:
| Field | Type | Description |
|---|---|---|
url | string | WebSocket endpoint (must start with wss:// or ws://), e.g., wss://relay.nefia.ai/derp. |
region | string | Cloud region identifier (e.g., ap-northeast-1, us-east-1). Informational only. |
The turn_servers sub-object:
| Field | Type | Description |
|---|---|---|
url | string | TURN server URL (e.g., turn:turn.example.com:3478). |
username | string | TURN credentials. Supports $ENV_VAR references. |
password | string | TURN credentials. Supports $ENV_VAR references. Hidden from JSON output. |
The multipath sub-object:
| Field | Type | Default | Description |
|---|---|---|---|
mode | string | off | Multipath behavior: off (disabled) or active-backup (failover between paths). |
probe_interval_sec | int | 5 | Interval between quality probes in seconds. |
failover_threshold_ms | int | 0 | RTT threshold in milliseconds above which a path is considered degraded and failover is evaluated more aggressively. 0 disables RTT-based failover. |
vpn:
enabled: true
private_key: $WG_PRIVATE_KEY
listen_port: 51820
address: 10.99.0.1/24
dns:
- 1.1.1.1
- 8.8.8.8
magic_dns:
enabled: true
domain: nefia
upstream:
- "1.1.1.1:53"
- "8.8.8.8:53"
derp_servers:
- url: "wss://relay.nefia.ai/derp"
region: "ap-northeast-1"
turn_servers:
- url: "turn:turn.example.com:3478"
username: $TURN_USER
password: $TURN_PASS
stun_servers:
- "stun.l.google.com:19302"
stun_timeout: "5s"
key_rotation_grace_period: "72h"
auto_rotate_interval: ""
monitor_interval: "30s"
local_dial_timeout: "3s"
derp_probe_interval: "5m"
derp_auto_select: true
multipath:
mode: "off"
probe_interval_sec: 5
failover_threshold_ms: 0auth — Authentication and API settings
Configures connection to the Nefia web dashboard and API.
| Field | Type | Default | Description |
|---|---|---|---|
api_base_url | string | https://www.nefia.ai | Base URL for the Nefia API endpoint. |
web_base_url | string | https://www.nefia.ai | URL of the Nefia web dashboard. |
auth:
api_base_url: https://www.nefia.ai
web_base_url: https://www.nefia.aipolicy — Command and path guardrails
Defines the policy engine rules that restrict what commands can be executed and what paths can be accessed.
| Field | Type | Default | Description |
|---|---|---|---|
mode | string | enforce | Policy enforcement mode: off, warn, or enforce. |
deny_commands | string[] | — | Regex patterns for denied commands. Checked before allow rules. Must include ^ or $ anchor in enforce/warn mode. |
allow_commands | string[] | — | Regex patterns for allowed commands. If set, only matching commands are permitted. |
deny_paths | string[] | — | Regex patterns for denied file paths. |
allowed_roots | string[] | — | Allowed root directories for file operations. Paths outside these roots are rejected. |
deny_operations | string[] | — | MCP operation types to deny globally (e.g., exec.sudo, fs.remove). |
sudo_mode | string | — | Override policy mode for sudo specifically: off, warn, or enforce. |
sudo_allow_commands | string[] | — | Regex patterns for allowed sudo commands. |
sudo_deny_commands | string[] | — | Regex patterns for denied sudo commands. |
roles | object[] | — | RBAC role definitions with per-role command, path, and operation restrictions. |
policy:
mode: enforce
deny_commands:
- "^rm\\s+-rf\\s+/"
- "^mkfs\\."
- "^dd\\s+if="
allow_commands:
- "systemctl\\s+(status|restart|reload)"
- "docker\\s+(ps|logs|inspect)"
- "cat|head|tail|less|grep"
deny_paths:
- "/etc/shadow"
- "/root/\\.ssh"
allowed_roots:
- /var/www
- /etc/nginx
- /home/deploy
roles:
- name: viewer
hosts: ["^log-", "^monitor-"]
allow_commands: ["cat|head|tail|ls"]
deny_paths: ["^/etc/"]
deny_commands: []
record_sessions: true
allowed_roots: ["/var/log"]
- name: deployer
hosts: ["^web-", "^worker-"]
allow_commands: [".*"]
deny_commands: ["^rm\\s+-rf"]
allowed_roots: ["/var/www", "/etc/nginx"]sudo — Sudo privilege escalation
Controls how nefia.exec.sudo and playbook sudo steps execute privileged commands on target hosts.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable sudo execution. When disabled, exec.sudo calls are rejected. |
method | string | "nopasswd" | Sudo method. Currently only "nopasswd" is supported (passwordless sudo). |
user | string | "root" | Target user for sudo execution. |
allowed_commands | string[] | [] | Regex patterns for allowed sudo commands. Empty means all commands allowed (subject to policy). Each entry must be a valid Go regex. |
deny_commands | string[] | [] | Regex patterns for denied sudo commands. Evaluated before allowed_commands. Each entry must be a valid Go regex. |
require_approval | boolean | false | Require human approval before executing sudo commands. Not yet supported — enabling this will produce a validation error at startup. |
# Example
sudo:
enabled: true
method: "nopasswd"
user: "root"
allowed_commands:
- "^apt (update|upgrade)"
- "^systemctl (restart|reload)"
deny_commands:
- "^rm -rf /"
require_approval: falseaudit — Audit logging
Controls the append-only audit log that records every operation.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | true | Enable audit logging. |
required | boolean | false | When true, audit write failures are treated as fatal errors — operations are blocked if audit logging fails. Default behavior is warn-and-continue. |
dir | string | <config-dir>/audit/ | Directory for audit log files (JSONL format). |
retention_days | int | 90 | Number of days to retain audit logs before automatic cleanup. |
syslog_addr | string | — | Remote syslog server address (e.g., localhost:514). Empty disables syslog forwarding. |
syslog_proto | string | udp | Syslog protocol: udp or tcp. |
audit:
enabled: true
required: false
# macOS: ~/Library/Application Support/nefia/audit/
# Linux: ~/.config/nefia/audit/
# Windows: %AppData%\nefia\audit\
dir: "" # defaults to <config-dir>/audit/
retention_days: 90
syslog_addr: "localhost:514"
syslog_proto: "udp"SIEM Forwarding
The audit.siem sub-object configures real-time audit event forwarding to external SIEM platforms.
| Field | Type | Default | Description |
|---|---|---|---|
siem.type | string | — | SIEM type: splunk, datadog, or webhook. |
siem.endpoint | string | — | SIEM endpoint URL. |
siem.token_env | string | — | Environment variable name containing the authentication token. |
siem.webhook_secret_env | string | — | Environment variable name containing the webhook HMAC secret (for webhook type). |
siem.batch_size | int | 100 | Number of events to batch before flushing. |
siem.flush_interval | duration | 10s | Maximum time between flushes. |
siem.source | string | nefia | Event source identifier (Splunk). |
siem.source_type | string | nefia:audit | Splunk source type. |
siem.index | string | — | Splunk index name. |
siem.service | string | nefia | Datadog service name. |
siem.tags | string[] | — | Datadog tags to attach to events. |
audit:
siem:
type: splunk
endpoint: https://splunk.example.com:8088/services/collector/event
token_env: SPLUNK_HEC_TOKEN
batch_size: 100
flush_interval: "10s"
source: nefia
source_type: "nefia:audit"
index: securityschedule — Scheduled execution
Configures the built-in scheduler for recurring and deferred operations.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | true | Enable the scheduling subsystem. |
sync_interval_sec | int | 300 | Reserved internal tuning field. The current scheduler loop does not use it directly. |
retention_days | int | 90 | Number of days to retain schedule execution history. |
max_concurrent_schedules | int | 5 | Maximum number of scheduled playbooks running simultaneously. |
schedule:
enabled: true
sync_interval_sec: 300
retention_days: 90
max_concurrent_schedules: 5device_lock — Cryptographic device verification
Controls device lock verification (Tailnet Lock style). When enabled, only hosts whose WireGuard public key has been signed by the device-lock authority are allowed to connect.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable cryptographic device verification. |
mode | string | log | Enforcement mode: log (log unsigned devices but allow connections) or enforce (block unsigned devices). |
device_lock:
enabled: false
mode: logposture — Device posture verification
Controls device posture checks on target hosts. When enabled, hosts must meet the specified security requirements before connections are allowed.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable device posture verification. |
mode | string | off | Posture enforcement mode: off (disabled), warn (log non-compliant devices but allow connections), or enforce (block non-compliant devices). |
require_firewall | boolean | false | Require that the host firewall is enabled. |
require_disk_encryption | boolean | false | Require that disk encryption (FileVault, BitLocker, LUKS) is enabled. |
posture:
enabled: true
mode: warn
require_firewall: true
require_disk_encryption: truessh_ca — SSH Certificate Authority
Configures the built-in SSH Certificate Authority for issuing short-lived SSH certificates instead of using static keys.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable the SSH CA workflow. When enabled, Nefia can issue user and host certificates. |
auto_sign | boolean | true | Automatically issue or renew user certificates on connect when a local identity file is available. |
cert_duration | duration | 24h | Validity period for user certificates. Uses Go duration format (e.g., 1h, 24h). |
host_cert_duration | duration | 2160h | Validity period for host certificates. Uses Go duration format. Default is 90 days. |
allowed_user_principals | string[] | — | Allowlist of principals that can be used in user certificates. Required when enabled is true. Must be non-empty and contain no duplicates. |
allow_privileged_principals | boolean | false | When true, allows privileged principals (e.g., root) in user certificates. By default, privileged principals are rejected. |
ssh_ca:
enabled: true
auto_sign: true
cert_duration: "24h"
host_cert_duration: "2160h"
allowed_user_principals:
- deploy
- admin
allow_privileged_principals: falsejit — Just-in-Time access
Configures the Just-in-Time (JIT) temporary access request system. When enabled, operators must request time-limited access to hosts instead of having persistent access.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable JIT access requests. |
default_duration | duration | 1h | Default grant duration for access requests. Uses Go duration format. |
max_duration | duration | 8h | Maximum allowed grant duration. Requests exceeding this are rejected. Uses Go duration format. |
require_reason | boolean | false | Require a reason string when requesting access. |
webhook_name | string | — | Name of the alerting webhook (defined in alerts.webhooks) to notify when access is requested. |
jit:
enabled: true
default_duration: "1h"
max_duration: "8h"
require_reason: true
webhook_name: slack-opssecrets — Dynamic credential injection
Configures dynamic credential injection from external secret backends. Secrets can be resolved at runtime and injected as environment variables into remote commands.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable secret resolution and injection. |
providers | object[] | — | List of secret backend providers. See providers sub-object below. |
cache_ttl | duration | 5m | How long resolved secrets are cached in memory. Uses Go duration format. |
inject | map | — | Environment variable mappings. Keys are env var names, values are secret references (e.g., ${vault:secret/data/db#password}). |
The providers sub-object:
| Field | Type | Description |
|---|---|---|
name | string | Unique identifier for this provider (e.g., vault-prod). |
type | string | Provider type: vault, aws-sm, op (1Password CLI), env, or file. |
config | map | Provider-specific configuration (e.g., {"addr": "https://vault.example.com"} for Vault). |
secrets:
enabled: true
cache_ttl: "5m"
providers:
- name: vault-prod
type: vault
config:
addr: "https://vault.example.com"
- name: env-local
type: env
inject:
DB_PASSWORD: "${vault-prod:secret/data/db#password}"
API_KEY: "${vault-prod:secret/data/api#key}"reactor — Event-driven automation
Configures the operator-side event reactor. The reactor listens for events from agents and triggers automated actions based on pattern-matched rules.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable the event reactor. |
listen_port | int | 9500 | TCP port for receiving events from agents. |
rules | object[] | — | List of automation rules. See rules sub-object below. |
The rules sub-object:
| Field | Type | Description |
|---|---|---|
name | string | Human-readable name for this rule. |
event_pattern | string | Regex pattern matched against event types. |
host_pattern | string | Regex pattern matched against host IDs. Empty matches all hosts. |
severity | string | Minimum event severity to trigger the rule (e.g., warning, critical). |
action.type | string | Action type: exec (run command), playbook (run playbook), or alert (send webhook). |
action.command | string | Command to execute (for type: exec). |
action.playbook_path | string | Path to playbook file (for type: playbook). |
action.webhook_name | string | Name of alerting webhook (for type: alert). |
reactor:
enabled: true
listen_port: 9500
rules:
- name: restart-on-crash
event_pattern: "service_crashed"
host_pattern: "^web-"
action:
type: exec
command: "systemctl restart nginx"
- name: alert-disk-full
event_pattern: "disk_usage_critical"
severity: critical
action:
type: alert
webhook_name: slack-opscluster — High availability (Raft)
Configures active-passive high availability via Raft consensus. When enabled, multiple Nefia operator instances form a cluster with automatic leader election and state replication.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable Raft-based clustering. |
node_id | string | — | Unique identifier for this node in the cluster. |
bind_addr | string | — | TCP address for Raft communication (e.g., 0.0.0.0:9700). |
advertise_addr | string | — | Address peers use to reach this node. Defaults to bind_addr if omitted. |
peers | object[] | — | Initial cluster members for joining. See peers sub-object below. |
data_dir | string | <state_dir>/cluster | Directory for Raft data (logs, snapshots, stable store). |
tls_enabled | boolean | false | Enable TLS for inter-node Raft communication. |
tls_cert_file | string | — | Path to the TLS certificate file for Raft transport. |
tls_key_file | string | — | Path to the TLS private key file for Raft transport. |
tls_ca_file | string | — | Path to the CA certificate file for verifying peer certificates. |
The peers sub-object:
| Field | Type | Description |
|---|---|---|
id | string | Raft server ID of the peer. |
address | string | Raft TCP address of the peer (e.g., 10.99.0.2:9700). |
cluster:
enabled: true
node_id: node-1
bind_addr: "0.0.0.0:9700"
advertise_addr: "10.99.0.1:9700"
peers:
- id: node-2
address: "10.99.0.2:9700"
- id: node-3
address: "10.99.0.3:9700"
data_dir: "" # defaults to <state_dir>/clusteralerts — Webhook-based alerting
Configures webhook notifications for operational events. Alerts are dispatched asynchronously and do not block the triggering operation.
| Field | Type | Default | Description |
|---|---|---|---|
webhooks[].url * | string | — | Webhook endpoint URL (must be http:// or https://). |
webhooks[].name | string | — | Optional name for referencing this webhook. |
webhooks[].type | string | generic | Webhook format: slack, discord, teams, pagerduty, or generic (plain JSON). |
webhooks[].events | string[] | — | Event types to subscribe to. Empty subscribes to all events. |
webhooks[].cooldown_sec | int | 300 | Minimum seconds between alerts of the same event type to prevent flooding. |
webhooks[].template.body | string | — | Go text/template for custom webhook payload. |
webhooks[].pagerduty.routing_key | string | — | PagerDuty Events API v2 routing key (required for pagerduty type). |
webhooks[].pagerduty.default_severity | string | warning | Default PagerDuty severity: info, warning, error, or critical. |
alerts:
webhooks:
- url: https://hooks.slack.com/services/T.../B.../xxx
type: slack
events: [exec_failure, vpn_peer_unhealthy]
cooldown_sec: 300
- url: https://monitoring.example.com/nefia
type: genericSupported Event Types
| Event Type | Trigger |
|---|---|
exec_failure | One or more hosts failed during a command execution. |
vpn_peer_unhealthy | A VPN peer's handshake is stale (detected by the health monitor). |
circuit_breaker_open | The SSH circuit breaker opened for a host after consecutive connection failures. |
policy_rebuild_failed | The policy engine failed to rebuild after a config hot-reload. |
enrollment_complete | A host finished enrollment successfully. |
host_online | A previously offline host became reachable again. |
host_offline | A host transitioned to offline or unhealthy state. |
key_rotation | WireGuard key rotation completed. |
config_change | Configuration was saved or reloaded with changes. |
playbook_complete | A playbook run completed successfully. |
playbook_failed | A playbook run failed. |
queue_executed | A queued offline command was delivered successfully. |
queue_failed | A queued offline command failed. |
host_revoked | One host's VPN access was revoked. |
host_revoke_all | Emergency revocation removed all hosts. |
Delivery and Retry Behavior
- Failed deliveries (HTTP 5xx or network errors) are retried up to 3 times with exponential backoff (1s, 2s, 4s).
- Each attempt has a 10-second timeout.
- Non-2xx responses below 500 (e.g., 4xx) are logged as warnings but not retried.
- The cooldown timer is per event type per webhook. Duplicate alerts of the same type within the cooldown window are silently dropped.
Payload Formats
Generic (type: generic): A JSON object with event, message, details (optional key-value map), and timestamp (RFC 3339).
{
"event": "exec_failure",
"message": "Exec failed on 2 host(s)",
"details": { "failed_count": 2 },
"timestamp": "2026-03-06T12:00:00Z"
}Slack (type: slack): A Slack Block Kit payload with a header block (event type), a section block (message), and an optional section block (details formatted as Markdown list).
mcp — MCP server settings
Settings for the Model Context Protocol server used by AI agents.
| Field | Type | Default | Description |
|---|---|---|---|
rate_limit.rate | float | server default | Maximum requests per second. When unset, the running server falls back to its built-in default. |
rate_limit.burst | int | server default | Maximum burst size before rate limiting kicks in. |
approval.enabled | boolean | false | Enable the approval workflow. When enabled, 2 additional approval tools (nefia.approval.list, nefia.approval.respond) are available, and matching rules require human approval before execution. |
approval.timeout_sec | int | 120 | Seconds to wait for user approval before timing out. |
approval.default_action | string | deny | Action when approval times out: deny or allow. |
approval.rules | object[] | — | Pattern-based approval rules (see below). |
mtls.enabled | boolean | false | Enable the mTLS gateway for secure MCP connections. |
mtls.listen_addr | string | 127.0.0.1:19821 | TCP address for the mTLS gateway to listen on. |
mtls.ca_cert_file | string | — | Path to the CA certificate file for client verification. |
mtls.cert_file | string | — | Path to the server certificate file. |
mtls.key_file | string | — | Path to the server private key file. |
Client certificate revocation is not configured through nefia.yaml. The mTLS gateway always uses the state-directory revocation store managed by nefia mtls revoke, and newly revoked certificates are rejected on subsequent handshakes without restarting the gateway.
Each approval rule can specify tools (exact match), commands and paths (prefix match), and hosts (exact match). The first matching rule wins, and require_approval: false can be used as an explicit exemption rule. When approval is disabled, the approval tools are still advertised with (not configured) descriptions for discoverability.
mcp:
rate_limit:
rate: 1.0
burst: 10
approval:
enabled: false
timeout_sec: 120
default_action: deny
rules: []
mtls:
enabled: false
listen_addr: "127.0.0.1:19821"
ca_cert_file: ""
cert_file: ""
key_file: ""team — Team context
Configures the active team for multi-tenant operations.
| Field | Type | Default | Description |
|---|---|---|---|
active_team_id | string | — | ID of the currently active team. Set by nefia team use. |
active_team_slug | string | — | Human-readable slug of the active team (e.g., my-team). |
team:
active_team_id: "tm_abc123"
active_team_slug: "my-team"telemetry — Tracing and metrics
Configures OpenTelemetry tracing and Prometheus metrics export. Both subsystems are disabled by default and add zero overhead when disabled (a noop TracerProvider is installed).
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable OpenTelemetry trace export. |
endpoint | string | localhost:4318 | OTLP HTTP endpoint for trace collection. |
service_name | string | nefia | Service name reported in traces. |
metrics_enabled | boolean | false | Enable a Prometheus-compatible metrics endpoint. |
metrics_port | int | 9090 | TCP port for the Prometheus metrics endpoint. |
insecure_http | boolean | false | Allow plaintext (non-TLS) OTLP HTTP connections for non-loopback collectors. Only needed when the collector is on a different host and does not support TLS. |
telemetry:
enabled: false
endpoint: "localhost:4318"
service_name: nefia
insecure_http: false
metrics_enabled: false
metrics_port: 9090OpenTelemetry Tracing
When enabled is true, Nefia exports distributed traces via OTLP HTTP to the configured endpoint. Traces are batched asynchronously and flushed on shutdown. The TraceHandler automatically injects trace_id and span_id into every slog log record when a span context is active, enabling log-trace correlation.
If the OTLP exporter fails to initialise (e.g., endpoint unreachable), Nefia falls back to a noop tracer silently rather than failing startup.
Prometheus Metrics
When metrics_enabled is true, Nefia starts an HTTP server on the configured metrics_port with the following endpoints:
| Endpoint | Description |
|---|---|
/metrics | Prometheus-compatible metrics scrape endpoint. |
/healthz | Liveness probe. Returns {"status":"healthy"} (200) or {"status":"unhealthy"} (503). |
/readyz | Readiness probe. Returns {"status":"ready"} (200) or {"status":"not_ready"} (503). |
Exported Metrics
Nefia exports both expvar counters (always available at Go's debug endpoint) and OTel instruments (exported to Prometheus when enabled):
| expvar Counter | OTel Instrument | Description |
|---|---|---|
nefia_exec_total | nefia.exec.total | Total command execution operations. |
nefia_exec_success | — | Successful executions. |
nefia_exec_fail | — | Failed executions. |
nefia_conn_dial_total | nefia.conn.dial.total | Total SSH dial attempts. |
nefia_conn_dial_fail | — | Failed SSH dial attempts. |
nefia_conn_healthcheck_fail | — | Failed connection health checks. |
nefia_conn_pool_size | — | Current SSH connection pool size. |
nefia_session_open_total | — | Total session open operations. |
nefia_session_gc_removed | — | Sessions removed by garbage collection. |
nefia_session_gc_runs | — | Number of GC cycles executed. |
nefia_fs_read_total | nefia.fs.total (op=read) | File read operations. |
nefia_fs_write_total | nefia.fs.total (op=write) | File write operations. |
nefia_fs_patch_total | nefia.fs.total (op=patch) | File patch operations. |
nefia_fs_list_total | nefia.fs.total (op=list) | Directory list operations. |
nefia_fs_stat_total | nefia.fs.total (op=stat) | File stat operations. |
nefia_vpn_peer_unhealthy | — | Unhealthy VPN peer detections. |
nefia_playbook_run_total | nefia.playbook.total | Total playbook run operations. |
OTel histograms record operation duration in seconds with bucket boundaries from 1ms to 10s:
| Histogram | Description |
|---|---|
nefia.exec.duration | Command execution duration. |
nefia.conn.dial.duration | SSH dial duration. |
nefia.fs.duration | Filesystem operation duration (labeled by op). |
nefia.playbook.duration | Playbook run duration. |
All OTel instruments include an ok boolean attribute for success/failure breakdowns.
hosts — Target PC definitions
Each host represents a target PC enrolled via nefia vpn invite. Hosts are defined as an array of objects.
| Field | Type | Default | Description |
|---|---|---|---|
id * | string | — | Unique host identifier (name). Must match ^[a-zA-Z0-9][a-zA-Z0-9._-]*$ and be at most 128 characters. |
address * | string | — | VPN IP address of the host. |
os * | string | — | Operating system: macos, linux, or windows. |
user * | string | — | SSH username for connecting to this host. Falls back to $USER (or $USERNAME on Windows). Connection fails with an actionable error if all fallbacks are empty. |
root | string | / | Default root directory for file operations. |
shell | string | — | Shell override for this host (e.g., /bin/bash). Overrides the OS-level default from defaults.shell. |
role | string | — | RBAC role name assigned to this host. Must match a role defined in policy.roles. |
tags | map | — | Key-value tags for targeting and group membership. |
vpn.public_key * | string | — | WireGuard public key for this peer. |
vpn.endpoint | string | — | WireGuard endpoint (ip:port) if the host has a public address. |
vpn.local_endpoint | string | — | LAN endpoint (ip:port) for hairpin NAT fallback. Discovered automatically during enrollment. |
vpn.vpn_addr | string | — | Peer's VPN IP address (e.g., 10.99.0.2). |
vpn.status | string | — | Current VPN status: active, pending, or empty. |
hosts:
- id: prod-web-1
address: 10.99.0.2
os: linux
user: deploy
root: /var/www
tags:
env: production
role: web
vpn:
public_key: xYz1...aBcD=
endpoint: 203.0.113.10:51820
local_endpoint: 192.168.1.50:51820
vpn_addr: 10.99.0.2
status: activegroups — Host group definitions
Groups provide named selectors based on tag matching.
groups:
- name: webservers
match:
tags:
role: web
- name: production
match:
tags:
env: production
- name: staging
match:
tags:
env: staging
- name: all-linux
match:
tags:
os: linuxUse groups in target selectors:
nefia exec --target group:webservers -- systemctl status nginxComplete Example
A fully populated nefia.yaml combining all sections:
version: 1
session_ttl_minutes: 1440
session_idle_timeout_minutes: 30
session_gc_interval_minutes: 5
host_sync_interval: "5m"
defaults:
concurrency: 50
timeout: 30m
output: human
max_output_bytes: 1048576
shell:
macos: /bin/zsh
linux: /bin/bash
windows: powershell.exe
artifact_retention_days: 7
progress:
enabled: true
interval: "30s"
notifications:
enabled: false
min_duration: "30s"
ssh:
connect_timeout_sec: 5
known_hosts: ~/.ssh/known_hosts
identities:
- ~/.ssh/id_ed25519
agent: false
max_file_size_bytes: 3221225472
max_concurrent_fs: 20
max_pool_size: 100
connection_ttl: "1h"
idle_timeout: "10m"
max_retries: 3
initial_backoff: "1s"
max_backoff: "60s"
circuit_breaker_threshold: 5
circuit_breaker_reset_timeout: "60s"
keepalive_timeout: "5s"
vpn:
enabled: true
private_key: $WG_PRIVATE_KEY
listen_port: 51820
address: 10.99.0.1/24
dns: [1.1.1.1]
magic_dns:
enabled: true
domain: nefia
upstream: ["1.1.1.1:53", "8.8.8.8:53"]
derp_servers:
- url: "wss://relay.nefia.ai/derp"
region: "ap-northeast-1"
turn_servers:
- url: "turn:turn.example.com:3478"
username: $TURN_USER
password: $TURN_PASS
stun_servers:
- "stun.l.google.com:19302"
stun_timeout: "5s"
key_rotation_grace_period: "72h"
monitor_interval: "30s"
local_dial_timeout: "3s"
derp_probe_interval: "5m"
derp_auto_select: true
multipath:
mode: "off"
probe_interval_sec: 5
failover_threshold_ms: 0
auth:
api_base_url: https://www.nefia.ai
web_base_url: https://www.nefia.ai
policy:
mode: enforce
deny_commands:
- "^rm\\s+-rf\\s+/"
allow_commands:
- "systemctl\\s+(status|restart|reload)"
deny_paths:
- "/etc/shadow"
allowed_roots:
- /var/www
- /etc/nginx
audit:
enabled: true
required: false
retention_days: 90
syslog_addr: "localhost:514"
syslog_proto: "udp"
sudo:
enabled: false
method: "nopasswd"
user: "root"
allowed_commands: []
deny_commands: []
require_approval: false
schedule:
enabled: true
sync_interval_sec: 300
retention_days: 90
max_concurrent_schedules: 5
device_lock:
enabled: false
mode: log
posture:
enabled: true
mode: warn
require_firewall: true
require_disk_encryption: true
ssh_ca:
enabled: true
auto_sign: true
cert_duration: "24h"
host_cert_duration: "2160h"
allowed_user_principals:
- deploy
- admin
allow_privileged_principals: false
jit:
enabled: false
default_duration: "1h"
max_duration: "8h"
require_reason: true
webhook_name: ""
secrets:
enabled: false
cache_ttl: "5m"
providers:
- name: vault-prod
type: vault
config:
addr: "https://vault.example.com"
inject:
DB_PASSWORD: "${vault-prod:secret/data/db#password}"
reactor:
enabled: false
listen_port: 9500
rules: []
cluster:
enabled: false
node_id: node-1
bind_addr: "0.0.0.0:9700"
advertise_addr: ""
peers: []
data_dir: ""
mcp:
rate_limit:
rate: 1.0
burst: 10
approval:
enabled: false
timeout_sec: 120
default_action: deny
mtls:
enabled: false
listen_addr: "127.0.0.1:19821"
ca_cert_file: ""
cert_file: ""
key_file: ""
alerts:
webhooks: []
team:
active_team_id: ""
active_team_slug: ""
telemetry:
enabled: false
endpoint: "localhost:4318"
service_name: nefia
insecure_http: false
metrics_enabled: false
metrics_port: 9090
hosts:
- id: prod-web-1
address: 10.99.0.2
os: linux
user: deploy
root: /var/www
shell: /bin/bash
role: deployer
tags:
env: production
role: web
vpn:
public_key: xYz1...aBcD=
endpoint: 203.0.113.10:51820
vpn_addr: 10.99.0.2
status: active
groups:
- name: webservers
match:
tags:
role: web
- name: production
match:
tags:
env: productionValidating Your Configuration
Run the built-in validator to catch errors:
nefia validateConfiguration: /Users/admin/.config/nefia/nefia.yaml Hosts: 5 defined, 4 reachable Groups: 4 defined, all valid Policy: 12 rules loaded, no conflicts VPN: enabled, key present Result: OK
Config Validation Rules
The following validation rules are applied at startup and when running nefia validate:
| Field | Rule | Error |
|---|---|---|
vpn.derp_servers[].url | Must not be empty | vpn.derp_servers[N].url must not be empty |
vpn.derp_servers[].url | Must start with wss:// or ws:// | vpn.derp_servers[N].url must start with wss:// or ws:// |
vpn.derp_servers[].url | Must be a valid URL | vpn.derp_servers[N].url is not a valid URL |
vpn.stun_servers[] | Must not be empty | vpn.stun_servers[N] must not be empty |
vpn.stun_servers[] | Must be host:port format | vpn.stun_servers[N] must be host:port |
hosts[].vpn.vpn_addr | Must be unique across all hosts | Duplicate VPN address error |
hosts[].vpn.vpn_addr | Must not overlap with operator address | Operator address collision error |
ssh_ca.allowed_user_principals | Must be non-empty when ssh_ca.enabled is true | ssh_ca.allowed_user_principals must not be empty when ssh_ca is enabled |
ssh_ca.allowed_user_principals | Must contain no duplicates | ssh_ca.allowed_user_principals contains duplicate entry |
sudo.allowed_commands[] | Must be valid regex patterns | sudo.allowed_commands[N] is not a valid regex |
sudo.deny_commands[] | Must be valid regex patterns | sudo.deny_commands[N] is not a valid regex |
sudo.require_approval | Must be false (not yet supported) | sudo.require_approval is not yet supported |
File-Based Config Locking
When multiple nefia processes modify the configuration concurrently (e.g., parallel vpn invite and vpn reinvite commands), a file-based lock prevents race conditions:
- An exclusive lock file (
nefia.yaml.lock) is acquired before any config modification - On Unix (macOS/Linux): non-blocking
flockwith retry (up to 30 attempts, 1 second apart, max 30-second wait) - On Windows:
LockFileExwith exclusive non-blocking lock and the same retry strategy - The lock is released after the config file is saved
This prevents VPN address collisions and config corruption when enrollment operations run in parallel.
Stale Lock Recovery
On Unix, the operating system automatically releases flock locks when the owning process exits (even on crash), so stale locks are not possible under normal circumstances.
On Windows, the lock file stores the PID of the owning process. When a lock acquisition fails, Nefia reads the PID from the lock file and checks whether that process is still alive using OpenProcess and GetExitCodeProcess. If the owning process has exited (crash or ungraceful termination), the stale lock file is automatically removed and re-acquired. This prevents indefinite lock retention after a crash.
Agent-Side Configuration
The agent (nefia-agent) uses a separate agent.yaml configuration file. See the Agent Reference for the full schema. Key extensions for NAT traversal and cloud communication:
| Field | Type | Description |
|---|---|---|
agent_token | string | API token for cloud communication. Supports keyring:agent-token for OS keyring storage. |
agent_token_expires_at | string | RFC 3339 expiry time for the agent token. |
host_id | string | Host ID for endpoint reporting. |
stun_servers | string[] | Custom STUN servers (overrides defaults). Format: host:port. |
endpoint_refresh | duration | STUN refresh interval (e.g., 5m). Empty disables. |
cloud_api_base | string | Cloud API base URL. Must be https://. |
derp_servers | DERPServer[] | DERP relay servers (propagated from operator during enrollment). |
Related
Complete reference for all Nefia CLI commands and flags.
Configure WireGuard tunnels, NAT traversal, and key rotation.
YAML schema reference for Nefia playbook files.