Configuration

Complete reference for the nefia.yaml configuration file format and all available options.

Nefia stores its configuration in a single YAML file. This page documents every section and field available in nefia.yaml.

File Location

plaintext

~/Library/Application Support/nefia/nefia.yaml

Override the default path with the --config flag on any command:

bash

nefia --config /path/to/custom.yaml hosts list

Creating the Config File

The configuration file is created automatically when you run nefia setup (or its alias nefia init) or nefia vpn invite. You can also create it manually.

bash

nefia setup

nefia setup

Configuration created at /Users/admin/Library/Application Support/nefia/nefia.yaml

VPN: enabled (keypair generated) Address: 10.99.0.1/24 Audit: enabled (/Users/admin/Library/Application Support/nefia/audit/)

Schema Version

Every configuration file begins with a version field. The current schema version is 1.

yaml

version: 1

Top-Level Settings

In addition to the version field, the following top-level settings control session lifecycle and host synchronization:

Field	Type	Default	Description
`session_ttl_minutes`	int	`1440` (24h)	Maximum session lifetime in minutes. Sessions older than this are automatically closed. `0` uses the default (24 hours).
`session_idle_timeout_minutes`	int	`30`	Session idle timeout in minutes. Sessions with no activity for this duration are automatically closed based on `LastUsedAt`. `0` uses the default (30 minutes).
`session_gc_interval_minutes`	int	`5`	Interval in minutes for the session garbage collector to scan and remove expired sessions. `0` uses the default (5 minutes).
`host_sync_interval`	duration	`5m`	How often to synchronize host state from the web dashboard. Uses Go duration format (e.g., `5m`, `1h`).

yaml

version: 1
session_ttl_minutes: 1440
session_idle_timeout_minutes: 30
session_gc_interval_minutes: 5
host_sync_interval: "5m"

Section Reference

defaults — Global execution defaults

Controls default behavior for command execution across all hosts.

Field	Type	Default	Description
`concurrency`	int	`50`	Maximum number of hosts to operate on in parallel.
`timeout`	duration	`30m`	Default timeout for remote operations. Supports units: s, m, h.
`output`	string	`human`	Output format: `human`, `json`, `jsonl`, `yaml`, or `compact`.
`max_output_bytes`	int	`1048576`	Maximum output captured per host (1 MB). Truncated beyond this limit with an `[output truncated]` warning in stderr. The in-memory snapshot buffer is capped at the larger of 50 MB or `max_output_bytes * 2`.
`shell`	map	—	Default shell per OS. Keys are OS names (`macos`, `linux`, `windows`), values are shell paths (e.g., `{"linux": "/bin/bash", "macos": "/bin/zsh"}`). Per-host `shell` overrides this.
`artifact_retention_days`	int	`7`	Number of days to retain command output artifacts before automatic cleanup.
`progress.enabled`	boolean	`true`	Enable progress reporting during long-running operations.
`progress.interval`	duration	`30s`	Interval for progress reporting.
`retry.vpn_recovery_enabled`	boolean	`true`	Enable automatic retry while waiting for transient VPN recovery. The global `--retry-timeout` flag can enable the behavior for a single command.
`retry.vpn_recovery_timeout`	duration	`30s`	Default maximum wait time for VPN recovery. Overridden by the global `--retry-timeout` flag.
`recording.enabled`	boolean	`false`	Enable session recording by default when sessions are opened.
`recording.retention_days`	int	`90`	Number of days to retain recorded sessions.

yaml

defaults:
  concurrency: 50
  timeout: 30m
  output: human
  max_output_bytes: 1048576
  shell:
    macos: /bin/zsh
    linux: /bin/bash
    windows: powershell.exe
  artifact_retention_days: 7
  progress:
    enabled: true
    interval: "30s"
  retry:
    vpn_recovery_enabled: false
    vpn_recovery_timeout: "30s"
  recording:
    enabled: false
    retention_days: 90

defaults.retry — Automatic VPN recovery retry

Controls retry behavior for transient VPN recovery during CLI operations.

Field	Type	Default	Description
`vpn_recovery_enabled`	boolean	`true`	Enable retry for retriable VPN failures.
`vpn_recovery_timeout`	duration	`30s`	Maximum time to wait for VPN recovery before failing the command.

yaml

defaults:
  retry:
    vpn_recovery_enabled: false
    vpn_recovery_timeout: "30s"

defaults.recording — Session recording defaults

Controls whether newly opened sessions are recorded by default.

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable session recording by default when a session is opened.
`retention_days`	int	`90`	Number of days to retain recording files.

yaml

defaults:
  recording:
    enabled: false
    retention_days: 90

defaults.notifications — Desktop notification settings

Controls desktop notifications for completed operations.

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable automatic desktop notifications for long-running operations.
`min_duration`	duration	`30s`	Minimum operation duration before a notification is sent. Only applies when `enabled` is true.

yaml

defaults:
  notifications:
    enabled: true
    min_duration: "30s"

When enabled, operations exceeding min_duration automatically trigger a desktop notification. You can also use the --notify flag on any command to send a notification regardless of this setting.

ssh — SSH transport settings

Configures SSH connections established inside VPN tunnels.

Field	Type	Default	Description
`connect_timeout_sec`	int	`5`	Seconds to wait for an SSH connection to establish.
`known_hosts`	string	`~/.ssh/known_hosts`	Path to the SSH known_hosts file for host key verification (on Windows: `%USERPROFILE%\.ssh\known_hosts`).
`identities`	string[]	—	List of SSH private key file paths to offer during authentication.
`agent`	boolean	`false`	Use the system SSH agent for key authentication.
`max_file_size_bytes`	int	`3221225472`	Maximum file size for SFTP transfers (3 GB).
`max_concurrent_fs`	int	`20`	Maximum concurrent file system operations per host.
`sftp_timeout_sec`	int	`300`	Timeout in seconds for individual SFTP operations (read, write, patch, list, stat). The transport layer also applies a 30-minute hard safety timeout as a last resort.
`max_pool_size`	int	`100`	Maximum number of pooled SSH connections across all hosts.
`max_concurrent_dial`	int	`20`	Maximum number of concurrent SSH dial attempts.
`max_retries`	int	`3`	Maximum number of retry attempts for failed SSH connections.
`initial_backoff`	duration	`1s`	Initial backoff duration between retry attempts. Uses Go duration format (e.g. `1s`, `500ms`).
`max_backoff`	duration	`60s`	Maximum backoff duration between retry attempts. Uses Go duration format.
`connection_ttl`	duration	`1h`	Maximum lifetime of a pooled SSH connection before it is closed and re-established. Uses Go duration format.
`idle_timeout`	duration	`10m`	Time after which an idle pooled connection is closed. Uses Go duration format.
`circuit_breaker_threshold`	int	`5`	Number of consecutive connection failures before the circuit breaker opens for a host.
`circuit_breaker_reset_timeout`	duration	`60s`	Time to wait before attempting to reconnect to a host after the circuit breaker opens. Uses Go duration format.
`keepalive_timeout`	duration	`5s`	Timeout for SSH keepalive health-check probes on pooled connections. Uses Go duration format.

yaml

ssh:
  connect_timeout_sec: 5
  known_hosts: ~/.ssh/known_hosts
  identities:
    - ~/.ssh/id_ed25519    # Windows: %USERPROFILE%\.ssh\id_ed25519
    - ~/.ssh/id_rsa
  agent: false
  max_file_size_bytes: 3221225472
  max_concurrent_fs: 20
  # Connection pool tuning
  max_pool_size: 100
  connection_ttl: "1h"
  idle_timeout: "10m"
  # Retry / backoff
  max_retries: 3
  initial_backoff: "1s"
  max_backoff: "60s"
  # Circuit breaker
  circuit_breaker_threshold: 5
  circuit_breaker_reset_timeout: "60s"
  keepalive_timeout: "5s"

vpn — WireGuard VPN configuration

Controls the WireGuard VPN hub on the operator PC.

Field	Type	Default	Description
`enabled` *	boolean	`true`	Enable the WireGuard VPN. Must be true for any remote operations.
`private_key`	string	—	WireGuard private key. Supports inline base64, `$ENV_VAR` reference, or `keyring:` prefix for OS keyring.
`listen_port`	int	`51820`	UDP port for WireGuard to listen on.
`address`	string	`10.99.0.1/24`	VPN address and subnet for the operator (hub).
`dns`	string[]	—	DNS servers pushed to the VPN interface.
`magic_dns`	object	—	MagicDNS configuration for resolving host names via the VPN.

The magic_dns sub-object:

Field	Type	Default	Description
`enabled`	boolean	`true`	Enable MagicDNS name resolution.
`domain`	string	`nefia`	Domain suffix appended to host names (e.g., `pc-office.nefia`).
`upstream`	string[]	`["1.1.1.1:53","8.8.8.8:53"]`	Upstream DNS servers for non-VPN queries.

Additional VPN fields:

Field	Type	Default	Description
`derp_servers`	object[]	—	DERP relay servers for relay-first connectivity. Each entry has `url` (wss:// or ws://) and `region`. Propagated to agents during enrollment.
`turn_servers`	object[]	—	TURN relay servers for NAT traversal fallback. Each entry has `url`, `username`, and `password` (supports `$ENV_VAR` references). See `turn_servers` sub-object below.
`stun_servers`	string[]	—	STUN servers for endpoint discovery. Each entry must be `host:port` format.
`stun_timeout`	duration	`5s`	Timeout for STUN endpoint discovery requests.
`key_rotation_grace_period`	duration	`72h`	How long to accept the previous key after rotation.
`auto_rotate_interval`	duration	—	Automatic key rotation interval (e.g., `720h` for 30 days). Empty disables auto-rotation.
`monitor_interval`	duration	`30s`	Interval for VPN peer health monitoring.
`local_dial_timeout`	duration	`3s`	Timeout for local endpoint probe in hairpin NAT fallback. Uses Go duration format.
`derp_probe_interval`	duration	`5m`	Interval between DERP RTT quality probes. Minimum `30s`. Used to measure relay latency for auto-selection.
`derp_auto_select`	boolean	`true`	Enable automatic DERP relay selection based on RTT probes. When enabled, the client periodically probes configured DERP servers and routes traffic through the lowest-latency relay.
`multipath`	object	—	Multipath active-backup failover configuration. See `multipath` sub-object below.

The derp_servers sub-object:

Field	Type	Description
`url`	string	WebSocket endpoint (must start with `wss://` or `ws://`), e.g., `wss://relay.nefia.ai/derp`.
`region`	string	Cloud region identifier (e.g., `ap-northeast-1`, `us-east-1`). Informational only.

The turn_servers sub-object:

Field	Type	Description
`url`	string	TURN server URL (e.g., `turn:turn.example.com:3478`).
`username`	string	TURN credentials. Supports `$ENV_VAR` references.
`password`	string	TURN credentials. Supports `$ENV_VAR` references. Hidden from JSON output.

The multipath sub-object:

Field	Type	Default	Description
`mode`	string	`off`	Multipath behavior: `off` (disabled) or `active-backup` (failover between paths).
`probe_interval_sec`	int	`5`	Interval between quality probes in seconds.
`failover_threshold_ms`	int	`0`	RTT threshold in milliseconds above which a path is considered degraded and failover is evaluated more aggressively. `0` disables RTT-based failover.

yaml

vpn:
  enabled: true
  private_key: $WG_PRIVATE_KEY
  listen_port: 51820
  address: 10.99.0.1/24
  dns:
    - 1.1.1.1
    - 8.8.8.8
  magic_dns:
    enabled: true
    domain: nefia
    upstream:
      - "1.1.1.1:53"
      - "8.8.8.8:53"
  derp_servers:
    - url: "wss://relay.nefia.ai/derp"
      region: "ap-northeast-1"
  turn_servers:
    - url: "turn:turn.example.com:3478"
      username: $TURN_USER
      password: $TURN_PASS
  stun_servers:
    - "stun.l.google.com:19302"
  stun_timeout: "5s"
  key_rotation_grace_period: "72h"
  auto_rotate_interval: ""
  monitor_interval: "30s"
  local_dial_timeout: "3s"
  derp_probe_interval: "5m"
  derp_auto_select: true
  multipath:
    mode: "off"
    probe_interval_sec: 5
    failover_threshold_ms: 0

auth — Authentication and API settings

Configures connection to the Nefia web dashboard and API.

Field	Type	Default	Description
`api_base_url`	string	`https://www.nefia.ai`	Base URL for the Nefia API endpoint.
`web_base_url`	string	`https://www.nefia.ai`	URL of the Nefia web dashboard.

yaml

auth:
  api_base_url: https://www.nefia.ai
  web_base_url: https://www.nefia.ai

policy — Command and path guardrails

Defines the policy engine rules that restrict what commands can be executed and what paths can be accessed.

Field	Type	Default	Description
`mode`	string	`enforce`	Policy enforcement mode: off, warn, or enforce.
`deny_commands`	string[]	—	Regex patterns for denied commands. Checked before allow rules. Must include `^` or `$` anchor in enforce/warn mode.
`allow_commands`	string[]	—	Regex patterns for allowed commands. If set, only matching commands are permitted.
`deny_paths`	string[]	—	Regex patterns for denied file paths.
`allowed_roots`	string[]	—	Allowed root directories for file operations. Paths outside these roots are rejected.
`deny_operations`	string[]	—	MCP operation types to deny globally (e.g., `exec.sudo`, `fs.remove`).
`sudo_mode`	string	—	Override policy mode for sudo specifically: `off`, `warn`, or `enforce`.
`sudo_allow_commands`	string[]	—	Regex patterns for allowed sudo commands.
`sudo_deny_commands`	string[]	—	Regex patterns for denied sudo commands.
`roles`	object[]	—	RBAC role definitions with per-role command, path, and operation restrictions.

yaml

policy:
  mode: enforce
  deny_commands:
    - "^rm\\s+-rf\\s+/"
    - "^mkfs\\."
    - "^dd\\s+if="
  allow_commands:
    - "systemctl\\s+(status|restart|reload)"
    - "docker\\s+(ps|logs|inspect)"
    - "cat|head|tail|less|grep"
  deny_paths:
    - "/etc/shadow"
    - "/root/\\.ssh"
  allowed_roots:
    - /var/www
    - /etc/nginx
    - /home/deploy
  roles:
    - name: viewer
      hosts: ["^log-", "^monitor-"]
      allow_commands: ["cat|head|tail|ls"]
      deny_paths: ["^/etc/"]
      deny_commands: []
      record_sessions: true
      allowed_roots: ["/var/log"]
    - name: deployer
      hosts: ["^web-", "^worker-"]
      allow_commands: [".*"]
      deny_commands: ["^rm\\s+-rf"]
      allowed_roots: ["/var/www", "/etc/nginx"]

sudo — Sudo privilege escalation

Controls how nefia.exec.sudo and playbook sudo steps execute privileged commands on target hosts.

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable sudo execution. When disabled, `exec.sudo` calls are rejected.
`method`	string	`"nopasswd"`	Sudo method. Currently only `"nopasswd"` is supported (passwordless sudo).
`user`	string	`"root"`	Target user for sudo execution.
`allowed_commands`	string[]	`[]`	Regex patterns for allowed sudo commands. Empty means all commands allowed (subject to policy). Each entry must be a valid Go regex.
`deny_commands`	string[]	`[]`	Regex patterns for denied sudo commands. Evaluated before `allowed_commands`. Each entry must be a valid Go regex.
`require_approval`	boolean	`false`	Require human approval before executing sudo commands. Not yet supported — enabling this will produce a validation error at startup.

yaml

# Example
sudo:
  enabled: true
  method: "nopasswd"
  user: "root"
  allowed_commands:
    - "^apt (update|upgrade)"
    - "^systemctl (restart|reload)"
  deny_commands:
    - "^rm -rf /"
  require_approval: false

audit — Audit logging

Controls the append-only audit log that records every operation.

Field	Type	Default	Description
`enabled`	boolean	`true`	Enable audit logging.
`required`	boolean	`false`	When true, audit write failures are treated as fatal errors — operations are blocked if audit logging fails. Default behavior is warn-and-continue.
`dir`	string	`<config-dir>/audit/`	Directory for audit log files (JSONL format).
`retention_days`	int	`90`	Number of days to retain audit logs before automatic cleanup.
`syslog_addr`	string	—	Remote syslog server address (e.g., `localhost:514`). Empty disables syslog forwarding.
`syslog_proto`	string	`udp`	Syslog protocol: `udp` or `tcp`.

yaml

audit:
  enabled: true
  required: false
  # macOS: ~/Library/Application Support/nefia/audit/
  # Linux: ~/.config/nefia/audit/
  # Windows: %AppData%\nefia\audit\
  dir: ""  # defaults to <config-dir>/audit/
  retention_days: 90
  syslog_addr: "localhost:514"
  syslog_proto: "udp"

SIEM Forwarding

The audit.siem sub-object configures real-time audit event forwarding to external SIEM platforms.

Field	Type	Default	Description
`siem.type`	string	—	SIEM type: `splunk`, `datadog`, or `webhook`.
`siem.endpoint`	string	—	SIEM endpoint URL.
`siem.token_env`	string	—	Environment variable name containing the authentication token.
`siem.webhook_secret_env`	string	—	Environment variable name containing the webhook HMAC secret (for `webhook` type).
`siem.batch_size`	int	`100`	Number of events to batch before flushing.
`siem.flush_interval`	duration	`10s`	Maximum time between flushes.
`siem.source`	string	`nefia`	Event source identifier (Splunk).
`siem.source_type`	string	`nefia:audit`	Splunk source type.
`siem.index`	string	—	Splunk index name.
`siem.service`	string	`nefia`	Datadog service name.
`siem.tags`	string[]	—	Datadog tags to attach to events.

yaml

audit:
  siem:
    type: splunk
    endpoint: https://splunk.example.com:8088/services/collector/event
    token_env: SPLUNK_HEC_TOKEN
    batch_size: 100
    flush_interval: "10s"
    source: nefia
    source_type: "nefia:audit"
    index: security

schedule — Scheduled execution

Configures the built-in scheduler for recurring and deferred operations.

Field	Type	Default	Description
`enabled`	boolean	`true`	Enable the scheduling subsystem.
`sync_interval_sec`	int	`300`	Reserved internal tuning field. The current scheduler loop does not use it directly.
`retention_days`	int	`90`	Number of days to retain schedule execution history.
`max_concurrent_schedules`	int	`5`	Maximum number of scheduled playbooks running simultaneously.

yaml

schedule:
  enabled: true
  sync_interval_sec: 300
  retention_days: 90
  max_concurrent_schedules: 5

device_lock — Cryptographic device verification

Controls device lock verification (Tailnet Lock style). When enabled, only hosts whose WireGuard public key has been signed by the device-lock authority are allowed to connect.

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable cryptographic device verification.
`mode`	string	`log`	Enforcement mode: `log` (log unsigned devices but allow connections) or `enforce` (block unsigned devices).

yaml

device_lock:
  enabled: false
  mode: log

posture — Device posture verification

Controls device posture checks on target hosts. When enabled, hosts must meet the specified security requirements before connections are allowed.

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable device posture verification.
`mode`	string	`off`	Posture enforcement mode: `off` (disabled), `warn` (log non-compliant devices but allow connections), or `enforce` (block non-compliant devices).
`require_firewall`	boolean	`false`	Require that the host firewall is enabled.
`require_disk_encryption`	boolean	`false`	Require that disk encryption (FileVault, BitLocker, LUKS) is enabled.

yaml

posture:
  enabled: true
  mode: warn
  require_firewall: true
  require_disk_encryption: true

ssh_ca — SSH Certificate Authority

Configures the built-in SSH Certificate Authority for issuing short-lived SSH certificates instead of using static keys.

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable the SSH CA workflow. When enabled, Nefia can issue user and host certificates.
`auto_sign`	boolean	`true`	Automatically issue or renew user certificates on connect when a local identity file is available.
`cert_duration`	duration	`24h`	Validity period for user certificates. Uses Go duration format (e.g., `1h`, `24h`).
`host_cert_duration`	duration	`2160h`	Validity period for host certificates. Uses Go duration format. Default is 90 days.
`allowed_user_principals`	string[]	—	Allowlist of principals that can be used in user certificates. Required when `enabled` is `true`. Must be non-empty and contain no duplicates.
`allow_privileged_principals`	boolean	`false`	When true, allows privileged principals (e.g., `root`) in user certificates. By default, privileged principals are rejected.

yaml

ssh_ca:
  enabled: true
  auto_sign: true
  cert_duration: "24h"
  host_cert_duration: "2160h"
  allowed_user_principals:
    - deploy
    - admin
  allow_privileged_principals: false

jit — Just-in-Time access

Configures the Just-in-Time (JIT) temporary access request system. When enabled, operators must request time-limited access to hosts instead of having persistent access.

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable JIT access requests.
`default_duration`	duration	`1h`	Default grant duration for access requests. Uses Go duration format.
`max_duration`	duration	`8h`	Maximum allowed grant duration. Requests exceeding this are rejected. Uses Go duration format.
`require_reason`	boolean	`false`	Require a reason string when requesting access.
`webhook_name`	string	—	Name of the alerting webhook (defined in `alerts.webhooks`) to notify when access is requested.

yaml

jit:
  enabled: true
  default_duration: "1h"
  max_duration: "8h"
  require_reason: true
  webhook_name: slack-ops

secrets — Dynamic credential injection

Configures dynamic credential injection from external secret backends. Secrets can be resolved at runtime and injected as environment variables into remote commands.

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable secret resolution and injection.
`providers`	object[]	—	List of secret backend providers. See `providers` sub-object below.
`cache_ttl`	duration	`5m`	How long resolved secrets are cached in memory. Uses Go duration format.
`inject`	map	—	Environment variable mappings. Keys are env var names, values are secret references (e.g., `${vault:secret/data/db#password}`).

The providers sub-object:

Field	Type	Description
`name`	string	Unique identifier for this provider (e.g., `vault-prod`).
`type`	string	Provider type: `vault`, `aws-sm`, `op` (1Password CLI), `env`, or `file`.
`config`	map	Provider-specific configuration (e.g., `{"addr": "https://vault.example.com"}` for Vault).

yaml

secrets:
  enabled: true
  cache_ttl: "5m"
  providers:
    - name: vault-prod
      type: vault
      config:
        addr: "https://vault.example.com"
    - name: env-local
      type: env
  inject:
    DB_PASSWORD: "${vault-prod:secret/data/db#password}"
    API_KEY: "${vault-prod:secret/data/api#key}"

reactor — Event-driven automation

Configures the operator-side event reactor. The reactor listens for events from agents and triggers automated actions based on pattern-matched rules.

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable the event reactor.
`listen_port`	int	`9500`	TCP port for receiving events from agents.
`rules`	object[]	—	List of automation rules. See `rules` sub-object below.

The rules sub-object:

Field	Type	Description
`name`	string	Human-readable name for this rule.
`event_pattern`	string	Regex pattern matched against event types.
`host_pattern`	string	Regex pattern matched against host IDs. Empty matches all hosts.
`severity`	string	Minimum event severity to trigger the rule (e.g., `warning`, `critical`).
`action.type`	string	Action type: `exec` (run command), `playbook` (run playbook), or `alert` (send webhook).
`action.command`	string	Command to execute (for `type: exec`).
`action.playbook_path`	string	Path to playbook file (for `type: playbook`).
`action.webhook_name`	string	Name of alerting webhook (for `type: alert`).

yaml

reactor:
  enabled: true
  listen_port: 9500
  rules:
    - name: restart-on-crash
      event_pattern: "service_crashed"
      host_pattern: "^web-"
      action:
        type: exec
        command: "systemctl restart nginx"
    - name: alert-disk-full
      event_pattern: "disk_usage_critical"
      severity: critical
      action:
        type: alert
        webhook_name: slack-ops

cluster — High availability (Raft)

Configures active-passive high availability via Raft consensus. When enabled, multiple Nefia operator instances form a cluster with automatic leader election and state replication.

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable Raft-based clustering.
`node_id`	string	—	Unique identifier for this node in the cluster.
`bind_addr`	string	—	TCP address for Raft communication (e.g., `0.0.0.0:9700`).
`advertise_addr`	string	—	Address peers use to reach this node. Defaults to `bind_addr` if omitted.
`peers`	object[]	—	Initial cluster members for joining. See `peers` sub-object below.
`data_dir`	string	`<state_dir>/cluster`	Directory for Raft data (logs, snapshots, stable store).
`tls_enabled`	boolean	`false`	Enable TLS for inter-node Raft communication.
`tls_cert_file`	string	—	Path to the TLS certificate file for Raft transport.
`tls_key_file`	string	—	Path to the TLS private key file for Raft transport.
`tls_ca_file`	string	—	Path to the CA certificate file for verifying peer certificates.

The peers sub-object:

Field	Type	Description
`id`	string	Raft server ID of the peer.
`address`	string	Raft TCP address of the peer (e.g., `10.99.0.2:9700`).

yaml

cluster:
  enabled: true
  node_id: node-1
  bind_addr: "0.0.0.0:9700"
  advertise_addr: "10.99.0.1:9700"
  peers:
    - id: node-2
      address: "10.99.0.2:9700"
    - id: node-3
      address: "10.99.0.3:9700"
  data_dir: ""  # defaults to <state_dir>/cluster

alerts — Webhook-based alerting

Configures webhook notifications for operational events. Alerts are dispatched asynchronously and do not block the triggering operation.

Field	Type	Default	Description
`webhooks[].url` *	string	—	Webhook endpoint URL (must be `http://` or `https://`).
`webhooks[].name`	string	—	Optional name for referencing this webhook.
`webhooks[].type`	string	`generic`	Webhook format: `slack`, `discord`, `teams`, `pagerduty`, or `generic` (plain JSON).
`webhooks[].events`	string[]	—	Event types to subscribe to. Empty subscribes to all events.
`webhooks[].cooldown_sec`	int	`300`	Minimum seconds between alerts of the same event type to prevent flooding.
`webhooks[].template.body`	string	—	Go `text/template` for custom webhook payload.
`webhooks[].pagerduty.routing_key`	string	—	PagerDuty Events API v2 routing key (required for `pagerduty` type).
`webhooks[].pagerduty.default_severity`	string	`warning`	Default PagerDuty severity: `info`, `warning`, `error`, or `critical`.

yaml

alerts:
  webhooks:
    - url: https://hooks.slack.com/services/T.../B.../xxx
      type: slack
      events: [exec_failure, vpn_peer_unhealthy]
      cooldown_sec: 300
    - url: https://monitoring.example.com/nefia
      type: generic

Supported Event Types

Event Type	Trigger
`exec_failure`	One or more hosts failed during a command execution.
`vpn_peer_unhealthy`	A VPN peer's handshake is stale (detected by the health monitor).
`circuit_breaker_open`	The SSH circuit breaker opened for a host after consecutive connection failures.
`policy_rebuild_failed`	The policy engine failed to rebuild after a config hot-reload.
`enrollment_complete`	A host finished enrollment successfully.
`host_online`	A previously offline host became reachable again.
`host_offline`	A host transitioned to offline or unhealthy state.
`key_rotation`	WireGuard key rotation completed.
`config_change`	Configuration was saved or reloaded with changes.
`playbook_complete`	A playbook run completed successfully.
`playbook_failed`	A playbook run failed.
`queue_executed`	A queued offline command was delivered successfully.
`queue_failed`	A queued offline command failed.
`host_revoked`	One host's VPN access was revoked.
`host_revoke_all`	Emergency revocation removed all hosts.

Delivery and Retry Behavior

Failed deliveries (HTTP 5xx or network errors) are retried up to 3 times with exponential backoff (1s, 2s, 4s).
Each attempt has a 10-second timeout.
Non-2xx responses below 500 (e.g., 4xx) are logged as warnings but not retried.
The cooldown timer is per event type per webhook. Duplicate alerts of the same type within the cooldown window are silently dropped.

Payload Formats

Generic (type: generic): A JSON object with event, message, details (optional key-value map), and timestamp (RFC 3339).

json

{
  "event": "exec_failure",
  "message": "Exec failed on 2 host(s)",
  "details": { "failed_count": 2 },
  "timestamp": "2026-03-06T12:00:00Z"
}

Slack (type: slack): A Slack Block Kit payload with a header block (event type), a section block (message), and an optional section block (details formatted as Markdown list).

mcp — MCP server settings

Settings for the Model Context Protocol server used by AI agents.

Field	Type	Default	Description
`rate_limit.rate`	float	server default	Maximum requests per second. When unset, the running server falls back to its built-in default.
`rate_limit.burst`	int	server default	Maximum burst size before rate limiting kicks in.
`approval.enabled`	boolean	`false`	Enable the approval workflow. When enabled, 2 additional approval tools (`nefia.approval.list`, `nefia.approval.respond`) are available, and matching rules require human approval before execution.
`approval.timeout_sec`	int	`120`	Seconds to wait for user approval before timing out.
`approval.default_action`	string	`deny`	Action when approval times out: `deny` or `allow`.
`approval.rules`	object[]	—	Pattern-based approval rules (see below).
`mtls.enabled`	boolean	`false`	Enable the mTLS gateway for secure MCP connections.
`mtls.listen_addr`	string	`127.0.0.1:19821`	TCP address for the mTLS gateway to listen on.
`mtls.ca_cert_file`	string	—	Path to the CA certificate file for client verification.
`mtls.cert_file`	string	—	Path to the server certificate file.
`mtls.key_file`	string	—	Path to the server private key file.

Client certificate revocation is not configured through nefia.yaml. The mTLS gateway always uses the state-directory revocation store managed by nefia mtls revoke, and newly revoked certificates are rejected on subsequent handshakes without restarting the gateway.

Each approval rule can specify tools (exact match), commands and paths (prefix match), and hosts (exact match). The first matching rule wins, and require_approval: false can be used as an explicit exemption rule. When approval is disabled, the approval tools are still advertised with (not configured) descriptions for discoverability.

yaml

mcp:
  rate_limit:
    rate: 1.0
    burst: 10
  approval:
    enabled: false
    timeout_sec: 120
    default_action: deny
    rules: []
  mtls:
    enabled: false
    listen_addr: "127.0.0.1:19821"
    ca_cert_file: ""
    cert_file: ""
    key_file: ""

team — Team context

Configures the active team for multi-tenant operations.

Field	Type	Default	Description
`active_team_id`	string	—	ID of the currently active team. Set by `nefia team use`.
`active_team_slug`	string	—	Human-readable slug of the active team (e.g., `my-team`).

yaml

team:
  active_team_id: "tm_abc123"
  active_team_slug: "my-team"

telemetry — Tracing and metrics

Configures OpenTelemetry tracing and Prometheus metrics export. Both subsystems are disabled by default and add zero overhead when disabled (a noop TracerProvider is installed).

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable OpenTelemetry trace export.
`endpoint`	string	`localhost:4318`	OTLP HTTP endpoint for trace collection.
`service_name`	string	`nefia`	Service name reported in traces.
`metrics_enabled`	boolean	`false`	Enable a Prometheus-compatible metrics endpoint.
`metrics_port`	int	`9090`	TCP port for the Prometheus metrics endpoint.
`insecure_http`	boolean	`false`	Allow plaintext (non-TLS) OTLP HTTP connections for non-loopback collectors. Only needed when the collector is on a different host and does not support TLS.

yaml

telemetry:
  enabled: false
  endpoint: "localhost:4318"
  service_name: nefia
  insecure_http: false
  metrics_enabled: false
  metrics_port: 9090

OpenTelemetry Tracing

When enabled is true, Nefia exports distributed traces via OTLP HTTP to the configured endpoint. Traces are batched asynchronously and flushed on shutdown. The TraceHandler automatically injects trace_id and span_id into every slog log record when a span context is active, enabling log-trace correlation.

If the OTLP exporter fails to initialise (e.g., endpoint unreachable), Nefia falls back to a noop tracer silently rather than failing startup.

Prometheus Metrics

When metrics_enabled is true, Nefia starts an HTTP server on the configured metrics_port with the following endpoints:

Endpoint	Description
`/metrics`	Prometheus-compatible metrics scrape endpoint.
`/healthz`	Liveness probe. Returns `{"status":"healthy"}` (200) or `{"status":"unhealthy"}` (503).
`/readyz`	Readiness probe. Returns `{"status":"ready"}` (200) or `{"status":"not_ready"}` (503).

Exported Metrics

Nefia exports both expvar counters (always available at Go's debug endpoint) and OTel instruments (exported to Prometheus when enabled):

expvar Counter	OTel Instrument	Description
`nefia_exec_total`	`nefia.exec.total`	Total command execution operations.
`nefia_exec_success`	—	Successful executions.
`nefia_exec_fail`	—	Failed executions.
`nefia_conn_dial_total`	`nefia.conn.dial.total`	Total SSH dial attempts.
`nefia_conn_dial_fail`	—	Failed SSH dial attempts.
`nefia_conn_healthcheck_fail`	—	Failed connection health checks.
`nefia_conn_pool_size`	—	Current SSH connection pool size.
`nefia_session_open_total`	—	Total session open operations.
`nefia_session_gc_removed`	—	Sessions removed by garbage collection.
`nefia_session_gc_runs`	—	Number of GC cycles executed.
`nefia_fs_read_total`	`nefia.fs.total` (op=read)	File read operations.
`nefia_fs_write_total`	`nefia.fs.total` (op=write)	File write operations.
`nefia_fs_patch_total`	`nefia.fs.total` (op=patch)	File patch operations.
`nefia_fs_list_total`	`nefia.fs.total` (op=list)	Directory list operations.
`nefia_fs_stat_total`	`nefia.fs.total` (op=stat)	File stat operations.
`nefia_vpn_peer_unhealthy`	—	Unhealthy VPN peer detections.
`nefia_playbook_run_total`	`nefia.playbook.total`	Total playbook run operations.

OTel histograms record operation duration in seconds with bucket boundaries from 1ms to 10s:

Histogram	Description
`nefia.exec.duration`	Command execution duration.
`nefia.conn.dial.duration`	SSH dial duration.
`nefia.fs.duration`	Filesystem operation duration (labeled by `op`).
`nefia.playbook.duration`	Playbook run duration.

All OTel instruments include an ok boolean attribute for success/failure breakdowns.

hosts — Target PC definitions

Each host represents a target PC enrolled via nefia vpn invite. Hosts are defined as an array of objects.

Field	Type	Default	Description
`id` *	string	—	Unique host identifier (name). Must match `^[a-zA-Z0-9][a-zA-Z0-9._-]*$` and be at most 128 characters.
`address` *	string	—	VPN IP address of the host.
`os` *	string	—	Operating system: `macos`, `linux`, or `windows`.
`user` *	string	—	SSH username for connecting to this host. Falls back to `$USER` (or `$USERNAME` on Windows). Connection fails with an actionable error if all fallbacks are empty.
`root`	string	`/`	Default root directory for file operations.
`shell`	string	—	Shell override for this host (e.g., `/bin/bash`). Overrides the OS-level default from `defaults.shell`.
`role`	string	—	RBAC role name assigned to this host. Must match a role defined in `policy.roles`.
`tags`	map	—	Key-value tags for targeting and group membership.
`vpn.public_key` *	string	—	WireGuard public key for this peer.
`vpn.endpoint`	string	—	WireGuard endpoint (ip:port) if the host has a public address.
`vpn.local_endpoint`	string	—	LAN endpoint (ip:port) for hairpin NAT fallback. Discovered automatically during enrollment.
`vpn.vpn_addr`	string	—	Peer's VPN IP address (e.g., `10.99.0.2`).
`vpn.status`	string	—	Current VPN status: `active`, `pending`, or empty.

yaml

hosts:
  - id: prod-web-1
    address: 10.99.0.2
    os: linux
    user: deploy
    root: /var/www
    tags:
      env: production
      role: web
    vpn:
      public_key: xYz1...aBcD=
      endpoint: 203.0.113.10:51820
      local_endpoint: 192.168.1.50:51820
      vpn_addr: 10.99.0.2
      status: active

groups — Host group definitions

Groups provide named selectors based on tag matching.

yaml

groups:
  - name: webservers
    match:
      tags:
        role: web
  - name: production
    match:
      tags:
        env: production
  - name: staging
    match:
      tags:
        env: staging
  - name: all-linux
    match:
      tags:
        os: linux

Use groups in target selectors:

bash

nefia exec --target group:webservers -- systemctl status nginx

Complete Example

A fully populated nefia.yaml combining all sections:

yaml

version: 1
session_ttl_minutes: 1440
session_idle_timeout_minutes: 30
session_gc_interval_minutes: 5
host_sync_interval: "5m"
 
defaults:
  concurrency: 50
  timeout: 30m
  output: human
  max_output_bytes: 1048576
  shell:
    macos: /bin/zsh
    linux: /bin/bash
    windows: powershell.exe
  artifact_retention_days: 7
  progress:
    enabled: true
    interval: "30s"
  notifications:
    enabled: false
    min_duration: "30s"
 
ssh:
  connect_timeout_sec: 5
  known_hosts: ~/.ssh/known_hosts
  identities:
    - ~/.ssh/id_ed25519
  agent: false
  max_file_size_bytes: 3221225472
  max_concurrent_fs: 20
  max_pool_size: 100
  connection_ttl: "1h"
  idle_timeout: "10m"
  max_retries: 3
  initial_backoff: "1s"
  max_backoff: "60s"
  circuit_breaker_threshold: 5
  circuit_breaker_reset_timeout: "60s"
  keepalive_timeout: "5s"
 
vpn:
  enabled: true
  private_key: $WG_PRIVATE_KEY
  listen_port: 51820
  address: 10.99.0.1/24
  dns: [1.1.1.1]
  magic_dns:
    enabled: true
    domain: nefia
    upstream: ["1.1.1.1:53", "8.8.8.8:53"]
  derp_servers:
    - url: "wss://relay.nefia.ai/derp"
      region: "ap-northeast-1"
  turn_servers:
    - url: "turn:turn.example.com:3478"
      username: $TURN_USER
      password: $TURN_PASS
  stun_servers:
    - "stun.l.google.com:19302"
  stun_timeout: "5s"
  key_rotation_grace_period: "72h"
  monitor_interval: "30s"
  local_dial_timeout: "3s"
  derp_probe_interval: "5m"
  derp_auto_select: true
  multipath:
    mode: "off"
    probe_interval_sec: 5
    failover_threshold_ms: 0
 
auth:
  api_base_url: https://www.nefia.ai
  web_base_url: https://www.nefia.ai
 
policy:
  mode: enforce
  deny_commands:
    - "^rm\\s+-rf\\s+/"
  allow_commands:
    - "systemctl\\s+(status|restart|reload)"
  deny_paths:
    - "/etc/shadow"
  allowed_roots:
    - /var/www
    - /etc/nginx
 
audit:
  enabled: true
  required: false
  retention_days: 90
  syslog_addr: "localhost:514"
  syslog_proto: "udp"
 
sudo:
  enabled: false
  method: "nopasswd"
  user: "root"
  allowed_commands: []
  deny_commands: []
  require_approval: false
 
schedule:
  enabled: true
  sync_interval_sec: 300
  retention_days: 90
  max_concurrent_schedules: 5
 
device_lock:
  enabled: false
  mode: log
 
posture:
  enabled: true
  mode: warn
  require_firewall: true
  require_disk_encryption: true
 
ssh_ca:
  enabled: true
  auto_sign: true
  cert_duration: "24h"
  host_cert_duration: "2160h"
  allowed_user_principals:
    - deploy
    - admin
  allow_privileged_principals: false
 
jit:
  enabled: false
  default_duration: "1h"
  max_duration: "8h"
  require_reason: true
  webhook_name: ""
 
secrets:
  enabled: false
  cache_ttl: "5m"
  providers:
    - name: vault-prod
      type: vault
      config:
        addr: "https://vault.example.com"
  inject:
    DB_PASSWORD: "${vault-prod:secret/data/db#password}"
 
reactor:
  enabled: false
  listen_port: 9500
  rules: []
 
cluster:
  enabled: false
  node_id: node-1
  bind_addr: "0.0.0.0:9700"
  advertise_addr: ""
  peers: []
  data_dir: ""
 
mcp:
  rate_limit:
    rate: 1.0
    burst: 10
  approval:
    enabled: false
    timeout_sec: 120
    default_action: deny
  mtls:
    enabled: false
    listen_addr: "127.0.0.1:19821"
    ca_cert_file: ""
    cert_file: ""
    key_file: ""
 
alerts:
  webhooks: []
 
team:
  active_team_id: ""
  active_team_slug: ""
 
telemetry:
  enabled: false
  endpoint: "localhost:4318"
  service_name: nefia
  insecure_http: false
  metrics_enabled: false
  metrics_port: 9090
 
hosts:
  - id: prod-web-1
    address: 10.99.0.2
    os: linux
    user: deploy
    root: /var/www
    shell: /bin/bash
    role: deployer
    tags:
      env: production
      role: web
    vpn:
      public_key: xYz1...aBcD=
      endpoint: 203.0.113.10:51820
      vpn_addr: 10.99.0.2
      status: active
 
groups:
  - name: webservers
    match:
      tags:
        role: web
  - name: production
    match:
      tags:
        env: production

Validating Your Configuration

Run the built-in validator to catch errors:

bash

nefia validate

nefia validate

Configuration: /Users/admin/.config/nefia/nefia.yaml Hosts: 5 defined, 4 reachable Groups: 4 defined, all valid Policy: 12 rules loaded, no conflicts VPN: enabled, key present Result: OK

Config Validation Rules

The following validation rules are applied at startup and when running nefia validate:

Field	Rule	Error
`vpn.derp_servers[].url`	Must not be empty	`vpn.derp_servers[N].url must not be empty`
`vpn.derp_servers[].url`	Must start with `wss://` or `ws://`	`vpn.derp_servers[N].url must start with wss:// or ws://`
`vpn.derp_servers[].url`	Must be a valid URL	`vpn.derp_servers[N].url is not a valid URL`
`vpn.stun_servers[]`	Must not be empty	`vpn.stun_servers[N] must not be empty`
`vpn.stun_servers[]`	Must be `host:port` format	`vpn.stun_servers[N] must be host:port`
`hosts[].vpn.vpn_addr`	Must be unique across all hosts	Duplicate VPN address error
`hosts[].vpn.vpn_addr`	Must not overlap with operator address	Operator address collision error
`ssh_ca.allowed_user_principals`	Must be non-empty when `ssh_ca.enabled` is `true`	`ssh_ca.allowed_user_principals must not be empty when ssh_ca is enabled`
`ssh_ca.allowed_user_principals`	Must contain no duplicates	`ssh_ca.allowed_user_principals contains duplicate entry`
`sudo.allowed_commands[]`	Must be valid regex patterns	`sudo.allowed_commands[N] is not a valid regex`
`sudo.deny_commands[]`	Must be valid regex patterns	`sudo.deny_commands[N] is not a valid regex`
`sudo.require_approval`	Must be `false` (not yet supported)	`sudo.require_approval is not yet supported`

File-Based Config Locking

When multiple nefia processes modify the configuration concurrently (e.g., parallel vpn invite and vpn reinvite commands), a file-based lock prevents race conditions:

An exclusive lock file (nefia.yaml.lock) is acquired before any config modification
On Unix (macOS/Linux): non-blocking flock with retry (up to 30 attempts, 1 second apart, max 30-second wait)
On Windows: LockFileEx with exclusive non-blocking lock and the same retry strategy
The lock is released after the config file is saved

This prevents VPN address collisions and config corruption when enrollment operations run in parallel.

Stale Lock Recovery

On Unix, the operating system automatically releases flock locks when the owning process exits (even on crash), so stale locks are not possible under normal circumstances.

On Windows, the lock file stores the PID of the owning process. When a lock acquisition fails, Nefia reads the PID from the lock file and checks whether that process is still alive using OpenProcess and GetExitCodeProcess. If the owning process has exited (crash or ungraceful termination), the stale lock file is automatically removed and re-acquired. This prevents indefinite lock retention after a crash.

Agent-Side Configuration

The agent (nefia-agent) uses a separate agent.yaml configuration file. See the Agent Reference for the full schema. Key extensions for NAT traversal and cloud communication:

Field	Type	Description
`agent_token`	string	API token for cloud communication. Supports `keyring:agent-token` for OS keyring storage.
`agent_token_expires_at`	string	RFC 3339 expiry time for the agent token.
`host_id`	string	Host ID for endpoint reporting.
`stun_servers`	string[]	Custom STUN servers (overrides defaults). Format: `host:port`.
`endpoint_refresh`	duration	STUN refresh interval (e.g., `5m`). Empty disables.
`cloud_api_base`	string	Cloud API base URL. Must be `https://`.
`derp_servers`	DERPServer[]	DERP relay servers (propagated from operator during enrollment).

CLI Reference

Complete reference for all Nefia CLI commands and flags.

VPN Setup Guide

Configure WireGuard tunnels, NAT traversal, and key rotation.

Playbook Format

YAML schema reference for Nefia playbook files.