Skip to content

Configuration

Complete reference for the nefia.yaml configuration file format and all available options.

Nefia stores its configuration in a single YAML file. This page documents every section and field available in nefia.yaml.

File Location

plaintext
~/Library/Application Support/nefia/nefia.yaml

Override the default path with the --config flag on any command:

bash
nefia --config /path/to/custom.yaml hosts list

Creating the Config File

The configuration file is created automatically when you run nefia setup (or its alias nefia init) or nefia vpn invite. You can also create it manually.

bash
nefia setup
nefia setup

Configuration created at /Users/admin/Library/Application Support/nefia/nefia.yaml

VPN: enabled (keypair generated) Address: 10.99.0.1/24 Audit: enabled (/Users/admin/Library/Application Support/nefia/audit/)

Schema Version

Every configuration file begins with a version field. The current schema version is 1.

yaml
version: 1

Top-Level Settings

In addition to the version field, the following top-level settings control session lifecycle and host synchronization:

FieldTypeDefaultDescription
session_ttl_minutesint1440 (24h)Maximum session lifetime in minutes. Sessions older than this are automatically closed. 0 uses the default (24 hours).
session_idle_timeout_minutesint30Session idle timeout in minutes. Sessions with no activity for this duration are automatically closed based on LastUsedAt. 0 uses the default (30 minutes).
session_gc_interval_minutesint5Interval in minutes for the session garbage collector to scan and remove expired sessions. 0 uses the default (5 minutes).
host_sync_intervalduration5mHow often to synchronize host state from the web dashboard. Uses Go duration format (e.g., 5m, 1h).
yaml
version: 1
session_ttl_minutes: 1440
session_idle_timeout_minutes: 30
session_gc_interval_minutes: 5
host_sync_interval: "5m"

Section Reference

defaults — Global execution defaults

Controls default behavior for command execution across all hosts.

FieldTypeDefaultDescription
concurrencyint50Maximum number of hosts to operate on in parallel.
timeoutduration30mDefault timeout for remote operations. Supports units: s, m, h.
outputstringhumanOutput format: human, json, jsonl, yaml, or compact.
max_output_bytesint1048576Maximum output captured per host (1 MB). Truncated beyond this limit with an [output truncated] warning in stderr. The in-memory snapshot buffer is capped at the larger of 50 MB or max_output_bytes * 2.
shellmapDefault shell per OS. Keys are OS names (macos, linux, windows), values are shell paths (e.g., {"linux": "/bin/bash", "macos": "/bin/zsh"}). Per-host shell overrides this.
artifact_retention_daysint7Number of days to retain command output artifacts before automatic cleanup.
progress.enabledbooleantrueEnable progress reporting during long-running operations.
progress.intervalduration30sInterval for progress reporting.
retry.vpn_recovery_enabledbooleantrueEnable automatic retry while waiting for transient VPN recovery. The global --retry-timeout flag can enable the behavior for a single command.
retry.vpn_recovery_timeoutduration30sDefault maximum wait time for VPN recovery. Overridden by the global --retry-timeout flag.
recording.enabledbooleanfalseEnable session recording by default when sessions are opened.
recording.retention_daysint90Number of days to retain recorded sessions.
yaml
defaults:
  concurrency: 50
  timeout: 30m
  output: human
  max_output_bytes: 1048576
  shell:
    macos: /bin/zsh
    linux: /bin/bash
    windows: powershell.exe
  artifact_retention_days: 7
  progress:
    enabled: true
    interval: "30s"
  retry:
    vpn_recovery_enabled: false
    vpn_recovery_timeout: "30s"
  recording:
    enabled: false
    retention_days: 90
defaults.retry — Automatic VPN recovery retry

Controls retry behavior for transient VPN recovery during CLI operations.

FieldTypeDefaultDescription
vpn_recovery_enabledbooleantrueEnable retry for retriable VPN failures.
vpn_recovery_timeoutduration30sMaximum time to wait for VPN recovery before failing the command.
yaml
defaults:
  retry:
    vpn_recovery_enabled: false
    vpn_recovery_timeout: "30s"
defaults.recording — Session recording defaults

Controls whether newly opened sessions are recorded by default.

FieldTypeDefaultDescription
enabledbooleanfalseEnable session recording by default when a session is opened.
retention_daysint90Number of days to retain recording files.
yaml
defaults:
  recording:
    enabled: false
    retention_days: 90
defaults.notifications — Desktop notification settings

Controls desktop notifications for completed operations.

FieldTypeDefaultDescription
enabledbooleanfalseEnable automatic desktop notifications for long-running operations.
min_durationduration30sMinimum operation duration before a notification is sent. Only applies when enabled is true.
yaml
defaults:
  notifications:
    enabled: true
    min_duration: "30s"

When enabled, operations exceeding min_duration automatically trigger a desktop notification. You can also use the --notify flag on any command to send a notification regardless of this setting.

ssh — SSH transport settings

Configures SSH connections established inside VPN tunnels.

FieldTypeDefaultDescription
connect_timeout_secint5Seconds to wait for an SSH connection to establish.
known_hostsstring~/.ssh/known_hostsPath to the SSH known_hosts file for host key verification (on Windows: %USERPROFILE%\.ssh\known_hosts).
identitiesstring[]List of SSH private key file paths to offer during authentication.
agentbooleanfalseUse the system SSH agent for key authentication.
max_file_size_bytesint3221225472Maximum file size for SFTP transfers (3 GB).
max_concurrent_fsint20Maximum concurrent file system operations per host.
sftp_timeout_secint300Timeout in seconds for individual SFTP operations (read, write, patch, list, stat). The transport layer also applies a 30-minute hard safety timeout as a last resort.
max_pool_sizeint100Maximum number of pooled SSH connections across all hosts.
max_concurrent_dialint20Maximum number of concurrent SSH dial attempts.
max_retriesint3Maximum number of retry attempts for failed SSH connections.
initial_backoffduration1sInitial backoff duration between retry attempts. Uses Go duration format (e.g. 1s, 500ms).
max_backoffduration60sMaximum backoff duration between retry attempts. Uses Go duration format.
connection_ttlduration1hMaximum lifetime of a pooled SSH connection before it is closed and re-established. Uses Go duration format.
idle_timeoutduration10mTime after which an idle pooled connection is closed. Uses Go duration format.
circuit_breaker_thresholdint5Number of consecutive connection failures before the circuit breaker opens for a host.
circuit_breaker_reset_timeoutduration60sTime to wait before attempting to reconnect to a host after the circuit breaker opens. Uses Go duration format.
keepalive_timeoutduration5sTimeout for SSH keepalive health-check probes on pooled connections. Uses Go duration format.
yaml
ssh:
  connect_timeout_sec: 5
  known_hosts: ~/.ssh/known_hosts
  identities:
    - ~/.ssh/id_ed25519    # Windows: %USERPROFILE%\.ssh\id_ed25519
    - ~/.ssh/id_rsa
  agent: false
  max_file_size_bytes: 3221225472
  max_concurrent_fs: 20
  # Connection pool tuning
  max_pool_size: 100
  connection_ttl: "1h"
  idle_timeout: "10m"
  # Retry / backoff
  max_retries: 3
  initial_backoff: "1s"
  max_backoff: "60s"
  # Circuit breaker
  circuit_breaker_threshold: 5
  circuit_breaker_reset_timeout: "60s"
  keepalive_timeout: "5s"
vpn — WireGuard VPN configuration

Controls the WireGuard VPN hub on the operator PC.

FieldTypeDefaultDescription
enabled *booleantrueEnable the WireGuard VPN. Must be true for any remote operations.
private_keystringWireGuard private key. Supports inline base64, $ENV_VAR reference, or keyring: prefix for OS keyring.
listen_portint51820UDP port for WireGuard to listen on.
addressstring10.99.0.1/24VPN address and subnet for the operator (hub).
dnsstring[]DNS servers pushed to the VPN interface.
magic_dnsobjectMagicDNS configuration for resolving host names via the VPN.

The magic_dns sub-object:

FieldTypeDefaultDescription
enabledbooleantrueEnable MagicDNS name resolution.
domainstringnefiaDomain suffix appended to host names (e.g., pc-office.nefia).
upstreamstring[]["1.1.1.1:53","8.8.8.8:53"]Upstream DNS servers for non-VPN queries.

Additional VPN fields:

FieldTypeDefaultDescription
derp_serversobject[]DERP relay servers for relay-first connectivity. Each entry has url (wss:// or ws://) and region. Propagated to agents during enrollment.
turn_serversobject[]TURN relay servers for NAT traversal fallback. Each entry has url, username, and password (supports $ENV_VAR references). See turn_servers sub-object below.
stun_serversstring[]STUN servers for endpoint discovery. Each entry must be host:port format.
stun_timeoutduration5sTimeout for STUN endpoint discovery requests.
key_rotation_grace_periodduration72hHow long to accept the previous key after rotation.
auto_rotate_intervaldurationAutomatic key rotation interval (e.g., 720h for 30 days). Empty disables auto-rotation.
monitor_intervalduration30sInterval for VPN peer health monitoring.
local_dial_timeoutduration3sTimeout for local endpoint probe in hairpin NAT fallback. Uses Go duration format.
derp_probe_intervalduration5mInterval between DERP RTT quality probes. Minimum 30s. Used to measure relay latency for auto-selection.
derp_auto_selectbooleantrueEnable automatic DERP relay selection based on RTT probes. When enabled, the client periodically probes configured DERP servers and routes traffic through the lowest-latency relay.
multipathobjectMultipath active-backup failover configuration. See multipath sub-object below.

The derp_servers sub-object:

FieldTypeDescription
urlstringWebSocket endpoint (must start with wss:// or ws://), e.g., wss://relay.nefia.ai/derp.
regionstringCloud region identifier (e.g., ap-northeast-1, us-east-1). Informational only.

The turn_servers sub-object:

FieldTypeDescription
urlstringTURN server URL (e.g., turn:turn.example.com:3478).
usernamestringTURN credentials. Supports $ENV_VAR references.
passwordstringTURN credentials. Supports $ENV_VAR references. Hidden from JSON output.

The multipath sub-object:

FieldTypeDefaultDescription
modestringoffMultipath behavior: off (disabled) or active-backup (failover between paths).
probe_interval_secint5Interval between quality probes in seconds.
failover_threshold_msint0RTT threshold in milliseconds above which a path is considered degraded and failover is evaluated more aggressively. 0 disables RTT-based failover.
yaml
vpn:
  enabled: true
  private_key: $WG_PRIVATE_KEY
  listen_port: 51820
  address: 10.99.0.1/24
  dns:
    - 1.1.1.1
    - 8.8.8.8
  magic_dns:
    enabled: true
    domain: nefia
    upstream:
      - "1.1.1.1:53"
      - "8.8.8.8:53"
  derp_servers:
    - url: "wss://relay.nefia.ai/derp"
      region: "ap-northeast-1"
  turn_servers:
    - url: "turn:turn.example.com:3478"
      username: $TURN_USER
      password: $TURN_PASS
  stun_servers:
    - "stun.l.google.com:19302"
  stun_timeout: "5s"
  key_rotation_grace_period: "72h"
  auto_rotate_interval: ""
  monitor_interval: "30s"
  local_dial_timeout: "3s"
  derp_probe_interval: "5m"
  derp_auto_select: true
  multipath:
    mode: "off"
    probe_interval_sec: 5
    failover_threshold_ms: 0
auth — Authentication and API settings

Configures connection to the Nefia web dashboard and API.

FieldTypeDefaultDescription
api_base_urlstringhttps://www.nefia.aiBase URL for the Nefia API endpoint.
web_base_urlstringhttps://www.nefia.aiURL of the Nefia web dashboard.
yaml
auth:
  api_base_url: https://www.nefia.ai
  web_base_url: https://www.nefia.ai
policy — Command and path guardrails

Defines the policy engine rules that restrict what commands can be executed and what paths can be accessed.

FieldTypeDefaultDescription
modestringenforcePolicy enforcement mode: off, warn, or enforce.
deny_commandsstring[]Regex patterns for denied commands. Checked before allow rules. Must include ^ or $ anchor in enforce/warn mode.
allow_commandsstring[]Regex patterns for allowed commands. If set, only matching commands are permitted.
deny_pathsstring[]Regex patterns for denied file paths.
allowed_rootsstring[]Allowed root directories for file operations. Paths outside these roots are rejected.
deny_operationsstring[]MCP operation types to deny globally (e.g., exec.sudo, fs.remove).
sudo_modestringOverride policy mode for sudo specifically: off, warn, or enforce.
sudo_allow_commandsstring[]Regex patterns for allowed sudo commands.
sudo_deny_commandsstring[]Regex patterns for denied sudo commands.
rolesobject[]RBAC role definitions with per-role command, path, and operation restrictions.
yaml
policy:
  mode: enforce
  deny_commands:
    - "^rm\\s+-rf\\s+/"
    - "^mkfs\\."
    - "^dd\\s+if="
  allow_commands:
    - "systemctl\\s+(status|restart|reload)"
    - "docker\\s+(ps|logs|inspect)"
    - "cat|head|tail|less|grep"
  deny_paths:
    - "/etc/shadow"
    - "/root/\\.ssh"
  allowed_roots:
    - /var/www
    - /etc/nginx
    - /home/deploy
  roles:
    - name: viewer
      hosts: ["^log-", "^monitor-"]
      allow_commands: ["cat|head|tail|ls"]
      deny_paths: ["^/etc/"]
      deny_commands: []
      record_sessions: true
      allowed_roots: ["/var/log"]
    - name: deployer
      hosts: ["^web-", "^worker-"]
      allow_commands: [".*"]
      deny_commands: ["^rm\\s+-rf"]
      allowed_roots: ["/var/www", "/etc/nginx"]
sudo — Sudo privilege escalation

Controls how nefia.exec.sudo and playbook sudo steps execute privileged commands on target hosts.

FieldTypeDefaultDescription
enabledbooleanfalseEnable sudo execution. When disabled, exec.sudo calls are rejected.
methodstring"nopasswd"Sudo method. Currently only "nopasswd" is supported (passwordless sudo).
userstring"root"Target user for sudo execution.
allowed_commandsstring[][]Regex patterns for allowed sudo commands. Empty means all commands allowed (subject to policy). Each entry must be a valid Go regex.
deny_commandsstring[][]Regex patterns for denied sudo commands. Evaluated before allowed_commands. Each entry must be a valid Go regex.
require_approvalbooleanfalseRequire human approval before executing sudo commands. Not yet supported — enabling this will produce a validation error at startup.
yaml
# Example
sudo:
  enabled: true
  method: "nopasswd"
  user: "root"
  allowed_commands:
    - "^apt (update|upgrade)"
    - "^systemctl (restart|reload)"
  deny_commands:
    - "^rm -rf /"
  require_approval: false
audit — Audit logging

Controls the append-only audit log that records every operation.

FieldTypeDefaultDescription
enabledbooleantrueEnable audit logging.
requiredbooleanfalseWhen true, audit write failures are treated as fatal errors — operations are blocked if audit logging fails. Default behavior is warn-and-continue.
dirstring<config-dir>/audit/Directory for audit log files (JSONL format).
retention_daysint90Number of days to retain audit logs before automatic cleanup.
syslog_addrstringRemote syslog server address (e.g., localhost:514). Empty disables syslog forwarding.
syslog_protostringudpSyslog protocol: udp or tcp.
yaml
audit:
  enabled: true
  required: false
  # macOS: ~/Library/Application Support/nefia/audit/
  # Linux: ~/.config/nefia/audit/
  # Windows: %AppData%\nefia\audit\
  dir: ""  # defaults to <config-dir>/audit/
  retention_days: 90
  syslog_addr: "localhost:514"
  syslog_proto: "udp"

SIEM Forwarding

The audit.siem sub-object configures real-time audit event forwarding to external SIEM platforms.

FieldTypeDefaultDescription
siem.typestringSIEM type: splunk, datadog, or webhook.
siem.endpointstringSIEM endpoint URL.
siem.token_envstringEnvironment variable name containing the authentication token.
siem.webhook_secret_envstringEnvironment variable name containing the webhook HMAC secret (for webhook type).
siem.batch_sizeint100Number of events to batch before flushing.
siem.flush_intervalduration10sMaximum time between flushes.
siem.sourcestringnefiaEvent source identifier (Splunk).
siem.source_typestringnefia:auditSplunk source type.
siem.indexstringSplunk index name.
siem.servicestringnefiaDatadog service name.
siem.tagsstring[]Datadog tags to attach to events.
yaml
audit:
  siem:
    type: splunk
    endpoint: https://splunk.example.com:8088/services/collector/event
    token_env: SPLUNK_HEC_TOKEN
    batch_size: 100
    flush_interval: "10s"
    source: nefia
    source_type: "nefia:audit"
    index: security
schedule — Scheduled execution

Configures the built-in scheduler for recurring and deferred operations.

FieldTypeDefaultDescription
enabledbooleantrueEnable the scheduling subsystem.
sync_interval_secint300Reserved internal tuning field. The current scheduler loop does not use it directly.
retention_daysint90Number of days to retain schedule execution history.
max_concurrent_schedulesint5Maximum number of scheduled playbooks running simultaneously.
yaml
schedule:
  enabled: true
  sync_interval_sec: 300
  retention_days: 90
  max_concurrent_schedules: 5
device_lock — Cryptographic device verification

Controls device lock verification (Tailnet Lock style). When enabled, only hosts whose WireGuard public key has been signed by the device-lock authority are allowed to connect.

FieldTypeDefaultDescription
enabledbooleanfalseEnable cryptographic device verification.
modestringlogEnforcement mode: log (log unsigned devices but allow connections) or enforce (block unsigned devices).
yaml
device_lock:
  enabled: false
  mode: log
posture — Device posture verification

Controls device posture checks on target hosts. When enabled, hosts must meet the specified security requirements before connections are allowed.

FieldTypeDefaultDescription
enabledbooleanfalseEnable device posture verification.
modestringoffPosture enforcement mode: off (disabled), warn (log non-compliant devices but allow connections), or enforce (block non-compliant devices).
require_firewallbooleanfalseRequire that the host firewall is enabled.
require_disk_encryptionbooleanfalseRequire that disk encryption (FileVault, BitLocker, LUKS) is enabled.
yaml
posture:
  enabled: true
  mode: warn
  require_firewall: true
  require_disk_encryption: true
ssh_ca — SSH Certificate Authority

Configures the built-in SSH Certificate Authority for issuing short-lived SSH certificates instead of using static keys.

FieldTypeDefaultDescription
enabledbooleanfalseEnable the SSH CA workflow. When enabled, Nefia can issue user and host certificates.
auto_signbooleantrueAutomatically issue or renew user certificates on connect when a local identity file is available.
cert_durationduration24hValidity period for user certificates. Uses Go duration format (e.g., 1h, 24h).
host_cert_durationduration2160hValidity period for host certificates. Uses Go duration format. Default is 90 days.
allowed_user_principalsstring[]Allowlist of principals that can be used in user certificates. Required when enabled is true. Must be non-empty and contain no duplicates.
allow_privileged_principalsbooleanfalseWhen true, allows privileged principals (e.g., root) in user certificates. By default, privileged principals are rejected.
yaml
ssh_ca:
  enabled: true
  auto_sign: true
  cert_duration: "24h"
  host_cert_duration: "2160h"
  allowed_user_principals:
    - deploy
    - admin
  allow_privileged_principals: false
jit — Just-in-Time access

Configures the Just-in-Time (JIT) temporary access request system. When enabled, operators must request time-limited access to hosts instead of having persistent access.

FieldTypeDefaultDescription
enabledbooleanfalseEnable JIT access requests.
default_durationduration1hDefault grant duration for access requests. Uses Go duration format.
max_durationduration8hMaximum allowed grant duration. Requests exceeding this are rejected. Uses Go duration format.
require_reasonbooleanfalseRequire a reason string when requesting access.
webhook_namestringName of the alerting webhook (defined in alerts.webhooks) to notify when access is requested.
yaml
jit:
  enabled: true
  default_duration: "1h"
  max_duration: "8h"
  require_reason: true
  webhook_name: slack-ops
secrets — Dynamic credential injection

Configures dynamic credential injection from external secret backends. Secrets can be resolved at runtime and injected as environment variables into remote commands.

FieldTypeDefaultDescription
enabledbooleanfalseEnable secret resolution and injection.
providersobject[]List of secret backend providers. See providers sub-object below.
cache_ttlduration5mHow long resolved secrets are cached in memory. Uses Go duration format.
injectmapEnvironment variable mappings. Keys are env var names, values are secret references (e.g., ${vault:secret/data/db#password}).

The providers sub-object:

FieldTypeDescription
namestringUnique identifier for this provider (e.g., vault-prod).
typestringProvider type: vault, aws-sm, op (1Password CLI), env, or file.
configmapProvider-specific configuration (e.g., {"addr": "https://vault.example.com"} for Vault).
yaml
secrets:
  enabled: true
  cache_ttl: "5m"
  providers:
    - name: vault-prod
      type: vault
      config:
        addr: "https://vault.example.com"
    - name: env-local
      type: env
  inject:
    DB_PASSWORD: "${vault-prod:secret/data/db#password}"
    API_KEY: "${vault-prod:secret/data/api#key}"
reactor — Event-driven automation

Configures the operator-side event reactor. The reactor listens for events from agents and triggers automated actions based on pattern-matched rules.

FieldTypeDefaultDescription
enabledbooleanfalseEnable the event reactor.
listen_portint9500TCP port for receiving events from agents.
rulesobject[]List of automation rules. See rules sub-object below.

The rules sub-object:

FieldTypeDescription
namestringHuman-readable name for this rule.
event_patternstringRegex pattern matched against event types.
host_patternstringRegex pattern matched against host IDs. Empty matches all hosts.
severitystringMinimum event severity to trigger the rule (e.g., warning, critical).
action.typestringAction type: exec (run command), playbook (run playbook), or alert (send webhook).
action.commandstringCommand to execute (for type: exec).
action.playbook_pathstringPath to playbook file (for type: playbook).
action.webhook_namestringName of alerting webhook (for type: alert).
yaml
reactor:
  enabled: true
  listen_port: 9500
  rules:
    - name: restart-on-crash
      event_pattern: "service_crashed"
      host_pattern: "^web-"
      action:
        type: exec
        command: "systemctl restart nginx"
    - name: alert-disk-full
      event_pattern: "disk_usage_critical"
      severity: critical
      action:
        type: alert
        webhook_name: slack-ops
cluster — High availability (Raft)

Configures active-passive high availability via Raft consensus. When enabled, multiple Nefia operator instances form a cluster with automatic leader election and state replication.

FieldTypeDefaultDescription
enabledbooleanfalseEnable Raft-based clustering.
node_idstringUnique identifier for this node in the cluster.
bind_addrstringTCP address for Raft communication (e.g., 0.0.0.0:9700).
advertise_addrstringAddress peers use to reach this node. Defaults to bind_addr if omitted.
peersobject[]Initial cluster members for joining. See peers sub-object below.
data_dirstring<state_dir>/clusterDirectory for Raft data (logs, snapshots, stable store).
tls_enabledbooleanfalseEnable TLS for inter-node Raft communication.
tls_cert_filestringPath to the TLS certificate file for Raft transport.
tls_key_filestringPath to the TLS private key file for Raft transport.
tls_ca_filestringPath to the CA certificate file for verifying peer certificates.

The peers sub-object:

FieldTypeDescription
idstringRaft server ID of the peer.
addressstringRaft TCP address of the peer (e.g., 10.99.0.2:9700).
yaml
cluster:
  enabled: true
  node_id: node-1
  bind_addr: "0.0.0.0:9700"
  advertise_addr: "10.99.0.1:9700"
  peers:
    - id: node-2
      address: "10.99.0.2:9700"
    - id: node-3
      address: "10.99.0.3:9700"
  data_dir: ""  # defaults to <state_dir>/cluster
alerts — Webhook-based alerting

Configures webhook notifications for operational events. Alerts are dispatched asynchronously and do not block the triggering operation.

FieldTypeDefaultDescription
webhooks[].url *stringWebhook endpoint URL (must be http:// or https://).
webhooks[].namestringOptional name for referencing this webhook.
webhooks[].typestringgenericWebhook format: slack, discord, teams, pagerduty, or generic (plain JSON).
webhooks[].eventsstring[]Event types to subscribe to. Empty subscribes to all events.
webhooks[].cooldown_secint300Minimum seconds between alerts of the same event type to prevent flooding.
webhooks[].template.bodystringGo text/template for custom webhook payload.
webhooks[].pagerduty.routing_keystringPagerDuty Events API v2 routing key (required for pagerduty type).
webhooks[].pagerduty.default_severitystringwarningDefault PagerDuty severity: info, warning, error, or critical.
yaml
alerts:
  webhooks:
    - url: https://hooks.slack.com/services/T.../B.../xxx
      type: slack
      events: [exec_failure, vpn_peer_unhealthy]
      cooldown_sec: 300
    - url: https://monitoring.example.com/nefia
      type: generic

Supported Event Types

Event TypeTrigger
exec_failureOne or more hosts failed during a command execution.
vpn_peer_unhealthyA VPN peer's handshake is stale (detected by the health monitor).
circuit_breaker_openThe SSH circuit breaker opened for a host after consecutive connection failures.
policy_rebuild_failedThe policy engine failed to rebuild after a config hot-reload.
enrollment_completeA host finished enrollment successfully.
host_onlineA previously offline host became reachable again.
host_offlineA host transitioned to offline or unhealthy state.
key_rotationWireGuard key rotation completed.
config_changeConfiguration was saved or reloaded with changes.
playbook_completeA playbook run completed successfully.
playbook_failedA playbook run failed.
queue_executedA queued offline command was delivered successfully.
queue_failedA queued offline command failed.
host_revokedOne host's VPN access was revoked.
host_revoke_allEmergency revocation removed all hosts.

Delivery and Retry Behavior

  • Failed deliveries (HTTP 5xx or network errors) are retried up to 3 times with exponential backoff (1s, 2s, 4s).
  • Each attempt has a 10-second timeout.
  • Non-2xx responses below 500 (e.g., 4xx) are logged as warnings but not retried.
  • The cooldown timer is per event type per webhook. Duplicate alerts of the same type within the cooldown window are silently dropped.

Payload Formats

Generic (type: generic): A JSON object with event, message, details (optional key-value map), and timestamp (RFC 3339).

json
{
  "event": "exec_failure",
  "message": "Exec failed on 2 host(s)",
  "details": { "failed_count": 2 },
  "timestamp": "2026-03-06T12:00:00Z"
}

Slack (type: slack): A Slack Block Kit payload with a header block (event type), a section block (message), and an optional section block (details formatted as Markdown list).

mcp — MCP server settings

Settings for the Model Context Protocol server used by AI agents.

FieldTypeDefaultDescription
rate_limit.ratefloatserver defaultMaximum requests per second. When unset, the running server falls back to its built-in default.
rate_limit.burstintserver defaultMaximum burst size before rate limiting kicks in.
approval.enabledbooleanfalseEnable the approval workflow. When enabled, 2 additional approval tools (nefia.approval.list, nefia.approval.respond) are available, and matching rules require human approval before execution.
approval.timeout_secint120Seconds to wait for user approval before timing out.
approval.default_actionstringdenyAction when approval times out: deny or allow.
approval.rulesobject[]Pattern-based approval rules (see below).
mtls.enabledbooleanfalseEnable the mTLS gateway for secure MCP connections.
mtls.listen_addrstring127.0.0.1:19821TCP address for the mTLS gateway to listen on.
mtls.ca_cert_filestringPath to the CA certificate file for client verification.
mtls.cert_filestringPath to the server certificate file.
mtls.key_filestringPath to the server private key file.

Client certificate revocation is not configured through nefia.yaml. The mTLS gateway always uses the state-directory revocation store managed by nefia mtls revoke, and newly revoked certificates are rejected on subsequent handshakes without restarting the gateway.

Each approval rule can specify tools (exact match), commands and paths (prefix match), and hosts (exact match). The first matching rule wins, and require_approval: false can be used as an explicit exemption rule. When approval is disabled, the approval tools are still advertised with (not configured) descriptions for discoverability.

yaml
mcp:
  rate_limit:
    rate: 1.0
    burst: 10
  approval:
    enabled: false
    timeout_sec: 120
    default_action: deny
    rules: []
  mtls:
    enabled: false
    listen_addr: "127.0.0.1:19821"
    ca_cert_file: ""
    cert_file: ""
    key_file: ""
team — Team context

Configures the active team for multi-tenant operations.

FieldTypeDefaultDescription
active_team_idstringID of the currently active team. Set by nefia team use.
active_team_slugstringHuman-readable slug of the active team (e.g., my-team).
yaml
team:
  active_team_id: "tm_abc123"
  active_team_slug: "my-team"
telemetry — Tracing and metrics

Configures OpenTelemetry tracing and Prometheus metrics export. Both subsystems are disabled by default and add zero overhead when disabled (a noop TracerProvider is installed).

FieldTypeDefaultDescription
enabledbooleanfalseEnable OpenTelemetry trace export.
endpointstringlocalhost:4318OTLP HTTP endpoint for trace collection.
service_namestringnefiaService name reported in traces.
metrics_enabledbooleanfalseEnable a Prometheus-compatible metrics endpoint.
metrics_portint9090TCP port for the Prometheus metrics endpoint.
insecure_httpbooleanfalseAllow plaintext (non-TLS) OTLP HTTP connections for non-loopback collectors. Only needed when the collector is on a different host and does not support TLS.
yaml
telemetry:
  enabled: false
  endpoint: "localhost:4318"
  service_name: nefia
  insecure_http: false
  metrics_enabled: false
  metrics_port: 9090

OpenTelemetry Tracing

When enabled is true, Nefia exports distributed traces via OTLP HTTP to the configured endpoint. Traces are batched asynchronously and flushed on shutdown. The TraceHandler automatically injects trace_id and span_id into every slog log record when a span context is active, enabling log-trace correlation.

If the OTLP exporter fails to initialise (e.g., endpoint unreachable), Nefia falls back to a noop tracer silently rather than failing startup.

Prometheus Metrics

When metrics_enabled is true, Nefia starts an HTTP server on the configured metrics_port with the following endpoints:

EndpointDescription
/metricsPrometheus-compatible metrics scrape endpoint.
/healthzLiveness probe. Returns {"status":"healthy"} (200) or {"status":"unhealthy"} (503).
/readyzReadiness probe. Returns {"status":"ready"} (200) or {"status":"not_ready"} (503).

Exported Metrics

Nefia exports both expvar counters (always available at Go's debug endpoint) and OTel instruments (exported to Prometheus when enabled):

expvar CounterOTel InstrumentDescription
nefia_exec_totalnefia.exec.totalTotal command execution operations.
nefia_exec_successSuccessful executions.
nefia_exec_failFailed executions.
nefia_conn_dial_totalnefia.conn.dial.totalTotal SSH dial attempts.
nefia_conn_dial_failFailed SSH dial attempts.
nefia_conn_healthcheck_failFailed connection health checks.
nefia_conn_pool_sizeCurrent SSH connection pool size.
nefia_session_open_totalTotal session open operations.
nefia_session_gc_removedSessions removed by garbage collection.
nefia_session_gc_runsNumber of GC cycles executed.
nefia_fs_read_totalnefia.fs.total (op=read)File read operations.
nefia_fs_write_totalnefia.fs.total (op=write)File write operations.
nefia_fs_patch_totalnefia.fs.total (op=patch)File patch operations.
nefia_fs_list_totalnefia.fs.total (op=list)Directory list operations.
nefia_fs_stat_totalnefia.fs.total (op=stat)File stat operations.
nefia_vpn_peer_unhealthyUnhealthy VPN peer detections.
nefia_playbook_run_totalnefia.playbook.totalTotal playbook run operations.

OTel histograms record operation duration in seconds with bucket boundaries from 1ms to 10s:

HistogramDescription
nefia.exec.durationCommand execution duration.
nefia.conn.dial.durationSSH dial duration.
nefia.fs.durationFilesystem operation duration (labeled by op).
nefia.playbook.durationPlaybook run duration.

All OTel instruments include an ok boolean attribute for success/failure breakdowns.

hosts — Target PC definitions

Each host represents a target PC enrolled via nefia vpn invite. Hosts are defined as an array of objects.

FieldTypeDefaultDescription
id *stringUnique host identifier (name). Must match ^[a-zA-Z0-9][a-zA-Z0-9._-]*$ and be at most 128 characters.
address *stringVPN IP address of the host.
os *stringOperating system: macos, linux, or windows.
user *stringSSH username for connecting to this host. Falls back to $USER (or $USERNAME on Windows). Connection fails with an actionable error if all fallbacks are empty.
rootstring/Default root directory for file operations.
shellstringShell override for this host (e.g., /bin/bash). Overrides the OS-level default from defaults.shell.
rolestringRBAC role name assigned to this host. Must match a role defined in policy.roles.
tagsmapKey-value tags for targeting and group membership.
vpn.public_key *stringWireGuard public key for this peer.
vpn.endpointstringWireGuard endpoint (ip:port) if the host has a public address.
vpn.local_endpointstringLAN endpoint (ip:port) for hairpin NAT fallback. Discovered automatically during enrollment.
vpn.vpn_addrstringPeer's VPN IP address (e.g., 10.99.0.2).
vpn.statusstringCurrent VPN status: active, pending, or empty.
yaml
hosts:
  - id: prod-web-1
    address: 10.99.0.2
    os: linux
    user: deploy
    root: /var/www
    tags:
      env: production
      role: web
    vpn:
      public_key: xYz1...aBcD=
      endpoint: 203.0.113.10:51820
      local_endpoint: 192.168.1.50:51820
      vpn_addr: 10.99.0.2
      status: active
groups — Host group definitions

Groups provide named selectors based on tag matching.

yaml
groups:
  - name: webservers
    match:
      tags:
        role: web
  - name: production
    match:
      tags:
        env: production
  - name: staging
    match:
      tags:
        env: staging
  - name: all-linux
    match:
      tags:
        os: linux

Use groups in target selectors:

bash
nefia exec --target group:webservers -- systemctl status nginx

Complete Example

A fully populated nefia.yaml combining all sections:

yaml
version: 1
session_ttl_minutes: 1440
session_idle_timeout_minutes: 30
session_gc_interval_minutes: 5
host_sync_interval: "5m"
 
defaults:
  concurrency: 50
  timeout: 30m
  output: human
  max_output_bytes: 1048576
  shell:
    macos: /bin/zsh
    linux: /bin/bash
    windows: powershell.exe
  artifact_retention_days: 7
  progress:
    enabled: true
    interval: "30s"
  notifications:
    enabled: false
    min_duration: "30s"
 
ssh:
  connect_timeout_sec: 5
  known_hosts: ~/.ssh/known_hosts
  identities:
    - ~/.ssh/id_ed25519
  agent: false
  max_file_size_bytes: 3221225472
  max_concurrent_fs: 20
  max_pool_size: 100
  connection_ttl: "1h"
  idle_timeout: "10m"
  max_retries: 3
  initial_backoff: "1s"
  max_backoff: "60s"
  circuit_breaker_threshold: 5
  circuit_breaker_reset_timeout: "60s"
  keepalive_timeout: "5s"
 
vpn:
  enabled: true
  private_key: $WG_PRIVATE_KEY
  listen_port: 51820
  address: 10.99.0.1/24
  dns: [1.1.1.1]
  magic_dns:
    enabled: true
    domain: nefia
    upstream: ["1.1.1.1:53", "8.8.8.8:53"]
  derp_servers:
    - url: "wss://relay.nefia.ai/derp"
      region: "ap-northeast-1"
  turn_servers:
    - url: "turn:turn.example.com:3478"
      username: $TURN_USER
      password: $TURN_PASS
  stun_servers:
    - "stun.l.google.com:19302"
  stun_timeout: "5s"
  key_rotation_grace_period: "72h"
  monitor_interval: "30s"
  local_dial_timeout: "3s"
  derp_probe_interval: "5m"
  derp_auto_select: true
  multipath:
    mode: "off"
    probe_interval_sec: 5
    failover_threshold_ms: 0
 
auth:
  api_base_url: https://www.nefia.ai
  web_base_url: https://www.nefia.ai
 
policy:
  mode: enforce
  deny_commands:
    - "^rm\\s+-rf\\s+/"
  allow_commands:
    - "systemctl\\s+(status|restart|reload)"
  deny_paths:
    - "/etc/shadow"
  allowed_roots:
    - /var/www
    - /etc/nginx
 
audit:
  enabled: true
  required: false
  retention_days: 90
  syslog_addr: "localhost:514"
  syslog_proto: "udp"
 
sudo:
  enabled: false
  method: "nopasswd"
  user: "root"
  allowed_commands: []
  deny_commands: []
  require_approval: false
 
schedule:
  enabled: true
  sync_interval_sec: 300
  retention_days: 90
  max_concurrent_schedules: 5
 
device_lock:
  enabled: false
  mode: log
 
posture:
  enabled: true
  mode: warn
  require_firewall: true
  require_disk_encryption: true
 
ssh_ca:
  enabled: true
  auto_sign: true
  cert_duration: "24h"
  host_cert_duration: "2160h"
  allowed_user_principals:
    - deploy
    - admin
  allow_privileged_principals: false
 
jit:
  enabled: false
  default_duration: "1h"
  max_duration: "8h"
  require_reason: true
  webhook_name: ""
 
secrets:
  enabled: false
  cache_ttl: "5m"
  providers:
    - name: vault-prod
      type: vault
      config:
        addr: "https://vault.example.com"
  inject:
    DB_PASSWORD: "${vault-prod:secret/data/db#password}"
 
reactor:
  enabled: false
  listen_port: 9500
  rules: []
 
cluster:
  enabled: false
  node_id: node-1
  bind_addr: "0.0.0.0:9700"
  advertise_addr: ""
  peers: []
  data_dir: ""
 
mcp:
  rate_limit:
    rate: 1.0
    burst: 10
  approval:
    enabled: false
    timeout_sec: 120
    default_action: deny
  mtls:
    enabled: false
    listen_addr: "127.0.0.1:19821"
    ca_cert_file: ""
    cert_file: ""
    key_file: ""
 
alerts:
  webhooks: []
 
team:
  active_team_id: ""
  active_team_slug: ""
 
telemetry:
  enabled: false
  endpoint: "localhost:4318"
  service_name: nefia
  insecure_http: false
  metrics_enabled: false
  metrics_port: 9090
 
hosts:
  - id: prod-web-1
    address: 10.99.0.2
    os: linux
    user: deploy
    root: /var/www
    shell: /bin/bash
    role: deployer
    tags:
      env: production
      role: web
    vpn:
      public_key: xYz1...aBcD=
      endpoint: 203.0.113.10:51820
      vpn_addr: 10.99.0.2
      status: active
 
groups:
  - name: webservers
    match:
      tags:
        role: web
  - name: production
    match:
      tags:
        env: production

Validating Your Configuration

Run the built-in validator to catch errors:

bash
nefia validate
nefia validate

Configuration: /Users/admin/.config/nefia/nefia.yaml Hosts: 5 defined, 4 reachable Groups: 4 defined, all valid Policy: 12 rules loaded, no conflicts VPN: enabled, key present Result: OK

Config Validation Rules

The following validation rules are applied at startup and when running nefia validate:

FieldRuleError
vpn.derp_servers[].urlMust not be emptyvpn.derp_servers[N].url must not be empty
vpn.derp_servers[].urlMust start with wss:// or ws://vpn.derp_servers[N].url must start with wss:// or ws://
vpn.derp_servers[].urlMust be a valid URLvpn.derp_servers[N].url is not a valid URL
vpn.stun_servers[]Must not be emptyvpn.stun_servers[N] must not be empty
vpn.stun_servers[]Must be host:port formatvpn.stun_servers[N] must be host:port
hosts[].vpn.vpn_addrMust be unique across all hostsDuplicate VPN address error
hosts[].vpn.vpn_addrMust not overlap with operator addressOperator address collision error
ssh_ca.allowed_user_principalsMust be non-empty when ssh_ca.enabled is truessh_ca.allowed_user_principals must not be empty when ssh_ca is enabled
ssh_ca.allowed_user_principalsMust contain no duplicatesssh_ca.allowed_user_principals contains duplicate entry
sudo.allowed_commands[]Must be valid regex patternssudo.allowed_commands[N] is not a valid regex
sudo.deny_commands[]Must be valid regex patternssudo.deny_commands[N] is not a valid regex
sudo.require_approvalMust be false (not yet supported)sudo.require_approval is not yet supported

File-Based Config Locking

When multiple nefia processes modify the configuration concurrently (e.g., parallel vpn invite and vpn reinvite commands), a file-based lock prevents race conditions:

  • An exclusive lock file (nefia.yaml.lock) is acquired before any config modification
  • On Unix (macOS/Linux): non-blocking flock with retry (up to 30 attempts, 1 second apart, max 30-second wait)
  • On Windows: LockFileEx with exclusive non-blocking lock and the same retry strategy
  • The lock is released after the config file is saved

This prevents VPN address collisions and config corruption when enrollment operations run in parallel.

Stale Lock Recovery

On Unix, the operating system automatically releases flock locks when the owning process exits (even on crash), so stale locks are not possible under normal circumstances.

On Windows, the lock file stores the PID of the owning process. When a lock acquisition fails, Nefia reads the PID from the lock file and checks whether that process is still alive using OpenProcess and GetExitCodeProcess. If the owning process has exited (crash or ungraceful termination), the stale lock file is automatically removed and re-acquired. This prevents indefinite lock retention after a crash.

Agent-Side Configuration

The agent (nefia-agent) uses a separate agent.yaml configuration file. See the Agent Reference for the full schema. Key extensions for NAT traversal and cloud communication:

FieldTypeDescription
agent_tokenstringAPI token for cloud communication. Supports keyring:agent-token for OS keyring storage.
agent_token_expires_atstringRFC 3339 expiry time for the agent token.
host_idstringHost ID for endpoint reporting.
stun_serversstring[]Custom STUN servers (overrides defaults). Format: host:port.
endpoint_refreshdurationSTUN refresh interval (e.g., 5m). Empty disables.
cloud_api_basestringCloud API base URL. Must be https://.
derp_serversDERPServer[]DERP relay servers (propagated from operator during enrollment).
CLI Reference

Complete reference for all Nefia CLI commands and flags.

VPN Setup Guide

Configure WireGuard tunnels, NAT traversal, and key rotation.

Playbook Format

YAML schema reference for Nefia playbook files.