CChelar Docs
Runbooks

Incident Response

What to do when things break.

If everything is down, start with the triage checklist below. The most common root causes are: Caddy down, JuiceFS mount dropped, or disk full.

Triage Checklist

Check Nomad

make nomad-status

Are all jobs running? If platform jobs (chelar-api, chelar-dashboard) are dead, that's the root cause.

Check Caddy

make server-logs-caddy

Look for 502/504 errors or TLS failures. If Caddy is down, all traffic stops.

Check JuiceFS

make server-logs-juicefs

If the mount dropped, all tenants lose access to their data directories.

Check Disk

make server-disk

Docker images and layers can fill up NVMe. If disk is >90%, clean up with docker system prune.

Check Netdata

Open netdata.chelar.ai for real-time resource metrics. Look for CPU saturation, memory exhaustion, or network anomalies.

Common Incidents

All Tenants Unreachable

Most likely cause: Caddy is down or JuiceFS mount dropped.

make ssh
systemctl status caddy
# If Caddy is down:
systemctl restart caddy

# Check JuiceFS:
df -h /data
# If mount dropped:
systemctl restart juicefs

Single Tenant Unreachable

Check the specific tenant's Nomad allocation:

make ssh
nomad job status tenant-{id}
nomad alloc logs {alloc-id}

See Tenant Debugging for the full debugging flow.

API Returning 500s

make nomad-logs-api
# Check for panic/error messages
# If stuck, restart:
nomad job restart chelar-api

Server Rebooted

After a reboot, everything should self-heal automatically:

  1. Nomad starts via systemd and reconciles all registered jobs
  2. JuiceFS remounts automatically
  3. Caddy reloads its config and provisions TLS if needed
  4. Tenant containers resume within ~30 seconds

If it doesn't, check systemctl status nomad caddy juicefs.

On this page