Incident Response
What to do when things break.
If everything is down, start with the triage checklist below. The most common root causes are: Caddy down, JuiceFS mount dropped, or disk full.
Triage Checklist
Check Nomad
make nomad-statusAre all jobs running? If platform jobs (chelar-api, chelar-dashboard) are dead, that's the root cause.
Check Caddy
make server-logs-caddyLook for 502/504 errors or TLS failures. If Caddy is down, all traffic stops.
Check JuiceFS
make server-logs-juicefsIf the mount dropped, all tenants lose access to their data directories.
Check Disk
make server-diskDocker images and layers can fill up NVMe. If disk is >90%, clean up with docker system prune.
Check Netdata
Open netdata.chelar.ai for real-time resource metrics. Look for CPU saturation, memory exhaustion, or network anomalies.
Common Incidents
All Tenants Unreachable
Most likely cause: Caddy is down or JuiceFS mount dropped.
make ssh
systemctl status caddy
# If Caddy is down:
systemctl restart caddy
# Check JuiceFS:
df -h /data
# If mount dropped:
systemctl restart juicefsSingle Tenant Unreachable
Check the specific tenant's Nomad allocation:
make ssh
nomad job status tenant-{id}
nomad alloc logs {alloc-id}See Tenant Debugging for the full debugging flow.
API Returning 500s
make nomad-logs-api
# Check for panic/error messages
# If stuck, restart:
nomad job restart chelar-apiServer Rebooted
After a reboot, everything should self-heal automatically:
- Nomad starts via systemd and reconciles all registered jobs
- JuiceFS remounts automatically
- Caddy reloads its config and provisions TLS if needed
- Tenant containers resume within ~30 seconds
If it doesn't, check systemctl status nomad caddy juicefs.