← All projects

SecureBytes NOC Stack

The observability layer for the homelab. Every host runs Node Exporter, scraped by Prometheus on a sixty-second interval. Grafana fronts it with dashboards spanning host health, memory, disk, and network throughput across the entire cluster.

Host status, uptime, CPU usage, and CPU temperature panels across the cluster

What it covers

Three panel groups, one consolidated dashboard:

Host layer — uptime, CPU, memory, disk, and temperature for every node and container. Status rows turn green when a host responds to scrape, red when it doesn’t. Uptime panels color-grade so I can spot which hosts have rebooted recently without reading the numbers.

Resource layer — memory used and percentage, disk at root, all stacked side-by-side. Comparing lightweight LXCs against the heavy compute boxes is a glance, not a calculation.

Memory used and percentage, disk usage at root, traffic panels with per-host breakdown

Network layer — inbound and outbound throughput per host, plus separate panels for LAN traffic (internal hopping between containers) and WAN traffic (egress through the firewall). Useful for catching a runaway service hammering the network before it shows up as a complaint elsewhere.

Disk usage trends over time, per-host inbound and outbound traffic, plus LAN and WAN traffic separated

Stack

  • Prometheus scraping every host on 60s intervals
  • Node Exporter on every Proxmox node, every LXC, every VM
  • Grafana with custom dashboard JSON (one consolidated overview, edit-friendly variables)
  • ntfy.sh push notifications wired through Alertmanager rules
  • Docker Compose for the Prometheus + Grafana services themselves

Design decisions worth naming

Node Exporter on everything, no exceptions. Cost is ~10MB of RAM per host. The value is that nothing is invisible when something behaves weirdly — the data is already there. Curating “what’s worth monitoring” up-front is premature optimization for a homelab.

60-second scrape interval. Faster intervals (15s, 5s) sound better but quintuple storage cost without changing how I actually use the data. 60s is fine for everything except acute incidents, and during acute incidents I’m on the host directly with top.

One consolidated dashboard, not twelve specialty ones. I tried separating by category early on. Switching context between dashboards broke the muscle memory — every glance became a navigation problem. One dashboard with everything, scroll to find what you need. Boring, but it works.

What’s next

  • Loki for log aggregation so traces, metrics, and logs share one query surface
  • Alertmanager rules tuned by severity — currently every alert is the same level, which trains me to ignore the low-priority ones
  • Sanitized dashboard JSON pushed to a clean public repo

Field notes

The original public repo for this stack was deleted during a May 2026 security audit. Dashboard JSON exports were embedding the data source URL field — which contained internal IPs — into commit history. The work itself wasn’t the problem; the publishing workflow was. Tightened the dual-repo discipline (private internal source → sanitized public mirror) and the next push will run through that gate. The screenshots above are the visible artifact in the meantime.


Stack

GrafanaPrometheusNode ExporterntfyDocker Compose