Putting the platform under Ansible

What broke

Both Cloudflare tunnels went down. The status page with them.

The Proxmox host running my edge services (ava-dc-02) had crashed. I SSHed in to recover, hit a wall: the LXC running cloudflared wouldn’t accept my login. My docs said the user was sbx. The container only had root, with a password I’d forgotten.

The platform reference had been drifting from reality. Nothing forced it to match.

What I built

One Ansible repo. One role. Eight hosts under management:

6 LXCs (noc-board, uptime-kuma, pi-hole, nginx, vaultwarden, gitea)
2 Proxmox hosts (ava-dc-01, ava-dc-02)

The baseline role does four things on every host:

Creates an ansible service account
Authorizes my SSH key
Grants passwordless sudo
Hardens sshd (passwords off, root key-only)

Run it once: hosts change to match. Run it again: nothing changes. The drift problem is structurally impossible now.

The safety net

Hardening sshd can lock you out. So between authorizing the key and disabling password auth, the playbook SSHes from my Mac to the target as the new user. If that fails, the play aborts before touching sshd. The step that could lock me out cannot run unless I’ve already verified I’m in.

The bootstrap trick

Chicken-and-egg: how do you Ansible-manage hosts that don’t have your user yet?

The Proxmox hosts already had my SSH key. So the first playbook targets Proxmox and uses pct exec to inject the ansible user into every LXC from the outside. One run. Every container. No per-host password fumbling.

What this actually changes

Before: docs claimed things that weren’t true. Recovery meant manual SSH and remembering passwords.

After: the playbooks are the docs. They can’t drift — they either apply cleanly (state matches) or they fail (I notice). A destroyed LXC is a five-minute rebuild from the repo, not a memory exercise.

Honest scope

This isn’t impressive Ansible. One role, five tasks. It doesn’t deploy services yet. The cloudflared restart that the original (untested) bash recovery script was supposed to automate still isn’t automated.

But the spine is in. Every future role hangs off the same inventory and connection model. Cloudflared, nginx rebuild, Vaultwarden config — each one is a small role in the same repo, not a new project.

GitHub Actions CI — lint and syntax-check every push
Cloudflared role — actually automate today’s incident
Self-hosted runner — push to main, lab reconfigures itself