When the firewall isn't the edge

What broke

This happened over a few weeks last fall; I’m writing it up now because the lesson stuck harder than the fix did.

Nothing broke cleanly. That was the problem.

Remote access worked, then didn’t. Port forwards I’d set up correctly refused to behave. VPN handshakes that should have been trivial timed out for no reason I could see from the firewall. A few services on the lab network were reachable internally but invisible from outside. And the lab scenarios I run to mimic real edge routing weren’t matching real edge behavior. Traceroutes looked wrong in a way I couldn’t immediately explain.

Six unrelated symptoms. The instinct is to chase six fixes. The discipline is to ask what single thing, if it were wrong, could produce all of them at once.

The cause I didn’t own

The firewall’s WAN interface was pulling a private RFC1918 address.

That one fact explains every symptom above. A WAN interface is supposed to receive a routable public address. If it’s getting a private one, something upstream is handing out DHCP leases and doing NAT before traffic ever reaches the firewall. The ISP-supplied gateway was never actually in pass-through; it was still a router, still translating, and my firewall was a second NAT layer stacked behind it.

Double NAT doesn’t fail loudly. It fails selectively, which is worse for diagnosis. Outbound browsing works fine, so nothing looks obviously dead. But anything that depends on the firewall genuinely being the edge (inbound port forwarding, VPN endpoints, predictable outbound translation, anything that assumes the WAN address is reachable from the outside) degrades in ways that look like six different bugs. They aren’t. They’re one architectural fault wearing six masks.

The thing that turned a quick fix into a project: I couldn’t get usable administrative control of the upstream gateway to disable its routing. When you can’t change the device causing the problem, “just put it in bridge mode” stops being a config change and becomes a hardware decision.

The fix

Make the firewall the actual edge, not a tenant behind someone else’s NAT.

I replaced the ISP gateway’s role with a DOCSIS 3.1 modem that supports true bridge mode: pure layer-2, no routing, no translation, no DHCP of its own. The topology collapsed to what it should have been from the start:

Edge topology after the fix: ISP to bridged modem to pfSense firewall as the single NAT boundary to managed switch to lab network

One NAT boundary. One device making routing decisions. The firewall’s WAN interface came up with a real public address, and the six symptoms resolved together: not because I fixed six things, but because there was only ever one thing.

What that bought, concretely: reliable inbound VPN, predictable outbound NAT, port forwarding that does what the rule says, and (the part I care about for lab work) edge behavior that actually matches what I’m trying to simulate. You can’t model realistic ISP routing from inside a double-NAT sandwich.

The second bite

Fixing the edge surfaced a problem the edge had been masking.

Parts of my real LAN and parts of a simulated ISP environment in the lab were addressed out of the same RFC1918 range. While everything was buried behind double NAT, the overlap was hidden; traffic never got far enough to collide. Once the firewall was the true edge and routing actually worked end to end, the overlap turned into blackholed routes: two networks claiming the same space, the router with no good answer for which one a packet meant.

That wasn’t caused by the NAT fix. It was revealed by it. Worth saying plainly, because it’s the part that’s easy to get wrong in your own head: solving a foundational problem often doesn’t reduce your problem count. It uncovers the ones the foundational problem was hiding. The address-overlap conflict was its own separate piece of work, and it deserved to be treated as one rather than blamed on the migration.

What it actually taught me

Two things I keep.

First: when unrelated symptoms cluster in time, distrust the symptom list and look for the shared assumption underneath it. Every one of those six failures assumed the firewall was the edge. It wasn’t. Find the assumption, not the symptoms.

Second: “the edge” is an architectural position, not a device. A firewall behind another router isn’t an edge firewall. It’s an expensive internal hop with opinions. The fix wasn’t a setting. It was deciding, structurally, which device gets to be the boundary, and then making the topology actually say that.

Most of the operational pain I’ve debugged since traces back to the same root: a box doing a job that something upstream was quietly also doing. This was just the first time the lesson cost a hardware purchase to learn properly.