03//lab

The rack under the desk,
run like a cloud provider.

Three Talos Kubernetes clusters today: core, dev, and prod. A fourth is coming once the second site is live. Etcd has three votes everywhere. DNS rides a VIP. The edge splits into internal and public gateways. The whole thing is declared in git, delivered by ArgoCD, watched by a self-hosted LGTM stack. This site is served from it.

The secondary AZ is offline. Hardware is in the middle of moving between sites, so prod is running on one AZ until it's back. Everything below shows the current state next to the multi-AZ plan.
3 Talos clusters today — core · dev · prod (4th planned for second AZ)
9 Kubernetes nodes today — 12 once the second AZ is up
3 bare-metal EliteDesks running prod (+3 more planned for the second AZ)
3 Technitium DNS instances behind a VIP
13 Ansible-managed Linux hosts
6 VLANs — mgmt · trusted · DMZ · IoT · guest · clients
01/How it's designed

Architecture directions.

Four things the lab is designed around

Four design principles run through the lab. The rest of the page shows how each one is wired up.

Hardware redundancy

Every Talos cluster runs a three-node etcd quorum. Three Technitium DNS instances sit behind a keepalived VIP with AXFR replication. Storage replication and LGTM HA aren’t there yet — the HA diagram below shows where each one stands.

High availability

Services self-heal through Kubernetes. The edge is split: internal and public traffic land on separate Envoy Gateways, each with its own IPs and policies. Per-cluster Cloudflare tunnels run two replicas. Private PKI and observability are still single-instance, and both are queued for HA work.

GitOps all the way down

Every change is a commit. ArgoCD reconciles workloads, Talos holds cluster state, Ansible holds host state. Rollbacks are a git revert and a webhook. Production promotions go through an auto-generated PR that a human still has to merge.

Multi-AZ by design

A second AZ is wired up over a UniFi site-to-site VPN. The hardware is mid-move between sites right now, so production is on one AZ until it lands. DNS zones, cluster naming, and storage all already assume a second site, so adding it back is an addition rather than a rewrite.

02/Diagram 01

Five layers, one focal plane.

Infrastructure → workloads

Five layers between bare metal and a running pod. Every one of them is boring, which is the point. Adding a new app on top barely touches the stack below.

Homelab platform stack Five abstraction layers from bare-metal infrastructure up to application workloads. Platform services layer is highlighted as the focal layer — the cloud-native plane that ties everything together. L1Infrastructurephysical · hypervisor 3× Proxmox · 3× bare-metal EliteDesks + 2× RPi edge · 2× NAS L2Operating systemimmutable · minimal Talos Linux — API-driven, no SSH L3Kubernetesclusters · MetalLB L2 core · dev · prod — 9 control/worker nodes L4Platform servicesthe cloud-native plane ArgoCD · Envoy GW · cert-manager · CNPG external-dns · Dragonfly · LGTM observability L5Workloadsapps · services · game servers kian.coffee · techgarden.gg · Hausparty · Pelican LEGENDfocal layerabstraction layer SOURCE OF TRUTH: declarative inventory + cluster configs
03/Diagram 02

How a commit becomes a pod.

GitOps end-to-end

Every app ships the same way. Push to main, CI builds and publishes an image, a dispatch event tells the homelab repo to bump the tag, ArgoCD reconciles, Talos rolls. Prod promotions go through an auto-generated PR that a human still has to merge.

GitOps deployment flow A code push triggers GitHub Actions, which builds and publishes a container image to ghcr.io and dispatches an image-tag update to the homelab repo. ArgoCD notices the commit and syncs the Talos prod cluster, which pulls the new image. BUILDDISPATCHCOMMITSYNCPULLINPUTCode pushmain · any app repoSTEPGitHub Actionsdeployment.ymlSTOREghcr.ioimage registrySTEPimage-tag dispatch→ homelab repoFOCALArgoCDwatches · reconcilesSTEPTalos prod3× bare-metal · rollingLEGENDfocalflowprimaryasync SOURCE OF TRUTH: app CI + homelab GitOps repo
04/Diagram 03

What runs where, physically.

Three tiers · one production line today

Three tiers of physical compute. Production runs straight on bare-metal EliteDesks, so there's no hypervisor in the critical path. The Proxmox hosts carry the core and dev clusters as VMs. Edge services like DNS and load balancing, plus storage, sit on Raspberry Pis and TrueNAS boxes. The second AZ's bare-metal tier will mirror this one once the move finishes.

Hardware topology Three tiers of physical hosts — bare-metal prod (focal), Proxmox hypervisors, and edge plus storage. Each tier is independently sized for its role; bare-metal prod carries the production Talos control plane. T1Bare-metal prodtalos-prod · control planeed-n1HP EliteDesked-n2HP EliteDesked-n3HP EliteDeskT2Hypervisorstalos-core + talos-dev VMshx90Minisforum HX90bd-n1Minisforum BD795ibd-n2Minisforum BD795iT3Edge + storageDNS primary · edge LB · NAS ×2rpi-n1RPi 5 · DNSrpi-n2RPi 5 · edge LBcm-nasTrueNASjb-nasTrueNASLEGENDfocal tier — productionsupporting tier HARDWARE: redundancy by tier, not by host
05/Diagram 04

Six VLANs, one firewall.

Segmentation by trust tier

Web traffic only enters through a Cloudflare Tunnel that opens outbound. No HTTP service has an inbound port forward. Six VLANs segment by trust tier. The firewall denies between tiers by default and only allows what's needed (Trusted Clients → Servers, plus a few IoT exceptions).

Network topology and VLAN posture Six VLANs segmented by trust tier. Public traffic arrives only via outbound-only Cloudflare Tunnel — no inbound port forwards. Servers (Trusted) is the focal tier; Trusted Clients can reach it, Untrusted Servers and IoT cannot. UPSTREAMInternetweb: tunnel onlyOUTBOUND ONLYCloudflare Tunnelcloudflared · 2 replicas/clusterEdge firewalldeny-by-default between tiersVLANManagementhypervisors · switches · admin onlyISOLATEDVLANTrusted Clientspersonal devices · laptops · phonesALLOW → SERVERSVLANServers (Trusted)talos core · dev · prod · NAS · observabilityFOCALVLANServers (Untrusted)game servers · direct port-forward path (bypasses K8s)DENY → SERVERSVLANGuestvisitors · internet-onlyDENY → LANVLANIoTsmart-home devices · explicit exceptions onlyDENY + EXCEPTLEGENDallowdeny (firewall)focal VLAN POLICY: posture, not addresses
06/Diagram 05

What's replicated, what isn't (yet).

Redundant today vs. single-instance today

Reliability is never finished. The left column is what's already redundant. The right column is what still runs as a single instance, with the next step lined up for each one.

HA redundancy map Left column lists services that are already redundant — DNS, Kubernetes control planes, Cloudflare tunnels, the dual-gateway edge. Right column is the work ahead — observability, storage, private PKI, and GitOps control plane — each with a mitigation path in flight. STATUS TODAYRedundant today×N MARKERSingle point of failureMITIGATION IN FLIGHTTechnitium DNSkeepalived VIP + AXFR replication×3talos-core · etcd quorumcontrol plane×3talos-dev · etcd quorumcontrol plane×3talos-prod · etcd quorumbare-metal×3cloudflared · per clusteroutbound tunnels×2Split-gateway edgeeg-internal + eg-publicsplitcert-manager + step-issuerK8s-managedself-healLGTM observability→ migrate to Kubernetes×1Primary NAS→ async replication to second NAS×1step-ca (private PKI)→ HA pair planned×1ArgoCD control plane→ run across multiple clusters×1ROADMAP No SPoF without a mitigation queued up. Multi-AZ replication is the long-term backstop. secondary AZ · site-to-site VPN · cross-AZ DNSLEGEND×Nreplica count · redundant×1single instance · work in flight
07/The roster

The hosts, dynamically generated from the homelab's inventory.

13 Ansible-managed hosts

Bare-metal Talos nodes are configured through talosctl, so they show up in the cluster topology above but not in this roster. The roster below is generated from the homelab's inventory file, so it reflects whatever is actually deployed.

Media VM

1× arr

Media library automation on a dedicated VM.

  • arr-vm

DNS node

3× dns

Technitium DNS — 3 instances behind a keepalived VIP. rpi-n1 is the primary; dns-n2 + dns-n3 are secondaries.

  • dns-n2
  • dns-n3
  • rpi-n1

Observability VM

1× lgtm

Self-hosted Loki + Grafana + Tempo + Mimir stack.

  • lgtm-vm

NAS

2× nas

ZFS storage with Cloud Sync + RSync for 3-2-1 backup.

  • cm-nas
  • jb-nas

Proxmox host

3× proxmox

Hypervisor hosts running most VM-backed clusters.

  • bd-n1
  • bd-n2
  • hx90

Raspberry Pi

2× raspbian

Low-power utility nodes — rpi-n1 runs the DNS primary; rpi-n2 is the edge load balancer.

  • rpi-n1
  • rpi-n2

Game server node

2× wings

Pelican game-server control plane on the untrusted VLAN.

  • wings-n1
  • wings-n2
08/Named tools

The stack, in words.

Full named-service inventory

Infrastructure

Physical hosts, hypervisor, networking

HP EliteDesk (×3, bare-metal prod)Minisforum HX90 / BD795i (×3, Proxmox)Raspberry Pi 5 (×2)TrueNAS (×2)VLAN segmentation

Operating system + Kubernetes

Immutable OS, 3 Talos clusters, MetalLB L2

Talos LinuxKubernetesMetalLBEnvoy GatewayGateway API

Platform services

The cloud-native plane that ties it all together

ArgoCDArgo Workflowscert-managerexternal-dnsTechnitium DNSCloudNativePGDragonflyExternal Secrets OperatorBitwarden Secrets ManagerReloader

Observability

LGTM stack, self-hosted

GrafanaLokiTempoMimirGrafana AlloyPrometheus Operator

Delivery pipeline

GitOps end-to-end, PRs promote dev → prod

GitHub Actionsghcr.ioauto-promoted image tagsRenovate

Apps running here

Workloads shipped from personal repos

kian.coffeetechgarden.ggHauspartyPelican + Wings (game servers)