By flo in kubernetes — 03 Jun 2026

Migrating to Talos Linux

How I migrated my clusters to Talos Linux — the mindset shift, the step-by-step, and a deep storage chapter on ditching Longhorn for Rook Ceph to fix flaky RWX.

I've written about why I fell for Talos and how it stacks up against RKE2. This one is the practical follow-up: how I actually migrated my clusters to Talos Linux — and the one chapter everybody underestimates, storage, where I eventually ripped out Longhorn and replaced it with Rook Ceph.

If you're coming from a "normal" distro on a "normal" OS, the migration will bend your brain in one specific way first. Let's start there.

The mindset shift: you don't migrate a node, you replace it

There is no apt install talos. You can't convert your Ubuntu box into a Talos box in place. Talos is an immutable appliance OS — the node is defined entirely by a machine config and booted from an image. So "migrating to Talos" really means standing up new Talos nodes and moving workloads onto them, then retiring the old ones.

In practice you pick one of two paths:

Parallel cluster — build a fresh Talos cluster alongside the old one and migrate workloads across. Cleanest, safest, needs spare capacity.
Rolling replace — drain a node, reinstall it as Talos, rejoin, repeat. Cheaper on hardware, more nerve-wracking, and your storage layer has to tolerate nodes leaving and returning.

I'll be honest: the parallel-cluster route is the one I'd recommend if you can afford the hardware, precisely because of the storage migration we'll get to.

The migration, step by step

Inventory what's actually stateful. Stateless workloads are trivial — they're just YAML you reapply. The entire risk of the migration lives in your PersistentVolumes. Write down every PVC, its size, its access mode (RWO vs RWX — remember that), and where its data needs to end up.
Decide your extensions and kernel modules up front. This is the Talos-specific gotcha. System extensions are baked into the boot image (via the Image Factory schematic), so you choose them before you install — not after. Storage drivers, in particular, depend on this. More below.
Generate machine configs. talosctl gen config <cluster> https://<vip>:6443 gives you controlplane.yaml, worker.yaml, and a talosconfig. Patch them for your install disk, network, VIP, and CNI.
Bootstrap the control plane. Apply the config to your first control-plane node, then talosctl bootstrap once. Join the rest.
Bring your own CNI. I run Cilium; disable Talos's default CNI/kube-proxy in the machine config if you're going that route, and install it as the first thing on the cluster.
Reapply workloads via GitOps. If your manifests live in Git (they should), this is mostly pointing Argo CD or Flux at the new cluster and watching it reconcile.
Migrate the data. The scary one. Restore from volume backups (Velero, or your CSI's snapshots) into the new storage class. Do not cut over production until you've test-restored at least one real volume end to end.

Everything above is the easy 80%. The storage layer is the 20% that will actually decide whether your migration is pleasant — so it gets its own chapter.

The storage chapter: from Longhorn to Rook Ceph

Storage is the hardest part of any Talos migration for one structural reason: the rootfs is read-only, there's no SSH to "just fix it," and storage drivers reach deep into the kernel. On a normal distro you'd modprobe something or apt install open-iscsi and move on. On Talos, every one of those needs is expressed declaratively — as an extension or a kernel module — and it has to be right before the node boots.

Longhorn on Talos

I started with Longhorn, and for good reasons: it's lightweight, has a genuinely lovely UI, and replicated block storage that Just Works for RWO volumes. Getting it running on Talos takes a few deliberate steps:

Two system extensions baked into the image: siderolabs/iscsi-tools (Longhorn talks iSCSI under the hood) and siderolabs/util-linux-tools.
A bind mount so Longhorn's data survives, since the rootfs is immutable:

machine:
  kubelet:
    extraMounts:
      - destination: /var/lib/longhorn
        type: bind
        source: /var/lib/longhorn
        options:
          - bind
          - rshared
          - rw

And the Longhorn namespace labelled pod-security.kubernetes.io/enforce: privileged.

For RWO workloads — databases, single-writer apps — Longhorn was great, and if that's all you need it's still a fine choice on Talos.

Where it hurt: RWX volumes. Longhorn implements ReadWriteMany by spinning up a per-volume share-manager pod that re-exports the volume over NFS (Ganesha). It works… until a node reboots or the share-manager pod gets rescheduled, and then the NFS client on consumers can hang on a stale mount, or the volume takes its sweet time recovering. On an immutable, reboot-happy platform like Talos — where node replacement is a feature — that fragility shows up more often, not less. I lost more evenings than I'd like to RWX volumes that were "Healthy" in the UI but wedged on the consuming pod.

Switching to Rook Ceph

So I moved to Rook-managed Ceph, and the difference has been night and day. It handles everything I throw at it without a single hiccup.

The reason is architectural. Ceph isn't bolting RWX on as an afterthought — it gives you three real storage types from one cluster:

RBD (RADOS Block Device) → fast RWO block volumes.
CephFS → a genuine POSIX distributed filesystem, which means RWX done properly, not an NFS shim taped onto a block volume.
RGW → an S3-compatible object store, if you want it, for free.

That CephFS distinction is the whole story for me. The RWX workloads that used to wedge on Longhorn just… work now, including across node reboots and reschedules. Ceph was built as a distributed storage system from day one, and it shows.

On Talos, Rook needs less than you'd fear, because Rook itself runs in containers and orchestrates Ceph via CRDs. The two things to get right:

Kernel modules for RBD (and CephFS) loaded via the machine config:

machine:
  kernel:
    modules:
      - name: rbd
      - name: nbd
      - name: ceph

Raw block devices for the OSDs. Give Ceph whole, empty disks — not a partition on your boot drive — and let Rook consume them.

After that, a CephCluster CR, a couple of CephBlockPool / CephFilesystem resources with their storage classes, and Rook does the rest.

The honest tradeoff

I'm not going to pretend Ceph is a free lunch — that wouldn't be in the spirit of this blog:

It's heavier. Ceph wants real RAM and CPU for its MONs, MGRs, and OSDs. On a tiny single-node homelab, it's overkill and you'll feel it.
It really wants three nodes and dedicated disks. Replication and quorum assume a few machines. Below that, stick with Longhorn or a simpler provisioner.
The learning curve is steeper. When Ceph is unhappy, you're now learning placement groups and CRUSH maps, not clicking a button in a friendly UI.

So the decision guide I'd give my past self:

Longhorn if you're small, mostly need RWO, and value simplicity and that UI.
Rook Ceph if you need rock-solid RWX, want object storage too, or are running enough nodes that a real distributed storage system pays for itself.

For my clusters — multi-node, RWX-hungry, and run by someone who'd rather debug once and never again — Ceph was clearly the right call.

Talos-specific gotchas I wish I'd known sooner

Extensions are chosen at image-build time. Forgot the iSCSI extension? You don't install it later — you rebuild the schematic and re-image the node. Plan extensions with your storage decision, not after.
Read-only rootfs means extraMounts. Anything that wants to write to the host filesystem needs an explicit mount in the machine config.
No SSH is a feature, but respect it. Your debugging is talosctl, not a shell. Learn talosctl dmesg, talosctl logs, and talosctl get early.
Back up before you cut over. Immutability protects you from drift, not from a botched data migration. Snapshot or Velero everything stateful first.

Was it worth it?

Completely. The migration was more work than a normal distro upgrade — there's no pretending otherwise — but what I got on the other side is a cluster with no configuration drift, reproducible nodes defined in Git, and a storage layer that finally stopped surprising me. Trading Longhorn's friendly simplicity for Ceph's robustness was the single best decision of the whole move.

If you're earlier in the journey, start with why Talos won me over or the Talos vs RKE2 comparison. And if you've fought the same Longhorn RWX battles I have, I'd love to know whether Ceph fixed it for you too.