Recovering CEPH/Proxmox After a Major Oopsie

Archive of a panicked post by me on the LevelOneTechs and Proxmox forums in August 2024. I hope this helps some poor sod who also goes through the same stress! Please reach out if your cluster reaches the same sticky end and I will do my best to help you. Needless to say I learnt my lesson and now keep plenty of backups.

Original thread on LevelOne:

scruffyte

Aug 2024

(also posted on proxmox forums, but I like you guys more)

The Headline:

I have managed to kick all 3 of my nodes from the cluster and wipe all configuration for both PVE and CEPH. This is bad. I have configuration backups, I just don’t know how to use them.

The longer story:

Prior to this mishap, I had Proxmox installed on mirrored ZFS HDDs. I planned to move the install to a single smaller SSD (to free drive bays for more CEPH OSDs). I wanted to keep the configuration to avoid having to set up everything again and minimise downtime, so planned to do each node one by one to keep the cluster in quorum . Starting the node 1, I installed a fresh PVE on the new SSD and attempted to copy over the relevant config files. This ALMOST worked, but I was getting host key errors and the node wouldn’t re-join the cluster. I managed to get the node 1 back in the cluster eventually, but CEPH was still down and something was still amiss with the fingerprints/keys from node 2/3’s perspective. As a hail Mary I copied the entire /etc folder from the old HDD install to the new SSD install of node 1.

For reasons unbeknownst to me, this wiped node 1 – but worst of all it propagated across the entire cluster and wiped nodes 2&3 too. Every node was kicked from the cluster, VM configs, PVE configs and CEPH configs all reset to 0. Going to the CEPH tab in the web UI prompted me to install CEPH for the fist time. Completely. Wiped.

Suitably panicked at this point, I pulled the CEPH OSD drives from their bays to stop any further loss in it’s tracks. As far as I know, the OSDs (and all my data) are untouched – it’s just the configs.

What do I Need?

I need advice dear reader. How do I proceed? What are my options to restore my cluster? I am mainly concerned with recovering ceph. It has not only all my VMs on, but all my persistent container data on cephFS. I’m happy to reconfigure the PVE cluster from scratch… but I need ceph back exactly how it was. That being said, It would be a bonus to retain the VM configs so I don’t have to re-create each VM manually and figure out which VHD is which. I did that a lot with XCP-NG and am sick of it. I fucked around and found out, and now I want to find out before I get fucked.

What do I Have?

Luckily, I took a backup of (what I thought were) the relevant configuration files before embarking on this shitshow of an upgrade. I have the following folders (recursive) from EACH NODE prior to poop → fan :

/etc
/var/lib/ceph
/var/lib/pve-cluster

I ALSO have the HDD mirror from node 1 – this is in the untouched state before anything changed (I unplugged them before I did the new SSD install). I can dip into that if there are any files I need – hopefully.

What should I do? Any advice would be greatly appreciated and will happily buy coffees! Apologies for the coarse language… I am sleep deprived and stressed.

Many thanks,
Max

lae

When using PVE in a cluster setup, /etc is no longer purely a single filesystem. PVE uses FUSE/corosync 2 to share /etc/pve across the cluster, and you can’t just copy it like any other directory.

I can only provide limited advice since it’s been a while since I last touched Ceph, but if the OSDs haven’t been overwritten then you should be able to recreate the cluster. CephFS, I’m not quite so sure. The recovery steps in this post 8 might be helpful, but I’m sure you’ll probably get better advice sometime later on your own thread on the PVE forums.

scruffyte

Thanks for your input lae.

I think I must have been looking at information about standalone nodes rather than clusters when planning my strategy… I can’t seem to find a straight answer to help me understand how the /etc/pve folder works. I tried overwriting/removing just that folder at first to restore the pve configs, since that’s what’s suggested in the literature I could find, but couldn’t as I kept getting permission errors.

I’d read that /etc/pve is a FUSE of the configuration database in /var/lib/, so also copied the /var/lib/pve-cluster folder. This seemed to restore the cluster configuration (but not the VMs). I then had to copy the /etc/corosync folder to get it to sync and join with the cluster properly. This presumably re-synced the /etc/pve folder with the rest of the cluster and restored the VM templates. That’s when everything was looking almost ok.

Obviously there are other things in /etc/ that I am unaware of… I’m just not sure why overwriting with relevant backups would wipe the whole cluster.

Oh well.

scruffyte

I managed to restore everything after a very stressful 24hrs!

For those reading in the future, don’t bother backing up /etc/pve. Don’t listen to what anyone says. Useful to have, no doubt – then you can cherry pick any files you might need, but it’s an ineffective disaster recovery strategy. It’s simply the FUSE mount of the SQLlite database, and you can’t even write to it properly while it’s mounted – and can’t access it at all when it isn’t. Instead back up the db file /var/lib/pve-cluster/config.db and use that to restore the config.

TLDR: To completely restore all nodes, I made the following backups and copied them to a fresh install: /etc/hosts, /etc/hostname, /etc/resolv.conf, /etc/ceph, /etc/corosync, /etc/ssh (particularly the host keys), /etc/network, /var/lib/ceph /var/lib/pve-cluster. I stopped PVE first to avoid conflics with “systemctl stop pve-cluster pvedaemon pveproxy pvestatd”.

After restoring these files and rebooting PVE, VM CT, storage etc. configs are restored. Ceph required some extra work in my case:

Enable the no-subscription repository and use “pveceph install –repository no-subscription” to install ceph (or use the web ui)
Manually start and enable the manager and monitor on each node using systemctl start/enable ceph-mgr@/ceph-mon@
Check your OSDs are detected by running “ceph-volume lvm list”
Rejoin the OSDs to the cluster using “ceph-volume lvm activate –all”
Profit

lae

Good to hear! (and I see you’ve posted the update to the PVE thread too )

You technically can—it’s just not fully POSIX compliant. We write to it directly for some things in our Ansible role (where I’ve written at length on how it’s particular about how you write to it). Agreed though that what does need backing up is the config.db file.

I’m actually impressed at the Ceph OSD recovery, I guess having a backup of /etc/ceph made it that much simpler?