Recovering CEPH/Proxmox After a Major Oopsie

Archive of a panicked post by me on the LevelOneTechs and Proxmox forums in August 2024. I hope this helps some poor sod who also goes through the same stress! Please reach out if your cluster reaches the same sticky end and I will do my best to help you. Needless to say I learnt my lesson and now keep plenty of backups. 

Original thread on LevelOne:

 
(also posted on proxmox forums, but I like you guys more)
 

The Headline:

I have managed to kick all 3 of my nodes from the cluster and wipe all configuration for both PVE and CEPH. This is bad. I have configuration backups, I just don’t know how to use them.

The longer story:

Prior to this mishap, I had Proxmox installed on mirrored ZFS HDDs. I planned to move the install to a single smaller SSD (to free drive bays for more CEPH OSDs). I wanted to keep the configuration to avoid having to set up everything again and minimise downtime, so planned to do each node one by one to keep the cluster in quorum . Starting the node 1, I installed a fresh PVE on the new SSD and attempted to copy over the relevant config files. This ALMOST worked, but I was getting host key errors and the node wouldn’t re-join the cluster. I managed to get the node 1 back in the cluster eventually, but CEPH was still down and something was still amiss with the fingerprints/keys from node 2/3’s perspective. As a hail Mary I copied the entire /etc folder from the old HDD install to the new SSD install of node 1.

For reasons unbeknownst to me, this wiped node 1 – but worst of all it propagated across the entire cluster and wiped nodes 2&3 too. Every node was kicked from the cluster, VM configs, PVE configs and CEPH configs all reset to 0. Going to the CEPH tab in the web UI prompted me to install CEPH for the fist time. Completely. Wiped.

Suitably panicked at this point, I pulled the CEPH OSD drives from their bays to stop any further loss in it’s tracks. As far as I know, the OSDs (and all my data) are untouched – it’s just the configs.

What do I Need?

I need advice dear reader. How do I proceed? What are my options to restore my cluster? I am mainly concerned with recovering ceph. It has not only all my VMs on, but all my persistent container data on cephFS. I’m happy to reconfigure the PVE cluster from scratch… but I need ceph back exactly how it was. That being said, It would be a bonus to retain the VM configs so I don’t have to re-create each VM manually and figure out which VHD is which. I did that a lot with XCP-NG and am sick of it. I fucked around and found out, and now I want to find out before I get fucked.

What do I Have?

Luckily, I took a backup of (what I thought were) the relevant configuration files before embarking on this shitshow of an upgrade. I have the following folders (recursive) from EACH NODE prior to poop → fan :

  • /etc
  • /var/lib/ceph
  • /var/lib/pve-cluster

I ALSO have the HDD mirror from node 1 – this is in the untouched state before anything changed (I unplugged them before I did the new SSD install). I can dip into that if there are any files I need – hopefully.

What should I do? Any advice would be greatly appreciated and will happily buy coffees! Apologies for the coarse language… I am sleep deprived and stressed.

Many thanks,
Max

When using PVE in a cluster setup, /etc is no longer purely a single filesystem. PVE uses FUSE/corosync 2 to share /etc/pve across the cluster, and you can’t just copy it like any other directory.

I can only provide limited advice since it’s been a while since I last touched Ceph, but if the OSDs haven’t been overwritten then you should be able to recreate the cluster. CephFS, I’m not quite so sure. The recovery steps in this post 8 might be helpful, but I’m sure you’ll probably get better advice sometime later on your own thread on the PVE forums.

Send me an email at

All enquiries welcome!