Rebuilding the Kubernetes Control Plane from ETCD Snapshots

Introduction

When we run a self-managed Kubernetes cluster, the control plane is the single biggest operational risk. Nodes can be replaced, workloads can be redeployed, and manifests can be re-applied, but if the control plane goes down. We don’t have a proven recovery path, we are effectively blind, with no scheduling, no scaling, no API access, and no safe way to manage the cluster.

In self-managed setups (kubeadm/on-prem/EC2 VMs), etcd is the source of truth. It stores almost everything that makes up a cluster, including namespaces, deployments, services, RBAC, CRDs, secrets, and configuration. That’s why a real Kubernetes disaster recovery plan starts with one thing taking consistent etcd snapshots and knowing how to restore them.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Solution Overview

We take a consistent snapshot of the etcd datastore and securely store it. During disaster recovery, we will restore the etcd snapshot on the control-plane nodes, bring up the Kubernetes control-plane components (etcd, kube-apiserver, controller-manager, scheduler), and validate that the cluster state and workloads are recovered.

Prerequisite

Self-managed Kubernetes control plane (e.g., kubeadm / on-prem / EC2 VMs) where you can access the control-plane node(s).Necessary permissions to access the cluster and take the backup.
Access to the control-plane node(s) (SSH or console) with sudo/root access, because etcd data and certs live on the node.
Permissions to run cluster validation commands after restore (access to admin.conf or kubeconfig)
Etcdctl API needs to be installed on the cluster that matches the etcd major version.

Step-by-Step Guide

Verify ETCD Version

Run the below command to verify the etcd version.

ETCDCTL_API=3 etcdctl version

1	ETCDCTL_API=3 etcdctl version

Static Pod manifests backup

Move the static pod folder otherwise, the kubelet keeps restarting api server and other static pods.

mkdir -p /root/k8s-manifests-backup
mv /etc/kubernetes/manifests/*.yaml /root/k8s-manifests-backup/

1 2	mkdir -p /root/k8s-manifests-backup mv /etc/kubernetes/manifests/*.yaml /root/k8s-manifests-backup/

Check the Data directory path

Check the etcd data directory. In kubeadm, it uses /var/lib/etcd. Verify once from YAML or if etcd is running as a service, check –data-dir path.

cat /root/k8s-manifests-backup/etcd.yaml | grep -E 'data-dir|/var/lib/etcd'

1	cat /root/k8s-manifests-backup/etcd.yaml \| grep -E 'data-dir\|/var/lib/etcd'

Backup of existing etcd data

Take a backup of existing data. Run the command to take the backup.

ETCDCTL_API=3 etcdctl snapshot save /opt/snapshot-pre-boot.db --cacert=/etc/kubernetes/pki/etcd/ca.crt --endpoints=https://127.0.0.1:2379 --key=/etc/kubernetes/pki/etcd/server.key --cert=/etc/kubernetes/pki/etcd/server.crt

1	ETCDCTL_API=3 etcdctl snapshot save /opt/snapshot-pre-boot.db --cacert=/etc/kubernetes/pki/etcd/ca.crt --endpoints=https://127.0.0.1:2379 --key=/etc/kubernetes/pki/etcd/server.key --cert=/etc/kubernetes/pki/etcd/server.crt

The flags shown above are required to take a backup and restore it. We can get the details of these flags from the manifests file of the etcd server.

etc2

Cleaning the older etcd database

Wipe the old data dir to avoid mixing old state, or else restore the backup to a new directory and update the volume path in the yaml.

sudo systemctl stop kubelet || true
sudo rm -rf /var/lib/etcd/*

1 2	sudo systemctl stop kubelet \|\| true sudo rm -rf /var/lib/etcd/*

Stop the kubelet, then remove the data from the etcd data directory

Restoring the snapshot

Restore snapshots in /var/lib/etcd. This creates the etcd data dir from the snapshot.

ETCDCTL_API=3 etcdctl snapshot restore /opt/snapshot-pre-boot.db --data-dir /var/lib/etcd --cacert=/etc/kubernetes/pki/etcd/ca.crt --endpoints=https://127.0.0.1:2379 --key=/etc/kubernetes/pki/etcd/server.key --cert=/etc/kubernetes/pki/etcd/server.crt

1	ETCDCTL_API=3 etcdctl snapshot restore /opt/snapshot-pre-boot.db --data-dir /var/lib/etcd --cacert=/etc/kubernetes/pki/etcd/ca.crt --endpoints=https://127.0.0.1:2379 --key=/etc/kubernetes/pki/etcd/server.key --cert=/etc/kubernetes/pki/etcd/server.crt

etc3

Note: If, for some reason, restoring in /var/lib/etcd is not working, restore the backup to a different directory and update the etcd yaml volume path to point to the backup location.

Restoring the static manifests file

Copy api server and other manifests from the backup to the manifests location

mv /root/k8s-manifests-backup/kube-apiserver.yaml /etc/kubernetes/manifests/
mv /root/k8s-manifests-backup/kube-controller-manager.yaml /etc/kubernetes/manifests/
mv /root/k8s-manifests-backup/kube-scheduler.yaml /etc/kubernetes/manifests/

mv /root/k8s-manifests-backup/kube-apiserver.yaml /etc/kubernetes/manifests/

mv /root/k8s-manifests-backup/kube-controller-manager.yaml /etc/kubernetes/manifests/

mv /root/k8s-manifests-backup/kube-scheduler.yaml /etc/kubernetes/manifests/

Verification of services

Verify that the deployments and services are running again after restoring the backup.

etc4

Key Benefits

Provides a step-by-step, incident-ready procedure to rebuild the Kubernetes control plane from an etcd snapshot.
Helps us restore API access and cluster operations faster during control-plane failures.
Shows how to validate snapshots and verify recovery.
Helps you define and meet RPO/RTO targets with repeatable processes and checks.

Conclusion

In a self-managed Kubernetes setup, losing the control plane doesn’t have to mean losing the cluster. Since etcd is the source of truth, a reliable snapshot and a proven restore process are the fastest way to recover our Kubernetes state, including namespaces, workloads, RBAC, CRDs, and configuration, after a major failure. Make etcd snapshotting automatic, store backups off the control-plane node, and run periodic restore game days to measure your RPO/RTO. With that in place, control-plane recovery becomes a repeatable operational task.

Drop a query if you have any questions regarding Kubernetes and we will get back to you quickly.

Empowering organizations to become ‘data driven’ enterprises with our Cloud experts.

Reduced infrastructure costs
Timely data-driven decisions

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. Can I do an etcd restore on managed Kubernetes, such as EKS/GKE/AKS?

ANS: – No. Managed services handle etcd/control plane internally and don’t expose restore workflows. This blog applies to self-managed control planes (kubeadm/on-prem/VM-based clusters).

2. Does restoring an etcd snapshot recover my application data (databases/files)?

ANS: – Not usually. etcd restores Kubernetes objects (manifests/state), not persistent data stored in PVs or external databases

WRITTEN BY Suryansh Srivastava

Suryansh is an experienced DevOps Consultant with a strong background in DevOps, Linux, Ansible, and AWS. He is passionate about optimizing software development processes, ensuring continuous improvement, and enhancing the scalability and security of cloud-based production systems. With a proven ability to bridge the gap between IT and development teams, Surayansh specializes in creating efficient CI/CD pipelines that drive process automation and enable seamless, reliable software delivery.