Whole Cluster Restore

This feature requires an enterprise license. To get a trial license key or extend your trial period, generate a new trial license key. To purchase a license, contact Redpanda Sales.

If Redpanda has enterprise features enabled and it cannot find a valid license, restrictions apply.

With Tiered Storage enabled, you can use Whole Cluster Restore to restore data from a failed cluster (source cluster you are restoring from), including its metadata, onto a new cluster (target cluster you are restoring to). This is a simpler and cheaper alternative to active-active replication, for example with MirrorMaker 2. Use this recovery method to restore your application to the latest functional state as quickly as possible.

Whole Cluster Restore is not a fully-functional disaster recovery solution. It does not provide snapshot-style consistency. Some partitions in some topics will be more up-to-date than others. Committed transactions are not guaranteed to be atomic.

If you need to restore only a subset of topic data, consider using topic recovery instead of a Whole Cluster Restore.

The following metadata is included in a Whole Cluster Restore:

  • Topic definitions. If you have enabled Tiered Storage only for specific topics, topics without Tiered Storage enabled will be restored empty.

  • Users and access control lists (ACLs).

  • Schemas. To ensure that your schemas are also archived and restored, you must also enable Tiered Storage for the _schemas topic.

  • The consumer offsets topic. Some restored committed consumer offsets may be truncated to a lower value than in the original cluster, to keep offsets at or below the highest restored offset in the partition.

  • Transaction metadata, up to the highest committed transaction. In-flight transactions are treated as aborted and will not be included in the restore.

  • Cluster configurations, including your Redpanda license key, with the exception of the following properties:

    • cloud_storage_cache_size

    • cluster_id

    • cloud_storage_access_key

    • cloud_storage_secret_key

    • cloud_storage_region

    • cloud_storage_bucket

    • cloud_storage_api_endpoint

    • cloud_storage_credentials_source

    • cloud_storage_trust_file

    • cloud_storage_backend

    • cloud_storage_credentials_host

    • cloud_storage_azure_storage_account

    • cloud_storage_azure_container

    • cloud_storage_azure_shared_key

    • cloud_storage_azure_adls_endpoint

    • cloud_storage_azure_adls_port

Manage source metadata uploads

By default, Redpanda uploads cluster metadata to object storage periodically. You can manage metadata uploads for your source cluster, or disable them entirely, with the following cluster configuration properties:

  • enable_cluster_metadata_upload_loop: Enable metadata uploads. This property is enabled by default and is required for Whole Cluster Restore.

  • cloud_storage_cluster_metadata_upload_interval_ms: Set the time interval to wait between metadata uploads.

  • controller_snapshot_max_age_sec: Maximum amount of time that can pass before Redpanda attempts to take a controller snapshot after a new controller command appears. This property affects how current the uploaded metadata can be.

  • cloud_storage_cluster_name: This is an internal-only configuration and should be enabled only after consulting with Redpanda support. Specify a custom name for cluster’s metadata in object storage. For use when multiple clusters share the same storage bucket (for example, for Whole Cluster Restore).

You can monitor the redpanda_cluster_latest_cluster_metadata_manifest_age metric to track the age of the most recent metadata upload.

Restore data from a source cluster

To restore data from a source cluster:

Prerequisites

You must have the following:

  • Redpanda v23.3 or later on both source and target clusters.

  • Tiered Storage enabled on the source cluster.

  • Physical or virtual machines on which to deploy the target cluster.

Limitations

  • You cannot use Whole Cluster Restore if the target cluster is in recovery mode.

  • Whole Cluster Restore supports only one source cluster. It is not possible to consolidate multiple clusters onto the target cluster.

  • If a duplicate cluster configuration is found in the target cluster, it will be overwritten by the restore.

  • The target cluster should not contain user-managed or application-managed topic data, schemas, users, ACLs, or ongoing transactions.

Start a target cluster

Follow the steps to deploy a new cluster.

Make sure to configure the target cluster with the same Tiered Storage settings as the source cluster.

Restore to target cluster

You can restore data from a source cluster to a target cluster using the rpk cluster storage restore command.

  1. Restore data from the source cluster:

    rpk cluster storage restore start -w

    The wait flag (-w) tells the command to poll the status of the restore process and then exit when completed.

  2. Check if a rolling restart is required:

    rpk cluster config status

    Example output when a restart is required:

    NODE  CONFIG-VERSION  NEEDS-RESTART  INVALID  UNKNOWN
    1     4               true           []       []
  3. If a restart is required, perform a rolling restart.

When the cluster restore is successfully completed, you can redirect your application workload to the new cluster. Make sure to update your application code to use the new addresses of your brokers.

Restore data from multiple clusters sharing the same bucket

This is an advanced use case that should be performed only after consulting with Redpanda support.

Typically, you will have a one-to-one mapping between a Redpanda cluster and its object storage bucket. However, it’s possible to run multiple clusters that share the same storage bucket. Sharing an object storage bucket allows you to move tenants between clusters without moving data. For example, you might wish to move topics (unmount on cluster A, mount on cluster B) to multiple clusters in the same bucket without having to move data.

Running multiple clusters that share the same storage bucket presents unique challenges during Whole Cluster Restore operations. To manage these challenges, you must understand how Redpanda uses UUIDs (universally unique identifiers) to identify clusters during a Whole Cluster Restore. This shared storage approach can create identification challenges during restore operations.

The role of cluster UUIDs in Whole Cluster Restore

Each Redpanda cluster (single node or more) receives a unique UUID every time it starts. From that moment forward, all entities created by the cluster are identifiable using this cluster UUID. These entities include:

  • Topic data

  • Topic metadata

  • Whole Cluster Restore manifests

  • Controller log snapshots for Whole Cluster Restore

  • Consumer offsets for Whole Cluster Restore

However, not all entities managed by the cluster are identifiable using this cluster UUID. Each time a cluster uploads its metadata, the name of the object has two parts: the cluster UUID, which is unique each time you create a cluster (even after a restore it will have a new UUID), and a metadata (sequence) ID. When performing a restore, Redpanda scans the bucket to find the highest-sequenced ID uploaded by the cluster. It can be ambiguous what to restore when the highest sequential ID has been uploaded by another cluster, and result in a split-brain scenario, where you have two independent clusters that both believe they are the “rightful owner” of the same logical data.

Configure cluster names for multiple source clusters

To disambiguate cluster metadata from multiple clusters, use the cloud_storage_cluster_name property (off by default), which allows you to assign a unique name to each cluster sharing the same object storage bucket. Redpanda uses this name to organize the cluster metadata within the shared object storage bucket. This ensures that each cluster’s data remains distinct and prevents conflicts during recovery operations.The name must be unique within the bucket, 1-64 characters, and use only letters, numbers, underscores, and hyphens. Do not change this value once set. After setting, your object storage bucket organization may look like the following:

/
+- cluster_metadata/
|  + <uuid-a>/manifests/
|  | +- 0/cluster_manifest.json
|  | +- 1/cluster_manifest.json
|  | +- 2/cluster_manifest.json
|  + <uuid-b>/manifests/
|    +- 0/cluster_manifest.json
|    +- 1/cluster_manifest.json # lost cluster
+- cluster_name/
   +- rp-foo/uuid/<uuid-a>
   +- rp-qux/uuid/<uuid-b>

During a Whole Cluster Restore, Redpanda looks for the cluster name specified in cloud_storage_cluster_name and only consider manifests associated with that name. Because the cluster name specified here is rp-qux, Redpanda only considers manifests for the clusters <uuid-b> and <uuid-c> (another new cluster sharing the bucket), ignoring cluster <uuid-a> entirely. In this case, your object storage bucket may look like the following:

+- cluster_metadata/
|  + <uuid-a>/manifests/
|  | +- 0/cluster_manifest.json
|  | +- 1/cluster_manifest.json
|  | +- 2/cluster_manifest.json
|  + <uuid-b>/manifests/
|  | +- 0/cluster_manifest.json
|  | +- 1/cluster_manifest.json # lost cluster
|  + <uuid-c>/manifests/
|    +- 3/cluster_manifest.json # new cluster
|     # ^- next highest sequence number globally
+- cluster_name/
   +- rp-foo/uuid/<uuid-a>
   +- rp-qux/uuid/
      +- <uuid-b>
      +- <uuid-c> # reference to new cluster

Resolve repeated recovery failures

If you experience repeated failures when a cluster is lost and recreated, the automated recovery algorithm may have selected the manifest with the highest sequence number, which might be the most recent one with no data, instead of the original one containing the data. In such a scenario, your object storage bucket might be organized like the following:

/
+- cluster_metadata/
   + <uuid-a>/manifests/
   | +- 0/cluster_manifest.json
   | +- 1/cluster_manifest.json #lost cluster
   + <uuid-b>/manifests/
     +- 3/cluster_manifest.json # lost again (not recovered)
   + <uuid-d>/manifests/
      +- 7/cluster_manifest.json # new attempt to recover uuid-b
                                 # it does not have the data

In such cases, you can explicitly specify the cluster UUID:

  • rpk

  • Admin API

The --cluster-uuid-override option is available in v25.3.3 and later:

rpk cluster storage restore start --cluster-uuid-override <uuid-a>
curl -XPOST \
     --data '{"cluster_uuid_override":  "<uuid-a>"}' \
     http://localhost:9644/v1/cloud_storage/automated_recovery

For details, see the Admin API reference.