Failovers

View Markdown

When a Namespace with High Availability is disrupted by an outage, Temporal Cloud can fail over the Namespace from the primary to the replica. This lets in-flight Workflow Executions continue, new Workflow Executions start, and closed Workflow Executions be inspected, all with minimal interruptions or data loss.

Returning control from the replica to the primary is called a failback. After an automatic failover, Temporal automatically fails back to the original region once it is healthy, unless you opt out. See Failbacks for details.

Automatic failover

Temporal Cloud offers managed outage detection and failover to all Namespaces that use High Availability. These automatic failovers keep your Namespace available without manual intervention. Temporal aims to both detect the outage and complete a failover in minutes from when the outage began, according to the stated Recovery Time Objective (RTO).

After an automatic failover, the Namespace will have a replica in its original region. Once the original region is healthy again, Temporal Cloud automatically performs a failback, moving the Namespace back to its original region.

On failover, the replica becomes active and the Namespace endpoint directs access to it.

To opt out of automatic failovers and their RTO, you can disable automatic failovers.

Conditions that trigger an automatic failover

While the failover operation itself usually completes in seconds, the bulk of the Recovery Time in an outage is spent detecting the disruption and deciding to trigger a failover. See The failover process for a detailed breakdown.

Temporal Cloud runs automated Workflows that detect outages and trigger failovers. These Workflows continuously monitor the health of Temporal Cloud in every region and every cell.

If any of the monitored conditions are failing for too long, Temporal Cloud automatically triggers a failover on any Namespaces with High Availability that have a healthy replica.

Temporal's on-call engineers may also trigger a failover at their discretion, for example, if they see early signs of a regional outage.

info

The following list gives a general idea of the conditions that trigger an automatic failover. This is not an exhaustive list, and it may change over time.

Whether Temporal Cloud's services in the cell are reachable from the Control Plane.
The average latency of inbound RPC calls (excluding long-polling APIs) to Temporal services in the cell.
The percentage of inbound RPC calls that returned errors related to server health.
The average latency of calls from Temporal Cloud's services in the cell to its persistence layer.
The percentage of calls to the persistence layer that returned errors related to persistence health.

Manual failover

You can also manually trigger a failover based on your own monitoring or for failover testing.

Most Namespaces with High Availability are well-served by automatic failovers. The cases where a manual failover (i.e., a failover triggered by a user) is warranted are:

Testing failover or migrating to a new region. A manual failover is the standard way to exercise your failover process with your Clients and Workers, or to move a Namespace to a different region.
An outage that affects only your systems. If an outage is contained to your application, Workers, or other infrastructure, and Temporal Cloud is not affected, Temporal will not initiate a failover on your behalf. Detect the outage with your own monitoring and trigger a failover yourself.
Failing over more aggressively during a regional outage. Even with automatic failovers enabled, you can trigger a failover yourself if you detect a regional outage before Temporal does. Whichever failover happens first takes effect, and the later one is a no-op. A manual failover does not conflict with Temporal's automatic failover.

Same-region Replication

Manual failovers apply only to Multi-region and Multi-cloud Replication. A Same-region Replication Namespace fails over automatically between cells and cannot be failed over manually or have its automatic failovers disabled.

The failover process

The failover process is the same whether it is triggered automatically by Temporal or manually by a user.

During normal operation, the primary asynchronously replicates data to the replica, keeping them in sync.
A failover is triggered. For automatic failovers, the majority of time is spent on outage detection. Temporal's automated health checks must confirm the disruption before initiating a failover. For the overall timing target, see the Recovery Time Objective (RTO).
The Namespace becomes active in the replica's region.
1. Temporal Cloud first attempts a graceful failover: it pauses traffic, drains in-flight replication, and switches to the replica with no data conflicts.
2. If the graceful attempt does not complete within 10 seconds, Temporal Cloud falls back to a forced failover, which immediately activates the replica. In a forced failover, any events not yet replicated undergo conflict resolution once the original region comes back.
3. This hybrid strategy balances consistency and availability. During the switch, Workflow operations are briefly paused, and Temporal Cloud returns a retryable "Service unavailable" error to SDKs.
The Namespace Endpoint re-routes to the active region. This DNS change can take a few minutes to fully propagate to all Clients and Workers. If your application has an extremely demanding Recovery Time, you can eliminate this stage by connecting through a Regional Endpoint instead of the Namespace Endpoint.
Failback. If the failover was triggered by Temporal, Temporal automatically triggers a failback to the original region once the region is healthy. If the failover was triggered by a user, the Namespace continues as-is until a user triggers another failover. See failback options for details.

Post-failover events

After any failover, whether triggered by you or by Temporal, an event appears in both the Temporal Cloud Web UI (on the Namespace detail page) and in your audit logs. The audit log entry uses the "operation": "FailoverNamespace" event. Temporal Cloud notifies you via email whenever a failover occurs.

After an automatic failover, Temporal automatically fails back to the original region once the region is healthy, unless you opt out. After a user-triggered failover, the Namespace stays in the replica region until a user triggers another failover. See failback options for details.

Split-brain scenario

At any time, only the primary or the replica should be active. However, if a network partition separates the two regions, the regions cannot communicate with each other. If you promote the replica to active during a network partition, both regions will be active simultaneously, accepting writes independently. This is known as a split-brain scenario.

When the network partition resolves and the regions can communicate again, Temporal's conflict resolution process reconciles the divergent histories and determines which region remains active.

Conflict resolution

Namespaces with replicas rely on asynchronous event replication. Updates made to the primary may not immediately be reflected in the replica due to replication lag, particularly during failovers. In the event of a non-graceful failover, replication lag causes temporary setback in Workflow progress. At the moment of non-graceful failover:

Operations that had already replicated remain durable in the replica.
Operations that had not yet replicated (i.e., that are still in the replication backlog) are reconciled when the region recovers. according to the Conflict Resolution process. - Note that Conflict Resolution can only recover data from a functioning Temporal Server. If the active server never recovers, the Workflow API calls that fall within the 1 minute RPO may be permanently lost. Such a case would require the permanent loss of multiple cloud Availability Zones and has never happened in the history of Temporal Cloud.

In a graceful failover, Temporal Cloud drains the replication backlog to zero and pauses traffic before switching regions, so the replica holds every acknowledged operation and the Namespace achieves a recovery point of zero.

Namespaces that are not replicated can be configured to provide at-most-once semantics for Activity execution when a retry policy's maximum attempts is set to 0. High Availability Namespaces provide at-least-once semantics for execution of Activities. Completed Activities may be re-dispatched in a newly active Namespace, leading to repeated executions.

The same durability boundary applies to Workflow starts and Signals: a StartWorkflowExecution or SignalWorkflowExecution call that returns success is durably committed in the active region, and replicated asynchronously to the replica.

How Workflow Id uniqueness is preserved after a forced failover

The Workflow Id uniqueness guarantee — at most one Open Workflow Execution per Workflow Id — is always enforced within the active Namespace, and conflict resolution preserves it across a failover. The guarantee limits how many Executions are Open at the same time, not how many Run Ids a Workflow Id accumulates over its lifetime. That distinction is what lets conflict resolution reconcile a divergence without ever running the same Workflow Id twice concurrently.

Steady state. The active region enforces uniqueness on every write and asynchronously replicates the Event History to the replica.
Failover with divergence. In a forced failover when replication lag is present, both regions can independently append events to the same Workflow Execution. When the regions reconnect, their Event Histories have diverged for that Workflow Id.
Fork instead of merge. Temporal Cloud does not interleave the divergent histories. Events from the previously active Namespace that arrive after the failover cannot be directly applied, so Temporal Cloud forks the Event History and creates a new branch history, each branch identified by its own Run Id. Its conflict resolution process keeps one branch as the Open Execution and supersedes the other, leaving exactly one Open Workflow Execution per Workflow Id. The Temporal Service ensures the resulting Event Histories remain valid and replayable by SDKs.

This guarantee is separate from durability: a forced failover never runs the same Workflow Id twice, but a start that had not yet replicated can still be lost if the original region is permanently lost, as described above.

Automatic failover​

Conditions that trigger an automatic failover​

Manual failover​

The failover process​

Post-failover events​

Split-brain scenario​

Conflict resolution​

How Workflow Id uniqueness is preserved after a forced failover​