[BEE-19017] Lease-Based Coordination

INFO

A lease is a time-bounded grant of authority: the holder may act without contacting the grantor for the duration of the lease, and the authority expires automatically if not renewed — making leases more fault-tolerant than indefinite locks because holder crashes resolve themselves without requiring failure detection.

Context

Cary Gray and David Cheriton introduced leases in "Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency" (SOSP 1989). The problem they were solving: a distributed file system cache needs to know whether it can serve a cached read without checking the server for updates. A traditional locking approach grants the cache a read lock; when the cached data changes, the server revokes the lock. But revocation requires the server to find and contact every cache holder — expensive — and if a cache crashes while holding a lock, the server must detect the crash and reclaim the lock before allowing writes. Leases solve both problems: grant the cache authority to serve cached data for time T; if it needs to update, it renews the lease; when T expires without renewal, the server is free to grant a conflicting lease to someone else. No revocation protocol needed. No crash detection needed.

The invariant leases provide: at any given time, at most one entity holds a valid lease for a given resource. The grantor enforces this by waiting at least T_lease + T_clock_skew before granting a conflicting lease. If the previous holder's clock and the grantor's clock agree within T_clock_skew, the previous holder will have stopped acting on its lease before the new one is granted. This requires that clocks are bounded in their divergence — an assumption that NTP with bounded skew, GPS time, or Google's TrueTime (see BEE-19008) can provide.

Google's Chubby lock service (Burrows, OSDI 2006) operationalized this pattern at scale. Chubby is a Paxos-replicated service providing distributed locks and a small amount of storage; it is the coordination substrate for GFS leader election, Bigtable metadata management, and MapReduce. Clients hold session leases against the Chubby master. If a client cannot renew its session lease within the session timeout, it enters a "jeopardy" period in which it must stop acting on any locks or cached data it holds — the distributed equivalent of a process stopping itself rather than risking stale authority. When the session is restored, the client learns whether its locks are still valid. ZooKeeper emerged as an open-source alternative using the same core idea: ephemeral nodes that vanish when their creating session expires.

Leader leases are a specific application of this pattern to consensus-based replication. In Raft, the leader normally serves all reads by appending a read to the log and waiting for quorum acknowledgment — this guarantees the read reflects the latest committed state. But this adds one log round-trip to every read. An optimization: the leader can serve reads without the log if it can prove it is still the leader. A leader lease works by having the leader note the time of its last successful heartbeat. For the next L seconds (the lease duration), no other leader can have been elected — because a new election requires a majority quorum, and a quorum requires a majority of nodes to have timed out on the current leader, which takes longer than L. The leader can therefore serve reads directly during the lease window, reducing read latency. TiKV uses this for "lease reads"; CockroachDB defaults to leader leases for its range leaseholder mechanism.

Design Thinking

Leases trade some availability for fault tolerance. With a lock, if the holder crashes while holding it, the lock is stuck until the holder is detected as failed and the lock is forcibly released. With a lease, if the holder crashes, the lease expires automatically within T. No detection is needed — the grantor simply waits. The downside: the lease holder must renew before expiry. If the holder is alive but temporarily partitioned from the grantor, it must stop acting on its lease authority at expiry, even though it is still running. This causes brief availability gaps that locks would not cause. The right choice depends on whether crashes or partitions are more operationally painful.

Lease duration is a safety-availability tradeoff. Short leases (seconds) expire quickly after holder failure, minimizing the time other clients are blocked waiting for authority. Long leases (minutes) reduce renewal overhead and tolerate longer network interruptions before the holder must stop. Kubernetes uses 15 seconds for node lease duration (the kubelet must renew every 10 seconds to maintain a 5-second safety margin); etcd suggests 5–30 seconds for service leases. Setting lease duration shorter than the maximum observed network RTT is dangerous: the holder renews, the renewal message is delayed past the TTL, the grantor expires the lease and grants it to another node, while the original holder still thinks it holds authority — both act simultaneously.

Clock skew is a correctness constraint, not just a performance concern. The grantor-side invariant — wait T_lease + T_clock_skew before granting a conflicting lease — requires that both parties agree on when the lease expires. If clocks drift arbitrarily, the holder may believe the lease is valid while the grantor has already granted it to another. Production systems bound this in one of three ways: use NTP with a known maximum skew (100ms is typical in a data center; etcd documentation recommends 250ms per server), use hardware GPS time, or use TrueTime's bounded uncertainty interval and commit-wait.

Leader leases require clock synchronization for safety. A leader lease that lasts L seconds is safe only if: (a) the clock skew between the leader and any potential new leader is less than L, and (b) the leader will stop serving reads when its lease expires even if it cannot contact followers. If a leader with a fast clock serves a read after its lease has expired according to the slow follower's clock, and a new leader has been elected in that gap, the fast-clock leader serves a stale read. CockroachDB documents a maximum clock skew of 500ms and terminates if the skew exceeds 400ms; its leader leases are sized accordingly.

Visual

Example

etcd lease for service registration (Go client):

// Service registers itself with a lease; if process dies, key auto-expires
cli, _ := clientv3.New(clientv3.Config{Endpoints: []string{"localhost:2379"}})
defer cli.Close()

// Create a 30-second lease
lease, _ := cli.Grant(context.Background(), 30)

// Attach key to lease — key disappears when lease expires
cli.Put(context.Background(), "/services/api/node-1", "10.0.0.5:8080",
    clientv3.WithLease(lease.ID))

// Keep the lease alive in a background goroutine
// etcd client automatically renews every TTL/3 seconds (every 10s here)
keepAlive, _ := cli.KeepAlive(context.Background(), lease.ID)
go func() {
    for range keepAlive {
        // drain channel; KeepAlive sends responses to confirm renewal
    }
}()

// If this process dies, keepAlive stops, etcd expires the lease after 30s,
// and /services/api/node-1 is automatically deleted — no explicit cleanup needed

Kubernetes leader election via Lease object:

yaml

# Lease object created/renewed by the current leader
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: kube-controller-manager
  namespace: kube-system
spec:
  holderIdentity: "controller-manager-pod-abc123"
  leaseDurationSeconds: 15       # must renew within 15s
  renewTime: "2026-04-14T10:30:00Z"
  acquireTime: "2026-04-14T09:00:00Z"
  leaseTransitions: 3            # how many times leadership changed

// Leader election in a controller using client-go
import "k8s.io/client-go/tools/leaderelection"

leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
    Lock:            resourceLock,
    LeaseDuration:   15 * time.Second,  // how long a lease is valid
    RenewDeadline:   10 * time.Second,  // max time to renew before giving up leadership
    RetryPeriod:     2 * time.Second,   // how often non-leader retries to acquire
    Callbacks: leaderelection.LeaderCallbacks{
        OnStartedLeading: func(ctx context.Context) { runControllerLoop(ctx) },
        OnStoppedLeading: func() { os.Exit(1) }, // safe: process dies, lease expires
        OnNewLeader: func(identity string) { log.Printf("leader: %s", identity) },
    },
})
// If leader process dies, its lease expires after LeaseDuration (15s)
// A new leader is elected within LeaseDuration + RetryPeriod (17s max)

Leader lease for fast reads (TiKV-style pseudocode):

# Raft leader serves reads without log round-trip if lease is valid

LEASE_DURATION = 9s          # must be < election_timeout (10s)
MAX_CLOCK_SKEW = 500ms       # bounded by NTP configuration

class RaftLeader:
    def __init__(self):
        self.lease_start = None   # time of last successful heartbeat quorum

    def on_heartbeat_quorum(self):
        self.lease_start = now()  # majority acknowledged → we are the leader

    def serve_read(self, key):
        lease_remaining = self.lease_start + LEASE_DURATION - now()

        if lease_remaining > MAX_CLOCK_SKEW:
            # Safe: no other leader could have been elected within this window
            return self.local_state[key]  # no Raft log round-trip
        else:
            # Lease expired or about to expire — fall back to linearizable read
            return self.raft_read(key)    # append to log, wait for quorum

BEE-19002 -- Consensus Algorithms: Raft leader leases extend the Raft leader's authority to serve reads without log round-trips; the safety of the lease depends on Raft's election guarantee — a quorum cannot elect a new leader faster than one election timeout, which bounds the lease duration
BEE-19005 -- Distributed Locking: locks and leases solve the same mutual exclusion problem by different means; locks require explicit release or failure detection, leases self-expire — combine both (timed lock = lease) for the resilience of leases and the semantics of locks
BEE-19008 -- Clock Synchronization and Physical Time: lease safety depends on bounded clock skew; the grantor must wait TTL + max_clock_skew before re-granting, which requires knowing max_clock_skew — TrueTime provides this with a hardware-backed uncertainty interval
BEE-19015 -- Failure Detection: leases eliminate the need for active failure detection on the grantor side — instead of detecting the holder's crash (hard), the grantor waits for the lease to expire (easy); the tradeoff is that the grant gap (time no one holds authority) equals the remaining lease TTL at crash time

[BEE-19017] Lease-Based Coordination ​

Context ​

Design Thinking ​

Visual ​

Example ​

Related BEEs ​

References ​

[BEE-19017] Lease-Based Coordination

Context

Design Thinking

Visual

Example

Related BEEs

References