[BEE-19028] Fencing Tokens

INFO

A fencing token is a monotonically increasing number issued by a lock service on each lock acquisition; the protected resource rejects any write whose token is lower than the highest it has already accepted, preventing a stale lock holder from corrupting shared state even after its lease has silently expired.

Context

Distributed locks and leases appear to prevent concurrent mutation of shared resources, but they carry a subtle and dangerous assumption: the lock holder is continuously aware of whether it still holds the lock. In practice, a process holding a lease can be suspended for an arbitrary duration — a stop-the-world garbage collection pause, OS scheduler preemption, a VM being live-migrated, or a network partition that temporarily isolates the node. During this pause, the lease may expire and be re-issued to another client; when the paused process resumes, it does not know this happened. It proceeds to write to the shared resource, racing with the new legitimate lock holder.

Martin Kleppmann articulated this failure mode precisely in his 2016 blog post "How to do distributed locking" and proposed the fencing token as the remedy. The mechanism is simple: every time a client acquires a lock, the lock service issues an integer token that is strictly greater than all previously issued tokens. Every write to the protected resource includes the current token. The resource server records the highest token it has processed; it rejects any write whose token is lower than this high-water mark. A paused client that resumes with a stale token 33 is rejected because the resource has already processed writes with token 34 from the new lock holder.

The same word — "fencing" — appears in high-availability cluster management with related but distinct meaning. In that context, fencing refers to forcibly evicting a suspected-failed node from access to shared resources, either by cutting its power (STONITH: "Shoot The Other Node In The Head"), revoking its SAN zoning, or sending a SCSI-3 persistent reservation that blocks the node's disk access. Both uses share the same invariant: a resource (storage, cluster service) actively enforces exclusion rather than trusting the lock holder to self-regulate.

Design Thinking

A lock alone is insufficient — the resource must participate. Distributed locks implemented in ZooKeeper, etcd, or Redis provide a rendezvous point for clients to coordinate, but they cannot physically prevent a client from writing to a database or file system after its lease expires. The resource being protected must be aware of the current generation of the lock and must actively gate writes against it. This is the crux of the fencing token pattern: safety is provided by the resource, not by the lock service.

Process pauses are unbounded and invisible to the paused process. A GC pause in the JVM can last hundreds of milliseconds to seconds during major collection; a container under memory pressure can be CPU-throttled for seconds; a VM snapshot can freeze execution for tens of seconds. The process experiencing the pause has no way to know how long it was suspended — System.currentTimeMillis() or wall clock time cannot be used to detect expiry from within the paused process. A fencing token externalizes the "generation" of the lock to the resource server, which is not paused and can evaluate the token objectively.

The token source must be linearizable. If the lock service issues tokens from a replicated state machine (ZooKeeper, etcd/Raft), the token is linearizable: token 34 is provably issued after token 33 was granted and released. If the token source is a non-linearizable system (Redis in a non-WAIT configuration, multiple Redis primaries in Redlock), tokens may be issued out of order or duplicated across leader failovers — violating the monotonicity invariant that fencing depends on. Kleppmann's critique of Redlock centers on this point: Redlock produces no monotonically increasing token and cannot support fencing.

Visual

Best Practices

MUST include the fencing token in every write to the protected resource. A single write that omits the token check is enough to corrupt state. The token MUST flow from the lock acquisition through every operation that touches shared state — passed as a request header, a database column, or a message attribute. If it cannot be included (e.g., the downstream system has no notion of conditional writes), document the gap explicitly and treat the system as providing only best-effort exclusion.

The resource server MUST maintain a monotonic high-water mark and reject stale tokens. The check at the resource is: if incoming_token < high_water_mark: reject. The high-water mark MUST itself be stored durably (or in a linearizable store) so that it survives resource server restarts. A resource server that checks the token only in memory loses the high-water mark on restart, allowing a stale client to succeed immediately after a resource server reboot.

Choose a fencing token source with proven linearizability. For ZooKeeper-backed locks, use the zxid or the cversion of the lock znode — both are monotonically increasing global counters maintained by the ZooKeeper quorum. For etcd-backed locks, use the creation revision of the lock key (available as the header.revision in the lock response) — etcd's Raft log guarantees this is globally ordered. MUST NOT use a wall clock timestamp, a random UUID, or any value from a non-linearizable source as a fencing token.

MUST NOT rely on lease TTL as the sole protection. A lease may expire on the lock service's clock while the holder is paused; neither the holder nor the resource knows this at the moment of expiry. Fencing tokens decouple exclusion enforcement from the timeliness of the lock service — even if the lock service itself is partitioned or slow, the resource server can reject stale writes based solely on token ordering.

For HA cluster node fencing, configure fencing before enabling automatic failover. A cluster without a working fencing device — power controller, IPMI, SAN zoning, or SBD (Storage-Based Death) — MUST NOT be configured for automatic failover. Without fencing, a cluster manager that assumes a failed node is down may promote a secondary while the primary is still writing, producing split-brain data corruption. The Pacemaker cluster manager refuses to start resources until a fencing device is configured.

Implement fencing token validation as a conditional write. Many storage systems natively support conditional writes that can serve as the fencing enforcement point: PostgreSQL UPDATE ... WHERE version = $expected AND version >= $min_token, DynamoDB ConditionExpression: "version = :v", etcd txn(compare, success, failure). These atomic compare-and-set operations combine the token check and the write in a single transaction, preventing TOCTOU (time-of-check-to-time-of-use) races between the check and the write.

Deep Dive

ZooKeeper as a fencing token source. ZooKeeper maintains a global transaction ID (zxid) that increments with every write to any znode in the ensemble. When a client creates an ephemeral sequential znode to acquire a lock, the creation zxid is the lock's generation token. The zxid is exposed via the Stat structure returned on any successful watch or data operation. Because zxid is maintained by the Zab consensus protocol (ZooKeeper Atomic Broadcast), it is linearizable: if the lock service issues zxid=1042, no future leader election or failover can reissue zxid=1042 or lower.

etcd as a fencing token source. etcd lock operations (implemented via the concurrency package's Mutex) create a key with a TTL-bearing lease. The header.revision field in the LeaseGrant or Lock response is the Raft log index at the time of the lock grant — a globally monotonically increasing integer across all etcd operations. This revision is the fencing token. On the resource side, use an etcd transaction with a compare condition (version > last_seen) to atomically validate the token and apply the update in a single Raft-committed operation.

STONITH and infrastructure-level fencing. In HA database clusters (PostgreSQL with Patroni, MySQL Group Replication), a failed primary must be fenced before a secondary can be promoted — otherwise both may write to the same storage. Fencing methods in order of reliability: (1) Power fencing: IPMI/BMC sends a power-off command — the node is definitively dead, but the command may fail if the BMC is also unreachable. (2) SAN/disk fencing: SCSI-3 persistent reservations (PR) allow a surviving node to register a reservation that prevents I/O from the fenced node's HBA at the storage level — works even if the node's OS is still running. (3) SBD (Storage-Based Death): a dedicated small block device stores a poison pill message; nodes watch the SBD device and self-fence (reboot) if they see their own name in the poison pill.

Why Redlock cannot provide fencing tokens. Redlock acquires a lock by writing a random UUID to N Redis primaries with a TTL, then verifying that a majority responded in time. The UUID is opaque — it carries no ordering information. If Client 1 acquires the lock, gets delayed, and the lock expires, Client 2 acquires the lock with a different UUID. When Client 1 resumes and presents its UUID to the resource, the resource cannot determine whether UUID-A or UUID-B is "newer" because they have no ordering relationship. Kleppmann's conclusion is that Redlock should only be used for efficiency (reducing duplicate work) rather than correctness (preventing data corruption), because it cannot support fencing.

Example

etcd-backed lock with fencing token, resource-side validation (Python pseudocode):

python

import etcd3

etcd = etcd3.client(host='localhost', port=2379)

def acquire_lock_with_token(resource_name: str, ttl: int = 10):
    """Acquire a distributed lock and return the fencing token (etcd revision)."""
    lease = etcd.lease(ttl)
    # Lock key: only one client can hold this key at a time
    lock_key = f"/locks/{resource_name}"
    # Try to put the key with the lease; succeeds only if key doesn't exist
    success, responses = etcd.transaction(
        compare=[etcd.transactions.version(lock_key) == 0],
        success=[etcd.transactions.put(lock_key, "locked", lease=lease)],
        failure=[],
    )
    if not success:
        raise RuntimeError("Lock already held")
    # The response header revision is the fencing token
    fencing_token = responses[0].header.revision
    return lease, fencing_token

def write_to_resource(resource_id: str, data: dict, fencing_token: int):
    """Write to the protected resource, including the fencing token."""
    # The resource server validates: fencing_token > last_accepted_token
    response = resource_client.update(
        resource_id=resource_id,
        data=data,
        fencing_token=fencing_token,  # included in every write
    )
    if response.status == "TOKEN_TOO_OLD":
        raise StaleTokenError(f"Token {fencing_token} rejected; another client has higher token")

# ── Resource server side ──────────────────────────────────────────────────────

# PostgreSQL: atomic token check + write as a single statement
UPDATE shared_resources
SET    data          = $1,
       fencing_token = $2
WHERE  resource_id   = $3
  AND  fencing_token < $2;   -- reject if incoming token is not newer
-- 0 rows updated → stale token → application raises StaleTokenError

DynamoDB conditional write (fencing token as version attribute):

python

import boto3
from botocore.exceptions import ClientError

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('shared_resources')

def write_with_fencing(resource_id: str, data: dict, fencing_token: int):
    """Write only if the stored fencing_token is less than the incoming one."""
    try:
        table.update_item(
            Key={'resource_id': resource_id},
            UpdateExpression='SET #d = :data, fencing_token = :token',
            ConditionExpression='attribute_not_exists(fencing_token) OR fencing_token < :token',
            ExpressionAttributeNames={'#d': 'data'},
            ExpressionAttributeValues={
                ':data': data,
                ':token': fencing_token,
            },
        )
    except ClientError as e:
        if e.response['Error']['Code'] == 'ConditionalCheckFailedException':
            raise StaleTokenError(f"Write rejected: fencing token {fencing_token} is stale")
        raise

BEE-19005 -- Distributed Locking: fencing tokens address the fundamental safety gap in distributed locks — a lock provides coordination, but only a fencing token enforced by the resource provides safety when the lock holder is paused
BEE-19017 -- Lease-Based Coordination: leases are the mechanism by which distributed locks expire; fencing tokens are what make lease expiry safe — without tokens, a lease holder that outlasts its lease can still corrupt the resource
BEE-11006 -- Optimistic vs Pessimistic Concurrency Control: fencing token enforcement at the resource is a form of optimistic concurrency control — the resource accepts writes optimistically but rejects them if the token reveals a stale view, analogous to version-number OCC
BEE-19002 -- Consensus Algorithms: Paxos and Raft: the reliability of fencing tokens as a safety mechanism depends entirely on the linearizability of the token source; consensus-based systems (ZooKeeper/Zab, etcd/Raft) provide this guarantee, while non-consensus systems (Redis) cannot

[BEE-19028] Fencing Tokens ​

Context ​

Design Thinking ​

Visual ​

Best Practices ​

Deep Dive ​

Example ​

Related BEEs ​

References ​