SIMD-0046: optimistic cluster restart automation

simd: '0046'
title: Optimistic cluster restart automation
authors:
  - Wen Xu (Anza)
category: Standard
type: Core
status: Implemented
created: 2023-04-07
feature: N/A (gated by command line flag instead)
development:
  - Anza - implemented
  - Firedancer - implemented

Summary

During a cluster restart following an outage, make validators enter a separate recovery protocol that uses Gossip to exchange local status and automatically reach consensus on the block to restart from. Proceed to restart if validators in the restart can reach agreement, or print debug information and halt otherwise. To distinguish the new restart process from other operations, we call the new process "Wen restart".

New Terminology

  • cluster restart: When there is an outage such that the whole cluster stalls, human may need to restart most of the validators with a sane state so that the cluster can continue to function. This is different from sporadic single validator restart which does not impact the cluster. See cluster restart for details.

  • cluster restart slot: In current cluster restart scheme, human normally decide on one block for all validators to restart from. This is very often the highest optimistically confirmed block, because optimistically confirmed block should never be rolled back. But it's also okay to start from a child of the highest optimistically confirmed block as long as consensus can be reached.

  • optimistically confirmed block: a block which gets the votes from the majority of the validators in a cluster (> 2/3 stake). Our algorithm tries to guarantee that an optimistically confirmed block will never be rolled back.

  • wen restart phase: During the proposed optimistic cluster restart automation process, the validators in restart will first spend some time to exchange information, repair missing blocks, and finally reach consensus. The validators only continue normal block production and voting after consensus is reached. We call this preparation phase where block production and voting are paused the wen restart phase.

  • wen restart shred version: right now we update shred_version during a cluster restart, it is used to verify received shreds and filter Gossip peers. In the proposed optimistic cluster restart plan, we introduce a new temporary shred version in the wen restart phase so validators in restart don't interfere with those not in restart. Currently this wen restart shred version is calculated using (current_shred_version + 1) % 0xffff.

  • RESTART_STAKE_THRESHOLD: We need enough validators to participate in a restart so they can make decision for the whole cluster. If everything works perfect, we only need 2/3 of the total stake. However, validators could die or perform abnormally, so we currently set the RESTART_STAKE_THRESHOLD at 80%, which is the same as what we use now for --wait_for_supermajority.

Motivation

Currently during a cluster restart, validator operators need to decide the highest optimistically confirmed slot, then restart the validators with new command-line arguments.

The current process involves a lot of human intervention, if people make a mistake in deciding the highest optimistically confirmed slot, it is detrimental to the viability of the ecosystem.

We aim to automate the negotiation of highest optimistically confirmed slot and the distribution of all blocks on that fork, so that we can lower the possibility of human mistakes in the cluster restart process. This also reduces the burden on validator operators, because they don't have to stay around while the validators automatically try to reach consensus, the validator will halt and print debug information if anything goes wrong, and operators can set up their own monitoring accordingly.

However, there are many ways an automatic restart can go wrong, mostly due to unforseen situations or software bugs. To make things really safe, we apply multiple checks during the restart, if any check fails, the automatic restart is halted and debugging info printed, waiting for human intervention. Therefore we say this is an optimistic cluster restart procedure.

Alternatives Considered

Automatically detect outage and perform cluster restart

The reaction time of a human in case of emergency is measured in minutes, while a cluster restart where human initiate validator restarts takes hours. We considered various approaches to automatically detect outage and perform cluster restart, which can reduce recovery speed to minutes or even seconds.

However, automatically restarting the whole cluster seems risky. Because if the recovery process itself doesn't work, it might be some time before we can get human's attention. And it doesn't solve the cases where new binary is needed. So for now we still plan to have human in the loop.

After we gain more experience with the restart approach in this proposal, we may slowly try to make the process more automatic to improve reliability.

Use Gossip and consensus to figure out restart slot before the restart

The main difference between this and the current restart proposal is this alternative tries to make the cluster automatically enter restart preparation phase without human intervention.

While getting humans out of the loop improves recovery speed, there are concerns about recovery Gossip messages interfering with normal Gossip messages, and automatically start a new message in Gossip seems risky.

Automatically reduce block production in an outage

Right now we have vote-only mode, a validator will only pack vote transactions into new blocks if the tower distance (last_vote - local_root) is greater than 400 slots.

Unfortunately in the previous outages vote-only mode isn't enough to save the cluster. There are proposals of more aggressive block production reduction to save the cluster. For example, a leader could produce only one block in four consecutive slots allocated to it.

However, this only solves the problem in specific type of outage, and it seems risky to aggressively reduce block production, so we are not proceeding with this proposal for now.

Detailed Design

The new protocol tries to make all restarting validators get the same data blocks and the same set of last votes, so that they will with high probability converge on the same canonical fork and proceed.

When the cluster is in need of a restart, we assume validators holding at least RESTART_STAKE_THRESHOLD percentage of stakes will enter the restart mode. Then the following steps will happen:

  1. The operator restarts the validator into the wen restart phase at boot, where it will not make new blocks or vote. The validator propagates its local voted fork information to all other validators in restart.

  2. While aggregating local vote information from all others in restart, the validator repairs all blocks which could potentially have been optimistically confirmed.

  3. After enough validators are in restart and repair is complete, the validator counts votes on each fork and computes local heaviest fork.

  4. A coordinator which is configured on everyone's command line sends out its heaviest fork to everyone.

  5. Each validator verifies that the coordinator's choice is reasonable:

    1. If yes, proceed and restart

    2. If no, print out what it thinks is wrong, halt and wait for human

See each step explained in details below.

We assume that as most 5% of the validators in restart can be malicious or contains bugs, this number is consistent with other algorithms in the consensus protocol. We call these non-conforming validators.

Wen restart phase

  1. Gossip last vote and ancestors on that fork

    The main goal of this step is to propagate most recent ancestors on the last voted fork to all others in restart.

    We use a new Gossip message RestartLastVotedForkSlots, its fields are:

    • last_voted_slot: u64 the slot last voted, this also serves as last_slot for the bit vector.
    • last_voted_hash: Hash the bank hash of the slot last voted slot.
    • ancestors: Run-length encoding compressed bit vector representing the slots on sender's last voted fork. the least significant bit is always last_voted_slot, most significant bit is last_voted_slot-65535.

    The max distance between oldest ancestor slot and last voted slot is hard coded at 65535, because that's 400ms * 65535 = 7.3 hours, we assume that most validator administrators would have noticed an outage within 7 hours, and the optimistic confirmation must have halted within 64k slots of the last confirmed block. Also 65535 bits nicely fits into u16, which makes encoding more compact. If a validator restarts after 7 hours past the outage, it cannot join the restart this way. If enough validators failed to restart within 7 hours, then we fallback to the manual, interactive cluster restart method.

    When a validator enters restart, it uses wen restart shred version to avoid interfering with those outside the restart. To be extra cautious, we will also filter out RestartLastVotedForkSlots and RestartHeaviestFork (described later) in Gossip if a validator is not in wen restart phase. There is a slight chance that the wen restart shred version would collide with the shred version after the wen restart phase, but with the filtering described above it should not be a problem.

    When a validator receives RestartLastVotedForkSlots from someone else, it will discard all slots smaller than the local root. Because the local root should be an optimistic confirmed slot, it does not need to keep any slot older than local root.

  2. Repair ledgers up to the restart slot

    The main goal of this step is to repair all blocks which could potentially be optimistically confirmed.

    We need to prevent false negative at all costs, because we can't rollback an optimistically confirmed block. However, false positive is okay. Because when we select the heaviest fork in the next step, we should see all the potential candidates for optimistically confirmed slots, there we can count the votes and remove some false positive cases.

    However, it's also overkill to repair every block presented by others. When RestartLastVotedForkSlots messages are being received and aggregated, a validator can categorize blocks missing locally into 2 categories: must-have and ignored.

    We repairs all blocks with no less than 42% stake. The number is 67% - 5% - stake_on_validators_not_in_restart. We require that at least 80% join the restart, any block with less than 67% - (100 - 80)% - 5% = 42% can never be optimistically confirmed before the restart.

    It's possible that different validators see different 80%, so their must-have blocks might be different, but there will be another repair round in the final step so this is fine. Whenever some block gets to 42%, repair could be started, because when more validators join the restart, this number will only go up but will never go down.

    When a validator gets RestartLastVotedForkSlots from 80% of the stake, and all those "must-have" blocks are repaired, it can proceed to next step.

  3. Calculate heaviest fork

    After receiving RestartLastVotedForkSlots from the validators holding stake more than RESTART_STAKE_THRESHOLD and repairing slots in "must-have" category, pick the heaviest fork like this:

    1. Calculate the threshold for a block to be on the heaviest fork, the heaviest fork should have all blocks with possibility to be optimistically confirmed. The number is 67% - 5% - stake_on_validators_not_in_restart.

    For example, if 80% validators are in restart, the number would be 67% - 5% - (100-80)% = 42%. If 90% validators are in restart, the number would be 67% - 5% - (100-90)% = 52%.

    1. Sort all blocks over the threshold by slot number, and verify that they form a single chain. The first block in the list should be the local root.

    If any block does not satisfy above constraint, print the first offending block and exit.

    The list should not be empty, it should contain at least the local root.

    To see why the above algorithm is safe, we will prove that:

    1. Any block optimistically confirmed before the restart will always be on the list:

    Assume block A is one such block, it would have 67% stake, discounting 5% non-conforming and people not participating in wen_restart, it should have at least 67% - 5% - stake_on_validators_not_in_restart stake, so it should pass the threshold and be in the list.

    1. Any block in the list should only have at most one child in the list:

    Let's use X to denote stake_on_validators_not_in_restart for brevity. Assuming a block has child A and B both on the list, the children's combined stake would be 2 * (67% - 5% - X). Because we only allow one RestartHeaviestFork per pubkey, every validator should select either A or B, it's easy to find and filter out vialators who selected both. So the children's total stake should be less than 100% - X. We can calculate that if 124% - 2 * X < 100% - X, then X > 24%, this is not possible when we have at least 80% of the validators in restart. So we prove any block in the list can have at most one child in the list by contradiction.

    1. If a block not optimistically confirmed before the restart is on the list, it can only be at the end of the list and none of its siblings are on the list.

    Let's say block D is the first not optimistically confirmed block on the list, its parent E is confirmed and on the list. We know from above point that E can only have 1 child on the list, therefore D must be at the end of the list while its siblings are not on the list.

    Even if the last block D on the list may not be optimistically confirmed, it already has at least 42% - 5% = 37% stake. Say F is its sibling with the most stake, F can only have less than 42% stake because it's not on the list. So picking D over F is equal to the case where 5% stake switched from fork F to fork D, 80% of the cluster can switch to fork D if that turns out to be the heaviest fork.

    After picking the appropriate slot, replay the block and all its ancestors to get the bankhash for the picked slot.

  4. Verify the heaviest fork of the coordinator

    There will be one coordinator specified on the command line of everyone's command line. Even though everyone will calculate its own heaviest fork in previous step, only the coordinator's heaviest fork will be checked and optionally accepted by others.

    We use a new Gossip message RestartHeaviestFork, its fields are:

    • slot: u64 slot of the picked block.
    • hash: Hash bank hash of the picked block.

    After deciding the heaviest block, the coordinator Gossip RestartHeaviestFork(X.slot, X.hash) out, where X is the block the coordinator picked locally in previous step. The coordinator will stay up until manually restarted by its operator.

    For every non-coordinator validator, it will perform the following actions on the heaviest fork sent by the coordinator:

    1. If the bank selected is missing locally, repair this slot and all slots with higher stake.

    2. Check that the bankhash of selected slot matches the data locally.

    3. Verify that the selected fork contains local root, and that its local heaviest fork slot is on the same fork as the coordinator's choice.

    If any of the above repair or check fails, exit with error message, the coordinator may have made a mistake and this needs manual intervention.

    When exiting this step, no matter what a non-coordinator validator chooses, it will send a RestartHeaviestFork back to leader to report its status. This reporting is just for ease of aggregating the cluster's status at the coordinator, it doesn't have other effects.

  5. Generate incremental snapshot and exit

If the previous step succeeds, the validator immediately starts adding a hard fork at the designated slot and perform set_root. Then it will start generating an incremental snapshot at the agreed upon cluster restart slot. After snapshot generation completes, the --wait_for_supermajority args with correct shred version, restart slot, and expected bankhash will be printed to the logs.

After the snapshot generation is complete, a non coordinator then exits with exit code 200 to indicate work is complete.

A coordinator will stay up until restarted by the operator to make sure any late comers get the RestartHeaviestFork message. It also aggregates the RestartHeaviestFork messages sent by the non-coordinators to report on the status of the cluster.

Impact

This proposal adds a new wen restart mode to validators, under this mode the validators will not participate in normal cluster activities. Compared to today's cluster restart, the new mode may mean more network bandwidth and memory on the restarting validators, but it guarantees the safety of optimistically confirmed user transactions, and validator operators don't need to manually generate and download snapshots during a cluster restart.

Security Considerations

The two added Gossip messages RestartLastVotedForkSlots and RestartHeaviestFork will only be sent and processed when the validator is restarted in wen restart mode. So random validator restarting in the new mode will not clutter the Gossip CRDS table of a normal system.

Non-conforming validators could send out wrong RestartLastVotedForkSlots messages to mess with cluster restarts, these should be included in the Slashing rules in the future.

Handling oscillating votes

Non-conforming validators could change their last votes back and forth, this could lead to instability in the system. We forbid any change of slot or hash in RestartLastVotedForkSlots or RestartHeaviestFork, everyone will stick with the first value received, and discrepancies will be recorded in the proto file for later slashing.

Handling multiple epochs

Even though it's not very common that an outage happens across an epoch boundary, we do need to prepare for this rare case. Because the main purpose of wen restart is to make everyone reach aggrement, the following choices are made:

  • Every validator only handles 2 epochs, any validator will discard slots which belong to an epoch which is > 1 epoch away from its root. If a validator has very old root so it can't proceed, it will exit and report error. Since we assume an outage will be discovered within 7 hours and one epoch is roughly two days, handling 2 epochs should be enough.

  • The stake weight of each slot is calculated using the epoch the slot is in. Because right now epoch stakes are calculated 1 epoch ahead of time, and we only handle 2 epochs, the local root bank should have the epoch stakes for all epochs we need.

  • When aggregating RestartLastVotedForkSlots, for any epoch with validators voting for any slot in this epoch having at least 33% stake, calculate the stake of active validators in this epoch. Only exit this stage if all epochs reaching the above bar has > 80% stake. This is a bit restrictive, but it guarantees that whichever slot we select for HeaviestFork, we have enough validators in the restart. Note that the epoch containing local root should always be considered, because root should have > 33% stake.

Now we prove this is safe, whenever there is a slot being optimistically confirmed in the new epoch, we will only exit the aggregating of RestartLastVotedForkSlots stage if > 80% in the new epoch joined:

  1. Assume slot X is optimistically confirmed in the new epoch, it has >67% stake in the new epoch.

  2. Our stake warmup/cooldown limit is at 9% currently, so at least 67% - 9% = 58% of the stake were from the old epoch.

  3. We always have >80% stake of the old epoch, so at least 58% - 20% = 38% of the stake were in restart. Excluding non-conforming stake, at least 38% - 5% = 33% should be in the restart and they should at least report they voted for X which is in the new epoch.

  4. According to the above rule we will require >80% stake in the new epoch as well.

Backwards Compatibility

This change is backward compatible with previous versions, because validators only enter the new mode during new restart mode which is controlled by a command line argument. All current restart arguments like --wait-for-supermajority and --expected-bank-hash will be kept as is.