SIMD-0046: optimistic cluster restart automation
simd: '0046' title: Optimistic cluster restart automation authors: - Wen Xu (Anza) category: Standard type: Core status: Implemented created: 2023-04-07 feature: N/A (gated by command line flag instead) development: - Anza - implemented - Firedancer - implemented
Summary
During a cluster restart following an outage, make validators enter a separate recovery protocol that uses Gossip to exchange local status and automatically reach consensus on the block to restart from. Proceed to restart if validators in the restart can reach agreement, or print debug information and halt otherwise. To distinguish the new restart process from other operations, we call the new process "Wen restart".
New Terminology
-
cluster restart
: When there is an outage such that the whole cluster stalls, human may need to restart most of the validators with a sane state so that the cluster can continue to function. This is different from sporadic single validator restart which does not impact the cluster. Seecluster restart
for details. -
cluster restart slot
: In currentcluster restart
scheme, human normally decide on one block for all validators to restart from. This is very often the highestoptimistically confirmed block
, becauseoptimistically confirmed block
should never be rolled back. But it's also okay to start from a child of the highestoptimistically confirmed block
as long as consensus can be reached. -
optimistically confirmed block
: a block which gets the votes from the majority of the validators in a cluster (> 2/3 stake). Our algorithm tries to guarantee that an optimistically confirmed block will never be rolled back. -
wen restart phase
: During the proposed optimisticcluster restart
automation process, the validators in restart will first spend some time to exchange information, repair missing blocks, and finally reach consensus. The validators only continue normal block production and voting after consensus is reached. We call this preparation phase where block production and voting are paused thewen restart phase
. -
wen restart shred version
: right now we updateshred_version
during acluster restart
, it is used to verify received shreds and filter Gossip peers. In the proposed optimisticcluster restart
plan, we introduce a new temporary shred version in thewen restart phase
so validators in restart don't interfere with those not in restart. Currently thiswen restart shred version
is calculated using(current_shred_version + 1) % 0xffff
. -
RESTART_STAKE_THRESHOLD
: We need enough validators to participate in a restart so they can make decision for the whole cluster. If everything works perfect, we only need 2/3 of the total stake. However, validators could die or perform abnormally, so we currently set theRESTART_STAKE_THRESHOLD
at 80%, which is the same as what we use now for--wait_for_supermajority
.
Motivation
Currently during a cluster restart
, validator operators need to decide the
highest optimistically confirmed slot, then restart the validators with new
command-line arguments.
The current process involves a lot of human intervention, if people make a mistake in deciding the highest optimistically confirmed slot, it is detrimental to the viability of the ecosystem.
We aim to automate the negotiation of highest optimistically confirmed slot and
the distribution of all blocks on that fork, so that we can lower the
possibility of human mistakes in the cluster restart
process. This also
reduces the burden on validator operators, because they don't have to stay
around while the validators automatically try to reach consensus, the validator
will halt and print debug information if anything goes wrong, and operators can
set up their own monitoring accordingly.
However, there are many ways an automatic restart can go wrong, mostly due to unforseen situations or software bugs. To make things really safe, we apply multiple checks during the restart, if any check fails, the automatic restart is halted and debugging info printed, waiting for human intervention. Therefore we say this is an optimistic cluster restart procedure.
Alternatives Considered
Automatically detect outage and perform cluster restart
The reaction time of a human in case of emergency is measured in minutes,
while a cluster restart
where human initiate validator restarts takes hours.
We considered various approaches to automatically detect outage and perform
cluster restart
, which can reduce recovery speed to minutes or even seconds.
However, automatically restarting the whole cluster seems risky. Because if the recovery process itself doesn't work, it might be some time before we can get human's attention. And it doesn't solve the cases where new binary is needed. So for now we still plan to have human in the loop.
After we gain more experience with the restart approach in this proposal, we may slowly try to make the process more automatic to improve reliability.
Use Gossip and consensus to figure out restart slot before the restart
The main difference between this and the current restart proposal is this alternative tries to make the cluster automatically enter restart preparation phase without human intervention.
While getting humans out of the loop improves recovery speed, there are concerns about recovery Gossip messages interfering with normal Gossip messages, and automatically start a new message in Gossip seems risky.
Automatically reduce block production in an outage
Right now we have vote-only mode, a validator will only pack vote transactions into new blocks if the tower distance (last_vote - local_root) is greater than 400 slots.
Unfortunately in the previous outages vote-only mode isn't enough to save the cluster. There are proposals of more aggressive block production reduction to save the cluster. For example, a leader could produce only one block in four consecutive slots allocated to it.
However, this only solves the problem in specific type of outage, and it seems risky to aggressively reduce block production, so we are not proceeding with this proposal for now.
Detailed Design
The new protocol tries to make all restarting validators get the same data blocks and the same set of last votes, so that they will with high probability converge on the same canonical fork and proceed.
When the cluster is in need of a restart, we assume validators holding at least
RESTART_STAKE_THRESHOLD
percentage of stakes will enter the restart mode.
Then the following steps will happen:
-
The operator restarts the validator into the
wen restart phase
at boot, where it will not make new blocks or vote. The validator propagates its local voted fork information to all other validators in restart. -
While aggregating local vote information from all others in restart, the validator repairs all blocks which could potentially have been optimistically confirmed.
-
After enough validators are in restart and repair is complete, the validator counts votes on each fork and computes local heaviest fork.
-
A coordinator which is configured on everyone's command line sends out its heaviest fork to everyone.
-
Each validator verifies that the coordinator's choice is reasonable:
-
If yes, proceed and restart
-
If no, print out what it thinks is wrong, halt and wait for human
-
See each step explained in details below.
We assume that as most 5% of the validators in restart can be malicious or
contains bugs, this number is consistent with other algorithms in the consensus
protocol. We call these non-conforming
validators.
Wen restart phase
-
Gossip last vote and ancestors on that fork
The main goal of this step is to propagate most recent ancestors on the last voted fork to all others in restart.
We use a new Gossip message
RestartLastVotedForkSlots
, its fields are:last_voted_slot
:u64
the slot last voted, this also serves as last_slot for the bit vector.last_voted_hash
:Hash
the bank hash of the slot last voted slot.ancestors
:Run-length encoding
compressed bit vector representing the slots on sender's last voted fork. the least significant bit is alwayslast_voted_slot
, most significant bit islast_voted_slot-65535
.
The max distance between oldest ancestor slot and last voted slot is hard coded at 65535, because that's 400ms * 65535 = 7.3 hours, we assume that most validator administrators would have noticed an outage within 7 hours, and the optimistic confirmation must have halted within 64k slots of the last confirmed block. Also 65535 bits nicely fits into u16, which makes encoding more compact. If a validator restarts after 7 hours past the outage, it cannot join the restart this way. If enough validators failed to restart within 7 hours, then we fallback to the manual, interactive
cluster restart
method.When a validator enters restart, it uses
wen restart shred version
to avoid interfering with those outside the restart. To be extra cautious, we will also filter outRestartLastVotedForkSlots
andRestartHeaviestFork
(described later) in Gossip if a validator is not inwen restart phase
. There is a slight chance that thewen restart shred version
would collide with the shred version after thewen restart phase
, but with the filtering described above it should not be a problem.When a validator receives
RestartLastVotedForkSlots
from someone else, it will discard all slots smaller than the local root. Because the local root should be anoptimistic confirmed
slot, it does not need to keep any slot older than local root. -
Repair ledgers up to the restart slot
The main goal of this step is to repair all blocks which could potentially be optimistically confirmed.
We need to prevent false negative at all costs, because we can't rollback an
optimistically confirmed block
. However, false positive is okay. Because when we select the heaviest fork in the next step, we should see all the potential candidates for optimistically confirmed slots, there we can count the votes and remove some false positive cases.However, it's also overkill to repair every block presented by others. When
RestartLastVotedForkSlots
messages are being received and aggregated, a validator can categorize blocks missing locally into 2 categories: must-have and ignored.We repairs all blocks with no less than 42% stake. The number is
67% - 5% - stake_on_validators_not_in_restart
. We require that at least 80% join the restart, any block with less than 67% - (100 - 80)% - 5% = 42% can never be optimistically confirmed before the restart.It's possible that different validators see different 80%, so their must-have blocks might be different, but there will be another repair round in the final step so this is fine. Whenever some block gets to 42%, repair could be started, because when more validators join the restart, this number will only go up but will never go down.
When a validator gets
RestartLastVotedForkSlots
from 80% of the stake, and all those "must-have" blocks are repaired, it can proceed to next step. -
Calculate heaviest fork
After receiving
RestartLastVotedForkSlots
from the validators holding stake more thanRESTART_STAKE_THRESHOLD
and repairing slots in "must-have" category, pick the heaviest fork like this:- Calculate the threshold for a block to be on the heaviest fork, the
heaviest fork should have all blocks with possibility to be optimistically
confirmed. The number is
67% - 5% - stake_on_validators_not_in_restart
.
For example, if 80% validators are in restart, the number would be
67% - 5% - (100-80)% = 42%
. If 90% validators are in restart, the number would be67% - 5% - (100-90)% = 52%
.- Sort all blocks over the threshold by slot number, and verify that they form a single chain. The first block in the list should be the local root.
If any block does not satisfy above constraint, print the first offending block and exit.
The list should not be empty, it should contain at least the local root.
To see why the above algorithm is safe, we will prove that:
- Any block optimistically confirmed before the restart will always be on the list:
Assume block A is one such block, it would have
67%
stake, discounting5%
non-conforming and people not participating in wen_restart, it should have at least67% - 5% - stake_on_validators_not_in_restart
stake, so it should pass the threshold and be in the list.- Any block in the list should only have at most one child in the list:
Let's use
X
to denotestake_on_validators_not_in_restart
for brevity. Assuming a block has childA
andB
both on the list, the children's combined stake would be2 * (67% - 5% - X)
. Because we only allow one RestartHeaviestFork per pubkey, every validator should select eitherA
orB
, it's easy to find and filter out vialators who selected both. So the children's total stake should be less than100% - X
. We can calculate that if124% - 2 * X < 100% - X
, thenX > 24%
, this is not possible when we have at least 80% of the validators in restart. So we prove any block in the list can have at most one child in the list by contradiction.- If a block not optimistically confirmed before the restart is on the list, it can only be at the end of the list and none of its siblings are on the list.
Let's say block D is the first not optimistically confirmed block on the list, its parent E is confirmed and on the list. We know from above point that E can only have 1 child on the list, therefore D must be at the end of the list while its siblings are not on the list.
Even if the last block D on the list may not be optimistically confirmed, it already has at least
42% - 5% = 37%
stake. Say F is its sibling with the most stake, F can only have less than42%
stake because it's not on the list. So picking D over F is equal to the case where5%
stake switched from fork F to fork D, 80% of the cluster can switch to fork D if that turns out to be the heaviest fork.After picking the appropriate slot, replay the block and all its ancestors to get the bankhash for the picked slot.
- Calculate the threshold for a block to be on the heaviest fork, the
heaviest fork should have all blocks with possibility to be optimistically
confirmed. The number is
-
Verify the heaviest fork of the coordinator
There will be one coordinator specified on the command line of everyone's command line. Even though everyone will calculate its own heaviest fork in previous step, only the coordinator's heaviest fork will be checked and optionally accepted by others.
We use a new Gossip message
RestartHeaviestFork
, its fields are:slot
:u64
slot of the picked block.hash
:Hash
bank hash of the picked block.
After deciding the heaviest block, the coordinator Gossip
RestartHeaviestFork(X.slot, X.hash)
out, where X is the block the coordinator picked locally in previous step. The coordinator will stay up until manually restarted by its operator.For every non-coordinator validator, it will perform the following actions on the heaviest fork sent by the coordinator:
-
If the bank selected is missing locally, repair this slot and all slots with higher stake.
-
Check that the bankhash of selected slot matches the data locally.
-
Verify that the selected fork contains local root, and that its local heaviest fork slot is on the same fork as the coordinator's choice.
If any of the above repair or check fails, exit with error message, the coordinator may have made a mistake and this needs manual intervention.
When exiting this step, no matter what a non-coordinator validator chooses, it will send a
RestartHeaviestFork
back to leader to report its status. This reporting is just for ease of aggregating the cluster's status at the coordinator, it doesn't have other effects. -
Generate incremental snapshot and exit
If the previous step succeeds, the validator immediately starts adding a hard
fork at the designated slot and perform set_root
. Then it will start
generating an incremental snapshot at the agreed upon cluster restart slot
.
After snapshot generation completes, the --wait_for_supermajority
args with
correct shred version, restart slot, and expected bankhash will be printed to
the logs.
After the snapshot generation is complete, a non coordinator then exits with
exit code 200
to indicate work is complete.
A coordinator will stay up until restarted by the operator to make sure any
late comers get the RestartHeaviestFork
message. It also aggregates the
RestartHeaviestFork
messages sent by the non-coordinators to report on the
status of the cluster.
Impact
This proposal adds a new wen restart
mode to validators, under this mode the
validators will not participate in normal cluster activities. Compared to
today's cluster restart
, the new mode may mean more network bandwidth and
memory on the restarting validators, but it guarantees the safety of
optimistically confirmed user transactions, and validator operators don't need
to manually generate and download snapshots during a cluster restart
.
Security Considerations
The two added Gossip messages RestartLastVotedForkSlots
and
RestartHeaviestFork
will only be sent and processed when the validator is
restarted in wen restart
mode. So random validator restarting in the new
mode will not clutter the Gossip CRDS table of a normal system.
Non-conforming validators could send out wrong RestartLastVotedForkSlots
messages to mess with cluster restart
s, these should be included in the
Slashing rules in the future.
Handling oscillating votes
Non-conforming validators could change their last votes back and forth, this
could lead to instability in the system. We forbid any change of slot or hash
in RestartLastVotedForkSlots
or RestartHeaviestFork
, everyone will stick
with the first value received, and discrepancies will be recorded in the proto
file for later slashing.
Handling multiple epochs
Even though it's not very common that an outage happens across an epoch
boundary, we do need to prepare for this rare case. Because the main purpose
of wen restart
is to make everyone reach aggrement, the following choices
are made:
-
Every validator only handles 2 epochs, any validator will discard slots which belong to an epoch which is > 1 epoch away from its root. If a validator has very old root so it can't proceed, it will exit and report error. Since we assume an outage will be discovered within 7 hours and one epoch is roughly two days, handling 2 epochs should be enough.
-
The stake weight of each slot is calculated using the epoch the slot is in. Because right now epoch stakes are calculated 1 epoch ahead of time, and we only handle 2 epochs, the local root bank should have the epoch stakes for all epochs we need.
-
When aggregating
RestartLastVotedForkSlots
, for any epoch with validators voting for any slot in this epoch having at least 33% stake, calculate the stake of active validators in this epoch. Only exit this stage if all epochs reaching the above bar has > 80% stake. This is a bit restrictive, but it guarantees that whichever slot we select for HeaviestFork, we have enough validators in the restart. Note that the epoch containing local root should always be considered, because root should have > 33% stake.
Now we prove this is safe, whenever there is a slot being optimistically
confirmed in the new epoch, we will only exit the aggregating of
RestartLastVotedForkSlots
stage if > 80% in the new epoch joined:
-
Assume slot
X
is optimistically confirmed in the new epoch, it has >67% stake in the new epoch. -
Our stake warmup/cooldown limit is at 9% currently, so at least 67% - 9% = 58% of the stake were from the old epoch.
-
We always have >80% stake of the old epoch, so at least 58% - 20% = 38% of the stake were in restart. Excluding non-conforming stake, at least 38% - 5% = 33% should be in the restart and they should at least report they voted for
X
which is in the new epoch. -
According to the above rule we will require >80% stake in the new epoch as well.
Backwards Compatibility
This change is backward compatible with previous versions, because validators
only enter the new mode during new restart mode which is controlled by a
command line argument. All current restart arguments like
--wait-for-supermajority
and --expected-bank-hash
will be kept as is.