This post is a long-form explainer on the historic, current, and near-term scheduling algorithms and design used in the Solana Labs validator client’s block-production.

Background - Block-ProductionLink to this heading

At the beginning of each epoch, the cluster creates a leader schedule for the entire epoch. This leader schedule is determined by the stake-weight of each validator, and allocates groups of 4 slots to validators in proportion to their stake-weight. During the slot(s) allocated to a validator, that validator is the leader and is responsible for creating a block and broadcasting it to the rest of the network. This process of creating the block is referred to as “block-production”, and within the Solana Labs’ client as BankingStage.

Each block is created from sets of transactions that are received from users and other validators in the network. However, there are limits on the amount of work a block is allowed to contain, because each block is intended to executed within 400ms. Often the leader will receive many more transactions that can actually be included within a block, and so the leader must decide which transactions to include.

Entry ConstraintsLink to this heading

Historic designs were limited by certain constaints and it is important to mention these constraints. While there as plans to remove some of these constraints, they are still relevant to explanation of previous and current designs.

The Solana transaction history is composed of a series of entries. Each entry is a list of transactions. However, an entry cannot contain conflicting transactions; if it does, then the entire block will be marked as invalid by the network.

Solana Labs’ Client ArchitectureLink to this heading

At a very high level, the Solana Labs’ block-production code sits between signature verification (SigVerify), and broadcasting (BroadcastStage).

graph LR; PacketIngress --> Sigverify --> BankingStage <-->|recording| PoH; BankingStage -->|successfully recorded and committed| BroadcastStage --> Turbine; BankingStage -->|forwarding| TpuForwarding;

Historic DesignLink to this heading

As this post will discuss the current and future block-production algorithms and design, it is useful to first understand previous designs.

Block-production within the Labs’ client has gone through many changes, but has histocially involved multiple nearly independent threads, each with their own queue of transactions, and each running their own scheduling. Here, scheduling refers to the process of deciding the order that transactions are processed in, and thus included into the block. While each thread is nearly independent, there is some coordination between the threads to ensure that conflicting transactions are not executed in parallel.

graph LR; subgraph BankingStage Receiver BankingStage1 BankingStage2 BankingStage3 BankingStage4 end Sigverify -->|channel| Receiver; Receiver --> BankingStage1 & BankingStage2 & BankingStage3 & BankingStage4;

The historic and current designs have threads which access a shared channel to pull transactions from into their local queues. When pulled into the local queue, the transactions are sorted by priority. During block-production, each thread would take the top 128 transactions from their local queue, attempt to grab locks, then execute, record and commit the transactions. If the lock grab failed, the transaction would be retried at a later time.

Historic Design - DrawbacksLink to this heading

The historic design had several drawbacks, mainly related to failed lock grabs. During competitive events, such as NFT mints, the top of the queue is often filled with conflicting transactions. Given the entry constraints, this means each thread would attempt to lock 128 conflicting transactions, but fail to lock 127 of them. This resulted in massive inefficiency as the majority of transactions would be skipped over to be retried later. Additionally, because the other threads were also likely to be grabbing the same locks, only one thread may have actually been doing useful work.

At the point these drawbacks became issues, there were already plans for a more advanced scheduling design, but there was urgency to address the issues in the short-term. This urgency led to the introduction of the multi-iterator approach.

The multi-iterator approach is still used in the current algorithm, and leaves the architecture unchanged from the historic design. Since each thread maintains its’ own local queue, the current block-production method has been named “thread-local-multi-iterator”.

Thread-Local Multi-IteratorLink to this heading

What is the Multi-Iterator?Link to this heading

The multi-iterator is a class which takes a slice of items, and will march multiple iterators through the slice according to a decision-function that is passed in. The decision-function is responsible for determining whether an item should be processed immediately, on the next march of the iterators, or skipped entirely.

In the context of BankingStage, the multi-iterator is used to create batches of non-conficting transactions. Let’s take a look at iteration through an example, with a batch size of 2. Let shared node color indicate conflicting transactions. If we have several conflicting transactions at the beginning of our buffer, the previous method would have attempted to lock both of these transactions within the same batch. Grabbing locks on the second one would have failed, and the second transaction would then be retried later.

However, with the multi-iterator, we check locks against the in-progress batch before selection. The initial step will select the first transaction, then the second transaction will be skipped for this batch because it conflicts with the first. On the next multi-iterator iteration, the second transaction is selected by the first iterator marching forward.

graph LR subgraph 0 [initial buffer] Tx1_0((Tx1 - 5)) Tx2_0((Tx2 - 4)) Tx3_0((Tx3 - 3)) Tx4_0((Tx4 - 2)) Tx5_0((Tx5 - 1)) Tx6_0((Tx6 - 1)) Tx7_0((Tx7 - 1)) style Tx1_0 fill:#CF0000 style Tx2_0 fill:#CF0000 style Tx3_0 fill:#00CF00 style Tx4_0 fill:#00CF00 style Tx5_0 fill:#00CF00 style Tx6_0 fill:#0000CF style Tx7_0 fill:#00CF00 end subgraph 1 [iteration 1] Tx1_1((Tx1 - 5)) Tx2_1((Tx2 - 4)) Tx3_1((Tx3 - 3)) Tx4_1((Tx4 - 2)) Tx5_1((Tx5 - 1)) Tx6_1((Tx6 - 1)) Tx7_1((Tx7 - 1)) style Tx1_1 fill:#CF0000,stroke:#000,stroke-width:8px style Tx2_1 fill:#CF0000 style Tx3_1 fill:#00CF00,stroke:#000,stroke-width:8px style Tx4_1 fill:#00CF00 style Tx5_1 fill:#00CF00 style Tx6_1 fill:#0000CF style Tx7_1 fill:#00CF00 end subgraph 2 [iteration 2] Tx1_2((Tx1 - 5)) Tx2_2((Tx2 - 4)) Tx3_2((Tx3 - 3)) Tx4_2((Tx4 - 2)) Tx5_2((Tx5 - 1)) Tx6_2((Tx6 - 1)) Tx7_2((Tx7 - 1)) style Tx1_2 fill:#666666 style Tx2_2 fill:#CF0000,stroke:#000,stroke-width:8px style Tx3_2 fill:#666666 style Tx4_2 fill:#00CF00,stroke:#000,stroke-width:8px style Tx5_2 fill:#00CF00 style Tx6_2 fill:#0000CF style Tx7_2 fill:#00CF00 end subgraph 3 [iteration 3] Tx1_3((Tx1 - 5)) Tx2_3((Tx2 - 4)) Tx3_3((Tx3 - 3)) Tx4_3((Tx4 - 2)) Tx5_3((Tx5 - 1)) Tx6_3((Tx6 - 1)) Tx7_3((Tx7 - 1)) style Tx1_3 fill:#666666 style Tx2_3 fill:#666666 style Tx3_3 fill:#666666 style Tx4_3 fill:#666666 style Tx5_3 fill:#00CF00,stroke:#000,stroke-width:8px style Tx6_3 fill:#0000CF,stroke:#000,stroke-width:8px style Tx7_3 fill:#00CF00 end subgraph 4 [iteration 4] Tx1_4((Tx1 - 5)) Tx2_4((Tx2 - 4)) Tx3_4((Tx3 - 3)) Tx4_4((Tx4 - 2)) Tx5_4((Tx5 - 1)) Tx6_4((Tx6 - 1)) Tx7_4((Tx7 - 1)) style Tx1_4 fill:#666666 style Tx2_4 fill:#666666 style Tx3_4 fill:#666666 style Tx4_4 fill:#666666 style Tx5_4 fill:#666666 style Tx6_4 fill:#666666 style Tx7_4 fill:#00CF00,stroke:#000,stroke-width:8px end

How does the Multi-Iterator Help?Link to this heading

This leads to each of our BankingStage threads being able to create non-conflicting batches, which eliminates most of the retryable transactions that were previously seen during competitive events such as NFT mints. This additionally allows lower-priority non-conflicting transactions to be included into the block.

Thread-Local Multi-Iterator - DrawbacksLink to this heading

However, using the multi-iterator in this way does not eliminate all locking conflicts, because each BankingStage thread has an independent buffer it is iterating through. Since packets are essentially pulled randomly by each BankingStage thread from the shared channel from SigVerify, each BankingStage thread will have a random set of all the transactions. During competitive events, it is thus likely that the many high-priority transactions will be in multiple BankingStage threads’ buffers, and this will cause inter-thread locking conflicts. There are also several edge cases in which transactions will be executed out of priority order, which is not desirable.

This is a problem that can be addressed by changing BankingStage’s architecture to use a single scheduling thread that sends transactions to worker threads.

Central Scheduling ThreadLink to this heading

The next step in the evolution of BankingStage is to move to a single scheduling thread that sends work to worker threads. This will allow the scheduling algorithm to determine which transactions can be executed in parallel and send them to the appropriate thread.

There are corner-cases in the multi-iterator approach that lead to out of priority order execution for conflicting transactions. These issues were not easily resolvable while maintaining performance, and so a different approach has been taken for the central scheduling design.

Central Scheduling ArchitectureLink to this heading

An overview of the central scheduling BankingStage architecture is shown below.

graph LR; subgraph BankingStage Receiver Scheduler Worker1 Worker2 Worker3 Worker4 end Sigverify -->|channel| Receiver; Receiver --> Scheduler; Scheduler <--> Worker1 & Worker2 & Worker3 & Worker4;

In this architecture, the scheduler is the only thread which accesses the channel from SigVerify. The scheduler is responsible for pulling transactions from the receiver channel, and sending them to the appropriate worker thread.

A decision was made to use separate channels between the scheduler and worker threads, rather than a single shared channel. This allows conflicting work to be queued onto the thread it conflicts with, which allows for overall better throughput. This is due to the overhead of communication between the worker thread and the scheduler, but is a decision that may be reversed in the future if overhead can be brought down.

The scheduler maintains a view of which account locks are in-use by which threads, and is able to determine which threads a transaction can be queued on. Each worker thread will process batches of transactions, in the received order, and send a message back to the scheduler upon completion of each batch. These messages back to the scheduler allow the scheduler to update its’ view of the locks, and thus determine which future transactions can be scheduled.

Scheduling Algorithm - Prio-GraphLink to this heading

The actual scheduling algorithm can be described in a relatively straight-forward way. During scheduling the highest-priority transactions are popped from the queue, and inserted into a lazily populated dependency graph, which we will refer to as a “prio-graph”. In this graph, edges are only present between a transaction and the next-highest priority transaction that conflicts with it per account. Initially, a “look-ahead” window of some reasonable length is popped from the priority queue and inserted into the graph. As the scheduler pops from the graph for actual scheduling, another transaction is popped from the priority queue and inserted into the graph. The graph structure allows us to look ahead and determine if currently non-conflicting transactions will eventually have depedent transactions that conflict, i.e. a graph join. If a graph join is detected, these transactions can be queued onto the same thread to avoid creating an unschedulable transaction in the near future. There is still work being done on tuning the size of the look-ahead window. A window that is too small will lead to many unschedulable transactions and delay execution more than necessary. A window that is too large will lead to many transactions being queued onto the same thread, and the advantages of parallelism are lost.

Visualization of Prio-GraphLink to this heading

The great thing about graphs is that they are easy to visualize. Thanks to banking-trace data collected by leader nodes, we can visualize the prio-graph for all received transactions within a slot.

An example visualization of the full prio-graph1 for slot 229666043 is shown in the following images.

Slot 229666043 Overview

Here there are 89 entirely distinct collections of transactions, which could be scheduled in parallel. Many of these groups are very small and only have a few transactions, or very simple sequential structure. However, the large groups are more interesting, and show a root-like structure.

Slot 229666043 - Group 1

If we zoom into the largest group, we can see it actually has three main branches, which ultimately join together with a few low-priority transactions. This is a good example of where the look-ahead window can be tuned, since if we had too large of a window, these three large branches may have been scheduled sequentially.

Slot 229666043 - Group 1 - High Priority

However, if we zoom further into some of the high-priority transactions on one of the main branches, we can see there are many transactions that are one removed from the branch. The look-ahead window is used to schedule these transactions onto the same thread as the main branch, so that we can avoid making some lower-priority, but still relatively high-priority, transactions unschedulable. By unschedulable, we mean that the transaction cannot be scheduled currently because it has outstanding conflicts with multiple threads.

  1. The actual prio-graph is simplified here, with redundant edges removed to improve readability. ↩︎