On transposition tables
This page collects some suggestions for engine developers for dealing with transposition tables.
Contents
Initialization
For many chess engines, transposition table hash initialization is by far the most expensive operation in engine startup. This is especially true with big hash sizes.
Deferred initialization
The first problem with some engines, including Stockfish at the time of writing, is that they initialize the hash table multiple times on start-up. Let's take a look what SF does here: (commit 353e20674b2019094059caaa3567e9a44abe9cd1)
% ./stockfish Resize 16 MB <-- default hash at initialization Clear 1 threads Clear 1 threads uci ... setoption name Hash value 32768 Resize 32768 MB <-- resize and clear with 1 thread Clear 1 threads setoption name Threads value 12 Resize 32768 MB <-- resize and clear with 12 threads Clear 12 threads isready <-- this is a no-op, other than printing out readyok readyok ucinewgame Clear 12 threads isready <-- this is a no-op, other than printing out readyok readyok go depth 1 info depth 1 seldepth 1 multipv 1 score cp 110 nodes 251 nps 83666 tbhits 0 time 3 pv e2e3 bestmove e2e3 ponder b7b6
To avoid redundant resizes and clears, the obvious solution here would be to defer initialization until isready is sent. However, isready may be sometimes omitted, and it wouldn't be good if the search is started before initialization. So, initialization should also be triggered by any command that could potentially start a search.
The other issue is that there are often multiple UCI commands triggering a hash clear in the engine launch sequence. To avoid this, state tracking is needed for the hash table.
Ideally:
% ./stockfish <-- default hash is 16 MB, but we don't initialize yet uci ... setoption name Hash value 32768 setoption name Threads value 12 isready <-- this needs to trigger the initialization Resize 32768 MB <-- allocate the memory Clear 12 threads <-- clear with 12 threads; hash state is clean readyok ucinewgame <-- clear request on clean hash is redundant, so no need to actually clear isready <-- nothing to do here readyok go depth 1 <-- mark the hash state dirty here info depth 1 seldepth 1 multipv 1 score cp 110 nodes 251 nps 83666 tbhits 0 time 3 pv e2e3 bestmove e2e3 ponder b7b6 ucinewgame <-- clear request on dirty hash is triggers actual clear, mark the hash to be cleared isready Clear 12 threads <-- clear with 12 threads; hash state is clean again readyok
Optimizing the initialization sequence
A single CPU core on many high-end machines cannot saturate the memory bus. Thus, multi-threaded clear is usually much faster than a single-threaded clear. This is arguably the most important optimization for clearing. However, the number of clear threads is a bit more trickier question. Generally, for fastest clear, the number of threads should be just high enough that they can saturate the memory bus. If there are too many threads, the speed will go down again, since DRAM is fastest in sequential accesses. But with too many threads, the sequential clears from the threads will interleave each other, and the sequential bursts in DRAM become shorter.
However, there is another consideration. Even if malloc() and similar do not guarantee zero-initialized memory, the underlying OS page allocation functions do (e.g., VirtualAlloc(), mmap()). This may be a consideration if mmap() or VirtualAlloc() is already being used for large pages. In this case, all that is needed to do is to make the pages resident. This is relatively easily done if hash state tracking has been implemented.
To make the pages resident, every page must be touched. This can be done by writing a single zero per every page. Consider the following on Linux:
void *ptr = mmap(NULL, memSize, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); // allocates virtual memory, but does not make the pages resident reinterpret_cast<uint8_t *>(ptr)[0] = 0; // make the first page resident reinterpret_cast<uint8_t *>(ptr)[4096] = 0; // make the second page resident, but use sysconf(PAGESIZE) to actually retrieve the page size reinterpret_cast<uint8_t *>(ptr)[4096 *2] = 0; // make the third page resident
When the Linux kernel makes a newly allocated page resident for the first time, it clears the page using the accesser's thread. Thus, multi-threaded clear still makes sense. However, the performance difference between touching compared to full-page zeroing is not that big. This is because we are essentially clearing every page twice before moving to the next page. The second clear generally only touches CPU caches, which does not add to the traffic in the memory bus.
Accessing the transposition table
Memory access explained
TODO TLBs, caches, page sizes
Prefetch
TODO
Atomic access
TODO
Putting it all together
TODO numbers explaining measured speedups with all these different techniques