Difference between revisions of "On transposition tables"

From TCEC wiki
Jump to: navigation, search
Line 1: Line 1:
 
This page collects some suggestions for engine developers for dealing with transposition tables.
 
This page collects some suggestions for engine developers for dealing with transposition tables.
 
 
  
 
= Initialization =
 
= Initialization =
Line 62: Line 60:
 
== Optimizing the initialization sequence ==
 
== Optimizing the initialization sequence ==
  
A single CPU core on many high-end machines cannot saturate the memory bus. Thus, multi-threaded clear is usually much faster than a single-threaded clear. '''This is arguably the most important optimization for clearing.'''
+
A single CPU core on many high-end machines cannot saturate the memory bus. Thus, multi-threaded clear is usually much faster than a single-threaded clear. '''This is arguably the most important optimization for clearing.''' However, the number of clear threads is a bit more trickier question. Generally, for fastest clear, the number of threads should be just high enough that they can saturate the memory bus. If there are too many threads, the speed will go down again, since DRAM is fastest in sequential accesses. But with too many threads, the sequential clears from the threads will interleave each other, and the sequential bursts in DRAM become shorter.
  
 
However, there is another consideration. Even if malloc() and similar do not guarantee zero-initialized memory, the underlying OS page allocation functions do (e.g., VirtualAlloc(), mmap()). This may be a consideration if mmap() or VirtualAlloc() is already being used for large pages. In this case, all that is needed to do is to make the pages resident. This is relatively easily done if hash state tracking has been implemented.
 
However, there is another consideration. Even if malloc() and similar do not guarantee zero-initialized memory, the underlying OS page allocation functions do (e.g., VirtualAlloc(), mmap()). This may be a consideration if mmap() or VirtualAlloc() is already being used for large pages. In this case, all that is needed to do is to make the pages resident. This is relatively easily done if hash state tracking has been implemented.
Line 74: Line 72:
  
 
When the Linux kernel makes a newly allocated page resident for the first time, it clears the page using the accesser's thread. Thus, multi-threaded clear still makes sense. However, the performance difference between touching compared to full-page zeroing is not that big. This is because we are essentially clearing every page twice before moving to the next page. The second clear generally only touches CPU caches, which does not add to the traffic in the memory bus.
 
When the Linux kernel makes a newly allocated page resident for the first time, it clears the page using the accesser's thread. Thus, multi-threaded clear still makes sense. However, the performance difference between touching compared to full-page zeroing is not that big. This is because we are essentially clearing every page twice before moving to the next page. The second clear generally only touches CPU caches, which does not add to the traffic in the memory bus.
 +
 +
= Accessing the transposition table =
 +
 +
== Memory access explained ==
 +
 +
TODO TLBs, caches, page sizes
 +
 +
== Prefetch ==
 +
 +
TODO
 +
 +
== Atomic access ==
 +
 +
TODO
 +
 +
= Putting it all together =
 +
 +
TODO numbers explaining measured speedups with all these different techniques

Revision as of 13:00, 2 May 2020

This page collects some suggestions for engine developers for dealing with transposition tables.

Initialization

For many chess engines, transposition table hash initialization is by far the most expensive operation in engine startup. This is especially true with big hash sizes.

Deferred initialization

The first problem with some engines, including Stockfish at the time of writing, is that they initialize the hash table multiple times on start-up. Let's take a look what SF does here: (commit 353e20674b2019094059caaa3567e9a44abe9cd1)

% ./stockfish
Resize 16 MB     <-- default hash at initialization
Clear 1 threads
Clear 1 threads
uci
...
setoption name Hash value 32768
Resize 32768 MB  <-- resize and clear with 1 thread
Clear 1 threads
setoption name Threads value 12
Resize 32768 MB  <-- resize and clear with 12 threads
Clear 12 threads
isready          <-- this is a no-op, other than printing out readyok
readyok
ucinewgame
Clear 12 threads
isready          <-- this is a no-op, other than printing out readyok
readyok
go depth 1
info depth 1 seldepth 1 multipv 1 score cp 110 nodes 251 nps 83666 tbhits 0 time 3 pv e2e3
bestmove e2e3 ponder b7b6

To avoid redundant resizes and clears, the obvious solution here would be to defer initialization until isready is sent. However, isready may be sometimes omitted, and it wouldn't be good if the search is started before initialization. So, initialization should also be triggered by any command that could potentially start a search.

The other issue is that there are often multiple UCI commands triggering a hash clear in the engine launch sequence. To avoid this, state tracking is needed for the hash table.

Ideally:

% ./stockfish    <-- default hash is 16 MB, but we don't initialize yet
uci
...
setoption name Hash value 32768
setoption name Threads value 12
isready          <-- this needs to trigger the initialization
Resize 32768 MB  <-- allocate the memory
Clear 12 threads <-- clear with 12 threads; hash state is clean
readyok
ucinewgame       <-- clear request on clean hash is redundant, so no need to actually clear
isready          <-- nothing to do here
readyok
go depth 1       <-- mark the hash state dirty here
info depth 1 seldepth 1 multipv 1 score cp 110 nodes 251 nps 83666 tbhits 0 time 3 pv e2e3
bestmove e2e3 ponder b7b6
ucinewgame       <-- clear request on dirty hash is triggers actual clear, mark the hash to be cleared
isready
Clear 12 threads <-- clear with 12 threads; hash state is clean again
readyok


Optimizing the initialization sequence

A single CPU core on many high-end machines cannot saturate the memory bus. Thus, multi-threaded clear is usually much faster than a single-threaded clear. This is arguably the most important optimization for clearing. However, the number of clear threads is a bit more trickier question. Generally, for fastest clear, the number of threads should be just high enough that they can saturate the memory bus. If there are too many threads, the speed will go down again, since DRAM is fastest in sequential accesses. But with too many threads, the sequential clears from the threads will interleave each other, and the sequential bursts in DRAM become shorter.

However, there is another consideration. Even if malloc() and similar do not guarantee zero-initialized memory, the underlying OS page allocation functions do (e.g., VirtualAlloc(), mmap()). This may be a consideration if mmap() or VirtualAlloc() is already being used for large pages. In this case, all that is needed to do is to make the pages resident. This is relatively easily done if hash state tracking has been implemented.

To make the pages resident, every page must be touched. This can be done by writing a single zero per every page. Consider the following on Linux:

void *ptr = mmap(NULL, memSize, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); // allocates virtual memory, but does not make the pages resident
reinterpret_cast<uint8_t *>(ptr)[0] = 0; // make the first page resident
reinterpret_cast<uint8_t *>(ptr)[4096] = 0; // make the second page resident, but use sysconf(PAGESIZE) to actually retrieve the page size
reinterpret_cast<uint8_t *>(ptr)[4096 *2] = 0; // make the third page resident

When the Linux kernel makes a newly allocated page resident for the first time, it clears the page using the accesser's thread. Thus, multi-threaded clear still makes sense. However, the performance difference between touching compared to full-page zeroing is not that big. This is because we are essentially clearing every page twice before moving to the next page. The second clear generally only touches CPU caches, which does not add to the traffic in the memory bus.

Accessing the transposition table

Memory access explained

TODO TLBs, caches, page sizes

Prefetch

TODO

Atomic access

TODO

Putting it all together

TODO numbers explaining measured speedups with all these different techniques