BadFish

mirror of https://github.com/sockspls/badfish synced 2025-04-30 08:43:09 +00:00

Author	SHA1	Message	Date
Tomasz Sobczyk	a169c78b6d	Improve performance on NUMA systems Allow for NUMA memory replication for NNUE weights. Bind threads to ensure execution on a specific NUMA node. This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc. Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies. ----------------- A new UCI option `NumaPolicy` is introduced. It can take the following values: ``` system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node none - assumes there is 1 NUMA node, never binds threads auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count) [[custom]] - // ':'-separated numa nodes // ','-separated cpu indices // supports "first-last" range syntax for cpu indices, for example '0-15,32-47:16-31,48-63' ``` Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT. The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better. Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided. On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly. ----------------- We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary. MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway. Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux. Linux is supported, without libnuma requirement. `lscpu` is expected. ----------------- Passed 60+1 @ 256t 16000MB hash: https://tests.stockfishchess.org/tests/view/6654e443a86388d5e27db0d8 ``` LLR: 2.95 (-2.94,2.94) <0.00,10.00> Total: 278 W: 110 L: 29 D: 139 Ptnml(0-2): 0, 1, 56, 82, 0 ``` Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd ``` LLR: 2.95 (-2.94,2.94) <-1.75,0.25> Total: 67152 W: 17354 L: 17177 D: 32621 Ptnml(0-2): 64, 7428, 18408, 7619, 57 ``` Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c ``` LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 131648 W: 34155 L: 34045 D: 63448 Ptnml(0-2): 426, 13878, 37096, 14008, 416 ``` fixes #5253 closes https://github.com/official-stockfish/Stockfish/pull/5285 No functional change	2024-05-28 18:34:15 +02:00
Disservin	be026bdcb2	Clear Workers after changing the network ensures internal state (e.g. accumulator cache) is consistent with network closes https://github.com/official-stockfish/Stockfish/pull/5204 No functional change	2024-05-05 12:30:28 +02:00
xoto10	886ed90ec3	Use less time on recaptures Credit for the idea goes to peregrine on discord. Passed STC 10+0.1: https://tests.stockfishchess.org/tests/view/662652623fe04ce4cefc48cf LLR: 2.95 (-2.94,2.94) <0.00,2.00> Total: 75712 W: 19793 L: 19423 D: 36496 Ptnml(0-2): 258, 8487, 20023, 8803, 285 Passed LTC 60+0.6: https://tests.stockfishchess.org/tests/view/6627495e3fe04ce4cefc59b6 LLR: 2.94 (-2.94,2.94) <0.50,2.50> Total: 49788 W: 12743 L: 12404 D: 24641 Ptnml(0-2): 29, 5141, 14215, 5480, 29 The code was updated slightly and tested for non-regression against the original code at STC: LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 41952 W: 10912 L: 10698 D: 20342 Ptnml(0-2): 133, 4825, 10835, 5061, 122 https://tests.stockfishchess.org/tests/view/662d84f56115ff6764c7e438 closes https://github.com/official-stockfish/Stockfish/pull/5189 Bench: 1836777	2024-04-28 21:26:25 +02:00
Disservin	ddd250b9d6	Restore NPS output for Perft Previously it was possible to also get the node counter after running a bench with perft, i.e. `./stockfish bench 1 1 5 current perft`, caused by a small regression from the uci refactoring. ``` Nodes searched: 4865609 =========================== Total time (ms) : 18 Nodes searched : 4865609 Nodes/second : 270311611 ```` closes https://github.com/official-stockfish/Stockfish/pull/5188 No functional change	2024-04-24 18:20:55 +02:00
Disservin	4912f5b0b5	Remove duplicated Position object in UCIEngine Also fixes searchmoves. Drop the need of a Position object in uci.cpp. A side note, it is still required for the static functions, but these should be moved to a different namespace/class later on, since sf kinda relies on them. closes https://github.com/official-stockfish/Stockfish/pull/5169 No functional change	2024-04-12 19:37:39 +02:00
Disservin	9032c6cbe7	Transform search output to engine callbacks Part 2 of the Split UCI into UCIEngine and Engine refactor. This creates function callbacks for search to use when an update should occur. The benching in uci.cpp for example does this to extract the total nodes searched. No functional change	2024-04-05 21:03:58 +02:00
Disservin	299707d2c2	Split UCI into UCIEngine and Engine This is another refactor which aims to decouple uci from stockfish. A new engine class manages all engine related logic and uci is a "small" wrapper around it. In the future we should also try to remove the need for the Position object in the uci and replace the options with an actual options struct instead of using a map. Also convert the std::string's in the Info structs a string_view. closes #5147 No functional change	2024-04-04 00:15:17 +02:00

7 commits