BadFish

mirror of https://github.com/sockspls/badfish synced 2025-04-30 00:33:09 +00:00

Author	SHA1	Message	Date
Tomasz Sobczyk	3ac75cd27d	Add a standardized benchmark command `speedtest`. `speedtest [threads] [hash_MiB] [time_s]`. `threads` default to system concurrency. `hash_MiB` defaults to `threads*128`. `time_s` defaults to 150. Intended to be used with default parameters, as a stable hardware benchmark. Example: ``` C:\dev\stockfish-master\src>stockfish.exe speedtest Stockfish dev-20240928-nogit by the Stockfish developers (see AUTHORS file) info string Using 16 threads Warmup position 3/3 Position 258/258 =========================== Version : Stockfish dev-20240928-nogit Compiled by : g++ (GNUC) 13.2.0 on MinGW64 Compilation architecture : x86-64-vnni256 Compilation settings : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT Compiler __VERSION__ macro : 13.2.0 Large pages : yes User invocation : speedtest Filled invocation : speedtest 16 2048 150 Available processors : 0-15 Thread count : 16 Thread binding : none TT size [MiB] : 2048 Hash max, avg [per mille] : single search : 40, 21 single game : 631, 428 Total nodes searched : 2099917842 Total search time [s] : 153.937 Nodes/second : 13641410 ``` ------------------------------- Small unrelated tweaks: - Network verification output is now handled as a callback. - TT hashfull queries allow specifying maximum entry age. closes https://github.com/official-stockfish/Stockfish/pull/5354 No functional change	2024-09-28 18:01:26 +02:00
Dubslow	86b564055d	Remove delta, adjusted, complexity from nnue code ...rather they're the consumer's concern whether to tweak the result or not. Passed STC: https://tests.stockfishchess.org/tests/view/665cea9ffd45fb0f907c53bd LLR: 2.93 (-2.94,2.94) <-1.75,0.25> Total: 69696 W: 18101 L: 17918 D: 33677 Ptnml(0-2): 195, 8171, 17929, 8362, 191 Passed LTC: https://tests.stockfishchess.org/tests/view/665cf761fd45fb0f907c5406 LLR: 2.96 (-2.94,2.94) <-1.75,0.25> Total: 63720 W: 16344 L: 16165 D: 31211 Ptnml(0-2): 32, 6990, 17625, 7193, 20 Non functional except for rounding issues of OutputScale changing bench. closes https://github.com/official-stockfish/Stockfish/pull/5344 Bench: 1378596	2024-06-03 23:27:58 +02:00
Disservin	00a28ae325	Add helpers for managing aligned memory Previously, we had two type aliases, LargePagePtr and AlignedPtr, which required manually initializing the aligned memory for the pointer. The new helpers: - make_unique_aligned - make_unique_large_page are now available for allocating aligned memory (with large pages). They behave similarly to std::make_unique, ensuring objects allocated with these functions follow RAII. The old approach had issues with initializing non-trivial types or arrays of objects. The evaluation function of the network is now a unique pointer to an array instead of an array of unique pointers. Memory related functions have been moved into memory.h Passed High Hash Pressure Test Non-Regression STC: https://tests.stockfishchess.org/tests/view/665b2b36586058766677cfd2 LLR: 2.93 (-2.94,2.94) <-1.75,0.25> Total: 476992 W: 122426 L: 122677 D: 231889 Ptnml(0-2): 1145, 51027, 134419, 50744, 1161 Failed Normal Non-Regression STC: https://tests.stockfishchess.org/tests/view/665b2997586058766677cfc8 LLR: -2.94 (-2.94,2.94) <-1.75,0.25> Total: 877312 W: 225233 L: 226395 D: 425684 Ptnml(0-2): 2110, 94642, 246239, 93630, 2035 Probably a fluke since there shouldn't be a real slowndown and it has also passed the high hash pressure test. closes https://github.com/official-stockfish/Stockfish/pull/5332 No functional change	2024-06-03 23:11:59 +02:00
Tomasz Sobczyk	a169c78b6d	Improve performance on NUMA systems Allow for NUMA memory replication for NNUE weights. Bind threads to ensure execution on a specific NUMA node. This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc. Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies. ----------------- A new UCI option `NumaPolicy` is introduced. It can take the following values: ``` system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node none - assumes there is 1 NUMA node, never binds threads auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count) [[custom]] - // ':'-separated numa nodes // ','-separated cpu indices // supports "first-last" range syntax for cpu indices, for example '0-15,32-47:16-31,48-63' ``` Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT. The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better. Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided. On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly. ----------------- We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary. MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway. Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux. Linux is supported, without libnuma requirement. `lscpu` is expected. ----------------- Passed 60+1 @ 256t 16000MB hash: https://tests.stockfishchess.org/tests/view/6654e443a86388d5e27db0d8 ``` LLR: 2.95 (-2.94,2.94) <0.00,10.00> Total: 278 W: 110 L: 29 D: 139 Ptnml(0-2): 0, 1, 56, 82, 0 ``` Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd ``` LLR: 2.95 (-2.94,2.94) <-1.75,0.25> Total: 67152 W: 17354 L: 17177 D: 32621 Ptnml(0-2): 64, 7428, 18408, 7619, 57 ``` Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c ``` LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 131648 W: 34155 L: 34045 D: 63448 Ptnml(0-2): 426, 13878, 37096, 14008, 416 ``` fixes #5253 closes https://github.com/official-stockfish/Stockfish/pull/5285 No functional change	2024-05-28 18:34:15 +02:00
cj5716	8ee9905d8b	Remove PSQT-only mode Passed STC: LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 94208 W: 24270 L: 24112 D: 45826 Ptnml(0-2): 286, 11186, 24009, 11330, 293 https://tests.stockfishchess.org/tests/view/6635ddd773559a8aa8582826 Passed LTC: LLR: 2.95 (-2.94,2.94) <-1.75,0.25> Total: 114960 W: 29107 L: 28982 D: 56871 Ptnml(0-2): 37, 12683, 31924, 12790, 46 https://tests.stockfishchess.org/tests/view/663604a973559a8aa85881ed closes #5214 Bench 1653939	2024-05-05 12:36:20 +02:00
mstembera	940a3a7383	Cache small net w/ psqtOnly support Caching the small net in the same way as the big net allows them to share the same code path and completely removes update_accumulator_refresh(). STC: https://tests.stockfishchess.org/tests/view/662bfb5ed46f72253dcfed85 LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 151712 W: 39252 L: 39158 D: 73302 Ptnml(0-2): 565, 17474, 39683, 17570, 564 closes https://github.com/official-stockfish/Stockfish/pull/5194 Bench: 1836777	2024-04-28 21:30:19 +02:00
Joost VandeVondele	bc45cbc820	Output some basic info about the used networks Adds size in memory as well as layer sizes as in info string NNUE evaluation using nn-ae6a388e4a1a.nnue (132MiB, (22528, 3072, 15, 32, 1)) info string NNUE evaluation using nn-baff1ede1f90.nnue (6MiB, (22528, 128, 15, 32, 1)) For example, the size in MiB is useful to keep the fishtest memory sizes up-to-date, the L1-L3 sizes give a useful hint about the architecture used. closes https://github.com/official-stockfish/Stockfish/pull/5193 No functional change	2024-04-28 21:27:28 +02:00
gab8192	49ef4c935a	Implement accumulator refresh table For each thread persist an accumulator cache for the network, where each cache contains multiple entries for each of the possible king squares. When the accumulator needs to be refreshed, the cached entry is used to more efficiently update the accumulator, instead of rebuilding it from scratch. This idea, was first described by Luecx (author of Koivisto) and is commonly referred to as "Finny Tables". When the accumulator needs to be refreshed, instead of filling it with biases and adding every piece from scratch, we... 1. Take the `AccumulatorRefreshEntry` associated with the new king bucket 2. Calculate the features to activate and deactivate (from differences between bitboards in the entry and bitboards of the actual position) 3. Apply the updates on the refresh entry 4. Copy the content of the refresh entry accumulator to the accumulator we were refreshing 5. Copy the bitboards from the position to the refresh entry, to match the newly updated accumulator Results at STC: https://tests.stockfishchess.org/tests/view/662301573fe04ce4cefc1386 (first version) https://tests.stockfishchess.org/tests/view/6627fa063fe04ce4cefc6560 (final) Non-Regression between first and final: https://tests.stockfishchess.org/tests/view/662801e33fe04ce4cefc660a STC SMP: https://tests.stockfishchess.org/tests/view/662808133fe04ce4cefc667c closes https://github.com/official-stockfish/Stockfish/pull/5183 No functional change	2024-04-24 18:38:20 +02:00
Disservin	55df0ee009	Fix Raspberry Pi Compilation Reported by @Torom over discord. > dev build fails on Raspberry Pi 5 with clang ``` clang++ -o stockfish benchmark.o bitboard.o evaluate.o main.o misc.o movegen.o movepick.o position.o search.o thread.o timeman.o tt.o uci.o ucioption.o tune.o tbprobe.o nnue_misc.o half_ka_v2_hm.o network.o -fprofile-instr-generate -latomic -lpthread -Wall -Wcast-qual -fno-exceptions -std=c++17 -fprofile-instr-generate -pedantic -Wextra -Wshadow -Wmissing-prototypes -Wconditional-uninitialized -DUSE_PTHREADS -DNDEBUG -O3 -funroll-loops -DIS_64BIT -DUSE_POPCNT -DUSE_NEON=8 -march=armv8.2-a+dotprod -DUSE_NEON_DOTPROD -DGIT_SHA=627974c9 -DGIT_DATE=20240312 -DARCH=armv8-dotprod -flto=full /tmp/lto-llvm-e9300e.o: in function `_GLOBAL__sub_I_network.cpp': ld-temp.o:(.text.startup+0x704c): relocation truncated to fit: R_AARCH64_LDST64_ABS_LO12_NC against symbol `gEmbeddedNNUEBigEnd' defined in .rodata section in /tmp/lto-llvm-e9300e.o /usr/bin/ld: ld-temp.o:(.text.startup+0x704c): warning: one possible cause of this error is that the symbol is being referenced in the indicated code as if it had a larger alignment than was declared where it was defined ld-temp.o:(.text.startup+0x7068): relocation truncated to fit: R_AARCH64_LDST64_ABS_LO12_NC against symbol `gEmbeddedNNUESmallEnd' defined in .rodata section in /tmp/lto-llvm-e9300e.o /usr/bin/ld: ld-temp.o:(.text.startup+0x7068): warning: one possible cause of this error is that the symbol is being referenced in the indicated code as if it had a larger alignment than was declared where it was defined clang: error: linker command failed with exit code 1 (use -v to see invocation) make[2]: * [Makefile:1051: stockfish] Error 1 make[2]: Leaving directory '/home/torsten/chess/Stockfish_master/src' make[1]: * [Makefile:1058: clang-profile-make] Error 2 make[1]: Leaving directory '/home/torsten/chess/Stockfish_master/src' make: *** [Makefile:886: profile-build] Error 2 ``` closes https://github.com/official-stockfish/Stockfish/pull/5106 No functional change	2024-03-12 19:09:50 +01:00
Disservin	1a26d698de	Refactor Network Usage Continuing from PR #4968, this update improves how Stockfish handles network usage, making it easier to manage and modify networks in the future. With the introduction of a dedicated Network class, creating networks has become straightforward. See uci.cpp: ```cpp NN::NetworkBig({EvalFileDefaultNameBig, "None", ""}, NN::embeddedNNUEBig) ``` The new `Network` encapsulates all network-related logic, significantly reducing the complexity previously required to support multiple network types, such as the distinction between small and big networks #4915. Non-Regression STC: https://tests.stockfishchess.org/tests/view/65edd26c0ec64f0526c43584 LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 33760 W: 8887 L: 8661 D: 16212 Ptnml(0-2): 143, 3795, 8808, 3961, 173 Non-Regression SMP STC: https://tests.stockfishchess.org/tests/view/65ed71970ec64f0526c42fdd LLR: 2.96 (-2.94,2.94) <-1.75,0.25> Total: 59088 W: 15121 L: 14931 D: 29036 Ptnml(0-2): 110, 6640, 15829, 6880, 85 Compiled with `make -j profile-build` ``` bash ./bench_parallel.sh ./stockfish ./stockfish-nnue 13 50 sf_base = 1568540 +/- 7637 (95%) sf_test = 1573129 +/- 7301 (95%) diff = 4589 +/- 8720 (95%) speedup = 0.29260% +/- 0.556% (95%) ``` Compiled with `make -j build` ``` bash ./bench_parallel.sh ./stockfish ./stockfish-nnue 13 50 sf_base = 1472653 +/- 7293 (95%) sf_test = 1491928 +/- 7661 (95%) diff = 19275 +/- 7154 (95%) speedup = 1.30886% +/- 0.486% (95%) ``` closes https://github.com/official-stockfish/Stockfish/pull/5100 No functional change	2024-03-12 16:41:08 +01:00

10 commits