Previously, we had two type aliases, LargePagePtr and AlignedPtr, which
required manually initializing the aligned memory for the pointer.
The new helpers:
- make_unique_aligned
- make_unique_large_page
are now available for allocating aligned memory (with large pages). They
behave similarly to std::make_unique, ensuring objects allocated with
these functions follow RAII.
The old approach had issues with initializing non-trivial types or
arrays of objects. The evaluation function of the network is now a
unique pointer to an array instead of an array of unique pointers.
Memory related functions have been moved into memory.h
Passed High Hash Pressure Test Non-Regression STC:
https://tests.stockfishchess.org/tests/view/665b2b36586058766677cfd2
LLR: 2.93 (-2.94,2.94) <-1.75,0.25>
Total: 476992 W: 122426 L: 122677 D: 231889
Ptnml(0-2): 1145, 51027, 134419, 50744, 1161
Failed Normal Non-Regression STC:
https://tests.stockfishchess.org/tests/view/665b2997586058766677cfc8
LLR: -2.94 (-2.94,2.94) <-1.75,0.25>
Total: 877312 W: 225233 L: 226395 D: 425684
Ptnml(0-2): 2110, 94642, 246239, 93630, 2035
Probably a fluke since there shouldn't be a real slowndown and it has also
passed the high hash pressure test.
closes https://github.com/official-stockfish/Stockfish/pull/5332
No functional change
Allow for NUMA memory replication for NNUE weights. Bind threads to ensure execution on a specific NUMA node.
This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc.
Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies.
-----------------
A new UCI option `NumaPolicy` is introduced. It can take the following values:
```
system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node
none - assumes there is 1 NUMA node, never binds threads
auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count)
[[custom]] -
// ':'-separated numa nodes
// ','-separated cpu indices
// supports "first-last" range syntax for cpu indices,
for example '0-15,32-47:16-31,48-63'
```
Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT.
The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better.
Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided.
On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly.
-----------------
We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary.
MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway.
Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux.
Linux is supported, **without** libnuma requirement. `lscpu` is expected.
-----------------
Passed 60+1 @ 256t 16000MB hash: https://tests.stockfishchess.org/tests/view/6654e443a86388d5e27db0d8
```
LLR: 2.95 (-2.94,2.94) <0.00,10.00>
Total: 278 W: 110 L: 29 D: 139
Ptnml(0-2): 0, 1, 56, 82, 0
```
Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd
```
LLR: 2.95 (-2.94,2.94) <-1.75,0.25>
Total: 67152 W: 17354 L: 17177 D: 32621
Ptnml(0-2): 64, 7428, 18408, 7619, 57
```
Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c
```
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 131648 W: 34155 L: 34045 D: 63448
Ptnml(0-2): 426, 13878, 37096, 14008, 416
```
fixes#5253
closes https://github.com/official-stockfish/Stockfish/pull/5285
No functional change
Also "fix" movepicker to allow depths between CHECKS and NO_CHECKS,
which makes them easier to tweak (not that they get tweaked hardly ever)
(This was more beneficial when there was a third stage to DEPTH_QS, but
it's still an improvement now)
closes https://github.com/official-stockfish/Stockfish/pull/5205
No functional change
We change the definition of "age" from "age of this position" to "age of this TT entry".
In this way, despite being on the same position, when we save into TT, we always prefer the new entry as compared to the old one.
Passed STC:
LLR: 2.94 (-2.94,2.94) <0.00,2.00>
Total: 152256 W: 39597 L: 39110 D: 73549
Ptnml(0-2): 556, 17562, 39398, 18063, 549
https://tests.stockfishchess.org/tests/view/6620faee3fe04ce4cefbf215
Passed LTC:
LLR: 2.95 (-2.94,2.94) <0.50,2.50>
Total: 51564 W: 13242 L: 12895 D: 25427
Ptnml(0-2): 24, 5464, 14463, 5803, 28
https://tests.stockfishchess.org/tests/view/66231ab53fe04ce4cefc153ecloses#5184
Bench 1479416
This aims to remove some of the annoying global structure which Stockfish has.
Overall there is no major elo regression to be expected.
Non regression SMP STC (paused, early version):
https://tests.stockfishchess.org/tests/view/65983d7979aa8af82b9608f1
LLR: 0.23 (-2.94,2.94) <-1.75,0.25>
Total: 76232 W: 19035 L: 19096 D: 38101
Ptnml(0-2): 92, 8735, 20515, 8690, 84
Non regression STC (early version):
https://tests.stockfishchess.org/tests/view/6595b3a479aa8af82b95da7f
LLR: 2.93 (-2.94,2.94) <-1.75,0.25>
Total: 185344 W: 47027 L: 46972 D: 91345
Ptnml(0-2): 571, 21285, 48943, 21264, 609
Non regression SMP STC:
https://tests.stockfishchess.org/tests/view/65a0715c79aa8af82b96b7e4
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 142936 W: 35761 L: 35662 D: 71513
Ptnml(0-2): 209, 16400, 38135, 16531, 193
These global structures/variables add hidden dependencies and allow data
to be mutable from where it shouldn't it be (i.e. options). They also
prevent Stockfish from internal selfplay, which would be a nice thing to
be able to do, i.e. instantiate two Stockfish instances and let them
play against each other. It will also allow us to make Stockfish a
library, which can be easier used on other platforms.
For consistency with the old search code, `thisThread` has been kept,
even though it is not strictly necessary anymore. This the first major
refactor of this kind (in recent time), and future changes are required,
to achieve the previously described goals. This includes cleaning up the
dependencies, transforming the network to be self contained and coming
up with a plan to deal with proper tablebase memory management (see
comments for more information on this).
The removal of these global structures has been discussed in parts with
Vondele and Sopel.
closes https://github.com/official-stockfish/Stockfish/pull/4968
No functional change
- remove the blank line between the declaration of the function and it's
comment, leads to better IDE support when hovering over a function to see it's
description
- remove the unnecessary duplication of the function name in the functions
description
- slightly refactored code for lsb, msb in bitboard.h There are still a few
things we can be improved later on, move the description of a function where
it was declared (instead of implemented) and add descriptions to functions
which are behind macros ifdefs
closes https://github.com/official-stockfish/Stockfish/pull/4840
No functional change
This introduces clang-format to enforce a consistent code style for Stockfish.
Having a documented and consistent style across the code will make contributing easier
for new developers, and will make larger changes to the codebase easier to make.
To facilitate formatting, this PR includes a Makefile target (`make format`) to format the code,
this requires clang-format (version 17 currently) to be installed locally.
Installing clang-format is straightforward on most OS and distros
(e.g. with https://apt.llvm.org/, brew install clang-format, etc), as this is part of quite commonly
used suite of tools and compilers (llvm / clang).
Additionally, a CI action is present that will verify if the code requires formatting,
and comment on the PR as needed. Initially, correct formatting is not required, it will be
done by maintainers as part of the merge or in later commits, but obviously this is encouraged.
fixes https://github.com/official-stockfish/Stockfish/issues/3608
closes https://github.com/official-stockfish/Stockfish/pull/4790
Co-Authored-By: Joost VandeVondele <Joost.VandeVondele@gmail.com>
The current implementation generates warnings on MSVC. However, we have
no real use cases for double-typed UCI option values now. Also parameter
tuning only accepts following three types:
int, Value, Score
closes https://github.com/official-stockfish/Stockfish/pull/4505
No functional change
This patch removes some magic numbers in TT bit management and introduce proper
constants in the code, to improve documentation and ease further modifications.
No function change
Use TT memory functions to allocate memory for the NNUE weights. This
should provide a small speed-up on systems where large pages are not
automatically used, including Windows and some Linux distributions.
Further, since we now have a wrapper for std::aligned_alloc(), we can
simplify the TT memory management a bit:
- We no longer need to store separate pointers to the hash table and
its underlying memory allocation.
- We also get to merge the Linux-specific and default implementations
of aligned_ttmem_alloc().
Finally, we'll enable the VirtualAlloc code path with large page
support also for Win32.
STC: https://tests.stockfishchess.org/tests/view/5f66595823a84a47b9036fba
LLR: 2.94 (-2.94,2.94) {-0.25,1.25}
Total: 14896 W: 1854 L: 1686 D: 11356
Ptnml(0-2): 65, 1224, 4742, 1312, 105
closes https://github.com/official-stockfish/Stockfish/pull/3081
No functional change.
Fix the issue where a TT entry with key16==0 would always be reported
as a miss. Instead, we'll use depth8 to detect whether the TT entry is
occupied. In order to do that, we'll change DEPTH_OFFSET to -7
(depth8==0) to distinguish between an unoccupied entry and the
otherwise lowest possible depth, i.e., DEPTH_NONE (depth8==1).
To prevent a performance regression, we'll reorder the TT entry fields
by the access order of TranspositionTable::probe(). Memory in general
works fastest when accessed in sequential order. We'll also match the
store order in TTEntry::save() with the entry field order, and
re-order the 'if-or' expressions in TTEntry::save() from the cheapest
to the most expensive.
Finally, as we now have a proper TT entry occupancy test, we'll fix a
minor corner case with hashfull reporting. To reproduce:
- Use a big hash
- Either:
a. Start 31 very quick searches (this wraparounds generation to 0); or
b. Force generation of the first search to 0.
- go depth infinite
Before the fix, hashfull would incorrectly report nearly full hash
immediately after the search start, since
TranspositionTable::hashfull() used to consider only the entry
generation and not whether the entry was actually occupied.
STC:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 36848 W: 4091 L: 3898 D: 28859
Ptnml(0-2): 158, 2996, 11972, 3091, 207
https://tests.stockfishchess.org/tests/view/5f3f98d5dc02a01a0c2881f7
LTC:
LLR: 2.95 (-2.94,2.94) {0.25,1.25}
Total: 32280 W: 1828 L: 1653 D: 28799
Ptnml(0-2): 34, 1428, 13051, 1583, 44
https://tests.stockfishchess.org/tests/view/5f3fe77a87a5c3c63d8f5332
closes https://github.com/official-stockfish/Stockfish/pull/3048
Bench: 3760677
This patch ports the efficiently updatable neural network (NNUE) evaluation to Stockfish.
Both the NNUE and the classical evaluations are available, and can be used to
assign a value to a position that is later used in alpha-beta (PVS) search to find the
best move. The classical evaluation computes this value as a function of various chess
concepts, handcrafted by experts, tested and tuned using fishtest. The NNUE evaluation
computes this value with a neural network based on basic inputs. The network is optimized
and trained on the evalutions of millions of positions at moderate search depth.
The NNUE evaluation was first introduced in shogi, and ported to Stockfish afterward.
It can be evaluated efficiently on CPUs, and exploits the fact that only parts
of the neural network need to be updated after a typical chess move.
[The nodchip repository](https://github.com/nodchip/Stockfish) provides additional
tools to train and develop the NNUE networks.
This patch is the result of contributions of various authors, from various communities,
including: nodchip, ynasu87, yaneurao (initial port and NNUE authors), domschl, FireFather,
rqs, xXH4CKST3RXx, tttak, zz4032, joergoster, mstembera, nguyenpham, erbsenzaehler,
dorzechowski, and vondele.
This new evaluation needed various changes to fishtest and the corresponding infrastructure,
for which tomtor, ppigazzini, noobpwnftw, daylen, and vondele are gratefully acknowledged.
The first networks have been provided by gekkehenker and sergiovieri, with the latter
net (nn-97f742aaefcd.nnue) being the current default.
The evaluation function can be selected at run time with the `Use NNUE` (true/false) UCI option,
provided the `EvalFile` option points the the network file (depending on the GUI, with full path).
The performance of the NNUE evaluation relative to the classical evaluation depends somewhat on
the hardware, and is expected to improve quickly, but is currently on > 80 Elo on fishtest:
60000 @ 10+0.1 th 1
https://tests.stockfishchess.org/tests/view/5f28fe6ea5abc164f05e4c4c
ELO: 92.77 +-2.1 (95%) LOS: 100.0%
Total: 60000 W: 24193 L: 8543 D: 27264
Ptnml(0-2): 609, 3850, 9708, 10948, 4885
40000 @ 20+0.2 th 8
https://tests.stockfishchess.org/tests/view/5f290229a5abc164f05e4c58
ELO: 89.47 +-2.0 (95%) LOS: 100.0%
Total: 40000 W: 12756 L: 2677 D: 24567
Ptnml(0-2): 74, 1583, 8550, 7776, 2017
At the same time, the impact on the classical evaluation remains minimal, causing no significant
regression:
sprt @ 10+0.1 th 1
https://tests.stockfishchess.org/tests/view/5f2906a2a5abc164f05e4c5b
LLR: 2.94 (-2.94,2.94) {-6.00,-4.00}
Total: 34936 W: 6502 L: 6825 D: 21609
Ptnml(0-2): 571, 4082, 8434, 3861, 520
sprt @ 60+0.6 th 1
https://tests.stockfishchess.org/tests/view/5f2906cfa5abc164f05e4c5d
LLR: 2.93 (-2.94,2.94) {-6.00,-4.00}
Total: 10088 W: 1232 L: 1265 D: 7591
Ptnml(0-2): 49, 914, 3170, 843, 68
The needed networks can be found at https://tests.stockfishchess.org/nns
It is recommended to use the default one as indicated by the `EvalFile` UCI option.
Guidelines for testing new nets can be found at
https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#nnue-net-tests
Integration has been discussed in various issues:
https://github.com/official-stockfish/Stockfish/issues/2823https://github.com/official-stockfish/Stockfish/issues/2728
The integration branch will be closed after the merge:
https://github.com/official-stockfish/Stockfish/pull/2825https://github.com/official-stockfish/Stockfish/tree/nnue-player-wip
closes https://github.com/official-stockfish/Stockfish/pull/2912
This will be an exciting time for computer chess, looking forward to seeing the evolution of
this approach.
Bench: 4746616
Conceptually group hash clusters into super clusters of 256 clusters.
This scheme allows us to use hash sizes up to 32 TB
(= 2^32 super clusters = 2^40 clusters).
Use 48 bits of the Zobrist key to choose the cluster index. We use 8
extra bits to mitigate the quantization error for very large hashes when
scaling the hash key to cluster index.
The hash index computation is organized to be compatible with the existing
scheme for power-of-two hash sizes up to 128 GB.
Fixes https://github.com/official-stockfish/Stockfish/issues/1349
closes https://github.com/official-stockfish/Stockfish/pull/2722
Passed non-regression STC:
LLR: 2.93 (-2.94,2.94) {-1.50,0.50}
Total: 37976 W: 7336 L: 7211 D: 23429
Ptnml(0-2): 578, 4295, 9149, 4356, 610
https://tests.stockfishchess.org/tests/view/5edcbaaef29b40b0fc95abc5
No functional change.
for users that set the needed privilige "Lock Pages in Memory"
large pages will be automatically enabled (see Readme.md).
This expert setting might improve speed, 5% - 30%, depending
on the hardware, the number of threads and hash size. More for
large hashes, large number of threads and NUMA. If the operating
system can not allocate large pages (easier after a reboot), default
allocation is used automatically. The engine log provides details.
closes https://github.com/official-stockfish/Stockfish/pull/2656
fixes https://github.com/official-stockfish/Stockfish/issues/2619
No functional change
Align the TT allocation by 2M to make it huge page friendly and advise the
kernel to use huge pages.
Benchmarks on my i7-8700K (6C/12T) box: (3 runs per bench per config)
vanilla (nps) hugepages (nps) avg
==================================================================================
bench | 3012490 3024364 3036331 3071052 3067544 3071052 +1.5%
bench 16 12 20 | 19237932 19050166 19085315 19266346 19207025 19548758 +1.1%
bench 16384 12 20 | 18182313 18371581 18336838 19381275 19738012 19620225 +7.0%
On my box, huge pages have a significant perf impact when using a big
hash size. They also speed up TT initialization big time:
vanilla (s) huge pages (s) speed-up
=======================================================================
time stockfish bench 16384 1 1 | 5.37 1.48 3.6x
In practice, huge pages with auto-defrag may always be enabled in the
system, in which case this patch has no effect. This
depends on the values in /sys/kernel/mm/transparent_hugepage/enabled
and /sys/kernel/mm/transparent_hugepage/defrag.
closes https://github.com/official-stockfish/Stockfish/pull/2463
No functional change
Simplification that eliminates ONE_PLY, based on a suggestion in the forum that
support for fractional plies has never been used, and @mcostalba's openness to
the idea of eliminating it. We lose a little bit of type safety by making Depth
an integer, but in return we simplify the code in search.cpp quite significantly.
No functional change
------------------------------------------
The argument favoring eliminating ONE_PLY:
* The term “ONE_PLY” comes up in a lot of forum posts (474 to date)
https://groups.google.com/forum/?fromgroups=#!searchin/fishcooking/ONE_PLY%7Csort:relevance
* There is occasionally a commit that breaks invariance of the code
with respect to ONE_PLY
https://groups.google.com/forum/?fromgroups=#!searchin/fishcooking/ONE_PLY%7Csort:date/fishcooking/ZIPdYj6k0fk/KdNGcPWeBgAJ
* To prevent such commits, there is a Travis CI hack that doubles ONE_PLY
and rechecks bench
* Sustaining ONE_PLY has, alas, not resulted in any improvements to the
engine, despite many individuals testing many experiments over 5 years.
The strongest argument in favor of preserving ONE_PLY comes from @locutus:
“If we use par example ONE_PLY=256 the parameter space is increases by the
factor 256. So it seems very unlikely that the optimal setting is in the
subspace of ONE_PLY=1.”
There is a strong theoretical impediment to fractional depth systems: the
transposition table uses depth to determine when a stored result is good
enough to supply an answer for a current search. If you have fractional
depths, then different pathways to the position can be at fractionally
different depths.
In the end, there are three separate times when a proposal to remove ONE_PLY
was defeated by the suggestion to “give it a few more months.” So… it seems
like time to remove this distraction from the community.
See the pull request here:
https://github.com/official-stockfish/Stockfish/pull/2289
High rootDepths, selDepths and generally searches are increasingly
common with long time control games, analysis, and improving hardware.
In this case, depths of MAX_DEPTH/MAX_PLY (128) can be reached,
and the search tree is truncated.
In principle MAX_PLY can be easily increased, except for a technicality
of storing depths in a signed 8 bit int in the TT. This patch increases
MAX_PLY by storing the depth in an unsigned 8 bit, after shifting by the
most negative depth stored in TT (DEPTH_NONE).
No regression at STC:
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 42235 W: 9565 L: 9484 D: 23186
http://tests.stockfishchess.org/tests/view/5cdb35360ebc5925cf0595e1
Verified to reach high depths on
k1b5/1p1p4/pP1Pp3/K2pPp2/1P1p1P2/3P1P2/5P2/8 w - -
info depth 142 seldepth 154 multipv 1 score cp 537 nodes 26740713110 ...
No bench change.
Introducing new concept, saving principal lines into the transposition table
to generate a "critical search tree" which we can reuse later for intelligent
pruning/extension decisions.
For instance in this patch we just reduce reduction for these lines. But a lot
of other ideas are possible.
To go further : tune some parameters, how to add or remove lines from the
critical search tree, how to use these lines in search choices, etc.
STC :
LLR: 2.94 (-2.94,2.94) [0.50,4.50]
Total: 59761 W: 13321 L: 12863 D: 33577 +2.23 ELO
http://tests.stockfishchess.org/tests/view/5c34da5d0ebc596a450c53d3
LTC :
LLR: 2.96 (-2.94,2.94) [0.00,3.50]
Total: 26826 W: 4439 L: 4191 D: 18196 +2.9 ELO
http://tests.stockfishchess.org/tests/view/5c35ceb00ebc596a450c65b2
Special thanks to Miguel Lahoz for his help in transposition table in/out.
Bench: 3399866
* Use a bit less code to calculate hashfull(). Change post increment to preincrement as is more standard
in the rest of stockfish. Scale result so we report 100% instead of 99.9% when completely full.
No functional change.