As stockfish nets and search evolve, the existing time control appears
to give too little time at STC, roughly correct at LTC, and too little
at VLTC+.
This change adds an adjustment to the optExtra calculation. This
adjustment is easy to retune and refine, so it should be easier to keep
up-to-date than the more complex calculations used for optConstant and
optScale.
Passed STC 10+0.1:
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 169568 W: 43803 L: 43295 D: 82470
Ptnml(0-2): 485, 19679, 44055, 19973, 592
https://tests.stockfishchess.org/tests/view/66531865a86388d5e27da9fa
Yellow LTC 60+0.6:
LLR: -2.94 (-2.94,2.94) <0.50,2.50>
Total: 209970 W: 53087 L: 52914 D: 103969
Ptnml(0-2): 91, 19652, 65314, 19849, 79
https://tests.stockfishchess.org/tests/view/6653e38ba86388d5e27daaa0
Passed VLTC 180+1.8 :
LLR: 2.94 (-2.94,2.94) <0.50,2.50>
Total: 85618 W: 21735 L: 21342 D: 42541
Ptnml(0-2): 15, 8267, 25848, 8668, 11
https://tests.stockfishchess.org/tests/view/6655131da86388d5e27db95f
closes https://github.com/official-stockfish/Stockfish/pull/5297
Bench: 1212167
Allow for NUMA memory replication for NNUE weights. Bind threads to ensure execution on a specific NUMA node.
This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc.
Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies.
-----------------
A new UCI option `NumaPolicy` is introduced. It can take the following values:
```
system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node
none - assumes there is 1 NUMA node, never binds threads
auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count)
[[custom]] -
// ':'-separated numa nodes
// ','-separated cpu indices
// supports "first-last" range syntax for cpu indices,
for example '0-15,32-47:16-31,48-63'
```
Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT.
The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better.
Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided.
On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly.
-----------------
We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary.
MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway.
Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux.
Linux is supported, **without** libnuma requirement. `lscpu` is expected.
-----------------
Passed 60+1 @ 256t 16000MB hash: https://tests.stockfishchess.org/tests/view/6654e443a86388d5e27db0d8
```
LLR: 2.95 (-2.94,2.94) <0.00,10.00>
Total: 278 W: 110 L: 29 D: 139
Ptnml(0-2): 0, 1, 56, 82, 0
```
Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd
```
LLR: 2.95 (-2.94,2.94) <-1.75,0.25>
Total: 67152 W: 17354 L: 17177 D: 32621
Ptnml(0-2): 64, 7428, 18408, 7619, 57
```
Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c
```
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 131648 W: 34155 L: 34045 D: 63448
Ptnml(0-2): 426, 13878, 37096, 14008, 416
```
fixes#5253
closes https://github.com/official-stockfish/Stockfish/pull/5285
No functional change
This speedup was first inspired by a comment by @AndyGrant on my recent
PR "If mullo_epi16 would preserve the signedness, then this could be
used to remove 50% of the max operations during the halfkp-pairwise
mat-mul relu deal."
That got me thinking, because although mullo_epi16 did not preserve the
signedness, mulhi_epi16 did, and so we could shift left and then use
mulhi_epi16, instead of shifting right after the mullo.
However, due to some issues with shifting into the sign bit, the FT
weights and biases had to be multiplied by 2 for the optimisation to
work.
Speedup on "Arch=x86-64-bmi2 COMP=clang", courtesy of @Torom
Result of 50 runs
base (...es/stockfish) = 962946 +/- 1202
test (...ise-max-less) = 979696 +/- 1084
diff = +16750 +/- 1794
speedup = +0.0174
P(speedup > 0) = 1.0000
CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on
Also a speedup on "COMP=gcc", courtesy of Torom once again
Result of 50 runs
base (...tockfish_gcc) = 966033 +/- 1574
test (...max-less_gcc) = 983319 +/- 1513
diff = +17286 +/- 2515
speedup = +0.0179
P(speedup > 0) = 1.0000
CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on
Passed STC:
LLR: 2.96 (-2.94,2.94) <0.00,2.00>
Total: 67712 W: 17715 L: 17358 D: 32639
Ptnml(0-2): 225, 7472, 18140, 7759, 260
https://tests.stockfishchess.org/tests/view/664c1d75830eb9f886616906
closes https://github.com/official-stockfish/Stockfish/pull/5282
No functional change
Also "fix" movepicker to allow depths between CHECKS and NO_CHECKS,
which makes them easier to tweak (not that they get tweaked hardly ever)
(This was more beneficial when there was a third stage to DEPTH_QS, but
it's still an improvement now)
closes https://github.com/official-stockfish/Stockfish/pull/5205
No functional change
Removes some max calls
Some speedup stats, courtesy of @AndyGrant (albeit measured in an alternate implementation)
Dev 749240 nps
Base 748495 nps
Gain 0.100%
289936 games
STC:
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 203040 W: 52213 L: 52179 D: 98648
Ptnml(0-2): 480, 20722, 59139, 20642, 537
https://tests.stockfishchess.org/tests/view/664805fe6dcff0d1d6b05f2ccloses#5261
No functional change
This reverts commit 4edd1a3
The unusual result of (combined) +12.0 +- 3.7 in the 2 VVLTC simplification SPRTs ran was the result of base having only 64MB of hash instead of 512MB (Asymmetric hash).
Vizvezdenec was the one to notice this.
closes https://github.com/official-stockfish/Stockfish/pull/5265
bench 1404295
Co-Authored-By: Michael Chaly <26898827+Vizvezdenec@users.noreply.github.com>
Created by first retraining the spsa-tuned main net `nn-ae6a388e4a1a.nnue` with:
- using v6-dd data without bestmove captures removed
- addition of T80 mar2024 data
- increasing loss by 20% when Q is too high
- torch.compile changes for marginal training speed gains
And then SPSA tuning weights of epoch 899 following methods described in:
https://github.com/official-stockfish/Stockfish/pull/5149
This net was reached at 92k out of 120k steps in this 70+0.7 th 7 SPSA tuning run:
https://tests.stockfishchess.org/tests/view/66413b7df9f4e8fc783c9bbb
Thanks to @Viren6 for suggesting usage of:
- c value 4 for the weights
- c value 128 for the biases
Scripts for automating applying fishtest spsa params to exporting tuned .nnue are in:
https://github.com/linrock/nnue-tools/tree/master/spsa
Before spsa tuning, epoch 899 was nn-f85738aefa84.nnue
https://tests.stockfishchess.org/tests/view/663e5c893a2f9702074bc167
After initially training with max-epoch 800, training was resumed with max-epoch 1000.
```
experiment-name: 3072--S11--more-data-v6-dd-t80-mar2024--see-ge0-20p-more-loss-high-q-sk28-l8
nnue-pytorch-branch: linrock/nnue-pytorch/3072-r21-skip-more-wdl-see-ge0-20p-more-loss-high-q-torch-compile-more
start-from-engine-test-net: False
start-from-model: /data/config/apr2024-3072/nn-ae6a388e4a1a.nnue
early-fen-skipping: 28
training-dataset:
/data/S11-mar2024/:
- leela96.v2.min.binpack
- test60-2021-11-12-novdec-12tb7p.v6-dd.min.binpack
- test78-2022-01-to-05-jantomay-16tb7p.v6-dd.min.binpack
- test80-2022-06-jun-16tb7p.v6-dd.min.binpack
- test80-2022-08-aug-16tb7p.v6-dd.min.binpack
- test80-2022-09-sep-16tb7p.v6-dd.min.binpack
- test80-2023-01-jan-16tb7p.v6-sk20.min.binpack
- test80-2023-02-feb-16tb7p.v6-sk20.min.binpack
- test80-2023-03-mar-2tb7p.v6-sk16.min.binpack
- test80-2023-04-apr-2tb7p.v6-sk16.min.binpack
- test80-2023-05-may-2tb7p.v6.min.binpack
# https://github.com/official-stockfish/Stockfish/pull/4782
- test80-2023-06-jun-2tb7p.binpack
- test80-2023-07-jul-2tb7p.binpack
# https://github.com/official-stockfish/Stockfish/pull/4972
- test80-2023-08-aug-2tb7p.v6.min.binpack
- test80-2023-09-sep-2tb7p.binpack
- test80-2023-10-oct-2tb7p.binpack
# S9 new data: https://github.com/official-stockfish/Stockfish/pull/5056
- test80-2023-11-nov-2tb7p.binpack
- test80-2023-12-dec-2tb7p.binpack
# S10 new data: https://github.com/official-stockfish/Stockfish/pull/5149
- test80-2024-01-jan-2tb7p.binpack
- test80-2024-02-feb-2tb7p.binpack
# S11 new data
- test80-2024-03-mar-2tb7p.binpack
/data/filt-v6-dd/:
- test77-dec2021-16tb7p-filter-v6-dd.binpack
- test78-juntosep2022-16tb7p-filter-v6-dd.binpack
- test79-apr2022-16tb7p-filter-v6-dd.binpack
- test79-may2022-16tb7p-filter-v6-dd.binpack
- test80-jul2022-16tb7p-filter-v6-dd.binpack
- test80-oct2022-16tb7p-filter-v6-dd.binpack
- test80-nov2022-16tb7p-filter-v6-dd.binpack
num-epochs: 1000
lr: 4.375e-4
gamma: 0.995
start-lambda: 0.8
end-lambda: 0.7
```
Training data can be found at:
https://robotmoon.com/nnue-training-data/
Local elo at 25k nodes per move:
nn-epoch899.nnue : 4.6 +/- 1.4
Passed STC:
https://tests.stockfishchess.org/tests/view/6645454893ce6da3e93b31ae
LLR: 2.95 (-2.94,2.94) <0.00,2.00>
Total: 95232 W: 24598 L: 24194 D: 46440
Ptnml(0-2): 294, 11215, 24180, 11647, 280
Passed LTC:
https://tests.stockfishchess.org/tests/view/6645522d93ce6da3e93b31df
LLR: 2.95 (-2.94,2.94) <0.50,2.50>
Total: 320544 W: 81432 L: 80524 D: 158588
Ptnml(0-2): 164, 35659, 87696, 36611, 142
closes https://github.com/official-stockfish/Stockfish/pull/5254
bench 1995552