in the case of avx512 and vnni512 archs.
Up to 17% speedup, depending on the compiler, e.g.
```
AMD pro 7840u (zen4 phoenix apu 4nm)
bash bench_parallel.sh ./stockfish_avx512_gcc13 ./stockfish_avx512_pr_gcc13 20 10
sf_base = 1077737 +/- 8446 (95%)
sf_test = 1264268 +/- 8543 (95%)
diff = 186531 +/- 4280 (95%)
speedup = 17.308% +/- 0.397% (95%)
```
Prior to this patch, it appears gcc spills registers.
closes https://github.com/official-stockfish/Stockfish/pull/4796
No functional change
deal with the general case
About a 8.6% speedup (for general arch)
Results for 200 tests for each version:
Base Test Diff
Mean 141741 153998 -12257
StDev 2990 3042 3742
p-value: 0.999
speedup: 0.086
closes https://github.com/official-stockfish/Stockfish/pull/4786
No functional change
Add more static checks regarding the SIMD width match.
STC: https://tests.stockfishchess.org/tests/view/64f5c568a9bc5a78c669e70e
LLR: 2.95 (-2.94,2.94) <-1.75,0.25>
Total: 125216 W: 31756 L: 31636 D: 61824
Ptnml(0-2): 327, 13993, 33848, 14113, 327
Fixes a bug introduced in 2f2f45f, where with AVX-512 the weights and input to
the last layer were being read out of bounds. Now AVX-512 is only used for the
layers it can be used for. Additional static assertions have been added to
prevent more errors like this in the future.
closes https://github.com/official-stockfish/Stockfish/pull/4773
No functional change
Squared numbers are never negative, so barring any wraparound there
is no need to clamp to 0. From reading the code, there's no obvious
way to get wraparound, so the entire operation can be simplified
away. Updated original truncated code comments to be sensible.
Verified by running ./stockfish bench 128 1 24 and by the following test:
STC: https://tests.stockfishchess.org/tests/view/64da4db95b17f7c21c0eabe7
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 60224 W: 15425 L: 15236 D: 29563
Ptnml(0-2): 195, 6576, 16382, 6763, 196
closes https://github.com/official-stockfish/Stockfish/pull/4751
No functional change
Current master fails to compile for ARMv8 on Raspi cause gcc (version 10.2.1)
does not like to cast between signed and unsigned vector types. This patch
fixes it by using unsigned vector pointer for ARM to avoid implicite cast.
closes https://github.com/official-stockfish/Stockfish/pull/4752
No functional change
a) Add further tests to CI to cover most features. This uncovered a potential race
in case setoption was sent between two searches. As the UCI protocol requires
this sent to be went the engine is not searching, setoption now ensures that
this is the case.
b) Remove some unused code
closes https://github.com/official-stockfish/Stockfish/pull/4730
No functional change
Also make two get_weight_index() static methods constexpr, for
consistency with the other static get_hash_value() method right above.
Tested for speed by user Torom (thanks).
closes https://github.com/official-stockfish/Stockfish/pull/4708
No functional change
since the introduction of NNUE (first released with Stockfish 12), we
have maintained the classical evaluation as part of SF in frozen form.
The idea that this code could lead to further inputs to the NN or
search did not materialize. Now, after five releases, this PR removes
the classical evaluation from SF. Even though this evaluation is
probably the best of its class, it has become unimportant for the
engine's strength, and there is little need to maintain this
code (roughly 25% of SF) going forward, or to expend resources on
trying to improve its integration in the NNUE eval.
Indeed, it had still a very limited use in the current SF, namely
for the evaluation of positions that are nearly decided based on
material difference, where the speed of the classical evaluation
outweights its inaccuracies. This impact on strength is small,
roughly 2Elo, and probably decreasing in importance as the TC grows.
Potentially, removal of this code could lead to the development of
techniques to have faster, but less accurate NN evaluation,
for certain positions.
STC
https://tests.stockfishchess.org/tests/view/64a320173ee09aa549c52157
Elo: -2.35 ± 1.1 (95%) LOS: 0.0%
Total: 100000 W: 24916 L: 25592 D: 49492
Ptnml(0-2): 287, 12123, 25841, 11477, 272
nElo: -4.62 ± 2.2 (95%) PairsRatio: 0.95
LTC
https://tests.stockfishchess.org/tests/view/64a320293ee09aa549c5215b
Elo: -1.74 ± 1.0 (95%) LOS: 0.0%
Total: 100000 W: 25010 L: 25512 D: 49478
Ptnml(0-2): 44, 11069, 28270, 10579, 38
nElo: -3.72 ± 2.2 (95%) PairsRatio: 0.96
VLTC SMP
https://tests.stockfishchess.org/tests/view/64a3207c3ee09aa549c52168
Elo: -1.70 ± 0.9 (95%) LOS: 0.0%
Total: 100000 W: 25673 L: 26162 D: 48165
Ptnml(0-2): 8, 9455, 31569, 8954, 14
nElo: -3.95 ± 2.2 (95%) PairsRatio: 0.95
closes https://github.com/official-stockfish/Stockfish/pull/4674
Bench: 1444646
Implemented LEB128 (de)compression for the feature transformer.
Reduces embedded network size from 70 MiB to 39 Mib.
The new nn-78bacfcee510.nnue corresponds to the master net compressed.
closes https://github.com/official-stockfish/Stockfish/pull/4617
No functional change
Use block sparse input for the first fully connected layer on architectures with at least SSSE3.
Depending on the CPU architecture, this yields a speedup of up to 10%, e.g.
```
Result of 100 runs of 'bench 16 1 13 default depth NNUE'
base (...ockfish-base) = 959345 +/- 7477
test (...ckfish-patch) = 1054340 +/- 9640
diff = +94995 +/- 3999
speedup = +0.0990
P(speedup > 0) = 1.0000
CPU: 8 x AMD Ryzen 7 5700U with Radeon Graphics
Hyperthreading: on
```
Passed STC:
https://tests.stockfishchess.org/tests/view/6485aa0965ffe077ca12409c
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 8864 W: 2479 L: 2223 D: 4162
Ptnml(0-2): 13, 829, 2504, 1061, 25
This commit includes a net with reordered weights, to increase the likelihood of block sparse inputs,
but otherwise equivalent to the previous master net (nn-ea57bea57e32.nnue).
Activation data collected with https://github.com/AndrovT/Stockfish/tree/log-activations, running bench 16 1 13 varied_1000.epd depth NNUE on this data. Net parameters permuted with https://gist.github.com/AndrovT/9e3fbaebb7082734dc84d27e02094cb3.
closes https://github.com/official-stockfish/Stockfish/pull/4612
No functional change
Created by training a new net from scratch with L1 size increased from 1024 to 1536.
Thanks to Vizvezdenec for the idea of exploring larger net sizes after recent
training data improvements.
A new net was first trained with lambda 1.0 and constant LR 8.75e-4. Then a strong net
from a later epoch in the training run was chosen for retraining with start-lambda 1.0
and initial LR 4.375e-4 decaying with gamma 0.995. Retraining was performed a total of
3 times, for this 4-step process:
1. 400 epochs, lambda 1.0 on filtered T77+T79 v6 deduplicated data
2. 800 epochs, end-lambda 0.75 on T60T70wIsRightFarseerT60T74T75T76.binpack
3. 800 epochs, end-lambda 0.75 and early-fen-skipping 28 on the master dataset
4. 800 epochs, end-lambda 0.7 and early-fen-skipping 28 on the master dataset
In the training sequence that reached the new nn-8d69132723e2.nnue net,
the epochs used for the 3x retraining runs were:
1. epoch 379 trained on T77T79-filter-v6-dd.min.binpack
2. epoch 679 trained on T60T70wIsRightFarseerT60T74T75T76.binpack
3. epoch 799 trained on the master dataset
For training from scratch:
python3 easy_train.py \
--experiment-name new-L1-1536-T77T79-filter-v6dd \
--training-dataset /data/T77T79-filter-v6-dd.min.binpack \
--max_epoch 400 \
--lambda 1.0 \
--start-from-engine-test-net False \
--engine-test-branch linrock/Stockfish/L1-1536 \
--nnue-pytorch-branch linrock/Stockfish/misc-fixes-L1-1536 \
--tui False \
--gpus "0," \
--seed $RANDOM
Retraining commands were similar to each other. For the 3rd retraining run:
python3 easy_train.py \
--experiment-name L1-1536-T77T79-v6dd-Re1-LeelaFarseer-Re2-masterDataset-Re3-sameData \
--training-dataset /data/leela96-dfrc99-v2-T60novdecT80juntonovjanfebT79aprmayT78jantosepT77dec-v6dd.binpack \
--early-fen-skipping 28 \
--max_epoch 800 \
--start-lambda 1.0 \
--end-lambda 0.7 \
--lr 4.375e-4 \
--gamma 0.995 \
--start-from-engine-test-net False \
--start-from-model /data/L1-1536-T77T79-v6dd-Re1-LeelaFarseer-Re2-masterDataset-nn-epoch799.nnue \
--engine-test-branch linrock/Stockfish/L1-1536 \
--nnue-pytorch-branch linrock/nnue-pytorch/misc-fixes-L1-1536 \
--tui False \
--gpus "0," \
--seed $RANDOM
The T77+T79 data used is a subset of the master dataset available at:
https://robotmoon.com/nnue-training-data/
T60T70wIsRightFarseerT60T74T75T76.binpack is available at:
https://drive.google.com/drive/folders/1S9-ZiQa_3ApmjBtl2e8SyHxj4zG4V8gG
Local elo at 25k nodes per move vs. nn-e1fb1ade4432.nnue (L1 size 1024):
nn-epoch759.nnue : 26.9 +/- 1.6
Failed STC
https://tests.stockfishchess.org/tests/view/64742485d29264e4cfa75f97
LLR: -2.94 (-2.94,2.94) <0.00,2.00>
Total: 13728 W: 3588 L: 3829 D: 6311
Ptnml(0-2): 71, 1661, 3610, 1482, 40
Failing LTC
https://tests.stockfishchess.org/tests/view/64752d7c4a36543c4c9f3618
LLR: -1.91 (-2.94,2.94) <0.50,2.50>
Total: 35424 W: 9522 L: 9603 D: 16299
Ptnml(0-2): 24, 3579, 10585, 3502, 22
Passed VLTC 180+1.8
https://tests.stockfishchess.org/tests/view/64752df04a36543c4c9f3638
LLR: 2.95 (-2.94,2.94) <0.50,2.50>
Total: 47616 W: 13174 L: 12863 D: 21579
Ptnml(0-2): 13, 4261, 14952, 4566, 16
Passed VLTC SMP 60+0.6 th 8
https://tests.stockfishchess.org/tests/view/647446ced29264e4cfa761e5
LLR: 2.94 (-2.94,2.94) <0.50,2.50>
Total: 19942 W: 5694 L: 5451 D: 8797
Ptnml(0-2): 6, 1504, 6707, 1749, 5
closes https://github.com/official-stockfish/Stockfish/pull/4593
bench 2222567
This patch introduces `hint_common_parent_position()` to signal that potentially several child nodes will require an NNUE eval. By populating explicitly the accumulator, these subsequent evaluations can be performed more efficiently.
This was based on the observation that calculating the evaluation in an excluded move position yielded a significant Elo gain, even though the evaluation itself was already available (work by pb00067).
Sopel wrote the code to perform just the accumulator update. This PR is based on cleaned up code that
passed STC:
https://tests.stockfishchess.org/tests/view/63f62f9be74a12625bcd4aa0
LLR: 2.94 (-2.94,2.94) <0.50,2.50>
Total: 110368 W: 29607 L: 29167 D: 51594
Ptnml(0-2): 41, 10551, 33572, 10967, 53
and in an the earlier (equivalent) version
passed STC:
https://tests.stockfishchess.org/tests/view/63f3c3fee74a12625bcce2a6
LLR: 2.95 (-2.94,2.94) <0.00,2.00>
Total: 47552 W: 12786 L: 12467 D: 22299
Ptnml(0-2): 120, 5107, 12997, 5438, 114
passed LTC:
https://tests.stockfishchess.org/tests/view/63f45cc2e74a12625bccfa63
LLR: 2.94 (-2.94,2.94) <0.50,2.50>
Total: 110368 W: 29607 L: 29167 D: 51594
Ptnml(0-2): 41, 10551, 33572, 10967, 53
closes https://github.com/official-stockfish/Stockfish/pull/4402
Bench: 3726250
The sdot instruction computes (and accumulates) a signed dot product,
which is quite handy for Stockfish's NNUE code. The instruction is
optional for Armv8.2 and Armv8.3, and mandatory for Armv8.4 and above.
The commit adds a new 'arm-dotprod' architecture with enabled dot
product support. It also enables dot product support for the existing
'apple-silicon' architecture, which is at least Armv8.5.
The following local speed test was performed on an Apple M1 with
ARCH=apple-silicon. I had to remove CPU pinning from the benchmark
script. However, the results were still consistent: Checking both
binaries against themselves reported a speedup of +0.0000 and +0.0005,
respectively.
```
Result of 100 runs
==================
base (...ish.037ef3e1) = 1917997 +/- 7152
test (...fish.dotprod) = 2159682 +/- 9066
diff = +241684 +/- 2923
speedup = +0.1260
P(speedup > 0) = 1.0000
CPU: 10 x arm
Hyperthreading: off
```
Fixes#4193
closes https://github.com/official-stockfish/Stockfish/pull/4400
No functional change
The accumulator should be an earlyclobber because it is written before
all input operands are read. Otherwise, the asm code computes a wrong
result if the accumulator shares a register with one of the other input
operands (which happens if we pass in the same expression for the
accumulator and the operand).
Closes https://github.com/official-stockfish/Stockfish/pull/4339
No functional change
Removed sprintf() which generated a warning, because of security reasons.
Replace NULL with nullptr
Replace typedef with using
Do not inherit from std::vector. Use composition instead.
optimize mutex-unlocking
closes https://github.com/official-stockfish/Stockfish/pull/4327
No functional change
If a global function has no previous declaration, either the declaration
is missing in the corresponding header file or the function should be
declared static. Static functions are local to the translation unit,
which allows the compiler to apply some optimizations earlier (when
compiling the translation unit rather than during link-time
optimization).
The commit enables the warning for gcc, clang, and mingw. It also fixes
the reported warnings by declaring the functions static or by adding a
header file (benchmark.h).
closes https://github.com/official-stockfish/Stockfish/pull/4325
No functional change
Normalizes the internal value as reported by evaluate or search
to the UCI centipawn result used in output. This value is derived from
the win_rate_model() such that Stockfish outputs an advantage of
"100 centipawns" for a position if the engine has a 50% probability to win
from this position in selfplay at fishtest LTC time control.
The reason to introduce this normalization is that our evaluation is, since NNUE,
no longer related to the classical parameter PawnValueEg (=208). This leads to
the current evaluation changing quite a bit from release to release, for example,
the eval needed to have 50% win probability at fishtest LTC (in cp and internal Value):
June 2020 : 113cp (237)
June 2021 : 115cp (240)
April 2022 : 134cp (279)
July 2022 : 167cp (348)
With this patch, a 100cp advantage will have a fixed interpretation,
i.e. a 50% win chance. To keep this value steady, it will be needed to update the win_rate_model()
from time to time, based on fishtest data. This analysis can be performed with
a set of scripts currently available at https://github.com/vondele/WLD_model
fixes https://github.com/official-stockfish/Stockfish/issues/4155
closes https://github.com/official-stockfish/Stockfish/pull/4216
No functional change