- remove the blank line between the declaration of the function and it's
comment, leads to better IDE support when hovering over a function to see it's
description
- remove the unnecessary duplication of the function name in the functions
description
- slightly refactored code for lsb, msb in bitboard.h There are still a few
things we can be improved later on, move the description of a function where
it was declared (instead of implemented) and add descriptions to functions
which are behind macros ifdefs
closes https://github.com/official-stockfish/Stockfish/pull/4840
No functional change
This introduces clang-format to enforce a consistent code style for Stockfish.
Having a documented and consistent style across the code will make contributing easier
for new developers, and will make larger changes to the codebase easier to make.
To facilitate formatting, this PR includes a Makefile target (`make format`) to format the code,
this requires clang-format (version 17 currently) to be installed locally.
Installing clang-format is straightforward on most OS and distros
(e.g. with https://apt.llvm.org/, brew install clang-format, etc), as this is part of quite commonly
used suite of tools and compilers (llvm / clang).
Additionally, a CI action is present that will verify if the code requires formatting,
and comment on the PR as needed. Initially, correct formatting is not required, it will be
done by maintainers as part of the merge or in later commits, but obviously this is encouraged.
fixes https://github.com/official-stockfish/Stockfish/issues/3608
closes https://github.com/official-stockfish/Stockfish/pull/4790
Co-Authored-By: Joost VandeVondele <Joost.VandeVondele@gmail.com>
in the case of avx512 and vnni512 archs.
Up to 17% speedup, depending on the compiler, e.g.
```
AMD pro 7840u (zen4 phoenix apu 4nm)
bash bench_parallel.sh ./stockfish_avx512_gcc13 ./stockfish_avx512_pr_gcc13 20 10
sf_base = 1077737 +/- 8446 (95%)
sf_test = 1264268 +/- 8543 (95%)
diff = 186531 +/- 4280 (95%)
speedup = 17.308% +/- 0.397% (95%)
```
Prior to this patch, it appears gcc spills registers.
closes https://github.com/official-stockfish/Stockfish/pull/4796
No functional change
deal with the general case
About a 8.6% speedup (for general arch)
Results for 200 tests for each version:
Base Test Diff
Mean 141741 153998 -12257
StDev 2990 3042 3742
p-value: 0.999
speedup: 0.086
closes https://github.com/official-stockfish/Stockfish/pull/4786
No functional change
Add more static checks regarding the SIMD width match.
STC: https://tests.stockfishchess.org/tests/view/64f5c568a9bc5a78c669e70e
LLR: 2.95 (-2.94,2.94) <-1.75,0.25>
Total: 125216 W: 31756 L: 31636 D: 61824
Ptnml(0-2): 327, 13993, 33848, 14113, 327
Fixes a bug introduced in 2f2f45f, where with AVX-512 the weights and input to
the last layer were being read out of bounds. Now AVX-512 is only used for the
layers it can be used for. Additional static assertions have been added to
prevent more errors like this in the future.
closes https://github.com/official-stockfish/Stockfish/pull/4773
No functional change
Squared numbers are never negative, so barring any wraparound there
is no need to clamp to 0. From reading the code, there's no obvious
way to get wraparound, so the entire operation can be simplified
away. Updated original truncated code comments to be sensible.
Verified by running ./stockfish bench 128 1 24 and by the following test:
STC: https://tests.stockfishchess.org/tests/view/64da4db95b17f7c21c0eabe7
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 60224 W: 15425 L: 15236 D: 29563
Ptnml(0-2): 195, 6576, 16382, 6763, 196
closes https://github.com/official-stockfish/Stockfish/pull/4751
No functional change
Current master fails to compile for ARMv8 on Raspi cause gcc (version 10.2.1)
does not like to cast between signed and unsigned vector types. This patch
fixes it by using unsigned vector pointer for ARM to avoid implicite cast.
closes https://github.com/official-stockfish/Stockfish/pull/4752
No functional change
a) Add further tests to CI to cover most features. This uncovered a potential race
in case setoption was sent between two searches. As the UCI protocol requires
this sent to be went the engine is not searching, setoption now ensures that
this is the case.
b) Remove some unused code
closes https://github.com/official-stockfish/Stockfish/pull/4730
No functional change
Also make two get_weight_index() static methods constexpr, for
consistency with the other static get_hash_value() method right above.
Tested for speed by user Torom (thanks).
closes https://github.com/official-stockfish/Stockfish/pull/4708
No functional change
since the introduction of NNUE (first released with Stockfish 12), we
have maintained the classical evaluation as part of SF in frozen form.
The idea that this code could lead to further inputs to the NN or
search did not materialize. Now, after five releases, this PR removes
the classical evaluation from SF. Even though this evaluation is
probably the best of its class, it has become unimportant for the
engine's strength, and there is little need to maintain this
code (roughly 25% of SF) going forward, or to expend resources on
trying to improve its integration in the NNUE eval.
Indeed, it had still a very limited use in the current SF, namely
for the evaluation of positions that are nearly decided based on
material difference, where the speed of the classical evaluation
outweights its inaccuracies. This impact on strength is small,
roughly 2Elo, and probably decreasing in importance as the TC grows.
Potentially, removal of this code could lead to the development of
techniques to have faster, but less accurate NN evaluation,
for certain positions.
STC
https://tests.stockfishchess.org/tests/view/64a320173ee09aa549c52157
Elo: -2.35 ± 1.1 (95%) LOS: 0.0%
Total: 100000 W: 24916 L: 25592 D: 49492
Ptnml(0-2): 287, 12123, 25841, 11477, 272
nElo: -4.62 ± 2.2 (95%) PairsRatio: 0.95
LTC
https://tests.stockfishchess.org/tests/view/64a320293ee09aa549c5215b
Elo: -1.74 ± 1.0 (95%) LOS: 0.0%
Total: 100000 W: 25010 L: 25512 D: 49478
Ptnml(0-2): 44, 11069, 28270, 10579, 38
nElo: -3.72 ± 2.2 (95%) PairsRatio: 0.96
VLTC SMP
https://tests.stockfishchess.org/tests/view/64a3207c3ee09aa549c52168
Elo: -1.70 ± 0.9 (95%) LOS: 0.0%
Total: 100000 W: 25673 L: 26162 D: 48165
Ptnml(0-2): 8, 9455, 31569, 8954, 14
nElo: -3.95 ± 2.2 (95%) PairsRatio: 0.95
closes https://github.com/official-stockfish/Stockfish/pull/4674
Bench: 1444646
Implemented LEB128 (de)compression for the feature transformer.
Reduces embedded network size from 70 MiB to 39 Mib.
The new nn-78bacfcee510.nnue corresponds to the master net compressed.
closes https://github.com/official-stockfish/Stockfish/pull/4617
No functional change
Use block sparse input for the first fully connected layer on architectures with at least SSSE3.
Depending on the CPU architecture, this yields a speedup of up to 10%, e.g.
```
Result of 100 runs of 'bench 16 1 13 default depth NNUE'
base (...ockfish-base) = 959345 +/- 7477
test (...ckfish-patch) = 1054340 +/- 9640
diff = +94995 +/- 3999
speedup = +0.0990
P(speedup > 0) = 1.0000
CPU: 8 x AMD Ryzen 7 5700U with Radeon Graphics
Hyperthreading: on
```
Passed STC:
https://tests.stockfishchess.org/tests/view/6485aa0965ffe077ca12409c
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 8864 W: 2479 L: 2223 D: 4162
Ptnml(0-2): 13, 829, 2504, 1061, 25
This commit includes a net with reordered weights, to increase the likelihood of block sparse inputs,
but otherwise equivalent to the previous master net (nn-ea57bea57e32.nnue).
Activation data collected with https://github.com/AndrovT/Stockfish/tree/log-activations, running bench 16 1 13 varied_1000.epd depth NNUE on this data. Net parameters permuted with https://gist.github.com/AndrovT/9e3fbaebb7082734dc84d27e02094cb3.
closes https://github.com/official-stockfish/Stockfish/pull/4612
No functional change
Created by training a new net from scratch with L1 size increased from 1024 to 1536.
Thanks to Vizvezdenec for the idea of exploring larger net sizes after recent
training data improvements.
A new net was first trained with lambda 1.0 and constant LR 8.75e-4. Then a strong net
from a later epoch in the training run was chosen for retraining with start-lambda 1.0
and initial LR 4.375e-4 decaying with gamma 0.995. Retraining was performed a total of
3 times, for this 4-step process:
1. 400 epochs, lambda 1.0 on filtered T77+T79 v6 deduplicated data
2. 800 epochs, end-lambda 0.75 on T60T70wIsRightFarseerT60T74T75T76.binpack
3. 800 epochs, end-lambda 0.75 and early-fen-skipping 28 on the master dataset
4. 800 epochs, end-lambda 0.7 and early-fen-skipping 28 on the master dataset
In the training sequence that reached the new nn-8d69132723e2.nnue net,
the epochs used for the 3x retraining runs were:
1. epoch 379 trained on T77T79-filter-v6-dd.min.binpack
2. epoch 679 trained on T60T70wIsRightFarseerT60T74T75T76.binpack
3. epoch 799 trained on the master dataset
For training from scratch:
python3 easy_train.py \
--experiment-name new-L1-1536-T77T79-filter-v6dd \
--training-dataset /data/T77T79-filter-v6-dd.min.binpack \
--max_epoch 400 \
--lambda 1.0 \
--start-from-engine-test-net False \
--engine-test-branch linrock/Stockfish/L1-1536 \
--nnue-pytorch-branch linrock/Stockfish/misc-fixes-L1-1536 \
--tui False \
--gpus "0," \
--seed $RANDOM
Retraining commands were similar to each other. For the 3rd retraining run:
python3 easy_train.py \
--experiment-name L1-1536-T77T79-v6dd-Re1-LeelaFarseer-Re2-masterDataset-Re3-sameData \
--training-dataset /data/leela96-dfrc99-v2-T60novdecT80juntonovjanfebT79aprmayT78jantosepT77dec-v6dd.binpack \
--early-fen-skipping 28 \
--max_epoch 800 \
--start-lambda 1.0 \
--end-lambda 0.7 \
--lr 4.375e-4 \
--gamma 0.995 \
--start-from-engine-test-net False \
--start-from-model /data/L1-1536-T77T79-v6dd-Re1-LeelaFarseer-Re2-masterDataset-nn-epoch799.nnue \
--engine-test-branch linrock/Stockfish/L1-1536 \
--nnue-pytorch-branch linrock/nnue-pytorch/misc-fixes-L1-1536 \
--tui False \
--gpus "0," \
--seed $RANDOM
The T77+T79 data used is a subset of the master dataset available at:
https://robotmoon.com/nnue-training-data/
T60T70wIsRightFarseerT60T74T75T76.binpack is available at:
https://drive.google.com/drive/folders/1S9-ZiQa_3ApmjBtl2e8SyHxj4zG4V8gG
Local elo at 25k nodes per move vs. nn-e1fb1ade4432.nnue (L1 size 1024):
nn-epoch759.nnue : 26.9 +/- 1.6
Failed STC
https://tests.stockfishchess.org/tests/view/64742485d29264e4cfa75f97
LLR: -2.94 (-2.94,2.94) <0.00,2.00>
Total: 13728 W: 3588 L: 3829 D: 6311
Ptnml(0-2): 71, 1661, 3610, 1482, 40
Failing LTC
https://tests.stockfishchess.org/tests/view/64752d7c4a36543c4c9f3618
LLR: -1.91 (-2.94,2.94) <0.50,2.50>
Total: 35424 W: 9522 L: 9603 D: 16299
Ptnml(0-2): 24, 3579, 10585, 3502, 22
Passed VLTC 180+1.8
https://tests.stockfishchess.org/tests/view/64752df04a36543c4c9f3638
LLR: 2.95 (-2.94,2.94) <0.50,2.50>
Total: 47616 W: 13174 L: 12863 D: 21579
Ptnml(0-2): 13, 4261, 14952, 4566, 16
Passed VLTC SMP 60+0.6 th 8
https://tests.stockfishchess.org/tests/view/647446ced29264e4cfa761e5
LLR: 2.94 (-2.94,2.94) <0.50,2.50>
Total: 19942 W: 5694 L: 5451 D: 8797
Ptnml(0-2): 6, 1504, 6707, 1749, 5
closes https://github.com/official-stockfish/Stockfish/pull/4593
bench 2222567
This patch introduces `hint_common_parent_position()` to signal that potentially several child nodes will require an NNUE eval. By populating explicitly the accumulator, these subsequent evaluations can be performed more efficiently.
This was based on the observation that calculating the evaluation in an excluded move position yielded a significant Elo gain, even though the evaluation itself was already available (work by pb00067).
Sopel wrote the code to perform just the accumulator update. This PR is based on cleaned up code that
passed STC:
https://tests.stockfishchess.org/tests/view/63f62f9be74a12625bcd4aa0
LLR: 2.94 (-2.94,2.94) <0.50,2.50>
Total: 110368 W: 29607 L: 29167 D: 51594
Ptnml(0-2): 41, 10551, 33572, 10967, 53
and in an the earlier (equivalent) version
passed STC:
https://tests.stockfishchess.org/tests/view/63f3c3fee74a12625bcce2a6
LLR: 2.95 (-2.94,2.94) <0.00,2.00>
Total: 47552 W: 12786 L: 12467 D: 22299
Ptnml(0-2): 120, 5107, 12997, 5438, 114
passed LTC:
https://tests.stockfishchess.org/tests/view/63f45cc2e74a12625bccfa63
LLR: 2.94 (-2.94,2.94) <0.50,2.50>
Total: 110368 W: 29607 L: 29167 D: 51594
Ptnml(0-2): 41, 10551, 33572, 10967, 53
closes https://github.com/official-stockfish/Stockfish/pull/4402
Bench: 3726250
The sdot instruction computes (and accumulates) a signed dot product,
which is quite handy for Stockfish's NNUE code. The instruction is
optional for Armv8.2 and Armv8.3, and mandatory for Armv8.4 and above.
The commit adds a new 'arm-dotprod' architecture with enabled dot
product support. It also enables dot product support for the existing
'apple-silicon' architecture, which is at least Armv8.5.
The following local speed test was performed on an Apple M1 with
ARCH=apple-silicon. I had to remove CPU pinning from the benchmark
script. However, the results were still consistent: Checking both
binaries against themselves reported a speedup of +0.0000 and +0.0005,
respectively.
```
Result of 100 runs
==================
base (...ish.037ef3e1) = 1917997 +/- 7152
test (...fish.dotprod) = 2159682 +/- 9066
diff = +241684 +/- 2923
speedup = +0.1260
P(speedup > 0) = 1.0000
CPU: 10 x arm
Hyperthreading: off
```
Fixes#4193
closes https://github.com/official-stockfish/Stockfish/pull/4400
No functional change
The accumulator should be an earlyclobber because it is written before
all input operands are read. Otherwise, the asm code computes a wrong
result if the accumulator shares a register with one of the other input
operands (which happens if we pass in the same expression for the
accumulator and the operand).
Closes https://github.com/official-stockfish/Stockfish/pull/4339
No functional change
Removed sprintf() which generated a warning, because of security reasons.
Replace NULL with nullptr
Replace typedef with using
Do not inherit from std::vector. Use composition instead.
optimize mutex-unlocking
closes https://github.com/official-stockfish/Stockfish/pull/4327
No functional change