If an incorrect network file is present at the start of the compilation stage, the
Makefile script now correctly removes it before trying to download a clean version.
closes https://github.com/official-stockfish/Stockfish/pull/4726
No functional change
Based on vondele's deletepsqt branch:
https://github.com/vondele/Stockfish/commit/369f5b051
This huge simplification uses a weighted material differences instead of
the positional piece square tables (psqt) in the semi-classical complexity
calculation. Tuned weights using spsa at 45+0.45 with:
int pawnMult = 100;
int knightMult = 325;
int bishopMult = 350;
int rookMult = 500;
int queenMult = 900;
TUNE(SetRange(0, 200), pawnMult);
TUNE(SetRange(0, 650), knightMult);
TUNE(SetRange(0, 700), bishopMult);
TUNE(SetRange(200, 800), rookMult);
TUNE(SetRange(600, 1200), queenMult);
The values obtained via this tuning session were for a model where
the psqt replacement formula was always from the point of view of White,
even if the side to move was Black. We re-used the same values for an
implementation with a psqt replacement from the point of view of the side
to move, testing the result both on our standard book on positions with
a strong White bias, and an alternate book with positions with a strong
Black bias.
We note that with the patch the last use of the venerable "Score" type
disappears in Stockfish codebase (the Score type was used in classical
evaluation to get a tampered eval interpolating values smoothly from the
early midgame stage to the endgame stage). We leave it to another commit
to clean all occurrences of Score in the code and the comments.
-------
Passed non-regression LTC:
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 142542 W: 36264 L: 36168 D: 70110
Ptnml(0-2): 76, 15578, 39856, 15696, 65
https://tests.stockfishchess.org/tests/view/64c8cb495b17f7c21c0cf9f8
Passed non-regression LTC (with a book with Black bias):
https://tests.stockfishchess.org/tests/view/64c8f9295b17f7c21c0cfdaf
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 494814 W: 125565 L: 125827 D: 243422
Ptnml(0-2): 244, 53926, 139346, 53630, 261
------
closes https://github.com/official-stockfish/Stockfish/pull/4713
Bench: 1655985
Explicitly describe the architecture as deprecated,
it remains available as its current alias x86-64-sse41-popcnt
CPUs that support just this instruction set are now years old,
any few years old Intel or AMD CPU supports x86-64-avx2. However,
naming things 'modern' doesn't age well, so instead use explicit names.
Adjust CI accordingly. Wiki, fishtest, downloader done as well.
closes https://github.com/official-stockfish/Stockfish/pull/4691
No functional change.
use a fixed compiler on Linux and Windows (right now gcc 11).
build avxvnni on Windows (Linux needs updated core utils)
build x86-32 on Linux (Windows needs other mingw)
fix a Makefile issue where a failed PGOBENCH would not stop the build
reuse the WINE_PATH for SDE as we do for QEMU
use WINE_PATH variable also for the signature
verify the bench for each of the binaries
do not build x86-64-avx2 on macos
closes https://github.com/official-stockfish/Stockfish/pull/4682
No functional change
since the introduction of NNUE (first released with Stockfish 12), we
have maintained the classical evaluation as part of SF in frozen form.
The idea that this code could lead to further inputs to the NN or
search did not materialize. Now, after five releases, this PR removes
the classical evaluation from SF. Even though this evaluation is
probably the best of its class, it has become unimportant for the
engine's strength, and there is little need to maintain this
code (roughly 25% of SF) going forward, or to expend resources on
trying to improve its integration in the NNUE eval.
Indeed, it had still a very limited use in the current SF, namely
for the evaluation of positions that are nearly decided based on
material difference, where the speed of the classical evaluation
outweights its inaccuracies. This impact on strength is small,
roughly 2Elo, and probably decreasing in importance as the TC grows.
Potentially, removal of this code could lead to the development of
techniques to have faster, but less accurate NN evaluation,
for certain positions.
STC
https://tests.stockfishchess.org/tests/view/64a320173ee09aa549c52157
Elo: -2.35 ± 1.1 (95%) LOS: 0.0%
Total: 100000 W: 24916 L: 25592 D: 49492
Ptnml(0-2): 287, 12123, 25841, 11477, 272
nElo: -4.62 ± 2.2 (95%) PairsRatio: 0.95
LTC
https://tests.stockfishchess.org/tests/view/64a320293ee09aa549c5215b
Elo: -1.74 ± 1.0 (95%) LOS: 0.0%
Total: 100000 W: 25010 L: 25512 D: 49478
Ptnml(0-2): 44, 11069, 28270, 10579, 38
nElo: -3.72 ± 2.2 (95%) PairsRatio: 0.96
VLTC SMP
https://tests.stockfishchess.org/tests/view/64a3207c3ee09aa549c52168
Elo: -1.70 ± 0.9 (95%) LOS: 0.0%
Total: 100000 W: 25673 L: 26162 D: 48165
Ptnml(0-2): 8, 9455, 31569, 8954, 14
nElo: -3.95 ± 2.2 (95%) PairsRatio: 0.95
closes https://github.com/official-stockfish/Stockfish/pull/4674
Bench: 1444646
Replace the deprecated Intel compiler icc with its newer icx variant.
This newer compiler is based on clang, and yields good performance.
As before, currently only linux is supported.
closes https://github.com/official-stockfish/Stockfish/pull/4478
No functional change
The sdot instruction computes (and accumulates) a signed dot product,
which is quite handy for Stockfish's NNUE code. The instruction is
optional for Armv8.2 and Armv8.3, and mandatory for Armv8.4 and above.
The commit adds a new 'arm-dotprod' architecture with enabled dot
product support. It also enables dot product support for the existing
'apple-silicon' architecture, which is at least Armv8.5.
The following local speed test was performed on an Apple M1 with
ARCH=apple-silicon. I had to remove CPU pinning from the benchmark
script. However, the results were still consistent: Checking both
binaries against themselves reported a speedup of +0.0000 and +0.0005,
respectively.
```
Result of 100 runs
==================
base (...ish.037ef3e1) = 1917997 +/- 7152
test (...fish.dotprod) = 2159682 +/- 9066
diff = +241684 +/- 2923
speedup = +0.1260
P(speedup > 0) = 1.0000
CPU: 10 x arm
Hyperthreading: off
```
Fixes#4193
closes https://github.com/official-stockfish/Stockfish/pull/4400
No functional change
If a global function has no previous declaration, either the declaration
is missing in the corresponding header file or the function should be
declared static. Static functions are local to the translation unit,
which allows the compiler to apply some optimizations earlier (when
compiling the translation unit rather than during link-time
optimization).
The commit enables the warning for gcc, clang, and mingw. It also fixes
the reported warnings by declaring the functions static or by adding a
header file (benchmark.h).
closes https://github.com/official-stockfish/Stockfish/pull/4325
No functional change
Instead of allowing .depend for specific build-related targets, filter
non-build-related targets (i.e. help, clean) so that other targets can
normally execute .depend target.
closes https://github.com/official-stockfish/Stockfish/pull/4293
No functional change
Add a constraint so that the dependency build only occurs when users
actually run build tasks.
This fixes a bug on some systems where gcc/g++ is not available.
closes https://github.com/official-stockfish/Stockfish/pull/4255
No functional change
For development versions of Stockfish, the version will now look like
dev-20221107-dca9a0533
indicating a development version, the date of the last commit,
and the git SHA of that commit. If git is not available,
the fallback is the date of compilation. Releases will continue to be
versioned as before.
Additionally, this PR extends the CI to create binary artifacts,
i.e. pushes to master will automatically build Stockfish and upload
the binaries to github.
closes https://github.com/official-stockfish/Stockfish/pull/4220
No functional change
have maximal compatibility on legacy target arch, now supporting AMD Athlon
The old behavior can anyway be selected by the user if needed, for example
make -j profile-build ARCH=x86-32 sse=yes
fixes#3904
closes https://github.com/official-stockfish/Stockfish/pull/3918
No functional change
For cross-compiling to Android on windows, the Makefile needs some tweaks.
Tested with Android NDK 23.1.7779620 and 21.4.7075529, using
Windows 10 with clean MSYS2 environment (i.e. no MINGW/GCC/Clang
toolchain in PATH) and Fedora 35, with build target:
build ARCH=armv8 COMP=ndk
The resulting binary runs fine inside Droidfish on my Samsung
Galaxy Note20 Ultra and Samsung Galaxy Tab S7+
Other builds tested to exclude regressions: MINGW64/Clang64 build
on Windows; MINGW64 cross build, native Clang and GCC builds on Fedora.
wiki docs https://github.com/glinscott/fishtest/wiki/Cross-compiling-Stockfish-for-Android-on-Windows-and-Linux
closes https://github.com/official-stockfish/Stockfish/pull/3901
No functional change
A Windows Native Build (WNB) can be done:
- on Windows, using a recent mingw-w64 g++/clang compiler
distributed by msys2, cygwin and others
- on Linux, using mingw-w64 g++ to cross compile
Improvements:
- check for a WNB in a proper way and set a variable to simplify the code
- set the proper EXE for a WNB
- use the proper name for the mingw-w64 clang compiler
- use the static linking for a WNB
- use wine to make a PGO cross compile on Linux (also with Intel SDE)
- enable the LTO build for mingw-w64 g++ compiler
- set `lto=auto` to use the make's job server, if available, or otherwise
to fall back to autodetection of the number of CPU threads
- clean up all the temporary LTO files saved in the local directory
Tested on:
- msys2 MINGW64 (g++), UCRT64 (g++), MINGW32 (g++), CLANG64 (clang)
environments
- cygwin mingw-w64 g++
- Ubuntu 18.04 & 21.10 mingw-w64 PGO cross compile (also with Intel SDE)
closes#3891
No functional change
This patch optimizes the NEON implementation in two ways.
The activation layer after the feature transformer is rewritten to make it easier for the compiler to see through dependencies and unroll. This in itself is a minimal, but a positive improvement. Other architectures could benefit from this too in the future. This is not an algorithmic change.
The affine transform for large matrices (first layer after FT) on NEON now utilizes the same optimized code path as >=SSSE3, which makes the memory accesses more sequential and makes better use of the available registers, which allows for code that has longer dependency chains.
Benchmarks from Redshift#161, profile-build with apple clang
george@Georges-MacBook-Air nets % ./stockfish-b82d93 bench 2>&1 | tail -4 (current master)
===========================
Total time (ms) : 2167
Nodes searched : 4667742
Nodes/second : 2154011
george@Georges-MacBook-Air nets % ./stockfish-7377b8 bench 2>&1 | tail -4 (this patch)
===========================
Total time (ms) : 1842
Nodes searched : 4667742
Nodes/second : 2534061
This is a solid 18% improvement overall, larger in a bench with NNUE-only, not mixed.
Improvement is also observed on armv7-neon (Raspberry Pi, and older phones), around 5% speedup.
No changes for architectures other than NEON.
closes https://github.com/official-stockfish/Stockfish/pull/3837
No functional changes.
In their infinite wisdom, Intel axed AVX512 from Alder Lake
chips (well, not entirely, but we kind of want to use the Gracemont
cores for chess!) but still added VNNI support.
Confusingly enough, this is not the same as VNNI256 support.
This adds a specific AVX-VNNI target that will use this AVX-VNNI
mode, by prefixing the VNNI instructions with the appropriate VEX
prefix, and avoiding AVX512 usage.
This is about 1% faster on P cores:
Result of 20 runs
==================
base (./clang-bmi2 ) = 3306337 +/- 7519
test (./clang-vnni ) = 3344226 +/- 7388
diff = +37889 +/- 4153
speedup = +0.0115
P(speedup > 0) = 1.0000
But a nice 3% faster on E cores:
Result of 20 runs
==================
base (./clang-bmi2 ) = 1938054 +/- 28257
test (./clang-vnni ) = 1994606 +/- 31756
diff = +56552 +/- 3735
speedup = +0.0292
P(speedup > 0) = 1.0000
This was measured on Clang 13. GCC 11.2 appears to generate
worse code for Alder Lake, though the speedup on the E cores
is similar.
It is possible to run the engine specifically on the P or E using binding,
for example in linux it is possible to use (for an 8 P + 8 E setup like i9-12900K):
taskset -c 0-15 ./stockfish
taskset -c 16-23 ./stockfish
where the first call binds to the P-cores and the second to the E-cores.
closes https://github.com/official-stockfish/Stockfish/pull/3824
No functional change
To help with debugging, the worker sends the output of
stderr (suitable truncated) to the action log on the
server, in case a build fails. For this to work it is
important that there is no spurious output to stderr.
closes https://github.com/official-stockfish/Stockfish/pull/3773
No functional change
Introduces a new NNUE network architecture and associated network parameters
The summary of the changes:
* Position for each perspective mirrored such that the king is on e..h files. Cuts the feature transformer size in half, while preserving enough knowledge to be good. See https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit#heading=h.b40q4rb1w7on.
* The number of neurons after the feature transformer increased two-fold, to 1024x2. This is possibly mostly due to the now very optimized feature transformer update code.
* The number of neurons after the second layer is reduced from 16 to 8, to reduce the speed impact. This, perhaps surprisingly, doesn't harm the strength much. See https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit#heading=h.6qkocr97fezq
The AffineTransform code did not work out-of-the box with the smaller number of neurons after the second layer, so some temporary changes have been made to add a special case for InputDimensions == 8. Also additional 0 padding is added to the output for some archs that cannot process inputs by <=8 (SSE2, NEON). VNNI uses an implementation that can keep all outputs in the registers while reducing the number of loads by 3 for each 16 inputs, thanks to the reduced number of output neurons. However GCC is particularily bad at optimization here (and perhaps why the current way the affine transform is done even passed sprt) (see https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit# for details) and more work will be done on this in the following days. I expect the current VNNI implementation to be improved and extended to other architectures.
The network was trained with a slightly modified version of the pytorch trainer (https://github.com/glinscott/nnue-pytorch); the changes are in https://github.com/glinscott/nnue-pytorch/pull/143
The training utilized 2 datasets.
dataset A - https://drive.google.com/file/d/1VlhnHL8f-20AXhGkILujnNXHwy9T-MQw/view?usp=sharing
dataset B - as described in ba01f4b954
The training process was as following:
train on dataset A for 350 epochs, take the best net in terms of elo at 20k nodes per move (it's fine to take anything from later stages of training).
convert the .ckpt to .pt
--resume-from-model from the .pt file, train on dataset B for <600 epochs, take the best net. Lambda=0.8, applied before the loss function.
The first training command:
python3 train.py \
../nnue-pytorch-training/data/large_gensfen_multipvdiff_100_d9.binpack \
../nnue-pytorch-training/data/large_gensfen_multipvdiff_100_d9.binpack \
--gpus "$3," \
--threads 1 \
--num-workers 1 \
--batch-size 16384 \
--progress_bar_refresh_rate 20 \
--smart-fen-skipping \
--random-fen-skipping 3 \
--features=HalfKAv2_hm^ \
--lambda=1.0 \
--max_epochs=600 \
--default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2
The second training command:
python3 serialize.py \
--features=HalfKAv2_hm^ \
../nnue-pytorch-training/experiment_131/run_6/default/version_0/checkpoints/epoch-499.ckpt \
../nnue-pytorch-training/experiment_$1/base/base.pt
python3 train.py \
../nnue-pytorch-training/data/michael_commit_b94a65.binpack \
../nnue-pytorch-training/data/michael_commit_b94a65.binpack \
--gpus "$3," \
--threads 1 \
--num-workers 1 \
--batch-size 16384 \
--progress_bar_refresh_rate 20 \
--smart-fen-skipping \
--random-fen-skipping 3 \
--features=HalfKAv2_hm^ \
--lambda=0.8 \
--max_epochs=600 \
--resume-from-model ../nnue-pytorch-training/experiment_$1/base/base.pt \
--default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2
STC: https://tests.stockfishchess.org/tests/view/611120b32a8a49ac5be798c4
LLR: 2.97 (-2.94,2.94) <-0.50,2.50>
Total: 22480 W: 2434 L: 2251 D: 17795
Ptnml(0-2): 101, 1736, 7410, 1865, 128
LTC: https://tests.stockfishchess.org/tests/view/611152b32a8a49ac5be798ea
LLR: 2.93 (-2.94,2.94) <0.50,3.50>
Total: 9776 W: 442 L: 333 D: 9001
Ptnml(0-2): 5, 295, 4180, 402, 6
closes https://github.com/official-stockfish/Stockfish/pull/3646
bench: 5189338
Not all linux users will have libatomic installed.
When using clang as the system compiler with compiler-rt as the default
runtime library instead of libgcc, atomic builtins may be provided by compiler-rt.
This change allows such users to pass RTLIB=compiler-rt to make sure
the build doesn't error out on the missing (unnecessary) libatomic.
closes https://github.com/official-stockfish/Stockfish/pull/3597
No functional change
The Cygwin environment has two g++ compilers, each with a different problem
for compiling Stockfish at the moment:
(a) g++.exe : full posix build compiler, linked to cygwin dll.
=> This one has a problem embedding the net.
(b) x86_64-w64-mingw32-g++.exe : native Windows build compiler.
=> This one manages to embed the net, but has a problem related to libgcov
when we use the profile-build target of Stockfish.
This patch solves the problem for compiler (b), so that our recommended command line
if you want to build an optimized version of Stockfish on Cygwin becomes something
like the following (you can change the ARCH value to whatever you want, but note
the COMP and CXX variables pointing at the right compiler):
```
make -j profile-build ARCH=x86-64-modern COMP=mingw CXX=x86_64-w64-mingw32-c++.exe
```
closes https://github.com/official-stockfish/Stockfish/pull/3569
No functional change
This reverts commit "Fix for Cygwin's environment build-profile", as it was
giving errors for "make clean" on some Windows environments. See comments in
68bf362ea2
Possibly somebody can propose a solution that would fix Cygwin builds and
not break on other system too, stay tuned! :-)
No functional change
The Cygwin environment has two g++ compilers, each with a different problem
for compiling Stockfish at the moment:
(a) g++.exe : full posix build compiler, linked to cygwin dll.
=> This one has a problem embedding the net.
(b) x86_64-w64-mingw32-g++.exe : native Windows build compiler.
=> This one manages to embed the net, but has a problem related to libgcov
when we use the profile-build target of Stockfish.
This patch solves the problem for compiler (b), so that our recommended command line
if you want to build an optimized version of Stockfish on Cygwin becomes something
like the following (you can change the ARCH value to whatever you want, but note
the COMP and CXX variables pointing at the right compiler):
```
make -j profile-build ARCH=x86-64-modern COMP=mingw CXX=x86_64-w64-mingw32-c++.exe
```
closes https://github.com/official-stockfish/Stockfish/pull/3463
No functional change