This commit generalizes the feature transform to use vec_t macros
that are architecture defined instead of using a seperate code path for each one.
It should make some old architectures (MMX, including improvements by Fanael) faster
and make further such improvements easier in the future.
Includes some corrections to CI for mingw.
closes https://github.com/official-stockfish/Stockfish/pull/3955
closes https://github.com/official-stockfish/Stockfish/pull/3928
No functional change
- set the variable only for the required tests to keep simple the yml file
- use NDK 21.x until will be fixed the Stockfish static build problem
with NDK 23.x
- set the test for armv7, armv7-neon, armv8 builds:
- use armv7a-linux-androideabi21-clang++ compiler for armv7 armv7-neon
- enforce a static build
- silence the Warning for the unused compilation flag "-pie" with
the static build, otherwise the Github workflow stops
- use qemu to bench the build and get the signature
Many thanks to @pschneider1968 that made all the hard work with NDK :)
closes https://github.com/official-stockfish/Stockfish/pull/3924
No functional change
This patch is a result of tuning done by user @candirufish after 150k games.
Since the tuned values were really interesting and touched heuristics
that are known for their non-linear scaling I decided to run limited
games LTC match, even if the STC test was really bad (which was expected).
After seeing the results of the LTC match, I also run a VLTC (very long
time control) SPRTtest, which passed.
The main difference is in extensions: this patch allows much more
singular/double extensions, both in terms of allowing them at lower
depths and with lesser margins.
Failed STC:
https://tests.stockfishchess.org/tests/view/620d66643ec80158c0cd3b46
LLR: -2.94 (-2.94,2.94) <0.00,2.50>
Total: 4968 W: 1194 L: 1398 D: 2376
Ptnml(0-2): 47, 633, 1294, 497, 13
Performed well at LTC in a fixed-length match:
https://tests.stockfishchess.org/tests/view/620d66823ec80158c0cd3b4a
ELO: 3.36 +-1.8 (95%) LOS: 100.0%
Total: 30000 W: 7966 L: 7676 D: 14358
Ptnml(0-2): 36, 2936, 8755, 3248, 25
Passed VLTC SPRT test:
https://tests.stockfishchess.org/tests/view/620da11a26f5b17ec884f939
LLR: 2.96 (-2.94,2.94) <0.50,3.00>
Total: 4400 W: 1326 L: 1127 D: 1947
Ptnml(0-2): 13, 309, 1348, 526, 4
closes https://github.com/official-stockfish/Stockfish/pull/3937
Bench: 6318903
Architecture:
The diagram of the "SFNNv4" architecture:
https://user-images.githubusercontent.com/8037982/153455685-cbe3a038-e158-4481-844d-9d5fccf5c33a.png
The most important architectural changes are the following:
* 1024x2 [activated] neurons are pairwise, elementwise multiplied (not quite pairwise due to implementation details, see diagram), which introduces a non-linearity that exhibits similar benefits to previously tested sigmoid activation (quantmoid4), while being slightly faster.
* The following layer has therefore 2x less inputs, which we compensate by having 2 more outputs. It is possible that reducing the number of outputs might be beneficial (as we had it as low as 8 before). The layer is now 1024->16.
* The 16 outputs are split into 15 and 1. The 1-wide output is added to the network output (after some necessary scaling due to quantization differences). The 15-wide is activated and follows the usual path through a set of linear layers. The additional 1-wide output is at least neutral, but has shown a slightly positive trend in training compared to networks without it (all 16 outputs through the usual path), and allows possibly an additional stage of lazy evaluation to be introduced in the future.
Additionally, the inference code was rewritten and no longer uses a recursive implementation. This was necessitated by the splitting of the 16-wide intermediate result into two, which was impossible to do with the old implementation with ugly hacks. This is hopefully overall for the better.
First session:
The first session was training a network from scratch (random initialization). The exact trainer used was slightly different (older) from the one used in the second session, but it should not have a measurable effect. The purpose of this session is to establish a strong network base for the second session. Small deviations in strength do not harm the learnability in the second session.
The training was done using the following command:
python3 train.py \
/home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \
/home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \
--gpus "$3," \
--threads 4 \
--num-workers 4 \
--batch-size 16384 \
--progress_bar_refresh_rate 20 \
--random-fen-skipping 3 \
--features=HalfKAv2_hm^ \
--lambda=1.0 \
--gamma=0.992 \
--lr=8.75e-4 \
--max_epochs=400 \
--default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2
Every 20th net was saved and its playing strength measured against some baseline at 25k nodes per move with pure NNUE evaluation (modified binary). The exact setup is not important as long as it's consistent. The purpose is to sift good candidates from bad ones.
The dataset can be found https://drive.google.com/file/d/1UQdZN_LWQ265spwTBwDKo0t1WjSJKvWY/view
Second session:
The second training session was done starting from the best network (as determined by strength testing) from the first session. It is important that it's resumed from a .pt model and NOT a .ckpt model. The conversion can be performed directly using serialize.py
The LR schedule was modified to use gamma=0.995 instead of gamma=0.992 and LR=4.375e-4 instead of LR=8.75e-4 to flatten the LR curve and allow for longer training. The training was then running for 800 epochs instead of 400 (though it's possibly mostly noise after around epoch 600).
The training was done using the following command:
The training was done using the following command:
python3 train.py \
/data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \
/data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \
--gpus "$3," \
--threads 4 \
--num-workers 4 \
--batch-size 16384 \
--progress_bar_refresh_rate 20 \
--random-fen-skipping 3 \
--features=HalfKAv2_hm^ \
--lambda=1.0 \
--gamma=0.995 \
--lr=4.375e-4 \
--max_epochs=800 \
--resume-from-model /data/sopel/nnue/nnue-pytorch-training/data/exp295/nn-epoch399.pt \
--default_root_dir ../nnue-pytorch-training/experiment_$1/run_$run_id
In particular note that we now use lambda=1.0 instead of lambda=0.8 (previous nets), because tests show that WDL-skipping introduced by vondele performs better with lambda=1.0. Nets were being saved every 20th epoch. In total 16 runs were made with these settings and the best nets chosen according to playing strength at 25k nodes per move with pure NNUE evaluation - these are the 4 nets that have been put on fishtest.
The dataset can be found either at ftp://ftp.chessdb.cn/pub/sopel/data_sf/T60T70wIsRightFarseerT60T74T75T76.binpack in its entirety (download might be painfully slow because hosted in China) or can be assembled in the following way:
Get the 5640ad48ae/script/interleave_binpacks.py script.
Download T60T70wIsRightFarseer.binpack https://drive.google.com/file/d/1_sQoWBl31WAxNXma2v45004CIVltytP8/view
Download farseerT74.binpack http://trainingdata.farseer.org/T74-May13-End.7z
Download farseerT75.binpack http://trainingdata.farseer.org/T75-June3rd-End.7z
Download farseerT76.binpack http://trainingdata.farseer.org/T76-Nov10th-End.7z
Run python3 interleave_binpacks.py T60T70wIsRightFarseer.binpack farseerT74.binpack farseerT75.binpack farseerT76.binpack T60T70wIsRightFarseerT60T74T75T76.binpack
Tests:
STC: https://tests.stockfishchess.org/tests/view/6203fb85d71106ed12a407b7
LLR: 2.94 (-2.94,2.94) <0.00,2.50>
Total: 16952 W: 4775 L: 4521 D: 7656
Ptnml(0-2): 133, 1818, 4318, 2076, 131
LTC: https://tests.stockfishchess.org/tests/view/62041e68d71106ed12a40e85
LLR: 2.94 (-2.94,2.94) <0.50,3.00>
Total: 14944 W: 4138 L: 3907 D: 6899
Ptnml(0-2): 21, 1499, 4202, 1728, 22
closes https://github.com/official-stockfish/Stockfish/pull/3927
Bench: 4919707
Most credits for this patch should go to @candirufish.
Based on his big search tuning (1M games at 20+0.1s)
https://tests.stockfishchess.org/tests/view/61fc7a6ed508ec6a1c9f4b7d
with some hand polishing on top of it, which includes :
a) correcting trend sigmoid - for some reason original tuning resulted in it being negative. This heuristic was proven to be worth some elo for years so reversing it sign is probably some random artefact;
b) remove changes to continuation history based pruning - this heuristic historically was really good at providing green STCs and then failing at LTC miserably if we tried to make it more strict, original tuning was done at short time control and thus it became more strict - which doesn't scale to longer time controls;
c) remove changes to improvement - not really indended :).
passed STC
https://tests.stockfishchess.org/tests/view/6203526e88ae2c84271c2ee2
LLR: 2.94 (-2.94,2.94) <0.00,2.50>
Total: 16840 W: 4604 L: 4363 D: 7873
Ptnml(0-2): 82, 1780, 4449, 2033, 76
passed LTC
https://tests.stockfishchess.org/tests/view/620376e888ae2c84271c35d4
LLR: 2.96 (-2.94,2.94) <0.50,3.00>
Total: 17232 W: 4771 L: 4542 D: 7919
Ptnml(0-2): 14, 1655, 5048, 1886, 13
closes https://github.com/official-stockfish/Stockfish/pull/3926
bench 5030992
have maximal compatibility on legacy target arch, now supporting AMD Athlon
The old behavior can anyway be selected by the user if needed, for example
make -j profile-build ARCH=x86-32 sse=yes
fixes#3904
closes https://github.com/official-stockfish/Stockfish/pull/3918
No functional change
For cross-compiling to Android on windows, the Makefile needs some tweaks.
Tested with Android NDK 23.1.7779620 and 21.4.7075529, using
Windows 10 with clean MSYS2 environment (i.e. no MINGW/GCC/Clang
toolchain in PATH) and Fedora 35, with build target:
build ARCH=armv8 COMP=ndk
The resulting binary runs fine inside Droidfish on my Samsung
Galaxy Note20 Ultra and Samsung Galaxy Tab S7+
Other builds tested to exclude regressions: MINGW64/Clang64 build
on Windows; MINGW64 cross build, native Clang and GCC builds on Fedora.
wiki docs https://github.com/glinscott/fishtest/wiki/Cross-compiling-Stockfish-for-Android-on-Windows-and-Linux
closes https://github.com/official-stockfish/Stockfish/pull/3901
No functional change
A Windows Native Build (WNB) can be done:
- on Windows, using a recent mingw-w64 g++/clang compiler
distributed by msys2, cygwin and others
- on Linux, using mingw-w64 g++ to cross compile
Improvements:
- check for a WNB in a proper way and set a variable to simplify the code
- set the proper EXE for a WNB
- use the proper name for the mingw-w64 clang compiler
- use the static linking for a WNB
- use wine to make a PGO cross compile on Linux (also with Intel SDE)
- enable the LTO build for mingw-w64 g++ compiler
- set `lto=auto` to use the make's job server, if available, or otherwise
to fall back to autodetection of the number of CPU threads
- clean up all the temporary LTO files saved in the local directory
Tested on:
- msys2 MINGW64 (g++), UCRT64 (g++), MINGW32 (g++), CLANG64 (clang)
environments
- cygwin mingw-w64 g++
- Ubuntu 18.04 & 21.10 mingw-w64 PGO cross compile (also with Intel SDE)
closes#3891
No functional change
This updates estimates from 2yr ago #2401, and adds missing terms.
All tests run at 10+0.1 (STC), 20000 games, error bars +- 1.8 Elo, book 8moves_v3.png.
A table of Elo values with the links to the corresponding tests can be found at the PR
closes https://github.com/official-stockfish/Stockfish/pull/3868
Non-functional Change
This is a reintroduction of an idea that was simplified away approximately 1 year ago.
There are some tweaks to it :
a) exclude promotions;
b) exclude Pv Nodes from it - Pv Nodes logic for captures is really different from non Pv nodes so it makes a lot of sense;
c) use a big grain of capture history - idea is taken from my recent patches in futility pruning.
passed STC
https://tests.stockfishchess.org/tests/view/61bd90f857a0d0f327c373b7
LLR: 2.96 (-2.94,2.94) <0.00,2.50>
Total: 86640 W: 22474 L: 22110 D: 42056
Ptnml(0-2): 268, 9732, 22963, 10082, 275
passed LTC
https://tests.stockfishchess.org/tests/view/61be094457a0d0f327c38aa3
LLR: 2.95 (-2.94,2.94) <0.50,3.00>
Total: 23240 W: 6079 L: 5838 D: 11323
Ptnml(0-2): 14, 2261, 6824, 2512, 9
https://github.com/official-stockfish/Stockfish/pull/3864
bench 4493723
This patch is a follow up of previous 2 patches that introduced more reductions for PV nodes with low delta and more pruning for nodes with low delta. Instead of writing separate heuristics now it adjust reductions based on delta / rootDelta - it allows to remove 3 separate adjustements of pruning/LMR in different places and also makes reduction dependence on delta and rootDelta smoother. Also now it works for all pruning heuristics and not just 2.
Passed STC
https://tests.stockfishchess.org/tests/view/61ba9b6c57a0d0f327c2d48b
LLR: 2.94 (-2.94,2.94) <0.00,2.50>
Total: 79192 W: 20513 L: 20163 D: 38516
Ptnml(0-2): 238, 8900, 21024, 9142, 292
passed LTC
https://tests.stockfishchess.org/tests/view/61baf77557a0d0f327c2eb8e
LLR: 2.96 (-2.94,2.94) <0.50,3.00>
Total: 158400 W: 41134 L: 40572 D: 76694
Ptnml(0-2): 101, 16372, 45745, 16828, 154
closes https://github.com/official-stockfish/Stockfish/pull/3862
bench 4651538