This patch improves the codegen in the AffineTransform::forward function for architectures >=SSSE3. Current code works directly on memory and the compiler cannot see that the stores through outptr do not alias the loads through weights and input32. The solution implemented is to perform the affine transform with local variables as accumulators and only store the result to memory at the end. The number of accumulators required is OutputDimensions / OutputSimdWidth, which means that for the 1024->16 affine transform it requires 4 registers with SSSE3, 2 with AVX2, 1 with AVX512. It also cuts the number of stores required by NumRegs * 256 for each node evaluated. The local accumulators are expected to be assigned to registers, but even if this cannot be done in some case due to register pressure it will help the compiler to see that there is no aliasing between the loads and stores and may still result in better codegen.
See https://godbolt.org/z/59aTKbbYc for codegen comparison.
passed STC:
LLR: 2.94 (-2.94,2.94) <-0.50,2.50>
Total: 140328 W: 10635 L: 10358 D: 119335
Ptnml(0-2): 302, 8339, 52636, 8554, 333
closes https://github.com/official-stockfish/Stockfish/pull/3634
No functional change
This PR adds an ability to export any currently loaded network.
The export_net command now takes an optional filename parameter.
If the loaded net is not the embedded net the filename parameter is required.
Two changes were required to support this:
* the "architecture" string, which is really just a some kind of description in the net, is now saved into netDescription on load and correctly saved on export.
* the AffineTransform scrambles weights for some architectures and sparsifies them, such that retrieving the index is hard. This is solved by having a temporary scrambled<->unscrambled index lookup table when loading the network, and the actual index is saved for each individual weight that makes it to canSaturate16. This increases the size of the canSaturate16 entries by 6 bytes.
closes https://github.com/official-stockfish/Stockfish/pull/3456
No functional change
Size of the weights in the last layer is less than 512 bits. It leads to wrong data access for AVX512. There is no error because in current implementation it is guaranteed that there is an array of zeros after weights so zero multiplied by something is returned and sum is correct. It is a mistake that can lead to unexpected bugs in the future. Used AVX2 instructions for smaller input size.
No measurable slowdown on avx512.
closes https://github.com/official-stockfish/Stockfish/pull/3298
No functional change.
Reordered weights in such a way that accumulated sum fits to output.
Weights are grouped in blocks of four elements because four
int8 (weight type) corresponds to one int32 (output type).
No horizontal additions.
Grouped AVX512, AVX2 and SSSE3 implementations.
Repeated code was removed.
An earlier version passed STC:
LLR: 2.97 (-2.94,2.94) {-0.25,1.25}
Total: 15336 W: 1495 L: 1355 D: 12486
Ptnml(0-2): 44, 1054, 5350, 1158, 62
https://tests.stockfishchess.org/tests/view/5ff60e106019e097de3eefd5
Speedup depends on the architecture, up to 4% measured on a NNUE only bench.
closes https://github.com/official-stockfish/Stockfish/pull/3287
No functional change
Improves throughput by summing 2 intermediate dot products using 16 bit addition before upconverting to 32 bit.
Potential saturation is detected and the code-path is avoided in this case.
The saturation can't happen with the current nets,
but nets can be constructed that trigger this check.
STC https://tests.stockfishchess.org/tests/view/5fd40a861ac1691201888479
LLR: 2.94 (-2.94,2.94) {-0.25,1.25}
Total: 25544 W: 2451 L: 2296 D: 20797
Ptnml(0-2): 92, 1761, 8925, 1888, 106
about 5% speedup
closes https://github.com/official-stockfish/Stockfish/pull/3261
No functional change
This appears to be slightly faster than using a comparison against zero
to compute the high bits, on both old (like Pentium III) and new (like
Zen 2) hardware.
closes https://github.com/official-stockfish/Stockfish/pull/3254
No functional change.
in affine transform for AVX512/AVX2/SSSE3
The idea is to initialize sum with the first element instead of zero.
Reduce one add_epi32 and one set_zero SIMD instructions for each output dimension.
sum = 0; for i = 1 to n sum += a[i] ->
sum = a[1]; for i = 2 to n sum += a[i]
STC:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 69048 W: 7024 L: 6799 D: 55225
Ptnml(0-2): 260, 5175, 23458, 5342, 289
https://tests.stockfishchess.org/tests/view/5faf2cf467cbf42301d6aa06
closes https://github.com/official-stockfish/Stockfish/pull/3227
No functional change.
For the feature transformer the code is analogical to AVX2 since there was room for easy adaptation of wider simd registers.
For the smaller affine transforms that have 32 byte stride we keep 2 columns in one zmm register. We also unroll more aggressively so that in the end we have to do 16 parallel horizontal additions on ymm slices each consisting of 4 32-bit integers. The slices are embedded in 8 zmm registers.
These changes provide about 1.5% speedup for AVX-512 builds.
Closes https://github.com/official-stockfish/Stockfish/pull/3218
No functional change.
A non-functional speedup. Unroll the loops going over
the output dimensions in the affine transform layers by
a factor of 4 and perform 4 horizontal additions at a time.
Instead of doing naive horizontal additions on each vector
separately use hadd and shuffling between vectors to reduce
the number of instructions by using all lanes for all stages
of the horizontal adds.
passed STC of the initial version:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 17808 W: 1914 L: 1756 D: 14138
Ptnml(0-2): 76, 1330, 5948, 1460, 90
https://tests.stockfishchess.org/tests/view/5f9d516f6a2c112b60691da3
passed STC of the final version after cleanup:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 16296 W: 1750 L: 1595 D: 12951
Ptnml(0-2): 72, 1192, 5479, 1319, 86
https://tests.stockfishchess.org/tests/view/5f9df5776a2c112b60691de3
closes https://github.com/official-stockfish/Stockfish/pull/3203
No functional change
due to downclocking on current chips (tested up to cascade lake)
supporting avx512 and vnni512, it is better to use avx2 or vnni256
in multithreaded (in particular hyperthreaded) engine use.
In single threaded use, the picture is different.
gcc compilation for vnni256 requires a toolchain for gcc >= 9.
closes https://github.com/official-stockfish/Stockfish/pull/3038
No functional change
This patch fixes the byte order when reading 16- and 32-bit values from the network file on a big-endian machine.
Bytes are ordered in read_le() using unsigned arithmetic, which doesn't need tricks to determine the endianness of the machine. Unfortunately the compiler doesn't seem to be able to optimise the ordering operation, but reading in the weights is not a time-critical operation and the extra time it takes should not be noticeable.
Big endian systems are still untested with NNUE.
fixes#3007
closes https://github.com/official-stockfish/Stockfish/pull/3009
No functional change.
Adds support for Vector Neural Network Instructions (avx512), as available on Intel Cascade Lake
The _mm512_dpbusd_epi32() intrinsic (vpdpbusd instruction) is taylor made for NNUE.
on a cascade lake CPU (AWS C5.24x.large, gcc 10) NNUE eval is at roughly 78% nps of classical
(single core test)
bench 1024 1 24 default depth:
target classical NNUE ratio
vnni 2207232 1725987 78.20
avx512 2216789 1671734 75.41
avx2 2194006 1611263 73.44
modern 2185001 1352469 61.90
closes https://github.com/official-stockfish/Stockfish/pull/2987
No functional change
This patch allows old x86 CPUs, from AMD K8 (which the x86-64 baseline
targets) all the way down to the Pentium MMX, to benefit from NNUE with
comparable performance hit versus hand-written eval as on more modern
processors.
NPS of the bench with NNUE enabled on a Pentium III 1.13 GHz (using the
MMX code):
master: 38951
this patch: 80586
NPS of the bench with NNUE enabled using baseline x86-64 arch, which is
how linux distros are likely to package stockfish, on a modern CPU
(using the SSE2 code):
master: 882584
this patch: 1203945
closes https://github.com/official-stockfish/Stockfish/pull/2956
No functional change.
despite usage of alignas, the generated (avx2/avx512) code with older compilers needs to use
unaligned loads with older gcc (e.g. confirmed crash with gcc 7.3/mingw on abrok).
Better performance thus requires gcc >= 9 on hardware supporting avx2/avx512
closes https://github.com/official-stockfish/Stockfish/pull/2969
No functional change
This patch ports the efficiently updatable neural network (NNUE) evaluation to Stockfish.
Both the NNUE and the classical evaluations are available, and can be used to
assign a value to a position that is later used in alpha-beta (PVS) search to find the
best move. The classical evaluation computes this value as a function of various chess
concepts, handcrafted by experts, tested and tuned using fishtest. The NNUE evaluation
computes this value with a neural network based on basic inputs. The network is optimized
and trained on the evalutions of millions of positions at moderate search depth.
The NNUE evaluation was first introduced in shogi, and ported to Stockfish afterward.
It can be evaluated efficiently on CPUs, and exploits the fact that only parts
of the neural network need to be updated after a typical chess move.
[The nodchip repository](https://github.com/nodchip/Stockfish) provides additional
tools to train and develop the NNUE networks.
This patch is the result of contributions of various authors, from various communities,
including: nodchip, ynasu87, yaneurao (initial port and NNUE authors), domschl, FireFather,
rqs, xXH4CKST3RXx, tttak, zz4032, joergoster, mstembera, nguyenpham, erbsenzaehler,
dorzechowski, and vondele.
This new evaluation needed various changes to fishtest and the corresponding infrastructure,
for which tomtor, ppigazzini, noobpwnftw, daylen, and vondele are gratefully acknowledged.
The first networks have been provided by gekkehenker and sergiovieri, with the latter
net (nn-97f742aaefcd.nnue) being the current default.
The evaluation function can be selected at run time with the `Use NNUE` (true/false) UCI option,
provided the `EvalFile` option points the the network file (depending on the GUI, with full path).
The performance of the NNUE evaluation relative to the classical evaluation depends somewhat on
the hardware, and is expected to improve quickly, but is currently on > 80 Elo on fishtest:
60000 @ 10+0.1 th 1
https://tests.stockfishchess.org/tests/view/5f28fe6ea5abc164f05e4c4c
ELO: 92.77 +-2.1 (95%) LOS: 100.0%
Total: 60000 W: 24193 L: 8543 D: 27264
Ptnml(0-2): 609, 3850, 9708, 10948, 4885
40000 @ 20+0.2 th 8
https://tests.stockfishchess.org/tests/view/5f290229a5abc164f05e4c58
ELO: 89.47 +-2.0 (95%) LOS: 100.0%
Total: 40000 W: 12756 L: 2677 D: 24567
Ptnml(0-2): 74, 1583, 8550, 7776, 2017
At the same time, the impact on the classical evaluation remains minimal, causing no significant
regression:
sprt @ 10+0.1 th 1
https://tests.stockfishchess.org/tests/view/5f2906a2a5abc164f05e4c5b
LLR: 2.94 (-2.94,2.94) {-6.00,-4.00}
Total: 34936 W: 6502 L: 6825 D: 21609
Ptnml(0-2): 571, 4082, 8434, 3861, 520
sprt @ 60+0.6 th 1
https://tests.stockfishchess.org/tests/view/5f2906cfa5abc164f05e4c5d
LLR: 2.93 (-2.94,2.94) {-6.00,-4.00}
Total: 10088 W: 1232 L: 1265 D: 7591
Ptnml(0-2): 49, 914, 3170, 843, 68
The needed networks can be found at https://tests.stockfishchess.org/nns
It is recommended to use the default one as indicated by the `EvalFile` UCI option.
Guidelines for testing new nets can be found at
https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#nnue-net-tests
Integration has been discussed in various issues:
https://github.com/official-stockfish/Stockfish/issues/2823https://github.com/official-stockfish/Stockfish/issues/2728
The integration branch will be closed after the merge:
https://github.com/official-stockfish/Stockfish/pull/2825https://github.com/official-stockfish/Stockfish/tree/nnue-player-wip
closes https://github.com/official-stockfish/Stockfish/pull/2912
This will be an exciting time for computer chess, looking forward to seeing the evolution of
this approach.
Bench: 4746616