1
0
Fork 0
mirror of https://github.com/sockspls/badfish synced 2025-05-03 18:19:35 +00:00
Commit graph

2 commits

Author SHA1 Message Date
Gian-Carlo Pascutto
c9977aa0a8 Add AVX-VNNI support for Alder Lake and later.
In their infinite wisdom, Intel axed AVX512 from Alder Lake
chips (well, not entirely, but we kind of want to use the Gracemont
cores for chess!) but still added VNNI support.
Confusingly enough, this is not the same as VNNI256 support.

This adds a specific AVX-VNNI target that will use this AVX-VNNI
mode, by prefixing the VNNI instructions with the appropriate VEX
prefix, and avoiding AVX512 usage.

This is about 1% faster on P cores:

Result of  20 runs
==================
base (./clang-bmi2   ) =    3306337  +/- 7519
test (./clang-vnni   ) =    3344226  +/- 7388
diff                   =     +37889  +/- 4153

speedup        = +0.0115
P(speedup > 0) =  1.0000

But a nice 3% faster on E cores:

Result of  20 runs
==================
base (./clang-bmi2   ) =    1938054  +/- 28257
test (./clang-vnni   ) =    1994606  +/- 31756
diff                   =     +56552  +/- 3735

speedup        = +0.0292
P(speedup > 0) =  1.0000

This was measured on Clang 13. GCC 11.2 appears to generate
worse code for Alder Lake, though the speedup on the E cores
is similar.

It is possible to run the engine specifically on the P or E using binding,
for example in linux it is possible to use (for an 8 P + 8 E setup like i9-12900K):
taskset -c 0-15 ./stockfish
taskset -c 16-23 ./stockfish
where the first call binds to the P-cores and the second to the E-cores.

closes https://github.com/official-stockfish/Stockfish/pull/3824

No functional change
2021-12-03 08:51:06 +01:00
Tomasz Sobczyk
18dcf1f097 Optimize and tidy up affine transform code.
The new network caused some issues initially due to the very narrow neuron set between the first two FC layers. Necessary changes were hacked together to make it work. This patch is a mature approach to make the affine transform code faster, more readable, and easier to maintain should the layer sizes change again.

The following changes were made:

* ClippedReLU always produces a multiple of 32 outputs. This is about as good of a solution for AffineTransform's SIMD requirements as it can get without a bigger rewrite.

* All self-contained simd helpers are moved to a separate file (simd.h). Inline asm is utilized to work around GCC's issues with code generation and register assignment. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101693, https://godbolt.org/z/da76fY1n7

* AffineTransform has 2 specializations. While it's more lines of code due to the boilerplate, the logic in both is significantly reduced, as these two are impossible to nicely combine into one.
 1) The first specialization is for cases when there's >=128 inputs. It uses a different approach to perform the affine transform and can make full use of AVX512 without any edge cases. Furthermore, it has higher theoretical throughput because less loads are needed in the hot path, requiring only a fixed amount of instructions for horizontal additions at the end, which are amortized by the large number of inputs.
 2) The second specialization is made to handle smaller layers where performance is still necessary but edge cases need to be handled. AVX512 implementation for this was ommited by mistake, a remnant from the temporary implementation for the new... This could be easily reintroduced if needed. A slightly more detailed description of both implementations is in the code.

Overall it should be a minor speedup, as shown on fishtest:

passed STC:
LLR: 2.96 (-2.94,2.94) <-0.50,2.50>
Total: 51520 W: 4074 L: 3888 D: 43558
Ptnml(0-2): 111, 3136, 19097, 3288, 128

and various tests shown in the pull request

closes https://github.com/official-stockfish/Stockfish/pull/3663

No functional change
2021-08-20 08:50:25 +02:00