BadFish

mirror of https://github.com/sockspls/badfish synced 2025-05-01 17:19:36 +00:00

Author	SHA1	Message	Date
mstembera	82bb21dc7a	Optimize AVX2 path in NNUE evaluation always selecting AffineTransform specialization for small inputs. A related patch was tested as Initially tested as a simplification STC https://tests.stockfishchess.org/tests/view/6317c3f437f41b13973d6dff LLR: 2.95 (-2.94,2.94) <-1.75,0.25> Total: 58072 W: 15619 L: 15425 D: 27028 Ptnml(0-2): 241, 6191, 15992, 6357, 255 Elo gain speedup test STC https://tests.stockfishchess.org/tests/view/63181c1b37f41b13973d79dc LLR: 2.94 (-2.94,2.94) <0.00,2.00> Total: 184496 W: 49922 L: 49401 D: 85173 Ptnml(0-2): 851, 19397, 51208, 19964, 828 and this patch gained in testing speedup = +0.0071 P(speedup > 0) = 1.0000 on CPU: 16 x AMD Ryzen 9 3950X closes https://github.com/official-stockfish/Stockfish/pull/4158 No functional change	2022-09-11 14:19:57 +02:00
Brad Knox	ad926d34c0	Update copyright years Happy New Year! closes https://github.com/official-stockfish/Stockfish/pull/3881 No functional change	2022-01-06 15:45:45 +01:00
Tomasz Sobczyk	4766dfc395	Optimize FT activation and affine transform for NEON. This patch optimizes the NEON implementation in two ways. The activation layer after the feature transformer is rewritten to make it easier for the compiler to see through dependencies and unroll. This in itself is a minimal, but a positive improvement. Other architectures could benefit from this too in the future. This is not an algorithmic change. The affine transform for large matrices (first layer after FT) on NEON now utilizes the same optimized code path as >=SSSE3, which makes the memory accesses more sequential and makes better use of the available registers, which allows for code that has longer dependency chains. Benchmarks from Redshift#161, profile-build with apple clang george@Georges-MacBook-Air nets % ./stockfish-b82d93 bench 2>&1 \| tail -4 (current master) =========================== Total time (ms) : 2167 Nodes searched : 4667742 Nodes/second : 2154011 george@Georges-MacBook-Air nets % ./stockfish-7377b8 bench 2>&1 \| tail -4 (this patch) =========================== Total time (ms) : 1842 Nodes searched : 4667742 Nodes/second : 2534061 This is a solid 18% improvement overall, larger in a bench with NNUE-only, not mixed. Improvement is also observed on armv7-neon (Raspberry Pi, and older phones), around 5% speedup. No changes for architectures other than NEON. closes https://github.com/official-stockfish/Stockfish/pull/3837 No functional changes.	2021-12-07 18:08:54 +01:00
Gian-Carlo Pascutto	c9977aa0a8	Add AVX-VNNI support for Alder Lake and later. In their infinite wisdom, Intel axed AVX512 from Alder Lake chips (well, not entirely, but we kind of want to use the Gracemont cores for chess!) but still added VNNI support. Confusingly enough, this is not the same as VNNI256 support. This adds a specific AVX-VNNI target that will use this AVX-VNNI mode, by prefixing the VNNI instructions with the appropriate VEX prefix, and avoiding AVX512 usage. This is about 1% faster on P cores: Result of 20 runs ================== base (./clang-bmi2 ) = 3306337 +/- 7519 test (./clang-vnni ) = 3344226 +/- 7388 diff = +37889 +/- 4153 speedup = +0.0115 P(speedup > 0) = 1.0000 But a nice 3% faster on E cores: Result of 20 runs ================== base (./clang-bmi2 ) = 1938054 +/- 28257 test (./clang-vnni ) = 1994606 +/- 31756 diff = +56552 +/- 3735 speedup = +0.0292 P(speedup > 0) = 1.0000 This was measured on Clang 13. GCC 11.2 appears to generate worse code for Alder Lake, though the speedup on the E cores is similar. It is possible to run the engine specifically on the P or E using binding, for example in linux it is possible to use (for an 8 P + 8 E setup like i9-12900K): taskset -c 0-15 ./stockfish taskset -c 16-23 ./stockfish where the first call binds to the P-cores and the second to the E-cores. closes https://github.com/official-stockfish/Stockfish/pull/3824 No functional change	2021-12-03 08:51:06 +01:00
Tomasz Sobczyk	18dcf1f097	Optimize and tidy up affine transform code. The new network caused some issues initially due to the very narrow neuron set between the first two FC layers. Necessary changes were hacked together to make it work. This patch is a mature approach to make the affine transform code faster, more readable, and easier to maintain should the layer sizes change again. The following changes were made: * ClippedReLU always produces a multiple of 32 outputs. This is about as good of a solution for AffineTransform's SIMD requirements as it can get without a bigger rewrite. * All self-contained simd helpers are moved to a separate file (simd.h). Inline asm is utilized to work around GCC's issues with code generation and register assignment. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101693, https://godbolt.org/z/da76fY1n7 * AffineTransform has 2 specializations. While it's more lines of code due to the boilerplate, the logic in both is significantly reduced, as these two are impossible to nicely combine into one. 1) The first specialization is for cases when there's >=128 inputs. It uses a different approach to perform the affine transform and can make full use of AVX512 without any edge cases. Furthermore, it has higher theoretical throughput because less loads are needed in the hot path, requiring only a fixed amount of instructions for horizontal additions at the end, which are amortized by the large number of inputs. 2) The second specialization is made to handle smaller layers where performance is still necessary but edge cases need to be handled. AVX512 implementation for this was ommited by mistake, a remnant from the temporary implementation for the new... This could be easily reintroduced if needed. A slightly more detailed description of both implementations is in the code. Overall it should be a minor speedup, as shown on fishtest: passed STC: LLR: 2.96 (-2.94,2.94) <-0.50,2.50> Total: 51520 W: 4074 L: 3888 D: 43558 Ptnml(0-2): 111, 3136, 19097, 3288, 128 and various tests shown in the pull request closes https://github.com/official-stockfish/Stockfish/pull/3663 No functional change	2021-08-20 08:50:25 +02:00

5 commits