BadFish

mirror of https://github.com/sockspls/badfish synced 2025-05-01 17:19:36 +00:00

Author	SHA1	Message	Date
Sebastian Buchwald	77dfcbedce	Remove unused macros closes https://github.com/official-stockfish/Stockfish/pull/4397 No functional change	2023-02-23 13:24:37 +01:00
Sebastian Buchwald	b4ad3a3c4b	Add support for ARM dot product instructions The sdot instruction computes (and accumulates) a signed dot product, which is quite handy for Stockfish's NNUE code. The instruction is optional for Armv8.2 and Armv8.3, and mandatory for Armv8.4 and above. The commit adds a new 'arm-dotprod' architecture with enabled dot product support. It also enables dot product support for the existing 'apple-silicon' architecture, which is at least Armv8.5. The following local speed test was performed on an Apple M1 with ARCH=apple-silicon. I had to remove CPU pinning from the benchmark script. However, the results were still consistent: Checking both binaries against themselves reported a speedup of +0.0000 and +0.0005, respectively. ``` Result of 100 runs ================== base (...ish.037ef3e1) = 1917997 +/- 7152 test (...fish.dotprod) = 2159682 +/- 9066 diff = +241684 +/- 2923 speedup = +0.1260 P(speedup > 0) = 1.0000 CPU: 10 x arm Hyperthreading: off ``` Fixes #4193 closes https://github.com/official-stockfish/Stockfish/pull/4400 No functional change	2023-02-23 13:22:03 +01:00
MinetaS	2c36d1e7e7	Fix overflow in add_dpbusd_epi32x2 This patch fixes 16bit overflow in _add_dpbusd_epi32x2 functions, that can be triggered in rare cases depending on the NNUE weights. While the code leads to some slowdown on affected architectures (most notably avx2), the fix is simpler than some of the other options discussed in https://github.com/official-stockfish/Stockfish/pull/4394 Code suggested by Sopel97. Result of "bench 4096 1 30 default depth nnue": \| Architecture \| master \| patch (gcc) \| patch (clang) \| \|---------------------\|-----------\|-------------\|---------------\| \| x86-64-vnni512 \| 762122798 \| 762122798 \| 762122798 \| \| x86-64-avx512 \| 769723503 \| 762122798 \| 762122798 \| \| x86-64-bmi2 \| 769723503 \| 762122798 \| 762122798 \| \| x86-64-ssse3 \| 769723503 \| 762122798 \| 762122798 \| \| x86-64 \| 762122798 \| 762122798 \| 762122798 \| Following architectures will experience ~4% slowdown due to an additional instruction in the middle of hot path: x86-64-avx512 * x86-64-bmi2 * x86-64-avx2 * x86-64-sse41-popcnt (x86-64-modern) * x86-64-ssse3 * x86-32-sse41-popcnt This patch clearly loses Elo against master with both STC and LTC. Failed non-regression STC (256bit fix only): LLR: -2.95 (-2.94,2.94) <-1.75,0.25> Total: 33528 W: 8769 L: 9049 D: 15710 Ptnml(0-2): 96, 3616, 9600, 3376, 76 https://tests.stockfishchess.org/tests/view/63e6a5b44299542b1e26a485 60+0.6 @ 30000 games: Elo: -1.67 +-1.7 (95%) LOS: 2.8% Total: 30000 W: 7848 L: 7992 D: 14160 Ptnml(0-2): 12, 2847, 9436, 2683, 22 nElo: -3.84 +-3.9 (95%) PairsRatio: 0.95 https://tests.stockfishchess.org/tests/view/63e7ac716d0e1db55f35a660 However, a test against nn-a3dc078bafc7.nnue, which is the latest "safe" network not causing the bug, passed with regular bounds. Passed STC: LLR: 2.94 (-2.94,2.94) <0.00,2.00> Total: 160456 W: 42658 L: 42175 D: 75623 Ptnml(0-2): 487, 17638, 43469, 18173, 461 https://tests.stockfishchess.org/tests/view/63e89836d62a5d02b0fa82c8 closes https://github.com/official-stockfish/Stockfish/pull/4391 closes https://github.com/official-stockfish/Stockfish/pull/4394 No functional change	2023-02-18 13:23:18 +01:00
Sebastian Buchwald	2f67409506	Remove redundant const qualifiers The const qualifiers are already implied by the constexpr qualifiers. closes https://github.com/official-stockfish/Stockfish/pull/4359 No functional change	2023-01-28 16:49:27 +01:00
Sebastian Buchwald	2167942b6e	Simplify functions to read/write network parameters closes https://github.com/official-stockfish/Stockfish/pull/4358 No functional change	2023-01-28 16:47:52 +01:00
Sebastian Buchwald	da5bcec481	Fix asm modifiers in add_dpbusd_epi32x2 implementations The accumulator should be an earlyclobber because it is written before all input operands are read. Otherwise, the asm code computes a wrong result if the accumulator shares a register with one of the other input operands (which happens if we pass in the same expression for the accumulator and the operand). Closes https://github.com/official-stockfish/Stockfish/pull/4339 No functional change	2023-01-22 10:51:02 +01:00
Sebastian Buchwald	4f4e652eca	Avoid unnecessary string copies closes https://github.com/official-stockfish/Stockfish/pull/4326 also fixes typo, closes https://github.com/official-stockfish/Stockfish/pull/4332 No functional change	2023-01-09 20:32:58 +01:00
Sebastian Buchwald	e9e7a7b83f	Replace some std::string occurrences with std::string_view std::string_view is more lightweight than std::string. Furthermore, std::string_view variables can be declared constexpr. closes https://github.com/official-stockfish/Stockfish/pull/4328 No functional change	2023-01-09 20:28:24 +01:00
Stefano Di Martino	5a88c5bb9b	Modernize code base a little bit Removed sprintf() which generated a warning, because of security reasons. Replace NULL with nullptr Replace typedef with using Do not inherit from std::vector. Use composition instead. optimize mutex-unlocking closes https://github.com/official-stockfish/Stockfish/pull/4327 No functional change	2023-01-09 20:25:13 +01:00
Sebastian Buchwald	31acd6bab7	Warn if a global function has no previous declaration If a global function has no previous declaration, either the declaration is missing in the corresponding header file or the function should be declared static. Static functions are local to the translation unit, which allows the compiler to apply some optimizations earlier (when compiling the translation unit rather than during link-time optimization). The commit enables the warning for gcc, clang, and mingw. It also fixes the reported warnings by declaring the functions static or by adding a header file (benchmark.h). closes https://github.com/official-stockfish/Stockfish/pull/4325 No functional change	2023-01-09 20:18:39 +01:00
Sebastian Buchwald	b60f9cc451	Update copyright years Happy New Year! closes https://github.com/official-stockfish/Stockfish/pull/4315 No functional change	2023-01-02 19:07:38 +01:00
Joost VandeVondele	ad2aa8c06f	Normalize evaluation Normalizes the internal value as reported by evaluate or search to the UCI centipawn result used in output. This value is derived from the win_rate_model() such that Stockfish outputs an advantage of "100 centipawns" for a position if the engine has a 50% probability to win from this position in selfplay at fishtest LTC time control. The reason to introduce this normalization is that our evaluation is, since NNUE, no longer related to the classical parameter PawnValueEg (=208). This leads to the current evaluation changing quite a bit from release to release, for example, the eval needed to have 50% win probability at fishtest LTC (in cp and internal Value): June 2020 : 113cp (237) June 2021 : 115cp (240) April 2022 : 134cp (279) July 2022 : 167cp (348) With this patch, a 100cp advantage will have a fixed interpretation, i.e. a 50% win chance. To keep this value steady, it will be needed to update the win_rate_model() from time to time, based on fishtest data. This analysis can be performed with a set of scripts currently available at https://github.com/vondele/WLD_model fixes https://github.com/official-stockfish/Stockfish/issues/4155 closes https://github.com/official-stockfish/Stockfish/pull/4216 No functional change	2022-11-05 09:15:53 +01:00
Clausable	8333b2a94c	Fix README typos, update AUTHORS closes https://github.com/official-stockfish/Stockfish/pull/4208 No functional change	2022-10-27 08:15:46 +02:00
mstembera	93f71ecfe1	Optimize make_index() using templates and lookup tables. https://tests.stockfishchess.org/tests/view/634517e54bc7650f07542f99 LLR: 2.94 (-2.94,2.94) <0.00,2.00> Total: 642672 W: 171819 L: 170658 D: 300195 Ptnml(0-2): 2278, 68077, 179416, 69336, 2229 this also introduces `-flto-partition=one` as suggested by MinetaS (Syine Mineta) to avoid linking errors due to LTO on 32 bit mingw. This change was tested in isolation as well https://tests.stockfishchess.org/tests/view/634aacf84bc7650f0755188b LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 119352 W: 31986 L: 31862 D: 55504 Ptnml(0-2): 439, 12624, 33400, 12800, 413 closes https://github.com/official-stockfish/Stockfish/pull/4199 No functional change	2022-10-16 11:42:19 +02:00
mstembera	82bb21dc7a	Optimize AVX2 path in NNUE evaluation always selecting AffineTransform specialization for small inputs. A related patch was tested as Initially tested as a simplification STC https://tests.stockfishchess.org/tests/view/6317c3f437f41b13973d6dff LLR: 2.95 (-2.94,2.94) <-1.75,0.25> Total: 58072 W: 15619 L: 15425 D: 27028 Ptnml(0-2): 241, 6191, 15992, 6357, 255 Elo gain speedup test STC https://tests.stockfishchess.org/tests/view/63181c1b37f41b13973d79dc LLR: 2.94 (-2.94,2.94) <0.00,2.00> Total: 184496 W: 49922 L: 49401 D: 85173 Ptnml(0-2): 851, 19397, 51208, 19964, 828 and this patch gained in testing speedup = +0.0071 P(speedup > 0) = 1.0000 on CPU: 16 x AMD Ryzen 9 3950X closes https://github.com/official-stockfish/Stockfish/pull/4158 No functional change	2022-09-11 14:19:57 +02:00
Dubslow	442c40b43d	Use NNUE complexity in search, retune related parameters This builds on ideas of xoto10 and mstembera to use more output from NNUE in the search algorithm. passed STC: https://tests.stockfishchess.org/tests/view/62ae454fe7ee5525ef88a957 LLR: 2.95 (-2.94,2.94) <0.00,2.50> Total: 89208 W: 24127 L: 23753 D: 41328 Ptnml(0-2): 400, 9886, 23642, 10292, 384 passed LTC: https://tests.stockfishchess.org/tests/view/62acc6ddd89eb6cf1e0750a1 LLR: 2.93 (-2.94,2.94) <0.50,3.00> Total: 56352 W: 15430 L: 15115 D: 25807 Ptnml(0-2): 44, 5501, 16782, 5794, 55 closes https://github.com/official-stockfish/Stockfish/pull/4088 bench 5332964	2022-06-20 08:30:57 +02:00
xoto10	7f1333ccf8	Blend nnue complexity with classical. Following mstembera's test of the complexity value derived from nnue values, this change blends that idea with the old complexity calculation. STC 10+0.1: LLR: 2.95 (-2.94,2.94) <0.00,2.50> Total: 42320 W: 11436 L: 11148 D: 19736 Ptnml(0-2): 209, 4585, 11263, 4915, 188 https://tests.stockfishchess.org/tests/live_elo/6295c9239c8c2fcb2bad7fd9 LTC 60+0.6: LLR: 2.98 (-2.94,2.94) <0.50,3.00> Total: 34600 W: 9393 L: 9125 D: 16082 Ptnml(0-2): 32, 3323, 10319, 3597, 29 https://tests.stockfishchess.org/tests/view/6295fd5d9c8c2fcb2bad88cf closes https://github.com/official-stockfish/Stockfish/pull/4046 Bench 6078140	2022-06-02 07:47:23 +02:00
Giacomo Lorenzetti	f7d1491b3d	Assorted small cleanups closes https://github.com/official-stockfish/Stockfish/pull/3973 No functional change	2022-05-29 18:42:48 +02:00
Tomasz Sobczyk	c079acc26f	Update NNUE architecture to SFNNv5. Update network to nn-3c0aa92af1da.nnue. Architecture changes: Duplicated activation after the 1024->15 layer with squared crelu (so 15->15*2). As proposed by vondele. Trainer changes: Added bias to L1 factorization, which was previously missing (no measurable improvement but at least neutral in principle) For retraining linearly reduce lambda parameter from 1.0 at epoch 0 to 0.75 at epoch 800. reduce max_skipping_rate from 15 to 10 (compared to vondele's outstanding PR) Note: This network was trained with a ~0.8% error in quantization regarding the newly added activation function. This will be fixed in the released trainer version. Expect a trainer PR tomorrow. Note: The inference implementation cuts a corner to merge results from two activation functions. This could possibly be resolved nicer in the future. AVX2 implementation likely not necessary, but NEON is missing. First training session invocation: python3 train.py \ ../nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \ ../nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \ --gpus "$3," \ --threads 4 \ --num-workers 8 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=1.0 \ --max_epochs=400 \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 Second training session invocation: python3 train.py \ ../nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \ ../nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \ --gpus "$3," \ --threads 4 \ --num-workers 8 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --start-lambda=1.0 \ --end-lambda=0.75 \ --gamma=0.995 \ --lr=4.375e-4 \ --max_epochs=800 \ --resume-from-model /data/sopel/nnue/nnue-pytorch-training/data/exp367/nn-exp367-run3-epoch399.pt \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 Passed STC: LLR: 2.95 (-2.94,2.94) <0.00,2.50> Total: 27288 W: 7445 L: 7178 D: 12665 Ptnml(0-2): 159, 3002, 7054, 3271, 158 https://tests.stockfishchess.org/tests/view/627e8c001919125939623644 Passed LTC: LLR: 2.95 (-2.94,2.94) <0.50,3.00> Total: 21792 W: 5969 L: 5727 D: 10096 Ptnml(0-2): 25, 2152, 6294, 2406, 19 https://tests.stockfishchess.org/tests/view/627f2a855734b18b2e2ece47 closes https://github.com/official-stockfish/Stockfish/pull/4020 Bench: 6481017	2022-05-14 12:47:22 +02:00
Topologist	471d93063a	Play more positional in endgames This patch chooses the delta value (which skews the nnue evaluation between positional and materialistic) depending on the material: If the material is low, delta will be higher and the evaluation is shifted to the positional value. If the material is high, the evaluation will be shifted to the psqt value. I don't think slightly negative values of delta should be a concern. Passed STC: https://tests.stockfishchess.org/tests/view/62418513b3b383e86185766f LLR: 2.94 (-2.94,2.94) <0.00,2.50> Total: 28808 W: 7832 L: 7564 D: 13412 Ptnml(0-2): 147, 3186, 7505, 3384, 182 Passed LTC: https://tests.stockfishchess.org/tests/view/62419137b3b383e861857842 LLR: 2.96 (-2.94,2.94) <0.50,3.00> Total: 58632 W: 15776 L: 15450 D: 27406 Ptnml(0-2): 42, 5889, 17149, 6173, 63 closes https://github.com/official-stockfish/Stockfish/pull/3971 Bench: 7588855	2022-03-28 22:43:52 +02:00
Ben Chaney	270a0e737f	Generalize the feature transform to use vec_t macros This commit generalizes the feature transform to use vec_t macros that are architecture defined instead of using a seperate code path for each one. It should make some old architectures (MMX, including improvements by Fanael) faster and make further such improvements easier in the future. Includes some corrections to CI for mingw. closes https://github.com/official-stockfish/Stockfish/pull/3955 closes https://github.com/official-stockfish/Stockfish/pull/3928 No functional change	2022-03-02 23:39:08 +01:00
Tomasz Sobczyk	174b038bf3	Use dynamic allocation for evaluation scratch TLS buffer. fixes #3946 an issue related with the toolchain as found in xcode 12 on macOS, related to previous commit `5f781d36`. closes https://github.com/official-stockfish/Stockfish/pull/3950 No functional change	2022-03-01 17:51:02 +01:00
mstembera	5f781d366e	Clean up and simplify some nnue code. Remove some unnecessary code and it's execution during inference. Also the change on line 49 in nnue_architecture.h results in a more efficient SIMD code path through ClippedReLU::propagate(). passed STC: https://tests.stockfishchess.org/tests/view/6217d3bfda649bba32ef25d5 LLR: 2.94 (-2.94,2.94) <-2.25,0.25> Total: 12056 W: 3281 L: 3092 D: 5683 Ptnml(0-2): 55, 1213, 3312, 1384, 64 passed STC SMP: https://tests.stockfishchess.org/tests/view/6217f344da649bba32ef295e LLR: 2.94 (-2.94,2.94) <-2.25,0.25> Total: 27376 W: 7295 L: 7137 D: 12944 Ptnml(0-2): 52, 2859, 7715, 3003, 59 closes https://github.com/official-stockfish/Stockfish/pull/3944 No functional change bench: 6820724	2022-02-25 08:37:57 +01:00
Tomasz Sobczyk	cb9c2594fc	Update architecture to "SFNNv4". Update network to nn-6877cd24400e.nnue. Architecture: The diagram of the "SFNNv4" architecture: https://user-images.githubusercontent.com/8037982/153455685-cbe3a038-e158-4481-844d-9d5fccf5c33a.png The most important architectural changes are the following: * 1024x2 [activated] neurons are pairwise, elementwise multiplied (not quite pairwise due to implementation details, see diagram), which introduces a non-linearity that exhibits similar benefits to previously tested sigmoid activation (quantmoid4), while being slightly faster. * The following layer has therefore 2x less inputs, which we compensate by having 2 more outputs. It is possible that reducing the number of outputs might be beneficial (as we had it as low as 8 before). The layer is now 1024->16. * The 16 outputs are split into 15 and 1. The 1-wide output is added to the network output (after some necessary scaling due to quantization differences). The 15-wide is activated and follows the usual path through a set of linear layers. The additional 1-wide output is at least neutral, but has shown a slightly positive trend in training compared to networks without it (all 16 outputs through the usual path), and allows possibly an additional stage of lazy evaluation to be introduced in the future. Additionally, the inference code was rewritten and no longer uses a recursive implementation. This was necessitated by the splitting of the 16-wide intermediate result into two, which was impossible to do with the old implementation with ugly hacks. This is hopefully overall for the better. First session: The first session was training a network from scratch (random initialization). The exact trainer used was slightly different (older) from the one used in the second session, but it should not have a measurable effect. The purpose of this session is to establish a strong network base for the second session. Small deviations in strength do not harm the learnability in the second session. The training was done using the following command: python3 train.py \ /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \ /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \ --gpus "$3," \ --threads 4 \ --num-workers 4 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=1.0 \ --gamma=0.992 \ --lr=8.75e-4 \ --max_epochs=400 \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 Every 20th net was saved and its playing strength measured against some baseline at 25k nodes per move with pure NNUE evaluation (modified binary). The exact setup is not important as long as it's consistent. The purpose is to sift good candidates from bad ones. The dataset can be found https://drive.google.com/file/d/1UQdZN_LWQ265spwTBwDKo0t1WjSJKvWY/view Second session: The second training session was done starting from the best network (as determined by strength testing) from the first session. It is important that it's resumed from a .pt model and NOT a .ckpt model. The conversion can be performed directly using serialize.py The LR schedule was modified to use gamma=0.995 instead of gamma=0.992 and LR=4.375e-4 instead of LR=8.75e-4 to flatten the LR curve and allow for longer training. The training was then running for 800 epochs instead of 400 (though it's possibly mostly noise after around epoch 600). The training was done using the following command: The training was done using the following command: python3 train.py \ /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \ /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \ --gpus "$3," \ --threads 4 \ --num-workers 4 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=1.0 \ --gamma=0.995 \ --lr=4.375e-4 \ --max_epochs=800 \ --resume-from-model /data/sopel/nnue/nnue-pytorch-training/data/exp295/nn-epoch399.pt \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$run_id In particular note that we now use lambda=1.0 instead of lambda=0.8 (previous nets), because tests show that WDL-skipping introduced by vondele performs better with lambda=1.0. Nets were being saved every 20th epoch. In total 16 runs were made with these settings and the best nets chosen according to playing strength at 25k nodes per move with pure NNUE evaluation - these are the 4 nets that have been put on fishtest. The dataset can be found either at ftp://ftp.chessdb.cn/pub/sopel/data_sf/T60T70wIsRightFarseerT60T74T75T76.binpack in its entirety (download might be painfully slow because hosted in China) or can be assembled in the following way: Get the `5640ad48ae/script/interleave_binpacks.py` script. Download T60T70wIsRightFarseer.binpack https://drive.google.com/file/d/1_sQoWBl31WAxNXma2v45004CIVltytP8/view Download farseerT74.binpack http://trainingdata.farseer.org/T74-May13-End.7z Download farseerT75.binpack http://trainingdata.farseer.org/T75-June3rd-End.7z Download farseerT76.binpack http://trainingdata.farseer.org/T76-Nov10th-End.7z Run python3 interleave_binpacks.py T60T70wIsRightFarseer.binpack farseerT74.binpack farseerT75.binpack farseerT76.binpack T60T70wIsRightFarseerT60T74T75T76.binpack Tests: STC: https://tests.stockfishchess.org/tests/view/6203fb85d71106ed12a407b7 LLR: 2.94 (-2.94,2.94) <0.00,2.50> Total: 16952 W: 4775 L: 4521 D: 7656 Ptnml(0-2): 133, 1818, 4318, 2076, 131 LTC: https://tests.stockfishchess.org/tests/view/62041e68d71106ed12a40e85 LLR: 2.94 (-2.94,2.94) <0.50,3.00> Total: 14944 W: 4138 L: 3907 D: 6899 Ptnml(0-2): 21, 1499, 4202, 1728, 22 closes https://github.com/official-stockfish/Stockfish/pull/3927 Bench: 4919707	2022-02-10 19:54:31 +01:00
Brad Knox	ad926d34c0	Update copyright years Happy New Year! closes https://github.com/official-stockfish/Stockfish/pull/3881 No functional change	2022-01-06 15:45:45 +01:00
Stéphane Nicolet	74776dbcd5	Simplification in evaluate_nnue.cpp Removes the test on non-pawn-material before applying the positional/materialistic bonus. Passed STC: LLR: 2.94 (-2.94,2.94) <-2.25,0.25> Total: 46904 W: 12197 L: 12059 D: 22648 Ptnml(0-2): 170, 5243, 12479, 5399, 161 https://tests.stockfishchess.org/tests/view/61be57cf57a0d0f327c3999d Passed LTC: LLR: 2.95 (-2.94,2.94) <-2.25,0.25> Total: 18760 W: 4958 L: 4790 D: 9012 Ptnml(0-2): 14, 1942, 5301, 2108, 15 https://tests.stockfishchess.org/tests/view/61bed1fb57a0d0f327c3afa9 closes https://github.com/official-stockfish/Stockfish/pull/3866 Bench: 4826206	2021-12-19 15:44:01 +01:00
Tomasz Sobczyk	4766dfc395	Optimize FT activation and affine transform for NEON. This patch optimizes the NEON implementation in two ways. The activation layer after the feature transformer is rewritten to make it easier for the compiler to see through dependencies and unroll. This in itself is a minimal, but a positive improvement. Other architectures could benefit from this too in the future. This is not an algorithmic change. The affine transform for large matrices (first layer after FT) on NEON now utilizes the same optimized code path as >=SSSE3, which makes the memory accesses more sequential and makes better use of the available registers, which allows for code that has longer dependency chains. Benchmarks from Redshift#161, profile-build with apple clang george@Georges-MacBook-Air nets % ./stockfish-b82d93 bench 2>&1 \| tail -4 (current master) =========================== Total time (ms) : 2167 Nodes searched : 4667742 Nodes/second : 2154011 george@Georges-MacBook-Air nets % ./stockfish-7377b8 bench 2>&1 \| tail -4 (this patch) =========================== Total time (ms) : 1842 Nodes searched : 4667742 Nodes/second : 2534061 This is a solid 18% improvement overall, larger in a bench with NNUE-only, not mixed. Improvement is also observed on armv7-neon (Raspberry Pi, and older phones), around 5% speedup. No changes for architectures other than NEON. closes https://github.com/official-stockfish/Stockfish/pull/3837 No functional changes.	2021-12-07 18:08:54 +01:00
Michael Ortmann	4b86ef8c4f	Fix typos in comments, adjust readme closes https://github.com/official-stockfish/Stockfish/pull/3822 also adjusts readme as requested in https://github.com/official-stockfish/Stockfish/pull/3816 No functional change	2021-12-01 18:07:30 +01:00
hengyu	64f21ecdae	Small clean-up remove unneeded calculation. closes https://github.com/official-stockfish/Stockfish/pull/3807 No functional change.	2021-12-01 17:59:20 +01:00
Stefano Cardanobile	2214fcecf7	Rewrite NNUE evaluation adjustments Make the eval code in the evaluate_nnue.cpp more similar to the rest of the codebase: * remove multiple variable assignment * make if conditions explicit and indent on multiple lines passed STC LLR: 2.93 (-2.94,2.94) <-2.50,0.50> Total: 59032 W: 14834 L: 14751 D: 29447 Ptnml(0-2): 176, 6310, 16459, 6397, 174 https://tests.stockfishchess.org/tests/view/616f250540f619782fd4f76d closes https://github.com/official-stockfish/Stockfish/pull/3753 No functional change	2021-10-23 12:22:02 +02:00
mstembera	644f6d4790	Simplify away ValueListInserter plus minor cleanups STC: https://tests.stockfishchess.org/tests/view/616f059b40f619782fd4f73f LLR: 2.94 (-2.94,2.94) <-2.50,0.50> Total: 84992 W: 21244 L: 21197 D: 42551 Ptnml(0-2): 279, 9005, 23868, 9078, 266 closes https://github.com/official-stockfish/Stockfish/pull/3749 No functional change	2021-10-23 12:21:17 +02:00
xoto10	f21a66f70d	Small clean-up, Sept 2021 Closes https://github.com/official-stockfish/Stockfish/pull/3485 No functional change	2021-10-07 09:41:57 +02:00
Michael Chaly	e8788d1b32	Combo of various parameter tweaks Combination of parameter tweaks in search, evaluation and time management. Original patches by snicolet xoto10 lonfom169 and Vizvezdenec. Includes: * Use bigger grain of positional evaluation more frequently (up to 1 exchange difference in non-pawn-material); * More extra time according to increment; * Increase margin for singular extensions; * Do more aggresive parent node futility pruning. Passed STC https://tests.stockfishchess.org/tests/view/6147deab3733d0e0dd9f313d LLR: 2.94 (-2.94,2.94) <-0.50,2.50> Total: 45488 W: 11691 L: 11450 D: 22347 Ptnml(0-2): 145, 5208, 11824, 5395, 172 Passed LTC https://tests.stockfishchess.org/tests/view/6147f1d53733d0e0dd9f3141 LLR: 2.94 (-2.94,2.94) <0.50,3.50> Total: 62520 W: 15808 L: 15482 D: 31230 Ptnml(0-2): 43, 6439, 17960, 6785, 33 closes https://github.com/official-stockfish/Stockfish/pull/3710 bench 5575265	2021-09-21 19:48:40 +02:00
Tomasz Sobczyk	18dcf1f097	Optimize and tidy up affine transform code. The new network caused some issues initially due to the very narrow neuron set between the first two FC layers. Necessary changes were hacked together to make it work. This patch is a mature approach to make the affine transform code faster, more readable, and easier to maintain should the layer sizes change again. The following changes were made: * ClippedReLU always produces a multiple of 32 outputs. This is about as good of a solution for AffineTransform's SIMD requirements as it can get without a bigger rewrite. * All self-contained simd helpers are moved to a separate file (simd.h). Inline asm is utilized to work around GCC's issues with code generation and register assignment. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101693, https://godbolt.org/z/da76fY1n7 * AffineTransform has 2 specializations. While it's more lines of code due to the boilerplate, the logic in both is significantly reduced, as these two are impossible to nicely combine into one. 1) The first specialization is for cases when there's >=128 inputs. It uses a different approach to perform the affine transform and can make full use of AVX512 without any edge cases. Furthermore, it has higher theoretical throughput because less loads are needed in the hot path, requiring only a fixed amount of instructions for horizontal additions at the end, which are amortized by the large number of inputs. 2) The second specialization is made to handle smaller layers where performance is still necessary but edge cases need to be handled. AVX512 implementation for this was ommited by mistake, a remnant from the temporary implementation for the new... This could be easily reintroduced if needed. A slightly more detailed description of both implementations is in the code. Overall it should be a minor speedup, as shown on fishtest: passed STC: LLR: 2.96 (-2.94,2.94) <-0.50,2.50> Total: 51520 W: 4074 L: 3888 D: 43558 Ptnml(0-2): 111, 3136, 19097, 3288, 128 and various tests shown in the pull request closes https://github.com/official-stockfish/Stockfish/pull/3663 No functional change	2021-08-20 08:50:25 +02:00
Tomasz Sobczyk	d61d38586e	New NNUE architecture and net Introduces a new NNUE network architecture and associated network parameters The summary of the changes: * Position for each perspective mirrored such that the king is on e..h files. Cuts the feature transformer size in half, while preserving enough knowledge to be good. See https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit#heading=h.b40q4rb1w7on. * The number of neurons after the feature transformer increased two-fold, to 1024x2. This is possibly mostly due to the now very optimized feature transformer update code. * The number of neurons after the second layer is reduced from 16 to 8, to reduce the speed impact. This, perhaps surprisingly, doesn't harm the strength much. See https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit#heading=h.6qkocr97fezq The AffineTransform code did not work out-of-the box with the smaller number of neurons after the second layer, so some temporary changes have been made to add a special case for InputDimensions == 8. Also additional 0 padding is added to the output for some archs that cannot process inputs by <=8 (SSE2, NEON). VNNI uses an implementation that can keep all outputs in the registers while reducing the number of loads by 3 for each 16 inputs, thanks to the reduced number of output neurons. However GCC is particularily bad at optimization here (and perhaps why the current way the affine transform is done even passed sprt) (see https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit# for details) and more work will be done on this in the following days. I expect the current VNNI implementation to be improved and extended to other architectures. The network was trained with a slightly modified version of the pytorch trainer (https://github.com/glinscott/nnue-pytorch); the changes are in https://github.com/glinscott/nnue-pytorch/pull/143 The training utilized 2 datasets. dataset A - https://drive.google.com/file/d/1VlhnHL8f-20AXhGkILujnNXHwy9T-MQw/view?usp=sharing dataset B - as described in `ba01f4b954` The training process was as following: train on dataset A for 350 epochs, take the best net in terms of elo at 20k nodes per move (it's fine to take anything from later stages of training). convert the .ckpt to .pt --resume-from-model from the .pt file, train on dataset B for <600 epochs, take the best net. Lambda=0.8, applied before the loss function. The first training command: python3 train.py \ ../nnue-pytorch-training/data/large_gensfen_multipvdiff_100_d9.binpack \ ../nnue-pytorch-training/data/large_gensfen_multipvdiff_100_d9.binpack \ --gpus "$3," \ --threads 1 \ --num-workers 1 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --smart-fen-skipping \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=1.0 \ --max_epochs=600 \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 The second training command: python3 serialize.py \ --features=HalfKAv2_hm^ \ ../nnue-pytorch-training/experiment_131/run_6/default/version_0/checkpoints/epoch-499.ckpt \ ../nnue-pytorch-training/experiment_$1/base/base.pt python3 train.py \ ../nnue-pytorch-training/data/michael_commit_b94a65.binpack \ ../nnue-pytorch-training/data/michael_commit_b94a65.binpack \ --gpus "$3," \ --threads 1 \ --num-workers 1 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --smart-fen-skipping \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=0.8 \ --max_epochs=600 \ --resume-from-model ../nnue-pytorch-training/experiment_$1/base/base.pt \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 STC: https://tests.stockfishchess.org/tests/view/611120b32a8a49ac5be798c4 LLR: 2.97 (-2.94,2.94) <-0.50,2.50> Total: 22480 W: 2434 L: 2251 D: 17795 Ptnml(0-2): 101, 1736, 7410, 1865, 128 LTC: https://tests.stockfishchess.org/tests/view/611152b32a8a49ac5be798ea LLR: 2.93 (-2.94,2.94) <0.50,3.50> Total: 9776 W: 442 L: 333 D: 9001 Ptnml(0-2): 5, 295, 4180, 402, 6 closes https://github.com/official-stockfish/Stockfish/pull/3646 bench: 5189338	2021-08-15 12:05:43 +02:00
Tomasz Sobczyk	26edf9534a	Avoid unnecessary stores in the affine transform This patch improves the codegen in the AffineTransform::forward function for architectures >=SSSE3. Current code works directly on memory and the compiler cannot see that the stores through outptr do not alias the loads through weights and input32. The solution implemented is to perform the affine transform with local variables as accumulators and only store the result to memory at the end. The number of accumulators required is OutputDimensions / OutputSimdWidth, which means that for the 1024->16 affine transform it requires 4 registers with SSSE3, 2 with AVX2, 1 with AVX512. It also cuts the number of stores required by NumRegs * 256 for each node evaluated. The local accumulators are expected to be assigned to registers, but even if this cannot be done in some case due to register pressure it will help the compiler to see that there is no aliasing between the loads and stores and may still result in better codegen. See https://godbolt.org/z/59aTKbbYc for codegen comparison. passed STC: LLR: 2.94 (-2.94,2.94) <-0.50,2.50> Total: 140328 W: 10635 L: 10358 D: 119335 Ptnml(0-2): 302, 8339, 52636, 8554, 333 closes https://github.com/official-stockfish/Stockfish/pull/3634 No functional change	2021-07-30 17:15:52 +02:00
Stéphane Nicolet	b51b094419	Simplify format_cp_aligned_dot() closes https://github.com/official-stockfish/Stockfish/pull/3583 No functional change	2021-07-03 09:25:16 +02:00
Joost VandeVondele	2e2865d34b	Fix build error on OSX directly use integer version for cp calculation. fixes https://github.com/official-stockfish/Stockfish/issues/3573 closes https://github.com/official-stockfish/Stockfish/pull/3574 No functional change	2021-06-21 23:14:58 +02:00
Tomasz Sobczyk	2e745956c0	Change trace with NNUE eval support This patch adds some more output to the `eval` command. It adds a board display with estimated piece values (method is remove-piece, evaluate, put-piece), and splits the NNUE evaluation with (psqt,layers) for each bucket for the NNUE net. Example: ``` ./stockfish position fen 3Qb1k1/1r2ppb1/pN1n2q1/Pp1Pp1Pr/4P2p/4BP2/4B1R1/1R5K b - - 11 40 eval Contributing terms for the classical eval: +------------+-------------+-------------+-------------+ \| Term \| White \| Black \| Total \| \| \| MG EG \| MG EG \| MG EG \| +------------+-------------+-------------+-------------+ \| Material \| ---- ---- \| ---- ---- \| -0.73 -1.55 \| \| Imbalance \| ---- ---- \| ---- ---- \| -0.21 -0.17 \| \| Pawns \| 0.35 -0.00 \| 0.19 -0.26 \| 0.16 0.25 \| \| Knights \| 0.04 -0.08 \| 0.12 -0.01 \| -0.08 -0.07 \| \| Bishops \| -0.34 -0.87 \| -0.17 -0.61 \| -0.17 -0.26 \| \| Rooks \| 0.12 0.00 \| 0.08 0.00 \| 0.04 0.00 \| \| Queens \| 0.00 0.00 \| -0.27 -0.07 \| 0.27 0.07 \| \| Mobility \| 0.84 1.76 \| 0.01 0.66 \| 0.83 1.10 \| \|King safety \| -0.99 -0.17 \| -0.72 -0.10 \| -0.27 -0.07 \| \| Threats \| 0.27 0.27 \| 0.73 0.86 \| -0.46 -0.59 \| \| Passed \| 0.00 0.00 \| 0.79 0.82 \| -0.79 -0.82 \| \| Space \| 0.61 0.00 \| 0.24 0.00 \| 0.37 0.00 \| \| Winnable \| ---- ---- \| ---- ---- \| 0.00 -0.03 \| +------------+-------------+-------------+-------------+ \| Total \| ---- ---- \| ---- ---- \| -1.03 -2.14 \| +------------+-------------+-------------+-------------+ NNUE derived piece values: +-------+-------+-------+-------+-------+-------+-------+-------+ \| \| \| \| Q \| b \| \| k \| \| \| \| \| \| +12.4 \| -1.62 \| \| \| \| +-------+-------+-------+-------+-------+-------+-------+-------+ \| \| r \| \| \| p \| p \| b \| \| \| \| -3.89 \| \| \| -0.84 \| -1.19 \| -3.32 \| \| +-------+-------+-------+-------+-------+-------+-------+-------+ \| p \| N \| \| n \| \| \| q \| \| \| -1.81 \| +3.71 \| \| -4.82 \| \| \| -5.04 \| \| +-------+-------+-------+-------+-------+-------+-------+-------+ \| P \| p \| \| P \| p \| \| P \| r \| \| +1.16 \| -0.91 \| \| +0.55 \| +0.12 \| \| +0.50 \| -4.02 \| +-------+-------+-------+-------+-------+-------+-------+-------+ \| \| \| \| \| P \| \| \| p \| \| \| \| \| \| +2.33 \| \| \| +1.17 \| +-------+-------+-------+-------+-------+-------+-------+-------+ \| \| \| \| \| B \| P \| \| \| \| \| \| \| \| +4.79 \| +1.54 \| \| \| +-------+-------+-------+-------+-------+-------+-------+-------+ \| \| \| \| \| B \| \| R \| \| \| \| \| \| \| +4.54 \| \| +6.03 \| \| +-------+-------+-------+-------+-------+-------+-------+-------+ \| \| R \| \| \| \| \| \| K \| \| \| +4.81 \| \| \| \| \| \| \| +-------+-------+-------+-------+-------+-------+-------+-------+ NNUE network contributions (Black to move) +------------+------------+------------+------------+ \| Bucket \| Material \| Positional \| Total \| \| \| (PSQT) \| (Layers) \| \| +------------+------------+------------+------------+ \| 0 \| + 0.32 \| - 1.46 \| - 1.13 \| \| 1 \| + 0.25 \| - 0.68 \| - 0.43 \| \| 2 \| + 0.46 \| - 1.72 \| - 1.25 \| \| 3 \| + 0.55 \| - 1.80 \| - 1.25 \| \| 4 \| + 0.48 \| - 1.77 \| - 1.29 \| \| 5 \| + 0.40 \| - 2.00 \| - 1.60 \| \| 6 \| + 0.57 \| - 2.12 \| - 1.54 \| <-- this bucket is used \| 7 \| + 3.38 \| - 2.00 \| + 1.37 \| +------------+------------+------------+------------+ Classical evaluation -1.00 (white side) NNUE evaluation +1.54 (white side) Final evaluation +2.38 (white side) [with scaled NNUE, hybrid, ...] ``` Also renames the export_net() function to save_eval() while there. closes https://github.com/official-stockfish/Stockfish/pull/3562 No functional change	2021-06-19 11:57:01 +02:00
Tomasz Sobczyk	900f249f59	Reduce the number of accumulator states Reduce from 3 to 2. Make the intent of the states clearer. STC: https://tests.stockfishchess.org/tests/view/60c50111457376eb8bcaad03 LLR: 2.95 (-2.94,2.94) <-2.50,0.50> Total: 61888 W: 5007 L: 4944 D: 51937 Ptnml(0-2): 164, 3947, 22649, 4030, 154 LTC: https://tests.stockfishchess.org/tests/view/60c52b1c457376eb8bcaad2c LLR: 2.94 (-2.94,2.94) <-2.50,0.50> Total: 20248 W: 688 L: 618 D: 18942 Ptnml(0-2): 7, 551, 8946, 605, 15 closes https://github.com/official-stockfish/Stockfish/pull/3548 No functional change.	2021-06-14 11:22:08 +02:00
Tomasz Sobczyk	ce4c523ad3	Register count for feature transformer Compute optimal register count for feature transformer accumulation dynamically. This also introduces a change where AVX512 would only use 8 registers instead of 16 (now possible due to a 2x increase in feature transformer size). closes https://github.com/official-stockfish/Stockfish/pull/3543 No functional change	2021-06-13 13:10:56 +02:00
Stéphane Nicolet	7819412002	Clarify use of UCI options Update README.md to clarify use of UCI options closes https://github.com/official-stockfish/Stockfish/pull/3540 No functional change	2021-06-13 10:02:43 +02:00
Tomasz Sobczyk	b84fa04db6	Read NNUE net faster Load feature transformer weights in bulk on little-endian machines. This is in particular useful to test new nets with c-chess-cli, see https://github.com/lucasart/c-chess-cli/issues/44 ``` $ time ./stockfish.exe uci Before : 0m0.914s After : 0m0.483s ``` No functional change	2021-06-13 09:39:03 +02:00
Stéphane Nicolet	8f081c86f7	Clean SIMD code a bit Cleaner vector code structure in feature transformer. This patch just regroups the parts of the inner loop for each SIMD instruction set. Tested for non-regression: LLR: 2.96 (-2.94,2.94) <-2.50,0.50> Total: 115760 W: 9835 L: 9831 D: 96094 Ptnml(0-2): 326, 7776, 41715, 7694, 369 https://tests.stockfishchess.org/tests/view/60b96b39457376eb8bcaa26e It would be nice if a future patch could use some of the macros at the top of the file to unify the code between the distincts SIMD instruction sets (of course, unifying the Relu will be the challenge). closes https://github.com/official-stockfish/Stockfish/pull/3506 No functional change	2021-06-04 14:07:46 +02:00
Tomasz Sobczyk	5448cad29e	Fix export of the feature transformer. PSQT export was missing. fixes #3507 closes https://github.com/official-stockfish/Stockfish/pull/3508 No functional change	2021-05-30 21:31:58 +02:00
Stéphane Nicolet	f193778446	Do not use lazy evaluation inside NNUE This simplification patch implements two changes: 1. it simplifies away the so-called "lazy" path in the NNUE evaluation internals, where we trusted the psqt head alone to avoid the costly "positional" head in some cases; 2. it raises a little bit the NNUEThreshold1 in evaluate.cpp (from 682 to 800), which increases the limit where we switched from NNUE eval to Classical eval. Both effects increase the number of positional evaluations done by our new net architecture, but the results of our tests below seem to indicate that the loss of speed will be compensated by the gain of eval quality. STC: LLR: 2.95 (-2.94,2.94) <-2.50,0.50> Total: 26280 W: 2244 L: 2137 D: 21899 Ptnml(0-2): 72, 1755, 9405, 1810, 98 https://tests.stockfishchess.org/tests/view/60ae73f112066fd299795a51 LTC: LLR: 2.95 (-2.94,2.94) <-2.50,0.50> Total: 20592 W: 750 L: 677 D: 19165 Ptnml(0-2): 9, 614, 8980, 681, 12 https://tests.stockfishchess.org/tests/view/60ae88e812066fd299795a82 closes https://github.com/official-stockfish/Stockfish/pull/3503 Bench: 3817907	2021-05-27 01:21:56 +02:00
Tomasz Sobczyk	9d53129075	Expose the lazy threshold for the feature transformer PSQT as a parameter. Definition of the lazy threshold moved to evaluate.cpp where all others are. Lazy threshold only used for real searches, not used for the "eval" call. This preserves the purity of NNUE evaluation, which is useful to verify consistency between the engine and the NNUE trainer. closes https://github.com/official-stockfish/Stockfish/pull/3499 No functional change	2021-05-25 21:40:51 +02:00
Stéphane Nicolet	a2f01c07eb	Sometimes change the (materialist, positional) balance Our new nets output two values for the side to move in the last layer. We can interpret the first value as a material evaluation of the position, and the second one as the dynamic, positional value of the location of pieces. This patch changes the balance for the (materialist, positional) parts of the score from (128, 128) to (121, 135) when the piece material is equal between the two players, but keeps the standard (128, 128) balance when one player is at least an exchange up. Passed STC: LLR: 2.93 (-2.94,2.94) <-0.50,2.50> Total: 15936 W: 1421 L: 1266 D: 13249 Ptnml(0-2): 37, 1037, 5694, 1134, 66 https://tests.stockfishchess.org/tests/view/60a82df9ce8ea25a3ef0408f Passed LTC: LLR: 2.94 (-2.94,2.94) <0.50,3.50> Total: 13904 W: 516 L: 410 D: 12978 Ptnml(0-2): 4, 374, 6088, 484, 2 https://tests.stockfishchess.org/tests/view/60a8bbf9ce8ea25a3ef04101 closes https://github.com/official-stockfish/Stockfish/pull/3492 Bench: 3856635	2021-05-22 21:09:22 +02:00
Fanael Linithien	038487f954	Use packed 32-bit MMX operations for updating the PSQT accumulator This improves the speed of NNUE by a bit on old hardware that code path is intended for, like a Pentium III 1.13 GHz: 10 repeats of "./stockfish bench 16 1 13 default depth NNUE": Before: 54 642 504 897 cycles (± 0.12%) 62 301 937 829 instructions (± 0.03%) After: 54 320 821 928 cycles (± 0.13%) 62 084 742 699 instructions (± 0.02%) Speed of go depth 20 from startpos: Before: 53103 nps After: 53856 nps closes https://github.com/official-stockfish/Stockfish/pull/3476 No functional change.	2021-05-19 19:34:44 +02:00
Tomasz Sobczyk	e8d64af123	New NNUE architecture and net Introduces a new NNUE network architecture and associated network parameters, as obtained by a new pytorch trainer. The network is already very strong at short TC, without regression at longer TC, and has potential for further improvements. https://tests.stockfishchess.org/tests/view/60a159c65085663412d0921d TC: 10s+0.1s, 1 thread ELO: 21.74 +-3.4 (95%) LOS: 100.0% Total: 10000 W: 1559 L: 934 D: 7507 Ptnml(0-2): 38, 701, 2972, 1176, 113 https://tests.stockfishchess.org/tests/view/60a187005085663412d0925b TC: 60s+0.6s, 1 thread ELO: 5.85 +-1.7 (95%) LOS: 100.0% Total: 20000 W: 1381 L: 1044 D: 17575 Ptnml(0-2): 27, 885, 7864, 1172, 52 https://tests.stockfishchess.org/tests/view/60a2beede229097940a03806 TC: 20s+0.2s, 8 threads LLR: 2.93 (-2.94,2.94) <0.50,3.50> Total: 34272 W: 1610 L: 1452 D: 31210 Ptnml(0-2): 30, 1285, 14350, 1439, 32 https://tests.stockfishchess.org/tests/view/60a2d687e229097940a03c72 TC: 60s+0.6s, 8 threads LLR: 2.94 (-2.94,2.94) <-2.50,0.50> Total: 45544 W: 1262 L: 1214 D: 43068 Ptnml(0-2): 12, 1129, 20442, 1177, 12 The network has been trained (by vondele) using the https://github.com/glinscott/nnue-pytorch/ trainer (started by glinscott), specifically the branch https://github.com/Sopel97/nnue-pytorch/tree/experiment_56. The data used are in 64 billion positions (193GB total) generated and scored with the current master net d8: https://drive.google.com/file/d/1hOOYSDKgOOp38ZmD0N4DV82TOLHzjUiF/view?usp=sharing d9: https://drive.google.com/file/d/1VlhnHL8f-20AXhGkILujnNXHwy9T-MQw/view?usp=sharing d10: https://drive.google.com/file/d/1ZC5upzBYMmMj1gMYCkt6rCxQG0GnO3Kk/view?usp=sharing fishtest_d9: https://drive.google.com/file/d/1GQHt0oNgKaHazwJFTRbXhlCN3FbUedFq/view?usp=sharing This network also contains a few architectural changes with respect to the current master: Size changed from 256x2-32-32-1 to 512x2-16-32-1 ~15-20% slower ~2x larger adds a special path for 16 valued ClippedReLU fixes affine transform code for 16 inputs/outputs, buy using InputDimensions instead of PaddedInputDimensions this is safe now because the inputs are processed in groups of 4 in the current affine transform code The feature set changed from HalfKP to HalfKAv2 Includes information about the kings like HalfKA Packs king features better, resulting in 8% size reduction compared to HalfKA The board is flipped for the black's perspective, instead of rotated like in the current master PSQT values for each feature the feature transformer now outputs a part that is fowarded directly to the output and allows learning piece values more directly than the previous network architecture. The effect is visible for high imbalance positions, where the current master network outputs evaluations skewed towards zero. 8 PSQT values per feature, chosen based on (popcount(pos.pieces()) - 1) / 4 initialized to classical material values on the start of the training 8 subnetworks (512x2->16->32->1), chosen based on (popcount(pos.pieces()) - 1) / 4 only one subnetwork is evaluated for any position, no or marginal speed loss A diagram of the network is available: https://user-images.githubusercontent.com/8037982/118656988-553a1700-b7eb-11eb-82ef-56a11cbebbf2.png A more complete description: https://github.com/glinscott/nnue-pytorch/blob/master/docs/nnue.md closes https://github.com/official-stockfish/Stockfish/pull/3474 Bench: 3806488	2021-05-18 18:06:23 +02:00

1 2

91 commits