BadFish

mirror of https://github.com/sockspls/badfish synced 2025-05-03 01:59:36 +00:00

Author	SHA1	Message	Date
Tomasz Sobczyk	c079acc26f	Update NNUE architecture to SFNNv5. Update network to nn-3c0aa92af1da.nnue. Architecture changes: Duplicated activation after the 1024->15 layer with squared crelu (so 15->15*2). As proposed by vondele. Trainer changes: Added bias to L1 factorization, which was previously missing (no measurable improvement but at least neutral in principle) For retraining linearly reduce lambda parameter from 1.0 at epoch 0 to 0.75 at epoch 800. reduce max_skipping_rate from 15 to 10 (compared to vondele's outstanding PR) Note: This network was trained with a ~0.8% error in quantization regarding the newly added activation function. This will be fixed in the released trainer version. Expect a trainer PR tomorrow. Note: The inference implementation cuts a corner to merge results from two activation functions. This could possibly be resolved nicer in the future. AVX2 implementation likely not necessary, but NEON is missing. First training session invocation: python3 train.py \ ../nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \ ../nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \ --gpus "$3," \ --threads 4 \ --num-workers 8 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=1.0 \ --max_epochs=400 \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 Second training session invocation: python3 train.py \ ../nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \ ../nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \ --gpus "$3," \ --threads 4 \ --num-workers 8 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --start-lambda=1.0 \ --end-lambda=0.75 \ --gamma=0.995 \ --lr=4.375e-4 \ --max_epochs=800 \ --resume-from-model /data/sopel/nnue/nnue-pytorch-training/data/exp367/nn-exp367-run3-epoch399.pt \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 Passed STC: LLR: 2.95 (-2.94,2.94) <0.00,2.50> Total: 27288 W: 7445 L: 7178 D: 12665 Ptnml(0-2): 159, 3002, 7054, 3271, 158 https://tests.stockfishchess.org/tests/view/627e8c001919125939623644 Passed LTC: LLR: 2.95 (-2.94,2.94) <0.50,3.00> Total: 21792 W: 5969 L: 5727 D: 10096 Ptnml(0-2): 25, 2152, 6294, 2406, 19 https://tests.stockfishchess.org/tests/view/627f2a855734b18b2e2ece47 closes https://github.com/official-stockfish/Stockfish/pull/4020 Bench: 6481017	2022-05-14 12:47:22 +02:00
mstembera	5f781d366e	Clean up and simplify some nnue code. Remove some unnecessary code and it's execution during inference. Also the change on line 49 in nnue_architecture.h results in a more efficient SIMD code path through ClippedReLU::propagate(). passed STC: https://tests.stockfishchess.org/tests/view/6217d3bfda649bba32ef25d5 LLR: 2.94 (-2.94,2.94) <-2.25,0.25> Total: 12056 W: 3281 L: 3092 D: 5683 Ptnml(0-2): 55, 1213, 3312, 1384, 64 passed STC SMP: https://tests.stockfishchess.org/tests/view/6217f344da649bba32ef295e LLR: 2.94 (-2.94,2.94) <-2.25,0.25> Total: 27376 W: 7295 L: 7137 D: 12944 Ptnml(0-2): 52, 2859, 7715, 3003, 59 closes https://github.com/official-stockfish/Stockfish/pull/3944 No functional change bench: 6820724	2022-02-25 08:37:57 +01:00
Tomasz Sobczyk	cb9c2594fc	Update architecture to "SFNNv4". Update network to nn-6877cd24400e.nnue. Architecture: The diagram of the "SFNNv4" architecture: https://user-images.githubusercontent.com/8037982/153455685-cbe3a038-e158-4481-844d-9d5fccf5c33a.png The most important architectural changes are the following: * 1024x2 [activated] neurons are pairwise, elementwise multiplied (not quite pairwise due to implementation details, see diagram), which introduces a non-linearity that exhibits similar benefits to previously tested sigmoid activation (quantmoid4), while being slightly faster. * The following layer has therefore 2x less inputs, which we compensate by having 2 more outputs. It is possible that reducing the number of outputs might be beneficial (as we had it as low as 8 before). The layer is now 1024->16. * The 16 outputs are split into 15 and 1. The 1-wide output is added to the network output (after some necessary scaling due to quantization differences). The 15-wide is activated and follows the usual path through a set of linear layers. The additional 1-wide output is at least neutral, but has shown a slightly positive trend in training compared to networks without it (all 16 outputs through the usual path), and allows possibly an additional stage of lazy evaluation to be introduced in the future. Additionally, the inference code was rewritten and no longer uses a recursive implementation. This was necessitated by the splitting of the 16-wide intermediate result into two, which was impossible to do with the old implementation with ugly hacks. This is hopefully overall for the better. First session: The first session was training a network from scratch (random initialization). The exact trainer used was slightly different (older) from the one used in the second session, but it should not have a measurable effect. The purpose of this session is to establish a strong network base for the second session. Small deviations in strength do not harm the learnability in the second session. The training was done using the following command: python3 train.py \ /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \ /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \ --gpus "$3," \ --threads 4 \ --num-workers 4 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=1.0 \ --gamma=0.992 \ --lr=8.75e-4 \ --max_epochs=400 \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 Every 20th net was saved and its playing strength measured against some baseline at 25k nodes per move with pure NNUE evaluation (modified binary). The exact setup is not important as long as it's consistent. The purpose is to sift good candidates from bad ones. The dataset can be found https://drive.google.com/file/d/1UQdZN_LWQ265spwTBwDKo0t1WjSJKvWY/view Second session: The second training session was done starting from the best network (as determined by strength testing) from the first session. It is important that it's resumed from a .pt model and NOT a .ckpt model. The conversion can be performed directly using serialize.py The LR schedule was modified to use gamma=0.995 instead of gamma=0.992 and LR=4.375e-4 instead of LR=8.75e-4 to flatten the LR curve and allow for longer training. The training was then running for 800 epochs instead of 400 (though it's possibly mostly noise after around epoch 600). The training was done using the following command: The training was done using the following command: python3 train.py \ /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \ /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \ --gpus "$3," \ --threads 4 \ --num-workers 4 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=1.0 \ --gamma=0.995 \ --lr=4.375e-4 \ --max_epochs=800 \ --resume-from-model /data/sopel/nnue/nnue-pytorch-training/data/exp295/nn-epoch399.pt \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$run_id In particular note that we now use lambda=1.0 instead of lambda=0.8 (previous nets), because tests show that WDL-skipping introduced by vondele performs better with lambda=1.0. Nets were being saved every 20th epoch. In total 16 runs were made with these settings and the best nets chosen according to playing strength at 25k nodes per move with pure NNUE evaluation - these are the 4 nets that have been put on fishtest. The dataset can be found either at ftp://ftp.chessdb.cn/pub/sopel/data_sf/T60T70wIsRightFarseerT60T74T75T76.binpack in its entirety (download might be painfully slow because hosted in China) or can be assembled in the following way: Get the `5640ad48ae/script/interleave_binpacks.py` script. Download T60T70wIsRightFarseer.binpack https://drive.google.com/file/d/1_sQoWBl31WAxNXma2v45004CIVltytP8/view Download farseerT74.binpack http://trainingdata.farseer.org/T74-May13-End.7z Download farseerT75.binpack http://trainingdata.farseer.org/T75-June3rd-End.7z Download farseerT76.binpack http://trainingdata.farseer.org/T76-Nov10th-End.7z Run python3 interleave_binpacks.py T60T70wIsRightFarseer.binpack farseerT74.binpack farseerT75.binpack farseerT76.binpack T60T70wIsRightFarseerT60T74T75T76.binpack Tests: STC: https://tests.stockfishchess.org/tests/view/6203fb85d71106ed12a407b7 LLR: 2.94 (-2.94,2.94) <0.00,2.50> Total: 16952 W: 4775 L: 4521 D: 7656 Ptnml(0-2): 133, 1818, 4318, 2076, 131 LTC: https://tests.stockfishchess.org/tests/view/62041e68d71106ed12a40e85 LLR: 2.94 (-2.94,2.94) <0.50,3.00> Total: 14944 W: 4138 L: 3907 D: 6899 Ptnml(0-2): 21, 1499, 4202, 1728, 22 closes https://github.com/official-stockfish/Stockfish/pull/3927 Bench: 4919707	2022-02-10 19:54:31 +01:00
Brad Knox	ad926d34c0	Update copyright years Happy New Year! closes https://github.com/official-stockfish/Stockfish/pull/3881 No functional change	2022-01-06 15:45:45 +01:00
Tomasz Sobczyk	4766dfc395	Optimize FT activation and affine transform for NEON. This patch optimizes the NEON implementation in two ways. The activation layer after the feature transformer is rewritten to make it easier for the compiler to see through dependencies and unroll. This in itself is a minimal, but a positive improvement. Other architectures could benefit from this too in the future. This is not an algorithmic change. The affine transform for large matrices (first layer after FT) on NEON now utilizes the same optimized code path as >=SSSE3, which makes the memory accesses more sequential and makes better use of the available registers, which allows for code that has longer dependency chains. Benchmarks from Redshift#161, profile-build with apple clang george@Georges-MacBook-Air nets % ./stockfish-b82d93 bench 2>&1 \| tail -4 (current master) =========================== Total time (ms) : 2167 Nodes searched : 4667742 Nodes/second : 2154011 george@Georges-MacBook-Air nets % ./stockfish-7377b8 bench 2>&1 \| tail -4 (this patch) =========================== Total time (ms) : 1842 Nodes searched : 4667742 Nodes/second : 2534061 This is a solid 18% improvement overall, larger in a bench with NNUE-only, not mixed. Improvement is also observed on armv7-neon (Raspberry Pi, and older phones), around 5% speedup. No changes for architectures other than NEON. closes https://github.com/official-stockfish/Stockfish/pull/3837 No functional changes.	2021-12-07 18:08:54 +01:00
Tomasz Sobczyk	18dcf1f097	Optimize and tidy up affine transform code. The new network caused some issues initially due to the very narrow neuron set between the first two FC layers. Necessary changes were hacked together to make it work. This patch is a mature approach to make the affine transform code faster, more readable, and easier to maintain should the layer sizes change again. The following changes were made: * ClippedReLU always produces a multiple of 32 outputs. This is about as good of a solution for AffineTransform's SIMD requirements as it can get without a bigger rewrite. * All self-contained simd helpers are moved to a separate file (simd.h). Inline asm is utilized to work around GCC's issues with code generation and register assignment. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101693, https://godbolt.org/z/da76fY1n7 * AffineTransform has 2 specializations. While it's more lines of code due to the boilerplate, the logic in both is significantly reduced, as these two are impossible to nicely combine into one. 1) The first specialization is for cases when there's >=128 inputs. It uses a different approach to perform the affine transform and can make full use of AVX512 without any edge cases. Furthermore, it has higher theoretical throughput because less loads are needed in the hot path, requiring only a fixed amount of instructions for horizontal additions at the end, which are amortized by the large number of inputs. 2) The second specialization is made to handle smaller layers where performance is still necessary but edge cases need to be handled. AVX512 implementation for this was ommited by mistake, a remnant from the temporary implementation for the new... This could be easily reintroduced if needed. A slightly more detailed description of both implementations is in the code. Overall it should be a minor speedup, as shown on fishtest: passed STC: LLR: 2.96 (-2.94,2.94) <-0.50,2.50> Total: 51520 W: 4074 L: 3888 D: 43558 Ptnml(0-2): 111, 3136, 19097, 3288, 128 and various tests shown in the pull request closes https://github.com/official-stockfish/Stockfish/pull/3663 No functional change	2021-08-20 08:50:25 +02:00
Tomasz Sobczyk	d61d38586e	New NNUE architecture and net Introduces a new NNUE network architecture and associated network parameters The summary of the changes: * Position for each perspective mirrored such that the king is on e..h files. Cuts the feature transformer size in half, while preserving enough knowledge to be good. See https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit#heading=h.b40q4rb1w7on. * The number of neurons after the feature transformer increased two-fold, to 1024x2. This is possibly mostly due to the now very optimized feature transformer update code. * The number of neurons after the second layer is reduced from 16 to 8, to reduce the speed impact. This, perhaps surprisingly, doesn't harm the strength much. See https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit#heading=h.6qkocr97fezq The AffineTransform code did not work out-of-the box with the smaller number of neurons after the second layer, so some temporary changes have been made to add a special case for InputDimensions == 8. Also additional 0 padding is added to the output for some archs that cannot process inputs by <=8 (SSE2, NEON). VNNI uses an implementation that can keep all outputs in the registers while reducing the number of loads by 3 for each 16 inputs, thanks to the reduced number of output neurons. However GCC is particularily bad at optimization here (and perhaps why the current way the affine transform is done even passed sprt) (see https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit# for details) and more work will be done on this in the following days. I expect the current VNNI implementation to be improved and extended to other architectures. The network was trained with a slightly modified version of the pytorch trainer (https://github.com/glinscott/nnue-pytorch); the changes are in https://github.com/glinscott/nnue-pytorch/pull/143 The training utilized 2 datasets. dataset A - https://drive.google.com/file/d/1VlhnHL8f-20AXhGkILujnNXHwy9T-MQw/view?usp=sharing dataset B - as described in `ba01f4b954` The training process was as following: train on dataset A for 350 epochs, take the best net in terms of elo at 20k nodes per move (it's fine to take anything from later stages of training). convert the .ckpt to .pt --resume-from-model from the .pt file, train on dataset B for <600 epochs, take the best net. Lambda=0.8, applied before the loss function. The first training command: python3 train.py \ ../nnue-pytorch-training/data/large_gensfen_multipvdiff_100_d9.binpack \ ../nnue-pytorch-training/data/large_gensfen_multipvdiff_100_d9.binpack \ --gpus "$3," \ --threads 1 \ --num-workers 1 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --smart-fen-skipping \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=1.0 \ --max_epochs=600 \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 The second training command: python3 serialize.py \ --features=HalfKAv2_hm^ \ ../nnue-pytorch-training/experiment_131/run_6/default/version_0/checkpoints/epoch-499.ckpt \ ../nnue-pytorch-training/experiment_$1/base/base.pt python3 train.py \ ../nnue-pytorch-training/data/michael_commit_b94a65.binpack \ ../nnue-pytorch-training/data/michael_commit_b94a65.binpack \ --gpus "$3," \ --threads 1 \ --num-workers 1 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --smart-fen-skipping \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=0.8 \ --max_epochs=600 \ --resume-from-model ../nnue-pytorch-training/experiment_$1/base/base.pt \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 STC: https://tests.stockfishchess.org/tests/view/611120b32a8a49ac5be798c4 LLR: 2.97 (-2.94,2.94) <-0.50,2.50> Total: 22480 W: 2434 L: 2251 D: 17795 Ptnml(0-2): 101, 1736, 7410, 1865, 128 LTC: https://tests.stockfishchess.org/tests/view/611152b32a8a49ac5be798ea LLR: 2.93 (-2.94,2.94) <0.50,3.50> Total: 9776 W: 442 L: 333 D: 9001 Ptnml(0-2): 5, 295, 4180, 402, 6 closes https://github.com/official-stockfish/Stockfish/pull/3646 bench: 5189338	2021-08-15 12:05:43 +02:00
Tomasz Sobczyk	26edf9534a	Avoid unnecessary stores in the affine transform This patch improves the codegen in the AffineTransform::forward function for architectures >=SSSE3. Current code works directly on memory and the compiler cannot see that the stores through outptr do not alias the loads through weights and input32. The solution implemented is to perform the affine transform with local variables as accumulators and only store the result to memory at the end. The number of accumulators required is OutputDimensions / OutputSimdWidth, which means that for the 1024->16 affine transform it requires 4 registers with SSSE3, 2 with AVX2, 1 with AVX512. It also cuts the number of stores required by NumRegs * 256 for each node evaluated. The local accumulators are expected to be assigned to registers, but even if this cannot be done in some case due to register pressure it will help the compiler to see that there is no aliasing between the loads and stores and may still result in better codegen. See https://godbolt.org/z/59aTKbbYc for codegen comparison. passed STC: LLR: 2.94 (-2.94,2.94) <-0.50,2.50> Total: 140328 W: 10635 L: 10358 D: 119335 Ptnml(0-2): 302, 8339, 52636, 8554, 333 closes https://github.com/official-stockfish/Stockfish/pull/3634 No functional change	2021-07-30 17:15:52 +02:00
Tomasz Sobczyk	e8d64af123	New NNUE architecture and net Introduces a new NNUE network architecture and associated network parameters, as obtained by a new pytorch trainer. The network is already very strong at short TC, without regression at longer TC, and has potential for further improvements. https://tests.stockfishchess.org/tests/view/60a159c65085663412d0921d TC: 10s+0.1s, 1 thread ELO: 21.74 +-3.4 (95%) LOS: 100.0% Total: 10000 W: 1559 L: 934 D: 7507 Ptnml(0-2): 38, 701, 2972, 1176, 113 https://tests.stockfishchess.org/tests/view/60a187005085663412d0925b TC: 60s+0.6s, 1 thread ELO: 5.85 +-1.7 (95%) LOS: 100.0% Total: 20000 W: 1381 L: 1044 D: 17575 Ptnml(0-2): 27, 885, 7864, 1172, 52 https://tests.stockfishchess.org/tests/view/60a2beede229097940a03806 TC: 20s+0.2s, 8 threads LLR: 2.93 (-2.94,2.94) <0.50,3.50> Total: 34272 W: 1610 L: 1452 D: 31210 Ptnml(0-2): 30, 1285, 14350, 1439, 32 https://tests.stockfishchess.org/tests/view/60a2d687e229097940a03c72 TC: 60s+0.6s, 8 threads LLR: 2.94 (-2.94,2.94) <-2.50,0.50> Total: 45544 W: 1262 L: 1214 D: 43068 Ptnml(0-2): 12, 1129, 20442, 1177, 12 The network has been trained (by vondele) using the https://github.com/glinscott/nnue-pytorch/ trainer (started by glinscott), specifically the branch https://github.com/Sopel97/nnue-pytorch/tree/experiment_56. The data used are in 64 billion positions (193GB total) generated and scored with the current master net d8: https://drive.google.com/file/d/1hOOYSDKgOOp38ZmD0N4DV82TOLHzjUiF/view?usp=sharing d9: https://drive.google.com/file/d/1VlhnHL8f-20AXhGkILujnNXHwy9T-MQw/view?usp=sharing d10: https://drive.google.com/file/d/1ZC5upzBYMmMj1gMYCkt6rCxQG0GnO3Kk/view?usp=sharing fishtest_d9: https://drive.google.com/file/d/1GQHt0oNgKaHazwJFTRbXhlCN3FbUedFq/view?usp=sharing This network also contains a few architectural changes with respect to the current master: Size changed from 256x2-32-32-1 to 512x2-16-32-1 ~15-20% slower ~2x larger adds a special path for 16 valued ClippedReLU fixes affine transform code for 16 inputs/outputs, buy using InputDimensions instead of PaddedInputDimensions this is safe now because the inputs are processed in groups of 4 in the current affine transform code The feature set changed from HalfKP to HalfKAv2 Includes information about the kings like HalfKA Packs king features better, resulting in 8% size reduction compared to HalfKA The board is flipped for the black's perspective, instead of rotated like in the current master PSQT values for each feature the feature transformer now outputs a part that is fowarded directly to the output and allows learning piece values more directly than the previous network architecture. The effect is visible for high imbalance positions, where the current master network outputs evaluations skewed towards zero. 8 PSQT values per feature, chosen based on (popcount(pos.pieces()) - 1) / 4 initialized to classical material values on the start of the training 8 subnetworks (512x2->16->32->1), chosen based on (popcount(pos.pieces()) - 1) / 4 only one subnetwork is evaluated for any position, no or marginal speed loss A diagram of the network is available: https://user-images.githubusercontent.com/8037982/118656988-553a1700-b7eb-11eb-82ef-56a11cbebbf2.png A more complete description: https://github.com/glinscott/nnue-pytorch/blob/master/docs/nnue.md closes https://github.com/official-stockfish/Stockfish/pull/3474 Bench: 3806488	2021-05-18 18:06:23 +02:00
Stéphane Nicolet	f90274d8ce	Small clean-ups - Comment for Countemove pruning -> Continuation history - Fix comment in input_slice.h - Shorter lines in Makefile - Comment for scale factor - Fix comment for pinners in see_ge() - Change Thread.id() signature to size_t - Trailing space in reprosearch.sh - Add Douglas Matos Gomes to the AUTHORS file - Introduce comment for undo_null_move() - Use Stockfish coding style for export_net() - Change date in AUTHORS file closes https://github.com/official-stockfish/Stockfish/pull/3416 No functional change	2021-05-17 10:47:14 +02:00
Tomasz Sobczyk	58054fd0fa	Exporting the currently loaded network file This PR adds an ability to export any currently loaded network. The export_net command now takes an optional filename parameter. If the loaded net is not the embedded net the filename parameter is required. Two changes were required to support this: * the "architecture" string, which is really just a some kind of description in the net, is now saved into netDescription on load and correctly saved on export. * the AffineTransform scrambles weights for some architectures and sparsifies them, such that retrieving the index is hard. This is solved by having a temporary scrambled<->unscrambled index lookup table when loading the network, and the actual index is saved for each individual weight that makes it to canSaturate16. This increases the size of the canSaturate16 entries by 6 bytes. closes https://github.com/official-stockfish/Stockfish/pull/3456 No functional change	2021-05-11 19:36:11 +02:00
Tomasz Sobczyk	fbbd4adc3c	Unify naming convention of the NNUE code matches the rest of the stockfish code base closes https://github.com/official-stockfish/Stockfish/pull/3437 No functional change	2021-04-24 12:49:29 +02:00
Stéphane Nicolet	83eac08e75	Small cleanups (march 2021) With help of @BM123499, @mstembera, @gvreuls, @noobpwnftw and @Fanael Thanks! Closes https://github.com/official-stockfish/Stockfish/pull/3405 No functional change	2021-03-24 17:11:06 +01:00
Dieter Dobbelaere	7ffae17f85	Add Stockfish namespace. fixes #3350 and is a small cleanup that might make it easier to use SF in separate projects, like a NNUE trainer or similar. closes https://github.com/official-stockfish/Stockfish/pull/3370 No functional change.	2021-03-07 14:26:54 +01:00
MaximMolchanov	303713b560	Affine transform robust implementation Size of the weights in the last layer is less than 512 bits. It leads to wrong data access for AVX512. There is no error because in current implementation it is guaranteed that there is an array of zeros after weights so zero multiplied by something is returned and sum is correct. It is a mistake that can lead to unexpected bugs in the future. Used AVX2 instructions for smaller input size. No measurable slowdown on avx512. closes https://github.com/official-stockfish/Stockfish/pull/3298 No functional change.	2021-01-11 18:54:18 +01:00
Joost VandeVondele	c4d67d77c9	Update copyright years No functional change	2021-01-08 17:04:23 +01:00
MaximMolchanov	23c385ec36	Affine transform refactoring. Reordered weights in such a way that accumulated sum fits to output. Weights are grouped in blocks of four elements because four int8 (weight type) corresponds to one int32 (output type). No horizontal additions. Grouped AVX512, AVX2 and SSSE3 implementations. Repeated code was removed. An earlier version passed STC: LLR: 2.97 (-2.94,2.94) {-0.25,1.25} Total: 15336 W: 1495 L: 1355 D: 12486 Ptnml(0-2): 44, 1054, 5350, 1158, 62 https://tests.stockfishchess.org/tests/view/5ff60e106019e097de3eefd5 Speedup depends on the architecture, up to 4% measured on a NNUE only bench. closes https://github.com/official-stockfish/Stockfish/pull/3287 No functional change	2021-01-08 16:35:44 +01:00
mstembera	d862ba4069	AVX512, AVX2 and SSSE3 speedups Improves throughput by summing 2 intermediate dot products using 16 bit addition before upconverting to 32 bit. Potential saturation is detected and the code-path is avoided in this case. The saturation can't happen with the current nets, but nets can be constructed that trigger this check. STC https://tests.stockfishchess.org/tests/view/5fd40a861ac1691201888479 LLR: 2.94 (-2.94,2.94) {-0.25,1.25} Total: 25544 W: 2451 L: 2296 D: 20797 Ptnml(0-2): 92, 1761, 8925, 1888, 106 about 5% speedup closes https://github.com/official-stockfish/Stockfish/pull/3261 No functional change	2020-12-14 07:46:15 +01:00
Fanael Linithien	c7f0a768cb	Use arithmetic right shift for sign extension in MMX and SSE2 paths This appears to be slightly faster than using a comparison against zero to compute the high bits, on both old (like Pentium III) and new (like Zen 2) hardware. closes https://github.com/official-stockfish/Stockfish/pull/3254 No functional change.	2020-12-12 09:20:15 +01:00
MaximMolchanov	7615e3485e	Calculate sum from first elements in affine transform for AVX512/AVX2/SSSE3 The idea is to initialize sum with the first element instead of zero. Reduce one add_epi32 and one set_zero SIMD instructions for each output dimension. sum = 0; for i = 1 to n sum += a[i] -> sum = a[1]; for i = 2 to n sum += a[i] STC: LLR: 2.95 (-2.94,2.94) {-0.25,1.25} Total: 69048 W: 7024 L: 6799 D: 55225 Ptnml(0-2): 260, 5175, 23458, 5342, 289 https://tests.stockfishchess.org/tests/view/5faf2cf467cbf42301d6aa06 closes https://github.com/official-stockfish/Stockfish/pull/3227 No functional change.	2020-11-25 21:10:13 +01:00
Tomasz Sobczyk	ba35c88ab8	AVX-512 for smaller affine and feature transforms. For the feature transformer the code is analogical to AVX2 since there was room for easy adaptation of wider simd registers. For the smaller affine transforms that have 32 byte stride we keep 2 columns in one zmm register. We also unroll more aggressively so that in the end we have to do 16 parallel horizontal additions on ymm slices each consisting of 4 32-bit integers. The slices are embedded in 8 zmm registers. These changes provide about 1.5% speedup for AVX-512 builds. Closes https://github.com/official-stockfish/Stockfish/pull/3218 No functional change.	2020-11-07 16:49:49 +01:00
Tomasz Sobczyk	3f6451eff7	Manually align arrays on the stack as a workaround to issues with overaligned alignas() on stack variables in gcc < 9.3 on windows. closes https://github.com/official-stockfish/Stockfish/pull/3217 fixes #3216 No functional change	2020-11-04 19:52:42 +01:00
Tomasz Sobczyk	75e06a1c89	Optimize affine transform for SSSE3 and higher targets. A non-functional speedup. Unroll the loops going over the output dimensions in the affine transform layers by a factor of 4 and perform 4 horizontal additions at a time. Instead of doing naive horizontal additions on each vector separately use hadd and shuffling between vectors to reduce the number of instructions by using all lanes for all stages of the horizontal adds. passed STC of the initial version: LLR: 2.95 (-2.94,2.94) {-0.25,1.25} Total: 17808 W: 1914 L: 1756 D: 14138 Ptnml(0-2): 76, 1330, 5948, 1460, 90 https://tests.stockfishchess.org/tests/view/5f9d516f6a2c112b60691da3 passed STC of the final version after cleanup: LLR: 2.95 (-2.94,2.94) {-0.25,1.25} Total: 16296 W: 1750 L: 1595 D: 12951 Ptnml(0-2): 72, 1192, 5479, 1319, 86 https://tests.stockfishchess.org/tests/view/5f9df5776a2c112b60691de3 closes https://github.com/official-stockfish/Stockfish/pull/3203 No functional change	2020-11-02 19:41:17 +01:00
mstembera	701b2427bd	Support VNNI on 256bit vectors due to downclocking on current chips (tested up to cascade lake) supporting avx512 and vnni512, it is better to use avx2 or vnni256 in multithreaded (in particular hyperthreaded) engine use. In single threaded use, the picture is different. gcc compilation for vnni256 requires a toolchain for gcc >= 9. closes https://github.com/official-stockfish/Stockfish/pull/3038 No functional change	2020-08-24 12:03:04 +02:00
Stéphane Nicolet	81d716f5cc	Reformat code in little-endian patch Reformat code and rename the function to "read_little_endian()" in the recent commit by Ronald de Man for support of big endian systems. closes https://github.com/official-stockfish/Stockfish/pull/3016 No functional change ----- Recommended net: https://tests.stockfishchess.org/api/nn/nn-82215d0fd0df.nnue	2020-08-17 12:15:57 +02:00
syzygy1	72dc7a5c54	Assume network file is in little-endian byte order This patch fixes the byte order when reading 16- and 32-bit values from the network file on a big-endian machine. Bytes are ordered in read_le() using unsigned arithmetic, which doesn't need tricks to determine the endianness of the machine. Unfortunately the compiler doesn't seem to be able to optimise the ordering operation, but reading in the weights is not a time-critical operation and the extra time it takes should not be noticeable. Big endian systems are still untested with NNUE. fixes #3007 closes https://github.com/official-stockfish/Stockfish/pull/3009 No functional change.	2020-08-16 21:10:26 +02:00
mstembera	dd63b98fb0	Add support for VNNI Adds support for Vector Neural Network Instructions (avx512), as available on Intel Cascade Lake The _mm512_dpbusd_epi32() intrinsic (vpdpbusd instruction) is taylor made for NNUE. on a cascade lake CPU (AWS C5.24x.large, gcc 10) NNUE eval is at roughly 78% nps of classical (single core test) bench 1024 1 24 default depth: target classical NNUE ratio vnni 2207232 1725987 78.20 avx512 2216789 1671734 75.41 avx2 2194006 1611263 73.44 modern 2185001 1352469 61.90 closes https://github.com/official-stockfish/Stockfish/pull/2987 No functional change	2020-08-13 07:39:52 +02:00
mstembera	f46c73040c	Fix AVX512 build with older compilers avoids an intrinsic that is missing in gcc < 10. For this target, might trigger another gcc bug on windows that requires up-to-date gcc 8, 9, or 10, or usage of clang. Fixes https://github.com/official-stockfish/Stockfish/issues/2975 closes https://github.com/official-stockfish/Stockfish/pull/2976 No functional change	2020-08-11 08:17:03 +02:00
Fanael Linithien	21df37d7fd	Provide vectorized NNUE code for SSE2 and MMX targets This patch allows old x86 CPUs, from AMD K8 (which the x86-64 baseline targets) all the way down to the Pentium MMX, to benefit from NNUE with comparable performance hit versus hand-written eval as on more modern processors. NPS of the bench with NNUE enabled on a Pentium III 1.13 GHz (using the MMX code): master: 38951 this patch: 80586 NPS of the bench with NNUE enabled using baseline x86-64 arch, which is how linux distros are likely to package stockfish, on a modern CPU (using the SSE2 code): master: 882584 this patch: 1203945 closes https://github.com/official-stockfish/Stockfish/pull/2956 No functional change.	2020-08-10 19:17:57 +02:00
mstembera	f948cd008d	Cleanup and optimize SSE/AVX code AVX512 +4% faster AVX2 +1% faster SSSE3 +5% faster passed non-regression STC: STC https://tests.stockfishchess.org/tests/view/5f31249f90816720665374f6 LLR: 2.96 (-2.94,2.94) {-1.50,0.50} Total: 17576 W: 2344 L: 2245 D: 12987 Ptnml(0-2): 127, 1570, 5292, 1675, 124 closes https://github.com/official-stockfish/Stockfish/pull/2962 No functional change	2020-08-10 14:38:17 +02:00
mstembera	875183b310	Workaround using unaligned loads for gcc < 9 despite usage of alignas, the generated (avx2/avx512) code with older compilers needs to use unaligned loads with older gcc (e.g. confirmed crash with gcc 7.3/mingw on abrok). Better performance thus requires gcc >= 9 on hardware supporting avx2/avx512 closes https://github.com/official-stockfish/Stockfish/pull/2969 No functional change	2020-08-10 11:12:35 +02:00
Joost VandeVondele	651ec3b31e	Revert "Avoid special casing for MinGW" This reverts commit `a6e89293df`. The offending setup has been found as gcc/mingw 7.3 (on Ubuntu 18.04). fixes https://github.com/official-stockfish/Stockfish/issues/2963 closes https://github.com/official-stockfish/Stockfish/issues/2968 No functional change.	2020-08-10 07:28:19 +02:00
Dariusz Orzechowski	a6e89293df	Avoid special casing for MinGW after some testing, no version of MinGW/gcc has been found where this code is still necessary. Probably older code (pre-c++17?) closes https://github.com/official-stockfish/Stockfish/pull/2891 No functional change	2020-08-09 23:49:14 +02:00
nodchip	84f3e86790	Add NNUE evaluation This patch ports the efficiently updatable neural network (NNUE) evaluation to Stockfish. Both the NNUE and the classical evaluations are available, and can be used to assign a value to a position that is later used in alpha-beta (PVS) search to find the best move. The classical evaluation computes this value as a function of various chess concepts, handcrafted by experts, tested and tuned using fishtest. The NNUE evaluation computes this value with a neural network based on basic inputs. The network is optimized and trained on the evalutions of millions of positions at moderate search depth. The NNUE evaluation was first introduced in shogi, and ported to Stockfish afterward. It can be evaluated efficiently on CPUs, and exploits the fact that only parts of the neural network need to be updated after a typical chess move. [The nodchip repository](https://github.com/nodchip/Stockfish) provides additional tools to train and develop the NNUE networks. This patch is the result of contributions of various authors, from various communities, including: nodchip, ynasu87, yaneurao (initial port and NNUE authors), domschl, FireFather, rqs, xXH4CKST3RXx, tttak, zz4032, joergoster, mstembera, nguyenpham, erbsenzaehler, dorzechowski, and vondele. This new evaluation needed various changes to fishtest and the corresponding infrastructure, for which tomtor, ppigazzini, noobpwnftw, daylen, and vondele are gratefully acknowledged. The first networks have been provided by gekkehenker and sergiovieri, with the latter net (nn-97f742aaefcd.nnue) being the current default. The evaluation function can be selected at run time with the `Use NNUE` (true/false) UCI option, provided the `EvalFile` option points the the network file (depending on the GUI, with full path). The performance of the NNUE evaluation relative to the classical evaluation depends somewhat on the hardware, and is expected to improve quickly, but is currently on > 80 Elo on fishtest: 60000 @ 10+0.1 th 1 https://tests.stockfishchess.org/tests/view/5f28fe6ea5abc164f05e4c4c ELO: 92.77 +-2.1 (95%) LOS: 100.0% Total: 60000 W: 24193 L: 8543 D: 27264 Ptnml(0-2): 609, 3850, 9708, 10948, 4885 40000 @ 20+0.2 th 8 https://tests.stockfishchess.org/tests/view/5f290229a5abc164f05e4c58 ELO: 89.47 +-2.0 (95%) LOS: 100.0% Total: 40000 W: 12756 L: 2677 D: 24567 Ptnml(0-2): 74, 1583, 8550, 7776, 2017 At the same time, the impact on the classical evaluation remains minimal, causing no significant regression: sprt @ 10+0.1 th 1 https://tests.stockfishchess.org/tests/view/5f2906a2a5abc164f05e4c5b LLR: 2.94 (-2.94,2.94) {-6.00,-4.00} Total: 34936 W: 6502 L: 6825 D: 21609 Ptnml(0-2): 571, 4082, 8434, 3861, 520 sprt @ 60+0.6 th 1 https://tests.stockfishchess.org/tests/view/5f2906cfa5abc164f05e4c5d LLR: 2.93 (-2.94,2.94) {-6.00,-4.00} Total: 10088 W: 1232 L: 1265 D: 7591 Ptnml(0-2): 49, 914, 3170, 843, 68 The needed networks can be found at https://tests.stockfishchess.org/nns It is recommended to use the default one as indicated by the `EvalFile` UCI option. Guidelines for testing new nets can be found at https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#nnue-net-tests Integration has been discussed in various issues: https://github.com/official-stockfish/Stockfish/issues/2823 https://github.com/official-stockfish/Stockfish/issues/2728 The integration branch will be closed after the merge: https://github.com/official-stockfish/Stockfish/pull/2825 https://github.com/official-stockfish/Stockfish/tree/nnue-player-wip closes https://github.com/official-stockfish/Stockfish/pull/2912 This will be an exciting time for computer chess, looking forward to seeing the evolution of this approach. Bench: 4746616	2020-08-06 16:37:45 +02:00

34 commits