BadFish

mirror of https://github.com/sockspls/badfish synced 2025-07-12 03:59:15 +00:00

Author	SHA1	Message	Date
Stéphane Nicolet	74776dbcd5	Simplification in evaluate_nnue.cpp Removes the test on non-pawn-material before applying the positional/materialistic bonus. Passed STC: LLR: 2.94 (-2.94,2.94) <-2.25,0.25> Total: 46904 W: 12197 L: 12059 D: 22648 Ptnml(0-2): 170, 5243, 12479, 5399, 161 https://tests.stockfishchess.org/tests/view/61be57cf57a0d0f327c3999d Passed LTC: LLR: 2.95 (-2.94,2.94) <-2.25,0.25> Total: 18760 W: 4958 L: 4790 D: 9012 Ptnml(0-2): 14, 1942, 5301, 2108, 15 https://tests.stockfishchess.org/tests/view/61bed1fb57a0d0f327c3afa9 closes https://github.com/official-stockfish/Stockfish/pull/3866 Bench: 4826206	2021-12-19 15:44:01 +01:00
Tomasz Sobczyk	4766dfc395	Optimize FT activation and affine transform for NEON. This patch optimizes the NEON implementation in two ways. The activation layer after the feature transformer is rewritten to make it easier for the compiler to see through dependencies and unroll. This in itself is a minimal, but a positive improvement. Other architectures could benefit from this too in the future. This is not an algorithmic change. The affine transform for large matrices (first layer after FT) on NEON now utilizes the same optimized code path as >=SSSE3, which makes the memory accesses more sequential and makes better use of the available registers, which allows for code that has longer dependency chains. Benchmarks from Redshift#161, profile-build with apple clang george@Georges-MacBook-Air nets % ./stockfish-b82d93 bench 2>&1 \| tail -4 (current master) =========================== Total time (ms) : 2167 Nodes searched : 4667742 Nodes/second : 2154011 george@Georges-MacBook-Air nets % ./stockfish-7377b8 bench 2>&1 \| tail -4 (this patch) =========================== Total time (ms) : 1842 Nodes searched : 4667742 Nodes/second : 2534061 This is a solid 18% improvement overall, larger in a bench with NNUE-only, not mixed. Improvement is also observed on armv7-neon (Raspberry Pi, and older phones), around 5% speedup. No changes for architectures other than NEON. closes https://github.com/official-stockfish/Stockfish/pull/3837 No functional changes.	2021-12-07 18:08:54 +01:00
Michael Ortmann	4b86ef8c4f	Fix typos in comments, adjust readme closes https://github.com/official-stockfish/Stockfish/pull/3822 also adjusts readme as requested in https://github.com/official-stockfish/Stockfish/pull/3816 No functional change	2021-12-01 18:07:30 +01:00
hengyu	64f21ecdae	Small clean-up remove unneeded calculation. closes https://github.com/official-stockfish/Stockfish/pull/3807 No functional change.	2021-12-01 17:59:20 +01:00
Stefano Cardanobile	2214fcecf7	Rewrite NNUE evaluation adjustments Make the eval code in the evaluate_nnue.cpp more similar to the rest of the codebase: * remove multiple variable assignment * make if conditions explicit and indent on multiple lines passed STC LLR: 2.93 (-2.94,2.94) <-2.50,0.50> Total: 59032 W: 14834 L: 14751 D: 29447 Ptnml(0-2): 176, 6310, 16459, 6397, 174 https://tests.stockfishchess.org/tests/view/616f250540f619782fd4f76d closes https://github.com/official-stockfish/Stockfish/pull/3753 No functional change	2021-10-23 12:22:02 +02:00
mstembera	644f6d4790	Simplify away ValueListInserter plus minor cleanups STC: https://tests.stockfishchess.org/tests/view/616f059b40f619782fd4f73f LLR: 2.94 (-2.94,2.94) <-2.50,0.50> Total: 84992 W: 21244 L: 21197 D: 42551 Ptnml(0-2): 279, 9005, 23868, 9078, 266 closes https://github.com/official-stockfish/Stockfish/pull/3749 No functional change	2021-10-23 12:21:17 +02:00
xoto10	f21a66f70d	Small clean-up, Sept 2021 Closes https://github.com/official-stockfish/Stockfish/pull/3485 No functional change	2021-10-07 09:41:57 +02:00
Michael Chaly	e8788d1b32	Combo of various parameter tweaks Combination of parameter tweaks in search, evaluation and time management. Original patches by snicolet xoto10 lonfom169 and Vizvezdenec. Includes: * Use bigger grain of positional evaluation more frequently (up to 1 exchange difference in non-pawn-material); * More extra time according to increment; * Increase margin for singular extensions; * Do more aggresive parent node futility pruning. Passed STC https://tests.stockfishchess.org/tests/view/6147deab3733d0e0dd9f313d LLR: 2.94 (-2.94,2.94) <-0.50,2.50> Total: 45488 W: 11691 L: 11450 D: 22347 Ptnml(0-2): 145, 5208, 11824, 5395, 172 Passed LTC https://tests.stockfishchess.org/tests/view/6147f1d53733d0e0dd9f3141 LLR: 2.94 (-2.94,2.94) <0.50,3.50> Total: 62520 W: 15808 L: 15482 D: 31230 Ptnml(0-2): 43, 6439, 17960, 6785, 33 closes https://github.com/official-stockfish/Stockfish/pull/3710 bench 5575265	2021-09-21 19:48:40 +02:00
Tomasz Sobczyk	18dcf1f097	Optimize and tidy up affine transform code. The new network caused some issues initially due to the very narrow neuron set between the first two FC layers. Necessary changes were hacked together to make it work. This patch is a mature approach to make the affine transform code faster, more readable, and easier to maintain should the layer sizes change again. The following changes were made: * ClippedReLU always produces a multiple of 32 outputs. This is about as good of a solution for AffineTransform's SIMD requirements as it can get without a bigger rewrite. * All self-contained simd helpers are moved to a separate file (simd.h). Inline asm is utilized to work around GCC's issues with code generation and register assignment. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101693, https://godbolt.org/z/da76fY1n7 * AffineTransform has 2 specializations. While it's more lines of code due to the boilerplate, the logic in both is significantly reduced, as these two are impossible to nicely combine into one. 1) The first specialization is for cases when there's >=128 inputs. It uses a different approach to perform the affine transform and can make full use of AVX512 without any edge cases. Furthermore, it has higher theoretical throughput because less loads are needed in the hot path, requiring only a fixed amount of instructions for horizontal additions at the end, which are amortized by the large number of inputs. 2) The second specialization is made to handle smaller layers where performance is still necessary but edge cases need to be handled. AVX512 implementation for this was ommited by mistake, a remnant from the temporary implementation for the new... This could be easily reintroduced if needed. A slightly more detailed description of both implementations is in the code. Overall it should be a minor speedup, as shown on fishtest: passed STC: LLR: 2.96 (-2.94,2.94) <-0.50,2.50> Total: 51520 W: 4074 L: 3888 D: 43558 Ptnml(0-2): 111, 3136, 19097, 3288, 128 and various tests shown in the pull request closes https://github.com/official-stockfish/Stockfish/pull/3663 No functional change	2021-08-20 08:50:25 +02:00
Tomasz Sobczyk	d61d38586e	New NNUE architecture and net Introduces a new NNUE network architecture and associated network parameters The summary of the changes: * Position for each perspective mirrored such that the king is on e..h files. Cuts the feature transformer size in half, while preserving enough knowledge to be good. See https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit#heading=h.b40q4rb1w7on. * The number of neurons after the feature transformer increased two-fold, to 1024x2. This is possibly mostly due to the now very optimized feature transformer update code. * The number of neurons after the second layer is reduced from 16 to 8, to reduce the speed impact. This, perhaps surprisingly, doesn't harm the strength much. See https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit#heading=h.6qkocr97fezq The AffineTransform code did not work out-of-the box with the smaller number of neurons after the second layer, so some temporary changes have been made to add a special case for InputDimensions == 8. Also additional 0 padding is added to the output for some archs that cannot process inputs by <=8 (SSE2, NEON). VNNI uses an implementation that can keep all outputs in the registers while reducing the number of loads by 3 for each 16 inputs, thanks to the reduced number of output neurons. However GCC is particularily bad at optimization here (and perhaps why the current way the affine transform is done even passed sprt) (see https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit# for details) and more work will be done on this in the following days. I expect the current VNNI implementation to be improved and extended to other architectures. The network was trained with a slightly modified version of the pytorch trainer (https://github.com/glinscott/nnue-pytorch); the changes are in https://github.com/glinscott/nnue-pytorch/pull/143 The training utilized 2 datasets. dataset A - https://drive.google.com/file/d/1VlhnHL8f-20AXhGkILujnNXHwy9T-MQw/view?usp=sharing dataset B - as described in `ba01f4b954` The training process was as following: train on dataset A for 350 epochs, take the best net in terms of elo at 20k nodes per move (it's fine to take anything from later stages of training). convert the .ckpt to .pt --resume-from-model from the .pt file, train on dataset B for <600 epochs, take the best net. Lambda=0.8, applied before the loss function. The first training command: python3 train.py \ ../nnue-pytorch-training/data/large_gensfen_multipvdiff_100_d9.binpack \ ../nnue-pytorch-training/data/large_gensfen_multipvdiff_100_d9.binpack \ --gpus "$3," \ --threads 1 \ --num-workers 1 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --smart-fen-skipping \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=1.0 \ --max_epochs=600 \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 The second training command: python3 serialize.py \ --features=HalfKAv2_hm^ \ ../nnue-pytorch-training/experiment_131/run_6/default/version_0/checkpoints/epoch-499.ckpt \ ../nnue-pytorch-training/experiment_$1/base/base.pt python3 train.py \ ../nnue-pytorch-training/data/michael_commit_b94a65.binpack \ ../nnue-pytorch-training/data/michael_commit_b94a65.binpack \ --gpus "$3," \ --threads 1 \ --num-workers 1 \ --batch-size 16384 \ --progress_bar_refresh_rate 20 \ --smart-fen-skipping \ --random-fen-skipping 3 \ --features=HalfKAv2_hm^ \ --lambda=0.8 \ --max_epochs=600 \ --resume-from-model ../nnue-pytorch-training/experiment_$1/base/base.pt \ --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2 STC: https://tests.stockfishchess.org/tests/view/611120b32a8a49ac5be798c4 LLR: 2.97 (-2.94,2.94) <-0.50,2.50> Total: 22480 W: 2434 L: 2251 D: 17795 Ptnml(0-2): 101, 1736, 7410, 1865, 128 LTC: https://tests.stockfishchess.org/tests/view/611152b32a8a49ac5be798ea LLR: 2.93 (-2.94,2.94) <0.50,3.50> Total: 9776 W: 442 L: 333 D: 9001 Ptnml(0-2): 5, 295, 4180, 402, 6 closes https://github.com/official-stockfish/Stockfish/pull/3646 bench: 5189338	2021-08-15 12:05:43 +02:00
Tomasz Sobczyk	26edf9534a	Avoid unnecessary stores in the affine transform This patch improves the codegen in the AffineTransform::forward function for architectures >=SSSE3. Current code works directly on memory and the compiler cannot see that the stores through outptr do not alias the loads through weights and input32. The solution implemented is to perform the affine transform with local variables as accumulators and only store the result to memory at the end. The number of accumulators required is OutputDimensions / OutputSimdWidth, which means that for the 1024->16 affine transform it requires 4 registers with SSSE3, 2 with AVX2, 1 with AVX512. It also cuts the number of stores required by NumRegs * 256 for each node evaluated. The local accumulators are expected to be assigned to registers, but even if this cannot be done in some case due to register pressure it will help the compiler to see that there is no aliasing between the loads and stores and may still result in better codegen. See https://godbolt.org/z/59aTKbbYc for codegen comparison. passed STC: LLR: 2.94 (-2.94,2.94) <-0.50,2.50> Total: 140328 W: 10635 L: 10358 D: 119335 Ptnml(0-2): 302, 8339, 52636, 8554, 333 closes https://github.com/official-stockfish/Stockfish/pull/3634 No functional change	2021-07-30 17:15:52 +02:00
Stéphane Nicolet	b51b094419	Simplify format_cp_aligned_dot() closes https://github.com/official-stockfish/Stockfish/pull/3583 No functional change	2021-07-03 09:25:16 +02:00
Joost VandeVondele	2e2865d34b	Fix build error on OSX directly use integer version for cp calculation. fixes https://github.com/official-stockfish/Stockfish/issues/3573 closes https://github.com/official-stockfish/Stockfish/pull/3574 No functional change	2021-06-21 23:14:58 +02:00
Tomasz Sobczyk	2e745956c0	Change trace with NNUE eval support This patch adds some more output to the `eval` command. It adds a board display with estimated piece values (method is remove-piece, evaluate, put-piece), and splits the NNUE evaluation with (psqt,layers) for each bucket for the NNUE net. Example: ``` ./stockfish position fen 3Qb1k1/1r2ppb1/pN1n2q1/Pp1Pp1Pr/4P2p/4BP2/4B1R1/1R5K b - - 11 40 eval Contributing terms for the classical eval: +------------+-------------+-------------+-------------+ \| Term \| White \| Black \| Total \| \| \| MG EG \| MG EG \| MG EG \| +------------+-------------+-------------+-------------+ \| Material \| ---- ---- \| ---- ---- \| -0.73 -1.55 \| \| Imbalance \| ---- ---- \| ---- ---- \| -0.21 -0.17 \| \| Pawns \| 0.35 -0.00 \| 0.19 -0.26 \| 0.16 0.25 \| \| Knights \| 0.04 -0.08 \| 0.12 -0.01 \| -0.08 -0.07 \| \| Bishops \| -0.34 -0.87 \| -0.17 -0.61 \| -0.17 -0.26 \| \| Rooks \| 0.12 0.00 \| 0.08 0.00 \| 0.04 0.00 \| \| Queens \| 0.00 0.00 \| -0.27 -0.07 \| 0.27 0.07 \| \| Mobility \| 0.84 1.76 \| 0.01 0.66 \| 0.83 1.10 \| \|King safety \| -0.99 -0.17 \| -0.72 -0.10 \| -0.27 -0.07 \| \| Threats \| 0.27 0.27 \| 0.73 0.86 \| -0.46 -0.59 \| \| Passed \| 0.00 0.00 \| 0.79 0.82 \| -0.79 -0.82 \| \| Space \| 0.61 0.00 \| 0.24 0.00 \| 0.37 0.00 \| \| Winnable \| ---- ---- \| ---- ---- \| 0.00 -0.03 \| +------------+-------------+-------------+-------------+ \| Total \| ---- ---- \| ---- ---- \| -1.03 -2.14 \| +------------+-------------+-------------+-------------+ NNUE derived piece values: +-------+-------+-------+-------+-------+-------+-------+-------+ \| \| \| \| Q \| b \| \| k \| \| \| \| \| \| +12.4 \| -1.62 \| \| \| \| +-------+-------+-------+-------+-------+-------+-------+-------+ \| \| r \| \| \| p \| p \| b \| \| \| \| -3.89 \| \| \| -0.84 \| -1.19 \| -3.32 \| \| +-------+-------+-------+-------+-------+-------+-------+-------+ \| p \| N \| \| n \| \| \| q \| \| \| -1.81 \| +3.71 \| \| -4.82 \| \| \| -5.04 \| \| +-------+-------+-------+-------+-------+-------+-------+-------+ \| P \| p \| \| P \| p \| \| P \| r \| \| +1.16 \| -0.91 \| \| +0.55 \| +0.12 \| \| +0.50 \| -4.02 \| +-------+-------+-------+-------+-------+-------+-------+-------+ \| \| \| \| \| P \| \| \| p \| \| \| \| \| \| +2.33 \| \| \| +1.17 \| +-------+-------+-------+-------+-------+-------+-------+-------+ \| \| \| \| \| B \| P \| \| \| \| \| \| \| \| +4.79 \| +1.54 \| \| \| +-------+-------+-------+-------+-------+-------+-------+-------+ \| \| \| \| \| B \| \| R \| \| \| \| \| \| \| +4.54 \| \| +6.03 \| \| +-------+-------+-------+-------+-------+-------+-------+-------+ \| \| R \| \| \| \| \| \| K \| \| \| +4.81 \| \| \| \| \| \| \| +-------+-------+-------+-------+-------+-------+-------+-------+ NNUE network contributions (Black to move) +------------+------------+------------+------------+ \| Bucket \| Material \| Positional \| Total \| \| \| (PSQT) \| (Layers) \| \| +------------+------------+------------+------------+ \| 0 \| + 0.32 \| - 1.46 \| - 1.13 \| \| 1 \| + 0.25 \| - 0.68 \| - 0.43 \| \| 2 \| + 0.46 \| - 1.72 \| - 1.25 \| \| 3 \| + 0.55 \| - 1.80 \| - 1.25 \| \| 4 \| + 0.48 \| - 1.77 \| - 1.29 \| \| 5 \| + 0.40 \| - 2.00 \| - 1.60 \| \| 6 \| + 0.57 \| - 2.12 \| - 1.54 \| <-- this bucket is used \| 7 \| + 3.38 \| - 2.00 \| + 1.37 \| +------------+------------+------------+------------+ Classical evaluation -1.00 (white side) NNUE evaluation +1.54 (white side) Final evaluation +2.38 (white side) [with scaled NNUE, hybrid, ...] ``` Also renames the export_net() function to save_eval() while there. closes https://github.com/official-stockfish/Stockfish/pull/3562 No functional change	2021-06-19 11:57:01 +02:00
Tomasz Sobczyk	900f249f59	Reduce the number of accumulator states Reduce from 3 to 2. Make the intent of the states clearer. STC: https://tests.stockfishchess.org/tests/view/60c50111457376eb8bcaad03 LLR: 2.95 (-2.94,2.94) <-2.50,0.50> Total: 61888 W: 5007 L: 4944 D: 51937 Ptnml(0-2): 164, 3947, 22649, 4030, 154 LTC: https://tests.stockfishchess.org/tests/view/60c52b1c457376eb8bcaad2c LLR: 2.94 (-2.94,2.94) <-2.50,0.50> Total: 20248 W: 688 L: 618 D: 18942 Ptnml(0-2): 7, 551, 8946, 605, 15 closes https://github.com/official-stockfish/Stockfish/pull/3548 No functional change.	2021-06-14 11:22:08 +02:00
Tomasz Sobczyk	ce4c523ad3	Register count for feature transformer Compute optimal register count for feature transformer accumulation dynamically. This also introduces a change where AVX512 would only use 8 registers instead of 16 (now possible due to a 2x increase in feature transformer size). closes https://github.com/official-stockfish/Stockfish/pull/3543 No functional change	2021-06-13 13:10:56 +02:00
Stéphane Nicolet	7819412002	Clarify use of UCI options Update README.md to clarify use of UCI options closes https://github.com/official-stockfish/Stockfish/pull/3540 No functional change	2021-06-13 10:02:43 +02:00
Tomasz Sobczyk	b84fa04db6	Read NNUE net faster Load feature transformer weights in bulk on little-endian machines. This is in particular useful to test new nets with c-chess-cli, see https://github.com/lucasart/c-chess-cli/issues/44 ``` $ time ./stockfish.exe uci Before : 0m0.914s After : 0m0.483s ``` No functional change	2021-06-13 09:39:03 +02:00
Stéphane Nicolet	8f081c86f7	Clean SIMD code a bit Cleaner vector code structure in feature transformer. This patch just regroups the parts of the inner loop for each SIMD instruction set. Tested for non-regression: LLR: 2.96 (-2.94,2.94) <-2.50,0.50> Total: 115760 W: 9835 L: 9831 D: 96094 Ptnml(0-2): 326, 7776, 41715, 7694, 369 https://tests.stockfishchess.org/tests/view/60b96b39457376eb8bcaa26e It would be nice if a future patch could use some of the macros at the top of the file to unify the code between the distincts SIMD instruction sets (of course, unifying the Relu will be the challenge). closes https://github.com/official-stockfish/Stockfish/pull/3506 No functional change	2021-06-04 14:07:46 +02:00
Tomasz Sobczyk	5448cad29e	Fix export of the feature transformer. PSQT export was missing. fixes #3507 closes https://github.com/official-stockfish/Stockfish/pull/3508 No functional change	2021-05-30 21:31:58 +02:00
Stéphane Nicolet	f193778446	Do not use lazy evaluation inside NNUE This simplification patch implements two changes: 1. it simplifies away the so-called "lazy" path in the NNUE evaluation internals, where we trusted the psqt head alone to avoid the costly "positional" head in some cases; 2. it raises a little bit the NNUEThreshold1 in evaluate.cpp (from 682 to 800), which increases the limit where we switched from NNUE eval to Classical eval. Both effects increase the number of positional evaluations done by our new net architecture, but the results of our tests below seem to indicate that the loss of speed will be compensated by the gain of eval quality. STC: LLR: 2.95 (-2.94,2.94) <-2.50,0.50> Total: 26280 W: 2244 L: 2137 D: 21899 Ptnml(0-2): 72, 1755, 9405, 1810, 98 https://tests.stockfishchess.org/tests/view/60ae73f112066fd299795a51 LTC: LLR: 2.95 (-2.94,2.94) <-2.50,0.50> Total: 20592 W: 750 L: 677 D: 19165 Ptnml(0-2): 9, 614, 8980, 681, 12 https://tests.stockfishchess.org/tests/view/60ae88e812066fd299795a82 closes https://github.com/official-stockfish/Stockfish/pull/3503 Bench: 3817907	2021-05-27 01:21:56 +02:00
Tomasz Sobczyk	9d53129075	Expose the lazy threshold for the feature transformer PSQT as a parameter. Definition of the lazy threshold moved to evaluate.cpp where all others are. Lazy threshold only used for real searches, not used for the "eval" call. This preserves the purity of NNUE evaluation, which is useful to verify consistency between the engine and the NNUE trainer. closes https://github.com/official-stockfish/Stockfish/pull/3499 No functional change	2021-05-25 21:40:51 +02:00
Stéphane Nicolet	a2f01c07eb	Sometimes change the (materialist, positional) balance Our new nets output two values for the side to move in the last layer. We can interpret the first value as a material evaluation of the position, and the second one as the dynamic, positional value of the location of pieces. This patch changes the balance for the (materialist, positional) parts of the score from (128, 128) to (121, 135) when the piece material is equal between the two players, but keeps the standard (128, 128) balance when one player is at least an exchange up. Passed STC: LLR: 2.93 (-2.94,2.94) <-0.50,2.50> Total: 15936 W: 1421 L: 1266 D: 13249 Ptnml(0-2): 37, 1037, 5694, 1134, 66 https://tests.stockfishchess.org/tests/view/60a82df9ce8ea25a3ef0408f Passed LTC: LLR: 2.94 (-2.94,2.94) <0.50,3.50> Total: 13904 W: 516 L: 410 D: 12978 Ptnml(0-2): 4, 374, 6088, 484, 2 https://tests.stockfishchess.org/tests/view/60a8bbf9ce8ea25a3ef04101 closes https://github.com/official-stockfish/Stockfish/pull/3492 Bench: 3856635	2021-05-22 21:09:22 +02:00
Fanael Linithien	038487f954	Use packed 32-bit MMX operations for updating the PSQT accumulator This improves the speed of NNUE by a bit on old hardware that code path is intended for, like a Pentium III 1.13 GHz: 10 repeats of "./stockfish bench 16 1 13 default depth NNUE": Before: 54 642 504 897 cycles (± 0.12%) 62 301 937 829 instructions (± 0.03%) After: 54 320 821 928 cycles (± 0.13%) 62 084 742 699 instructions (± 0.02%) Speed of go depth 20 from startpos: Before: 53103 nps After: 53856 nps closes https://github.com/official-stockfish/Stockfish/pull/3476 No functional change.	2021-05-19 19:34:44 +02:00
Tomasz Sobczyk	e8d64af123	New NNUE architecture and net Introduces a new NNUE network architecture and associated network parameters, as obtained by a new pytorch trainer. The network is already very strong at short TC, without regression at longer TC, and has potential for further improvements. https://tests.stockfishchess.org/tests/view/60a159c65085663412d0921d TC: 10s+0.1s, 1 thread ELO: 21.74 +-3.4 (95%) LOS: 100.0% Total: 10000 W: 1559 L: 934 D: 7507 Ptnml(0-2): 38, 701, 2972, 1176, 113 https://tests.stockfishchess.org/tests/view/60a187005085663412d0925b TC: 60s+0.6s, 1 thread ELO: 5.85 +-1.7 (95%) LOS: 100.0% Total: 20000 W: 1381 L: 1044 D: 17575 Ptnml(0-2): 27, 885, 7864, 1172, 52 https://tests.stockfishchess.org/tests/view/60a2beede229097940a03806 TC: 20s+0.2s, 8 threads LLR: 2.93 (-2.94,2.94) <0.50,3.50> Total: 34272 W: 1610 L: 1452 D: 31210 Ptnml(0-2): 30, 1285, 14350, 1439, 32 https://tests.stockfishchess.org/tests/view/60a2d687e229097940a03c72 TC: 60s+0.6s, 8 threads LLR: 2.94 (-2.94,2.94) <-2.50,0.50> Total: 45544 W: 1262 L: 1214 D: 43068 Ptnml(0-2): 12, 1129, 20442, 1177, 12 The network has been trained (by vondele) using the https://github.com/glinscott/nnue-pytorch/ trainer (started by glinscott), specifically the branch https://github.com/Sopel97/nnue-pytorch/tree/experiment_56. The data used are in 64 billion positions (193GB total) generated and scored with the current master net d8: https://drive.google.com/file/d/1hOOYSDKgOOp38ZmD0N4DV82TOLHzjUiF/view?usp=sharing d9: https://drive.google.com/file/d/1VlhnHL8f-20AXhGkILujnNXHwy9T-MQw/view?usp=sharing d10: https://drive.google.com/file/d/1ZC5upzBYMmMj1gMYCkt6rCxQG0GnO3Kk/view?usp=sharing fishtest_d9: https://drive.google.com/file/d/1GQHt0oNgKaHazwJFTRbXhlCN3FbUedFq/view?usp=sharing This network also contains a few architectural changes with respect to the current master: Size changed from 256x2-32-32-1 to 512x2-16-32-1 ~15-20% slower ~2x larger adds a special path for 16 valued ClippedReLU fixes affine transform code for 16 inputs/outputs, buy using InputDimensions instead of PaddedInputDimensions this is safe now because the inputs are processed in groups of 4 in the current affine transform code The feature set changed from HalfKP to HalfKAv2 Includes information about the kings like HalfKA Packs king features better, resulting in 8% size reduction compared to HalfKA The board is flipped for the black's perspective, instead of rotated like in the current master PSQT values for each feature the feature transformer now outputs a part that is fowarded directly to the output and allows learning piece values more directly than the previous network architecture. The effect is visible for high imbalance positions, where the current master network outputs evaluations skewed towards zero. 8 PSQT values per feature, chosen based on (popcount(pos.pieces()) - 1) / 4 initialized to classical material values on the start of the training 8 subnetworks (512x2->16->32->1), chosen based on (popcount(pos.pieces()) - 1) / 4 only one subnetwork is evaluated for any position, no or marginal speed loss A diagram of the network is available: https://user-images.githubusercontent.com/8037982/118656988-553a1700-b7eb-11eb-82ef-56a11cbebbf2.png A more complete description: https://github.com/glinscott/nnue-pytorch/blob/master/docs/nnue.md closes https://github.com/official-stockfish/Stockfish/pull/3474 Bench: 3806488	2021-05-18 18:06:23 +02:00
Stéphane Nicolet	f90274d8ce	Small clean-ups - Comment for Countemove pruning -> Continuation history - Fix comment in input_slice.h - Shorter lines in Makefile - Comment for scale factor - Fix comment for pinners in see_ge() - Change Thread.id() signature to size_t - Trailing space in reprosearch.sh - Add Douglas Matos Gomes to the AUTHORS file - Introduce comment for undo_null_move() - Use Stockfish coding style for export_net() - Change date in AUTHORS file closes https://github.com/official-stockfish/Stockfish/pull/3416 No functional change	2021-05-17 10:47:14 +02:00
Tomasz Sobczyk	58054fd0fa	Exporting the currently loaded network file This PR adds an ability to export any currently loaded network. The export_net command now takes an optional filename parameter. If the loaded net is not the embedded net the filename parameter is required. Two changes were required to support this: * the "architecture" string, which is really just a some kind of description in the net, is now saved into netDescription on load and correctly saved on export. * the AffineTransform scrambles weights for some architectures and sparsifies them, such that retrieving the index is hard. This is solved by having a temporary scrambled<->unscrambled index lookup table when loading the network, and the actual index is saved for each individual weight that makes it to canSaturate16. This increases the size of the canSaturate16 entries by 6 bytes. closes https://github.com/official-stockfish/Stockfish/pull/3456 No functional change	2021-05-11 19:36:11 +02:00
Tomasz Sobczyk	b748b46714	Cleanup and simplify NNUE code. A lot of optimizations happend since the NNUE was introduced and since then some parts of the code were left unused. This got to the point where asserts were have to be made just to let people know that modifying something will not have any effects or may even break everything due to the assumptions being made. Removing these parts removes those inexisting "false dependencies". Additionally: * append_changed_indices now takes the king pos and stateinfo explicitly, no more misleading pos parameter * IndexList is removed in favor of a generic ValueList. Feature transformer just instantiates the type it needs. * The update cost and refresh requirement is deferred to the feature set once again, but now doesn't go through the whole FeatureSet machinery and just calls HalfKP directly. * accumulator no longer has a singular dimension. * The PS constants and the PieceSquareIndex array are made local to the HalfKP feature set because they are specific to it and DO differ for other feature sets. * A few names are changed to more descriptive Passed STC non-regression: https://tests.stockfishchess.org/tests/view/608421dd95e7f1852abd2790 LLR: 2.95 (-2.94,2.94) <-2.50,0.50> Total: 180008 W: 16186 L: 16258 D: 147564 Ptnml(0-2): 587, 12593, 63725, 12503, 596 closes https://github.com/official-stockfish/Stockfish/pull/3441 No functional change	2021-04-25 13:16:30 +02:00
Tomasz Sobczyk	fbbd4adc3c	Unify naming convention of the NNUE code matches the rest of the stockfish code base closes https://github.com/official-stockfish/Stockfish/pull/3437 No functional change	2021-04-24 12:49:29 +02:00
Tomasz Sobczyk	255514fb29	Documentation patch: AppendChangedIndices Clarify the assumptions on the position passed to the AppendChangedIndices(). Closes https://github.com/official-stockfish/Stockfish/pull/3428 No functional change	2021-04-15 12:21:30 +02:00
Stéphane Nicolet	83eac08e75	Small cleanups (march 2021) With help of @BM123499, @mstembera, @gvreuls, @noobpwnftw and @Fanael Thanks! Closes https://github.com/official-stockfish/Stockfish/pull/3405 No functional change	2021-03-24 17:11:06 +01:00
Guy Vreuls	ec42154ef2	Use reference instead of pointer for pop_lsb() signature This patch changes the pop_lsb() signature from Square pop_lsb(Bitboard*) to Square pop_lsb(Bitboard&). This is more idomatic for C++ style signatures. Passed a non-regression STC test: LLR: 2.93 (-2.94,2.94) {-1.25,0.25} Total: 21280 W: 1928 L: 1847 D: 17505 Ptnml(0-2): 71, 1427, 7558, 1518, 66 https://tests.stockfishchess.org/tests/view/6053a1e22433018de7a38e2f We have verified that the generated binary is identical on gcc-10. Closes https://github.com/official-stockfish/Stockfish/pull/3404 No functional change.	2021-03-19 20:28:57 +01:00
Dieter Dobbelaere	7ffae17f85	Add Stockfish namespace. fixes #3350 and is a small cleanup that might make it easier to use SF in separate projects, like a NNUE trainer or similar. closes https://github.com/official-stockfish/Stockfish/pull/3370 No functional change.	2021-03-07 14:26:54 +01:00
MaximMolchanov	303713b560	Affine transform robust implementation Size of the weights in the last layer is less than 512 bits. It leads to wrong data access for AVX512. There is no error because in current implementation it is guaranteed that there is an array of zeros after weights so zero multiplied by something is returned and sum is correct. It is a mistake that can lead to unexpected bugs in the future. Used AVX2 instructions for smaller input size. No measurable slowdown on avx512. closes https://github.com/official-stockfish/Stockfish/pull/3298 No functional change.	2021-01-11 18:54:18 +01:00
Joost VandeVondele	c4d67d77c9	Update copyright years No functional change	2021-01-08 17:04:23 +01:00
MaximMolchanov	23c385ec36	Affine transform refactoring. Reordered weights in such a way that accumulated sum fits to output. Weights are grouped in blocks of four elements because four int8 (weight type) corresponds to one int32 (output type). No horizontal additions. Grouped AVX512, AVX2 and SSSE3 implementations. Repeated code was removed. An earlier version passed STC: LLR: 2.97 (-2.94,2.94) {-0.25,1.25} Total: 15336 W: 1495 L: 1355 D: 12486 Ptnml(0-2): 44, 1054, 5350, 1158, 62 https://tests.stockfishchess.org/tests/view/5ff60e106019e097de3eefd5 Speedup depends on the architecture, up to 4% measured on a NNUE only bench. closes https://github.com/official-stockfish/Stockfish/pull/3287 No functional change	2021-01-08 16:35:44 +01:00
mstembera	d862ba4069	AVX512, AVX2 and SSSE3 speedups Improves throughput by summing 2 intermediate dot products using 16 bit addition before upconverting to 32 bit. Potential saturation is detected and the code-path is avoided in this case. The saturation can't happen with the current nets, but nets can be constructed that trigger this check. STC https://tests.stockfishchess.org/tests/view/5fd40a861ac1691201888479 LLR: 2.94 (-2.94,2.94) {-0.25,1.25} Total: 25544 W: 2451 L: 2296 D: 20797 Ptnml(0-2): 92, 1761, 8925, 1888, 106 about 5% speedup closes https://github.com/official-stockfish/Stockfish/pull/3261 No functional change	2020-12-14 07:46:15 +01:00
Fanael Linithien	c7f0a768cb	Use arithmetic right shift for sign extension in MMX and SSE2 paths This appears to be slightly faster than using a comparison against zero to compute the high bits, on both old (like Pentium III) and new (like Zen 2) hardware. closes https://github.com/official-stockfish/Stockfish/pull/3254 No functional change.	2020-12-12 09:20:15 +01:00
mstembera	9b7983a452	Cleaned up MakeIndex() The index order in kpp_board_index[][] is reversed to be more optimal for the access pattern STC https://tests.stockfishchess.org/tests/view/5fbd74f967cbf42301d6b24f LLR: 2.93 (-2.94,2.94) {-1.25,0.25} Total: 27504 W: 2686 L: 2607 D: 22211 Ptnml(0-2): 84, 2001, 9526, 2034, 107 closes https://github.com/official-stockfish/Stockfish/pull/3233 No functional change	2020-11-29 16:36:49 +01:00
MaximMolchanov	7615e3485e	Calculate sum from first elements in affine transform for AVX512/AVX2/SSSE3 The idea is to initialize sum with the first element instead of zero. Reduce one add_epi32 and one set_zero SIMD instructions for each output dimension. sum = 0; for i = 1 to n sum += a[i] -> sum = a[1]; for i = 2 to n sum += a[i] STC: LLR: 2.95 (-2.94,2.94) {-0.25,1.25} Total: 69048 W: 7024 L: 6799 D: 55225 Ptnml(0-2): 260, 5175, 23458, 5342, 289 https://tests.stockfishchess.org/tests/view/5faf2cf467cbf42301d6aa06 closes https://github.com/official-stockfish/Stockfish/pull/3227 No functional change.	2020-11-25 21:10:13 +01:00
Stéphane Nicolet	027626db1e	Small cleanups 13 No functional change	2020-11-23 22:20:32 +01:00
Tomasz Sobczyk	ba35c88ab8	AVX-512 for smaller affine and feature transforms. For the feature transformer the code is analogical to AVX2 since there was room for easy adaptation of wider simd registers. For the smaller affine transforms that have 32 byte stride we keep 2 columns in one zmm register. We also unroll more aggressively so that in the end we have to do 16 parallel horizontal additions on ymm slices each consisting of 4 32-bit integers. The slices are embedded in 8 zmm registers. These changes provide about 1.5% speedup for AVX-512 builds. Closes https://github.com/official-stockfish/Stockfish/pull/3218 No functional change.	2020-11-07 16:49:49 +01:00
Tomasz Sobczyk	3f6451eff7	Manually align arrays on the stack as a workaround to issues with overaligned alignas() on stack variables in gcc < 9.3 on windows. closes https://github.com/official-stockfish/Stockfish/pull/3217 fixes #3216 No functional change	2020-11-04 19:52:42 +01:00
Tomasz Sobczyk	75e06a1c89	Optimize affine transform for SSSE3 and higher targets. A non-functional speedup. Unroll the loops going over the output dimensions in the affine transform layers by a factor of 4 and perform 4 horizontal additions at a time. Instead of doing naive horizontal additions on each vector separately use hadd and shuffling between vectors to reduce the number of instructions by using all lanes for all stages of the horizontal adds. passed STC of the initial version: LLR: 2.95 (-2.94,2.94) {-0.25,1.25} Total: 17808 W: 1914 L: 1756 D: 14138 Ptnml(0-2): 76, 1330, 5948, 1460, 90 https://tests.stockfishchess.org/tests/view/5f9d516f6a2c112b60691da3 passed STC of the final version after cleanup: LLR: 2.95 (-2.94,2.94) {-0.25,1.25} Total: 16296 W: 1750 L: 1595 D: 12951 Ptnml(0-2): 72, 1192, 5479, 1319, 86 https://tests.stockfishchess.org/tests/view/5f9df5776a2c112b60691de3 closes https://github.com/official-stockfish/Stockfish/pull/3203 No functional change	2020-11-02 19:41:17 +01:00
syzygy1	2046d5da30	More incremental accumulator updates This patch was inspired by `c065abd` which updates the accumulator, if possible, based on the accumulator of two plies back if the accumulator of the preceding ply is not available. With this patch we look back even further in the position history in an attempt to reduce the number of complete recomputations. When we find a usable accumulator for the position N plies back, we also update the accumulator of the position N-1 plies back because that accumulator is most likely to be helpful later when evaluating positions in sibling branches. By not updating all intermediate accumulators immediately, we avoid doing too much work that is not certain to be useful. Overall, roughly 2-3% speedup. This patch makes the code more specific to the net architecture, changing input features of the net will require additional changes to the incremental update code as discussed in the PR #3193 and #3191. Passed STC: https://tests.stockfishchess.org/tests/view/5f9056712c92c7fe3a8c60d0 LLR: 2.94 (-2.94,2.94) {-0.25,1.25} Total: 10040 W: 1116 L: 968 D: 7956 Ptnml(0-2): 42, 722, 3365, 828, 63 closes https://github.com/official-stockfish/Stockfish/pull/3193 No functional change.	2020-10-22 20:50:16 +02:00
noobpwnftw	c065abdcaf	Use incremental updates more often Use incremental updates for accumulators for up to 2 plies. Do not copy accumulator. About 2% speedup. Passed STC: LLR: 2.95 (-2.94,2.94) {-0.25,1.25} Total: 21752 W: 2583 L: 2403 D: 16766 Ptnml(0-2): 128, 1761, 6923, 1931, 133 https://tests.stockfishchess.org/tests/view/5f7150cf3b22d6afa5069412 closes https://github.com/official-stockfish/Stockfish/pull/3157 No functional change	2020-09-28 16:54:35 +02:00
Stéphane Nicolet	9a64e737cf	Small cleanups 12 - Clean signature of functions in namespace NNUE - Add comment for countermove based pruning - Remove bestMoveCount variable - Add const qualifier to kpp_board_index array - Fix spaces in get_best_thread() - Fix indention in capture LMR code in search.cpp - Rename TtmemDeleter to LargePageDeleter Closes https://github.com/official-stockfish/Stockfish/pull/3063 No functional change	2020-09-21 10:41:10 +02:00
Sami Kiminki	485d517c68	Add large page support for NNUE weights and simplify TT mem management Use TT memory functions to allocate memory for the NNUE weights. This should provide a small speed-up on systems where large pages are not automatically used, including Windows and some Linux distributions. Further, since we now have a wrapper for std::aligned_alloc(), we can simplify the TT memory management a bit: - We no longer need to store separate pointers to the hash table and its underlying memory allocation. - We also get to merge the Linux-specific and default implementations of aligned_ttmem_alloc(). Finally, we'll enable the VirtualAlloc code path with large page support also for Win32. STC: https://tests.stockfishchess.org/tests/view/5f66595823a84a47b9036fba LLR: 2.94 (-2.94,2.94) {-0.25,1.25} Total: 14896 W: 1854 L: 1686 D: 11356 Ptnml(0-2): 65, 1224, 4742, 1312, 105 closes https://github.com/official-stockfish/Stockfish/pull/3081 No functional change.	2020-09-21 08:43:48 +02:00
syzygy1	8b8a510fd6	Use tiling to speed up accumulator refreshes and updates Perform the update and refresh operations tile by tile in a local array of vectors. By selecting the array size carefully, we achieve that the compiler keeps the whole array in vector registers. Idea and original implementation by @sf-x. STC: https://tests.stockfishchess.org/tests/view/5f623eec912c15f19854b855 LLR: 2.94 (-2.94,2.94) {-0.25,1.25} Total: 4872 W: 623 L: 477 D: 3772 Ptnml(0-2): 14, 350, 1585, 450, 37 LTC: https://tests.stockfishchess.org/tests/view/5f62434e912c15f19854b860 LLR: 2.94 (-2.94,2.94) {0.25,1.25} Total: 25808 W: 1565 L: 1401 D: 22842 Ptnml(0-2): 23, 1186, 10332, 1330, 33 closes https://github.com/official-stockfish/Stockfish/pull/3130 No functional change	2020-09-17 17:24:52 +02:00
syzygy1	fc27d158c0	Bug fix in do_null_move() and NNUE simplification. This fixes #3108 and removes some NNUE code that is currently not used. At the moment, do_null_move() copies the accumulator from the previous state into the new state, which is correct. It then clears the "computed_score" flag because the side to move has changed, and with the other side to move NNUE will return a completely different evaluation (normally with changed sign but also with different NNUE-internal tempo bonus). The problem is that do_null_move() clears the wrong flag. It clears the computed_score flag of the old state, not of the new state. It turns out that this almost never affects the search. For example, fixing it does not change the current bench (but it does change the previous bench). This is because the search code usually avoids calling evaluate() after a null move. This PR corrects do_null_move() by removing the computed_score flag altogether. The flag is not needed because nnue_evaluate() is never called twice on a position. This PR also removes some unnecessary {}s and inserts a few blank lines in the modified NNUE files in line with SF coding style. Resulf ot STC non-regression test: LLR: 2.95 (-2.94,2.94) {-1.25,0.25} Total: 26328 W: 3118 L: 3012 D: 20198 Ptnml(0-2): 126, 2208, 8397, 2300, 133 https://tests.stockfishchess.org/tests/view/5f553ccc2d02727c56b36db1 closes https://github.com/official-stockfish/Stockfish/pull/3109 bench: 4109324	2020-09-08 22:53:17 +02:00

1 2

66 commits