1
0
Fork 0
mirror of https://github.com/sockspls/badfish synced 2025-05-01 17:19:36 +00:00
Commit graph

60 commits

Author SHA1 Message Date
gab8192
49ef4c935a Implement accumulator refresh table
For each thread persist an accumulator cache for the network, where each
cache contains multiple entries for each of the possible king squares.
When the accumulator needs to be refreshed, the cached entry is used to more
efficiently update the accumulator, instead of rebuilding it from scratch.
This idea, was first described by Luecx (author of Koivisto) and
is commonly referred to as "Finny Tables".

When the accumulator needs to be refreshed, instead of filling it with
biases and adding every piece from scratch, we...

1. Take the `AccumulatorRefreshEntry` associated with the new king bucket
2. Calculate the features to activate and deactivate (from differences
   between bitboards in the entry and bitboards of the actual position)
3. Apply the updates on the refresh entry
4. Copy the content of the refresh entry accumulator to the accumulator
   we were refreshing
5. Copy the bitboards from the position to the refresh entry, to match
   the newly updated accumulator

Results at STC:
https://tests.stockfishchess.org/tests/view/662301573fe04ce4cefc1386
(first version)
https://tests.stockfishchess.org/tests/view/6627fa063fe04ce4cefc6560
(final)

Non-Regression between first and final:
https://tests.stockfishchess.org/tests/view/662801e33fe04ce4cefc660a

STC SMP:
https://tests.stockfishchess.org/tests/view/662808133fe04ce4cefc667c

closes https://github.com/official-stockfish/Stockfish/pull/5183

No functional change
2024-04-24 18:38:20 +02:00
Gahtan Nahdi
d0e72c19fa fix clang compiler warning for avx512 build
Initialize variable in constexpr function to get rid of clang compiler warning for avx512 build.

closes https://github.com/official-stockfish/Stockfish/pull/5176

Non-functional change
2024-04-21 14:38:16 +02:00
mstembera
94484db6e8 Avoid permuting inputs during transform()
Avoid permuting inputs during transform() and instead do it once at load time.
Affects AVX2 and newer Intel architectures only.

https://tests.stockfishchess.org/tests/view/661306613eb00c8ccc0033c7
LLR: 2.94 (-2.94,2.94) <0.00,2.00>
Total: 108480 W: 28319 L: 27898 D: 52263
Ptnml(0-2): 436, 12259, 28438, 12662, 445

speedups measured such as e.g.

```
Result of 100 runs
==================
base (./stockfish.master       ) =    1241128  +/- 3757
test (./stockfish.patch        ) =    1247713  +/- 3689
diff                             =      +6585  +/- 2583

speedup        = +0.0053
P(speedup > 0) =  1.0000
```

closes https://github.com/official-stockfish/Stockfish/pull/5160

No functional change
2024-04-11 22:38:38 +02:00
mstembera
5001d49f42 Update nnue_feature_transformer.h
Unroll update_accumulator_refresh to process two
active indices simultaneously.

The compiler might not unroll effectively because
the number of active indices isn't known at
compile time.

STC https://tests.stockfishchess.org/tests/view/65faa8850ec64f0526c4fca9
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 130464 W: 33882 L: 33431 D: 63151
Ptnml(0-2): 539, 14591, 34501, 15082, 519

closes https://github.com/official-stockfish/Stockfish/pull/5125

No functional change
2024-03-26 18:06:49 +01:00
mstembera
7831131591 Only evaluate the PSQT part of the small net for large evals.
Thanks to Viren6 for suggesting to set complexity to 0.

STC https://tests.stockfishchess.org/tests/view/65d7d6709b2da0226a5a203f
LLR: 2.92 (-2.94,2.94) <0.00,2.00>
Total: 328384 W: 85316 L: 84554 D: 158514
Ptnml(0-2): 1414, 39076, 82486, 39766, 1450

LTC https://tests.stockfishchess.org/tests/view/65dce6d290f639b028a54d2e
LLR: 2.95 (-2.94,2.94) <0.50,2.50>
Total: 165162 W: 41918 L: 41330 D: 81914
Ptnml(0-2): 102, 18332, 45124, 18922, 101

closes https://github.com/official-stockfish/Stockfish/pull/5083

bench: 1504003
2024-03-03 15:29:58 +01:00
FauziAkram
59691d46a1 Assorted trivial cleanups
Renaming doubleExtensions variable to multiExtensions, since now we have also triple extensions.

Some extra cleanups.

Recent tests used to measure the elo worth:
https://tests.stockfishchess.org/tests/view/659fd0c379aa8af82b96abc3
https://tests.stockfishchess.org/tests/view/65a8f3da79aa8af82b9751e3
https://tests.stockfishchess.org/tests/view/65b51824c865510db0272740
https://tests.stockfishchess.org/tests/view/65b58fbfc865510db0272f5b

closes https://github.com/official-stockfish/Stockfish/pull/5032

No functional change
2024-02-09 19:06:24 +01:00
Linmiao Xu
584d9efedc Dual NNUE with L1-128 smallnet
Credit goes to @mstembera for:
- writing the code enabling dual NNUE:
  https://github.com/official-stockfish/Stockfish/pull/4898
- the idea of trying L1-128 trained exclusively on high simple eval
  positions

The L1-128 smallnet is:
- epoch 399 of a single-stage training from scratch
- trained only on positions from filtered data with high material
  difference
  - defined by abs(simple_eval) > 1000

```yaml
experiment-name: 128--S1-only-hse-v2

training-dataset:
  - /data/hse/S3/dfrc99-16tb7p-eval-filt-v2.min.high-simple-eval-1k.binpack
  - /data/hse/S3/leela96-filt-v2.min.high-simple-eval-1k.binpack
  - /data/hse/S3/test80-apr2022-16tb7p.min.high-simple-eval-1k.binpack

  - /data/hse/S7/test60-2020-2tb7p.v6-3072.high-simple-eval-1k.binpack
  - /data/hse/S7/test60-novdec2021-12tb7p-filter-v6-dd.min-mar2023.unmin.high-simple-eval-1k.binpack

  - /data/hse/S7/test77-nov2021-2tb7p.v6-3072.min.high-simple-eval-1k.binpack
  - /data/hse/S7/test77-dec2021-16tb7p-filter-v6-dd.min-mar2023.unmin.high-simple-eval-1k.binpack
  - /data/hse/S7/test77-jan2022-2tb7p.high-simple-eval-1k.binpack

  - /data/hse/S7/test78-jantomay2022-16tb7p-filter-v6-dd.min-mar2023.unmin.high-simple-eval-1k.binpack
  - /data/hse/S7/test78-juntosep2022-16tb7p-filter-v6-dd.min-mar2023.unmin.high-simple-eval-1k.binpack

  - /data/hse/S7/test79-apr2022-16tb7p-filter-v6-dd.min-mar2023.unmin.high-simple-eval-1k.binpack
  - /data/hse/S7/test79-may2022-16tb7p-filter-v6-dd.min-mar2023.unmin.high-simple-eval-1k.binpack

  # T80 2022
  - /data/hse/S7/test80-may2022-16tb7p.high-simple-eval-1k.binpack
  - /data/hse/S7/test80-jun2022-16tb7p-filter-v6-dd.min-mar2023.unmin.high-simple-eval-1k.binpack
  - /data/hse/S7/test80-jul2022-16tb7p.v6-dd.min.high-simple-eval-1k.binpack
  - /data/hse/S7/test80-aug2022-16tb7p-filter-v6-dd.min-mar2023.unmin.high-simple-eval-1k.binpack
  - /data/hse/S7/test80-sep2022-16tb7p-filter-v6-dd.min-mar2023.unmin.high-simple-eval-1k.binpack
  - /data/hse/S7/test80-oct2022-16tb7p.v6-dd.high-simple-eval-1k.binpack
  - /data/hse/S7/test80-nov2022-16tb7p-v6-dd.min.high-simple-eval-1k.binpack

  # T80 2023
  - /data/hse/S7/test80-jan2023-3of3-16tb7p-filter-v6-dd.min-mar2023.unmin.high-simple-eval-1k.binpack
  - /data/hse/S7/test80-feb2023-16tb7p-filter-v6-dd.min-mar2023.unmin.high-simple-eval-1k.binpack
  - /data/hse/S7/test80-mar2023-2tb7p.v6-sk16.min.high-simple-eval-1k.binpack
  - /data/hse/S7/test80-apr2023-2tb7p-filter-v6-sk16.min.high-simple-eval-1k.binpack
  - /data/hse/S7/test80-may2023-2tb7p.v6.min.high-simple-eval-1k.binpack
  - /data/hse/S7/test80-jun2023-2tb7p.v6-3072.min.high-simple-eval-1k.binpack
  - /data/hse/S7/test80-jul2023-2tb7p.v6-3072.min.high-simple-eval-1k.binpack
  - /data/hse/S7/test80-aug2023-2tb7p.v6.min.high-simple-eval-1k.binpack
  - /data/hse/S7/test80-sep2023-2tb7p.high-simple-eval-1k.binpack
  - /data/hse/S7/test80-oct2023-2tb7p.high-simple-eval-1k.binpack

start-from-engine-test-net: False

nnue-pytorch-branch: linrock/nnue-pytorch/L1-128
engine-test-branch: linrock/Stockfish/L1-128-nolazy
engine-base-branch: linrock/Stockfish/L1-128

num-epochs: 500
lambda: 1.0
```

Experiment yaml configs converted to easy_train.sh commands with:
https://github.com/linrock/nnue-tools/blob/4339954/yaml_easy_train.py

Binpacks interleaved at training time with:
https://github.com/official-stockfish/nnue-pytorch/pull/259

Data filtered for high simple eval positions with:
https://github.com/linrock/nnue-data/blob/32d6a68/filter_high_simple_eval_plain.py
https://github.com/linrock/Stockfish/blob/61dbfe/src/tools/transform.cpp#L626-L655

Training data can be found at:
https://robotmoon.com/nnue-training-data/

Local elo at 25k nodes per move of
L1-128 smallnet (nnue-only eval) vs. L1-128 trained on standard S1 data:
nn-epoch399.nnue : -318.1 +/- 2.1

Passed STC:
https://tests.stockfishchess.org/tests/view/6574cb9d95ea6ba1fcd49e3b
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 62432 W: 15875 L: 15521 D: 31036
Ptnml(0-2): 177, 7331, 15872, 7633, 203

Passed LTC:
https://tests.stockfishchess.org/tests/view/6575da2d4d789acf40aaac6e
LLR: 2.94 (-2.94,2.94) <0.50,2.50>
Total: 64830 W: 16118 L: 15738 D: 32974
Ptnml(0-2): 43, 7129, 17697, 7497, 49

closes https://github.com/official-stockfish/Stockfish/pulls

Bench: 1330050

Co-Authored-By: mstembera <5421953+mstembera@users.noreply.github.com>
2024-01-07 21:15:52 +01:00
Disservin
444f03ee95 Update copyright year
closes https://github.com/official-stockfish/Stockfish/pull/4954

No functional change
2024-01-04 15:47:10 +01:00
FauziAkram
833a2e2bc0 Cleanup comments
Tests used to derive some Elo worth comments:
https://tests.stockfishchess.org/tests/view/656a7f4e136acbc573555a31
https://tests.stockfishchess.org/tests/view/6585fb455457644dc984620f

closes https://github.com/official-stockfish/Stockfish/pull/4945

No functional change
2023-12-31 19:54:27 +01:00
Joost VandeVondele
ec02714b62 Cleanup comments and some code reorg.
passed STC:
https://tests.stockfishchess.org/tests/view/6536dc7dcc309ae83955b04d
LLR: 2.93 (-2.94,2.94) <-1.75,0.25>
Total: 58048 W: 14693 L: 14501 D: 28854
Ptnml(0-2): 200, 6399, 15595, 6669, 161

closes https://github.com/official-stockfish/Stockfish/pull/4846

No functional change
2023-10-24 17:43:05 +02:00
Disservin
2d0237db3f add clang-format
This introduces clang-format to enforce a consistent code style for Stockfish.

Having a documented and consistent style across the code will make contributing easier
for new developers, and will make larger changes to the codebase easier to make.

To facilitate formatting, this PR includes a Makefile target (`make format`) to format the code,
this requires clang-format (version 17 currently) to be installed locally.

Installing clang-format is straightforward on most OS and distros
(e.g. with https://apt.llvm.org/, brew install clang-format, etc), as this is part of quite commonly
used suite of tools and compilers (llvm / clang).

Additionally, a CI action is present that will verify if the code requires formatting,
and comment on the PR as needed. Initially, correct formatting is not required, it will be
done by maintainers as part of the merge or in later commits, but obviously this is encouraged.

fixes https://github.com/official-stockfish/Stockfish/issues/3608
closes https://github.com/official-stockfish/Stockfish/pull/4790

Co-Authored-By: Joost VandeVondele <Joost.VandeVondele@gmail.com>
2023-10-22 16:06:27 +02:00
mstembera
d3d0c69dc1 Remove outdated Tile naming.
cleanup variable naming after  #4816

closes #4833

No functional change
2023-10-21 10:28:55 +02:00
mstembera
c17a657b04 Optimize the most common update accumalator cases w/o tiling
In the most common case where we only update a single state
it's faster to not use temporary accumulation registers and tiling.
(Also includes a couple of small cleanups.)

passed STC
https://tests.stockfishchess.org/tests/view/651918e3cff46e538ee0023b
LLR: 2.95 (-2.94,2.94) <0.00,2.00>
Total: 34944 W: 8989 L: 8687 D: 17268
Ptnml(0-2): 88, 3743, 9512, 4037, 92

A simpler version
https://tests.stockfishchess.org/tests/view/65190dfacff46e538ee00155
also passed but this version is stronger still
https://tests.stockfishchess.org/tests/view/6519b95fcff46e538ee00fa2

closes https://github.com/official-stockfish/Stockfish/pull/4816

No functional change
2023-10-08 07:42:39 +02:00
mstembera
8a912951de Remove handcrafted MMX code
too small a benefit to maintain this old target

closes https://github.com/official-stockfish/Stockfish/pull/4804

No functional change
2023-10-08 07:37:01 +02:00
mstembera
95fe2b9a9d Reduce SIMD register count from 32 to 16
in the case of avx512 and vnni512 archs.

Up to 17% speedup, depending on the compiler, e.g.

```
AMD pro 7840u (zen4 phoenix apu 4nm)
bash bench_parallel.sh ./stockfish_avx512_gcc13 ./stockfish_avx512_pr_gcc13 20 10
sf_base =  1077737 +/-   8446 (95%)
sf_test =  1264268 +/-   8543 (95%)
diff    =   186531 +/-   4280 (95%)
speedup =  17.308% +/- 0.397% (95%)
```

Prior to this patch, it appears gcc spills registers.

closes https://github.com/official-stockfish/Stockfish/pull/4796

No functional change
2023-09-22 19:15:34 +02:00
mstembera
97f706ecc1 Sparse impl of affine_transform_non_ssse3()
deal with the general case

About a 8.6% speedup (for general arch)

Results for 200 tests for each version:

            Base      Test      Diff
    Mean    141741    153998    -12257
    StDev   2990      3042      3742

p-value: 0.999
speedup: 0.086

closes https://github.com/official-stockfish/Stockfish/pull/4786

No functional change
2023-09-22 19:03:47 +02:00
Disservin
3c0e86a91e Cleanup includes
Reorder a few includes, include "position.h" where it was previously missing
and apply include-what-you-use suggestions. Also make the order of the includes
consistent, in the following way:

1. Related header (for .cpp files)
2. A blank line
3. C/C++ headers
4. A blank line
5. All other header files

closes https://github.com/official-stockfish/Stockfish/pull/4763
fixes https://github.com/official-stockfish/Stockfish/issues/4707

No functional change
2023-09-03 08:24:51 +02:00
maxim
a46087ee30 Compressed network parameters
Implemented LEB128 (de)compression for the feature transformer.
Reduces embedded network size from 70 MiB to 39 Mib.

The new nn-78bacfcee510.nnue corresponds to the master net compressed.

closes https://github.com/official-stockfish/Stockfish/pull/4617

No functional change
2023-06-19 21:37:23 +02:00
pb00067
f0556dcbe3 Small cleanups
remove some unneeded assignments, typos, incorrect comments, add authors entry.

closes https://github.com/official-stockfish/Stockfish/pull/4417

no functional change
2023-03-14 08:38:02 +01:00
Sebastian Buchwald
564456a6a8 Unify type alias declarations
The commit unifies the declaration of type aliases by replacing all
typedefs with corresponding using statements.

closing https://github.com/official-stockfish/Stockfish/pull/4412

No functional change
2023-02-27 08:29:47 +01:00
Sebastian Buchwald
29b5ad5dea Fix typo in method name
closes https://github.com/official-stockfish/Stockfish/pull/4404

No functional change
2023-02-24 20:12:53 +01:00
Joost VandeVondele
08385527dd Introduce a function to compute NNUE accumulator
This patch introduces `hint_common_parent_position()` to signal that potentially several child nodes will require an NNUE eval. By populating explicitly the accumulator, these subsequent evaluations can be performed more efficiently.

This was based on the observation that calculating the evaluation in an excluded move position yielded a significant Elo gain, even though the evaluation itself was already available (work by pb00067).

Sopel wrote the code to perform just the accumulator update. This PR is based on cleaned up code that

passed STC:
https://tests.stockfishchess.org/tests/view/63f62f9be74a12625bcd4aa0
 LLR: 2.94 (-2.94,2.94) <0.50,2.50>
Total: 110368 W: 29607 L: 29167 D: 51594
Ptnml(0-2): 41, 10551, 33572, 10967, 53

and in an the earlier (equivalent) version

passed STC:
https://tests.stockfishchess.org/tests/view/63f3c3fee74a12625bcce2a6
LLR: 2.95 (-2.94,2.94) <0.00,2.00>
Total: 47552 W: 12786 L: 12467 D: 22299
Ptnml(0-2): 120, 5107, 12997, 5438, 114

passed LTC:
https://tests.stockfishchess.org/tests/view/63f45cc2e74a12625bccfa63
LLR: 2.94 (-2.94,2.94) <0.50,2.50>
Total: 110368 W: 29607 L: 29167 D: 51594
Ptnml(0-2): 41, 10551, 33572, 10967, 53

closes https://github.com/official-stockfish/Stockfish/pull/4402

Bench: 3726250
2023-02-23 13:25:35 +01:00
Sebastian Buchwald
b60f9cc451 Update copyright years
Happy New Year!

closes https://github.com/official-stockfish/Stockfish/pull/4315

No functional change
2023-01-02 19:07:38 +01:00
mstembera
93f71ecfe1 Optimize make_index() using templates and lookup tables.
https://tests.stockfishchess.org/tests/view/634517e54bc7650f07542f99
LLR: 2.94 (-2.94,2.94) <0.00,2.00>
Total: 642672 W: 171819 L: 170658 D: 300195
Ptnml(0-2): 2278, 68077, 179416, 69336, 2229

this also introduces `-flto-partition=one` as suggested by MinetaS (Syine Mineta)
to avoid linking errors due to LTO on 32 bit mingw. This change was tested in isolation as well

https://tests.stockfishchess.org/tests/view/634aacf84bc7650f0755188b
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 119352 W: 31986 L: 31862 D: 55504
Ptnml(0-2): 439, 12624, 33400, 12800, 413

closes https://github.com/official-stockfish/Stockfish/pull/4199

No functional change
2022-10-16 11:42:19 +02:00
Giacomo Lorenzetti
f7d1491b3d Assorted small cleanups
closes https://github.com/official-stockfish/Stockfish/pull/3973

No functional change
2022-05-29 18:42:48 +02:00
Ben Chaney
270a0e737f Generalize the feature transform to use vec_t macros
This commit generalizes the feature transform to use vec_t macros
that are architecture defined instead of using a seperate code path for each one.

It should make some old architectures (MMX, including improvements by Fanael) faster
and make further such improvements easier in the future.

Includes some corrections to CI for mingw.

closes https://github.com/official-stockfish/Stockfish/pull/3955
closes https://github.com/official-stockfish/Stockfish/pull/3928

No functional change
2022-03-02 23:39:08 +01:00
mstembera
5f781d366e Clean up and simplify some nnue code.
Remove some unnecessary code and it's execution during inference. Also the change on line 49 in nnue_architecture.h results in a more efficient SIMD code path through ClippedReLU::propagate().

passed STC:
https://tests.stockfishchess.org/tests/view/6217d3bfda649bba32ef25d5
LLR: 2.94 (-2.94,2.94) <-2.25,0.25>
Total: 12056 W: 3281 L: 3092 D: 5683
Ptnml(0-2): 55, 1213, 3312, 1384, 64

passed STC SMP:
https://tests.stockfishchess.org/tests/view/6217f344da649bba32ef295e
LLR: 2.94 (-2.94,2.94) <-2.25,0.25>
Total: 27376 W: 7295 L: 7137 D: 12944
Ptnml(0-2): 52, 2859, 7715, 3003, 59

closes https://github.com/official-stockfish/Stockfish/pull/3944

No functional change

bench: 6820724
2022-02-25 08:37:57 +01:00
Tomasz Sobczyk
cb9c2594fc Update architecture to "SFNNv4". Update network to nn-6877cd24400e.nnue.
Architecture:

The diagram of the "SFNNv4" architecture:
https://user-images.githubusercontent.com/8037982/153455685-cbe3a038-e158-4481-844d-9d5fccf5c33a.png

The most important architectural changes are the following:

* 1024x2 [activated] neurons are pairwise, elementwise multiplied (not quite pairwise due to implementation details, see diagram), which introduces a non-linearity that exhibits similar benefits to previously tested sigmoid activation (quantmoid4), while being slightly faster.
* The following layer has therefore 2x less inputs, which we compensate by having 2 more outputs. It is possible that reducing the number of outputs might be beneficial (as we had it as low as 8 before). The layer is now 1024->16.
* The 16 outputs are split into 15 and 1. The 1-wide output is added to the network output (after some necessary scaling due to quantization differences). The 15-wide is activated and follows the usual path through a set of linear layers. The additional 1-wide output is at least neutral, but has shown a slightly positive trend in training compared to networks without it (all 16 outputs through the usual path), and allows possibly an additional stage of lazy evaluation to be introduced in the future.

Additionally, the inference code was rewritten and no longer uses a recursive implementation. This was necessitated by the splitting of the 16-wide intermediate result into two, which was impossible to do with the old implementation with ugly hacks. This is hopefully overall for the better.

First session:

The first session was training a network from scratch (random initialization). The exact trainer used was slightly different (older) from the one used in the second session, but it should not have a measurable effect. The purpose of this session is to establish a strong network base for the second session. Small deviations in strength do not harm the learnability in the second session.

The training was done using the following command:

python3 train.py \
    /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \
    /home/sopel/nnue/nnue-pytorch-training/data/nodes5000pv2_UHO.binpack \
    --gpus "$3," \
    --threads 4 \
    --num-workers 4 \
    --batch-size 16384 \
    --progress_bar_refresh_rate 20 \
    --random-fen-skipping 3 \
    --features=HalfKAv2_hm^ \
    --lambda=1.0 \
    --gamma=0.992 \
    --lr=8.75e-4 \
    --max_epochs=400 \
    --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2

Every 20th net was saved and its playing strength measured against some baseline at 25k nodes per move with pure NNUE evaluation (modified binary). The exact setup is not important as long as it's consistent. The purpose is to sift good candidates from bad ones.

The dataset can be found https://drive.google.com/file/d/1UQdZN_LWQ265spwTBwDKo0t1WjSJKvWY/view

Second session:

The second training session was done starting from the best network (as determined by strength testing) from the first session. It is important that it's resumed from a .pt model and NOT a .ckpt model. The conversion can be performed directly using serialize.py

The LR schedule was modified to use gamma=0.995 instead of gamma=0.992 and LR=4.375e-4 instead of LR=8.75e-4 to flatten the LR curve and allow for longer training. The training was then running for 800 epochs instead of 400 (though it's possibly mostly noise after around epoch 600).

The training was done using the following command:

The training was done using the following command:

python3 train.py \
        /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \
        /data/sopel/nnue/nnue-pytorch-training/data/T60T70wIsRightFarseerT60T74T75T76.binpack \
        --gpus "$3," \
        --threads 4 \
        --num-workers 4 \
        --batch-size 16384 \
        --progress_bar_refresh_rate 20 \
        --random-fen-skipping 3 \
        --features=HalfKAv2_hm^ \
        --lambda=1.0 \
        --gamma=0.995 \
        --lr=4.375e-4 \
        --max_epochs=800 \
        --resume-from-model /data/sopel/nnue/nnue-pytorch-training/data/exp295/nn-epoch399.pt \
        --default_root_dir ../nnue-pytorch-training/experiment_$1/run_$run_id

In particular note that we now use lambda=1.0 instead of lambda=0.8 (previous nets), because tests show that WDL-skipping introduced by vondele performs better with lambda=1.0. Nets were being saved every 20th epoch. In total 16 runs were made with these settings and the best nets chosen according to playing strength at 25k nodes per move with pure NNUE evaluation - these are the 4 nets that have been put on fishtest.

The dataset can be found either at ftp://ftp.chessdb.cn/pub/sopel/data_sf/T60T70wIsRightFarseerT60T74T75T76.binpack in its entirety (download might be painfully slow because hosted in China) or can be assembled in the following way:

Get the 5640ad48ae/script/interleave_binpacks.py script.
Download T60T70wIsRightFarseer.binpack https://drive.google.com/file/d/1_sQoWBl31WAxNXma2v45004CIVltytP8/view
Download farseerT74.binpack http://trainingdata.farseer.org/T74-May13-End.7z
Download farseerT75.binpack http://trainingdata.farseer.org/T75-June3rd-End.7z
Download farseerT76.binpack http://trainingdata.farseer.org/T76-Nov10th-End.7z
Run python3 interleave_binpacks.py T60T70wIsRightFarseer.binpack farseerT74.binpack farseerT75.binpack farseerT76.binpack T60T70wIsRightFarseerT60T74T75T76.binpack

Tests:

STC: https://tests.stockfishchess.org/tests/view/6203fb85d71106ed12a407b7
LLR: 2.94 (-2.94,2.94) <0.00,2.50>
Total: 16952 W: 4775 L: 4521 D: 7656
Ptnml(0-2): 133, 1818, 4318, 2076, 131

LTC: https://tests.stockfishchess.org/tests/view/62041e68d71106ed12a40e85
LLR: 2.94 (-2.94,2.94) <0.50,3.00>
Total: 14944 W: 4138 L: 3907 D: 6899
Ptnml(0-2): 21, 1499, 4202, 1728, 22

closes https://github.com/official-stockfish/Stockfish/pull/3927

Bench: 4919707
2022-02-10 19:54:31 +01:00
Brad Knox
ad926d34c0 Update copyright years
Happy New Year!

closes https://github.com/official-stockfish/Stockfish/pull/3881

No functional change
2022-01-06 15:45:45 +01:00
Tomasz Sobczyk
4766dfc395 Optimize FT activation and affine transform for NEON.
This patch optimizes the NEON implementation in two ways.

    The activation layer after the feature transformer is rewritten to make it easier for the compiler to see through dependencies and unroll. This in itself is a minimal, but a positive improvement. Other architectures could benefit from this too in the future. This is not an algorithmic change.
    The affine transform for large matrices (first layer after FT) on NEON now utilizes the same optimized code path as >=SSSE3, which makes the memory accesses more sequential and makes better use of the available registers, which allows for code that has longer dependency chains.

Benchmarks from Redshift#161, profile-build with apple clang

george@Georges-MacBook-Air nets % ./stockfish-b82d93 bench 2>&1 | tail -4 (current master)
===========================
Total time (ms) : 2167
Nodes searched  : 4667742
Nodes/second    : 2154011
george@Georges-MacBook-Air nets % ./stockfish-7377b8 bench 2>&1 | tail -4 (this patch)
===========================
Total time (ms) : 1842
Nodes searched  : 4667742
Nodes/second    : 2534061

This is a solid 18% improvement overall, larger in a bench with NNUE-only, not mixed.

Improvement is also observed on armv7-neon (Raspberry Pi, and older phones), around 5% speedup.

No changes for architectures other than NEON.

closes https://github.com/official-stockfish/Stockfish/pull/3837

No functional changes.
2021-12-07 18:08:54 +01:00
mstembera
644f6d4790 Simplify away ValueListInserter
plus minor cleanups

STC: https://tests.stockfishchess.org/tests/view/616f059b40f619782fd4f73f
LLR: 2.94 (-2.94,2.94) <-2.50,0.50>
Total: 84992 W: 21244 L: 21197 D: 42551
Ptnml(0-2): 279, 9005, 23868, 9078, 266

closes https://github.com/official-stockfish/Stockfish/pull/3749

No functional change
2021-10-23 12:21:17 +02:00
Tomasz Sobczyk
900f249f59 Reduce the number of accumulator states
Reduce from 3 to 2. Make the intent of the states clearer.

STC: https://tests.stockfishchess.org/tests/view/60c50111457376eb8bcaad03
LLR: 2.95 (-2.94,2.94) <-2.50,0.50>
Total: 61888 W: 5007 L: 4944 D: 51937
Ptnml(0-2): 164, 3947, 22649, 4030, 154

LTC: https://tests.stockfishchess.org/tests/view/60c52b1c457376eb8bcaad2c
LLR: 2.94 (-2.94,2.94) <-2.50,0.50>
Total: 20248 W: 688 L: 618 D: 18942
Ptnml(0-2): 7, 551, 8946, 605, 15

closes https://github.com/official-stockfish/Stockfish/pull/3548

No functional change.
2021-06-14 11:22:08 +02:00
Tomasz Sobczyk
ce4c523ad3 Register count for feature transformer
Compute optimal register count for feature transformer accumulation dynamically.
This also introduces a change where AVX512 would only use 8 registers instead of 16
(now possible due to a 2x increase in feature transformer size).

closes https://github.com/official-stockfish/Stockfish/pull/3543

No functional change
2021-06-13 13:10:56 +02:00
Tomasz Sobczyk
b84fa04db6 Read NNUE net faster
Load feature transformer weights in bulk on little-endian machines.
This is in particular useful to test new nets with c-chess-cli,
see https://github.com/lucasart/c-chess-cli/issues/44

```
$ time ./stockfish.exe uci

Before : 0m0.914s
After  : 0m0.483s
```

No functional change
2021-06-13 09:39:03 +02:00
Stéphane Nicolet
8f081c86f7 Clean SIMD code a bit
Cleaner vector code structure in feature transformer. This patch just
regroups the parts of the inner loop for each SIMD instruction set.

Tested for non-regression:
LLR: 2.96 (-2.94,2.94) <-2.50,0.50>
Total: 115760 W: 9835 L: 9831 D: 96094
Ptnml(0-2): 326, 7776, 41715, 7694, 369
https://tests.stockfishchess.org/tests/view/60b96b39457376eb8bcaa26e

It would be nice if a future patch could use some of the macros at
the top of the file to unify the code between the distincts SIMD
instruction sets (of course, unifying the Relu will be the challenge).

closes https://github.com/official-stockfish/Stockfish/pull/3506

No functional change
2021-06-04 14:07:46 +02:00
Tomasz Sobczyk
5448cad29e Fix export of the feature transformer.
PSQT export was missing.

fixes #3507

closes https://github.com/official-stockfish/Stockfish/pull/3508

No functional change
2021-05-30 21:31:58 +02:00
Stéphane Nicolet
f193778446 Do not use lazy evaluation inside NNUE
This simplification patch implements two changes:

1. it simplifies away the so-called "lazy" path in the NNUE evaluation internals,
   where we trusted the psqt head alone to avoid the costly "positional" head in
   some cases;
2. it raises a little bit the NNUEThreshold1 in evaluate.cpp (from 682 to 800),
   which increases the limit where we switched from NNUE eval to Classical eval.

Both effects increase the number of positional evaluations done by our new net
architecture, but the results of our tests below seem to indicate that the loss
of speed will be compensated by the gain of eval quality.

STC:
LLR: 2.95 (-2.94,2.94) <-2.50,0.50>
Total: 26280 W: 2244 L: 2137 D: 21899
Ptnml(0-2): 72, 1755, 9405, 1810, 98
https://tests.stockfishchess.org/tests/view/60ae73f112066fd299795a51

LTC:
LLR: 2.95 (-2.94,2.94) <-2.50,0.50>
Total: 20592 W: 750 L: 677 D: 19165
Ptnml(0-2): 9, 614, 8980, 681, 12
https://tests.stockfishchess.org/tests/view/60ae88e812066fd299795a82

closes https://github.com/official-stockfish/Stockfish/pull/3503

Bench: 3817907
2021-05-27 01:21:56 +02:00
Tomasz Sobczyk
9d53129075 Expose the lazy threshold for the feature transformer PSQT as a parameter.
Definition of the lazy threshold moved to evaluate.cpp where all others are.
Lazy threshold only used for real searches, not used for the "eval" call.
This preserves the purity of NNUE evaluation, which is useful to verify
consistency between the engine and the NNUE trainer.

closes https://github.com/official-stockfish/Stockfish/pull/3499

No functional change
2021-05-25 21:40:51 +02:00
Fanael Linithien
038487f954 Use packed 32-bit MMX operations for updating the PSQT accumulator
This improves the speed of NNUE by a bit on old hardware that code path
is intended for, like a Pentium III 1.13 GHz:

10 repeats of "./stockfish bench 16 1 13 default depth NNUE":

Before:
54 642 504 897 cycles (± 0.12%)
62 301 937 829 instructions (± 0.03%)

After:
54 320 821 928 cycles (± 0.13%)
62 084 742 699 instructions (± 0.02%)

Speed of go depth 20 from startpos:

Before: 53103 nps
After: 53856 nps

closes https://github.com/official-stockfish/Stockfish/pull/3476

No functional change.
2021-05-19 19:34:44 +02:00
Tomasz Sobczyk
e8d64af123 New NNUE architecture and net
Introduces a new NNUE network architecture and associated network parameters,
as obtained by a new pytorch trainer.

The network is already very strong at short TC, without regression at longer TC,
and has potential for further improvements.

https://tests.stockfishchess.org/tests/view/60a159c65085663412d0921d
TC: 10s+0.1s, 1 thread
ELO: 21.74 +-3.4 (95%) LOS: 100.0%
Total: 10000 W: 1559 L: 934 D: 7507
Ptnml(0-2): 38, 701, 2972, 1176, 113

https://tests.stockfishchess.org/tests/view/60a187005085663412d0925b
TC: 60s+0.6s, 1 thread
ELO: 5.85 +-1.7 (95%) LOS: 100.0%
Total: 20000 W: 1381 L: 1044 D: 17575
Ptnml(0-2): 27, 885, 7864, 1172, 52

https://tests.stockfishchess.org/tests/view/60a2beede229097940a03806
TC: 20s+0.2s, 8 threads
LLR: 2.93 (-2.94,2.94) <0.50,3.50>
Total: 34272 W: 1610 L: 1452 D: 31210
Ptnml(0-2): 30, 1285, 14350, 1439, 32

https://tests.stockfishchess.org/tests/view/60a2d687e229097940a03c72
TC: 60s+0.6s, 8 threads
LLR: 2.94 (-2.94,2.94) <-2.50,0.50>
Total: 45544 W: 1262 L: 1214 D: 43068
Ptnml(0-2): 12, 1129, 20442, 1177, 12

The network has been trained (by vondele) using the https://github.com/glinscott/nnue-pytorch/ trainer (started by glinscott),
specifically the branch https://github.com/Sopel97/nnue-pytorch/tree/experiment_56.
The data used are in 64 billion positions (193GB total) generated and scored with the current master net
d8: https://drive.google.com/file/d/1hOOYSDKgOOp38ZmD0N4DV82TOLHzjUiF/view?usp=sharing
d9: https://drive.google.com/file/d/1VlhnHL8f-20AXhGkILujnNXHwy9T-MQw/view?usp=sharing
d10: https://drive.google.com/file/d/1ZC5upzBYMmMj1gMYCkt6rCxQG0GnO3Kk/view?usp=sharing
fishtest_d9: https://drive.google.com/file/d/1GQHt0oNgKaHazwJFTRbXhlCN3FbUedFq/view?usp=sharing

This network also contains a few architectural changes with respect to the current master:

    Size changed from 256x2-32-32-1 to 512x2-16-32-1
        ~15-20% slower
        ~2x larger
        adds a special path for 16 valued ClippedReLU
        fixes affine transform code for 16 inputs/outputs, buy using InputDimensions instead of PaddedInputDimensions
            this is safe now because the inputs are processed in groups of 4 in the current affine transform code
    The feature set changed from HalfKP to HalfKAv2
        Includes information about the kings like HalfKA
        Packs king features better, resulting in 8% size reduction compared to HalfKA
    The board is flipped for the black's perspective, instead of rotated like in the current master
    PSQT values for each feature
        the feature transformer now outputs a part that is fowarded directly to the output and allows learning piece values more directly than the previous network architecture. The effect is visible for high imbalance positions, where the current master network outputs evaluations skewed towards zero.
        8 PSQT values per feature, chosen based on (popcount(pos.pieces()) - 1) / 4
        initialized to classical material values on the start of the training
    8 subnetworks (512x2->16->32->1), chosen based on (popcount(pos.pieces()) - 1) / 4
        only one subnetwork is evaluated for any position, no or marginal speed loss

A diagram of the network is available: https://user-images.githubusercontent.com/8037982/118656988-553a1700-b7eb-11eb-82ef-56a11cbebbf2.png
A more complete description: https://github.com/glinscott/nnue-pytorch/blob/master/docs/nnue.md

closes https://github.com/official-stockfish/Stockfish/pull/3474

Bench: 3806488
2021-05-18 18:06:23 +02:00
Tomasz Sobczyk
58054fd0fa Exporting the currently loaded network file
This PR adds an ability to export any currently loaded network.
The export_net command now takes an optional filename parameter.
If the loaded net is not the embedded net the filename parameter is required.

Two changes were required to support this:

* the "architecture" string, which is really just a some kind of description in the net, is now saved into netDescription on load and correctly saved on export.
* the AffineTransform scrambles weights for some architectures and sparsifies them, such that retrieving the index is hard. This is solved by having a temporary scrambled<->unscrambled index lookup table when loading the network, and the actual index is saved for each individual weight that makes it to canSaturate16. This increases the size of the canSaturate16 entries by 6 bytes.

closes https://github.com/official-stockfish/Stockfish/pull/3456

No functional change
2021-05-11 19:36:11 +02:00
Tomasz Sobczyk
b748b46714 Cleanup and simplify NNUE code.
A lot of optimizations happend since the NNUE was introduced
and since then some parts of the code were left unused. This
got to the point where asserts were have to be made just to
let people know that modifying something will not have any
effects or may even break everything due to the assumptions
being made. Removing these parts removes those inexisting
"false dependencies". Additionally:

 * append_changed_indices now takes the king pos and stateinfo
   explicitly, no more misleading pos parameter
 * IndexList is removed in favor of a generic ValueList.
   Feature transformer just instantiates the type it needs.
 * The update cost and refresh requirement is deferred to the
   feature set once again, but now doesn't go through the whole
   FeatureSet machinery and just calls HalfKP directly.
 * accumulator no longer has a singular dimension.
 * The PS constants and the PieceSquareIndex array are made local
   to the HalfKP feature set because they are specific to it and
   DO differ for other feature sets.
 * A few names are changed to more descriptive

Passed STC non-regression:
https://tests.stockfishchess.org/tests/view/608421dd95e7f1852abd2790
LLR: 2.95 (-2.94,2.94) <-2.50,0.50>
Total: 180008 W: 16186 L: 16258 D: 147564
Ptnml(0-2): 587, 12593, 63725, 12503, 596

closes https://github.com/official-stockfish/Stockfish/pull/3441

No functional change
2021-04-25 13:16:30 +02:00
Tomasz Sobczyk
fbbd4adc3c Unify naming convention of the NNUE code
matches the rest of the stockfish code base

closes https://github.com/official-stockfish/Stockfish/pull/3437

No functional change
2021-04-24 12:49:29 +02:00
Dieter Dobbelaere
7ffae17f85 Add Stockfish namespace.
fixes #3350 and is a small cleanup that might make it easier to use SF
in separate projects, like a NNUE trainer or similar.

closes https://github.com/official-stockfish/Stockfish/pull/3370

No functional change.
2021-03-07 14:26:54 +01:00
Joost VandeVondele
c4d67d77c9 Update copyright years
No functional change
2021-01-08 17:04:23 +01:00
Stéphane Nicolet
027626db1e Small cleanups 13
No functional change
2020-11-23 22:20:32 +01:00
Tomasz Sobczyk
ba35c88ab8 AVX-512 for smaller affine and feature transforms.
For the feature transformer the code is analogical to AVX2 since there was room for easy adaptation of wider simd registers.

For the smaller affine transforms that have 32 byte stride we keep 2 columns in one zmm register. We also unroll more aggressively so that in the end we have to do 16 parallel horizontal additions on ymm slices each consisting of 4 32-bit integers. The slices are embedded in 8 zmm registers.

These changes provide about 1.5% speedup for AVX-512 builds.

Closes https://github.com/official-stockfish/Stockfish/pull/3218

No functional change.
2020-11-07 16:49:49 +01:00
Tomasz Sobczyk
3f6451eff7 Manually align arrays on the stack
as a workaround to issues with overaligned alignas() on stack variables in gcc < 9.3 on windows.

closes https://github.com/official-stockfish/Stockfish/pull/3217

fixes #3216

No functional change
2020-11-04 19:52:42 +01:00
syzygy1
2046d5da30 More incremental accumulator updates
This patch was inspired by c065abd which updates the accumulator,
if possible, based on the accumulator of two plies back if
the accumulator of the preceding ply is not available.

With this patch we look back even further in the position history
in an attempt to reduce the number of complete recomputations.
When we find a usable accumulator for the position N plies back,
we also update the accumulator of the position N-1 plies back
because that accumulator is most likely to be helpful later
when evaluating positions in sibling branches.
By not updating all intermediate accumulators immediately,
we avoid doing too much work that is not certain to be useful.
Overall, roughly 2-3% speedup.

This patch makes the code more specific to the net architecture,
changing input features of the net will require additional changes
to the incremental update code as discussed in the PR #3193 and #3191.

Passed STC:
https://tests.stockfishchess.org/tests/view/5f9056712c92c7fe3a8c60d0
LLR: 2.94 (-2.94,2.94) {-0.25,1.25}
Total: 10040 W: 1116 L: 968 D: 7956
Ptnml(0-2): 42, 722, 3365, 828, 63

closes https://github.com/official-stockfish/Stockfish/pull/3193

No functional change.
2020-10-22 20:50:16 +02:00
noobpwnftw
c065abdcaf Use incremental updates more often
Use incremental updates for accumulators for up to 2 plies.
Do not copy accumulator. About 2% speedup.

Passed STC:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 21752 W: 2583 L: 2403 D: 16766
Ptnml(0-2): 128, 1761, 6923, 1931, 133
https://tests.stockfishchess.org/tests/view/5f7150cf3b22d6afa5069412

closes https://github.com/official-stockfish/Stockfish/pull/3157

No functional change
2020-09-28 16:54:35 +02:00