1
0
Fork 0
mirror of https://github.com/sockspls/badfish synced 2025-05-02 01:29:36 +00:00
Commit graph

153 commits

Author SHA1 Message Date
Tomasz Sobczyk
2e745956c0 Change trace with NNUE eval support
This patch adds some more output to the `eval` command. It adds a board display
with estimated piece values (method is remove-piece, evaluate, put-piece), and
splits the NNUE evaluation with (psqt,layers) for each bucket for the NNUE net.

Example:

```

./stockfish
position fen 3Qb1k1/1r2ppb1/pN1n2q1/Pp1Pp1Pr/4P2p/4BP2/4B1R1/1R5K b - - 11 40
eval

 Contributing terms for the classical eval:
+------------+-------------+-------------+-------------+
|    Term    |    White    |    Black    |    Total    |
|            |   MG    EG  |   MG    EG  |   MG    EG  |
+------------+-------------+-------------+-------------+
|   Material |  ----  ---- |  ----  ---- | -0.73 -1.55 |
|  Imbalance |  ----  ---- |  ----  ---- | -0.21 -0.17 |
|      Pawns |  0.35 -0.00 |  0.19 -0.26 |  0.16  0.25 |
|    Knights |  0.04 -0.08 |  0.12 -0.01 | -0.08 -0.07 |
|    Bishops | -0.34 -0.87 | -0.17 -0.61 | -0.17 -0.26 |
|      Rooks |  0.12  0.00 |  0.08  0.00 |  0.04  0.00 |
|     Queens |  0.00  0.00 | -0.27 -0.07 |  0.27  0.07 |
|   Mobility |  0.84  1.76 |  0.01  0.66 |  0.83  1.10 |
|King safety | -0.99 -0.17 | -0.72 -0.10 | -0.27 -0.07 |
|    Threats |  0.27  0.27 |  0.73  0.86 | -0.46 -0.59 |
|     Passed |  0.00  0.00 |  0.79  0.82 | -0.79 -0.82 |
|      Space |  0.61  0.00 |  0.24  0.00 |  0.37  0.00 |
|   Winnable |  ----  ---- |  ----  ---- |  0.00 -0.03 |
+------------+-------------+-------------+-------------+
|      Total |  ----  ---- |  ----  ---- | -1.03 -2.14 |
+------------+-------------+-------------+-------------+

 NNUE derived piece values:
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |   Q   |   b   |       |   k   |       |
|       |       |       | +12.4 | -1.62 |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |   r   |       |       |   p   |   p   |   b   |       |
|       | -3.89 |       |       | -0.84 | -1.19 | -3.32 |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   p   |   N   |       |   n   |       |       |   q   |       |
| -1.81 | +3.71 |       | -4.82 |       |       | -5.04 |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   P   |   p   |       |   P   |   p   |       |   P   |   r   |
| +1.16 | -0.91 |       | +0.55 | +0.12 |       | +0.50 | -4.02 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |   P   |       |       |   p   |
|       |       |       |       | +2.33 |       |       | +1.17 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |   B   |   P   |       |       |
|       |       |       |       | +4.79 | +1.54 |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |   B   |       |   R   |       |
|       |       |       |       | +4.54 |       | +6.03 |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |   R   |       |       |       |       |       |   K   |
|       | +4.81 |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+

 NNUE network contributions (Black to move)
+------------+------------+------------+------------+
|   Bucket   |  Material  | Positional |   Total    |
|            |   (PSQT)   |  (Layers)  |            |
+------------+------------+------------+------------+
|  0         |  +  0.32   |  -  1.46   |  -  1.13   |
|  1         |  +  0.25   |  -  0.68   |  -  0.43   |
|  2         |  +  0.46   |  -  1.72   |  -  1.25   |
|  3         |  +  0.55   |  -  1.80   |  -  1.25   |
|  4         |  +  0.48   |  -  1.77   |  -  1.29   |
|  5         |  +  0.40   |  -  2.00   |  -  1.60   |
|  6         |  +  0.57   |  -  2.12   |  -  1.54   | <-- this bucket is used
|  7         |  +  3.38   |  -  2.00   |  +  1.37   |
+------------+------------+------------+------------+

Classical evaluation   -1.00 (white side)
NNUE evaluation        +1.54 (white side)
Final evaluation       +2.38 (white side) [with scaled NNUE, hybrid, ...]

```

Also renames the export_net() function to save_eval() while there.

closes https://github.com/official-stockfish/Stockfish/pull/3562

No functional change
2021-06-19 11:57:01 +02:00
Tomasz Sobczyk
900f249f59 Reduce the number of accumulator states
Reduce from 3 to 2. Make the intent of the states clearer.

STC: https://tests.stockfishchess.org/tests/view/60c50111457376eb8bcaad03
LLR: 2.95 (-2.94,2.94) <-2.50,0.50>
Total: 61888 W: 5007 L: 4944 D: 51937
Ptnml(0-2): 164, 3947, 22649, 4030, 154

LTC: https://tests.stockfishchess.org/tests/view/60c52b1c457376eb8bcaad2c
LLR: 2.94 (-2.94,2.94) <-2.50,0.50>
Total: 20248 W: 688 L: 618 D: 18942
Ptnml(0-2): 7, 551, 8946, 605, 15

closes https://github.com/official-stockfish/Stockfish/pull/3548

No functional change.
2021-06-14 11:22:08 +02:00
Tomasz Sobczyk
ce4c523ad3 Register count for feature transformer
Compute optimal register count for feature transformer accumulation dynamically.
This also introduces a change where AVX512 would only use 8 registers instead of 16
(now possible due to a 2x increase in feature transformer size).

closes https://github.com/official-stockfish/Stockfish/pull/3543

No functional change
2021-06-13 13:10:56 +02:00
Stéphane Nicolet
7819412002 Clarify use of UCI options
Update README.md to clarify use of UCI options

closes https://github.com/official-stockfish/Stockfish/pull/3540

No functional change
2021-06-13 10:02:43 +02:00
Tomasz Sobczyk
b84fa04db6 Read NNUE net faster
Load feature transformer weights in bulk on little-endian machines.
This is in particular useful to test new nets with c-chess-cli,
see https://github.com/lucasart/c-chess-cli/issues/44

```
$ time ./stockfish.exe uci

Before : 0m0.914s
After  : 0m0.483s
```

No functional change
2021-06-13 09:39:03 +02:00
Stéphane Nicolet
8f081c86f7 Clean SIMD code a bit
Cleaner vector code structure in feature transformer. This patch just
regroups the parts of the inner loop for each SIMD instruction set.

Tested for non-regression:
LLR: 2.96 (-2.94,2.94) <-2.50,0.50>
Total: 115760 W: 9835 L: 9831 D: 96094
Ptnml(0-2): 326, 7776, 41715, 7694, 369
https://tests.stockfishchess.org/tests/view/60b96b39457376eb8bcaa26e

It would be nice if a future patch could use some of the macros at
the top of the file to unify the code between the distincts SIMD
instruction sets (of course, unifying the Relu will be the challenge).

closes https://github.com/official-stockfish/Stockfish/pull/3506

No functional change
2021-06-04 14:07:46 +02:00
Tomasz Sobczyk
5448cad29e Fix export of the feature transformer.
PSQT export was missing.

fixes #3507

closes https://github.com/official-stockfish/Stockfish/pull/3508

No functional change
2021-05-30 21:31:58 +02:00
Stéphane Nicolet
f193778446 Do not use lazy evaluation inside NNUE
This simplification patch implements two changes:

1. it simplifies away the so-called "lazy" path in the NNUE evaluation internals,
   where we trusted the psqt head alone to avoid the costly "positional" head in
   some cases;
2. it raises a little bit the NNUEThreshold1 in evaluate.cpp (from 682 to 800),
   which increases the limit where we switched from NNUE eval to Classical eval.

Both effects increase the number of positional evaluations done by our new net
architecture, but the results of our tests below seem to indicate that the loss
of speed will be compensated by the gain of eval quality.

STC:
LLR: 2.95 (-2.94,2.94) <-2.50,0.50>
Total: 26280 W: 2244 L: 2137 D: 21899
Ptnml(0-2): 72, 1755, 9405, 1810, 98
https://tests.stockfishchess.org/tests/view/60ae73f112066fd299795a51

LTC:
LLR: 2.95 (-2.94,2.94) <-2.50,0.50>
Total: 20592 W: 750 L: 677 D: 19165
Ptnml(0-2): 9, 614, 8980, 681, 12
https://tests.stockfishchess.org/tests/view/60ae88e812066fd299795a82

closes https://github.com/official-stockfish/Stockfish/pull/3503

Bench: 3817907
2021-05-27 01:21:56 +02:00
Tomasz Sobczyk
9d53129075 Expose the lazy threshold for the feature transformer PSQT as a parameter.
Definition of the lazy threshold moved to evaluate.cpp where all others are.
Lazy threshold only used for real searches, not used for the "eval" call.
This preserves the purity of NNUE evaluation, which is useful to verify
consistency between the engine and the NNUE trainer.

closes https://github.com/official-stockfish/Stockfish/pull/3499

No functional change
2021-05-25 21:40:51 +02:00
Stéphane Nicolet
a2f01c07eb Sometimes change the (materialist, positional) balance
Our new nets output two values for the side to move in the last layer.
We can interpret the first value as a material evaluation of the
position, and the second one as the dynamic, positional value of the
location of pieces.

This patch changes the balance for the (materialist, positional) parts
of the score from (128, 128) to (121, 135) when the piece material is
equal between the two players, but keeps the standard (128, 128) balance
when one player is at least an exchange up.

Passed STC:
LLR: 2.93 (-2.94,2.94) <-0.50,2.50>
Total: 15936 W: 1421 L: 1266 D: 13249
Ptnml(0-2): 37, 1037, 5694, 1134, 66
https://tests.stockfishchess.org/tests/view/60a82df9ce8ea25a3ef0408f

Passed LTC:
LLR: 2.94 (-2.94,2.94) <0.50,3.50>
Total: 13904 W: 516 L: 410 D: 12978
Ptnml(0-2): 4, 374, 6088, 484, 2
https://tests.stockfishchess.org/tests/view/60a8bbf9ce8ea25a3ef04101

closes https://github.com/official-stockfish/Stockfish/pull/3492

Bench: 3856635
2021-05-22 21:09:22 +02:00
Fanael Linithien
038487f954 Use packed 32-bit MMX operations for updating the PSQT accumulator
This improves the speed of NNUE by a bit on old hardware that code path
is intended for, like a Pentium III 1.13 GHz:

10 repeats of "./stockfish bench 16 1 13 default depth NNUE":

Before:
54 642 504 897 cycles (± 0.12%)
62 301 937 829 instructions (± 0.03%)

After:
54 320 821 928 cycles (± 0.13%)
62 084 742 699 instructions (± 0.02%)

Speed of go depth 20 from startpos:

Before: 53103 nps
After: 53856 nps

closes https://github.com/official-stockfish/Stockfish/pull/3476

No functional change.
2021-05-19 19:34:44 +02:00
Tomasz Sobczyk
e8d64af123 New NNUE architecture and net
Introduces a new NNUE network architecture and associated network parameters,
as obtained by a new pytorch trainer.

The network is already very strong at short TC, without regression at longer TC,
and has potential for further improvements.

https://tests.stockfishchess.org/tests/view/60a159c65085663412d0921d
TC: 10s+0.1s, 1 thread
ELO: 21.74 +-3.4 (95%) LOS: 100.0%
Total: 10000 W: 1559 L: 934 D: 7507
Ptnml(0-2): 38, 701, 2972, 1176, 113

https://tests.stockfishchess.org/tests/view/60a187005085663412d0925b
TC: 60s+0.6s, 1 thread
ELO: 5.85 +-1.7 (95%) LOS: 100.0%
Total: 20000 W: 1381 L: 1044 D: 17575
Ptnml(0-2): 27, 885, 7864, 1172, 52

https://tests.stockfishchess.org/tests/view/60a2beede229097940a03806
TC: 20s+0.2s, 8 threads
LLR: 2.93 (-2.94,2.94) <0.50,3.50>
Total: 34272 W: 1610 L: 1452 D: 31210
Ptnml(0-2): 30, 1285, 14350, 1439, 32

https://tests.stockfishchess.org/tests/view/60a2d687e229097940a03c72
TC: 60s+0.6s, 8 threads
LLR: 2.94 (-2.94,2.94) <-2.50,0.50>
Total: 45544 W: 1262 L: 1214 D: 43068
Ptnml(0-2): 12, 1129, 20442, 1177, 12

The network has been trained (by vondele) using the https://github.com/glinscott/nnue-pytorch/ trainer (started by glinscott),
specifically the branch https://github.com/Sopel97/nnue-pytorch/tree/experiment_56.
The data used are in 64 billion positions (193GB total) generated and scored with the current master net
d8: https://drive.google.com/file/d/1hOOYSDKgOOp38ZmD0N4DV82TOLHzjUiF/view?usp=sharing
d9: https://drive.google.com/file/d/1VlhnHL8f-20AXhGkILujnNXHwy9T-MQw/view?usp=sharing
d10: https://drive.google.com/file/d/1ZC5upzBYMmMj1gMYCkt6rCxQG0GnO3Kk/view?usp=sharing
fishtest_d9: https://drive.google.com/file/d/1GQHt0oNgKaHazwJFTRbXhlCN3FbUedFq/view?usp=sharing

This network also contains a few architectural changes with respect to the current master:

    Size changed from 256x2-32-32-1 to 512x2-16-32-1
        ~15-20% slower
        ~2x larger
        adds a special path for 16 valued ClippedReLU
        fixes affine transform code for 16 inputs/outputs, buy using InputDimensions instead of PaddedInputDimensions
            this is safe now because the inputs are processed in groups of 4 in the current affine transform code
    The feature set changed from HalfKP to HalfKAv2
        Includes information about the kings like HalfKA
        Packs king features better, resulting in 8% size reduction compared to HalfKA
    The board is flipped for the black's perspective, instead of rotated like in the current master
    PSQT values for each feature
        the feature transformer now outputs a part that is fowarded directly to the output and allows learning piece values more directly than the previous network architecture. The effect is visible for high imbalance positions, where the current master network outputs evaluations skewed towards zero.
        8 PSQT values per feature, chosen based on (popcount(pos.pieces()) - 1) / 4
        initialized to classical material values on the start of the training
    8 subnetworks (512x2->16->32->1), chosen based on (popcount(pos.pieces()) - 1) / 4
        only one subnetwork is evaluated for any position, no or marginal speed loss

A diagram of the network is available: https://user-images.githubusercontent.com/8037982/118656988-553a1700-b7eb-11eb-82ef-56a11cbebbf2.png
A more complete description: https://github.com/glinscott/nnue-pytorch/blob/master/docs/nnue.md

closes https://github.com/official-stockfish/Stockfish/pull/3474

Bench: 3806488
2021-05-18 18:06:23 +02:00
Stéphane Nicolet
f90274d8ce Small clean-ups
- Comment for Countemove pruning -> Continuation history
- Fix comment in input_slice.h
- Shorter lines in Makefile
- Comment for scale factor
- Fix comment for pinners in see_ge()
- Change Thread.id() signature to size_t
- Trailing space in reprosearch.sh
- Add Douglas Matos Gomes to the AUTHORS file
- Introduce comment for undo_null_move()
- Use Stockfish coding style for export_net()
- Change date in AUTHORS file

closes https://github.com/official-stockfish/Stockfish/pull/3416

No functional change
2021-05-17 10:47:14 +02:00
Tomasz Sobczyk
58054fd0fa Exporting the currently loaded network file
This PR adds an ability to export any currently loaded network.
The export_net command now takes an optional filename parameter.
If the loaded net is not the embedded net the filename parameter is required.

Two changes were required to support this:

* the "architecture" string, which is really just a some kind of description in the net, is now saved into netDescription on load and correctly saved on export.
* the AffineTransform scrambles weights for some architectures and sparsifies them, such that retrieving the index is hard. This is solved by having a temporary scrambled<->unscrambled index lookup table when loading the network, and the actual index is saved for each individual weight that makes it to canSaturate16. This increases the size of the canSaturate16 entries by 6 bytes.

closes https://github.com/official-stockfish/Stockfish/pull/3456

No functional change
2021-05-11 19:36:11 +02:00
Tomasz Sobczyk
b748b46714 Cleanup and simplify NNUE code.
A lot of optimizations happend since the NNUE was introduced
and since then some parts of the code were left unused. This
got to the point where asserts were have to be made just to
let people know that modifying something will not have any
effects or may even break everything due to the assumptions
being made. Removing these parts removes those inexisting
"false dependencies". Additionally:

 * append_changed_indices now takes the king pos and stateinfo
   explicitly, no more misleading pos parameter
 * IndexList is removed in favor of a generic ValueList.
   Feature transformer just instantiates the type it needs.
 * The update cost and refresh requirement is deferred to the
   feature set once again, but now doesn't go through the whole
   FeatureSet machinery and just calls HalfKP directly.
 * accumulator no longer has a singular dimension.
 * The PS constants and the PieceSquareIndex array are made local
   to the HalfKP feature set because they are specific to it and
   DO differ for other feature sets.
 * A few names are changed to more descriptive

Passed STC non-regression:
https://tests.stockfishchess.org/tests/view/608421dd95e7f1852abd2790
LLR: 2.95 (-2.94,2.94) <-2.50,0.50>
Total: 180008 W: 16186 L: 16258 D: 147564
Ptnml(0-2): 587, 12593, 63725, 12503, 596

closes https://github.com/official-stockfish/Stockfish/pull/3441

No functional change
2021-04-25 13:16:30 +02:00
Tomasz Sobczyk
fbbd4adc3c Unify naming convention of the NNUE code
matches the rest of the stockfish code base

closes https://github.com/official-stockfish/Stockfish/pull/3437

No functional change
2021-04-24 12:49:29 +02:00
Tomasz Sobczyk
255514fb29 Documentation patch: AppendChangedIndices
Clarify the assumptions on the position passed to the AppendChangedIndices().

Closes https://github.com/official-stockfish/Stockfish/pull/3428

No functional change
2021-04-15 12:21:30 +02:00
Stéphane Nicolet
83eac08e75 Small cleanups (march 2021)
With help of @BM123499, @mstembera, @gvreuls, @noobpwnftw and @Fanael
Thanks!

Closes https://github.com/official-stockfish/Stockfish/pull/3405

No functional change
2021-03-24 17:11:06 +01:00
Guy Vreuls
ec42154ef2 Use reference instead of pointer for pop_lsb() signature
This patch changes the pop_lsb() signature from Square pop_lsb(Bitboard*) to
Square pop_lsb(Bitboard&). This is more idomatic for C++ style signatures.

Passed a non-regression STC test:
LLR: 2.93 (-2.94,2.94) {-1.25,0.25}
Total: 21280 W: 1928 L: 1847 D: 17505
Ptnml(0-2): 71, 1427, 7558, 1518, 66
https://tests.stockfishchess.org/tests/view/6053a1e22433018de7a38e2f

We have verified that the generated binary is identical on gcc-10.

Closes https://github.com/official-stockfish/Stockfish/pull/3404

No functional change.
2021-03-19 20:28:57 +01:00
Dieter Dobbelaere
7ffae17f85 Add Stockfish namespace.
fixes #3350 and is a small cleanup that might make it easier to use SF
in separate projects, like a NNUE trainer or similar.

closes https://github.com/official-stockfish/Stockfish/pull/3370

No functional change.
2021-03-07 14:26:54 +01:00
MaximMolchanov
303713b560 Affine transform robust implementation
Size of the weights in the last layer is less than 512 bits. It leads to wrong data access for AVX512. There is no error because in current implementation it is guaranteed that there is an array of zeros after weights so zero multiplied by something is returned and sum is correct. It is a mistake that can lead to unexpected bugs in the future. Used AVX2 instructions for smaller input size.

No measurable slowdown on avx512.

closes https://github.com/official-stockfish/Stockfish/pull/3298

No functional change.
2021-01-11 18:54:18 +01:00
Joost VandeVondele
c4d67d77c9 Update copyright years
No functional change
2021-01-08 17:04:23 +01:00
MaximMolchanov
23c385ec36 Affine transform refactoring.
Reordered weights in such a way that accumulated sum fits to output.
Weights are grouped in blocks of four elements because four
int8 (weight type) corresponds to one int32 (output type).
No horizontal additions.
Grouped AVX512, AVX2 and SSSE3 implementations.
Repeated code was removed.

An earlier version passed STC:

LLR: 2.97 (-2.94,2.94) {-0.25,1.25}
Total: 15336 W: 1495 L: 1355 D: 12486
Ptnml(0-2): 44, 1054, 5350, 1158, 62
https://tests.stockfishchess.org/tests/view/5ff60e106019e097de3eefd5

Speedup depends on the architecture, up to 4% measured on a NNUE only bench.

closes https://github.com/official-stockfish/Stockfish/pull/3287

No functional change
2021-01-08 16:35:44 +01:00
mstembera
d862ba4069 AVX512, AVX2 and SSSE3 speedups
Improves throughput by summing 2 intermediate dot products using 16 bit addition before upconverting to 32 bit.

Potential saturation is detected and the code-path is avoided in this case.
The saturation can't happen with the current nets,
but nets can be constructed that trigger this check.

STC https://tests.stockfishchess.org/tests/view/5fd40a861ac1691201888479
LLR: 2.94 (-2.94,2.94) {-0.25,1.25}
Total: 25544 W: 2451 L: 2296 D: 20797
Ptnml(0-2): 92, 1761, 8925, 1888, 106

about 5% speedup

closes https://github.com/official-stockfish/Stockfish/pull/3261

No functional change
2020-12-14 07:46:15 +01:00
Fanael Linithien
c7f0a768cb Use arithmetic right shift for sign extension in MMX and SSE2 paths
This appears to be slightly faster than using a comparison against zero
to compute the high bits, on both old (like Pentium III) and new (like
Zen 2) hardware.

closes https://github.com/official-stockfish/Stockfish/pull/3254

No functional change.
2020-12-12 09:20:15 +01:00
mstembera
9b7983a452 Cleaned up MakeIndex()
The index order in kpp_board_index[][] is reversed to be more optimal for the access pattern

STC https://tests.stockfishchess.org/tests/view/5fbd74f967cbf42301d6b24f
LLR: 2.93 (-2.94,2.94) {-1.25,0.25}
Total: 27504 W: 2686 L: 2607 D: 22211
Ptnml(0-2): 84, 2001, 9526, 2034, 107

closes https://github.com/official-stockfish/Stockfish/pull/3233

No functional change
2020-11-29 16:36:49 +01:00
MaximMolchanov
7615e3485e Calculate sum from first elements
in affine transform for AVX512/AVX2/SSSE3

The idea is to initialize sum with the first element instead of zero.
Reduce one add_epi32 and one set_zero SIMD instructions for each output dimension.

sum = 0; for i = 1 to n sum += a[i] ->
sum = a[1]; for i = 2 to n sum += a[i]

STC:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 69048 W: 7024 L: 6799 D: 55225
Ptnml(0-2): 260, 5175, 23458, 5342, 289
https://tests.stockfishchess.org/tests/view/5faf2cf467cbf42301d6aa06

closes https://github.com/official-stockfish/Stockfish/pull/3227

No functional change.
2020-11-25 21:10:13 +01:00
Stéphane Nicolet
027626db1e Small cleanups 13
No functional change
2020-11-23 22:20:32 +01:00
Tomasz Sobczyk
ba35c88ab8 AVX-512 for smaller affine and feature transforms.
For the feature transformer the code is analogical to AVX2 since there was room for easy adaptation of wider simd registers.

For the smaller affine transforms that have 32 byte stride we keep 2 columns in one zmm register. We also unroll more aggressively so that in the end we have to do 16 parallel horizontal additions on ymm slices each consisting of 4 32-bit integers. The slices are embedded in 8 zmm registers.

These changes provide about 1.5% speedup for AVX-512 builds.

Closes https://github.com/official-stockfish/Stockfish/pull/3218

No functional change.
2020-11-07 16:49:49 +01:00
Tomasz Sobczyk
3f6451eff7 Manually align arrays on the stack
as a workaround to issues with overaligned alignas() on stack variables in gcc < 9.3 on windows.

closes https://github.com/official-stockfish/Stockfish/pull/3217

fixes #3216

No functional change
2020-11-04 19:52:42 +01:00
Tomasz Sobczyk
75e06a1c89 Optimize affine transform for SSSE3 and higher targets.
A non-functional speedup. Unroll the loops going over
the output dimensions in the affine transform layers by
a factor of 4 and perform 4 horizontal additions at a time.
Instead of doing naive horizontal additions on each vector
separately use hadd and shuffling between vectors to reduce
the number of instructions by using all lanes for all stages
of the horizontal adds.

passed STC of the initial version:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 17808 W: 1914 L: 1756 D: 14138
Ptnml(0-2): 76, 1330, 5948, 1460, 90
https://tests.stockfishchess.org/tests/view/5f9d516f6a2c112b60691da3

passed STC of the final version after cleanup:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 16296 W: 1750 L: 1595 D: 12951
Ptnml(0-2): 72, 1192, 5479, 1319, 86
https://tests.stockfishchess.org/tests/view/5f9df5776a2c112b60691de3

closes https://github.com/official-stockfish/Stockfish/pull/3203

No functional change
2020-11-02 19:41:17 +01:00
syzygy1
2046d5da30 More incremental accumulator updates
This patch was inspired by c065abd which updates the accumulator,
if possible, based on the accumulator of two plies back if
the accumulator of the preceding ply is not available.

With this patch we look back even further in the position history
in an attempt to reduce the number of complete recomputations.
When we find a usable accumulator for the position N plies back,
we also update the accumulator of the position N-1 plies back
because that accumulator is most likely to be helpful later
when evaluating positions in sibling branches.
By not updating all intermediate accumulators immediately,
we avoid doing too much work that is not certain to be useful.
Overall, roughly 2-3% speedup.

This patch makes the code more specific to the net architecture,
changing input features of the net will require additional changes
to the incremental update code as discussed in the PR #3193 and #3191.

Passed STC:
https://tests.stockfishchess.org/tests/view/5f9056712c92c7fe3a8c60d0
LLR: 2.94 (-2.94,2.94) {-0.25,1.25}
Total: 10040 W: 1116 L: 968 D: 7956
Ptnml(0-2): 42, 722, 3365, 828, 63

closes https://github.com/official-stockfish/Stockfish/pull/3193

No functional change.
2020-10-22 20:50:16 +02:00
noobpwnftw
c065abdcaf Use incremental updates more often
Use incremental updates for accumulators for up to 2 plies.
Do not copy accumulator. About 2% speedup.

Passed STC:
LLR: 2.95 (-2.94,2.94) {-0.25,1.25}
Total: 21752 W: 2583 L: 2403 D: 16766
Ptnml(0-2): 128, 1761, 6923, 1931, 133
https://tests.stockfishchess.org/tests/view/5f7150cf3b22d6afa5069412

closes https://github.com/official-stockfish/Stockfish/pull/3157

No functional change
2020-09-28 16:54:35 +02:00
Stéphane Nicolet
9a64e737cf Small cleanups 12
- Clean signature of functions in namespace NNUE
- Add comment for countermove based pruning
- Remove bestMoveCount variable
- Add const qualifier to kpp_board_index array
- Fix spaces in get_best_thread()
- Fix indention in capture LMR code in search.cpp
- Rename TtmemDeleter to LargePageDeleter

Closes https://github.com/official-stockfish/Stockfish/pull/3063

No functional change
2020-09-21 10:41:10 +02:00
Sami Kiminki
485d517c68 Add large page support for NNUE weights and simplify TT mem management
Use TT memory functions to allocate memory for the NNUE weights. This
should provide a small speed-up on systems where large pages are not
automatically used, including Windows and some Linux distributions.

Further, since we now have a wrapper for std::aligned_alloc(), we can
simplify the TT memory management a bit:

- We no longer need to store separate pointers to the hash table and
  its underlying memory allocation.
- We also get to merge the Linux-specific and default implementations
  of aligned_ttmem_alloc().

Finally, we'll enable the VirtualAlloc code path with large page
support also for Win32.

STC: https://tests.stockfishchess.org/tests/view/5f66595823a84a47b9036fba
LLR: 2.94 (-2.94,2.94) {-0.25,1.25}
Total: 14896 W: 1854 L: 1686 D: 11356
Ptnml(0-2): 65, 1224, 4742, 1312, 105

closes https://github.com/official-stockfish/Stockfish/pull/3081

No functional change.
2020-09-21 08:43:48 +02:00
syzygy1
8b8a510fd6 Use tiling to speed up accumulator refreshes and updates
Perform the update and refresh operations tile by tile in a local
array of vectors. By selecting the array size carefully, we
achieve that the compiler keeps the whole array in vector registers.

Idea and original implementation by @sf-x.

STC: https://tests.stockfishchess.org/tests/view/5f623eec912c15f19854b855
LLR: 2.94 (-2.94,2.94) {-0.25,1.25}
Total: 4872 W: 623 L: 477 D: 3772
Ptnml(0-2): 14, 350, 1585, 450, 37

LTC: https://tests.stockfishchess.org/tests/view/5f62434e912c15f19854b860
LLR: 2.94 (-2.94,2.94) {0.25,1.25}
Total: 25808 W: 1565 L: 1401 D: 22842
Ptnml(0-2): 23, 1186, 10332, 1330, 33

closes https://github.com/official-stockfish/Stockfish/pull/3130

No functional change
2020-09-17 17:24:52 +02:00
syzygy1
fc27d158c0 Bug fix in do_null_move() and NNUE simplification.
This fixes #3108 and removes some NNUE code that is currently not used.

At the moment, do_null_move() copies the accumulator from the previous
state into the new state, which is correct. It then clears the "computed_score"
flag because the side to move has changed, and with the other side to move
NNUE will return a completely different evaluation (normally with changed
sign but also with different NNUE-internal tempo bonus).

The problem is that do_null_move() clears the wrong flag. It clears the
computed_score flag of the old state, not of the new state. It turns out
that this almost never affects the search. For example, fixing it does not
change the current bench (but it does change the previous bench). This is
because the search code usually avoids calling evaluate() after a null move.

This PR corrects do_null_move() by removing the computed_score flag altogether.
The flag is not needed because nnue_evaluate() is never called twice on a position.

This PR also removes some unnecessary {}s and inserts a few blank lines
in the modified NNUE files in line with SF coding style.

Resulf ot STC non-regression test:
LLR: 2.95 (-2.94,2.94) {-1.25,0.25}
Total: 26328 W: 3118 L: 3012 D: 20198
Ptnml(0-2): 126, 2208, 8397, 2300, 133
https://tests.stockfishchess.org/tests/view/5f553ccc2d02727c56b36db1

closes https://github.com/official-stockfish/Stockfish/pull/3109

bench: 4109324
2020-09-08 22:53:17 +02:00
Stéphane Nicolet
406979ea12 Embed default net, and simplify using non-default nets
covers the most important cases from the user perspective:

It embeds the default net in the binary, so a download of that binary will result
in a working engine with the default net. The engine will be functional in the default mode
without any additional user action.

It allows non-default nets to be used, which will be looked for in up to
three directories (working directory, location of the binary, and optionally a specific default directory).
This mechanism is also kept for those developers that use MSVC,
the one compiler that doesn't have an easy mechanism for embedding data.

It is possible to disable embedding, and instead specify a specific directory, e.g. linux distros might want to use
CXXFLAGS="-DNNUE_EMBEDDING_OFF -DDEFAULT_NNUE_DIRECTORY=/usr/share/games/stockfish/" make -j ARCH=x86-64 profile-build

passed STC non-regression:
https://tests.stockfishchess.org/tests/view/5f4a581c150f0aef5f8ae03a
LLR: 2.95 (-2.94,2.94) {-1.25,-0.25}
Total: 66928 W: 7202 L: 7147 D: 52579
Ptnml(0-2): 291, 5309, 22211, 5360, 293

closes https://github.com/official-stockfish/Stockfish/pull/3070

fixes https://github.com/official-stockfish/Stockfish/issues/3030

No functional change.
2020-08-29 21:56:00 +02:00
syzygy1
9b4967071e Remove EvalList
This patch removes the EvalList structure from the Position object and generally simplifies the interface between do_move() and the NNUE code.

The NNUE evaluation function first calculates the "accumulator". The accumulator consists of two halves: one for white's perspective, one for black's perspective.

If the "friendly king" has moved or the accumulator for the parent position is not available, the accumulator for this half has to be calculated from scratch. To do this, the NNUE node needs to know the positions and types of all non-king pieces and the position of the friendly king. This information can easily be obtained from the Position object.

If the "friendly king" has not moved, its half of the accumulator can be calculated by incrementally updating the accumulator for the previous position. For this, the NNUE code needs to know which pieces have been added to which squares and which pieces have been removed from which squares. In principle this information can be derived from the Position object and StateInfo struct (in the same way as undo_move() does this). However, it is probably a bit faster to prepare this information in do_move(), so I have kept the DirtyPiece struct. Since the DirtyPiece struct now stores the squares rather than "PieceSquare" indices, there are now at most three "dirty pieces" (previously two). A promotion move that captures a piece removes the capturing pawn and the captured piece from the board (to SQ_NONE) and moves the promoted piece to the promotion square (from SQ_NONE).

An STC test has confirmed a small speedup:

https://tests.stockfishchess.org/tests/view/5f43f06b5089a564a10d850a
LLR: 2.94 (-2.94,2.94) {-0.25,1.25}
Total: 87704 W: 9763 L: 9500 D: 68441
Ptnml(0-2): 426, 6950, 28845, 7197, 434

closes https://github.com/official-stockfish/Stockfish/pull/3068

No functional change
2020-08-26 07:11:26 +02:00
mstembera
701b2427bd Support VNNI on 256bit vectors
due to downclocking on current chips (tested up to cascade lake)
supporting avx512 and vnni512, it is better to use avx2 or vnni256
in multithreaded (in particular hyperthreaded) engine use.
In single threaded use, the picture is different.

gcc compilation for vnni256 requires a toolchain for gcc >= 9.

closes https://github.com/official-stockfish/Stockfish/pull/3038

No functional change
2020-08-24 12:03:04 +02:00
syzygy1
cc9d503dde Skip the alignment bug workaround for Clang
Clang-10.0.0 poses as gcc-4.2:

$ clang++ -E -dM - </dev/null | grep GNUC

This means that Clang is using the workaround for the alignment bug of gcc-8
even though it does not have the bug (as far as I know).

This patch should speed up AVX2 and AVX512 compiles on Windows (when using Clang),
because it disables (for Clang) the gcc workaround we had introduced in this commit:
875183b310

closes https://github.com/official-stockfish/Stockfish/pull/3050

No functional change.
2020-08-23 23:09:31 +02:00
Stéphane Nicolet
81d716f5cc Reformat code in little-endian patch
Reformat code and rename the function to "read_little_endian()" in the recent
commit by Ronald de Man for support of big endian systems.

closes https://github.com/official-stockfish/Stockfish/pull/3016

No functional change
-----

Recommended net: https://tests.stockfishchess.org/api/nn/nn-82215d0fd0df.nnue
2020-08-17 12:15:57 +02:00
syzygy1
72dc7a5c54 Assume network file is in little-endian byte order
This patch fixes the byte order when reading 16- and 32-bit values from the network file on a big-endian machine.

Bytes are ordered in read_le() using unsigned arithmetic, which doesn't need tricks to determine the endianness of the machine. Unfortunately the compiler doesn't seem to be able to optimise the ordering operation, but reading in the weights is not a time-critical operation and the extra time it takes should not be noticeable.

Big endian systems are still untested with NNUE.

fixes #3007

closes https://github.com/official-stockfish/Stockfish/pull/3009

No functional change.
2020-08-16 21:10:26 +02:00
mstembera
6eb186c97e Try to match relative magnitude of NNUE eval to classical
The idea is that since we are mixing NNUE and classical evals matching their magnitudes closer allows for better comparisons.

STC https://tests.stockfishchess.org/tests/view/5f35a65411a9b1a1dbf18e2b
LLR: 2.94 (-2.94,2.94) {-0.50,1.50}
Total: 9840 W: 1150 L: 1027 D: 7663
Ptnml(0-2): 49, 772, 3175, 855, 69

LTC https://tests.stockfishchess.org/tests/view/5f35bcbe11a9b1a1dbf18e47
LLR: 2.93 (-2.94,2.94) {0.25,1.75}
Total: 44424 W: 2492 L: 2294 D: 39638
Ptnml(0-2): 42, 2015, 17915, 2183, 57

also corrects the location to clamp the evaluation (non-function on bench).

closes https://github.com/official-stockfish/Stockfish/pull/3003

bench: 3905447
2020-08-14 16:39:52 +02:00
mstembera
dd63b98fb0 Add support for VNNI
Adds support for Vector Neural Network Instructions (avx512), as available on Intel Cascade Lake

The _mm512_dpbusd_epi32() intrinsic (vpdpbusd instruction) is taylor made for NNUE.

on a cascade lake CPU (AWS C5.24x.large, gcc 10) NNUE eval is at roughly 78% nps of classical
(single core test)

bench 1024 1 24 default depth:
target 	classical 	NNUE 	ratio
vnni 	2207232 	1725987 	78.20
avx512 	2216789 	1671734 	75.41
avx2 	2194006 	1611263 	73.44
modern 	2185001 	1352469 	61.90

closes https://github.com/official-stockfish/Stockfish/pull/2987

No functional change
2020-08-13 07:39:52 +02:00
Joost VandeVondele
992f549ae7 Restrict avx2 hack to windows target
this workaround is possibly rather a windows & gcc specific problem. See e.g.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412#c25

on Linux with gcc 8 this patch brings roughly a 8% speedup.
However, probably needs some testing in the wild.

includes a workaround for an old msys make (3.81) installation (fixes #2984)

No functional change
2020-08-11 23:35:02 +02:00
mstembera
f46c73040c Fix AVX512 build with older compilers
avoids an intrinsic that is missing in gcc < 10.

For this target, might trigger another gcc bug on windows that
requires up-to-date gcc 8, 9, or 10, or usage of clang.

Fixes https://github.com/official-stockfish/Stockfish/issues/2975

closes https://github.com/official-stockfish/Stockfish/pull/2976

No functional change
2020-08-11 08:17:03 +02:00
Fanael Linithien
21df37d7fd Provide vectorized NNUE code for SSE2 and MMX targets
This patch allows old x86 CPUs, from AMD K8 (which the x86-64 baseline
targets) all the way down to the Pentium MMX, to benefit from NNUE with
comparable performance hit versus hand-written eval as on more modern
processors.

NPS of the bench with NNUE enabled on a Pentium III 1.13 GHz (using the
MMX code):
  master: 38951
  this patch: 80586

NPS of the bench with NNUE enabled using baseline x86-64 arch, which is
how linux distros are likely to package stockfish, on a modern CPU
(using the SSE2 code):
  master: 882584
  this patch: 1203945

closes https://github.com/official-stockfish/Stockfish/pull/2956

No functional change.
2020-08-10 19:17:57 +02:00
mstembera
f948cd008d Cleanup and optimize SSE/AVX code
AVX512 +4% faster
AVX2 +1% faster
SSSE3 +5% faster

passed non-regression STC:
STC https://tests.stockfishchess.org/tests/view/5f31249f90816720665374f6
LLR: 2.96 (-2.94,2.94) {-1.50,0.50}
Total: 17576 W: 2344 L: 2245 D: 12987
Ptnml(0-2): 127, 1570, 5292, 1675, 124

closes https://github.com/official-stockfish/Stockfish/pull/2962

No functional change
2020-08-10 14:38:17 +02:00
mstembera
875183b310 Workaround using unaligned loads for gcc < 9
despite usage of alignas, the generated (avx2/avx512) code with older compilers needs to use
unaligned loads with older gcc (e.g. confirmed crash with gcc 7.3/mingw on abrok).

Better performance thus requires gcc >= 9 on hardware supporting avx2/avx512

closes https://github.com/official-stockfish/Stockfish/pull/2969

No functional change
2020-08-10 11:12:35 +02:00