This changes 2 parts with regards to static exchange evaluation.
Currently, we do not allow pinned pieces to recapture if *all* opponent
pinners are still in their starting squares. This changes that to having
a less strict requirement, checking if *any* pinners are still in their
starting square. This makes our SEE give more respect to the pinning
side with regards to exchanges, which makes sense because it helps our
search explore more tactical options.
Furthermore, we change the logic for saving pinners into our state
variable when computing slider_blockers. We will include double pinners,
where two sliders may be looking at the same blocker, a similar concept
to our mobility calculation for sliders in our evaluation section.
Interestingly, I think SEE is the only place where the pinners bitboard
is actually used, so as far as I know there are no other side effects
to this change.
An example and some insights:
White Bf2, Kg1
Black Qe3, Bc5
The move Qg3 will be given the correct value of 0. (Previously < 0)
The move Qd4 will be incorrectly given a value of 0. (Previously < 0)
It seems the tradeoff in search is worth it. Qd4 will likely be pruned
soon by something like probcut anyway, while Qg3 could help us spot
tactics at an earlier depth.
STC:
LLR: 2.96 (-2.94,2.94) [0.50,4.50]
Total: 62162 W: 13879 L: 13408 D: 34875
http://tests.stockfishchess.org/tests/view/5c4ba1a70ebc593af5d49c55
LTC: (Thanks to @alayant)
LLR: 3.40 (-2.94,2.94) [0.00,3.50]
Total: 140285 W: 23416 L: 22825 D: 94044
http://tests.stockfishchess.org/tests/view/5c4bcfba0ebc593af5d49ea8
Bench: 3937213
This patch saves (4-1) * 64 * 64 = 12KiB of cache.
STC
LLR: 2.95 (-2.94,2.94) [0.00,4.00]
Total: 176120 W: 38944 L: 38087 D: 99089
http://tests.stockfishchess.org/tests/view/5c4c9f840ebc593af5d4a7ce
LTC
As a pure speed up, I've been informed it should not require LTC.
No functional change
Using point to point instead of a collective improves performance, and might be more flexible for future improvements.
Also corrects the condition for the number elements required to fill the send buffer.
The actual Elo gains depends a bit on the setup used for testing.
8mpi x 32t yields 141 - 102 - 957 ~ 11 Elo
8mpi x 1t yields 70 +- 9 Elo.
stopOnPonderhit is used to stop search quickly on a ponderhit. It is set by mainThread as part of its time management. However, master employs it as a signal between mainThread and the UCI thread. This is not necessary, it is sufficient for the UCI thread to signal that pondering finished, and mainThread should do its usual time-keeping job, and in this case stop immediately.
This patch implements this, removing stopOnPonderHit as an atomic variable from the ThreadPool,
and moving it as a normal variable to mainThread, reducing its scope. In MainThread::check_time() the search is stopped immediately if ponder switches to false, and the variable stopOnPonderHit is set.
Furthermore, ponder has been moved to mainThread, as the variable is only used to exchange signals between the UCI thread and mainThread.
The version has been tested locally (as fishtest doesn't support ponder):
Score of ponderSimp vs master: 2616 - 2528 - 8630 [0.503] 13774
Elo difference: 2.22 +/- 3.54
which indicates no regression.
No functional change.
Small changes in initiative(). For Pawn PSQT, endgame values for d6-e6 and d7-e7 are now symmetric. The MG value of d2 is now smaller than e2 (d2=13, e2=21 now compared to d2=19, e2=16 before). The MG values of h5-h6-h7 also increased so this might encourage stockfish for more h-pawn pushes.
STC
LLR: -2.96 (-2.94,2.94) [0.00,4.00]
Total: 81141 W: 17933 L: 17777 D: 45431
http://tests.stockfishchess.org/tests/view/5c4017350ebc5902bb5cf237
LTC
LLR: 2.96 (-2.94,2.94) [0.00,4.00]
Total: 83078 W: 13883 L: 13466 D: 55729
http://tests.stockfishchess.org/tests/view/5c40763f0ebc5902bb5cff09
Bench: 3266398
This is a non-functional simplification that removes the AdjacentFiles array.
This array is simple enough to calculate that the pre-calculated array provides
no benefit. Reduces the memory footprint.
STC
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 74839 W: 16390 L: 16373 D: 42076
http://tests.stockfishchess.org/tests/view/5c3d75920ebc596a450cfb67
No functionnal change
If we define dcCandidates with & pawnsNotOn7,
we don't have to & it both times.
This seems more clear to me as well.
Tested for no regression.
STC
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 44042 W: 9663 L: 9585 D: 24794
http://tests.stockfishchess.org/tests/view/5c21d9120ebc5902ba12e84d
No functional change.
The new form is likely to trigger a bit more at LTC. Given that LTC
appears to be an improvement, I think that is fine.
The change is not very invasive: it does the same as before, use
potentially less time for moves that are very stable. Most of the
time, the full bonus was given if the bonus was given, so the gradual
part {3, 4, 5} didn't matter much. Whereas previously 'stable' was
expressed as the last 80% of iterations are the same, now I use a
fixed depth (10 iterations). For TCEC style TC, it will presumably
imply some more moves that are played quicker (and thus more time
on the clock when it potentially matters). Note that 10 iterations
of stability means we've been proposing that move for 99.9% of search
time.
passed STC
http://tests.stockfishchess.org/tests/view/5c30d2290ebc596a450c055b
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 70921 W: 15403 L: 15378 D: 40140
passed LTC
http://tests.stockfishchess.org/tests/view/5c31ae240ebc596a450c1881
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 17422 W: 2968 L: 2842 D: 11612
No functional change.
- add cluster info line
- provides basic info on positions received/stored in a cluster run,
useful to judge performance.
- document most cluster functionality in the readme.md
No functional change
Introducing new concept, saving principal lines into the transposition table
to generate a "critical search tree" which we can reuse later for intelligent
pruning/extension decisions.
For instance in this patch we just reduce reduction for these lines. But a lot
of other ideas are possible.
To go further : tune some parameters, how to add or remove lines from the
critical search tree, how to use these lines in search choices, etc.
STC :
LLR: 2.94 (-2.94,2.94) [0.50,4.50]
Total: 59761 W: 13321 L: 12863 D: 33577 +2.23 ELO
http://tests.stockfishchess.org/tests/view/5c34da5d0ebc596a450c53d3
LTC :
LLR: 2.96 (-2.94,2.94) [0.00,3.50]
Total: 26826 W: 4439 L: 4191 D: 18196 +2.9 ELO
http://tests.stockfishchess.org/tests/view/5c35ceb00ebc596a450c65b2
Special thanks to Miguel Lahoz for his help in transposition table in/out.
Bench: 3399866
This was inspired after reading about
[Multi-Cut](https://www.chessprogramming.org/Multi-Cut).
We now do non-singular cut node pruning. The idea is to prune when we
have a "backup plan" in case our expected fail high node does not fail
high on the ttMove.
For singular extensions, we do a search on all other moves but the
ttMove. If this fails high on our original beta, this means that both
the ttMove, as well as at least one other move was proven to fail high
on a lower depth search. We then assume that one of these moves will
work on a higher depth and prune.
STC:
LLR: 2.96 (-2.94,2.94) [0.50,4.50]
Total: 72952 W: 16104 L: 15583 D: 41265
http://tests.stockfishchess.org/tests/view/5c3119640ebc596a450c0be5
LTC:
LLR: 2.95 (-2.94,2.94) [0.00,3.50]
Total: 27103 W: 4564 L: 4314 D: 18225
http://tests.stockfishchess.org/tests/view/5c3184c00ebc596a450c1662
Bench: 3145487
Small tweak of parameters, yielding some Elo.
The cluster branch can now be considered to be in good shape. In local testing, it runs stable for >30k games. Performance benefits from an MPI implementation that is able to make asynchronous progress. The code should be run with 1 MPI rank per node, and threaded on the node.
Performance against master has now been measured. Master has been given 1 node with 32 cores/threads in standard SMP, the cluster branch has been given N=2..20 of those nodes, running the corresponding number of MPI processes, each with 32 threads. Time control has been 10s+0.1s, Hash 8MB/core, the book 8moves_v3.pgn, the number of games 400.
```
Score of cluster-2mpix32t vs master-32t: 96 - 27 - 277 [0.586] 400
Elo difference: 60.54 +/- 18.49
Score of cluster-3mpix32t vs master-32t: 101 - 18 - 281 [0.604] 400
Elo difference: 73.16 +/- 17.94
Score of cluster-4mpix32t vs master-32t: 126 - 18 - 256 [0.635] 400
Elo difference: 96.19 +/- 19.68
Score of cluster-5mpix32t vs master-32t: 110 - 5 - 285 [0.631] 400
Elo difference: 93.39 +/- 17.09
Score of cluster-6mpix32t vs master-32t: 117 - 9 - 274 [0.635] 400
Elo difference: 96.19 +/- 18.06
Score of cluster-7mpix32t vs master-32t: 142 - 10 - 248 [0.665] 400
Elo difference: 119.11 +/- 19.89
Score of cluster-8mpix32t vs master-32t: 125 - 14 - 261 [0.639] 400
Elo difference: 99.01 +/- 19.18
Score of cluster-9mpix32t vs master-32t: 137 - 7 - 256 [0.662] 400
Elo difference: 117.16 +/- 19.20
Score of cluster-10mpix32t vs master-32t: 145 - 8 - 247 [0.671] 400
Elo difference: 124.01 +/- 19.86
Score of cluster-16mpix32t vs master-32t: 153 - 6 - 241 [0.684] 400
Elo difference: 133.95 +/- 20.17
Score of cluster-20mpix32t vs master-32t: 134 - 8 - 258 [0.657] 400
Elo difference: 113.29 +/- 19.11
```
As the cluster parallelism is essentially lazyMPI, the nodes per second has been verified to scale perfectly to large node counts. Unfortunately, that is not necessarily indicative of playing strength. In the following 2min search from startPos, we reach about 4.8Gnps (128 nodes).
```
info depth 38 seldepth 51 multipv 1 score cp 53 nodes 576165794092 nps 4801341606 hashfull 1000 tbhits 0 time 120001 pv e2e4 c7c5 g1f3 d7d6 f1b5 c8d7 b5d7 d8d7 c2c4 b8c6 b1c3 g8f6 d2d4 d7g4 d4d5 c6d4 f3d4 g4d1 e1d1 c5d4 c3b5 a8c8 b2b3 a7a6 b5d4 f6e4 d1e2 g7g6 c1e3 f8g7 a1c1 e4c5 f2f3 f7f5 h1d1 e8g8 d4c2 c5d7 a2a4 a6a5 e3d4 f5f4 d4f2 f8f7 h2h3 d7c5
```
This addresses partially issue #1911 in that it documents in our
Readme the command that users can use to verifying the md5sum of
their downloaded tablebase files.
Additionally, a quick check of the file size (the size of each
tablebase file modulo 64 is 16 as pointed out by @syzygy1) has been
implemented at launch time in Stockfish.
Closes https://github.com/official-stockfish/Stockfish/pull/1927
and https://github.com/official-stockfish/Stockfish/issues/1911
No functional change.
Fixes one TODO, by moving the IO related to bestmove to the root, even if this move is found by a different rank.
This is needed to make sure IO from different ranks is ordered properly. If this is not done it is possible that e.g. a bestmove arrives before all info lines have been received, leading to output that confuses tools and humans alike (see e.g. https://github.com/cutechess/cutechess/issues/472)
Delay legality check of castling moves at search time,
just before making the move, as is the standard with all
the other move types.
This should avoid an useless and not trivial legality check
when the castling is then not tried later. For instance due
to a previous cut-off.
The patch is also a big simplification and allows to entirely
remove generate_castling()
Bench changes due to a different move sequence out of MovePicker.
STC:
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 45073 W: 9918 L: 9843 D: 25312
http://tests.stockfishchess.org/tests/view/5c2f176f0ebc596a450bdfb3
LTC:
LLR: 3.15 (-2.94,2.94) [-3.00,1.00]
Total: 10156 W: 1707 L: 1560 D: 6889
http://tests.stockfishchess.org/tests/view/5c2e7dfd0ebc596a450bcdf4
Verified with perft both in standard and Chess960 cases.
Closes https://github.com/official-stockfish/Stockfish/pull/1929
Bench: 3559104
This rewrites in part the message passing part, using in place gather, and collecting, rather than merging, the data of all threads.
neutral with a single thread per rank:
Score of new-2mpi-1t vs old-2mpi-1t: 789 - 787 - 2615 [0.500] 4191
Elo difference: 0.17 +/- 6.44
likely progress with multiple threads per rank:
Score of new-2mpi-36t vs old-2mpi-36t: 76 - 53 - 471 [0.519] 600
Elo difference: 13.32 +/- 12.85
A single popcount in evaluate.cpp replaces all openFiles stuff in pawns. It doesn't seem to affect performance at all.
STC
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 28103 W: 6134 L: 6025 D: 15944
http://tests.stockfishchess.org/tests/view/5b7d70a20ebc5902bdbb1999
No functional change.
This custom predicate filter creates an unnecessary abstraction layer, but doesn't make the code any more readable. The code is clear enough without it.
No functional change.
The extra condition is used as a shortcut to skip the following 3 assignments:
```C++
Bitboard b1 = shift<UpRight>(pawnsOn7) & enemies;
Bitboard b2 = shift<UpLeft >(pawnsOn7) & enemies;
Bitboard b3 = shift<Up >(pawnsOn7) & emptySquares;
```
In case of EVASION with no target on 8th rank (the common case), we end up performing the 3 statements for nothing because b1 = b2 = b3 = 0.
But this is just a small micro-optimization and the condition is quite confusing, so just remove it and prefer a readable code instead.
STC
LLR: 2.95 (-2.94,2.94) [-3.00,1.00]
Total: 78020 W: 16978 L: 16967 D: 44075
http://tests.stockfishchess.org/tests/view/5c27b4fe0ebc5902ba135bb0
No functional change.
This generalizes exchange of signals between the ranks using a non-blocking all-reduce. It is now used for the stop signal and the node count, but should be easily generalizable (TB hits, and ponder still missing). It avoids having long-lived outstanding non-blocking collectives (removes an early posted Ibarrier). A bit too short a test, but not worse than before:
Score of new-r4-1t vs old-r4-1t: 459 - 401 - 1505 [0.512] 2365
Elo difference: 8.52 +/- 8.43
use a counter to track available elements.
Some elo gain, on 4 ranks:
Score of old-r4-1t vs new-r4-1t: 422 - 508 - 1694 [0.484] 2624
Elo difference: -11.39 +/- 7.90
avoid saving to TT the part of the receive buffer that actually originates from the same rank.
Now, on 1 mpi rank, we have the same bench as the non-mpi code on 1 thread.
since the logic for saving moves in the sendbuffer and the associated rehashing is expensive, only do it for TT stores of sufficient depth.
quite some gain in local testing with 4 ranks against the previous version.
Elo difference: 288.84 +/- 21.98
This starts to make the branch useful, but for on-node runs, difference remains to the standard threading.
In the original code, the position key stored in the TT is used to probe&store TT entries after message passing. Since we only store part of the bits in the TT, this leads to incorrect rehashing. This is fixed in this patch storing also the full key in the send buffers, and using that for hashing after message arrival.
Short testing with 4 ranks (old vs new) shows this is effective:
Score of mpiold vs mpinew: 84 - 275 - 265 [0.347] 624
Elo difference: -109.87 +/- 20.88
This reverts my earlier change that only the root node gets to output best move after fixing problem with MPI_Allreduce by our custom operator(BestMoveOp). This function is not commutable and we must ensure that its output is consistent among all nodes.
Previous behavior was to wait on all nodes to finish their search on their own TM and aggregate to root node via a blocking MPI_Allreduce call. This seems to be problematic.
In this commit a proper non-blocking signalling barrier was implemented to use TM from root node to control the cluster search, and disable TM on all non-root nodes.
Also includes some cosmetic fix to the nodes/NPS display.