BadFish

mirror of https://github.com/sockspls/badfish synced 2025-05-03 10:09:35 +00:00

Author	SHA1	Message	Date
Joost VandeVondele	8c4338ae49	[Cluster] Param tweak. Small tweak of parameters, yielding some Elo. The cluster branch can now be considered to be in good shape. In local testing, it runs stable for >30k games. Performance benefits from an MPI implementation that is able to make asynchronous progress. The code should be run with 1 MPI rank per node, and threaded on the node. Performance against master has now been measured. Master has been given 1 node with 32 cores/threads in standard SMP, the cluster branch has been given N=2..20 of those nodes, running the corresponding number of MPI processes, each with 32 threads. Time control has been 10s+0.1s, Hash 8MB/core, the book 8moves_v3.pgn, the number of games 400. ``` Score of cluster-2mpix32t vs master-32t: 96 - 27 - 277 [0.586] 400 Elo difference: 60.54 +/- 18.49 Score of cluster-3mpix32t vs master-32t: 101 - 18 - 281 [0.604] 400 Elo difference: 73.16 +/- 17.94 Score of cluster-4mpix32t vs master-32t: 126 - 18 - 256 [0.635] 400 Elo difference: 96.19 +/- 19.68 Score of cluster-5mpix32t vs master-32t: 110 - 5 - 285 [0.631] 400 Elo difference: 93.39 +/- 17.09 Score of cluster-6mpix32t vs master-32t: 117 - 9 - 274 [0.635] 400 Elo difference: 96.19 +/- 18.06 Score of cluster-7mpix32t vs master-32t: 142 - 10 - 248 [0.665] 400 Elo difference: 119.11 +/- 19.89 Score of cluster-8mpix32t vs master-32t: 125 - 14 - 261 [0.639] 400 Elo difference: 99.01 +/- 19.18 Score of cluster-9mpix32t vs master-32t: 137 - 7 - 256 [0.662] 400 Elo difference: 117.16 +/- 19.20 Score of cluster-10mpix32t vs master-32t: 145 - 8 - 247 [0.671] 400 Elo difference: 124.01 +/- 19.86 Score of cluster-16mpix32t vs master-32t: 153 - 6 - 241 [0.684] 400 Elo difference: 133.95 +/- 20.17 Score of cluster-20mpix32t vs master-32t: 134 - 8 - 258 [0.657] 400 Elo difference: 113.29 +/- 19.11 ``` As the cluster parallelism is essentially lazyMPI, the nodes per second has been verified to scale perfectly to large node counts. Unfortunately, that is not necessarily indicative of playing strength. In the following 2min search from startPos, we reach about 4.8Gnps (128 nodes). ``` info depth 38 seldepth 51 multipv 1 score cp 53 nodes 576165794092 nps 4801341606 hashfull 1000 tbhits 0 time 120001 pv e2e4 c7c5 g1f3 d7d6 f1b5 c8d7 b5d7 d8d7 c2c4 b8c6 b1c3 g8f6 d2d4 d7g4 d4d5 c6d4 f3d4 g4d1 e1d1 c5d4 c3b5 a8c8 b2b3 a7a6 b5d4 f6e4 d1e2 g7g6 c1e3 f8g7 a1c1 e4c5 f2f3 f7f5 h1d1 e8g8 d4c2 c5d7 a2a4 a6a5 e3d4 f5f4 d4f2 f8f7 h2h3 d7c5 ```	2019-01-06 15:38:31 +01:00
Joost VandeVondele	8a3f8e21ae	[Cluster] Move IO to the root. Fixes one TODO, by moving the IO related to bestmove to the root, even if this move is found by a different rank. This is needed to make sure IO from different ranks is ordered properly. If this is not done it is possible that e.g. a bestmove arrives before all info lines have been received, leading to output that confuses tools and humans alike (see e.g. https://github.com/cutechess/cutechess/issues/472)	2019-01-04 14:56:04 +01:00
Joost VandeVondele	ac43bef5c5	[Cluster] Improve message passing part. This rewrites in part the message passing part, using in place gather, and collecting, rather than merging, the data of all threads. neutral with a single thread per rank: Score of new-2mpi-1t vs old-2mpi-1t: 789 - 787 - 2615 [0.500] 4191 Elo difference: 0.17 +/- 6.44 likely progress with multiple threads per rank: Score of new-2mpi-36t vs old-2mpi-36t: 76 - 53 - 471 [0.519] 600 Elo difference: 13.32 +/- 12.85	2019-01-02 11:16:24 +01:00
Joost VandeVondele	7a32d26d5f	[cluster] keep track of TB hits cluster-wide.	2018-12-29 15:34:57 +01:00
Joost VandeVondele	87f0fa55a0	[cluster] keep track of node counts cluster-wide. This generalizes exchange of signals between the ranks using a non-blocking all-reduce. It is now used for the stop signal and the node count, but should be easily generalizable (TB hits, and ponder still missing). It avoids having long-lived outstanding non-blocking collectives (removes an early posted Ibarrier). A bit too short a test, but not worse than before: Score of new-r4-1t vs old-r4-1t: 459 - 401 - 1505 [0.512] 2365 Elo difference: 8.52 +/- 8.43	2018-12-29 15:34:57 +01:00
Joost VandeVondele	2f882309d5	fixup	2018-12-29 15:34:57 +01:00
Joost VandeVondele	86953b9392	[cluster] Fix non-mpi compile fix compile of the cluster branch in the non-mpi case. Add a TODO as a reminder for the new voting scheme. No functional changes	2018-12-29 15:34:56 +01:00
Joost VandeVondele	ba1c639836	[cluster] fill sendbuffer better use a counter to track available elements. Some elo gain, on 4 ranks: Score of old-r4-1t vs new-r4-1t: 422 - 508 - 1694 [0.484] 2624 Elo difference: -11.39 +/- 7.90	2018-12-29 15:34:56 +01:00
Joost VandeVondele	e526c5aa52	[cluster] Make bench compatible Fix one TODO. Takes care of output from bench. Sum nodes over ranks.	2018-12-29 15:34:56 +01:00
Joost VandeVondele	54a0a228f6	[cluster] Some formatting cleanup standarize whitespace a bit. Also adds two TODOs for follow up work. No functional change.	2018-12-29 15:34:56 +01:00
noobpwnftw	66b2c6b9f1	Implement best move voting system for cluster This implements the cluster version of `d96c1c32a2`	2018-12-29 15:34:56 +01:00
Joost VandeVondele	2559c20c6e	[cluster] Fix oversight in TT key reuse In the original code, the position key stored in the TT is used to probe&store TT entries after message passing. Since we only store part of the bits in the TT, this leads to incorrect rehashing. This is fixed in this patch storing also the full key in the send buffers, and using that for hashing after message arrival. Short testing with 4 ranks (old vs new) shows this is effective: Score of mpiold vs mpinew: 84 - 275 - 265 [0.347] 624 Elo difference: -109.87 +/- 20.88	2018-12-29 15:34:55 +01:00
noobpwnftw	8a95d269eb	Implement proper stop signalling from root node Previous behavior was to wait on all nodes to finish their search on their own TM and aggregate to root node via a blocking MPI_Allreduce call. This seems to be problematic. In this commit a proper non-blocking signalling barrier was implemented to use TM from root node to control the cluster search, and disable TM on all non-root nodes. Also includes some cosmetic fix to the nodes/NPS display.	2018-12-29 15:34:55 +01:00
Omri Mor	29c166a072	MPI/Cluster implementation for Stockfish Based on Peter Österlund's "Lazy Cluster" algorithm, but with some simplifications. To compile, point COMPCXX to the MPI C++ compiler wrapper (mpicxx).	2018-12-29 15:34:55 +01:00

14 commits