1
0
Fork 0
mirror of https://github.com/sockspls/badfish synced 2025-05-03 10:09:35 +00:00
Commit graph

14 commits

Author SHA1 Message Date
Joost VandeVondele
8c4338ae49 [Cluster] Param tweak.
Small tweak of parameters, yielding some Elo.

The cluster branch can now be considered to be in good shape. In local testing, it runs stable for >30k games. Performance benefits from an MPI implementation that is able to make asynchronous progress. The code should be run with 1 MPI rank per node, and threaded on the node.

Performance against master has now been measured. Master has been given 1 node with 32 cores/threads in standard SMP, the cluster branch has been given N=2..20 of those nodes, running the corresponding number of MPI processes, each with 32 threads. Time control has been 10s+0.1s, Hash 8MB/core, the book 8moves_v3.pgn, the number of games 400.

```
Score of cluster-2mpix32t vs master-32t: 96 - 27 - 277  [0.586] 400
Elo difference: 60.54 +/- 18.49

Score of cluster-3mpix32t vs master-32t: 101 - 18 - 281  [0.604] 400
Elo difference: 73.16 +/- 17.94

Score of cluster-4mpix32t vs master-32t: 126 - 18 - 256  [0.635] 400
Elo difference: 96.19 +/- 19.68

Score of cluster-5mpix32t vs master-32t: 110 - 5 - 285  [0.631] 400
Elo difference: 93.39 +/- 17.09

Score of cluster-6mpix32t vs master-32t: 117 - 9 - 274  [0.635] 400
Elo difference: 96.19 +/- 18.06

Score of cluster-7mpix32t vs master-32t: 142 - 10 - 248  [0.665] 400
Elo difference: 119.11 +/- 19.89

Score of cluster-8mpix32t vs master-32t: 125 - 14 - 261  [0.639] 400
Elo difference: 99.01 +/- 19.18

Score of cluster-9mpix32t vs master-32t: 137 - 7 - 256  [0.662] 400
Elo difference: 117.16 +/- 19.20

Score of cluster-10mpix32t vs master-32t: 145 - 8 - 247  [0.671] 400
Elo difference: 124.01 +/- 19.86

Score of cluster-16mpix32t vs master-32t: 153 - 6 - 241  [0.684] 400
Elo difference: 133.95 +/- 20.17

Score of cluster-20mpix32t vs master-32t: 134 - 8 - 258  [0.657] 400
Elo difference: 113.29 +/- 19.11
```

As the cluster parallelism is essentially lazyMPI, the nodes per second has been verified to scale perfectly to large node counts. Unfortunately, that is not necessarily indicative of playing strength. In the following 2min search from startPos, we reach about 4.8Gnps (128 nodes).

```
info depth 38 seldepth 51 multipv 1 score cp 53 nodes 576165794092 nps 4801341606 hashfull 1000 tbhits 0 time 120001 pv e2e4 c7c5 g1f3 d7d6 f1b5 c8d7 b5d7 d8d7 c2c4 b8c6 b1c3 g8f6 d2d4 d7g4 d4d5 c6d4 f3d4 g4d1 e1d1 c5d4 c3b5 a8c8 b2b3 a7a6 b5d4 f6e4 d1e2 g7g6 c1e3 f8g7 a1c1 e4c5 f2f3 f7f5 h1d1 e8g8 d4c2 c5d7 a2a4 a6a5 e3d4 f5f4 d4f2 f8f7 h2h3 d7c5
```
2019-01-06 15:38:31 +01:00
Joost VandeVondele
8a3f8e21ae [Cluster] Move IO to the root.
Fixes one TODO, by moving the IO related to bestmove to the root, even if this move is found by a different rank.

This is needed to make sure IO from different ranks is ordered properly. If this is not done it is possible that e.g. a bestmove arrives before all info lines have been received, leading to output that confuses tools and humans alike (see e.g. https://github.com/cutechess/cutechess/issues/472)
2019-01-04 14:56:04 +01:00
Joost VandeVondele
ac43bef5c5 [Cluster] Improve message passing part.
This rewrites in part the message passing part, using in place gather, and collecting, rather than merging, the data of all threads.

neutral with a single thread per rank:
Score of new-2mpi-1t vs old-2mpi-1t: 789 - 787 - 2615  [0.500] 4191
Elo difference: 0.17 +/- 6.44

likely progress with multiple threads per rank:
Score of new-2mpi-36t vs old-2mpi-36t: 76 - 53 - 471  [0.519] 600
Elo difference: 13.32 +/- 12.85
2019-01-02 11:16:24 +01:00
Joost VandeVondele
7a32d26d5f [cluster] keep track of TB hits cluster-wide. 2018-12-29 15:34:57 +01:00
Joost VandeVondele
87f0fa55a0 [cluster] keep track of node counts cluster-wide.
This generalizes exchange of signals between the ranks using a non-blocking all-reduce. It is now used for the stop signal and the node count, but should be easily generalizable (TB hits, and ponder still missing). It avoids having long-lived outstanding non-blocking collectives (removes an early posted Ibarrier). A bit too short a test, but not worse than before:

Score of new-r4-1t vs old-r4-1t: 459 - 401 - 1505  [0.512] 2365
Elo difference: 8.52 +/- 8.43
2018-12-29 15:34:57 +01:00
Joost VandeVondele
2f882309d5 fixup 2018-12-29 15:34:57 +01:00
Joost VandeVondele
86953b9392 [cluster] Fix non-mpi compile
fix compile of the cluster branch in the non-mpi case.

Add a TODO as a reminder for the new voting scheme.

No functional changes
2018-12-29 15:34:56 +01:00
Joost VandeVondele
ba1c639836 [cluster] fill sendbuffer better
use a counter to track available elements.

Some elo gain, on 4 ranks:

Score of old-r4-1t vs new-r4-1t: 422 - 508 - 1694  [0.484] 2624
Elo difference: -11.39 +/- 7.90
2018-12-29 15:34:56 +01:00
Joost VandeVondele
e526c5aa52 [cluster] Make bench compatible
Fix one TODO.

Takes care of output from bench.
Sum nodes over ranks.
2018-12-29 15:34:56 +01:00
Joost VandeVondele
54a0a228f6 [cluster] Some formatting cleanup
standarize whitespace a bit.
Also adds two TODOs for follow up work.

No functional change.
2018-12-29 15:34:56 +01:00
noobpwnftw
66b2c6b9f1 Implement best move voting system for cluster
This implements the cluster version of d96c1c32a2
2018-12-29 15:34:56 +01:00
Joost VandeVondele
2559c20c6e [cluster] Fix oversight in TT key reuse
In the original code, the position key stored in the TT is used to probe&store TT entries after message passing. Since we only store part of the bits in the TT, this leads to incorrect rehashing. This is fixed in this patch storing also the full key in the send buffers, and using that for hashing after message arrival.

Short testing with 4 ranks (old vs new) shows this is effective:
Score of mpiold vs mpinew: 84 - 275 - 265  [0.347] 624
Elo difference: -109.87 +/- 20.88
2018-12-29 15:34:55 +01:00
noobpwnftw
8a95d269eb Implement proper stop signalling from root node
Previous behavior was to wait on all nodes to finish their search on their own TM and aggregate to root node via a blocking MPI_Allreduce call. This seems to be problematic.

In this commit a proper non-blocking signalling barrier was implemented to use TM from root node to control the cluster search, and disable TM on all non-root nodes.

Also includes some cosmetic fix to the nodes/NPS display.
2018-12-29 15:34:55 +01:00
Omri Mor
29c166a072 MPI/Cluster implementation for Stockfish
Based on Peter Österlund's "Lazy Cluster" algorithm,
but with some simplifications.
To compile, point COMPCXX to the MPI C++ compiler wrapper (mpicxx).
2018-12-29 15:34:55 +01:00