This gives SF accurate PVs, such that the evaluation of the leaf node in
the PV matches the score backed up to the root (99% of the time.
q-search will use the value stored in the hash table instead of the eval
value sometimes).
One drawback is that fail-high/low only get a minimal 2 move PV.
It doesn't add any additional overhead to the non-PV codepath except an
extra eight bytes to the SearchStack structure in multi-threaded
searches.
A core part of this is not pruning based on TT score in PV nodes. This
was measured as not being a regression at multiple TCs, except for one
exception, fast TC with huge hash, which is not realistic for longer
searches.
STC - 1 thread, 128 mb hash
ELO: 1.42 +-3.1 (95%) LOS: 81.9%
Total: 20000 W: 4078 L: 3996 D: 11926
STC - 3 thread, 128 mb hash
ELO: -3.60 +-2.9 (95%) LOS: 0.8%
Total: 20000 W: 3575 L: 3782 D: 12643
STC - 3 thread, 8 mb hash
ELO: 0.12 +-2.9 (95%) LOS: 53.3%
Total: 20000 W: 3654 L: 3647 D: 12699
LTC - 3 thread, 32mb hash
ELO: 2.29 +-2.0 (95%) LOS: 98.8%
Total: 35740 W: 5618 L: 5382 D: 24740
Bench: 6984058
Resolves#102
- Currently broken
- Never been really useful
- Does not work well with new splitting model
Verified for no regression at STC with 3 threads:
LLR: 2.96 (-2.94,2.94) [-6.00,0.00]
Total: 81905 W: 12122 L: 12381 D: 57402
No functional change
After last Joona's patch there is no measurable
difference between the option set or unset.
Tested by Andreas Strangmüller with 16 threads
on his Dual Opteron 6376.
After 5000 games at 15+0.05 the result is:
1 Stockfish_14050822_T16_on : 3003 5000 (+849,=3396,-755), 50.9 %
2 Stockfish_14050822_T16_off : 2997 5000 (+755,=3396,-849), 49.1 %
bench: 880215
Instead of waiting to be allocated, actively search
for another split point to join when finishes its
search. Also modify split conditions.
This patch has been tested with 7 threads SMP and
passed both STC:
LLR: 2.97 (-2.94,2.94) [-1.50,4.50]
Total: 2885 W: 519 L: 410 D: 1956
And a reduced-LTC at 25+0.05
LLR: 2.95 (-2.94,2.94) [0.00,6.00]
Total: 4401 W: 684 L: 566 D: 3151
Was then retested against regression in 3 thread case
at standard LTC of 60+0.05:
LLR: 2.96 (-2.94,2.94) [-4.00,0.00]
Total: 40809 W: 5446 L: 5406 D: 29957
bench: 8802105
Thanks to std::bitset we can easily increase
the limit of active threads above 64.
Thanks to Lucas Braesch for pointing at the
correct solution of using std::bitset.
No functional change.
A great simplification that shows no regression
and it seems even a bit scalable.
Tested with fixed number of games:
Short TC
ELO: 0.60 +-2.1 (95%) LOS: 71.1%
Total: 39554 W: 7477 L: 7409 D: 24668
Long TC
ELO: 2.97 +-2.0 (95%) LOS: 99.8%
Total: 36424 W: 5894 L: 5583 D: 24947
bench: 8184352
And remove a complex (and broken) formula.
Indeed previous code was broken in case of TC with big
time increments where available_time() was too similar
to total time yielding to many time losses, so for instance:
go wtime 2600 winc 2600
info nodes 4432770 time 2601 <-- time forfeit!
maximum search time = 2530 ms
available_time = 2300 ms
For a reference and further details see:
https://groups.google.com/forum/?fromgroups=#!topic/fishcooking/dCPAvQDcm2E
Speed tested with bench disabling timer alltogheter vs timer set at
max resolution, showed we have no speed regressions both in single
core and when using all physical cores.
No functional change.
This reverts commit 800410eef1 and instead increases
stack size.
I went through the old emails with Daylen that reported the
crash issue on Mac OS X and was fixed by 0049d3f337.
It was reported default stack size for a thread in Mac OS X is 8
megabytes while the patch that we are reverting allows to reduce
stack size at max of about 217KB, so the reason for the crash was
only marginal in MAX_MOVES value. On those emails Daylen also
hinted how to increase stack size for Mac OS X to 16MB.
So prefer to increase stack size to 16MB instad of re-inventing
the wheel and do our home grown stack as we did with the patch
that we are now reverting (it will remain anyhow in git history
for documentation purposes).
No functional change.
This greately reduces stack usage and is a
prerequisite for next patch.
Verified with 40K games both in single and SMP
case that there are no regressions.
No functional change.
Introduce ThreadBase struct that is search
agnostic and just handles low level stuff,
and derive all the other specialized classes
form here.
In particular TimerThread does not hinerits
anymore all the search related stuff from Thread.
Also some renaming while there.
Suggested by Steven Edwards
No functional change.
At thread creation start_routine() is called
and from there the virtual function idle_loop()
because we do this inside Thread c'tor, where the
virtual mechanism is disabled, it could happen that
the base class idle_loop() is called instead.
The issue happens with TimerThread and MainThread
where, at launch, start_routine calls
Thread::idle_loop instead of the derived ones.
Normally this bug is hidden because c'tor finishes
before start_routine() is actually called in the
just created execution thread, but on some platforms
and in some cases this is not guaranteed and the
engine hangs.
Reported by Ted Wong on talkchess
No functional change.
And #ifdef instead of #if defined
This is more standard form (see for example iostream file).
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Follow Don Dailey definition of cut/all node:
"If the previous node was a cut node, we consider this an ALL node.
The only exception is for PV nodes which are a special case of ALL nodes.
In the PVS framework, the first zero width window searched from a PV
node is by our definition a CUT node and if you have to do a re-search
then it is suddenly promoted to a PV nodes (as per PVS search) and only
then can the cut and all nodes swap positions. In other words, these
internal search failures can force the status of every node in the subtree
to swap if it propagates back to the last PV nodes."
http://talkchess.com/forum/viewtopic.php?topic_view=threads&p=519741&t=47577
With this definition we have an hit rate higher than 90% on:
if (!PvNode && depth > 4 * ONE_PLY)
dbg_hit_on_c(cutNode, (bestValue >= beta));
And an hit rate of just 28% on:
if (!PvNode && depth > 4 * ONE_PLY)
dbg_hit_on_c(!cutNode, (bestValue >= beta));
No functional change.
This reverts commit 0d68b523a3.
After easy move semplification this machinery is not
needed anymore (because of we don't need to know if a
root move is a recapture)
No functional change.
Save the current active position in each Thread
instead of keeping a centralized array in struct
SplitPoint.
This allow to skip a memset() call at each split.
No functional change.
To return a pointer to the available
thread instead of a bool. This allows
to simplify the core loop in split().
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
This function "returns" two values: bestValue and bestMove
Instead of returning one and passing as pointer the other
be consistent and pass as pointers both.
No functional change.
When a thread is allocated a bit is set in slavesMask.
This bit corresponds to the thread's index field that,
because it happens to be the position in the threads
array, eventually it is equal to the loop index 'i'.
But instead of relying on this 'coincidence', explicitly
use the 'idx' field so to clarify slavesMask usage.
Backported from c++11 branch.
No functional change.
Subclass MainThread and TimerThread and declare
idle_loop() virtual. This allow us to cleanly
remove a good bunch of hacks, relying on C++
polymorphism to do the job.
No functional change.
Rename it is_finished and use it only in main
thread to signal search is finished. This allows
us to simplify the complex SMP logic.
Ultra tricky patch: deep test is required under
wide conditions like pondering on and option
"Use Sleeping Threads" set to false.
No functional change.
And return on using TT as backing store for position
evaluations.
Tests (even on single thread) show eval cache was a regression.
In multi thread result should be even worst because eval cache
is a per-thread struct, while TT is shared.
After 4957 games at 15"+0.05 (single thread)
eval cache vs master 969 - 1093 - 2895 -9 ELO
So previous reported result of +18 ELO was probably due to an
issue in the testing framework (a bug in cutechess-cli) that
has been fixed in the meanwhile.
bench: 5386711
Use an eval cache instead of TT to store node
position evaluations.
It is already an improvment and, because it frees
two TT entry slots, paves the way to extend TT to
store both upper and lower bounds.
After 4855 games, single thread, 15"+0.05
Mod vs Orig 1165 -920 - 2770 ELO +18
bench: 5149248
And document why this is an hard limit. It
seems for some (lucky) people 32 threads
are not enough.
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
With this patch series we want to introduce a per-thread
evaluation cache to store node evaluation and do not
rely anymore on the TT table for this.
This patch just introduces the infrastructure.
No functional change.
Handle also the SMP case. This has been quite tricky, not
trivial to enforce the node limit in SMP case becuase
with "helpful master" concept we can have recursive split
points and we cannot lock them all at once so there is the
risk of counting the same nodes more than once.
Anyhow this patch should be race free and counted nodes are
correct.
No functional change.
It is very difficult and risky to assure
that a running thread doesn't access a global
variable. This is currently true, but could
change in the future and we don't want to rely
on code that works 'by accident'. The threads
are still running when ThreadPool destructor is
called (after main() returns) and this could
lead to crashes if a thread accesses a global
that has been already freed. The solution is to
use an exit() function and call it while we are
still in main(), ensuring global variables are
still alive at threads termination time.
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>