It is used mainly in a bunch of inline oneliners
just below its definition. So substitute it with
the explicit definition and avoid information hiding.
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
It is more clear that only in that case the move number is
correct, otherwise is only a partial quantity: the number of
moves of that phase.
In case of PH_EVASIONS instead we have only one phase.
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Array index[] and pieceList[] are not guaranteed to be
invariant to a do_move() + undo_move() sequence when a
capture move is involved.
The reason is that the captured piece is removed form
the list and substituted with the last one in do_move()
while in undo_move() is added again but at the end of
the list.
Because index[] and pieceList[] are used in move generation
to scan the pieces it means that moves will be generated
in a different order before and after a do_move() + undo_move()
sequence as, for instance, the one in Position::has_mate_threat()
After latest patches, move generation could now be invoked
also by MovePicker c'tor and this explains why order of
picked moves is different if MovePicker object is istantiated
before or after a Position::has_mate_threat() call.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
It seems that pos.has_mate_threat() changes the position !
So that calling MovePicker c'tor before or after the
has_mate_threat() call changes the things !
Bug was unhidden by previous patch that makes MovePicker c'tor
to generate, score and sort good captures under some circumstances.
Because scoring the captures is position dependent it seems that
the moves returned by MovePicker are different when c'tor is
called before has_mate_threat()
Of course this is only a workaround because the real bug is still
hidden :-(
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
If we don't have tt moves to search skip the
useless loop associated with TT_MOVES phase.
Another 1% speed boost that brings this series
to a +6.2% against original revision 595a90df
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
This not only cleans up the code but gives another
speed boost of 1.8%
From revision 595a90dfd0 we have increased pgo compiled binary
speed of a whopping +5.2% without any functional change !!
This is really awsome considering that we have also
cut line count by 25 lines.
Sometime we spend days for getting an extra 1% from move
generation while instead the biggest optimizations come
from anonymous and apparently dull parts of the code.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Does not seem to improve on the standard, latest results
from Joona after 2040 games are negative:
Orig - Mod: 454 - 424 - 1162
And is more or less the same I got few days ago.
So revert for now.
Verified same functionality of 595a90dfd
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Use the same way of loop along the move list used for
the others move kinds so to be consistent in get_next_move()
And a bit of the usual clean up too, but just a bit.
It is even a bit (+0.3%) faster now. ;-)
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Always after TT move but before captures.
This seems a better setup against version before this
patch.
After 999 games at 1+0
Mod - Orig +252 =527 -220 +11 ELO
Unfortunatly it does not seems to improve on the standard
version, with null move outside of movepicker (595a90df) with
the latest speed-up patches added in.
After 999 games at 1+0
Mod - Standard +244 =506 -249 -2 ELO
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
This avoids calculating the array entry position
at each access and gives another boost of almost 1%.
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
In MovePicker we get the next move with pick_move_from_list(),
then check if the return value is equal to MOVE_NONE and
in this case we update the state to the new phase.
This patch reorders the flow so that now from pick_move_from_list()
renamed get_next_move() we directly call go_next_phase() to
generate and sort the next bunch of moves when there are no more
move to try. This avoids to always check for pick_move_from_list()
returned value and the flow is more linear and natural.
Also use a local variable instead of a pointer dereferencing in a
time critical switch statement in get_next_move()
With this patch alone we have an incredible speed up of 3.2% !!!
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Try good captures before null move when depth < 3 * OnePly.
Use this kind of null move also in Depth == OnePly.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Original patch from Joona with added optimizations
by me.
Great cleanup of MovePicker with speed improvment of 1%
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Explicitly write the conditions for pawn to 7th
and passed pawn instead of wrapping in redundant
helpers.
Also retire the now unused move_is_pawn_push_to_7th()
and the never used move_was_passed_pawn_push() and
move_is_deep_pawn_push()
Function extension() is so time critical that this
simple patch speeds up the pgo compile of 0.5% and
it is also more clear what actually happens there.
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Remove the 'b' uint32_t local variable.
Optimized assembly is more or less the same
(one 'mov' instruction less), but now it is
written in a way more similar to the final assembly
flow so it should be easier for compiler to optimize.
Also guarantee that BitTable[] is always aligned.
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Verified correct against tuning branch.
After 999 games at 1+0
Mod vs Orig +257 =510 -232 51.20% +9 ELO
Very small increase but an increase anyway !
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
The following new targets were added:
* osx-icc32: 32-bit x86 compiled with icpc.
* osx-icc64: 64-bit x86 compiled with icpc.
* osx-icc32-profile: 32-bit x86 compiled with icpc and pgo.
* osx-icc64-profile: 64-bit x86 compiled with icpc and pgo.
There were two asserts.
The first was raised because is_ok() was called at the
beginning of do_castle_move() and this is wrong after
the last code reformatting because at that point the state
is already modified by the caller do_move().
The second, raised by debugIncrementalEval, was due to a
rounding error in compute_value() that occurs because
TempoValueEndgame was updated in an odd number by patch
"Merge Joona Kiiski evaluation tweaks" (3ed603cd) of 13/3/2009
This line in compute_value() is the guilty one:
result += (side_to_move() == WHITE)? TempoValue / 2 : -TempoValue / 2;
The fix is to increment TempoValueEndgame so to be even.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
This patch seems bigger then what actually is.
It just moves some code around and adds a bit of coding style fixes
to do_move() and undo_move() so to have uniformity of naming in both
functions.
The diffstat for the whole patch series is
239 insertions(+), 426 deletions(-)
And final MSVC pgo build is even a bit faster:
Before 448.051 nodes/sec
After 453.810 nodes/sec (+1.3%)
No functional change (tested on more then 100M of nodes)
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Integrate undo_ep_move in undo_move() this reduces line count
and code readibility.
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Integrate do_ep_move in undo_move() this reduces line count
and code readibility.
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Integrate do_promotion_move() in do_move() this reduces line count
and code readibility.
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Integrate do_ep_move in do_move() this reduces line count
and code readibility.
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
In Movepicker c'tor we access during initialization one of
MainSearchPhaseIndex..QsearchWithoutChecksPhaseIndex globals.
Postpone definition of PhaseTable[] just after them so that
when PhaseTable[] will be accessed later in get_next_move()
it will be already present in L1/L2.
It works like an implicit prefetching of PhaseTable[].
Also shrink PhaseTable[] to fit an L1 cache line of 16 bytes
using uint8_t instead of int.
This apparentely innocuous patch gives an astonish speed
up of 1.6% under MSVC 2010 beta, pgo optimized !
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
It was due to a missing -msse compiler option !
Without this option the CPU silently discards
prefetcht2 instructions during execution.
Also added a (gcc documented) hack to prevent Intel
compiler to optimize away the prefetches.
Special thanks to Heinz for testing and suggesting
improvments. And for Jim for testing icc on Windows.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
But this time with the guarantee of an always aligned
access so that prefetching is not adversely impacted.
On Joona PC
1+0, 64Mb hash:
Orig - Mod: 174 - 237 - 359
Instead after 1000 games at 1+0 with 128MB hash size
we are at + 1 ELO (just 4 games of difference).
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
After fixing the cpu frequency with RightMark tool I was
able to test speed all the different prefetch combinations.
Here the results:
OS Windows Vista 32bit, MSVC compile
CPU Intecl Core 2 Duo T5220 1.55 GHz
bench on depth 12, 1 thread, 26552844 nodes searched
results in nodes/sec
no-prefetch
402486, 402005, 402767, 401439, 403060
single prefetch (aligned 64)
410145, 409159, 408078, 410443, 409652
double prefetch (aligned 64) 0+32
414739, 411238, 413937, 414641, 413834
double prefetch (aligned 64) 0+64
413537, 414337, 413537, 414842, 414240
And now also some crazy stuff:
single prefetch (aligned 128)
410145, 407395, 406230, 410050, 409949
double prefetch (aligned 64) 0+0
409753, 410044, 409456
single prefetch (aligned 64) +32
408379, 408272, 406809
single prefetch (aligned 64) +64
408279, 409059, 407395
So it seems the best is a double prefetch at the addres + 32 or +64,
I will choose the second one because it seems more natural to me.
It is still a mystery why it doesn't work under Linux :-(
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Prefetch always form a chache line boundary. It seems
that if prefetch address is not cache line aligned then
performance is adversely impacted.
Hopefully we will resuse that 32 bits of padding for something
useful in the future.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
This fix a compile error under Linux with gcc when
there aren't the intel dev libraries.
Also simplify the previous patch moving TT definition
from search.cpp to tt.cpp so to avoid using passing a
pointer to TT to the current position.
Finally simplify do_move(), now we miss a prefetch in the
rare case of setting an en-passant square but code is
much cleaner and performance penalty is almost zero.
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Move prefetching code inside do_move() so to allow a
very early prefetching and to put as many instructions
as possible between prefetching and following retrieve().
With this patch retrieve() times are cutted of another 25%
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
TT.retrieve() is the most time consuming function
because almost always involves a very slow RAM access.
TT table is so big that is never cached. This patch
prefetches TT data just after a move is done, so that
subsequent TT.retrieve will be very fast.
Profiling with VTune shows that TT:retrieve() times are
almost cutted in half !
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Shrink key to 32 bits instead of 64. To still avoid
collisions use the high 32 bits of position key as the
TT key and the low 32 bits to retrieve the correct
cluster index in the table.
With this patch size og TTentry shrinks to 96 bits instead
of 128 and the cluster of 4 TTEntry sums to 48 bytes instead
of 64.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Binaries are always built with symbol table in to easy
debugging and profiling.
It is now possible to run:
make strip
To remove symbol table from the compiled binary. This
could be useful to prepare the release version.
Patch by Heinz van Saanen.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Current formula enable LMR when
i + MultiPV >= LMRPVMoves
It means that, for instance, if MultiPV == 1 then LMR
will be started to be considered at move i = LMRPVMoves - 1,
while if MultiPV == 3 then it will start before,
at move i = LMRPVMoves - 3.
With this patch the formula becomes
i >= MultiPV + LMRPVMoves - 2
So that LMR will always start after LMRPVMoves - 1 moves
from the last PV move.
No functional change when MultiPV == 1
Signed-off-by: Marco Costalba <mcostalba@gmail.com>
Access pos.piece_count() only once and avoid some
branches in the inner loop.
Profiling with VTune shows a 20% speed improvement in
get_material_info(), and it is also a bit more cleaned
up this way ;-)
No functional change.
Signed-off-by: Marco Costalba <mcostalba@gmail.com>