I could not find anything documented that is necessary that prepending -mbmi to -mbmi2 gives some benefit.
Instead at
https://gcc.gnu.org/onlinedocs/gcc/x86-Built-in-Functions.html#x86-Built-in-Functions
The following built-in functions are available when -mbmi is used. All of them generate the machine instruction that is part of the name.
unsigned int __builtin_ia32_bextr_u32(unsigned int, unsigned int);
unsigned long long __builtin_ia32_bextr_u64 (unsigned long long, unsigned long long);
The following built-in functions are available when -mbmi2 is used. All of them generate the machine instruction that is part of the name.
unsigned int _bzhi_u32 (unsigned int, unsigned int)
unsigned int _pdep_u32 (unsigned int, unsigned int)
unsigned int _pext_u32 (unsigned int, unsigned int)
unsigned long long _bzhi_u64 (unsigned long long, unsigned long long)
unsigned long long _pdep_u64 (unsigned long long, unsigned long long)
unsigned long long _pext_u64 (unsigned long long, unsigned long long)
and at
https://gcc.gnu.org/ml/gcc/2014-02/msg00204.html
( "... The real optimization comes from being able to use pext
(parallel bit extract), which can implement several bextr expressions in
parallel.")
Apart from that we don't use all -msse -msse2 -msse3 -msse4.2 etc. but just -msse3 (or -msse4.2) only.
As regards to the speedup within noise level - this pull request is actually reversal of mcostalba#198 wherein prepending -mbmi to -mbmi2 was claimed to be 0.3% faster and here (removing -mbmi) gives 0.4% speed gain.
Same initialization logic for both
pawns and pieces.
The advantage of this patch is that we reduce
redundancy and get a single (source) code path
for both cases. This is easier to understand
and to mantain.
Note: This patch makes use of some advanced template
techniques like SFINAE, decltype and the new function
declaration syntax (with trailing return type). This
is not just a show-off, but it is really needed in
this case.
Remove the involved size[] array to get the
sizes from setup_pairs().
Now code it is more self-documenting because each
table is associated with its clear size.
Currently we malloc a single memory
chunk in which we shuffle all kind of
different stuff in a very tricky way,
for instance see PairsData::base[1]
that is a hack used as a pointer to
data instead of an actual array (no
wonder C++ compiler complains!).
This patch rewrites all this in a way
to avoid hacky allocations and instead
to rely on the standard containers to
do their job.
This is the base for future work.
Now, Binomial[k][n] = Bin(k, n), instead of Binomial[k-1][n] = Bin(k, n).
Better document the Pascal triangle:
* Sum the above and the one to the left of it.
* Values outside the triangle are zero. This was not checked for k=n previously,
and the code implicitly relied on zero initialization of Binomial[]. That
reliance was made more confusing by the initial assignment before the loop.
No functional change.
This is a first step to cleanup that part of
initialization code.
Apparently init functions are harder to read now,
but this is only temporary: this is a prerequisite
for future work.
As a side benefit we can now get rid of the ancillary
struct and define them directly in teh main ones, even
using anonymous structs!
Pointer members of WDLEntry and DTZEntry must be null, so they can be freed.
Whether unmap() behaves like free() and tolerates a NULL pointer (treated as
no-op) is unclear. Better safe than sorry, so test data before calling unmap().
Simplify hasUniquePieces calculation while at it.
No functional change.
Avoid explicitly freeing the objects.
Because d'tor involves file unmapping, some
care must be taken to avoid accidentaly destroy
the object (even temporarly), for instance when
reordering the list.
As a side effect, we can now restore the original
main.cpp, fully in sync with master branch.
It is just an intermediate struct, use DTZEntry directly.
This allow us to remove a malloc ad simplify freeing.
Confirmed with Valgrind there are no memory leaks.
Super big patch that completely rewrite
data layout to avoid casting of pointer
back and forth different structs.
Unfortunatly it is not possible to write
the patch in small steps because all the
data structs where deeply mixed and once
you touch one part you need to chaneg also
the others.
Functionality s unchanged and this is already
a big success, now we have a proper base above
which to di further clean up work.
Verified with Valgrind there are no memory leaks.
Use std::deque instead because it preserves references
to its elements when resizing (std::vector does not).
DTZ_table is still an array because it seems its size
is fixed and does not depend on TB exsisting files.
Given a position probe_ab() does a kind of qsearch,
but instead of evaluating the position at the begin,
through a table look up, it performs a depth-first
search and only at the end checks for current position
score.
Also replace platform specific byte swap with
a software version. Amazingly it seems it is
even faster now!
Also removing the templatized form does not slow down.
Seems to give around 1% speed-up for CPUs with popcnt support.
Seems to give a very minor speed-up for CPUs without popcnt.
No functional change
Resolves#646