1
0
Fork 0
mirror of https://github.com/sockspls/badfish synced 2025-07-11 11:39:15 +00:00

Double prefetch on Windows

After fixing the cpu frequency with RightMark tool I was
able to test speed all the different prefetch combinations.

Here the results:

OS Windows Vista 32bit, MSVC compile
CPU Intecl Core 2 Duo T5220 1.55 GHz
bench on depth 12, 1 thread, 26552844 nodes searched
results in nodes/sec

no-prefetch
402486, 402005, 402767, 401439, 403060

single prefetch (aligned 64)
410145, 409159, 408078, 410443, 409652

double prefetch (aligned 64) 0+32
414739, 411238, 413937, 414641, 413834

double prefetch (aligned 64) 0+64
413537, 414337, 413537, 414842, 414240

And now also some crazy stuff:

single prefetch (aligned 128)
410145, 407395, 406230, 410050, 409949

double prefetch (aligned 64) 0+0
409753, 410044, 409456

single prefetch (aligned 64) +32
408379, 408272, 406809

single prefetch (aligned 64) +64
408279, 409059, 407395

So it seems the best is a double prefetch at the addres + 32 or +64,
I will choose the second one because it seems more natural to me.

It is still a mystery why it doesn't work under Linux :-(

Signed-off-by: Marco Costalba <mcostalba@gmail.com>
This commit is contained in:
Marco Costalba 2009-08-10 14:23:19 +01:00
parent f4140ecc0c
commit 8d369600ec

View file

@ -174,12 +174,14 @@ TTEntry* TranspositionTable::retrieve(const Key posKey) const {
/// blocking function and do not stalls the CPU waiting for data /// blocking function and do not stalls the CPU waiting for data
/// to be loaded from RAM, that can be very slow. When we will /// to be loaded from RAM, that can be very slow. When we will
/// subsequently call retrieve() the TT data will be already /// subsequently call retrieve() the TT data will be already
/// quickly accessible in L1/l2 CPU cache. /// quickly accessible in L1/L2 CPU cache.
void TranspositionTable::prefetch(const Key posKey) const { void TranspositionTable::prefetch(const Key posKey) const {
#if defined(_MSC_VER) #if defined(_MSC_VER)
_mm_prefetch((char*)first_entry(posKey), _MM_HINT_T0); char* addr = (char*)first_entry(posKey);
_mm_prefetch(addr, _MM_HINT_T0);
_mm_prefetch(addr+64, _MM_HINT_T0);
#else #else
// We need to force an asm volatile here because gcc builtin // We need to force an asm volatile here because gcc builtin
// is optimized away by Intel compiler. // is optimized away by Intel compiler.