1
0
Fork 0
mirror of https://github.com/sockspls/badfish synced 2025-05-01 09:13:08 +00:00
Commit graph

13 commits

Author SHA1 Message Date
Tomasz Sobczyk
8e560c4fd3 Replicate network weights only to used NUMA nodes
On a system with multiple NUMA nodes, this patch avoids unneeded replicated
(e.g. 8x for a single threaded run), reducting memory use in that case.

Lazy initialization forced before search.

Passed STC:
https://tests.stockfishchess.org/tests/view/66a28c524ff211be9d4ecdd4
LLR: 2.96 (-2.94,2.94) <-1.75,0.25>
Total: 691776 W: 179429 L: 179927 D: 332420
Ptnml(0-2): 2573, 79370, 182547, 78778, 2620

closes https://github.com/official-stockfish/Stockfish/pull/5515

No functional change
2024-08-03 09:41:37 +02:00
Stéphane Nicolet
7e72b37e4c Clean up comments in code
- Capitalize comments
- Reformat multi-lines comments to equalize the widths of the lines
- Try to keep the width of comments around 85 characters
- Remove periods at the end of single-line comments

closes https://github.com/official-stockfish/Stockfish/pull/5469

No functional change
2024-07-11 07:29:33 +02:00
Tomasz Sobczyk
f55239b2f3 NumaPolicy fixes and robustness improvements
1. Fix GetProcessGroupAffinity still not getting properly aligned memory
   sometimes.
2. Fix a very theoretically possible heap corruption if
   GetActiveProcessorGroupCount changes between calls.
3. Fully determine affinity on Windows 11 and Windows Server 2022. It
   should only ever be indeterminate in case of an error.
4. Separate isDeterminate for old and new API, as they are &'d together
   we still can end up with a subset of processors even if one API is
   indeterminate.
5. likely_used_old_api() that is based on actual affinity that's been
   detected
6. IMPORTANT: Gather affinities at startup, so that we only later use
   the affinites set at startup. Not only does this prevent us from our
   own calls interfering with detection but it also means subsequent
   setoption NumaPolicy calls should behave as expected.
7. Fix ERROR_INSUFFICIENT_BUFFER from GetThreadSelectedCpuSetMasks being
   treated like an error.

Should resolve
02ff76630b (commitcomment-142790025)

closes https://github.com/official-stockfish/Stockfish/pull/5372

Bench: 1231853
2024-06-08 23:32:27 +02:00
Tomasz Sobczyk
02ff76630b Add NumaPolicy "hardware" option that bypasses current processor affinity.
Can be used in case a GUI (e.g. ChessBase 17 see #5307) sets affinity to a
single processor group, but the user would like to use the full capabilities of
the hardware.  Improves affinity handling on Windows in case of multiple
available APIs and existing affinities.

closes https://github.com/official-stockfish/Stockfish/pull/5353

No functional change
2024-06-05 21:01:45 +02:00
Disservin
00a28ae325 Add helpers for managing aligned memory
Previously, we had two type aliases, LargePagePtr and AlignedPtr, which
required manually initializing the aligned memory for the pointer.

The new helpers:

- make_unique_aligned
- make_unique_large_page

are now available for allocating aligned memory (with large pages). They
behave similarly to std::make_unique, ensuring objects allocated with
these functions follow RAII.

The old approach had issues with initializing non-trivial types or
arrays of objects. The evaluation function of the network is now a
unique pointer to an array instead of an array of unique pointers.

Memory related functions have been moved into memory.h

Passed High Hash Pressure Test Non-Regression STC:
https://tests.stockfishchess.org/tests/view/665b2b36586058766677cfd2
LLR: 2.93 (-2.94,2.94) <-1.75,0.25>
Total: 476992 W: 122426 L: 122677 D: 231889
Ptnml(0-2): 1145, 51027, 134419, 50744, 1161

Failed Normal Non-Regression STC:
https://tests.stockfishchess.org/tests/view/665b2997586058766677cfc8
LLR: -2.94 (-2.94,2.94) <-1.75,0.25>
Total: 877312 W: 225233 L: 226395 D: 425684
Ptnml(0-2): 2110, 94642, 246239, 93630, 2035

Probably a fluke since there shouldn't be a real slowndown and it has also
passed the high hash pressure test.

closes https://github.com/official-stockfish/Stockfish/pull/5332

No functional change
2024-06-03 23:11:59 +02:00
Tomasz Sobczyk
a2a7edf4c8 Fix GetProcessGroupAffinity call
`GetProcessGroupAffinity` appears to require 4 byte alignment for `GroupArray` memory.

See https://stackoverflow.com/q/78567676 for further information

closes https://github.com/official-stockfish/Stockfish/pull/5340

No functional change
2024-06-03 08:54:24 +02:00
Linmiao Xu
c17d73c554 Simplify statScore divisor into a constant
Passed non-regression STC:
https://tests.stockfishchess.org/tests/view/665b392ff4a1fd0c208ea864
LLR: 2.93 (-2.94,2.94) <-1.75,0.25>
Total: 114752 W: 29628 L: 29495 D: 55629
Ptnml(0-2): 293, 13694, 29269, 13827, 293

Passed non-regression LTC:
https://tests.stockfishchess.org/tests/view/665b588c11645bd3d3fac467
LLR: 2.95 (-2.94,2.94) <-1.75,0.25>
Total: 65322 W: 16549 L: 16373 D: 32400
Ptnml(0-2): 30, 7146, 18133, 7322, 30

closes https://github.com/official-stockfish/Stockfish/pull/5337

bench 1241443
2024-06-01 20:17:38 +02:00
Joost VandeVondele
54e74919d4 Fix cross from Linux to Windows
specifies Windows 7 required

https://learn.microsoft.com/en-us/cpp/porting/modifying-winver-and-win32-winnt?view=msvc-170

closes https://github.com/official-stockfish/Stockfish/pull/5319

No functional change
2024-05-30 23:07:25 +02:00
Tomasz Sobczyk
c8375c2fbd On linux use sysfs instead of lscpu
Use sysfs (https://www.kernel.org/doc/Documentation/ABI/stable/sysfs-devices-node)
to determine processor to NUMA node mapping.

Avoids problems on some machines with high core count where lscpu was showing high cpu utilization.

closes https://github.com/official-stockfish/Stockfish/pull/5315

No functional change
2024-05-30 23:05:25 +02:00
Tomasz Sobczyk
f1bb4164bf Fix process' processor affinity determination on Windows.
Specialize and privatize NumaConfig::get_process_affinity.
Only enable NUMA capability for 64-bit Windows.

Following #5307 and some more testing it was determined that the way affinity
was being determined on Windows was incorrect, based on incorrect assumptions
about GetNumaProcessorNodeEx.

This patch fixes the issue by attempting to retrieve the actual process'
processor affinity using Windows API. However one issue persists that is not
addressable due to limitations of Windows, and will have to be considered a
limitation. If affinities were set using SetThreadAffinityMask instead of
SetThreadSelectedCpuSetMasks and GetProcessGroupAffinity returns more than 1
group it is NOT POSSIBLE to determine the affinity programmatically on Windows.
In such case the implementation assumes no affinites are set and will consider
all processors available for execution.

closes https://github.com/official-stockfish/Stockfish/pull/5312

No functional change
2024-05-30 23:05:16 +02:00
Disservin
596fb4842b NUMA: Fix concurrency counting for windows systems
If there is more than 1 processor group, std:🧵:hardware_concurrency should not be used.

fixes #5307

closes https://github.com/official-stockfish/Stockfish/pull/5311

No functional change
2024-05-30 23:05:01 +02:00
mstembera
a2f4e988aa Fix MSVC NUMA compile issues
closes https://github.com/official-stockfish/Stockfish/pull/5298

No functional change
2024-05-29 19:00:37 +02:00
Tomasz Sobczyk
a169c78b6d Improve performance on NUMA systems
Allow for NUMA memory replication for NNUE weights.  Bind threads to ensure execution on a specific NUMA node.

This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc.

Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies.

-----------------

A new UCI option `NumaPolicy` is introduced. It can take the following values:
```
system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node
none - assumes there is 1 NUMA node, never binds threads
auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count)
[[custom]] -
  // ':'-separated numa nodes
  // ','-separated cpu indices
  // supports "first-last" range syntax for cpu indices,
  for example '0-15,32-47:16-31,48-63'
```

Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT.

The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better.

Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided.

On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly.

-----------------

We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary.

MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway.

Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux.

Linux is supported, **without** libnuma requirement. `lscpu` is expected.

-----------------

Passed 60+1 @ 256t 16000MB hash: https://tests.stockfishchess.org/tests/view/6654e443a86388d5e27db0d8
```
LLR: 2.95 (-2.94,2.94) <0.00,10.00>
Total: 278 W: 110 L: 29 D: 139
Ptnml(0-2): 0, 1, 56, 82, 0
```

Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd
```
LLR: 2.95 (-2.94,2.94) <-1.75,0.25>
Total: 67152 W: 17354 L: 17177 D: 32621
Ptnml(0-2): 64, 7428, 18408, 7619, 57
```

Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c
```
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 131648 W: 34155 L: 34045 D: 63448
Ptnml(0-2): 426, 13878, 37096, 14008, 416
```

fixes #5253
closes https://github.com/official-stockfish/Stockfish/pull/5285

No functional change
2024-05-28 18:34:15 +02:00