1

I have a faulty Ryzen 5900X desktop CPU. Previously, I somewhat tamed its faulty cores via isolcpus=2,3,14,15 kernel parameter in GRUB2 (see https://blog.cbugk.com/post/ryzen-5850x/).

However, on Proxmox 8.2, I have set up a CEPH cluster. It had crippling performance of around 2 MB/s. Redone the cluster got 20 MB/s speed while cloning a template VM. I was suspecting my use of second-hand enterprise SSDs but even fresh ones did it (with or without NVMe DB cache).

But, when I checked my faulty cores (2,3,14,15) they were being used. The moment I turn down the computer with 5900X, transfer speed jumps to around 100 MB/s on the remaining two nodes. Networking is 10G between each-node, iperf previously had shown 6G throughput, ~~it cannot be the bottle-neck.~~ It was the damn cabling.

Some duckduckgo-ing later found out isolcpus= works for user space but not for kernel space.

watch -n1 -- "ps -axo psr,pcpu,uid,user,pid,tid,args --sort=psr | grep -e '^ 2 ' -e '^ 3 ' -e '^ 14 ' -e '^ 15'" (source) gives:

  2  0.0     0 root          27      27 [cpuhp/2]
  2  0.0     0 root          28      28 [idle_inject/2]
  2  0.3     0 root          29      29 [migration/2]
  2  0.0     0 root          30      30 [ksoftirqd/2]
  2  0.0     0 root          31      31 [kworker/2:0-events]
  2  0.0     0 root         192     192 [irq/26-AMD-Vi]
  2  0.0     0 root         202     202 [kworker/2:1-events]
  3  0.0     0 root          33      33 [cpuhp/3]
  3  0.0     0 root          34      34 [idle_inject/3]
  3  0.3     0 root          35      35 [migration/3]
  3  0.0     0 root          36      36 [ksoftirqd/3]
  3  0.0     0 root          37      37 [kworker/3:0-events]
  3  0.0     0 root         203     203 [kworker/3:1-events]
 14  0.0     0 root          99      99 [cpuhp/14]
 14  0.0     0 root         100     100 [idle_inject/14]
 14  0.3     0 root         101     101 [migration/14]
 14  0.0     0 root         102     102 [ksoftirqd/14]
 14  0.0     0 root         103     103 [kworker/14:0-events]
 14  0.0     0 root         210     210 [kworker/14:1-events]
 15  0.0     0 root         105     105 [cpuhp/15]
 15  0.0     0 root         106     106 [idle_inject/15]
 15  0.3     0 root         107     107 [migration/15]
 15  0.0     0 root         108     108 [ksoftirqd/15]
 15  0.0     0 root         109     109 [kworker/15:0-events]
 15  0.0     0 root         211     211 [kworker/15:1-events]

Since Ceph uses kernel driver, I need a way to isolate cores from the whole system. Running PID 1 on-wards in a taskset is okay. I cannot use cset due to cgroups2. numactl is also okay.

With isolcpus I do not have apparent system stability issues, without that I would face secure connection errors on Firefox and OS installs would fail. But even that is not enough when using CEPH. And now I conclude that it could corrupt data unnoticed if this wasn't my homelab machine.

Can anyone suggest a way to effectively ban these faulty threads as soon as system allows to do so, permanently? (I better use the phrase CPU affinity in the post)


I was wrong, redone Cat6 cables just the right length, having cleared power cables earlier I can state intererence should be quite lower than earlier. The same error was there when I disabled half the cores on BIOS including the faulty ones. I get instant VM clones on CEPH pool now, thanks to nvme DB cache I suppose.

Also, the kernel threads on the cores are the ones used for scheduling processes, their PID and set of threads on those cores is constant with above watch command even during a VM clone on CEPH pool. So if no tasks are being scheduled, it might be working as intended.

Found these tangentially relevant readings interesting: migration - reddit, nohz - lwn.net

2
  • 1
    I've never tested for kernel threads, but could you offline the cores by writing a zero to /sys/devices/system/cpu/cpu#/online ? Note that in a hyperthreaded setup each thread shows as a separate cpu and you can see them by looking at cpu#/topology/thread_siblings_list Commented May 14, 2024 at 2:02
  • I tried your suggestion but did not help, got even worse (for an other reason). The processes might have been dead-locked, as they did not vanish. Upon inspection, found out it was the eth cables. They worked right after plugging them in, but could not withstood a reboot, dropping to mere hundreds of Kbps at best. Edited question, this was a non-issue as no fresh PID was added to isolated threads. I can't answer for sure, but seems like it works as I intended. @StephenHarris thank you for your time, and sorry to bother with my lack of effort to verify. Well, better late than never. Commented May 14, 2024 at 23:00

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.