I have a 24-node cluster that receives 400k to 600k ops/s at a latency of approximately 15ms-20ms at the 99th percentile.
I haven't found the reason why after restarting the nodes, writes increase to 700k-800k ops/s and write latency decreases to approximately 5ms-15ms for a period of about 24 hours, after which it degrades again.
I would like to have this performance consistently.
I have looked at several metrics like hints, compaction tasks, and Garbage Collection (GC), comparing the day of the restart with the same day of the previous week. I haven't seen anything significant that would indicate a change.
What metrics should I focus on?