I'm seeing high response times on elasticsearch searches (took is 5000ms) but if I check profile, query time is low ~15ms. I think this only happens when the request rate is high, but CPU is also far from fully saturated (~50%).
I'm requesting many items (size=4096), but I set _source=false to exclude document data. If I lower size to 10, responses are very fast (took=35). If I add from=4096, size=10 responses are still very fast. I'm using track_total_hits=true, but removing doesn't seem to make any difference. The index is quite small (<2GB) and should definitely fit in cache (96GB RAM). Heap size is around 75% most of the time. CPU usage varies between 20-80%.
I've tried looking at perf_events and I'm seeing a lot of cpu (40 cores at ~30% in osq_lock()) coming from ZFS reads locks, so I suspect this might be a problem with the filesystem.
perf top during slow response times
I would not expect fs reads to be the bottleneck. I'm a bit surprised it's reading from disk at all, I would assume that would not be necessary when _source=false? I'm really just interested in the document ids. Could it be that ZFS is just not suited for Elasticsearch?
Software versions:
- Elasticsearch 7.10.1 (official docker image)
- ZFS: 0.7.8
