How do I find what's causing a task to be slow, when CPU, memory, disk and network are not used at 100%?

Question

I'm currently analyzing a process that is considered too slow.

In summary, it's a task that loads a lot of data from Microsoft SQL Server, performs some basic stuff on it, and creates a report. It currently takes from five to six hours, whereas the stakeholders expect it to take three to four hours.

Basic analysis showed that:

The application server that runs the actual task is underused. The CPU usage stays very low (less than 5%), there is plenty of memory, and the SSD is not used much either.
The database server seem to get most of the job. Its CPU usage occasionally averages 85% (although most of the time, it stays around 20%). The memory is used at 99%, but this is expected—the default behavior of SQL Server is to fill all the memory. The SSD is used roughly at 80%.
The network usage, measured at the database server, varies a lot—sometimes it remains at nearly zero, and sometimes it peaks at 2.5 Gbps.

Now, what do I do next to understand where the bottleneck is, given that none of the resources seem to be used at 100%?

I guess there are three possible scenarios:

The network is the issue, and the machines are simply waiting for the data to be sent/received. As both the database server and the application server are on premises, and there are lots of other things going on between all other machines in the data center, if the router is busy with other traffic, it may only allow, say, 500 Mbps between the database server and the application server, whereas a few minutes later, it would make it possible to reach 2.5 Gbps.
There is not one, but multiple factors that cause slowness. For instance, at a given moment, the data may be ready to be sent by the database, but the network is slow; a few minutes later, the network is fast, but the new data is not ready and the CPU or the SSD become the bottleneck.
The issue is somewhere else—maybe something is just idling, while waiting for a lock.

How do I figure out which one of those scenarios is correct—and eventually find how to optimize the task by improving either the hardware or the actual task?

Have you tried running an execution plan against your queries? — Kate
– Kate, Commented Oct 24 at 14:24
"The CPU usage stays very low (less than 5%)" - how many cores does the app server have? It might be a case of a single-threaded application using only one of them. Report generation sounds like something that should be trivial to parallelise. — Bergi
– Bergi, Commented Oct 24 at 22:47
You're talking about Microsoft, where 100% CPU utilization means all cores active. (Linux load average and other monitoring tools normally report 100% as one core busy, 800% as 8 cores busy, etc. Much more sensible especially for workloads that can't always use all cores.) When you say SSD 80% busy, are you talking about IOPS, bandwidth, or what? Or like 80% of the time there's at least one outstanding request? I assume this number is from Windows resource monitor or something; IDK what metric it uses. — Peter Cordes
– Peter Cordes, Commented Oct 25 at 3:24
After you optimized your tasks, I guess it would be an interesting read how you finally managed it. — Doc Brown
– Doc Brown, Commented Oct 25 at 5:07
@PeterCordes: relevant questions, indeed. While for the app server, I have RDP access, for the database one, the performances are collected by a special app, and it is unclear, what exactly does it display. This applies both to CPU (i.e. whether 100% would be for one core or all cores), and to the SSD (i.e. what exactly is being measured and what busy means in this context). — Arseni Mourzenko
– Arseni Mourzenko, Commented Oct 25 at 7:38

candied_orange · Accepted Answer · 2025-10-25 00:18:32Z

22

When dealing with a process that takes too much time, the tool you need from your toolbox is profiling: find out how much time is spent in each portion of the process.

As we are are talking about long time scales (in the order of hours), I would use a logging system to support me in collecting the profiling information. In concrete steps

Configure your logging system to include timestamps, if that is not already the case.
Break down the slow process in concrete steps at a granularity that you think will give you good information without having to wade through tons of logging. Preferably, each step either performs a set of calculations or an interaction over the network or an interaction with the storage device but not multiple of those. If that makes your steps too small for your comfort, you can also do those smaller steps in a second or third iteration.
Add logging to each step to identify when it starts and when it finishes
Run the process and collect the logging
Identify which steps take the most time and check for each of them if the time taken is reasonable and to be expected or if it is too long and needs to be optimized. If a step is repeated multiple times, take all repetitions into account for determining how long that step takes within the overall process.
If you identified steps that take too long, but they are still too granular to suggest optimization opportunities, then identify sub-steps within that step and repeat the profiling process.
If the target time is not a hard requirement, but based on a gut feeling of "it should take about that long", then also consider the possibility that the requirement is wrong. In this case, you can use the measurements you collected and your analysis of them as supporting evidence for why the process takes as long as it does.

edited Oct 25 at 0:18

candied_orange

120k27 gold badges233 silver badges369 bronze badges

answered Oct 24 at 12:59

Bart van Ingen Schenau

79k20 gold badges131 silver badges197 bronze badges

2

The OP is probably expecting their workload to take advantage of multiple cores at some points. Profiling to check this would be a good idea. You suggest only logging/recording time at the start/end of each step, but you can also profile CPU utilization on a per-step basis, especially if the steps are non-overlapping. Finding out that some slow step is only using a single CPU core, or is I/O bound (on disk latency instead of throughput perhaps), would suggest an avenue of attack for where to spend some effort parallelizing it or removing serial dependencies.

Peter Cordes
– Peter Cordes

2025-10-25 03:22:21 +00:00
Commented Oct 25 at 3:22
1

@PeterCordes per-step profiling is generally a bad advice due to synchronization issues. Multi-threaded executions can't be evaluated in parts.

Basilevs
– Basilevs

2025-10-25 08:11:53 +00:00
Commented Oct 25 at 8:11
2

@Basilevs: You can still profile a part, in terms of how many CPU seconds did it take as well as wall-clock start/stop time even if there is overlap with other parts. Also how long did its threads spend sleeping waiting for disk or waiting for a free CPU. (Those threads are potentially competing with other work on the system, specifically other parts, but can at least rule out e.g. CPU contention if threads spend no time sleeping when they're ready to run, not in disk-sleep.) But yeah, calling them "steps" sounds optimistic and/or simplistic.

Peter Cordes
– Peter Cordes

2025-10-25 10:43:36 +00:00
Commented Oct 25 at 10:43
1

@Basilevs: By CPU time, I mean what Linux perf stat calls task-clock. CPU time / wall-clock time = number of CPU cores you're keeping busy, on average, for the time interval of that part of your workload. If that's a lot lower than you expected, that tells you something. Memory latency is a reason why CPU time might be higher than you expected to run the same number of instructions, once you drill down into IPC (instructions per cycle) to see if the code has low or high throughput for the time it is running on CPU cores.

Peter Cordes
– Peter Cordes

2025-10-25 21:07:48 +00:00
Commented Oct 25 at 21:07
1

+1, saw this on the HNQ and "profiling" immediately leapt to mind. If you don't profile, the danger (actually almost a certainty) is that you'll end up wasting days or weeks trying optimize something that isn't even the thing taking all the time.

T.E.D.
– T.E.D.

2025-10-27 18:37:43 +00:00
Commented Oct 27 at 18:37

| Show 4 more comments

Basilevs · Accepted Answer · 2025-10-25 04:45:03Z

7

The database server has a good CPU and disk load. They may be not at 100%, but that's explained by inefficiencies in pipeline scheduling - CPU occasionally waits for IO and occasionally fails to schedule a disk read in advance.

Your DB IS a bottleneck. Profile and optimize queries and their scheduling.

Ensure DB always executes at least some queries - application may have an inefficient approach where results are processed in batches and requests for the next batch is not sent until previous one is processed. If possible, schedule next batch/request before processing last one. Note that if a query is ready and results are not being consumed by client, it is not being executed, which may mean your DB is underutilized.

If possible, try accessing DB in multiple threads - this may slow the task down, but may speed it up depending on task nature and index layout.

Measure and write down execution times before and after every change - otherwise you may get confused exploring bad options.

edited Oct 25 at 4:45

answered Oct 25 at 4:31

Basilevs

4,4901 gold badge20 silver badges33 bronze badges

How do I know that it's the inefficient queries that are the problem, and not the network bandwidth or, say, the fact that the database waits for a lock to be released?

Arseni Mourzenko
– Arseni Mourzenko

2025-10-25 07:41:05 +00:00
Commented Oct 25 at 7:41
2

@ArseniMourzenko network is not used effectively per OP. A DB lock is indeed a potential origin of a bottleneck (hence a suggestion to vary thread count - the perf will be better on lower thread count if lock is a problem).

Basilevs
– Basilevs

2025-10-25 08:09:48 +00:00
Commented Oct 25 at 8:09
I did not realize the symptoms are described for production system. Current answers are only applicable to a dedicated test setup.

Basilevs
– Basilevs

2025-10-27 11:21:35 +00:00
Commented Oct 27 at 11:21
@Basilevs, often you need to deploy a patch to production with extra logging because you cannot effectively replicate the problem anywhere else.

Greg Burghardt
– Greg Burghardt

2025-10-27 23:30:26 +00:00
Commented Oct 27 at 23:30
@GregBurghardt I know. The advice would be different in that situation though. For example - investigating threading issues would be impossible, due to noise of unrelated requests. Profiling and measuring would be very hard in general. I have no advice for production, besides obvious inspection of slow request's execution plan and review of application code,

Basilevs
– Basilevs

2025-10-28 03:48:36 +00:00
Commented Oct 28 at 3:48

Add a comment |

haylem · Accepted Answer · 2025-10-30 09:26:38Z

You have quite a lof of angles of attack for your problem, especially as you don't give us a lot of details on the specifics of your system, its intent, and the way the various components interact with each other.

Notably, you can look at:

infrastructure aspects,
programming aspects,
configuration aspects,
data modelling aspecs.

Infrastructure

You didn't give us details on your overall architecture and infrastructure, so at this stage we can only conjecture. While your network speed may be fast, your latency may be bad.

In any case, try to measure perform a few queries between the querying system and the database server to confirm that your latency is good and stable. <6ms is good on a local system. <15ms is good on a local network. Anything above is going to be a pain.

Programming / Behavior

We don't know how your program interacts with the SQL server: do you you fetch a lot of content with small queries, a few large queries, etc... Do you use connection pooling ? What are the different configurations ?

Configuration

Several aspects can affect performance when it comes to the configuration of your system (things like MTU, etc...) however it's not generally what I'd look at first for major gains.

On the DB side, though, you may also have several configuration options to optimize performance.

Data Modelling

The way your data is modelled may have a dramatic impact on your query performance. Have you tried doing an EXPLAIN PLAN on your queries to check if they can be optimized ?

Troubleshooting

You give us a few details about OS metrics, but not many details on what your system intents to do, and what it actually does (where the time is spent).

I would recommend that you look at performing a sampling analysis : run a program to generate thread dumps for the programming stack you use, and do a statistical analysis of the time spent in the different layers of each stacktrace. That gives you a resulting view of where your program (and the OS) is spending time, and this way you can identify which type of bottleneck you're dealing with.

Others have suggested profiling. If you're comfortable with that, that's fine. But for performance analysis that's reproducible across the board, it wouldn't be my favorite approach. And as you have a process that needs to run in a specific time frame on data sets you could reproduce, sampling seems a better approach for me.

Finally, considering what you're reporting (but again, we are severely lacking on details here and shooting in the dark), there may be several aspect. However, I'd be hinting towards something stalling on IO_WAIT, which gives you the impression that most thing run peachy when it's blocking somewhere and sitting while awaiting I/O.

JonasH · Accepted Answer · 2025-10-29 13:04:25Z

Now, what do I do next to understand where the bottleneck is, given that none of the resources seem to be used at 100%?

A fundamental problem is latency. If the CPU makes many small requests to the disk or to a database the time may be dominated by various types of fixed costs, like the time to actually transmit the information. This time may not show up in performance metrics since both are mostly waiting for each other and has the capacity to other work if such work was available.

The solution is usually to make fewer but larger requests or queries to reduce the effect of latency. Simply starting processing of database results as they are received can also help a fair bit. But concurrency and granularity can be complicated. While it can greatly help with performance it can also be difficult to get right.

How do I figure out which one of those scenarios is correct—and eventually find how to optimize the task by improving either the hardware or the actual task?

Use the tools available to see if you can confirm or disprove any hypothesis

There are tools to simulate bandwidth restrictions and latency. If this has little effect it is probably not the network.
99% memory and 80% disk load is high enough that I would be concerned. Upgrading hardware can be a relatively cheap, and may be enough. Simulating added memory/disk/CPU load could also reveal potential bottlenecks.
If a database is involved it should be one of the primary suspects. There are specialized tools available to check for common problems, like swapping, lack of indexes, lots of small queries, lock contention and so on. Check the documentation for your database for the specifics.
There are CPU profilers to check what an application is doing and what takes most time, even if CPU time is less likely to be a problem in your case.
Just reading the source code can be illuminating, even if just to get a rough understanding of the overall code quality and what kind of problems you can expect.

Chances are that the main limitation is the software design and architecture. Computers are ridiculously fast when used well. But most software development stops when they get something that works well enough. Ensuring the solution scales well is often not considered or tested enough.

It is not that uncommon to improve performance by orders of magnitude with some fairly simple fixes. But that might require a good understanding of the application to make sure you understand the problems and ensure it still works correctly after any fixes. Gaining that understanding can be expensive if the project lacks documentation, automated tests, and the original developers are gone.

Given the nature of DB requests (they are transfeered over relatively slow network), large requests have to be managed accurately to really defeat latency. In particular, they should be processed in small batches, so that network never pending processing previous batxh or request. — Basilevs
– Basilevs, Commented Oct 29 at 11:30

JimmyJames · Accepted Answer · 2025-10-29 14:35:38Z

I don't think it's possible to know what issue is the exact cause without much greater level of detail. I don't think it's possible or desirable to provide that level of information here. I think there are a number of other possible causes you should consider, though. Here are a few that I would look into based on your description:

This is my main suspect: the client is executing queries and only beginning to process the results after the all results are fully retrieved. This is a very common mistake and I have had many experiences with developers who are committed to the (very wrong) idea that this improves performance. You may already know this but if you need me to elaborate, I will happily do so.
There's some sort of contention within the system, most likely on your client but potentially on the database side as well. For example, all your threads are synchronized on the same lock so only one executes at a time. I've seen issues in databases where things like a missing foreign key can cause unexpected lock contention.
This might sound silly but things like printing a lot to a terminal session can have really surprising negative performance impacts. Simply redirecting to a file can resolve this.

This is by no means comprehensive. These are a few of the kinds of common issues I have found when resolving performance issues. There are many more, though. For example, I was reminded the other day that JDBC, by default, uses a tiny 10 record fetch size when retrieving results. For a lot of needs, this is fine, but if you are retrieving a large number of results, it creates a lot chattiness and latency.

I had another thought of maybe low usefulness: I've been really surprised at how much async IO (or non-blocking IO) can improve performance for IO-bound systems. I know that's the point but my intuitions tend to underestimate the impact introducing non-blocking IO calls. I'm not sure how stable non-blocking database drivers are for SQLServer but on a cursory search it seems there may be some available.

Stack Exchange Network

How do I find what's causing a task to be slow, when CPU, memory, disk and network are not used at 100%?

5 Answers 5

Infrastructure

Programming / Behavior

Configuration

Data Modelling

Troubleshooting

Your Answer

Hot Network Questions

How do I find what's causing a task to be slow, when CPU, memory, disk and network are not used at 100%?

5 Answers 5

Infrastructure

Programming / Behavior

Configuration

Data Modelling

Troubleshooting

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions