2

Sometimes a PostgreSQ database I support is overloaded. I need to pinpoint the exact queries that cause the database to be overloaded. I have created a pgFounie output on the relevant time frame: http://bit.ly/f2aikx

In that report there is a row
Query peak: 4 queries/s at 2011-03-15 10:41:06, 2011-03-15 10:44:18, 2011-03-15 10:48:25

It is not clear whether the mentioned timestamps are for the start or end of the peak.

3
  • The first step in debugging such problems is pinpointing the bottleneck. Is it overloading the CPU? Is it waiting for disk I/O? Reads or writes? Random accesses or sequential accesses? Generally the atop tool is very good in this respect, as it summarizes lots of resources in a single glimpse. Commented Mar 17, 2011 at 16:45
  • The vm's top shows that all 4 CPUs are 100% on that moment. But Vmware management utility (Vmware Infrastructure Client) shows that this particular vm does not use more than 30% of it's available CPU resource. I have checked esxtop when the overload occurs and it's reading a lot (~300 IOPS). I'm not sure how to interpret atop output. Commented Mar 19, 2011 at 6:49
  • If you get these conflicting readings from VM host and guest then it sounds like a problem with virtualization. If the guest reports consuming 100% of available resources, it sounds like your CPU usage is being throttled by the VM host. Commented Mar 19, 2011 at 16:16

1 Answer 1

1

I'm not sure what the question is, but I'll take a stab at a few things:

The "Query peak" metric is referring to three separate seconds where you saw a peak throughput of 4 queries per second.

Here's how I would approach pinpointing your problematic queries:

  1. Define "overloaded" for this instance. That will help you determine what is actually causing the problem. Let's assume that overloaded is defined as "slow queries"
  2. Examine slow queries in the pgFouine output. It helpfully groups them in the "Queries that took the most time (N)" section. Looking in there you can also click on "Show Examples" to see a few queries that are giving you grief.
  3. Take a sample of a few of those queries and run an EXPLAIN ANALYZE on them to get actual execution plans.
  4. Look at the other plans running at the same time. These may be causing I/O contention
  5. Analyze the plans yourself or else use http://explain.depesz.com/ to get analysis of your execution plans. Watch for things like table spools.
  6. Tune queries or adjust PostgreSQL settings accordingly.
  7. Rinse and repeat

In the long run, I would change settings in pgFouine to only log queries that execute for over 100ms. You can do that using the log_min_duration_statement setting in the postgresql.conf file.

Sign up to request clarification or add additional context in comments.

6 Comments

Overloaded: the web app takes ages to load the next page. This is because, to generate the page, the web app on the neighbouring machine performs some database queries against db vm but the latter is not responding because all 4 processors are already processing some queries (or more likely, waiting behind IO, because vmware infrastructure client shows that the vm does really not spend cpu).
I used the following command line to look inside PostgreSQL real time: watch -n 0.1 "psql --host="/tmp" mydb -c \"SELECT procpid,query_start, current_query FROM pg_stat_activity WHERE datname = 'mydb' AND current_query <> '<IDLE>' ORDER BY query_start;\""
@jeremiah-peschka, Here's the explain analyze: explain.depesz.com/s/4bi What should I look from here? What are table spools?
A table spool occurs when a database is unable to store a 'work table' in memory so it writes a temporary table to disk and then uses that for sorting and other operations. You could try increasing the work_mem for that query through SET work_mem = 10MB see this post for more. edit You don't need to worry about work mem here, it looks like your query is in memory.
Looking at the explain, there are a few problem places. If you look toward the bottom, there's an aggregate that returns 1 row but is executed nearly 700,000 times. This means whatever query is performing that aggregate is happening 700,000 times. I would try to re-write the subquery into a join on a set-based operation.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.