3

I'm running Geodjango/Postgres 9.1/PostGIS and I'm trying to get the following query (and others like it) to run faster.

[query snipped for brevity]

SELECT "crowdbreaks_incomingkeyword"."keyword_id"
       , COUNT("crowdbreaks_incomingkeyword"."keyword_id") AS "cnt" 
  FROM "crowdbreaks_incomingkeyword"
 INNER JOIN "crowdbreaks_tweet"
       ON ("crowdbreaks_incomingkeyword"."tweet_id"
          = "crowdbreaks_tweet"."tweet_id")
  LEFT OUTER JOIN "crowdbreaks_place"
    ON ("crowdbreaks_tweet"."place_id"
       = "crowdbreaks_place"."place_id") 
 WHERE (("crowdbreaks_tweet"."coordinates"
        @ ST_GeomFromEWKB(E'\\001 ... \\000\\000\\000\\0008@'::bytea)
       OR ST_Overlaps("crowdbreaks_place"."bounding_box"
                     , ST_GeomFromEWKB(E'\\001...00\\000\\0008@'::bytea)
       )) 
   AND "crowdbreaks_tweet"."created_at" > E'2012-04-17 15:46:12.109893'
   AND "crowdbreaks_tweet"."created_at" < E'2012-04-18 15:46:12.109899' ) 
 GROUP BY "crowdbreaks_incomingkeyword"."keyword_id"
         , "crowdbreaks_incomingkeyword"."keyword_id"
    ;

Here is what the crowdbreaks_tweet table looks like:

\d+ crowdbreaks_tweet;
                       Table "public.crowdbreaks_tweet"
    Column     |           Type           | Modifiers | Storage  | Description 
---------------+--------------------------+-----------+----------+-------------
 tweet_id      | bigint                   | not null  | plain    | 
 tweeter       | bigint                   | not null  | plain    | 
 text          | text                     | not null  | extended | 
 created_at    | timestamp with time zone | not null  | plain    | 
 country_code  | character varying(3)     |           | extended | 
 place_id      | character varying(32)    |           | extended | 
 coordinates   | geometry                 |           | main     | 
Indexes:
    "crowdbreaks_tweet_pkey" PRIMARY KEY, btree (tweet_id)
    "crowdbreaks_tweet_coordinates_id" gist (coordinates)
    "crowdbreaks_tweet_created_at" btree (created_at)
    "crowdbreaks_tweet_place_id" btree (place_id)
    "crowdbreaks_tweet_place_id_like" btree (place_id varchar_pattern_ops)
Check constraints:
    "enforce_dims_coordinates" CHECK (st_ndims(coordinates) = 2)
    "enforce_geotype_coordinates" CHECK (geometrytype(coordinates) = 'POINT'::text OR coordinates IS NULL)
    "enforce_srid_coordinates" CHECK (st_srid(coordinates) = 4326)
Foreign-key constraints:
    "crowdbreaks_tweet_place_id_fkey" FOREIGN KEY (place_id) REFERENCES crowdbreaks_place(place_id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
    TABLE "crowdbreaks_incomingkeyword" CONSTRAINT "crowdbreaks_incomingkeyword_tweet_id_fkey" FOREIGN KEY (tweet_id) REFERENCES crowdbreaks_tweet(tweet_id) DEFERRABLE INITIALLY DEFERRED
    TABLE "crowdbreaks_tweetanswer" CONSTRAINT "crowdbreaks_tweetanswer_tweet_id_id_fkey" FOREIGN KEY (tweet_id_id) REFERENCES crowdbreaks_tweet(tweet_id) DEFERRABLE INITIALLY DEFERRED
Has OIDs: no

And here is the explain analyze for the query:

 HashAggregate  (cost=184022.03..184023.18 rows=115 width=4) (actual time=6381.707..6381.769 rows=62 loops=1)
   ->  Hash Join  (cost=103857.48..183600.24 rows=84357 width=4) (actual time=1745.449..6377.505 rows=3453 loops=1)
         Hash Cond: (crowdbreaks_incomingkeyword.tweet_id = crowdbreaks_tweet.tweet_id)
         ->  Seq Scan on crowdbreaks_incomingkeyword  (cost=0.00..36873.97 rows=2252597 width=12) (actual time=0.008..2136.839 rows=2252597 loops=1)
         ->  Hash  (cost=102535.68..102535.68 rows=80544 width=8) (actual time=1744.815..1744.815 rows=3091 loops=1)
               Buckets: 4096  Batches: 4  Memory Usage: 32kB
               ->  Hash Left Join  (cost=16574.93..102535.68 rows=80544 width=8) (actual time=112.551..1740.651 rows=3091 loops=1)
                     Hash Cond: ((crowdbreaks_tweet.place_id)::text = (crowdbreaks_place.place_id)::text)
                     Filter: ((crowdbreaks_tweet.coordinates @ '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry) OR ((crowdbreaks_place.bounding_box && '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry) AND _st_overlaps(crowdbreaks_place.bounding_box, '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry)))
                     ->  Bitmap Heap Scan on crowdbreaks_tweet  (cost=15874.18..67060.28 rows=747873 width=125) (actual time=96.012..940.462 rows=736784 loops=1)
                           Recheck Cond: ((created_at > '2012-04-17 15:46:12.109893+00'::timestamp with time zone) AND (created_at < '2012-04-18 15:46:12.109899+00'::timestamp with time zone))
                           ->  Bitmap Index Scan on crowdbreaks_tweet_crreated_at  (cost=0.00..15687.22 rows=747873 width=0) (actual time=94.259..94.259 rows=736784 loops=1)
                                 Index Cond: ((created_at > '2012-04-17 15:46:12.109893+00'::timestamp with time zone) AND (created_at < '2012-04-18 15:46:12.109899+00'::timestamp with time zone))
                     ->  Hash  (cost=217.11..217.11 rows=6611 width=469) (actual time=15.926..15.926 rows=6611 loops=1)
                           Buckets: 1024  Batches: 4  Memory Usage: 259kB
                           ->  Seq Scan on crowdbreaks_place  (cost=0.00..217.11 rows=6611 width=469) (actual time=0.005..6.908 rows=6611 loops=1)
 Total runtime: 6381.903 ms
(17 rows)

That's a pretty bad runtime for the query. Ideally, I'd like to get results back in a second or two.

I've increased shared_buffers on Postgres to 2GB (I have 8GB of RAM) but other than that I'm not quite sure what to do. What are my options? Should I do fewer joins? Are there any other indexes I can throw on there? The sequential scan on crowdbreaks_incomingkeyword doesn't make sense to me. It's a table of foreign keys to other tables, and thus has indexes on it.

3
  • If I 'set enable seqscan to off;' before running the explain analyze, the query time reduces to 1.8 seconds. However, everything I read says I shouldn't really do this. Commented Apr 19, 2012 at 16:23
  • I pasted your EXPLAIN output here: explain.depesz.com/s/EfH to get a better view. Commented Apr 19, 2012 at 16:43
  • BTW: what is the impact of duplicating the term crowdbreaks_incomingkeyword"."keyword_id in the GROUP BY clause? (could be that the optimiser forgets to remove this redundancy, and computes the specificity of groups as 1/squared(number of groups) Commented Apr 19, 2012 at 22:10

1 Answer 1

5

Judging from your comment I would try two things:

  • Raise statistics target for involved columns (and run ANALYZE).

    ALTER TABLE tbl ALTER COLUMN column SET STATISTICS 1000;
    

The data distribution may be uneven. A bigger sample may provide the query planner with more accurate estimates.

  • Play with the cost settings in postgresql.conf. Your sequential scans might need to be more expensive compared to your index scans to give good estimates.

Try to lower the cost for cpu_index_tuple_cost and set effective_cache_size to something as high as three quaters of your total RAM for a dedicated DB server.

Sign up to request clarification or add additional context in comments.

7 Comments

Do I raise statistics for the columns I use to filter, the columns I select, or both?
@Khandelwal: The ones you filter and join on are relevant for planning. Just selecting is irrelevant.
@Khandelwal: If most of the relevant parts of your database fit into RAM, you should set random_page_cost even lower - only slightly above seq_page_cost. I have 1 for seq_page_cost and 1.1 for random_page_cost for db clusters that are mostly cached. In RAM random access is basically just as fast as sequential access. Only on disc it is much slower.
Another setting I have often found helpful is raising cpu_tuple_cost to somewhere in the range of 0.03 to 0.05. This may be somewhat redundant to lowering cpu_index_tuple_cost, since cost factors are relative. BTW, the reason raising effective_cache_size would help is that it tells the planner how much of an index is likely to stay in cache when repeatedly accessed within a single query; higher values tend to make indexed access look less expensive.
Normally, I set the effective_cache_size as high as possible (available memory - memory used by all processes. In your case about 8-2=6GB) I tend to keep shared_buffers low. I've heard rumours about shared_buffers being a pain om VMs (probably because they impose some kind of core-lock). Trusting on the OS's LRU disk buffers seems to be a good choice. @Khandelwal: could you please add the resulting query plan to your question?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.