0

I have to query once per 10 minutes for the amount of users that have been active the last 1, 24, 724 and 3024 hours from a datapool where we store one line per user action.

When a user does something, we store the hashed userId, hashed action, timestamp and the group the user belongs to in a table. This table is used for a lot of statistical purposes (e.g. decide which features are used most, which features lead to user loss and so on.)

However the query that happens most often on this table is to get the amount of unique users in a given period of time.

SELECT
    count(user) as "1m",
    count(*) FILTER (WHERE "timestamp" >= (now() - interval '7 days')::timestamp) as "1w",
    count(*) FILTER (WHERE "timestamp" >= (now() - interval '1 day')::timestamp) as "1d",
    count(*) FILTER (WHERE "timestamp" >= (now() - interval '1 hour')::timestamp) as "1h"
FROM (
    SELECT
        "user" as "user",
        (max(timestamp) + interval '1 hour')::timestamp as "timestamp"
    FROM public.user_activity
    WHERE
        public.user_activity."timestamp" >= (now() - interval '1 month')::timestamp
        AND "system" = 'enterprise'
    GROUP BY  "user"
) as a

so in the subquery

  • we select entries, where the timestamp was within the last month and belongs to a given system
  • we group these entries by user
  • we then select the userId and the last timestamp of the given grouped user

this subquery returns usually between 10k and 100k entries (but should work for more, too)

then we do another query on this subquery:

  • we count the amount of entries as users last month
  • we count the filtered amount of entries where the timestamp is newer than a specific point in time

This query runs on a few million entries (growing rapidly).

How can I improve the query to run faster? What indexes would be beneficial? (Using AWS RDS hitting the IOPS limit of our 100GB SSD)

5
  • You might want to take a look at TimescaleDB (a plugin to Postgres): docs.timescale.com/latest/using-timescaledb/reading-data - I think that your activity-log is kind of what TSDB are made for... But I don't know if those are supported by AWS... Commented Feb 8, 2021 at 14:07
  • count(*) as "1m" would be slightly faster than count(user) as "1m" Commented Feb 8, 2021 at 14:12
  • Could you show us the result from EXPLAIN (ANALYZE, BUFFERS) ? Then we can see what part is slow and start guessing how to improve. Commented Feb 8, 2021 at 14:46
  • it's kind of hard to execute that at the moment as the database is inside of our VPC and the VPN is only to our Main office, while all our developers are in home-office right now. I would need to tunnel into some machine and run it from there or change the application logic to log that output somewhere Commented Feb 8, 2021 at 16:04
  • @Tobi you really need a test database unless you enjoy beating your head against a wall. Commented Feb 8, 2021 at 21:03

1 Answer 1

1

I would recommend an index on user_activity(system, timestamp, user).

If all rows are visible in the index, then this covers the subquery.

Not much can be done for the outer query.

However, I wonder if phrasing the query like this:

SELECT . . .
FROM (SELECT DISTINCT ON (user) "user" as "user",
             (timestamp + interval '1 hour')::timestamp as timestamp
      FROM public.user_activity ua
      WHERE "system" = 'enterprise'
      ORDER BY user, timestamp DESC
     ) ua
WHERE ua.timestamp >= now() - interval '1 month';

(Note: The filtering may be off by an hour. It is a bit hard to follow your exact date filtering logic.)

And with an index on user_activity(system, user, timestamp desc) would provide better performance.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.