1

I use Postgres on my web server in order to record incoming queries into a table calls2, basically writing a single row each time with lots of repeating information, such as a date field ("when") of when the recording was made plus lots of other statistical information (ip address, query parameters etc). One field also identifies each event's caller (uid).

Now the table has gotten way too large, and I like to thin it out, by removing all but the oldest and newest (based on the date field, which I called "when") entry, because that's all I really need.

So, assuming I group the rows by their uid (integer), how do I either select only the oldest and newest row (which could be the same) of all my records? And how do select all the others so that I can delete them all at once, for all uids?

I've created a db fiddle with an example:

id uid when other
1 11 2010-01-01 a1
2 11 2010-01-02 a2
3 11 2010-01-03 a3
4 11 2010-01-04 a4
5 22 2010-01-01 b1
6 33 2010-01-01 c1
7 33 2010-01-02 c2
8 44 2010-01-01 d1
9 44 2010-01-02 d2
10 44 2010-01-03 d3

The goal would be to identify ids 2, 3 and 9 for deletion.

It would also be helpful if I could be shown how I select the other IDs in a way that they list a single row for each uid, showing the uid plus the values of oldest and latest pairs for "when" and other. Or at least just list even row that has the lowest and highest entry for each uid (i.e. the inverse of what I want to delete).

I need help with this because anything involving grouping or clustering in Psql breaks my head.

5 Answers 5

2

You can detect is this row first or last for uid by ranging with "row_number" and counting rows for uid.

First row have rn=1. Last row have rn=uidCnt.
If only 1 row exists for uid, then rn=1, uidCnt=1 - this is first and last row.

It is possible that there are several rows in the table for the uid and date. For the sake of certainty, we include the Row Id in the ORDER BY.

In example rows - "candidates to delete" tagged with flag 'D'.

with rangedData as(
select *
  ,row_number()over(partition by uid order by "when" asc, id asc)rn
  ,count(*)over(partition by uid) uidCnt
from calls2
)
select *
  ,case when rn=1 or rn=uidCnt then 'A' else 'D' end fl
from rangedData
id uid when other rn uidcnt fl
1 11 2010-01-01 a1 1 4 A
2 11 2010-01-02 a2 2 4 D
3 11 2010-01-03 a3 3 4 D
4 11 2010-01-04 a4 4 4 A
5 22 2010-01-01 b1 1 1 A
6 33 2010-01-01 c1 1 2 A
7 33 2010-01-02 c2 2 2 A
8 44 2010-01-01 d1 1 3 A
9 44 2010-01-02 d2 2 3 D
10 44 2010-01-03 d3 3 3 A

Rows to delete

with rangedData as(
select *
  ,row_number()over(partition by uid order by "when" asc, id asc)rn
  ,count(*)over(partition by uid) uidCnt
from calls2
)
select *
  ,case when rn=1 or rn=uidCnt then 'A' else 'D' end fl
from rangedData
where rn>1 and rn<uidCnt
order by uid,"when",id
id uid when other rn uidcnt fl
2 11 2010-01-02 a2 2 4 D
3 11 2010-01-03 a3 3 4 D
9 44 2010-01-02 d2 2 3 D

Actual rows

with rangedData as(
select *
  ,row_number()over(partition by uid order by "when" asc, id asc)rn
  ,count(*)over(partition by uid) uidCnt
from calls2
)
select *
  ,case when rn=1 or rn=uidCnt then 'A' else 'D' end fl
from rangedData
where rn=1 or rn=uidCnt
order by uid,"when",id
id uid when other rn uidcnt fl
1 11 2010-01-01 a1 1 4 A
4 11 2010-01-04 a4 4 4 A
5 22 2010-01-01 b1 1 1 A
6 33 2010-01-01 c1 1 2 A
7 33 2010-01-02 c2 2 2 A
8 44 2010-01-01 d1 1 3 A
10 44 2010-01-03 d3 3 3 A

DELETE action

delete from calls2 dest
using (
select *
  ,row_number()over(partition by uid order by "when" asc, id asc)rn
  ,count(*)over(partition by uid) uidCnt
from calls2
) rangedData 
where rangedData.id=dest.id
  and (rangedData.rn<>1 and rangedData.rn<uidCnt)
returning *
id uid when other id uid when other rn uidcnt
2 11 2010-01-02 a2 2 11 2010-01-02 a2 2 4
3 11 2010-01-03 a3 3 11 2010-01-03 a3 3 4
9 44 2010-01-02 d2 9 44 2010-01-02 d2 2 3

If the table is quite large, it is better to replace "SELECT * .." in the query with "SELECT id, uid, when" and add an index like "CREATE index idx_calls2_uid,when_id on calls2(uid,when,id)". I assume that you already have such an index.
Then the "table scan" operation may be replaced by a faster "index scan" operation. Also some sorting's inside the query will be removed.

fiddle

Update1
See this part of query plan

Delete on public.calls2 dest  (cost=142.62..241.61 rows=2119 width=150) (actual time=0.138..0.147 rows=3 loops=1)
  Output: dest.id, dest.uid, dest."when", dest.other, rangeddata.id, rangeddata.uid, rangeddata."when", rangeddata.other, rangeddata.rn, rangeddata.uidcnt
  ->  Hash Join  (cost=142.62..241.61 rows=2119 width=150) (actual time=0.171..0.179 rows=3 loops=1)
        Output: dest.ctid, rangeddata.*, rangeddata.id, rangeddata.uid, rangeddata."when", rangeddata.other, rangeddata.rn, rangeddata.uidcnt
        Hash Cond: (dest.id = rangeddata.id)
        ->  Seq Scan on public.calls2 dest  (cost=0.00..21.30 rows=1130 width=10) (actual time=0.005..0.007 rows=10 loops=1)
              Output: dest.ctid, dest.id
        ->  Hash  (cost=137.93..137.93 rows=375 width=144) (actual time=0.111..0.112 rows=3 loops=1)

We can see "Hash JOIN" with 2 sources: target table "public.calls2 dest" and subquery "rangeddata.*" with condition "(dest.id = rangeddata.id)".

We can assume that the DBMS first executes such a query and deletes the rows from the main table (DELETE FROM ...) that are included in the query result. This is the usual way to execute the DELETE command when using two sources.

select * 
from calls2 dest
,(
select *
  ,row_number()over(partition by uid order by "when" asc, id asc)rn
  ,count(*)over(partition by uid) uidCnt
from calls2
) rangedData 
where rangedData.id=dest.id
  and (rangedData.rn<>1 and rangedData.rn<uidCnt)
id uid when other id uid when other rn uidcnt
2 11 2010-01-02 a2 2 11 2010-01-02 a2 2 4
3 11 2010-01-03 a3 3 11 2010-01-03 a3 3 4
9 44 2010-01-02 d2 9 44 2010-01-02 d2 2 3

Full query plan see in fiddle

Update2
Your question in the comment prompted a desire to formulate a solution in a different way. Here is an example of a query that fulfills the conditions you have formulated verbatim.

select * 
-- delete 
from calls2 dest
where 
   exists( -- row before
    select 1 from calls2 t1 
    where t1.uid=dest.uid and t1."when"<=dest."when" and t1.id<dest.id)
  and exists( -- row after
    select 1 from calls2 t1 
    where t1.uid=dest.uid and t1."when">=dest."when" and t1.id>dest.id)

(There "select 1" is "select anything")

Fiddle

Sign up to request clarification or add additional context in comments.

2 Comments

I like this. However, why the rangedData.id=dest.id check in the delete statement?
To delete from two tables or a table and a subquery, the DBMS first creates a JOIN of these two sources. To do this, JOIN must set the conditions. I'll added query plan and comments to my answer fjr explanation.
2

You can use subqueries to find whether the current row's time is the min or max within the group. Have to use quotes around the column name since its a reserved SQL keyword. The query below results in the 3 IDs to delete (2,3,9)

select id from calls2 where id not in (
    select calls2.id from calls2 
    join(
        select uid, min("when") minmax from calls2 group by uid
          union all
        select uid, max("when") minmax from calls2 group by uid
    ) mm
    on calls2.uid = mm.uid
    where calls2.when = mm.minmax
)

2 Comments

I actually got somewhat ahead with chatgpt, see my fiddle. I manage to get the "good" values, using two "partition" expressions, but i am still trying to figure out how to get the inverse. Ahhh! Your "select .. where not in (select ...)" is the answer to that! yay.
Wonderful! I am glad to hear its the answer you needed. You can accept the answer to confirm.
1

You can use a CTE to both identify the candidates to delete then use those results to actually perform the delete. This would be necessary only if you needed to record those rows to be deleted. That CTE follows:

with uid_mm( uid      
           , uid_min 
           , uid_max 
           ) as
     ( select uid 
           , min("when") 
           , max("when")
        from calls2
       group by uid 
    )
select * 
  from uid_mm;

Notice, however, this identifies potential deletions for each uid if none actually qualify. You can adjust the CTE to eliminate the non-qualifying rows if needed - but it is not needed. They are eliminated through the main query following the CTE.

with uid_mm( uid      
           , uid_min 
           , uid_max 
           ) as
     ( select uid 
           , min("when") 
           , max("when")
        from calls2
       group by uid 
    )
delete from calls2  c
  where exists ( select null 
                   from uid_mm  u 
                  where c.uid   = u.uid 
                    and c."when" > u.uid_min 
                    and c."when" < u.uid_max
               ); 

See demo here.

Comments

0

You can also use LEAD and LAG over the partition by uid ordered both the same way, the rows to delete are the ones where both results are NOT NULL:

delete from calls2
where id in (
    select 
    case when 
    lag(id) over w is not null 
    and lead(id) over w is not null
    then id
    end as id
    from calls2 c
    window w as(partition by uid order by "when", id)
)

1 Comment

It won't be quicker than what's already shown but I like this because it nicely combines with my curiosity example. It can improve by 1) comparing to the much shorter list of rows that need to stay, instead of the longer one of those to go 2) removing case and not null checks in favour of lag=lead yielding null if either is null 3) pushing that down to order by..nulls first fetch first 1 rows with ties, making the set even shorter by only including the id's and none of the nulls previously resulting from the implicit case..else null. dbfiddle.uk/EviKPtPs
0

Now the table has gotten way too large, and I like to thin it out

A delete won't actually be enough. It only reclaims space for the same table to re-use over time. If you want to free it fully and make it available for other things, inside or outside the db, you also need a VACUUM FULL. Here's 40k rows holding random dates, belonging to 3k uids, about 4.36MiB before cleanup:
demo at db<>fiddle

select size,pg_size_pretty(size)
from(values(pg_total_relation_size('calls2')))as v(size);
size pg_size_pretty
4579328 4472 kB

If you don't allow duplicate "when" for the same uid, scan it once to establish earliest/latest for each uid, then delete all else:

with cte as(select uid,min("when"),max("when")
            from calls2
            group by uid)
delete from calls2
where (uid,"when")not in(select uid,min from cte
                         union all
                         select uid,max from cte);

The (x1,y1)=(x2,y2)-style expression uses row constructor comparison syntax. This query also applies if you do allow duplicate dates per uid but you also wish to keep all of the entries that tied for earliest and latest spots. Otherwise:

delete from calls2
where id not in(select distinct on(uid) id
                from calls2
                order by uid,"when" asc,id)
  and id not in(select distinct on(uid) id
                from calls2
                order by uid,"when" desc,id);

And even though these remove 90% of records, the table still takes up 4.36MiB:

select size,pg_size_pretty(size)
from(values(pg_total_relation_size('calls2')))as v(size);
size pg_size_pretty
4579328 4472 kB

Only once you run the vacuum full, you get that space back:

vacuum(full,analyze,verbose) calls2;
 
select size,pg_size_pretty(size)
from(values(pg_total_relation_size('calls2')))as v(size);
size pg_size_pretty
376832 368 kB

The demo shows plans and exec times for these as well as other queries proposed so far. Note that select benchmarks don't always directly translate to equivalent delete performance because that, by nature, has to persist something on the heap, so the strategies and optimization techniques used by the planner are slightly different.


how I select the other IDs in a way that they list a single row for each uid, showing the uid plus the values of oldest and latest pairs for "when" and "other".

That's the distinct on I used above. You can fetch all the values that coincide with the top/max value in the same row, without aggregation, CTEs, joins, subqueries:

select distinct on(uid) *
from calls2
order by uid, "when", id;

Other RDBMS can emulate it with a row_number() window function in a subquery:

select id, "when", uid
from ( select row_number() over (partition by uid order by "when", id) as rn, *
       from calls ) as subquery
where rn = 1;

plus lots of other statistical information (ip address, query parameters etc)

  1. If you aren't already, you can save some space and gain some speed and flexibility by making sure IP addresses use their dedicated inet data type.
  2. If some of that additional, undisclosed statistical information is saved using variable length data types (text, json(b), etc.) it might get compressed in place, moved out of line, then compressed out of line. You can save some space by switching default_toast_compression from the default pglz to lz4 offering higher compression ratios. You can also change that for individual columns using alter table..alter column..set compression, as well as control the compression/TOASTing behaviour with alter table..alter column..set storage;
  3. It's possible to save a bit more by rearranging the order of your columns to optimise for alignment padding.
  4. You can look into pg_repack.

As a curiosity:

delete from calls2
where id not in (select id
                 from calls2
                 window w1 as (partition by uid)
                 order by     1<>row_number()over(w1 order by "when",ctid)
                          and 1<>row_number()over(w1 order by "when" desc,ctid desc)
                 fetch first 1 rows with ties);
  • Window functions aren't allowed in a where because they execute after where is already evaluated. This achieves a similar effect by instead using them to establish an order by placing all you're interested in on the same, first position, then fetch 1 rows with ties limits the output to all rows tied for that first position.
  • It's looking for rows that don't rank first the top or bottom (1<>row_number()), chronologically. Default order by direction is ascending and false is lower than true, so the first position goes to those that do rank first.
  • There are actually three window definitions. The obvious one is the named w1 but then the window function calls use it to establish new ones, with specific ordering.

The three windows and two 1<>row_number() comparisons can be reduced to just one becauase lag=lead yields null if either is null, and order by offers a nulls first option:

delete from calls2
where id not in (
    select id 
    from calls2 c
    window w as(partition by uid order by "when", id)
    order by (lag(id)over w) = (lead(id)over w) nulls first
    fetch first 1 rows with ties );

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.