Postgres: Distinct but only for one column

Question

I have a table on pgsql with names (having more than 1 mio. rows), but I have also many duplicates. I select 3 fields: id, name, metadata.

I want to select them randomly with ORDER BY RANDOM() and LIMIT 1000, so I do this is many steps to save some memory in my PHP script.

But how can I do that so it only gives me a list having no duplicates in names.

For example [1,"Michael Fox","2003-03-03,34,M,4545"] will be returned but not [2,"Michael Fox","1989-02-23,M,5633"]. The name field is the most important and must be unique in the list everytime I do the select and it must be random.

I tried with GROUP BY name, bu then it expects me to have id and metadata in the GROUP BY as well or in a aggragate function, but I dont want to have them somehow filtered.

Anyone knows how to fetch many columns but do only a distinct on one column?

Community · Accepted Answer · 2020-06-20 09:12:55Z

408

To do a distinct on only one (or n) column(s):

select distinct on (name)
    name, col1, col2
from names

This will return any of the rows containing the name. If you want to control which of the rows will be returned you need to order:

select distinct on (name)
    name, col1, col2
from names
order by name, col1

Will return the first row when ordered by col1.

distinct on:

SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the “first row” of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first.

The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Jun 4, 2013 at 12:36

Clodoaldo Neto

127k30 gold badges251 silver badges274 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Craig Ringer Over a year ago

Good catch on ordering. I didn't include it because they mentioned wanting a random ordering, but it's important to mention anyway.

Clodoaldo Neto Over a year ago

@elliot yes name is necessary. Check distinct on at the manual.

JTW Over a year ago

I wish the TSQL team could provide such a sensible way of doing this.

Ogaga Uzoh Over a year ago

Please add the appropriate postgresql reference

Kevin Parker Over a year ago

Ugh me too, this has been plaguing me for weeks now. I want distinct on one column but order by something else that isn't the distinct column. Why is it so hard in Postgres? A subquery is way too slow as it will evaluate the entire thing before returning the outer order by. Frustrating beyond belief!

|

iainn · Accepted Answer · 2017-11-28 10:29:49Z

32

Anyone knows how to fetch many columns but do only a distinct on one column?

You want the DISTINCT ON clause.

You didn't provide sample data or a complete query so I don't have anything to show you. You want to write something like:

SELECT DISTINCT ON (name) fields, id, name, metadata FROM the_table;

This will return an unpredictable (but not "random") set of rows. If you want to make it predictable add an ORDER BY per Clodaldo's answer. If you want to make it truly random, you'll want to ORDER BY random().

edited Nov 28, 2017 at 10:29

iainn

17.4k9 gold badges38 silver badges45 bronze badges

answered Jun 4, 2013 at 12:35

Craig Ringer

329k83 gold badges742 silver badges820 bronze badges

3 Comments

Kevin Parker Over a year ago

Just note with this DISTINCT ON clause, you can only ORDER BY the same thing + more. So if you say DISTINCT ON (name) you must ORDER BY name then whatever else you want. Hardly ideal.

Craig Ringer Over a year ago

Kevin, you can just use a CTE or subquery-in-FROM and ORDER BY in the outer query

Kevin Parker Over a year ago

Yes, and watch the performance go... The entire possible results from the index space will be searched. It turns what could be a 10-20ms query with the right index into a 900ms one just because posgres can't handle a different distinct / order by. Doesn't even matter what the outer query order is, it's going to use the index from the inner subquery to find matches first, then re-sort. Happy to do a consulting fee for real solutions to our problems at dba.stackexchange.com/questions/260852/…

Sunil Kumar · Accepted Answer · 2020-11-30 13:47:04Z

14

To do a distinct on n columns:

select distinct on (col1, col2) col1, col2, col3, col4 from names

answered Nov 30, 2020 at 13:47

Sunil Kumar

8698 silver badges18 bronze badges

Comments

David Jashi · Accepted Answer · 2013-06-04 09:17:35Z

4

SELECT NAME,MAX(ID) as ID,MAX(METADATA) as METADATA 
from SOMETABLE
GROUP BY NAME

answered Jun 4, 2013 at 9:17

David Jashi

4,5091 gold badge23 silver badges26 bronze badges

4 Comments

user330315 Over a year ago

Just a word of caution: that might not return the ID value or the metadata value that belong "together"

Clodoaldo Neto Over a year ago

@Novum No. It means it cat take a id value from one of the Michael's rows and the metadata from another as it was asked for Michael's maxes.

David Jashi Over a year ago

Well yes, it greatly depends on real data OP uses, which I'm absolutely ignorant of. You may need to use MIN or whatever. Just demonstrated, how you can include fields not on a GROUP BY clause.

Elliot Chance Over a year ago

This is not a good solution because different values from different rows will get mixed up.

Collectives™ on Stack Overflow

Postgres: Distinct but only for one column

4 Answers 4

7 Comments

3 Comments

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

7 Comments

3 Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related