Creating a Bin column in Postgres to check an integer and return a string

Question

I have a large data set in a Postgres db and need to generate a field that groups rows into a respective bin for "0-100", "101-200", "201-300", etc. all the way up to nearly 5000. I am aware that I could manually update each row and produce a line of code for each bin like this:

update test
   set testgroup = '0-100' where testint >= 1 and distance < 100;

I really would like to figure out a more efficient way to do this, open to anything and everything! The main goal is to look at the integer in this 'testint' column and then if it is in between 1-100 return in the testgroup column "0-100".

in your example code, shouldn't the comparison variables be the same? i.e. testint >=1 and testint < 100 — bfris
– bfris, Commented May 25, 2018 at 5:32
Sorry that was a typo. Not really wanting to write that line and manually update it 50 times as well. The rows do have unique identifiers. — Holt
– Holt, Commented May 25, 2018 at 15:20

bfris · Accepted Answer · 2018-05-25 05:57:34Z

Use the width_bucket function. See the the docs, but here is a short version of the syntax:

width_bucket(a, LBound, UBound, num_bins)

To get it to work properly for your bins, I have to add 1 to UBound. Some examples:

select width_bucket( 1, 0, 5001, 50) gives 1
select width_bucket(100, 0, 5001, 0) gives 1
select width_bucket(101, 0, 5001, 50) gives 2
select width_bucket(4900, 0, 5001, 50) gives 49
select width_bucket(4901, 0, 5001, 50) gives 50

So that works as expected. Next we need to generate the proper string. Pseudo format is

(width_bucket - 1)*100 || '-' || (width_bucket)*100

Where || is the SQL concatenation operator. Using the first example from before:

select (width_bucket(1, 0, 5001, 50)-1)*100 || ' - ' || width_bucket(1, 0, 5001, 50)*100

gives '0 - 100'

Sweet. Now putting it all together. First make a sandbox table you can use for testing. This will be a copy or partial copy of your data:

CREATE TABLE test
AS
SELECT * 
FROM original_table

Then add the new column to the table:

ALTER TABLE test
  ADD COLUMN testgroup text

Now the UPDATE statement:

UPDATE test
   SET testgroup = width_bucket(testint, 0, 5001, 50)-1)*100 || ' - ' || 
                   width_bucket(testint, 0, 5001, 50)*100

JGH · Accepted Answer · 2018-05-24 22:08:11Z

0

You can make use of generate_series to generate numbers from 0 to 50, and then to select the data between the generated values * 100 and the next generated value * 100. The same principle is used to build the bin name.

UPDATE test
SET testgroup = (x*100)+1 || '-' || (x+1)*100
FROM generate_series(0,50) f(x)
WHERE testint > (x*100) 
  AND testint <= ((x+1)*100);

http://rextester.com/FXIS37706

answered May 24, 2018 at 22:08

JGH

18.2k5 gold badges40 silver badges57 bronze badges

3 Comments

KanduriR Over a year ago

Hi, I am using a similar binning method and found your code useful. Although I don't seem to understand it well enough. The part about "FROM generate_series(0,50) f(x)". While I understand the logic I am not able to get the syntax. What does f(x) mean here - a function ? or another dataset like generate_series(0,50)

JGH Over a year ago

@KanduriR the f is the table alias for the output, the (x) is used to rename the columns. Here since the output of generate_series is a single column, it doesn't really add much. Let's say you have a table mytable with 3 columns x, y, z, you can do select * from mytable a(a,b,c) and the columns will be shown as a,b,c, not x,y,z

KanduriR Over a year ago

Ok understand now. Thank you @JGH. Is this notation mytable a(a,b,c) is specific to postgres or is this a generic notation in SQL language ?

Eric M. · Accepted Answer · 2024-10-23 07:55:53Z

0

all the way up to nearly 5000

Sometimes the issue is that the upper bound is unknown so width_bucket may not be ideal in such case as the upper bound is required.

But an old school modulo may be enough:

-- Explicit
SELECT  testint - testint % 100 || '-' || testint - testint % 100 + 100
FROM    (
    VALUES
    (256),
    (543),
    (33),
    (5611)
) AS q (testint)

-- Less duplicate operations
SELECT  left_end || '-' || left_end + 100
FROM  (
    VALUES
    (256),
    (543),
    (33),
    (5611)
) AS q (testint)
JOIN LATERAL (SELECT testint - testint % 100) l(left_end) ON TRUE;

Both return:

answered Oct 23, 2024 at 7:55

Eric M.

3,0772 gold badges14 silver badges17 bronze badges

Collectives™ on Stack Overflow

Creating a Bin column in Postgres to check an integer and return a string

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related