1

I have a large data set in a Postgres db and need to generate a field that groups rows into a respective bin for "0-100", "101-200", "201-300", etc. all the way up to nearly 5000. I am aware that I could manually update each row and produce a line of code for each bin like this:

update test
   set testgroup = '0-100' where testint >= 1 and distance < 100;

I really would like to figure out a more efficient way to do this, open to anything and everything! The main goal is to look at the integer in this 'testint' column and then if it is in between 1-100 return in the testgroup column "0-100".

3
  • Do you have a unique identifier on each row? Commented May 24, 2018 at 21:45
  • in your example code, shouldn't the comparison variables be the same? i.e. testint >=1 and testint < 100 Commented May 25, 2018 at 5:32
  • Sorry that was a typo. Not really wanting to write that line and manually update it 50 times as well. The rows do have unique identifiers. Commented May 25, 2018 at 15:20

3 Answers 3

4

Use the width_bucket function. See the the docs, but here is a short version of the syntax:

width_bucket(a, LBound, UBound, num_bins)

To get it to work properly for your bins, I have to add 1 to UBound. Some examples:

select width_bucket( 1, 0, 5001, 50) gives 1
select width_bucket(100, 0, 5001, 0) gives 1
select width_bucket(101, 0, 5001, 50) gives 2
select width_bucket(4900, 0, 5001, 50) gives 49
select width_bucket(4901, 0, 5001, 50) gives 50

So that works as expected. Next we need to generate the proper string. Pseudo format is

(width_bucket - 1)*100 || '-' || (width_bucket)*100

Where || is the SQL concatenation operator. Using the first example from before:

select (width_bucket(1, 0, 5001, 50)-1)*100 || ' - ' || width_bucket(1, 0, 5001, 50)*100

gives '0 - 100'

Sweet. Now putting it all together. First make a sandbox table you can use for testing. This will be a copy or partial copy of your data:

CREATE TABLE test
AS
SELECT * 
FROM original_table 

Then add the new column to the table:

ALTER TABLE test
  ADD COLUMN testgroup text

Now the UPDATE statement:

UPDATE test
   SET testgroup = width_bucket(testint, 0, 5001, 50)-1)*100 || ' - ' || 
                   width_bucket(testint, 0, 5001, 50)*100
Sign up to request clarification or add additional context in comments.

Comments

0

You can make use of generate_series to generate numbers from 0 to 50, and then to select the data between the generated values * 100 and the next generated value * 100. The same principle is used to build the bin name.

UPDATE test
SET testgroup = (x*100)+1 || '-' || (x+1)*100
FROM generate_series(0,50) f(x)
WHERE testint > (x*100) 
  AND testint <= ((x+1)*100);

http://rextester.com/FXIS37706

3 Comments

Hi, I am using a similar binning method and found your code useful. Although I don't seem to understand it well enough. The part about "FROM generate_series(0,50) f(x)". While I understand the logic I am not able to get the syntax. What does f(x) mean here - a function ? or another dataset like generate_series(0,50)
@KanduriR the f is the table alias for the output, the (x) is used to rename the columns. Here since the output of generate_series is a single column, it doesn't really add much. Let's say you have a table mytable with 3 columns x, y, z, you can do select * from mytable a(a,b,c) and the columns will be shown as a,b,c, not x,y,z
Ok understand now. Thank you @JGH. Is this notation mytable a(a,b,c) is specific to postgres or is this a generic notation in SQL language ?
0

all the way up to nearly 5000

Sometimes the issue is that the upper bound is unknown so width_bucket may not be ideal in such case as the upper bound is required.

But an old school modulo may be enough:

-- Explicit
SELECT  testint - testint % 100 || '-' || testint - testint % 100 + 100
FROM    (
    VALUES
    (256),
    (543),
    (33),
    (5611)
) AS q (testint)

-- Less duplicate operations
SELECT  left_end || '-' || left_end + 100
FROM  (
    VALUES
    (256),
    (543),
    (33),
    (5611)
) AS q (testint)
JOIN LATERAL (SELECT testint - testint % 100) l(left_end) ON TRUE;

Both return:

200-300
500-600
0-100
5600-5700

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.