Delete duplicate rows from a BigQuery table

Question

I have a table with >1M rows of data and 20+ columns.

Within my table (tableX) I have identified duplicate records (~80k) in one particular column (troubleColumn).

If possible I would like to retain the original table name and remove the duplicate records from my problematic column otherwise I could create a new table (tableXfinal) with the same schema but without the duplicates.

I am not proficient in SQL or any other programming language so please excuse my ignorance.

delete from Accidents.CleanedFilledCombined 
where Fixed_Accident_Index 
in(select Fixed_Accident_Index from Accidents.CleanedFilledCombined 
group by Fixed_Accident_Index 
having count(Fixed_Accident_Index) >1);

I've just read that BigQuery tables are append only so I guess I'll need to make a copy of my table so! — TheGoat
– TheGoat, Commented Apr 17, 2016 at 11:12
To de-duplicate rows on a single partition, see: stackoverflow.com/a/57900778/132438 — Felipe Hoffa
– Felipe Hoffa, Commented Sep 12, 2019 at 6:22

Technetium · Accepted Answer · 2016-12-06 22:39:05Z

96

You can remove duplicates by running a query that rewrites your table (you can use the same table as the destination, or you can create a new table, verify that it has what you want, and then copy it over the old table).

A query that should work is here:

SELECT *
FROM (
  SELECT
      *,
      ROW_NUMBER()
          OVER (PARTITION BY Fixed_Accident_Index)
          row_number
  FROM Accidents.CleanedFilledCombined
)
WHERE row_number = 1

edited Dec 6, 2016 at 22:39

Technetium

6,2482 gold badges46 silver badges56 bronze badges

answered Apr 18, 2016 at 5:41

Jordan Tigani

26.7k5 gold badges63 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Felipe Hoffa Over a year ago

see my answer below for a more scalable alternative with #standardSQL

user8119881 Over a year ago

Is there a way to do this via the API ?

itzjustricky Over a year ago

one problem with overwriting is that the fields of the schema of the new table are all nullable

danieljimenez Over a year ago

This is as solid of an answer as you can get on S/O. Thanks Jordan.

Idodo Over a year ago

In general it's bad practice to overwrite an existing table, as you may find you made a mistake somewhere in your query. It's better to write it as a separate table and once you're sure it's good, delete the old one and rename the new one.

|

Julio Betta · Accepted Answer · 2023-04-04 18:12:45Z

61

UPDATE 2019: To de-duplicate rows on a single partition with a MERGE, see:

https://stackoverflow.com/a/57900778/132438

An alternative to Jordan's answer - this one scales better when having too many duplicates:

SELECT event.* FROM (
  SELECT ARRAY_AGG(
    t ORDER BY t.created_at DESC LIMIT 1
  )[OFFSET(0)]  event
  FROM `githubarchive.month.201706` t 
  # GROUP BY the id you are de-duplicating by
  GROUP BY actor.id
)

Or a shorter version (takes any row, instead of the newest one):

SELECT k.*
FROM (
  SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k 
  FROM `fh-bigquery.reddit_comments.2017_01` x 
  GROUP BY id
)

To de-duplicate rows on an existing table:

CREATE OR REPLACE TABLE `deleting.deduplicating_table`
AS
# SELECT id FROM UNNEST([1,1,1,2,2]) id
SELECT k.*
FROM (
  SELECT ARRAY_AGG(row LIMIT 1)[OFFSET(0)] k 
  FROM `deleting.deduplicating_table` row
  GROUP BY id
)

edited Apr 4, 2023 at 18:12

Julio Betta

2,2951 gold badge25 silver badges25 bronze badges

answered Jul 25, 2017 at 18:29

Felipe Hoffa

59.8k23 gold badges185 silver badges363 bronze badges

7 Comments

Kurt Maile Over a year ago

Hi Felipe, Very cool!As a matter of curiosity on this, how would you construct a standardSQL query (only) that used 'DELETE' DML on the source table instead or rewriting in order to remove duplicates?

Felipe Hoffa Over a year ago

Answer updated with a one step de-duplicating for an existing table

intotecho Over a year ago

when I ran the shorter version, my query took to long to respond.

Felipe Hoffa Over a year ago

@intotecho weird - longer version takes less time to execute? try posting your job ids on the bigquery issue tracker

intotecho Over a year ago

Ah, I forgot to include the first line CREATE OR REPLACE TABLE deleting.deduplicating_table. That's why it didn't finish.

|

Julio Betta · Accepted Answer · 2023-04-04 18:13:48Z

51

Not sure why nobody mentioned DISTINCT query.

Here is the way to clean duplicate rows:

CREATE OR REPLACE TABLE project.dataset.table
AS
SELECT DISTINCT * FROM project.dataset.table

edited Apr 4, 2023 at 18:13

Julio Betta

2,2951 gold badge25 silver badges25 bronze badges

answered Jan 30, 2019 at 16:30

Semra

3,03333 silver badges29 bronze badges

8 Comments

Oriol Nieto Over a year ago

This doesn't work if you have more than one column in your table (or perhaps I'm doing something wrong?)

ZaxR Over a year ago

Definitely the easiest way to do what I was trying to do - thanks! Doesn't directly answer OP's question, but it answers why I landed here :) @OriolNieto - it works with all your columns. You can swap * for a list of specific columns if you want to verify how it works

Cameron Over a year ago

This doesn't work if the existing table is partitioned.

Krista Davis Over a year ago

I think if you have a column that's a struct it won't work with *. That might be what @OriolNieto was seeing.

Hui Zheng Over a year ago

or if we want to dedup rows that have the same id but different values in other columns i.e. updated_at

|

Julio Betta · Accepted Answer · 2023-04-04 18:14:56Z

7

If your schema doesn’t have any records - below variation of Jordan’s answer will work well enough with writing over same table or new one, etc.

SELECT <list of original fields>
FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY Fixed_Accident_Index) AS pos,
  FROM Accidents.CleanedFilledCombined
)
WHERE pos = 1

In more generic case - with complex schema with records/netsed fields, etc. - above approach can be a challenge.

I would propose to try using Tabledata: insertAll API with rows[].insertId set to respective Fixed_Accident_Index for each row. In this case duplicate rows will be eliminated by BigQuery

Of course, this will involve some client side coding - so might be not relevant for this particular question. I havent tried this approach by myself either but feel it might be interesting to try :o)

edited Apr 4, 2023 at 18:14

Julio Betta

2,2951 gold badge25 silver badges25 bronze badges

answered Apr 19, 2016 at 4:39

Mikhail Berlyant

174k10 gold badges172 silver badges250 bronze badges

5 Comments

TheGoat Over a year ago

Thanks Mikhail, you've saved my bacon on a few occasions now!

Jordan Tigani Over a year ago

If you have nested / repeated fields, the query I mentioned should work, as long as you set the query option to allow large results and to prevent flattening.

killachaos Over a year ago

Instead of listing the original fields, if you are using Standard SQL you can use something like: SELECT * except(pos) FROM (...) WHERE pos = 1;

Kurt Maile Over a year ago

Hi Guys, Just on this deduping topic, lets say we pick one SQL above that works, and we want to priodically call it (savedquery) to execute and then write the dedup dataset back to the same table (effectively overriding). Assume in this scenario its scheduled using something like airflow, but there is another process that loads new events regularily, is there a chance of missing data here if say for a large table the sql is running and new data arrives at the same time - then you are writing back results that might not have the new data in it? Is this possible? How best to avoid if so? thx

Mikhail Berlyant Over a year ago

@AntsaR - great! glad it helped :o)

Julio Betta · Accepted Answer · 2023-04-04 18:15:50Z

If you have a large-size partitioned table, and only have duplicates in a certain partition range. You don't want to overscan nor process the whole table. use the MERGE SQL below with predicates on partition range:

-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table 
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------

DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");

MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
  SELECT k.*
  FROM (
    SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k 
    FROM `gcp_project`.`data_set`.`the_table` AS original_data
    WHERE stamp BETWEEN dt_start AND dt_end
    GROUP BY surrogate_key
  )

) AS INTERNAL_SOURCE
ON FALSE

WHEN NOT MATCHED BY SOURCE
  AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
    THEN DELETE

WHEN NOT MATCHED THEN INSERT ROW

credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a

Fedor · Accepted Answer · 2024-10-04 11:43:27Z

3

I like this way.. no messing with dlls reseting pks/fks etc

Qualify is a really neat thing!

create or replace table mytable_temp
 as 
 select * from mytable
  qualify row_number() over (partition by t.pk_docid) =1;
  
truncate table mytable;

insert into mytable 
 select * from mytable_temp t;  

drop table mytable_tmep;

edited Oct 4, 2024 at 11:43

Fedor

24.7k45 gold badges59 silver badges187 bronze badges

answered Oct 3, 2024 at 22:34

Geb BERRY

311 bronze badge

1 Comment

Jacob Eggers Jun 2 at 23:21

Rather than moving/deleting/replacing all rows, you can move just the duplicates. ``` lang-sql create or replace table mytable_temp as select * from mytable qualify row_number() over (partition by t.pk_docid) =2; delete from mytable where pk_docid in ( select pk_docid from mytable_temp ); insert into mytable select * from mytable_temp t; drop table mytable_temp; ```

elauser · Accepted Answer · 2021-12-01 15:13:55Z

2

Easier answer, without a subselect

  SELECT
      *,
      ROW_NUMBER()
          OVER (PARTITION BY Fixed_Accident_Index)
          row_number
  FROM Accidents.CleanedFilledCombined
  WHERE TRUE
  QUALIFY row_number = 1

The Where True is neccesary because qualify needs a where, group by or having clause

edited Dec 1, 2021 at 15:13

answered Dec 1, 2021 at 9:56

elauser

1061 silver badge6 bronze badges

Comments

Igor-S · Accepted Answer · 2020-09-21 18:36:26Z

1

Felipe's answer is the best approach for most cases. Here is a more elegant way to accomplish the same:

CREATE OR REPLACE TABLE Accidents.CleanedFilledCombined
AS
SELECT 
  Fixed_Accident_Index, 
  ARRAY_AGG(x LIMIT 1)[SAFE_OFFSET(0)].* EXCEPT(Fixed_Accident_Index)
FROM Accidents.CleanedFilledCombined AS x
GROUP BY Fixed_Accident_Index;

To be safe, make sure you backup the original table before you run this ^^

I don't recommend to use ROW NUMBER() OVER() approach if possible since you may run into BigQuery memory limits and get unexpected errors.

answered Sep 21, 2020 at 18:36

Igor-S

7758 silver badges11 bronze badges

Comments

Rick Rickles · Accepted Answer · 2024-03-11 10:01:11Z

1

Create 2 identical tables and use union distinct.

with tbl as (select * from my_table)
,    tbl_1 as (select * from my_table)

SELECT * from tbl

UNION DISTINCT

SELECT * FROM tbl_1

edited Mar 11, 2024 at 10:01

answered Mar 11, 2024 at 10:00

Rick Rickles

112 bronze badges

Comments

Akhilesh Negi · Accepted Answer · 2021-09-14 18:39:23Z

Update BigQuery schema with new table column as bq_uuid making it NULLABLE and type STRING  
Create duplicate rows by running same command 5 times for example

insert into beginner-290513.917834811114.messages (id, type, flow, updated_at) Values(19999,"hello", "inbound", '2021-06-08T12:09:03.693646')

Check if duplicate entries exist   select * from beginner-290513.917834811114.messages where id = 19999
Use generate uuid function to generate uuid corresponding to each message  UPDATE beginner-290513.917834811114.messages SET bq_uuid = GENERATE_UUID() where id>0
Clean duplicate entries

DELETE FROM beginner-290513.917834811114.messages WHERE bq_uuid IN (SELECT bq_uuid FROM (SELECT bq_uuid, ROW_NUMBER() OVER( PARTITION BY updated_at ORDER BY bq_uuid ) AS row_num FROM beginner-290513.917834811114.messages ) t WHERE t.row_num > 1 );

nadavw · Accepted Answer · 2023-08-01 06:43:42Z

0

When it comes to large deduplication, the QUALIFY command appears to be the most effective and efficient option, as explained here

answered Aug 1, 2023 at 6:43

nadavw

811 silver badge3 bronze badges

Comments

Umar Asghar · Accepted Answer · 2025-09-03 04:09:36Z

There are a few ways that can work for you

CREATE OR REPLACE TABLE test_vehicleTemperatureMeasurementsAtSpokes
-- PARTITION BY DATE(`_airbyte_extracted_at`)
-- CLUSTER BY airbyte_pk, _airbyte_extracted_at
AS
SELECT * EXCEPT(row_num)
FROM (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY airbyte_pk
      ORDER BY `partition` DESC
    ) AS row_num
  FROM
    test_vehicleTemperatureMeasurementsAtSpokes
)
WHERE row_num = 1;

or

-- -- Step 1: Create a new, temporary deduplicated table with the correct partitioning.
CREATE TABLE `rte0-data-dwh-ods`.ods.deduplicated__vehicleTemperatureMeasurementsAtSpokes
PARTITION BY DATE(`_airbyte_extracted_at`)
OPTIONS (
   description="Temp table for deduplication"
) AS
SELECT * EXCEPT(row_num) FROM (
 SELECT
   *,
   ROW_NUMBER() OVER (PARTITION BY airbyte_pk ORDER BY `partition` DESC) AS row_num
 FROM `rte0-data-dwh-ods`.ods.test_vehicleTemperatureMeasurementsAtSpokes
) WHERE row_num = 1;


-- Step 2: Drop the original table
DROP TABLE `rte0-data-dwh-ods`.ods.test_vehicleTemperatureMeasurementsAtSpokes;


-- Step 3: Rename the temporary table to the original name
ALTER TABLE `rte0-data-dwh-ods`.ods.deduplicated__vehicleTemperatureMeasurementsAtSpokes
RENAME TO test_vehicleTemperatureMeasurementsAtSpokes;

Collectives™ on Stack Overflow

Delete duplicate rows from a BigQuery table

12 Answers 12

6 Comments

7 Comments

8 Comments

5 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

6 Comments

7 Comments

8 Comments

5 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related