79

I have a table with >1M rows of data and 20+ columns.

Within my table (tableX) I have identified duplicate records (~80k) in one particular column (troubleColumn).

If possible I would like to retain the original table name and remove the duplicate records from my problematic column otherwise I could create a new table (tableXfinal) with the same schema but without the duplicates.

I am not proficient in SQL or any other programming language so please excuse my ignorance.

delete from Accidents.CleanedFilledCombined 
where Fixed_Accident_Index 
in(select Fixed_Accident_Index from Accidents.CleanedFilledCombined 
group by Fixed_Accident_Index 
having count(Fixed_Accident_Index) >1);
2
  • I've just read that BigQuery tables are append only so I guess I'll need to make a copy of my table so! Commented Apr 17, 2016 at 11:12
  • To de-duplicate rows on a single partition, see: stackoverflow.com/a/57900778/132438 Commented Sep 12, 2019 at 6:22

12 Answers 12

96

You can remove duplicates by running a query that rewrites your table (you can use the same table as the destination, or you can create a new table, verify that it has what you want, and then copy it over the old table).

A query that should work is here:

SELECT *
FROM (
  SELECT
      *,
      ROW_NUMBER()
          OVER (PARTITION BY Fixed_Accident_Index)
          row_number
  FROM Accidents.CleanedFilledCombined
)
WHERE row_number = 1
Sign up to request clarification or add additional context in comments.

6 Comments

see my answer below for a more scalable alternative with #standardSQL
Is there a way to do this via the API ?
one problem with overwriting is that the fields of the schema of the new table are all nullable
This is as solid of an answer as you can get on S/O. Thanks Jordan.
In general it's bad practice to overwrite an existing table, as you may find you made a mistake somewhere in your query. It's better to write it as a separate table and once you're sure it's good, delete the old one and rename the new one.
|
61

UPDATE 2019: To de-duplicate rows on a single partition with a MERGE, see:


An alternative to Jordan's answer - this one scales better when having too many duplicates:

SELECT event.* FROM (
  SELECT ARRAY_AGG(
    t ORDER BY t.created_at DESC LIMIT 1
  )[OFFSET(0)]  event
  FROM `githubarchive.month.201706` t 
  # GROUP BY the id you are de-duplicating by
  GROUP BY actor.id
)

Or a shorter version (takes any row, instead of the newest one):

SELECT k.*
FROM (
  SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k 
  FROM `fh-bigquery.reddit_comments.2017_01` x 
  GROUP BY id
)

To de-duplicate rows on an existing table:

CREATE OR REPLACE TABLE `deleting.deduplicating_table`
AS
# SELECT id FROM UNNEST([1,1,1,2,2]) id
SELECT k.*
FROM (
  SELECT ARRAY_AGG(row LIMIT 1)[OFFSET(0)] k 
  FROM `deleting.deduplicating_table` row
  GROUP BY id
)

7 Comments

Hi Felipe, Very cool!As a matter of curiosity on this, how would you construct a standardSQL query (only) that used 'DELETE' DML on the source table instead or rewriting in order to remove duplicates?
Answer updated with a one step de-duplicating for an existing table
when I ran the shorter version, my query took to long to respond.
@intotecho weird - longer version takes less time to execute? try posting your job ids on the bigquery issue tracker
Ah, I forgot to include the first line CREATE OR REPLACE TABLE deleting.deduplicating_table. That's why it didn't finish.
|
51

Not sure why nobody mentioned DISTINCT query.

Here is the way to clean duplicate rows:

CREATE OR REPLACE TABLE project.dataset.table
AS
SELECT DISTINCT * FROM project.dataset.table

8 Comments

This doesn't work if you have more than one column in your table (or perhaps I'm doing something wrong?)
Definitely the easiest way to do what I was trying to do - thanks! Doesn't directly answer OP's question, but it answers why I landed here :) @OriolNieto - it works with all your columns. You can swap * for a list of specific columns if you want to verify how it works
This doesn't work if the existing table is partitioned.
I think if you have a column that's a struct it won't work with *. That might be what @OriolNieto was seeing.
or if we want to dedup rows that have the same id but different values in other columns i.e. updated_at
|
7

If your schema doesn’t have any records - below variation of Jordan’s answer will work well enough with writing over same table or new one, etc.

SELECT <list of original fields>
FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY Fixed_Accident_Index) AS pos,
  FROM Accidents.CleanedFilledCombined
)
WHERE pos = 1

In more generic case - with complex schema with records/netsed fields, etc. - above approach can be a challenge.

I would propose to try using Tabledata: insertAll API with rows[].insertId set to respective Fixed_Accident_Index for each row. In this case duplicate rows will be eliminated by BigQuery

Of course, this will involve some client side coding - so might be not relevant for this particular question. I havent tried this approach by myself either but feel it might be interesting to try :o)

5 Comments

Thanks Mikhail, you've saved my bacon on a few occasions now!
If you have nested / repeated fields, the query I mentioned should work, as long as you set the query option to allow large results and to prevent flattening.
Instead of listing the original fields, if you are using Standard SQL you can use something like: SELECT * except(pos) FROM (...) WHERE pos = 1;
Hi Guys, Just on this deduping topic, lets say we pick one SQL above that works, and we want to priodically call it (savedquery) to execute and then write the dedup dataset back to the same table (effectively overriding). Assume in this scenario its scheduled using something like airflow, but there is another process that loads new events regularily, is there a chance of missing data here if say for a large table the sql is running and new data arrives at the same time - then you are writing back results that might not have the new data in it? Is this possible? How best to avoid if so? thx
@AntsaR - great! glad it helped :o)
3

If you have a large-size partitioned table, and only have duplicates in a certain partition range. You don't want to overscan nor process the whole table. use the MERGE SQL below with predicates on partition range:

-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table 
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------

DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");

MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
  SELECT k.*
  FROM (
    SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k 
    FROM `gcp_project`.`data_set`.`the_table` AS original_data
    WHERE stamp BETWEEN dt_start AND dt_end
    GROUP BY surrogate_key
  )

) AS INTERNAL_SOURCE
ON FALSE

WHEN NOT MATCHED BY SOURCE
  AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
    THEN DELETE

WHEN NOT MATCHED THEN INSERT ROW

credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a

Comments

3

I like this way.. no messing with dlls reseting pks/fks etc

Qualify is a really neat thing!

create or replace table mytable_temp
 as 
 select * from mytable
  qualify row_number() over (partition by t.pk_docid) =1;
  
truncate table mytable;

insert into mytable 
 select * from mytable_temp t;  

drop table mytable_tmep;

1 Comment

Rather than moving/deleting/replacing all rows, you can move just the duplicates. ``` lang-sql create or replace table mytable_temp as select * from mytable qualify row_number() over (partition by t.pk_docid) =2; delete from mytable where pk_docid in ( select pk_docid from mytable_temp ); insert into mytable select * from mytable_temp t; drop table mytable_temp; ```
2

Easier answer, without a subselect

  SELECT
      *,
      ROW_NUMBER()
          OVER (PARTITION BY Fixed_Accident_Index)
          row_number
  FROM Accidents.CleanedFilledCombined
  WHERE TRUE
  QUALIFY row_number = 1

The Where True is neccesary because qualify needs a where, group by or having clause

Comments

1

Felipe's answer is the best approach for most cases. Here is a more elegant way to accomplish the same:

CREATE OR REPLACE TABLE Accidents.CleanedFilledCombined
AS
SELECT 
  Fixed_Accident_Index, 
  ARRAY_AGG(x LIMIT 1)[SAFE_OFFSET(0)].* EXCEPT(Fixed_Accident_Index)
FROM Accidents.CleanedFilledCombined AS x
GROUP BY Fixed_Accident_Index;

To be safe, make sure you backup the original table before you run this ^^

I don't recommend to use ROW NUMBER() OVER() approach if possible since you may run into BigQuery memory limits and get unexpected errors.

Comments

1

Create 2 identical tables and use union distinct.

with tbl as (select * from my_table)
,    tbl_1 as (select * from my_table)

SELECT * from tbl

UNION DISTINCT

SELECT * FROM tbl_1

Comments

0
  1. Update BigQuery schema with new table column as bq_uuid making it NULLABLE and type STRING 

  2. Create duplicate rows by running same command 5 times for example

insert into beginner-290513.917834811114.messages (id, type, flow, updated_at) Values(19999,"hello", "inbound", '2021-06-08T12:09:03.693646')

  1. Check if duplicate entries exist 
 select * from beginner-290513.917834811114.messages where id = 19999

  2. Use generate uuid function to generate uuid corresponding to each message 
UPDATE beginner-290513.917834811114.messages SET bq_uuid = GENERATE_UUID() where id>0

  3. Clean duplicate entries


DELETE FROM beginner-290513.917834811114.messages WHERE bq_uuid IN (SELECT bq_uuid FROM (SELECT bq_uuid, ROW_NUMBER() OVER( PARTITION BY updated_at ORDER BY bq_uuid ) AS row_num FROM beginner-290513.917834811114.messages ) t WHERE t.row_num > 1 );

Comments

0

When it comes to large deduplication, the QUALIFY command appears to be the most effective and efficient option, as explained here

Comments

0

There are a few ways that can work for you

CREATE OR REPLACE TABLE test_vehicleTemperatureMeasurementsAtSpokes
-- PARTITION BY DATE(`_airbyte_extracted_at`)
-- CLUSTER BY airbyte_pk, _airbyte_extracted_at
AS
SELECT * EXCEPT(row_num)
FROM (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY airbyte_pk
      ORDER BY `partition` DESC
    ) AS row_num
  FROM
    test_vehicleTemperatureMeasurementsAtSpokes
)
WHERE row_num = 1;

or

-- -- Step 1: Create a new, temporary deduplicated table with the correct partitioning.
CREATE TABLE `rte0-data-dwh-ods`.ods.deduplicated__vehicleTemperatureMeasurementsAtSpokes
PARTITION BY DATE(`_airbyte_extracted_at`)
OPTIONS (
   description="Temp table for deduplication"
) AS
SELECT * EXCEPT(row_num) FROM (
 SELECT
   *,
   ROW_NUMBER() OVER (PARTITION BY airbyte_pk ORDER BY `partition` DESC) AS row_num
 FROM `rte0-data-dwh-ods`.ods.test_vehicleTemperatureMeasurementsAtSpokes
) WHERE row_num = 1;


-- Step 2: Drop the original table
DROP TABLE `rte0-data-dwh-ods`.ods.test_vehicleTemperatureMeasurementsAtSpokes;


-- Step 3: Rename the temporary table to the original name
ALTER TABLE `rte0-data-dwh-ods`.ods.deduplicated__vehicleTemperatureMeasurementsAtSpokes
RENAME TO test_vehicleTemperatureMeasurementsAtSpokes;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.