0

I have some code stats in greenplum table A

| id  | file   | repo | lang | line |
-------------------------------------
| a   | /a.txt | r1   | txt  | 3    |
| a   | /b.c   | r1   | c    | 5    |
| b   | /x.java| r1   | java | 33   |
| c   | /f.cpp | r2   | c++  | 23   |
| a   | /a.txt | r3   | txt  | 3    |
| a   | /b.c   | r3   | c    | 5    |

but the last two rows code indicate this code is come form repo r1, because the commit id is same with first two rows. I want to remove the duplicate rows, and insert result to table B:

| id  | file   | repo | lang | line |
-------------------------------------
| a   | /a.txt | r1   | txt  | 3    |
| a   | /b.c   | r1   | c    | 5    |
| b   | /x.java| r1   | java | 33   |
| c   | /f.cpp | r2   | c++  | 23   |

the row can be distinct by: id + file + repo

Thanks in advance.

2 Answers 2

1

You can use NOT EXISTS to check that a duplicate does not exist:

SELECT *
FROM t
WHERE NOT EXISTS (
    SELECT 1
    FROM t AS x
    WHERE x.id   = t.id
    AND   x.file = t.file
    AND   x.repo < t.repo
)

SQL Fiddle

Sign up to request clarification or add additional context in comments.

4 Comments

what does x.repo > t.repo mean ?
You seem to have different repo for a,/a.txt: r1 and r3; this condition keeps r1 and discards r2. Seems to match the expected output.
There may be more than two repos, can this condition keep just one discard others?
This is what it should do... replace > with < to keep greatest or smallest repo and (compared as strings) discard all others. It assumes that repo is different if id and file of two rows are same.
0

Aggregation would seem to do what you want:

select id, file, min(repo) as repo, lang, line
from t
group by id, file, lang, line;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.