3

I'm relatively new to sql and have been trying to get a pretty complex query (for me) to work for a while, but I've kept on getting duplicate values within each column using node postgres. Through this query I am trying to track user info, plan info and email info on a dashboard. Before we get to the query, here are the tables -

USER TABLE (u) - keep track of user info

+----+-------+---------+-------------+----------+
| id | first |  last   |    email    | password | 
+----+-------+---------+-------------+----------+
|  1 | joe   | smith   | [email protected] |     1234 | 
|  2 | mary  | johnson | [email protected] |     3445 | 
|  3 | harry | gold    | [email protected] |     4345 | 
+----+-------+---------+-------------+----------+

PLANS TABLE (p) - plans that users can make with their friends

+----+--------------+-----------+------------+------+--------+-----------+---------+------+
| id | experienceid | hostid(u) | guestid(u) | date |  time  | paidid(u) | groupid | newp |
+----+--------------+-----------+------------+------+--------+-----------+---------+------+
| 33 |            1 |         1 | [1,2,3]    | 4/20 | 8:00pm | [1,2]     |       1 | true |
+----+--------------+-----------+------------+------+--------+-----------+---------+------+

EMAIL TABLE (e) - keep track of messages I am sending to the users based on the plan they are a part of

+-------------+-----------+---------+----------+
|  email(u)   | planid(p) | confirm | reminder |
+-------------+-----------+---------+----------+
| [email protected] |        33 | null    | null     |
| [email protected] |        33 | true    | false    |
| [email protected] |        33 | true    | false    |
+-------------+-----------+---------+----------+

Now for the query what I am trying to do is combine all three tables to get this output -

+-------+---------------+---------------------------+---------+---------+------------+---------------+---------------+
| id(p) | hostname(u+p) |      paidguests(u+p)      | time(p) | newp(p) | groupid(p) |  reminder(e)  |  confirm(e)   |
+-------+---------------+---------------------------+---------+---------+------------+---------------+---------------+
|    33 | joe smith     | [joe smith, mary johnson] | 8:00pm  | true    |          1 | [true, false] | [true, false] |
+-------+---------------+---------------------------+---------+---------+------------+---------------+---------------+

Now where I left off in the query, I almost got it to work, but I kept getting duplicate values where it would look like this -

+-------+---------------+----------------------------------------------------+---------+---------+------------+----------------------------+---------------------------+
| id(p) | hostname(u+p) |                  paidguests(u+p)                   | time(p) | newp(p) | groupid(p) |        reminder(e)         |        confirm(e)         |
+-------+---------------+----------------------------------------------------+---------+---------+------------+----------------------------+---------------------------+
|    33 | joe smith     | [joe smith, mary johnson, joe smith, mary johnson] | 8:00pm  | true    |          1 | [true, false, true, false] | [true, false, true false] |
+-------+---------------+----------------------------------------------------+---------+---------+------------+----------------------------+---------------------------+ 

Now I don't really care about order for the confirm and reminder columns relative to the paidguests(u+p) column, as long as the right data is in there and not duplicated. Here is the query I had as it stands -

SELECT p.id, 
       Concat(u.first, ' ', u.last)              AS hostname, 
       Array_agg(Concat(us.first, ' ', us.last)) AS paidguests, 
       p.time, 
       p.groupid, 
       p.newp, 
       Array_agg(e.confirm)                      AS confirm, 
       Array_agg(e.reminder)                     AS reminder 
FROM   plans p 
       CROSS JOIN Unnest(p.paidid) AS allguests 
       LEFT JOIN users us 
              ON allguests = us.id 
       LEFT JOIN emails e 
              ON p.id = e.planid 
       LEFT JOIN users u 
              ON p.hostid = u.id 
WHERE  p.experienceid = $1 
       AND p.date = $2 
GROUP  BY p.id, 
          u.first, 
          u.last, 
          p.paidid, 
          p.time, 
          p.groupid, 
          p.newp, 
          confirm, 
          reminder 
ORDER  BY Array_length(p.paidid, 1) DESC 

So essentially just looking to get the table right without the duplicates. It was working before I added the join to the email table, but not entirely sure why its duplicating.

Hope I was thorough in the explanation. If not, please let me know what I can clarify! Thanks so much :)

3
  • CREATE TABLE statements showing data types and constraints would be more instructive. And always your version of Postgres. Commented Apr 21, 2018 at 15:02
  • I will keep that in mind for next time - thanks for the heads up! Commented Apr 21, 2018 at 16:03
  • Had a similar issue, upvoted for nicely describing it. Commented Aug 11, 2021 at 6:09

2 Answers 2

3

try adding this condition in where clause:

AND us.email = e.email

Culprit here is -> all emails are joined with all the users irrespective of their email id since all users and all emails have same plan_id. Hence the duplication.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you that did it! Appreciate the help :)
This helped me fix a problem with a similar issue.
1

Rahul spotted the missing join condition. But the rabbit hole goes deeper. I suggest this query:

SELECT p.id
     , concat_ws(' ', u.first, u.last) AS hostname  -- concat_ws!
     , p.time
     , p.groupid
     , p.newp
     , paid.paidguests
     , paid.confirm
     , paid.reminder
FROM   plans      p
LEFT   JOIN users u  ON u.id = p.hostid
LEFT   JOIN LATERAL (         -- LATERAL join
   SELECT array_agg(sub.paidguest) AS paidguests
        , array_agg(sub.confirm)   AS confirm
        , array_agg(sub.reminder)  AS reminder
   FROM  (
      SELECT concat_ws(' ', us.first, us.last) AS paidguest, e.confirm, e.reminder
      FROM   unnest(p.paidid) WITH ORDINALITY AS paid(id, ord)
      JOIN   users       us ON us.id = paid.id
      LEFT   JOIN emails e  ON e.email = us.email
                           AND e.planid = p.planid
      ORDER  BY paid.ord
      ) sub
   ) paid ON true
WHERE  p.experienceid = $1
AND    p.date = $2
-- no GROUP  BY needed
ORDER  BY cardinality(p.paidid) DESC, p.id;

Assuming that (planid, email) is the PRIMARY KEY of the email table and there is a FOREIGN KEY constraint from email.email to users.email.

Aggregate first, then join, so you don't need to GROUP BY on all the columns that don't need aggregation. While retrieving all or most rows, other query techniques are typically faster, for a small selection like in your example, I suggest a LATERAL join. Related:

In this particular case JOIN LATERAL would do the same as LEFT JOIN LATERAL since a subuery with aggregates and no GROUP BY always returns exactly 1 row.

The alias in your original Unnest(p.paidid) AS allguests is actively confusing, since those seem to be the IDs of guests who paid, not of all guests.

Use concat_ws() if first or last can be null. See:

When unnesting arrays, the order of elements is typically preserved in the simple case. But you have additional joins, so use WITH ORDINALITY and ORDER BY to avoid surprises. Your query may seem to work alright, even for a prolonged period of time - and then "suddenly" break (wrong order of elements) if you are not explicit about this. See:

The whole DB design is debatable. Typically, these arrays should rather be implemented as related tables - for many reasons, beyond the scope of this question.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.