5

I have the following table definitions:

Table public.messages:

Column Type Collation Nullable Default
ip text
msg text
ignore boolean

Table public.host:

Column Type Collation Nullable Default
ip text
name text

Yes, I could set type as IP for the ip, but for this example it doesn't matter.

I run this query:

SELECT * 
FROM messages 
LEFT JOIN host ON messages.ip = host.ip;

And get this result:

ip msg ignore ip name
1.1.1.1 Test1 1.1.1.1 host1
1.1.1.2 Test2

Sample data can be found here: https://dbfiddle.uk/3LTo9shO

I want to update all rows where there is no host name to set ignore to true.

So basically, just the ignore column for this result:

SELECT * 
FROM messages 
LEFT JOIN host ON messages.ip = host.ip 
WHERE host.name IS NULL;
ip msg ignore ip name
1.1.1.2 Test2

I can do an update for the row that does have a host name. When I set the value, everything works:

UPDATE messages 
SET ignore = FALSE 
FROM host 
WHERE messages.ip = host.ip AND host.name IS NOT NULL;

UPDATE 1

SELECT * 
FROM messages 
LEFT JOIN host ON messages.ip = host.ip;
ip msg ignore ip name
1.1.1.1 Test1 f 1.1.1.1 host1
1.1.1.2 Test2

However, setting true for those that don't have a host name doesn't work.

UPDATE messages 
SET ignore = TRUE 
FROM host 
WHERE messages.ip = host.ip AND host.name IS NULL;

UPDATE 0

I tried with a LEFT JOIN, but then everything was updated:

UPDATE messages 
SET ignore = TRUE 
FROM messages m 
LEFT JOIN host h ON m.ip = h.ip 
WHERE h.name IS NULL;

UPDATE 2

SELECT * 
FROM messages 
LEFT JOIN host ON messages.ip = host.ip;
ip msg ignore ip name
1.1.1.1 Test1 t 1.1.1.1 host1
1.1.1.2 Test2 t

https://dbfiddle.uk/fOJ55JBM

How can I update just rows without a host name?

4
  • Try join target table(messages as s) and source query (messages as m) UPDATE messages s SET ignore=TRUE FROM messages m LEFT JOIN host h ON m.ip = h.ip WHERE m.ip = s.ip and h.name IS NULL;(dbfiddle.uk/N2wNIUIG) Commented Nov 7 at 22:58
  • @ValNik This will work, but it's full additional join and instance of the table Commented Nov 7 at 23:07
  • Yes, this is additional join. I've given the example closest to the OP's attempt. Better is UPDATE messages s SET ignore=TRUE WHERE not exists(select ip from host h where h.ip=s.ip); Commented Nov 7 at 23:14
  • I tried to run the queries in the answers and comments provided. The execution plans differ significantly and may depend heavily on the number of missing host addresses. Therefore, for me, the choice of method has become completely ambiguous. Model (dbfiddle.uk/EMngx8Vi) Commented Nov 8 at 9:00

4 Answers 4

6

I'm currently much more of a SQL Server guy, especially at work, but I'm also a Postgresql fan in many ways and would probably look there first for a personal project. I love it most of the time. And yet... The UPDATE+JOIN syntax in Postgresql is actually kind of bad, especially related to SQL Server or MySQL, both of which (imo) use a more elegant strategy for this.

Specifically, to use multiple tables in an update, Postgresql looks at the item name listed at the beginning next to the UPDATE keyword as if it were the first item in the FROM section. If you then include additional tables in the FROM section you have to use conditional expressions in the WHERE clause to make join relations back to the original table, as we did for the old school pre-ansi 92 joins. ~SHUDDER~

So when we have this query:

UPDATE messages 
SET ignore=TRUE 
FROM messages m 
LEFT JOIN host h ON m.ip = h.ip 
WHERE h.name IS NULL;

what's actually going on is there are two references for the messages table, with no relation between them, and no relation from host back to the original messages table.

In addition to needing the old-style join predicates, the Postgresql syntax also assumes an INNER JOIN for this next table. This is a problem. How can we get a LEFT JOIN instead?

The old way to do this for pre-ansi92 joins was using the (+) operator, but I haven't found a way to include this operator that actually works. I don't think there is actually any support for this, but I thought it worth a quick try. Maybe someone can correct me and show the right method.

The next option, and the most common technique I've seen in google, is to create an inner join to a second reference for the same table on the primary keys, so we get a one-to-one match. Then you can do the exclusion join from this second table instance. This is what's suggested in the comments, but I don't love it because it adds unnecessary extra work to the query. Even if the optimizer is smart enough to remove this, it's still extra code and cognitive load.

Instead, I think the best option for your query is to convert the exclusion join to a NOT EXISTS subquery:

UPDATE messages m
SET ignore=TRUE 
WHERE NOT EXISTS (
    SELECT 1 FROM host h WHERE h.ip = m.ip
)

See it here:

https://dbfiddle.uk/fPsyYcuu

Note this doesn't cover every possiblility, as it doesn't account for cases where the row exists but the name value is null. But I believe this aligns with what you expect to see in the data and how you expect the query to work, and is easy enough to adjust if I'm wrong.

In fact, if there are only two tables, I believe one can always rewrite the exclusion join to a NOT EXISTS(). If there are more than two tables, depending on entire query one can sometimes re-write to place an INNER JOIN first and do an exclusion join from there, but you might also need to put additional JOINs in the NOT EXISTS(), or even nest joins, which is never fun.

Sign up to request clarification or add additional context in comments.

Comments

4

Upvoted @JoelCoehoorn's answer, but here's an alternative for what it's worth:

UPDATE messages
SET ignore=TRUE
FROM (
  SELECT ip FROM messages
  EXCEPT 
  SELECT ip FROM host
) AS "ignore"
WHERE messages.ip = "ignore".ip;

This at least eliminates the correlated subquery, replacing it with a table-subquery that only needs to be executed once.

Using EXCEPT in this way means the table-subquery includes only the messages.ip values that are not in host.ip.


EDIT: I inserted 10^5 rows to the messages table with and 10^5 rows to the host table.

I also added an index on the host.ip column.

create index hi on host(ip);

insert into messages (ip, msg, ignore)
select text('0.0.0.0'::inet + random(0,(2^31-1)::int)), 'Test 2', NULL
from generate_series(1,100000);

insert into host (ip, name)
select text('0.0.0.0'::inet + random(0,(2^31-1)::int)), 'host '||g.n
from generate_series(1,100000) as g(n);

analyze host;
analyze messages;

Here is the EXPLAIN for my solution.

EXPLAIN ANALYZE
UPDATE messages
SET ignore=TRUE
FROM (
  SELECT ip FROM messages
  EXCEPT
  SELECT ip FROM host
) AS "ignore"
WHERE messages.ip = "ignore".ip;

 Update on messages  (cost=6822.00..9447.00 rows=0 width=0) (actual time=302.638..302.639 rows=0.00 loops=1)
   Buffers: shared hit=303423 dirtied=702 written=702
   ->  Hash Join  (cost=6822.00..9447.00 rows=100000 width=89) (actual time=33.526..81.041 rows=99991.00 loops=1)
         Hash Cond: ((ignore.ip)::text = (messages.ip)::text)
         Buffers: shared hit=2072
         ->  Subquery Scan on ignore  (cost=3904.00..5154.00 rows=100000 width=140) (actual time=21.451..37.250 rows=99988.00 loops=1)
               Buffers: shared hit=1404
               ->  HashSetOp Except  (cost=3904.00..4154.00 rows=100000 width=58) (actual time=21.444..28.168 rows=99988.00 loops=1)
                     Buffers: shared hit=1404
                     ->  Seq Scan on messages messages_1  (cost=0.00..1668.00 rows=100000 width=16) (actual time=0.006..3.730 rows=100000.00 loops=1)
                           Buffers: shared hit=668
                     ->  Seq Scan on host  (cost=0.00..1736.00 rows=100000 width=16) (actual time=0.007..3.778 rows=100000.00 loops=1)
                           Buffers: shared hit=736
         ->  Hash  (cost=1668.00..1668.00 rows=100000 width=22) (actual time=12.066..12.067 rows=100000.00 loops=1)
               Buckets: 131072  Batches: 1  Memory Usage: 6381kB
               Buffers: shared hit=668
               ->  Seq Scan on messages  (cost=0.00..1668.00 rows=100000 width=22) (actual time=0.004..4.458 rows=100000.00 loops=1)
                     Buffers: shared hit=668
 Planning:
   Buffers: shared hit=21
 Planning Time: 0.150 ms
 Execution Time: 302.704 ms

Here's the EXPLAIN for Joel Cohoorn's solution:

EXPLAIN ANALYZE
UPDATE messages m
SET ignore=TRUE
WHERE NOT EXISTS (
    SELECT 1 FROM host h WHERE h.ip = m.ip
);

 Update on messages m  (cost=2986.00..8108.66 rows=0 width=0) (actual time=295.080..295.080 rows=0.00 loops=1)
   Buffers: shared hit=303369 dirtied=701 written=701
   ->  Hash Anti Join  (cost=2986.00..8108.66 rows=105090 width=13) (actual time=42.393..78.341 rows=99991.00 loops=1)
         Hash Cond: ((m.ip)::text = (h.ip)::text)
         Buffers: shared hit=2106
         ->  Seq Scan on messages m  (cost=0.00..3420.90 rows=205090 width=22) (actual time=0.123..12.703 rows=100000.00 loops=1)
               Buffers: shared hit=1370
         ->  Hash  (cost=1736.00..1736.00 rows=100000 width=22) (actual time=42.098..42.099 rows=100000.00 loops=1)
               Buckets: 131072  Batches: 1  Memory Usage: 6381kB
               Buffers: shared hit=736
               ->  Seq Scan on host h  (cost=0.00..1736.00 rows=100000 width=22) (actual time=0.008..14.771 rows=100000.00 loops=1)
                     Buffers: shared hit=736
 Planning:
   Buffers: shared hit=10
 Planning Time: 0.390 ms
 Execution Time: 295.232 ms

Here's the EXPLAIN for Zegarek's solution:

EXPLAIN ANALYZE
UPDATE messages AS m
SET ignore=NOT EXISTS(SELECT FROM host AS h
                      WHERE h.ip=m.ip
                        AND h.name IS NOT NULL);

 Update on messages m  (cost=0.00..2620274.35 rows=0 width=0) (actual time=277.147..277.147 rows=0.00 loops=1)
   Buffers: shared hit=303985 dirtied=701 written=701
   ->  Seq Scan on messages m  (cost=0.00..2620274.35 rows=310030 width=7) (actual time=51.317..71.654 rows=100000.00 loops=1)
         Buffers: shared hit=2807
         SubPlan 2
           ->  Seq Scan on host h  (cost=0.00..1736.00 rows=100000 width=32) (actual time=0.020..17.453 rows=100000.00 loops=1)
                 Filter: (name IS NOT NULL)
                 Buffers: shared hit=736
 Planning:
   Buffers: shared hit=3
 Planning Time: 0.254 ms
 Execution Time: 277.203 ms

EXPLAIN for ValNik's solution:

EXPLAIN ANALYZE       
UPDATE messages s  
SET ignore=TRUE   
FROM messages m   
LEFT JOIN host h ON m.ip = h.ip   
WHERE m.ip = s.ip and  h.name IS NULL;

Update on messages s  (cost=7432.01..10878.02 rows=0 width=0) (actual time=304.568..304.569 rows=0.00 loops=1)
   Buffers: shared hit=306541 dirtied=33 written=33
   ->  Hash Join  (cost=7432.01..10878.02 rows=1 width=19) (actual time=85.078..112.464 rows=100001.00 loops=1)
         Hash Cond: ((s.ip)::text = (m.ip)::text)
         Buffers: shared hit=4878
         ->  Seq Scan on messages s  (cost=0.00..3071.00 rows=100000 width=22) (actual time=0.022..6.152 rows=100000.00 loops=1)
               Buffers: shared hit=2071
         ->  Hash  (cost=7432.00..7432.00 rows=1 width=28) (actual time=85.044..85.045 rows=99995.00 loops=1)
               Buckets: 131072 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 6381kB
               Buffers: shared hit=2807
               ->  Hash Left Join  (cost=2986.00..7432.00 rows=1 width=28) (actual time=37.050..69.844 rows=99995.00 loops=1)
                     Hash Cond: ((m.ip)::text = (h.ip)::text)
                     Filter: (h.name IS NULL)
                     Rows Removed by Filter: 5
                     Buffers: shared hit=2807
                     ->  Seq Scan on messages m  (cost=0.00..3071.00 rows=100000 width=22) (actual time=0.005..12.510 rows=100000.00 loops=1)
                           Buffers: shared hit=2071
                     ->  Hash  (cost=1736.00..1736.00 rows=100000 width=32) (actual time=36.841..36.841 rows=100000.00 loops=1)
                           Buckets: 131072  Batches: 1  Memory Usage: 7445kB
                           Buffers: shared hit=736
                           ->  Seq Scan on host h  (cost=0.00..1736.00 rows=100000 width=32) (actual time=0.005..12.083 rows=100000.00 loops=1)
                                 Buffers: shared hit=736
 Planning:
   Buffers: shared hit=27
 Planning Time: 0.918 ms
 Execution Time: 304.807 ms

It looks like all the solutions involve some Seq Scan operations, but Zegarek's solution wins on execution time.

7 Comments

Did you have a look on the execution plans? I think Postgres will very likely optimize the NOT EXISTS query pretty well and execute it much faster. But might depend on indices.
That's a good point. EXCEPT tends to perform worse than equivalent queries using EXISTS and regular (anti-)joins. I think everyone would prefer if the planner understood that equivalence better because this syntax tends to be way clearer compared to the alternatives. This example could be even shorter and simpler with a WHERE ip IN (SELECT ip..EXCEPT..)
Thanks for adding the plans. I'm afraid that's way too few rows to call that a win, the dozen microseconds could easily be just noise. These might all go different ways with reasonable scale and indexing. Even with some in place, in the lower thousands PG will tend to ignore them and seq scan the table directly. With the host.name nullability unspecified, I'm also potentially doing a redundant check on that null, making these variants inequivalent.
I repeated the test after loaded both tables with 10^5 (100k) rows.
@BillKarwinl, please add a plan for the query with an additional JOIN to your response. Query like UPDATE messages s SET ignore=TRUE FROM messages m LEFT JOIN host h ON m.ip = h.ip WHERE m.ip = s.ip and h.name IS NULL;
Done. It's a good try, but it doesn't change the winner position.
Thanks! I prefer the winning option. However, this option with an additional JOIN is also good.
3
update messages as m
set ignore=not exists(select from host as h 
                      where h.ip=m.ip
                        and h.name is not null);
  1. I added a third test case with a host entry matching a message but with a null in host.name. If it's not allowed, and h.name is not null above can be removed.
  2. I'm updating all rows; messages with a named host get an ignore=false, those without a host or with an unnamed one get a true. Depending on cardinalities it might be more efficient to only target one group or the other, then apply a default opposite value wherever it remained null, indicating it belongs to the opposite group.
  3. (Addressed in the question, so this one's just for posterity) Don't use type varchar for IP addresses. Postgres offers quick and lightweight built-in inet with proper validation, indexes, functions and operators. If you ever need MAC addresses, there's also macaddr.
  4. Don't default to varchar in general, especially a limited one. Plain text type is a better default.
  5. Postgres supports empty select lists and exists discards everything on them. You can select/*nothing*/from tables if you're just checking counts or presence/existence.
  6. If the join column name matches, there's a convenient messages join host using(ip) syntax - no aliasing or repeated identifiers, no dots, no operators. Especially helpful if you otherwise need to type out a long list of a.x=b.x and a.y=b.y and.. comparisons that reduce to using(x,y,..).

Posting mainly as an opportunity to add these points. This demo also shows how you do that with a CTE, which I don't necessarily enjoy either, but it does let you expand it to a multi-table update Postgres doesn't otherwise support.

Performance wise, the optimal approach heavily depends on how many records you have on either side, how many match this comparison as well as what indexes you already have in place, or how flexible you are when it comes to adding more. The demo shows an example involving 150k messages between about 262144 possible IP's, out of which 30k are known hosts, among which 10% doesn't have a name and about 5k share an IP. Exec times of the solutions proposed here are comparable in this case.

1 Comment

The varchar in dbfiddle was me, just for the sample. The original question shows the text type.
2

One of the options is to use MERGE INTO using your first query (just make distinction of the host ip from message ip):

SELECT m.*, h.ip as host_ip, h.name as host_name
FROM messages m 
LEFT JOIN host h ON h.ip = m.ip
ip msg ignore host_ip host_name
1.1.1.1 Test 1 null 1.1.1.1 host1
1.1.1.2 Test 2 null null null

... using the above query filtered for host_ip (or name) IS NULL - matching it to merged into table messages should do the update of the ignore column to true for all non existant hosts present in messages table...

MERGE INTO messages m
USING ( SELECT m.*, h.ip as host_ip, h.name as host_name
        FROM messages m
        LEFT JOIN host h ON h.ip = m.ip
        WHERE h.ip Is Null          -- added filter for non existant hosts
     ) x ON ( m.ip = x.ip  )
WHEN MATCHED THEN UPDATE SET ignore = True

... if you run your query again - the result is ...

SELECT m.*, h.ip as host_ip, h.name as host_name
FROM messages m 
LEFT JOIN host h ON h.ip = m.ip
ip msg ignore host_ip host_name
1.1.1.1 Test 1 null 1.1.1.1 host1
1.1.1.2 Test 2 t null null

fiddle

Note:
Make sure that your USING sql fetches the ip's only once - if there are multiple rows of the same non existing hosts in the messsages table. If that is the case use distinct or group by to get unique ip rows.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.