NOT IN vs NOT EXISTS in PostgreSQL
We have two tables left, right. We need to delete the data from left which is not present in the right table based on a particular column(s). Below are the details of tables.
feeds=> \dt+ rightList of relationsSchema | Name | Type | Owner | Size | Description——–+——+——-+——-+——–+————-public | right | table | oguri | 175 MB |(1 row)feeds=> \dt+ leftList of relationsSchema | Name | Type | Owner | Size | Description————-+————-+——-+———-+——–+————-vendor_data | left | table | oguri | 641 MB |(1 row)
When I execute the below query, it is taking lot of time(after 30 mins also the query was still running)
so I cancelled the query.
delete from left where (identifier, id_bb_global) NOT IN (SELECT identifier, id_bb_global FROM right)AND region=’asia’;
After that I executed below command to check the estimated execution plan for the query. ( I changed from delete to select with same condition).
feeds=> explain select * from left where (identifier, id_bb_global) NOT IN(SELECT identifier, id_bb_global FROM right) AND region=’asia’;QUERY PLAN—————————————————————————-Seq Scan on left (cost=0.00..3345120516.58 rows=87786 width=944)Filter: (((region)::text = ‘asia’::text) AND (NOT (SubPlan 1)))SubPlan 1-> Materialize (cost=0.00..26237.23 rows=173882 width=31)-> Seq Scan on right (cost=0.00..24178.82 rows=173882 width=31)(5 rows)
As you can see in the execution plan, it is materializing the right table. what is Materialize in execution plan ?
A materialize node means the output of whatever is below it in the tree (which can be a scan, or a
full set of joins or something like that or a contiguous cache of rows) is materalized into memory before the upper node is executed. This is usually done when the outer node needs a source that it can re-scan for some reason or other.
So the right table will be scanned and materalized into memory (work_mem). Since PostgreSQL cannot flush a hashed subplan onto the disk, it will estimate the subquery size, and if it decides that it will not fit into work_mem, it will resort to using a mere subplan which will be either executed for each row from left, or materialized and the rows will be searched for in a loop.
The query will take lot of time if the work_mem is not enough to store the rows from subquery. To over come this we have two solutions.
1) Increase the work_mem value.
2) Use not exists instead of not in.
1) Increase the work_mem.
By seeing the above estimated plan it needs rows*width (173882*31) bytes of memory to store the output of left table in work_mem. I have set the memory to 10MB and executed the query and it was executed successfully.
feeds=> set work_mem to ’10MB’;SETfeeds=> explain analyze select * from vendor_data.left where (identifier, id_bb_global) NOT IN (SELECT identifier, id_bb_global FROM right) AND region=’asia’;QUERY PLAN————————————————————————————————————————Seq Scan on left (cost=24613.53..111080.82 rows=87197 width=941) (actual time=679.007..679.007 rows=0 loops=1)Filter: (((region)::text = ‘asia’::text) AND (NOT (hashed SubPlan 1)))Rows Removed by Filter: 250352SubPlan 1-> Seq Scan on right (cost=0.00..24178.82 rows=173882 width=31) (actual time=0.004..214.924 rows=173882 loops=1)Planning time: 0.941 msExecution time: 680.191 ms(7 rows)
2) Use not exists instead of not in.
When I removed the “NOT IN” clause and used the “NOT EXISTS” it worked perfectly, below is the execution plan for the same. Also it is using a better execution plan than first one.
feeds=> explain analyze delete from vendor_data.left t1 where not exists (SELECT 1 FROM right where identifier=t1.identifier and id_bb_global=t1.id_bb_global) AND region=’asia’;QUERY PLAN————————————————————————————————————————————-Delete on left t1 (cost=28146.05..119478.96 rows=91197 width=12) (actual time=777.052..777.052 rows=0 loops=1)-> Hash Anti Join (cost=28146.05..119478.96 rows=91197 width=12) (actual time=304.233..776.662 rows=71 loops=1)Hash Cond: (((t1.identifier)::text = (right.identifier)::text) AND ((t1.id_bb_global)::text = (right.id_bb_global)::text))-> Seq Scan on left t1 (cost=0.00..85206.35 rows=175571 width=37) (actual time=0.078..345.248 rows=173953 loops=1)Filter: ((region)::text = ‘asia’::text)Rows Removed by Filter: 76489-> Hash (cost=24178.82..24178.82 rows=173882 width=37) (actual time=290.545..290.545 rows=173882 loops=1)Buckets: 65536 Batches: 4 Memory Usage: 3427kB-> Seq Scan on right (cost=0.00..24178.82 rows=173882 width=37) (actual time=0.003..210.958 rows=173882 loops=1)Planning time: 0.612 msTrigger _td_bl_left_trigger: time=15.105 calls=71Execution time: 792.229 ms(12 rows)
Reference :