We have two tables left, right. We need to delete the data from left which is not present in the right table based on a particular column(s). Below are the details of tables.
feeds=> \dt+ right
                  List of relations
 Schema | Name | Type  | Owner |  Size  | Description
 public | right | table | oguri | 175 MB |
(1 row)
feeds=> \dt+ left
                          List of relations
   Schema    |    Name     | Type  |  Owner   |  Size  | Description
 vendor_data | left    | table | oguri   | 641 MB |
(1 row)
When I execute the below query, it is taking lot of time(after 30 mins also the query was still running)
so I cancelled the query.
delete from left where (identifier, id_bb_global) NOT IN (SELECT identifier, id_bb_global FROM right) 
AND region=’asia’;
After that I executed below command to check the estimated execution plan for the query. ( I changed from delete to select with same condition).
feeds=> explain select * from left where (identifier, id_bb_global) NOT IN 
(SELECT identifier, id_bb_global FROM right) AND region=’asia’;
                                 QUERY PLAN                               
 Seq Scan on left  (cost=0.00..3345120516.58 rows=87786 width=944)
   Filter: (((region)::text = ‘asia’::text) AND (NOT (SubPlan 1)))
   SubPlan 1
     ->  Materialize  (cost=0.00..26237.23 rows=173882 width=31)
           ->  Seq Scan on right  (cost=0.00..24178.82 rows=173882 width=31)
(5 rows)
As you can see in the execution plan, it is materializing the right table. what is Materialize in execution plan ?
A materialize node means the output of whatever is below it in the tree (which can be a scan, or a
full set of joins or something like that or a contiguous cache of rows) is materalized into memory before the upper node is executed. This is usually done when the outer node needs a source that it can re-scan for some reason or other.
So the right table will be scanned and materalized into memory (work_mem). Since PostgreSQL cannot flush a hashed subplan onto the disk, it will estimate the subquery size, and if it decides that it will not fit into work_mem, it will resort to using a mere subplan which will be either executed for each row from left, or materialized and the rows will be searched for in a loop.
The query will take lot of time if the work_mem is not enough to store the rows from subquery. To over come this we have two solutions.
1) Increase the work_mem value.
2) Use not exists instead of not in.
1) Increase the work_mem.
By seeing the above estimated plan it needs  rows*width (173882*31) bytes of memory to store the output of left table in work_mem. I have set the memory to 10MB and executed the query and it was executed successfully.
feeds=> set work_mem to ’10MB’;
feeds=> explain analyze select * from vendor_data.left where (identifier, id_bb_global) NOT IN (SELECT identifier, id_bb_global FROM right) AND region=’asia’;
                                                       QUERY PLAN                                                     
 Seq Scan on left  (cost=24613.53..111080.82 rows=87197 width=941) (actual time=679.007..679.007 rows=0 loops=1)
   Filter: (((region)::text = ‘asia’::text) AND (NOT (hashed SubPlan 1)))
   Rows Removed by Filter: 250352
   SubPlan 1
     ->  Seq Scan on right  (cost=0.00..24178.82 rows=173882 width=31) (actual time=0.004..214.924 rows=173882 loops=1)
 Planning time: 0.941 ms
 Execution time: 680.191 ms
(7 rows)
2) Use not exists instead of not in.
When I removed the “NOT IN” clause and used the “NOT EXISTS” it worked perfectly, below is the execution plan for the same. Also it is using a better execution plan than first one.
feeds=> explain analyze delete from vendor_data.left t1 where not exists (SELECT 1 FROM right where identifier=t1.identifier and id_bb_global=t1.id_bb_global) AND region=’asia’;
                                                             QUERY PLAN                                                             
 Delete on left t1  (cost=28146.05..119478.96 rows=91197 width=12) (actual time=777.052..777.052 rows=0 loops=1)
   ->  Hash Anti Join  (cost=28146.05..119478.96 rows=91197 width=12) (actual time=304.233..776.662 rows=71 loops=1)
         Hash Cond: (((t1.identifier)::text = (right.identifier)::text) AND ((t1.id_bb_global)::text = (right.id_bb_global)::text))
         ->  Seq Scan on left t1  (cost=0.00..85206.35 rows=175571 width=37) (actual time=0.078..345.248 rows=173953 loops=1)
               Filter: ((region)::text = ‘asia’::text)
               Rows Removed by Filter: 76489
         ->  Hash  (cost=24178.82..24178.82 rows=173882 width=37) (actual time=290.545..290.545 rows=173882 loops=1)
               Buckets: 65536  Batches: 4  Memory Usage: 3427kB
               ->  Seq Scan on right  (cost=0.00..24178.82 rows=173882 width=37) (actual time=0.003..210.958 rows=173882 loops=1)
 Planning time: 0.612 ms
 Trigger _td_bl_left_trigger: time=15.105 calls=71
 Execution time: 792.229 ms
(12 rows)
Reference :

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s