Category Archives: Performance Tuning

Visibility map in PostgreSQL

What is Visibility Map ?

Visibility Map is a bitmap file which is created for every table. This file is named with _vm. This file contains entry for each page in the table along with two bits/flags(all_frozen,all_visible). The first bit say “all rows in the block are visible to all transactions”, second bit say “all rows in the block are frozen”. The second bit “all_frozen” is set only when “all_visible” is set. (if all rows are visible then only all can get freeze)

Why Visibility Map ?

Improve vacuum performance – The vaccum process check for the bits of the page. If the first bit – “all rows are visible” – is set then vaccum process no need to process the page.
Improve freezing performance – The vacuum process while freezing the complete table (transaction id wraparound) if the second bit – “all rows are frozen” – is set then vacuum process no need to process the page.
Improve query performance – during Index-Only scan (instead of seeing in table page, get data from index if all columns in select are part of index) if the all_visible is set then it don’t access the table page to check the visibility of row, which reduce I/O.

Who create/modify this file ?

During the first vacuum process the file will be created by vacuum process and it will be modified by subsequent user processes who do DML(update, delete) on the table. There are two bits which say “all rows are visible” one is in visibility-map file and second is in the page of table it self(PD_ALL_VISIBLE). Any DML operation need to update both the bits(all_visible and PD_ALL_VISIBLE).

All modifications to this file are WAL-logged so that during time of crash PostgreSQL can do recovery.

Below is the example which shows how this file is created and modified

postgres=# create table test_vm as select generate_series(1,100000) as id, md5(generate_series(1,100000)::text) as name;
postgres=#  select pg_relation_filepath('test_vm');                                                                                      pg_relation_filepath
(1 row)

postgres=# \! ls -l $PGDATA/base/13323/65584*
-rw-------. 1 postgres postgres 6832128 May  2 00:08 /var/lib/pgsql/9.6/data/base/13323/65584
postgres=# vacuum test_vm;
postgres=# \! ls -l $PGDATA/base/13323/65584*
-rw-------. 1 postgres postgres 6832128 May  2 00:08 /var/lib/pgsql/9.6/data/base/13323/65584
-rw-------. 1 postgres postgres   24576 May  2 00:09 /var/lib/pgsql/9.6/data/base/13323/65584_fsm
-rw-------. 1 postgres postgres    8192 May  2 00:09 /var/lib/pgsql/9.6/data/base/13323/65584_vm
postgres=# create extension pg_visibility ;
ERROR:  extension "pg_visibility" already exists

postgres=# SELECT * FROM pg_visibility('test_vm'::regclass) limit 10;
 blkno | all_visible | all_frozen | pd_all_visible
     0 | t           | f          | t ====> All visible
     1 | t           | f          | t
     2 | t           | f          | t
(10 rows)

postgres=# begin;
postgres=# delete from test_vm where id < 5;
postgres=# SELECT * FROM pg_visibility('test_vm'::regclass) limit 10;
 blkno | all_visible | all_frozen | pd_all_visible
     0 | f           | f          | f ====> delete command modified the bits
     1 | t           | f          | t
     2 | t           | f          | t
     3 | t           | f          | t
(10 rows)

pg_visibility is an extension used to read the visibility map file.


Troubleshooting PostgreSQL Performance issue

In this post we will discuss about a PostgreSQL performance issue on my Linux box. The issue is CPU utilization reached 100% and good amount of I/O on the system.

This is how I started troubleshooting the issue. I started by checking the load average on the server.

What is Load average and how to check ?

Load average is the number of processes/threads that are running on CPU and waiting for CPU or IO or uninterruptable sleep.

It conventionally appears in the form of three numbers which represent the system load during the last one-, five-, and fifteen-minute periods.

You can check this value by different methods.

  1. Value is stored in file “/proc/loadavg”
  2. Using top command.
  3. Using “uptime” command.
  4. Using “w” command
  5. Using “sar -q” command
[root@localhost ~]# uptime
 02:19:47 up  5:11,  6 users,  load average: 0.00, 0.02, 0.05

Interpretation of load average

  • If the averages are 0.0, then your system is idle.
  • If the 1 minute average is higher than the 5 or 15 minute averages, then load is increasing.
  • If the 1 minute average is lower than the 5 or 15 minute averages, then load is decreasing.
  • If they are higher than your CPU count, then you might have a performance problem (it depends).

As we can see the load average is increasing (15 < 5 < 1 ), so there is more load on the system.

I executed the vmstat command.

-bash-4.2$ vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0     96 429824      0 2445100    0    0   170   307  160  193  2  2 96  0  0
 2  0     96 429736      0 2445220    0    0     0  2864 4116 4110 15 52 33  0  0
 4  0     96 431460      0 2445256    0    0     0  3008 4187 4221 16 51 33  0  0
 4  1     96 431404      0 2445280    0    0     0  3288 4107 3996 15 54 31  0  0
 4  1     96 431296      0 2445348    0    0     0  2984 4078 4126 16 52 32  0  0
 1  0     96 431576      0 2445348    0    0     0  3305 4175 4103 15 52 33  0  0
 4  1     96 431420      0 2445372    0    0     0  2952 4101 4157 14 51 34  1  0
 5  1     96 431032      0 2445328    0    0     0 19544 4189 4092 13 53 34  0  0

In the output I can see

  1. The first column “run queue” is more than my CPU count so there is load on the system(more processes to run than my cpu count).
  2. I can see there is high block out activity (disk writing is high).
  3. High context switching and interrupts

I started with finding the cause of I/O.

You can check the disk activity using “iostat -mx” command, we can see the disk write activity is more.

Now I executed the “iotop” command to pinpoint which process is doing more writes. Below is the information about the same.

I can see the Wal writer, check pointer and writer processes are doing heavy writes in my system. It means there is high transactions (DML activity) in my system which are causing more pressure on writer and check pointer processes.

I opened my PostgreSQL error logs and found that the checkpoints are happening too frequently due to heavy update commands running on my machine.

2020-04-12 05:37:19.809 PDT,,,1071,,5e90791e.42f,4,,2020-04-10 06:48:14 PDT,,0,LOG,00000,"checkpoints are occurring too frequently (11 seconds apart)",,"Consider increasing the configuration parameter ""max_wal_size"".",,,,,,,""
2020-04-12 05:37:28.649 PDT,,,1071,,5e90791e.42f,5,,2020-04-10 06:48:14 PDT,,0,LOG,00000,"checkpoints are occurring too frequently (9 seconds apart)",,"Consider increasing the configuration parameter ""max_wal_size"".",,,,,,,""

I have changed the checkpoint related parameters

postgres=# alter system set max_wal_size = '10GB';
postgres=# alter system set checkpoint_timeout = '30min';

I can still still the wal writer and writer processes doing more writes

Writer process will do writes during below times.

  1. When the check pointer process invokes it at checkpoint timeout or max_wal_size is reached.
  2. when the shared_buffers is getting filled too frequently.
  3. Also, writer process wakeup for every bgwriter_delay time and write the data from shared buffers to disk.

I verified the shared buffers setting and found that it is not enough and increased it. Also, I changed the wal writer parameters.

alter system set wal_buffers to '16MB';
alter system set shared_buffers to '2GB';

Now the I/O by writer and check pointer is not so frequent but there is high CPU utilization.

After that I started checking the context switching, which high on the system. As you can see in below image on average it is 590 but after the activity started it reached to 2000+ and same for system interrupts. If you want to know more about the vmstat command check

What is a context switch?

context switch is the process of storing the state of a process or thread, so that it can be restored and resume execution at a later point. This allows multiple processes to share a single CPU, and is an essential feature of a multitasking operating system.

Also, I observed in the top, pg_stat_activity where there are lot of shot lived transactions, which means there are multiple processes getting created and closed which means there is high load on the system. But I implemented the connection pooling (pgbouncer) to reduce the connection creation overhead.

Finally, the CPU utilization is reduced from 100 to ~70%. But still the load average is not reduced as expected and see more processes in the run queue as we see in vmstat output.

Finally we need to increase the CPU count to meet the load on the system.


We have two tables left, right. We need to delete the data from left which is not present in the right table based on a particular column(s). Below are the details of tables.
feeds=> \dt+ right
                  List of relations
 Schema | Name | Type  | Owner |  Size  | Description
 public | right | table | oguri | 175 MB |
(1 row)
feeds=> \dt+ left
                          List of relations
   Schema    |    Name     | Type  |  Owner   |  Size  | Description
 vendor_data | left    | table | oguri   | 641 MB |
(1 row)
When I execute the below query, it is taking lot of time(after 30 mins also the query was still running)
so I cancelled the query.
delete from left where (identifier, id_bb_global) NOT IN (SELECT identifier, id_bb_global FROM right) 
AND region=’asia’;
After that I executed below command to check the estimated execution plan for the query. ( I changed from delete to select with same condition).
feeds=> explain select * from left where (identifier, id_bb_global) NOT IN 
(SELECT identifier, id_bb_global FROM right) AND region=’asia’;
                                 QUERY PLAN                               
 Seq Scan on left  (cost=0.00..3345120516.58 rows=87786 width=944)
   Filter: (((region)::text = ‘asia’::text) AND (NOT (SubPlan 1)))
   SubPlan 1
     ->  Materialize  (cost=0.00..26237.23 rows=173882 width=31)
           ->  Seq Scan on right  (cost=0.00..24178.82 rows=173882 width=31)
(5 rows)
As you can see in the execution plan, it is materializing the right table. what is Materialize in execution plan ?
A materialize node means the output of whatever is below it in the tree (which can be a scan, or a
full set of joins or something like that or a contiguous cache of rows) is materalized into memory before the upper node is executed. This is usually done when the outer node needs a source that it can re-scan for some reason or other.
So the right table will be scanned and materalized into memory (work_mem). Since PostgreSQL cannot flush a hashed subplan onto the disk, it will estimate the subquery size, and if it decides that it will not fit into work_mem, it will resort to using a mere subplan which will be either executed for each row from left, or materialized and the rows will be searched for in a loop.
The query will take lot of time if the work_mem is not enough to store the rows from subquery. To over come this we have two solutions.
1) Increase the work_mem value.
2) Use not exists instead of not in.
1) Increase the work_mem.
By seeing the above estimated plan it needs  rows*width (173882*31) bytes of memory to store the output of left table in work_mem. I have set the memory to 10MB and executed the query and it was executed successfully.
feeds=> set work_mem to ’10MB’;
feeds=> explain analyze select * from vendor_data.left where (identifier, id_bb_global) NOT IN (SELECT identifier, id_bb_global FROM right) AND region=’asia’;
                                                       QUERY PLAN                                                     
 Seq Scan on left  (cost=24613.53..111080.82 rows=87197 width=941) (actual time=679.007..679.007 rows=0 loops=1)
   Filter: (((region)::text = ‘asia’::text) AND (NOT (hashed SubPlan 1)))
   Rows Removed by Filter: 250352
   SubPlan 1
     ->  Seq Scan on right  (cost=0.00..24178.82 rows=173882 width=31) (actual time=0.004..214.924 rows=173882 loops=1)
 Planning time: 0.941 ms
 Execution time: 680.191 ms
(7 rows)
2) Use not exists instead of not in.
When I removed the “NOT IN” clause and used the “NOT EXISTS” it worked perfectly, below is the execution plan for the same. Also it is using a better execution plan than first one.
feeds=> explain analyze delete from vendor_data.left t1 where not exists (SELECT 1 FROM right where identifier=t1.identifier and id_bb_global=t1.id_bb_global) AND region=’asia’;
                                                             QUERY PLAN                                                             
 Delete on left t1  (cost=28146.05..119478.96 rows=91197 width=12) (actual time=777.052..777.052 rows=0 loops=1)
   ->  Hash Anti Join  (cost=28146.05..119478.96 rows=91197 width=12) (actual time=304.233..776.662 rows=71 loops=1)
         Hash Cond: (((t1.identifier)::text = (right.identifier)::text) AND ((t1.id_bb_global)::text = (right.id_bb_global)::text))
         ->  Seq Scan on left t1  (cost=0.00..85206.35 rows=175571 width=37) (actual time=0.078..345.248 rows=173953 loops=1)
               Filter: ((region)::text = ‘asia’::text)
               Rows Removed by Filter: 76489
         ->  Hash  (cost=24178.82..24178.82 rows=173882 width=37) (actual time=290.545..290.545 rows=173882 loops=1)
               Buckets: 65536  Batches: 4  Memory Usage: 3427kB
               ->  Seq Scan on right  (cost=0.00..24178.82 rows=173882 width=37) (actual time=0.003..210.958 rows=173882 loops=1)
 Planning time: 0.612 ms
 Trigger _td_bl_left_trigger: time=15.105 calls=71
 Execution time: 792.229 ms
(12 rows)
Reference :