Troubleshooting PostgreSQL Performance issue

In this post we will discuss about a PostgreSQL performance issue on my Linux box. The issue is CPU utilization reached 100% and good amount of I/O on the system.

This is how I started troubleshooting the issue. I started by checking the load average on the server.

What is Load average and how to check ?

Load average is the number of processes/threads that are running on CPU and waiting for CPU or IO or uninterruptable sleep.

It conventionally appears in the form of three numbers which represent the system load during the last one-, five-, and fifteen-minute periods.

You can check this value by different methods.

  1. Value is stored in file “/proc/loadavg”
  2. Using top command.
  3. Using “uptime” command.
  4. Using “w” command
  5. Using “sar -q” command
[root@localhost ~]# uptime
 02:19:47 up  5:11,  6 users,  load average: 0.00, 0.02, 0.05

Interpretation of load average

  • If the averages are 0.0, then your system is idle.
  • If the 1 minute average is higher than the 5 or 15 minute averages, then load is increasing.
  • If the 1 minute average is lower than the 5 or 15 minute averages, then load is decreasing.
  • If they are higher than your CPU count, then you might have a performance problem (it depends).

As we can see the load average is increasing (15 < 5 < 1 ), so there is more load on the system.

I executed the vmstat command.

-bash-4.2$ vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0     96 429824      0 2445100    0    0   170   307  160  193  2  2 96  0  0
 2  0     96 429736      0 2445220    0    0     0  2864 4116 4110 15 52 33  0  0
 4  0     96 431460      0 2445256    0    0     0  3008 4187 4221 16 51 33  0  0
 4  1     96 431404      0 2445280    0    0     0  3288 4107 3996 15 54 31  0  0
 4  1     96 431296      0 2445348    0    0     0  2984 4078 4126 16 52 32  0  0
 1  0     96 431576      0 2445348    0    0     0  3305 4175 4103 15 52 33  0  0
 4  1     96 431420      0 2445372    0    0     0  2952 4101 4157 14 51 34  1  0
 5  1     96 431032      0 2445328    0    0     0 19544 4189 4092 13 53 34  0  0

In the output I can see

  1. The first column “run queue” is more than my CPU count so there is load on the system(more processes to run than my cpu count).
  2. I can see there is high block out activity (disk writing is high).
  3. High context switching and interrupts

I started with finding the cause of I/O.

You can check the disk activity using “iostat -mx” command, we can see the disk write activity is more.

Now I executed the “iotop” command to pinpoint which process is doing more writes. Below is the information about the same.

I can see the Wal writer, check pointer and writer processes are doing heavy writes in my system. It means there is high transactions (DML activity) in my system which are causing more pressure on writer and check pointer processes.

I opened my PostgreSQL error logs and found that the checkpoints are happening too frequently due to heavy update commands running on my machine.

2020-04-12 05:37:19.809 PDT,,,1071,,5e90791e.42f,4,,2020-04-10 06:48:14 PDT,,0,LOG,00000,"checkpoints are occurring too frequently (11 seconds apart)",,"Consider increasing the configuration parameter ""max_wal_size"".",,,,,,,""
2020-04-12 05:37:28.649 PDT,,,1071,,5e90791e.42f,5,,2020-04-10 06:48:14 PDT,,0,LOG,00000,"checkpoints are occurring too frequently (9 seconds apart)",,"Consider increasing the configuration parameter ""max_wal_size"".",,,,,,,""

I have changed the checkpoint related parameters

postgres=# alter system set max_wal_size = '10GB';
postgres=# alter system set checkpoint_timeout = '30min';

I can still still the wal writer and writer processes doing more writes

Writer process will do writes during below times.

  1. When the check pointer process invokes it at checkpoint timeout or max_wal_size is reached.
  2. when the shared_buffers is getting filled too frequently.
  3. Also, writer process wakeup for every bgwriter_delay time and write the data from shared buffers to disk.

I verified the shared buffers setting and found that it is not enough and increased it. Also, I changed the wal writer parameters.

alter system set wal_buffers to '16MB';
alter system set shared_buffers to '2GB';

Now the I/O by writer and check pointer is not so frequent but there is high CPU utilization.

After that I started checking the context switching, which high on the system. As you can see in below image on average it is 590 but after the activity started it reached to 2000+ and same for system interrupts. If you want to know more about the vmstat command check

What is a context switch?

context switch is the process of storing the state of a process or thread, so that it can be restored and resume execution at a later point. This allows multiple processes to share a single CPU, and is an essential feature of a multitasking operating system.

Also, I observed in the top, pg_stat_activity where there are lot of shot lived transactions, which means there are multiple processes getting created and closed which means there is high load on the system. But I implemented the connection pooling (pgbouncer) to reduce the connection creation overhead.

Finally, the CPU utilization is reduced from 100 to ~70%. But still the load average is not reduced as expected and see more processes in the run queue as we see in vmstat output.

Finally we need to increase the CPU count to meet the load on the system.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s