To implement one of the ACID properties called “Isolation”, PostgreSQL provides MVCC (Multi-Version Concurrency Control). In this way, the maximum possible concurrency among current transactions is achievable. PostgreSQL does this by creating versions of each tuple when the tuple receives any modifications. For instance, say that a tuple received n concurrent modifications. The n versions of the same tuple will still be kept and will make the last committed modified tuple visible to other transactions. Now, this will lead to more disk space usage because you are having both visible and non-visible tuples. The good news is that PostgreSQL offers a few ways to reuse the non-visible tuples to make way for further write operations. Read More
PostgreSQL databases require periodic maintenance known as vacuuming.
PostgreSQL’s VACUUM command has to process each table on a regular basis for several reasons:
- To recover or reuse disk space occupied by updated or deleted rows
- To update data statistics used by the PostgreSQL query planner
- To update the visibility map, which speeds up index-only scans
- To protect against loss of very old data due to transaction ID wraparound or multixact ID wraparound
Let’s perform a test with the usage of VACUUM. We will create a sample table and run a few SQL statements that will generate non-visible tuples or dead tuples.
Connect to your database as a super use then run the command below:
$ psql -h localhost -U postgres postgres=# CREATE EXTENSION pg_freespacemap; CREATE
Let’s create a test table:
postgres=# CREATE TABLE test(t INT);
Let’s turn off autovacuum for a moment for demo purposes:
postgres=# ALTER TABLE test SET(autovacuum_enabled=off); ALTER TABLE
Let’s observe how many pages this table will occupy in the disk when we insert 1,000 rows:
postgres=> INSERT INTO test VALUES(generate_series(1, 7000)); INSERT 0 7000 postgres=> ANALYZE test; ANALYZE postgres=> SELECT relpages FROM pg_class WHERE relname='test'; relpages ---------- 31 (1 row)
PostgreSQL used thirty-one relpages to store 7000 integer records.
Using the pg_freespace() function, let’s verify the free space available in the relation test:
postgres=> SELECT sum(avail) freespace FROM pg_freespace('test'); freespace ----------- 0 (1 row)
Let’s update all the 1,000 records and gather the relpages and freespace map information again:
postgres=> UPDATE test SET t=7000; UPDATE 7000 postgres=> ANALYZE test; ANALYZE postgres=> SELECT relpages FROM pg_class WHERE relname='test'; relpages ---------- 62 (1 row) postgres=> SELECT sum(avail) freespace FROM pg_freespace('test'); freespace ----------- 0 (1 row)
We’ve gained 31 additional pages by running the update on 7,000 tuples. In other words, the table size expanded by up to 50% in the disk storage upon updating the entire table.
Now, let’s gather the number of non-visible or dead tuples of the table. We can do this by using the pg_stat_user_tables view.
postgres=> SELECT n_dead_tup FROM pg_stat_user_tables WHERE relname='test'; n_dead_tup ------------ 7000 (1 row)
Let’s run the VACUUM then observe the amount of free space available for the relation ‘test’:
postgres=> VACUUM test; VACUUM postgres=> SELECT sum(avail) freespace FROM pg_freespace('test'); freespace ----------- 252288 (1 row)
After executing VACUUM command, it logically cleans up the dead tuples resulting in free space being available. Now, we are gonna insert additional 7,000 records into the table. Then let’s look at the relpages count and free information.
postgres=> INSERT INTO test VALUES(generate_series(1, 7000)); INSERT 0 7000 postgres=> ANALYZE test; ANALYZE postgres=> SELECT relpages FROM pg_class WHERE relname='test'; relpages ---------- 62 (1 row) postgres=> SELECT sum(avail) freespace FROM pg_freespace('test'); freespace ----------- 7488 (1 row)
From the previous SQL statement output, since we ran the VACUUM command before inserting records, we see that even after inserting additional 7,000 records into the table, the relpages did not increase. We can also see that the free space that was logically deleted by the previous VACUUM command has been properly used.
How about doing the same steps above without using the VACUUM command before inserting records? Let’s do that, then observe the disk usage.
postgres=> TRUNCATE test; TRUNCATE TABLE postgres=> ANALYZE test; ANALYZE postgres=> INSERT INTO test VALUES(generate_series(1, 7000)); INSERT 0 7000 postgres=> UPDATE test SET t=700; UPDATE 7000 postgres=> INSERT INTO test VALUES(generate_series(1, 7000)); INSERT 0 7000 postgres=> ANALYZE test; ANALYZE postgres=> SELECT relpages FROM pg_class WHERE relname='test'; relpages ---------- 93 postgres=> SELECT sum(avail) freespace FROM pg_freespace('test'); freespace ----------- 0 (1 row)
From the previous example, we can see that the table size increased by 50% when we perform an additional insert operation after updating the table. The reason for this is that the table has no available free space map, and it increased the table storage. In the prior example, after the insert operation has been completed, the total table size takes nine pages, and in this scenario, it takes 93 pages.
Now, we can validate that a table will reuse its dead tuple storage for any new incoming transactions when VACUUM operation is used. However, it is not recommended to run VACUUM explicitly on every insert/update/delete operation. It will cause I/O utilization to increase because VACUUM is a heavy process that scans all the tuples in a table. It is also not suggested to entirely stop VACUUM on any table; otherwise, you will have more dead tuples on your disk, which can result in ineffective disk utilization. The best practice is to run VACUUM process periodically. For example, we can run it once a day or once a week. We can either schedule it using crontab or pgBucket.
What Happens When We Execute VACUUM Command
When we run =VACUUM command on any table, it will ask the Postgres instance to transmit all the currently running SQL queries. Once the vacuum process sees that list, it will proceed to spot the dead tuples that are not in view from the running SQL queries. Afterwards, it will utilize the maintenance_work_mem to load the relation into memory where it can then identify dead tuples. Once all the dead tuples have been detected by the vacuum process, it will initiate an FSM(Free Space Map) for that relation.
Another duty of the vacuum process is to update each table row transaction status by confirming pg_clog and pg_subtrans. Each row header in contains xmin and xmax values. The xmin and xmax values relate to the transaction statuses: In progress, Committed, Aborted by citing the status bits from pg_clog.
xmin
The column that records the transaction id that created the row.
xmax
The column records the transaction id that expired the row, either through an UPDATE or DELETE.
Cheers!