The Magic of VACUUM: Keeping Your PostgreSQL Database in Top Shape

Zack
Level Up Coding
Published in
6 min readFeb 27, 2023

--

Introducing what VACUUM is in PostgreSQL, what it does, and also the difference between regular VACUUM and VACUUM FULL.

What is vacuuming

PostgreSQL implements concurrency control through MVCC (Multi-Version Concurrency Control) (If you would like to learn more about MVCC and how it is implemented in PostgreSQL, go here), which means different transactions are able to work with the same data concurrently but see different views of the data items. When a transaction deletes a tuple, it needs to be able to see the original version of the tuple in case it needs to be rolled back. If the original tuple was immediately removed, the transaction could not roll back correctly.

PostgreSQL retains deleted tuples in the table instead of immediately removing them, which results in space being occupied by those tuples. This approach works fine for smaller tables, but for tables where frequent updates and deletions are performed, the table size could significantly increase, leading to performance degradation. To address this, the VACUUM process is used to reclaim the occupied space and optimize the performance of the database. By using VACUUM, we can optimize the performance of the PostgreSQL database and maintain the health of our application.

Prerequisite

Dead tuple

In PostgreSQL, a “dead tuple” refers to a row or record in a table that has been marked for deletion but has not yet been physically removed from the table. When a row is deleted from a table, it is not immediately removed from the table’s underlying storage. Instead, the database marks the row as “dead” and leaves it in place until a process called VACUUM reclaims the space.

A dead tuple in PostgreSQL is still visible to queries, but its contents are no longer relevant or accessible. In other words, the data stored in a dead tuple is effectively deleted, but the space it occupies in the table is not yet freed up.

In PostgreSQL, you cannot directly query or view dead tuples in a table. Dead tuples are not visible to regular queries, and they do not appear in the results of a SELECT statement.

However, you can use the PostgreSQL command line utility psql and run a VACUUM VERBOSE command to get a report on the number of dead tuples that were removed by the vacuum operation. The verbose option will display information about the table, including the number of dead tuples that were removed during the VACUUM operation.

Concurrent VACUUM

VACUUM reclaims storage occupied by dead tuples. It is used to remove dead rows and free up space for future updates.

The VACUUM command scans the table and removes any dead rows, as well as reclaims the space they occupied. This helps to compact the table and reduce fragmentation. The VACUUM command can also analyze the table to update statistics used by the query optimizer.

We can have a look at how this command does through this example:

First, we can create a table in our database:

CREATE TABLE table1(col text) WITH (autovacuum_enabled = off);

Then insert one record into it and update that data once;

INSERT INTO table1(col) VALUES ('a');
UPDATE table1 SET col = 'b';

Now we should have one tuple in table1 and one dead tuple

Then we can create a function to print out all tuples in this table:


CREATE FUNCTION heap_page(relname text, pageno integer)
RETURNS TABLE(ctid tid, state text, xmin text, xmax text, t_ctid tid)
AS $$
SELECT (pageno,lp)::text::tid AS ctid,
CASE lp_flags
WHEN 0 THEN 'unused'
WHEN 1 THEN 'normal'
WHEN 2 THEN 'redirect to '||lp_off
WHEN 3 THEN 'dead'
END AS state,
t_xmin || CASE
WHEN (t_infomask & 256) > 0 THEN ' (c)'
WHEN (t_infomask & 512) > 0 THEN ' (a)'
ELSE ''
END AS xmin,
t_xmax || CASE
WHEN (t_infomask & 1024) > 0 THEN ' (c)'
WHEN (t_infomask & 2048) > 0 THEN ' (a)'
ELSE ''
END AS xmax,
t_ctid
FROM heap_page_items(get_raw_page(relname,pageno))
ORDER BY lp;
$$ LANGUAGE SQL;

After running this function, we can get:

SELECT * FROM heap_page('table1',0);
 ctid  | state  |  xmin   | xmax  | t_ctid 
-------+--------+---------+-------+--------
(0,1) | normal | 809 (c) | 810 | (0,2)
(0,2) | normal | 810 | 0 (a) | (0,2)
(2 rows)

The second record is the current tuple, while the first one is the dead tuple.

Now we execute the VACUUM command:

VACUUM table1;
SELECT * FROM heap_page('table1', 0);

The result is:

 ctid  |     state     |  xmin   | xmax  | t_ctid 
-------+---------------+---------+-------+--------
(0,1) | redirect to 2 | | |
(0,2) | normal | 810 (c) | 0 (a) | (0,2)
(2 rows)

Now only one tuple remains in the table, the dead tuple gets vacuumed away.

FULL VACUUM

In PostgreSQL, there is another way to do vacuum, called full vacuum. It refers to a command VACCUM FULL which will perform a complete vacuuming of a table or index.

VACUUM scans through the table or index and marks dead rows for deletion. It then frees up the space occupied by the dead rows for future use. VACUUM also updates statistics on the table, which can be used by the query optimizer to generate more efficient query plans.

VACUUM FULL is more aggressive and does a more thorough job of cleaning up the table or index. It physically reorganizes the table or index, moves data around to eliminate fragmentation, and compacts the table or index as much as possible. This can result in a significant reduction in the size of the table or index, but it can also take much longer to run than a regular VACUUM.

It’s important to note that VACUUM FULL can cause significant downtime and may require exclusive access to the table, whereas VACUUM can be run concurrently with other operations. As such, VACUUM FULL should only be used when absolutely necessary, such as when there is a significant amount of bloat or fragmentation in the table or index.

Now let me explain to you with an example, suppose we still using the table that we created before ‘table1’, and then insert 100000 rows of “a” and 100000 “b” by running this code:

DO $$   
BEGIN

FOR i IN 1..100000 LOOP
INSERT INTO table1(col) VALUES ('a');
END loop;

FOR i IN 1..100000 LOOP
INSERT INTO table1(col) VALUES ('b');
END loop;

END $$;

After that we can run this code to get the current size of the table1:

SELECT pg_size_pretty(pg_table_size('table1')) table_size;

Now, the result should be:

 table_size 
------------
7112 kB
(1 row)

Then, we need to delete some rows to create several dead tuples:

DELETE FROM table1 WHERE col = 'a';    

After that, we can run the pg_table_size again to get the size of the table1 again:

 table_size 
------------
7112 kB
(1 row)

Apparently, the size of this table will not change before we do any vacuum operations.

Now let's do a concurrent vacuum first, and measure the size again:

VACUUM table1;
SELECT pg_size_pretty(pg_table_size('table1')) table_size;

The result is:

 table_size 
------------
7120 kB
(1 row)

As we can see, the size doesn’t change much, that’s because when you run the regular VACUUM command on table1, it will scan through the table and reclaim the dead rows’ space but it will not physically reorganize the table, thus the result will be a table with some free space than can be reused by new rows.

Then let’s do a Full vacuum:

VACUUM FULL table1;     
SELECT pg_size_pretty(pg_table_size('table1')) table_size;

The result changes to:

 table_size 
------------
3552 kB
(1 row)

The space decreased to half of the previous size, that’s because the “VACUUM FULL” physically reorganized the table to eliminate fragmentation and compact it as much as possible.

Thanks for reading!

Level Up Coding

Thanks for being a part of our community! Before you go:

🚀👉 Join the Level Up talent collective and find an amazing job

--

--