PostgreSQL and heap-only-tuples updates - part 1
Here is a series of articles that will focus on a new feature in version 11.
During the development of this version, a feature caught my attention. It can be found in releases notes : https://www.postgresql.org/docs/11/static/release-11.html
Allow heap-only-tuple (HOT) updates for expression indexes when the values of the expressions are unchanged (Konstantin Knizhnik)
I admit that this is not very explicit and this feature requires some knowledge about how postgres works, that I will try to explain through several articles:
- How MVCC works and heap-only-tuples updates
- When postgres do not use heap-only-tuple updates and introduction to the new feature in v11
- Impact on performances
This feature was disabled in 11.1 because it could lead to instance crashes1. I chose to publish these articles because they help to understand the mechanism of HOT updates and the benefits that this feature could bring.
I thank Guillaume Lelarge for his review of this article ;).
How MVCC works
Due to MVCC, postgres does not update a line directly: it duplicates the line and provides visibility information.
Why does it work like this?
There is a critical component to consider when working with a RDBMS: concurrency.
The line you are editing may be used by a previous transaction. An ongoing backup, for example :)
To this end, RDBMS have adopted different techniques:
- Modify the line and store previous versions on another location. This is what oracle does with undo logs, for example.
- Duplicate the line and store visibility information to know which line is visible by which transaction. This requires to have a cleaning mechanism for lines that are no longer visible to no one. This is the implementation in Postgres and the vacuum is responsible to perform this cleaning.
Let’s take a very simple table and watch its content evolve using the pageinspect extension:
1CREATE TABLE t2(c1 int);
2INSERT INTO t2 VALUES (1);
3SELECT lp,t_data FROM heap_page_items(get_raw_page('t2',0));
4 lp | t_data
5----+------------
6 1 | \x01000000
7(1 row)
8
9UPDATE t2 SET c1 = 2 WHERE c1 = 1;
10SELECT lp,t_data FROM heap_page_items(get_raw_page('t2',0));
11 lp | t_data
12----+------------
13 1 | \x01000000
14 2 | \x02000000
15(2 rows)
16
17VACUUM t2;
18SELECT lp,t_data FROM heap_page_items(get_raw_page('t2',0));
19 lp | t_data
20----+------------
21 1 |
22 2 | \x02000000
23(2 rows)
We can see that the engine duplicated the line and the vacuum cleaned location for future use.
heap-only-tuple mechanism
Let’s take another case, a little more complicated, a table with two columns and an index on one of the two columns:
1CREATE TABLE t3(c1 int,c2 int);
2CREATE INDEX ON t3(c1);
3INSERT INTO t3(c1,c2) VALUES (1,1);
4INSERT INTO t3(c1,c2) VALUES (2,2);
5SELECT ctid,* FROM t3;
6 ctid | c1 | c2
7-------+----+----
8 (0,1) | 1 | 1
9 (0,2) | 2 | 2
10(2 rows)
In the same way it is possible to read table blocks, it is possible to read blocks with pageinspect:
1SELECT * FROM bt_page_items(get_raw_page('t3_c1_idx',1));
2 itemoffset | ctid | itemlen | nulls | vars | data
3------------+-------+---------+-------+------+-------------------------
4 1 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00
5 2 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00
6(2 rows)
So far, it’s pretty simple, my table contains two records and the index also contains two records pointing to the corresponding blocks in the table (ctid column).
If I update column c1, with the new value 3 for example, the index will have to be updated.
Now, if I update the column c2. Will the index on c1 be updated?
At first glance, we might say no because c1 is unchanged.
But because of the MVCC model presented above, in theory, the answer will be yes: we have just seen that postgres will duplicate the line, so its physical location will be different (the next ctid will be (0.3)).
Let’s check it out:
1UPDATE t3 SET c2 = 3 WHERE c1=1;
2SELECT lp,t_data,t_ctid FROM heap_page_items(get_raw_page('t3',0));
3 lp | t_data | t_ctid
4----+--------------------+--------
5 1 | \x0100000001000000 | (0,3)
6 2 | \x0200000002000000 | (0,2)
7 3 | \x0100000003000000 | (0,3)
8(3 rows)
9
10
11SELECT * FROM bt_page_items(get_raw_page('t3_c1_idx',1));
12 itemoffset | ctid | itemlen | nulls | vars | data
13------------+-------+---------+-------+------+-------------------------
14 1 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00
15 2 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00
16(2 rows)
Reading the table block confirms that the line has been duplicated.
By looking carefully at the field t_data, we can distinguish the 1 from column
c1 and the 3 from column c2.
If you read index block, you can see that its content has not changed! If I search
for line WHERE c1 = 1, the index points me towards the record (0,1) which
corresponds to the old line!
What happened?
In fact, we have just revealed a rather special mechanism called heap-only-tuple alias HOT. When a column is updated, no index points to that column and the record can be inserted in the same block, Postgres will only make a pointer between the old and the new record.
This allows postgres to avoid having to update the index. With all that implies:
- Read/write operations avoided
- Reduce index fragmentation and therefore of its size (it is difficult to reuse old index locations)
If you look at table block, the column t_ctid of the first line points to (0.3).
If the line was updated again, the first line of the table would point to the
line (0.3) and the line (0.3) would point to (0.4), forming what is called a chain.
A vacuum would clean up the free spaces but would always keep the first line
that would point to the last recording.
A line is changed and the index still does not change:
1UPDATE t3 SET c2 = 4 WHERE c1=1;
2SELECT * FROM bt_page_items(get_raw_page('t3_c1_idx',1));
3 itemoffset | ctid | itemlen | nulls | vars | data
4------------+-------+---------+-------+------+-------------------------
5 1 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00
6 2 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00
7(2 rows)
8
9SELECT lp,t_data,t_ctid FROM heap_page_items(get_raw_page('t3',0));
10 lp | t_data | t_ctid
11----+--------------------+--------
12 1 | \x0100000001000000 | (0,3)
13 2 | \x0200000002000000 | (0,2)
14 3 | \x0100000003000000 | (0,4)
15 4 | \x0100000004000000 | (0,4)
16(4 rows)
Vacuum cleans available spaces:
1VACUUM t3;
2SELECT lp,t_data,t_ctid FROM heap_page_items(get_raw_page('t3',0));
3 lp | t_data | t_ctid
4----+--------------------+--------
5 1 | |
6 2 | \x0200000002000000 | (0,2)
7 3 | |
8 4 | \x0100000004000000 | (0,4)
9(4 rows)
An update will reuse the second location and the index remains unchanged.
Look at the value of the column t_ctid to reconstruct the chain:
1UPDATE t3 SET c2 = 5 WHERE c1=1;
2SELECT lp,t_data,t_ctid FROM heap_page_items(get_raw_page('t3',0));
3 lp | t_data | t_ctid
4----+--------------------+--------
5 1 | |
6 2 | \x0200000002000000 | (0,2)
7 3 | \x0100000005000000 | (0,3)
8 4 | \x0100000004000000 | (0,3)
9(4 rows)
10
11
12SELECT * FROM bt_page_items(get_raw_page('t3_c1_idx',1));
13 itemoffset | ctid | itemlen | nulls | vars | data
14------------+-------+---------+-------+------+-------------------------
15 1 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00
16 2 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00
17(2 rows)
Uh, the first line is empty and postgres reused the third location?
In fact, an information does not appear in pageinspect. Let’s go read the block directly with pg_filedump :
Note: You must first request a CHECKPOINT otherwise the block may not yet be
written to disk.
1pg_filedump 11/main/base/16606/8890510
2
3*******************************************************************
4* PostgreSQL File/Block Formatted Dump Utility - Version 10.1
5*
6* File: 11/main/base/16606/8890510
7* Options used: None
8*
9* Dump created on: Sun Sep 2 13:09:53 2018
10*******************************************************************
11
12Block 0 ********************************************************
13<Header> -----
14 Block Offset: 0x00000000 Offsets: Lower 40 (0x0028)
15 Block: Size 8192 Version 4 Upper 8096 (0x1fa0)
16 LSN: logid 52 recoff 0xc39ea148 Special 8192 (0x2000)
17 Items: 4 Free Space: 8056
18 Checksum: 0x0000 Prune XID: 0x0000168b Flags: 0x0001 (HAS_FREE_LINES)
19 Length (including item array): 40
20
21<Data> ------
22 Item 1 -- Length: 0 Offset: 4 (0x0004) Flags: REDIRECT
23 Item 2 -- Length: 32 Offset: 8160 (0x1fe0) Flags: NORMAL
24 Item 3 -- Length: 32 Offset: 8096 (0x1fa0) Flags: NORMAL
25 Item 4 -- Length: 32 Offset: 8128 (0x1fc0) Flags: NORMAL
The first line contains Flags: REDIRECT. This indicates that this line corresponds
to a HOT redirection. This is documented in src/include/storage/itemid.h :
1/*
2 * lp_flags has these possible states. An UNUSED line pointer is available
3 * for immediate re-use, the other states are not.
4 */
5#define LP_UNUSED 0 /* unused (should always have lp_len=0) */
6#define LP_NORMAL 1 /* used (should always have lp_len>0) */
7#define LP_REDIRECT 2 /* HOT redirect (should have lp_len=0) */
8#define LP_DEAD 3 /* dead, may or may not have storage */
In fact, it is possible to see it with pageinspect by displaying the column lp_flags:
1SELECT lp,lp_flags,t_data,t_ctid FROM heap_page_items(get_raw_page('t3',0));
2 lp | lp_flags | t_data | t_ctid
3----+----------+--------------------+--------
4 1 | 2 | |
5 2 | 1 | \x0200000002000000 | (0,2)
6 3 | 1 | \x0100000005000000 | (0,3)
7 4 | 1 | \x0100000004000000 | (0,3)
8(4 rows)
If we do an update again, then a vacuum, followed by a CHECKPOINT to write the block to disk:
1SELECT lp,lp_flags,t_data,t_ctid FROM heap_page_items(get_raw_page('t3',0));
2 lp | lp_flags | t_data | t_ctid
3----+----------+--------------------+--------
4 1 | 2 | |
5 2 | 1 | \x0200000002000000 | (0,2)
6 3 | 0 | |
7 4 | 0 | |
8 5 | 1 | \x0100000006000000 | (0,5)
9(5 rows)
10
11CHECKPOINT;
12
13pg_filedump 11/main/base/16606/8890510
14
15*******************************************************************
16* PostgreSQL File/Block Formatted Dump Utility - Version 10.1
17*
18* File: 11/main/base/16606/8890510
19* Options used: None
20*
21* Dump created on: Sun Sep 2 13:16:12 2018
22*******************************************************************
23
24Block 0 ********************************************************
25<Header> -----
26 Block Offset: 0x00000000 Offsets: Lower 44 (0x002c)
27 Block: Size 8192 Version 4 Upper 8128 (0x1fc0)
28 LSN: logid 52 recoff 0xc39ea308 Special 8192 (0x2000)
29 Items: 5 Free Space: 8084
30 Checksum: 0x0000 Prune XID: 0x00000000 Flags: 0x0005 (HAS_FREE_LINES|ALL_VISIBLE)
31 Length (including item array): 44
32
33<Data> ------
34 Item 1 -- Length: 0 Offset: 5 (0x0005) Flags: REDIRECT
35 Item 2 -- Length: 32 Offset: 8160 (0x1fe0) Flags: NORMAL
36 Item 3 -- Length: 0 Offset: 0 (0x0000) Flags: UNUSED
37 Item 4 -- Length: 0 Offset: 0 (0x0000) Flags: UNUSED
38 Item 5 -- Length: 32 Offset: 8128 (0x1fc0) Flags: NORMAL
39
40
41*** End of File Encountered. Last Block Read: 0 ***
Postgres has kept the first line (flag REDIRECT) and writes a new line at location 5.
However, there are a few cases where postgres cannot use this mechanism:
- When there is no more space in the block and he has to write another block. We can deduce that the fragmentation of the table is useful here to benefit of HOT.
- When an index exists on the updated column. In this case postgres must update the index. Postgres can detect if there has been a change by performing a binary comparison between the new value and the previous one 2.
In the next article we will see an example where postgres cannot use HOT mechanism. Then, the new feature of version 11 where postgres can use this mechanism.
For seven years, I held various positions as a systems and network engineer. And it was in 2013 that I first started working with PostgreSQL. I held positions as a consultant, trainer, and also production DBA (Doctolib, Peopledoc).
This blog is a place where I share my findings, knowledge and talks.