Partitioning use cases with PostgreSQL
After a short break, I’m back to writing technical articles about Postgres. This is also an opportunity for me to announce my change of activity. Since 2021, I’m a freelancer to give companies the opportunity to benefit from my experience on Postgres.
Table of Contents
History of partitioning in PostgreSQL
For a long time, PostgreSQL has made it possible to partition tables by using table inheritance. However, this method was rather difficult to implement: it implied setting up triggers to redirect writes (less efficient than native partitioning), and the planning time could increase significantly if there were more than a hundred partitions…
The native partitioning came with version 10. Since this version, postgres is able (among other things) to send writes to the right partition, perform partition pruning to only read selected partitions, use algorithms exploiting partitioning, etc. It thus offers better performance and easier maintenance.
Since Postgres 10 we can:
- Partition:
- By list
- By hash
- By intervals
- Do multi-level partitioning
- Partition on multiple columns
- Use primary and foreign keys
All these features are interesting, but we can ask ourselves a very simple question: when to implement partitioning?
I will introduce you to several use cases that I have encountered. But first, here are some common mistakes about partitioning.
Common mistakes
“Partitioning is necessary as soon as the size is important”
First, what is a “large” size?
Some will say that it is more than several hundred GB, others more than a terabyte, others still more than a petabyte…
There is no real answer to this question, and generally it will depend on the type of workload: ratio INSERT/UPDATE/DELETE, type of SELECT (OLTP, OLAP…). It will also depend on the hardware. 10 years ago, when servers only had a few GB of RAM with mechanical disks, it was likely that a database of a few hundred GB would be perceived as a large database. Now it is not uncommon to see servers with over a terabyte of RAM, NVMe drives.
Thus, a database of a few hundred GB is no longer considered a large database. But rather as a modest database.
A little story, to reassure himself, a customer asked me if Postgres was already used for “large” volumes. We were talking about a 40 GB database on a server with 64 GB of RAM. All reads were done from the cache… :). I was able to reassure him about the size of his database which was relatively modest.
It may be superfluous to partition a database of a few TB as it may be necessary to partition one of a few hundred GB. For example, if the activity is just adding rows to tables and the queries are simple WHERE column = 4
that return a few rows. A simple Btree will do the job. And if the query returns many rows, it is possible to use BRIN indexes or bloom filters.
BRIN indexes provide similar benefits to horizontal partitioning or sharding but without needing to explicitly declare partitions.1
“Partitioning is necessary to spread data over several disks”
The idea would be to create partitions and tablespaces on different disks in order to spread the input/output operations.
For PostgreSQL, a tablespace is nothing more or less than a path to a directory. It is quite possible to manage the storage at the operating system level and to aggregate several disks (in RAID10) for example. Then, it is just a matter of storing the table on the volume created. Thus, I/O can be spread over a set of disks.
In this case, it is not necessary to implement partitioning. However, we will see a case where it might make sense.
Now we will look at “legitimate” use cases of partitioning.
Partitioning uses cases
Partitioning to manage retention
Due to the MVCC model, massive deletion leads to bloat in the tables.
A possible use case is to partition by date. Deleting the old data is the same as dropping a complete partition. The operation will be fast and the tables will not be bloated.
Partitioning to control index bloat
Adding and modifying data bloats indexes over time. To put it simply, you can’t recover the free space in a block until it is empty. Over time, index splits create “empty” space in the index and the only way to recover this space is to rebuild the index.
This is called “bloat”. There have been many improvements on the last versions of Postgres:
- Version 12, we can read in Releases Notes:
Improve performance and space utilization of btree indexes with many duplicates (Peter Geoghegan, Heikki Linnakangas)
Previously, duplicate index entries were stored unordered within their duplicate groups. This caused overhead during index inserts, wasted space due to excessive page splits, and it reduced VACUUM’s ability to recycle entire pages. Duplicate index entries are now sorted in heap-storage order.
- Version 13, we can read in Releases Notes:
More efficiently store duplicates in B-tree indexes (Anastasia Lubennikova, Peter Geoghegan)
This allows efficient B-tree indexing of low-cardinality columns by storing duplicate keys only once. Users upgrading with pg_upgrade will need to use REINDEX to make an existing index use this feature.
- Version 14, we can read in Releases Notes:
Allow btree index additions to remove expired index entries to prevent page splits (Peter Geoghegan)
This is particularly helpful for reducing index bloat on tables whose indexed columns are frequently updated.
To control the bloat, we could rebuild the index regularly (thanks to REINDEX CONCURRENTLY
which arrived in version 12). This solution would be cumbersome, because the whole index would have to be rebuilt regularly.
If most of the modifications are made on recent data, for example: log table, customer orders, appointments… We could imagine a partitioning by month. Thus, at the beginning of each month, we start with a “new” table and we can reindex the previous table to remove the bloat.
We can also take advantage of this to make a CLUSTER
on the table to have a good correlation of the data with the storage.
Partitioning for low cardinality
Gradually we will see more complicated use cases :)
Let’s take an example: an order table with a delivery state, after a few years 99% of the orders are delivered (we hope!) and very few are in the process of payment or delivery.
Let’s imagine that we want to retrieve 100 orders in progress. We will create an index on the state and use it to retrieve the records. If we are a bit clever, we can create a partial index on this particular state. The problem is that this index will bloat quite quickly as the orders are delivered.
In this case we could do a partitioning on the state. Thus, retrieving 100 orders in the process of being delivered is equivalent to reading 100 records from the partition.
Partitioning to get more accurate statistics
To determine the best execution plan, Postgres makes decisions based on statistics. They are obtained from a sample of the table (the default_statistic_target
which is 100 by default).
By default, postgres will collect 300 x default_statistic_target
rows, that is 30 000 rows. With a table of several hundred million rows, this sample is sometimes too small.
We can drastically increase the sample size, but this approach has some drawbacks:
- It increases the planning time
- It makes the
ANALYZE
more heavy. - Sometimes it is not enough if the data are not well distributed. For example, if you take a few hundred thousand rows from a table with several hundred million rows, you may miss the rows that are in delivery status.
With partitioning, we could have the same sample but per partition, which allows us to increase the accuracy.
This would also be useful when we have correlated data between columns. I will take the example of orders. We have a whole year’s worth of orders: all the orders that are more than one month old are delivered, those of the last month are 90% delivered (10% are in progress).
Intuitively, if I look for an order in progress more than 6 months ago, I should not get any result. On the other hand, if I search for orders in progress for the last month, I should get 10% of the table. But postgres doesn’t know that, for it, the orders in progress are spread over the whole table.
With a partitioning by date, it can estimate that there are no orders in progress for deliveries of more than one month. This approach is mainly used to reduce an estimation error in an execution plan.
Here is an example with this order table, orders_p
is the month partitioned version of the orders
table. The data is identical in both tables.
We can notice that the estimation is much better in the case where the table is partitioned, postgres having statistics per partitions.
|
|
Now let’s take the same query over the last month:
|
|
Here again we can see that the estimation is better.
partitionwise join & partitionwise aggregate
Another interest of partitioning is to benefit from better algorithms for joins and aggregation.
The partitionwise aggregate
allows to do an aggregation or a grouping partition per partition. An example is better than a long speech:
|
|
In the first case the aggregation is done once for all the tables, while in the second example the aggregation is done per partition. We can also notice that the total cost is lower in the plan with aggregation per partition.
The partitionwise join
performs on the same principle, we do a partition per partition join. It is useful to join two partitioned tables.
Storage tiering
Finally, another use case would be to store a part of the table on a different storage:
We can store a partitioned table in different tablespaces. For example recent data on a fast tablespace on NVMe SSD. Then, the more rarely accessed data on another tablespace, with less expensive mechanical disks.
This approach can also make sense in the cloud era where storage is very expensive.
Conclusion
I think I’ve covered the main use cases that came to my mind.
Obviously, the implementation of partitioning implies a bigger complexity (management of partitions…) and limitations that will have to be studied upstream.
-
“BRIN indexes provide similar benefits to horizontal partitioning or sharding but without needing to explicitly declare partitions.” - https://en.wikipedia.org/wiki/Block_Range_Index ↩︎