Almost everyone has heard of partitioning and logical replication in PostgreSQL 10. Have you heard about the support of ICU collations (International Components for Unicode)?

This article will present what this new feature is but also the possible gains by exploiting abbreviated keys.

Table of Contents

Short reminder about collation

In database, collation corresponds to character classification rules.

Here are some examples:

1create table t1 (c1 text);
2insert into t1 values ('cote'),('coté'),('côte'),('côté');

Two examples of sorts with two differents locales: French and German

 1select * from t1 order by c1 collate "de_DE";
 2  c1
 3------
 4 cote
 5 coté
 6 côte
 7 côté
 8
 9select * from t1 order by c1 collate "fr_FR";
10  c1
11------
12 cote
13 côte
14 coté
15 côté

Look a sort order, in French language sorting is done on the last accent which is not the case of German.

By default, Postgres uses environment collation. It is possible to specify the collation at the creation of the instance, creation of a base, a table … Even at column level.

There is also a particular collation “C” where postgres performs sorting according to the encoding order of the characters.

The pg_collation table in the system catalog lists collations. The collprovider column indicates the source:

c: libc
i: ICU

For example:

 1-[ RECORD 1 ]-+-----------------------
 2collname      | en-US-x-icu
 3collnamespace | 11
 4collowner     | 10
 5collprovider  | i
 6collencoding  | -1
 7collcollate   | en-US
 8collctype     | en-US
 9collversion   | 153.64
10-[ RECORD 4 ]-+-----------------------
11collname      | en_US
12collnamespace | 11
13collowner     | 10
14collprovider  | c
15collencoding  | 6
16collcollate   | en_US.utf8
17collctype     | en_US.utf8
18collversion   |

ICU collations

Postgres relies on operating system libraries to perform sort operations. Under linux, it is based on the famous libc.

The major benefit of relying on this library is not having to rewrite and maintain a whole set of sorting rules.

However, this approach may have a disadvantage:

We must “trust” this library. Within the same distribution (and even release), the libc will always return the same results. Things can get complicated if you have to compare results from different libs. For example, using a different distribution or a different release. This is why it is not recommended to put servers in streaming replication with different distributions: indexes may not be consistent. Cf: The dangers of streaming across versions of glibc: A cautionary tale

ICU collations are richer. It is possible to change sort ordering. For example, uppercase before the lowercase: Setting Options. Peter Geoghegan gave some examples on the mailing list postgresql-hackers: What users can do with custom ICU collations in Postgres 10

History of abbreviated keys in PostgreSQL

Back to the past :) Version 9.5 improved sorting algorithm with abbreviated keys. The announced gains were really important, here between 40 and 70%: Use abbreviated keys for faster sorting of text datums

In summary, the abbreviated keys allow to better exploit processor cache. Peter Geoghegan, who is the author of the patch, provides an explanation on his blog: Abbreviated keys: exploiting locality to improve PostgreSQL’s text sort performance

Unfortunately this feature has been disabled in version 9.5.2: Disable abbreviated keys for string-sorting in non-C locales

The reason? A bug in some version of the libc! This bug caused a corruption of the index: Abbreviated keys glibc issue

What is the link with ICU collations? It’s quite simple, with a ICU collation postgres no longer relies on the libc. It is therefore possible to use Abbreviated keys.

We check it in the code:

 1src/backend/utils/adt/varlena.c
 21876     /*
 31877      * Unfortunately, it seems that abbreviation for non-C collations is
 41878      * broken on many common platforms; testing of multiple versions of glibc
 51879      * reveals that, for many locales, strcoll() and strxfrm() do not return
 61880      * consistent results, which is fatal to this optimization.  While no
 71881      * other libc other than Cygwin has so far been shown to have a problem,
 81882      * we take the conservative course of action for right now and disable
 91883      * this categorically.  (Users who are certain this isn't a problem on
101884      * their system can define TRUST_STRXFRM.)
111885      *
121886      * Even apart from the risk of broken locales, it's possible that there
131887      * are platforms where the use of abbreviated keys should be disabled at
141888      * compile time.  Having only 4 byte datums could make worst-case
151889      * performance drastically more likely, for example.  Moreover, macOS's
161890      * strxfrm() implementation is known to not effectively concentrate a
171891      * significant amount of entropy from the original string in earlier
181892      * transformed blobs.  It's possible that other supported platforms are
191893      * similarly encumbered.  So, if we ever get past disabling this
201894      * categorically, we may still want or need to disable it for particular
211895      * platforms.
221896      */
231897 #ifndef TRUST_STRXFRM
241898     if (!collate_c && !(locale && locale->provider == COLLPROVIDER_ICU))
251899         abbreviate = false;
261900 #endif

Some tests

When I prepared my presentation on the Full Text Search in Postgres, I wanted to work on a “real” data set: the stackoverflow database

Here are the results of creating 3 indexes on the “title” column of the posts table (38GB):

Ignore collation: collate "C"
One using the en_US collation of the libc: collate "en_US"
Another one using the en_US collation of ICU library collate "en-US-x-icu"

I activate some options to have the execution time of the query and information about sorting:

1\timing
2set client_min_messages TO log;
3set trace_sort to on;

 1create index idx1 on posts (title collate  "C");
 2LOG:  begin index sort: unique = f, workMem = 6291456, randomAccess = f
 3LOG:  varstr_abbrev: abbrev_distinct after 160: 56.532166 (key_distinct: 59.707363, norm_abbrev_card: 0.353326, prop_card: 0.200000)
 4LOG:  varstr_abbrev: abbrev_distinct after 321: 110.782140 (key_distinct: 121.985752, norm_abbrev_card: 0.345116, prop_card: 0.200000)
 5[...]
 6LOG:  varstr_abbrev: abbrev_distinct after 10485760: 523091.461475 (key_distinct: 4215367.096361, norm_abbrev_card: 0.049886, prop_card: 0.002693)
 7LOG:  varstr_abbrev: abbrev_distinct after 20971522: 852125.989455 (key_distinct: 8800364.815018, norm_abbrev_card: 0.040633, prop_card: 0.001750)
 8LOG:  performsort starting: CPU: user: 21.88 s, system: 17.27 s, elapsed: 104.98 s
 9LOG:  performsort done: CPU: user: 43.55 s, system: 17.27 s, elapsed: 126.65 s
10LOG:  internal sort ended, 3519559 KB used: CPU: user: 48.14 s, system: 18.40 s, elapsed: 142.19 s
11CREATE INDEX
12Time: 142380.670 ms (02:22.381)

Here, postgres exploits abbreviated keys because there are no collation rules. It just has to sort according to the character encoding. Sorting does not take into account collation rules.

1create index idx2 on posts (title collate  "en_US");
2LOG:  begin index sort: unique = f, workMem = 6291456, randomAccess = f
3LOG:  performsort starting: CPU: user: 20.10 s, system: 17.32 s, elapsed: 104.80 s
4LOG:  performsort done: CPU: user: 137.52 s, system: 17.32 s, elapsed: 222.25 s
5LOG:  internal sort ended, 3519559 KB used: CPU: user: 142.41 s, system: 18.10 s, elapsed: 237.97 s
6CREATE INDEX
7Time: 238159.675 ms (03:58.160)

Postgres uses libc, so it can not exploit abbreviated keys.

 1create index idx3 on posts (title collate  "en-US-x-icu");
 2LOG:  begin index sort: unique = f, workMem = 6291456, randomAccess = f
 3LOG:  varstr_abbrev: abbrev_distinct after 160: 55.475952 (key_distinct: 59.707363, norm_abbrev_card: 0.346725, prop_card: 0.200000)
 4LOG:  varstr_abbrev: abbrev_distinct after 321: 110.782140 (key_distinct: 121.985752, norm_abbrev_card: 0.345116, prop_card: 0.200000)
 5[...]
 6LOG:  varstr_abbrev: abbrev_distinct after 10485760: 337228.120654 (key_distinct: 4215367.096361, norm_abbrev_card: 0.032161, prop_card: 0.002693)
 7LOG:  varstr_abbrev: abbrev_distinct after 20971522: 521498.943210 (key_distinct: 8800364.815018, norm_abbrev_card: 0.024867, prop_card: 0.001750)
 8LOG:  performsort starting: CPU: user: 30.22 s, system: 16.78 s, elapsed: 105.65 s
 9LOG:  performsort done: CPU: user: 66.86 s, system: 16.78 s, elapsed: 142.31 s
10LOG:  internal sort ended, 3519559 KB used: CPU: user: 71.23 s, system: 18.04 s, elapsed: 157.79 s
11CREATE INDEX
12Time: 157979.957 ms (02:37.980)

Postgres uses the ICU library, it can operate abbreviated keys. The gain is of the order of 34%. Unlike the first example, sorting takes into account the en_US collation rules.

Note: If you did not have system collations (during the initdb), you can import the new collations with the pg_import_system_collations function. For example:

1select pg_import_system_collations('pg_catalog');
2 pg_import_system_collations
3-----------------------------
4                           6

Postgres Sorts Collation ICU

Authors

Adrien Nayrat (he/him)

PostgreSQL DBA Freelance

For seven years, I held various positions as a systems and network engineer. And it was in 2013 that I first started working with PostgreSQL. I held positions as a consultant, trainer, and also production DBA (Doctolib, Peopledoc).

This blog is a place where I share my findings, knowledge and talks.

← PostgreSQL - JSONB and Statistics Nov 26, 2017

PostgreSQL 10 : Performances improvements Oct 4, 2017 →

No results found

PostgreSQL 10 : ICU & Abbreviated Keys

Short reminder about collation

ICU collations

History of abbreviated keys in PostgreSQL

Some tests