postgresql

Go to file

Peter Geoghegan 0d861bbb70 Add deduplication to nbtree. Deduplication reduces the storage overhead of duplicates in indexes that use the standard nbtree index access method. The deduplication process is applied lazily, after the point where opportunistic deletion of LP_DEAD-marked index tuples occurs. Deduplication is only applied at the point where a leaf page split would otherwise be required. New posting list tuples are formed by merging together existing duplicate tuples. The physical representation of the items on an nbtree leaf page is made more space efficient by deduplication, but the logical contents of the page are not changed. Even unique indexes make use of deduplication as a way of controlling bloat from duplicates whose TIDs point to different versions of the same logical table row. The lazy approach taken by nbtree has significant advantages over a GIN style eager approach. Most individual inserts of index tuples have exactly the same overhead as before. The extra overhead of deduplication is amortized across insertions, just like the overhead of page splits. The key space of indexes works in the same way as it has since commit `dd299df8` (the commit that made heap TID a tiebreaker column). Testing has shown that nbtree deduplication can generally make indexes with about 10 or 15 tuples for each distinct key value about 2.5X - 4X smaller, even with single column integer indexes (e.g., an index on a referencing column that accompanies a foreign key). The final size of single column nbtree indexes comes close to the final size of a similar contrib/btree_gin index, at least in cases where GIN's posting list compression isn't very effective. This can significantly improve transaction throughput, and significantly reduce the cost of vacuuming indexes. A new index storage parameter (deduplicate_items) controls the use of deduplication. The default setting is 'on', so all new B-Tree indexes automatically use deduplication where possible. This decision will be reviewed at the end of the Postgres 13 beta period. There is a regression of approximately 2% of transaction throughput with synthetic workloads that consist of append-only inserts into a table with several non-unique indexes, where all indexes have few or no repeated values. The underlying issue is that cycles are wasted on unsuccessful attempts at deduplicating items in non-unique indexes. There doesn't seem to be a way around it short of disabling deduplication entirely. Note that deduplication of items in unique indexes is fairly well targeted in general, which avoids the problem there (we can use a special heuristic to trigger deduplication passes in unique indexes, since we're specifically targeting "version bloat"). Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed. No bump in BTREE_VERSION, since the representation of posting list tuples works in a way that's backwards compatible with version 4 indexes (i.e. indexes built on PostgreSQL 12). However, users must still REINDEX a pg_upgrade'd index to use deduplication, regardless of the Postgres version they've upgraded from. This is the only way to set the new nbtree metapage flag indicating that deduplication is generally safe. Author: Anastasia Lubennikova, Peter Geoghegan Reviewed-By: Peter Geoghegan, Heikki Linnakangas Discussion: https://postgr.es/m/55E4051B.7020209@postgrespro.ru https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru		2020-02-26 13:05:30 -08:00
config	Assume that we have signed integral types and flexible array members.	2020-02-21 14:30:48 -05:00
contrib	Add deduplication to nbtree.	2020-02-26 13:05:30 -08:00
doc	Add deduplication to nbtree.	2020-02-26 13:05:30 -08:00
src	Add deduplication to nbtree.	2020-02-26 13:05:30 -08:00
.dir-locals.el	Make Emacs perl-mode indent more like perltidy.	2019-01-13 11:32:31 -08:00
.editorconfig	Add .editorconfig	2019-12-18 09:13:13 +01:00
.gitattributes	gitattributes: Add new file	2019-11-12 08:13:55 +01:00
.gitignore	Support for optimizing and emitting code in LLVM JIT provider.	2018-03-22 11:05:22 -07:00
COPYRIGHT	Update copyrights for 2020	2020-01-01 12:21:45 -05:00
GNUmakefile.in	Add support for automatically updating Unicode derived files	2020-01-09 10:08:14 +01:00
HISTORY	Canonicalize some URLs	2020-02-10 20:47:50 +01:00
Makefile	Don't unset MAKEFLAGS in non-GNU Makefile.	2019-06-25 09:36:21 +12:00
README	Canonicalize some URLs	2020-02-10 20:47:50 +01:00
README.git	Canonicalize some URLs	2020-02-10 20:47:50 +01:00
aclocal.m4	Fix configure's AC_CHECK_DECLS tests to work correctly with clang.	2018-11-19 12:01:47 -05:00
configure	Assume that we have signed integral types and flexible array members.	2020-02-21 14:30:48 -05:00
configure.in	Assume that we have signed integral types and flexible array members.	2020-02-21 14:30:48 -05:00

README

PostgreSQL Database Management System
=====================================

This directory contains the source code distribution of the PostgreSQL
database management system.

PostgreSQL is an advanced object-relational database management system
that supports an extended subset of the SQL standard, including
transactions, foreign keys, subqueries, triggers, user-defined types
and functions.  This distribution also contains C language bindings.

PostgreSQL has many language interfaces, many of which are listed here:

	https://www.postgresql.org/download/

See the file INSTALL for instructions on how to build and install
PostgreSQL.  That file also lists supported operating systems and
hardware platforms and contains information regarding any other
software packages that are required to build or run the PostgreSQL
system.  Copyright and license information can be found in the
file COPYRIGHT.  A comprehensive documentation set is included in this
distribution; it can be read as described in the installation
instructions.

The latest version of this software may be obtained at
https://www.postgresql.org/download/.  For more information look at our
web site located at https://www.postgresql.org/.