765 lines
31 KiB
Plaintext
765 lines
31 KiB
Plaintext
<!-- doc/src/sgml/btree.sgml -->
|
|
|
|
<chapter id="btree">
|
|
<title>B-Tree Indexes</title>
|
|
|
|
<indexterm>
|
|
<primary>index</primary>
|
|
<secondary>B-Tree</secondary>
|
|
</indexterm>
|
|
|
|
<sect1 id="btree-intro">
|
|
<title>Introduction</title>
|
|
|
|
<para>
|
|
<productname>PostgreSQL</productname> includes an implementation of the
|
|
standard <acronym>btree</acronym> (multi-way balanced tree) index data
|
|
structure. Any data type that can be sorted into a well-defined linear
|
|
order can be indexed by a btree index. The only limitation is that an
|
|
index entry cannot exceed approximately one-third of a page (after TOAST
|
|
compression, if applicable).
|
|
</para>
|
|
|
|
<para>
|
|
Because each btree operator class imposes a sort order on its data type,
|
|
btree operator classes (or, really, operator families) have come to be
|
|
used as <productname>PostgreSQL</productname>'s general representation
|
|
and understanding of sorting semantics. Therefore, they've acquired
|
|
some features that go beyond what would be needed just to support btree
|
|
indexes, and parts of the system that are quite distant from the
|
|
btree AM make use of them.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="btree-behavior">
|
|
<title>Behavior of B-Tree Operator Classes</title>
|
|
|
|
<para>
|
|
As shown in <xref linkend="xindex-btree-strat-table"/>, a btree operator
|
|
class must provide five comparison operators,
|
|
<literal><</literal>,
|
|
<literal><=</literal>,
|
|
<literal>=</literal>,
|
|
<literal>>=</literal> and
|
|
<literal>></literal>.
|
|
One might expect that <literal><></literal> should also be part of
|
|
the operator class, but it is not, because it would almost never be
|
|
useful to use a <literal><></literal> WHERE clause in an index
|
|
search. (For some purposes, the planner treats <literal><></literal>
|
|
as associated with a btree operator class; but it finds that operator via
|
|
the <literal>=</literal> operator's negator link, rather than
|
|
from <structname>pg_amop</structname>.)
|
|
</para>
|
|
|
|
<para>
|
|
When several data types share near-identical sorting semantics, their
|
|
operator classes can be grouped into an operator family. Doing so is
|
|
advantageous because it allows the planner to make deductions about
|
|
cross-type comparisons. Each operator class within the family should
|
|
contain the single-type operators (and associated support functions)
|
|
for its input data type, while cross-type comparison operators and
|
|
support functions are <quote>loose</quote> in the family. It is
|
|
recommendable that a complete set of cross-type operators be included
|
|
in the family, thus ensuring that the planner can represent any
|
|
comparison conditions that it deduces from transitivity.
|
|
</para>
|
|
|
|
<para>
|
|
There are some basic assumptions that a btree operator family must
|
|
satisfy:
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
An <literal>=</literal> operator must be an equivalence relation; that
|
|
is, for all non-null values <replaceable>A</replaceable>,
|
|
<replaceable>B</replaceable>, <replaceable>C</replaceable> of the
|
|
data type:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<replaceable>A</replaceable> <literal>=</literal>
|
|
<replaceable>A</replaceable> is true
|
|
(<firstterm>reflexive law</firstterm>)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
if <replaceable>A</replaceable> <literal>=</literal>
|
|
<replaceable>B</replaceable>,
|
|
then <replaceable>B</replaceable> <literal>=</literal>
|
|
<replaceable>A</replaceable>
|
|
(<firstterm>symmetric law</firstterm>)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
if <replaceable>A</replaceable> <literal>=</literal>
|
|
<replaceable>B</replaceable> and <replaceable>B</replaceable>
|
|
<literal>=</literal> <replaceable>C</replaceable>,
|
|
then <replaceable>A</replaceable> <literal>=</literal>
|
|
<replaceable>C</replaceable>
|
|
(<firstterm>transitive law</firstterm>)
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
A <literal><</literal> operator must be a strong ordering relation;
|
|
that is, for all non-null values <replaceable>A</replaceable>,
|
|
<replaceable>B</replaceable>, <replaceable>C</replaceable>:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<replaceable>A</replaceable> <literal><</literal>
|
|
<replaceable>A</replaceable> is false
|
|
(<firstterm>irreflexive law</firstterm>)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
if <replaceable>A</replaceable> <literal><</literal>
|
|
<replaceable>B</replaceable>
|
|
and <replaceable>B</replaceable> <literal><</literal>
|
|
<replaceable>C</replaceable>,
|
|
then <replaceable>A</replaceable> <literal><</literal>
|
|
<replaceable>C</replaceable>
|
|
(<firstterm>transitive law</firstterm>)
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Furthermore, the ordering is total; that is, for all non-null
|
|
values <replaceable>A</replaceable>, <replaceable>B</replaceable>:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
exactly one of <replaceable>A</replaceable> <literal><</literal>
|
|
<replaceable>B</replaceable>, <replaceable>A</replaceable>
|
|
<literal>=</literal> <replaceable>B</replaceable>, and
|
|
<replaceable>B</replaceable> <literal><</literal>
|
|
<replaceable>A</replaceable> is true
|
|
(<firstterm>trichotomy law</firstterm>)
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
(The trichotomy law justifies the definition of the comparison support
|
|
function, of course.)
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
The other three operators are defined in terms of <literal>=</literal>
|
|
and <literal><</literal> in the obvious way, and must act consistently
|
|
with them.
|
|
</para>
|
|
|
|
<para>
|
|
For an operator family supporting multiple data types, the above laws must
|
|
hold when <replaceable>A</replaceable>, <replaceable>B</replaceable>,
|
|
<replaceable>C</replaceable> are taken from any data types in the family.
|
|
The transitive laws are the trickiest to ensure, as in cross-type
|
|
situations they represent statements that the behaviors of two or three
|
|
different operators are consistent.
|
|
As an example, it would not work to put <type>float8</type>
|
|
and <type>numeric</type> into the same operator family, at least not with
|
|
the current semantics that <type>numeric</type> values are converted
|
|
to <type>float8</type> for comparison to a <type>float8</type>. Because
|
|
of the limited accuracy of <type>float8</type>, this means there are
|
|
distinct <type>numeric</type> values that will compare equal to the
|
|
same <type>float8</type> value, and thus the transitive law would fail.
|
|
</para>
|
|
|
|
<para>
|
|
Another requirement for a multiple-data-type family is that any implicit
|
|
or binary-coercion casts that are defined between data types included in
|
|
the operator family must not change the associated sort ordering.
|
|
</para>
|
|
|
|
<para>
|
|
It should be fairly clear why a btree index requires these laws to hold
|
|
within a single data type: without them there is no ordering to arrange
|
|
the keys with. Also, index searches using a comparison key of a
|
|
different data type require comparisons to behave sanely across two
|
|
data types. The extensions to three or more data types within a family
|
|
are not strictly required by the btree index mechanism itself, but the
|
|
planner relies on them for optimization purposes.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="btree-support-funcs">
|
|
<title>B-Tree Support Functions</title>
|
|
|
|
<para>
|
|
As shown in <xref linkend="xindex-btree-support-table"/>, btree defines
|
|
one required and three optional support functions. The four
|
|
user-defined methods are:
|
|
</para>
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><function>order</function></term>
|
|
<listitem>
|
|
<para>
|
|
For each combination of data types that a btree operator family
|
|
provides comparison operators for, it must provide a comparison
|
|
support function, registered in
|
|
<structname>pg_amproc</structname> with support function number 1
|
|
and
|
|
<structfield>amproclefttype</structfield>/<structfield>amprocrighttype</structfield>
|
|
equal to the left and right data types for the comparison (i.e.,
|
|
the same data types that the matching operators are registered
|
|
with in <structname>pg_amop</structname>). The comparison
|
|
function must take two non-null values
|
|
<replaceable>A</replaceable> and <replaceable>B</replaceable> and
|
|
return an <type>int32</type> value that is
|
|
<literal><</literal> <literal>0</literal>,
|
|
<literal>0</literal>, or <literal>></literal>
|
|
<literal>0</literal> when <replaceable>A</replaceable>
|
|
<literal><</literal> <replaceable>B</replaceable>,
|
|
<replaceable>A</replaceable> <literal>=</literal>
|
|
<replaceable>B</replaceable>, or <replaceable>A</replaceable>
|
|
<literal>></literal> <replaceable>B</replaceable>,
|
|
respectively. A null result is disallowed: all values of the
|
|
data type must be comparable. See
|
|
<filename>src/backend/access/nbtree/nbtcompare.c</filename> for
|
|
examples.
|
|
</para>
|
|
|
|
<para>
|
|
If the compared values are of a collatable data type, the
|
|
appropriate collation OID will be passed to the comparison
|
|
support function, using the standard
|
|
<function>PG_GET_COLLATION()</function> mechanism.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><function>sortsupport</function></term>
|
|
<listitem>
|
|
<para>
|
|
Optionally, a btree operator family may provide <firstterm>sort
|
|
support</firstterm> function(s), registered under support
|
|
function number 2. These functions allow implementing
|
|
comparisons for sorting purposes in a more efficient way than
|
|
naively calling the comparison support function. The APIs
|
|
involved in this are defined in
|
|
<filename>src/include/utils/sortsupport.h</filename>.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><function>inrange</function></term>
|
|
<listitem>
|
|
<indexterm>
|
|
<primary>in_range support functions</primary>
|
|
</indexterm>
|
|
|
|
<indexterm>
|
|
<primary>support functions</primary>
|
|
<secondary>in_range</secondary>
|
|
</indexterm>
|
|
<para>
|
|
Optionally, a btree operator family may provide
|
|
<firstterm>in_range</firstterm> support function(s), registered
|
|
under support function number 3. These are not used during btree
|
|
index operations; rather, they extend the semantics of the
|
|
operator family so that it can support window clauses containing
|
|
the <literal>RANGE</literal> <replaceable>offset</replaceable>
|
|
<literal>PRECEDING</literal> and <literal>RANGE</literal>
|
|
<replaceable>offset</replaceable> <literal>FOLLOWING</literal>
|
|
frame bound types (see <xref
|
|
linkend="syntax-window-functions"/>). Fundamentally, the extra
|
|
information provided is how to add or subtract an
|
|
<replaceable>offset</replaceable> value in a way that is
|
|
compatible with the family's data ordering.
|
|
</para>
|
|
|
|
<para>
|
|
An <function>in_range</function> function must have the signature
|
|
<synopsis>
|
|
in_range(<replaceable>val</replaceable> type1, <replaceable>base</replaceable> type1, <replaceable>offset</replaceable> type2, <replaceable>sub</replaceable> bool, <replaceable>less</replaceable> bool)
|
|
returns bool
|
|
</synopsis>
|
|
<replaceable>val</replaceable> and
|
|
<replaceable>base</replaceable> must be of the same type, which
|
|
is one of the types supported by the operator family (i.e., a
|
|
type for which it provides an ordering). However,
|
|
<replaceable>offset</replaceable> could be of a different type,
|
|
which might be one otherwise unsupported by the family. An
|
|
example is that the built-in <literal>time_ops</literal> family
|
|
provides an <function>in_range</function> function that has
|
|
<replaceable>offset</replaceable> of type <type>interval</type>.
|
|
A family can provide <function>in_range</function> functions for
|
|
any of its supported types and one or more
|
|
<replaceable>offset</replaceable> types. Each
|
|
<function>in_range</function> function should be entered in
|
|
<structname>pg_amproc</structname> with
|
|
<structfield>amproclefttype</structfield> equal to
|
|
<type>type1</type> and <structfield>amprocrighttype</structfield>
|
|
equal to <type>type2</type>.
|
|
</para>
|
|
|
|
<para>
|
|
The essential semantics of an <function>in_range</function>
|
|
function depend on the two Boolean flag parameters. It should
|
|
add or subtract <replaceable>base</replaceable> and
|
|
<replaceable>offset</replaceable>, then compare
|
|
<replaceable>val</replaceable> to the result, as follows:
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
if <literal>!</literal><replaceable>sub</replaceable> and
|
|
<literal>!</literal><replaceable>less</replaceable>, return
|
|
<replaceable>val</replaceable> <literal>>=</literal>
|
|
(<replaceable>base</replaceable> <literal>+</literal>
|
|
<replaceable>offset</replaceable>)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
if <literal>!</literal><replaceable>sub</replaceable> and
|
|
<replaceable>less</replaceable>, return
|
|
<replaceable>val</replaceable> <literal><=</literal>
|
|
(<replaceable>base</replaceable> <literal>+</literal>
|
|
<replaceable>offset</replaceable>)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
if <replaceable>sub</replaceable> and
|
|
<literal>!</literal><replaceable>less</replaceable>, return
|
|
<replaceable>val</replaceable> <literal>>=</literal>
|
|
(<replaceable>base</replaceable> <literal>-</literal>
|
|
<replaceable>offset</replaceable>)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
if <replaceable>sub</replaceable> and
|
|
<replaceable>less</replaceable>, return
|
|
<replaceable>val</replaceable> <literal><=</literal>
|
|
(<replaceable>base</replaceable> <literal>-</literal>
|
|
<replaceable>offset</replaceable>)
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
Before doing so, the function should check the sign of
|
|
<replaceable>offset</replaceable>: if it is less than zero, raise
|
|
error
|
|
<literal>ERRCODE_INVALID_PRECEDING_OR_FOLLOWING_SIZE</literal>
|
|
(22013) with error text like <quote>invalid preceding or
|
|
following size in window function</quote>. (This is required by
|
|
the SQL standard, although nonstandard operator families might
|
|
perhaps choose to ignore this restriction, since there seems to
|
|
be little semantic necessity for it.) This requirement is
|
|
delegated to the <function>in_range</function> function so that
|
|
the core code needn't understand what <quote>less than
|
|
zero</quote> means for a particular data type.
|
|
</para>
|
|
|
|
<para>
|
|
An additional expectation is that <function>in_range</function>
|
|
functions should, if practical, avoid throwing an error if
|
|
<replaceable>base</replaceable> <literal>+</literal>
|
|
<replaceable>offset</replaceable> or
|
|
<replaceable>base</replaceable> <literal>-</literal>
|
|
<replaceable>offset</replaceable> would overflow. The correct
|
|
comparison result can be determined even if that value would be
|
|
out of the data type's range. Note that if the data type
|
|
includes concepts such as <quote>infinity</quote> or
|
|
<quote>NaN</quote>, extra care may be needed to ensure that
|
|
<function>in_range</function>'s results agree with the normal
|
|
sort order of the operator family.
|
|
</para>
|
|
|
|
<para>
|
|
The results of the <function>in_range</function> function must be
|
|
consistent with the sort ordering imposed by the operator family.
|
|
To be precise, given any fixed values of
|
|
<replaceable>offset</replaceable> and
|
|
<replaceable>sub</replaceable>, then:
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
If <function>in_range</function> with
|
|
<replaceable>less</replaceable> = true is true for some
|
|
<replaceable>val1</replaceable> and
|
|
<replaceable>base</replaceable>, it must be true for every
|
|
<replaceable>val2</replaceable> <literal><=</literal>
|
|
<replaceable>val1</replaceable> with the same
|
|
<replaceable>base</replaceable>.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
If <function>in_range</function> with
|
|
<replaceable>less</replaceable> = true is false for some
|
|
<replaceable>val1</replaceable> and
|
|
<replaceable>base</replaceable>, it must be false for every
|
|
<replaceable>val2</replaceable> <literal>>=</literal>
|
|
<replaceable>val1</replaceable> with the same
|
|
<replaceable>base</replaceable>.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
If <function>in_range</function> with
|
|
<replaceable>less</replaceable> = true is true for some
|
|
<replaceable>val</replaceable> and
|
|
<replaceable>base1</replaceable>, it must be true for every
|
|
<replaceable>base2</replaceable> <literal>>=</literal>
|
|
<replaceable>base1</replaceable> with the same
|
|
<replaceable>val</replaceable>.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
If <function>in_range</function> with
|
|
<replaceable>less</replaceable> = true is false for some
|
|
<replaceable>val</replaceable> and
|
|
<replaceable>base1</replaceable>, it must be false for every
|
|
<replaceable>base2</replaceable> <literal><=</literal>
|
|
<replaceable>base1</replaceable> with the same
|
|
<replaceable>val</replaceable>.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
Analogous statements with inverted conditions hold when
|
|
<replaceable>less</replaceable> = false.
|
|
</para>
|
|
|
|
<para>
|
|
If the type being ordered (<type>type1</type>) is collatable, the
|
|
appropriate collation OID will be passed to the
|
|
<function>in_range</function> function, using the standard
|
|
PG_GET_COLLATION() mechanism.
|
|
</para>
|
|
|
|
<para>
|
|
<function>in_range</function> functions need not handle NULL
|
|
inputs, and typically will be marked strict.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><function>equalimage</function></term>
|
|
<listitem>
|
|
<para>
|
|
Optionally, a btree operator family may provide
|
|
<function>equalimage</function> (<quote>equality implies image
|
|
equality</quote>) support functions, registered under support
|
|
function number 4. These functions allow the core code to
|
|
determine when it is safe to apply the btree deduplication
|
|
optimization. Currently, <function>equalimage</function>
|
|
functions are only called when building or rebuilding an index.
|
|
</para>
|
|
<para>
|
|
An <function>equalimage</function> function must have the
|
|
signature
|
|
<synopsis>
|
|
equalimage(<replaceable>opcintype</replaceable> <type>oid</type>) returns bool
|
|
</synopsis>
|
|
The return value is static information about an operator class
|
|
and collation. Returning <literal>true</literal> indicates that
|
|
the <function>order</function> function for the operator class is
|
|
guaranteed to only return <literal>0</literal> (<quote>arguments
|
|
are equal</quote>) when its <replaceable>A</replaceable> and
|
|
<replaceable>B</replaceable> arguments are also interchangeable
|
|
without any loss of semantic information. Not registering an
|
|
<function>equalimage</function> function or returning
|
|
<literal>false</literal> indicates that this condition cannot be
|
|
assumed to hold.
|
|
</para>
|
|
<para>
|
|
The <replaceable>opcintype</replaceable> argument is the
|
|
<literal><structname>pg_type</structname>.oid</literal> of the
|
|
data type that the operator class indexes. This is a convenience
|
|
that allows reuse of the same underlying
|
|
<function>equalimage</function> function across operator classes.
|
|
If <replaceable>opcintype</replaceable> is a collatable data
|
|
type, the appropriate collation OID will be passed to the
|
|
<function>equalimage</function> function, using the standard
|
|
<function>PG_GET_COLLATION()</function> mechanism.
|
|
</para>
|
|
<para>
|
|
As far as the operator class is concerned, returning
|
|
<literal>true</literal> indicates that deduplication is safe (or
|
|
safe for the collation whose OID was passed to its
|
|
<function>equalimage</function> function). However, the core
|
|
code will only deem deduplication safe for an index when
|
|
<emphasis>every</emphasis> indexed column uses an operator class
|
|
that registers an <function>equalimage</function> function, and
|
|
each function actually returns <literal>true</literal> when
|
|
called.
|
|
</para>
|
|
<para>
|
|
Image equality is <emphasis>almost</emphasis> the same condition
|
|
as simple bitwise equality. There is one subtle difference: When
|
|
indexing a varlena data type, the on-disk representation of two
|
|
image equal datums may not be bitwise equal due to inconsistent
|
|
application of <acronym>TOAST</acronym> compression on input.
|
|
Formally, when an operator class's
|
|
<function>equalimage</function> function returns
|
|
<literal>true</literal>, it is safe to assume that the
|
|
<literal>datum_image_eq()</literal> C function will always agree
|
|
with the operator class's <function>order</function> function
|
|
(provided that the same collation OID is passed to both the
|
|
<function>equalimage</function> and <function>order</function>
|
|
functions).
|
|
</para>
|
|
<para>
|
|
The core code is fundamentally unable to deduce anything about
|
|
the <quote>equality implies image equality</quote> status of an
|
|
operator class within a multiple-data-type family based on
|
|
details from other operator classes in the same family. Also, it
|
|
is not sensible for an operator family to register a cross-type
|
|
<function>equalimage</function> function, and attempting to do so
|
|
will result in an error. This is because <quote>equality implies
|
|
image equality</quote> status does not just depend on
|
|
sorting/equality semantics, which are more or less defined at the
|
|
operator family level. In general, the semantics that one
|
|
particular data type implements must be considered separately.
|
|
</para>
|
|
<para>
|
|
The convention followed by the operator classes included with the
|
|
core <productname>PostgreSQL</productname> distribution is to
|
|
register a stock, generic <function>equalimage</function>
|
|
function. Most operator classes register
|
|
<function>btequalimage()</function>, which indicates that
|
|
deduplication is safe unconditionally. Operator classes for
|
|
collatable data types such as <type>text</type> register
|
|
<function>btvarstrequalimage()</function>, which indicates that
|
|
deduplication is safe with deterministic collations. Best
|
|
practice for third-party extensions is to register their own
|
|
custom function to retain control.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="btree-implementation">
|
|
<title>Implementation</title>
|
|
|
|
<para>
|
|
This section covers B-Tree index implementation details that may be
|
|
of use to advanced users. See
|
|
<filename>src/backend/access/nbtree/README</filename> in the source
|
|
distribution for a much more detailed, internals-focused description
|
|
of the B-Tree implementation.
|
|
</para>
|
|
<sect2 id="btree-structure">
|
|
<title>B-Tree Structure</title>
|
|
<para>
|
|
<productname>PostgreSQL</productname> B-Tree indexes are
|
|
multi-level tree structures, where each level of the tree can be
|
|
used as a doubly-linked list of pages. A single metapage is stored
|
|
in a fixed position at the start of the first segment file of the
|
|
index. All other pages are either leaf pages or internal pages.
|
|
Leaf pages are the pages on the lowest level of the tree. All
|
|
other levels consist of internal pages. Each leaf page contains
|
|
tuples that point to table rows. Each internal page contains
|
|
tuples that point to the next level down in the tree. Typically,
|
|
over 99% of all pages are leaf pages. Both internal pages and leaf
|
|
pages use the standard page format described in <xref
|
|
linkend="storage-page-layout"/>.
|
|
</para>
|
|
<para>
|
|
New leaf pages are added to a B-Tree index when an existing leaf
|
|
page cannot fit an incoming tuple. A <firstterm>page
|
|
split</firstterm> operation makes room for items that originally
|
|
belonged on the overflowing page by moving a portion of the items
|
|
to a new page. Page splits must also insert a new
|
|
<firstterm>downlink</firstterm> to the new page in the parent page,
|
|
which may cause the parent to split in turn. Page splits
|
|
<quote>cascade upwards</quote> in a recursive fashion. When the
|
|
root page finally cannot fit a new downlink, a <firstterm>root page
|
|
split</firstterm> operation takes place. This adds a new level to
|
|
the tree structure by creating a new root page that is one level
|
|
above the original root page.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="btree-deduplication">
|
|
<title>Deduplication</title>
|
|
<para>
|
|
A duplicate is a leaf page tuple (a tuple that points to a table
|
|
row) where <emphasis>all</emphasis> indexed key columns have values
|
|
that match corresponding column values from at least one other leaf
|
|
page tuple that's close by in the same index. Duplicate tuples are
|
|
quite common in practice. B-Tree indexes can use a special,
|
|
space-efficient representation for duplicates when an optional
|
|
technique is enabled: <firstterm>deduplication</firstterm>.
|
|
</para>
|
|
<para>
|
|
Deduplication works by periodically merging groups of duplicate
|
|
tuples together, forming a single posting list tuple for each
|
|
group. The column key value(s) only appear once in this
|
|
representation. This is followed by a sorted array of
|
|
<acronym>TID</acronym>s that point to rows in the table. This
|
|
significantly reduces the storage size of indexes where each value
|
|
(or each distinct combination of column values) appears several
|
|
times on average. The latency of queries can be reduced
|
|
significantly. Overall query throughput may increase
|
|
significantly. The overhead of routine index vacuuming may also be
|
|
reduced significantly.
|
|
</para>
|
|
<note>
|
|
<para>
|
|
While NULL is generally not considered to be equal to any other
|
|
value, including NULL, NULL is nevertheless treated as just
|
|
another value from the domain of indexed values by the B-Tree
|
|
implementation (except when enforcing uniqueness in a unique
|
|
index). B-Tree deduplication is therefore just as effective with
|
|
<quote>duplicates</quote> that contain a NULL value.
|
|
</para>
|
|
</note>
|
|
<para>
|
|
The deduplication process occurs lazily, when a new item is
|
|
inserted that cannot fit on an existing leaf page. This prevents
|
|
(or at least delays) leaf page splits. Unlike GIN posting list
|
|
tuples, B-Tree posting list tuples do not need to expand every time
|
|
a new duplicate is inserted; they are merely an alternative
|
|
physical representation of the original logical contents of the
|
|
leaf page. This design prioritizes consistent performance with
|
|
mixed read-write workloads. Most client applications will at least
|
|
see a moderate performance benefit from using deduplication.
|
|
Deduplication is enabled by default.
|
|
</para>
|
|
<para>
|
|
Write-heavy workloads that don't benefit from deduplication due to
|
|
having few or no duplicate values in indexes will incur a small,
|
|
fixed performance penalty (unless deduplication is explicitly
|
|
disabled). The <literal>deduplicate_items</literal> storage
|
|
parameter can be used to disable deduplication within individual
|
|
indexes. There is never any performance penalty with read-only
|
|
workloads, since reading posting list tuples is at least as
|
|
efficient as reading the standard tuple representation. Disabling
|
|
deduplication isn't usually helpful.
|
|
</para>
|
|
<para>
|
|
B-Tree indexes are not directly aware that under MVCC, there might
|
|
be multiple extant versions of the same logical table row; to an
|
|
index, each tuple is an independent object that needs its own index
|
|
entry. Thus, an update of a row always creates all-new index
|
|
entries for the row, even if the key values did not change. Some
|
|
workloads suffer from index bloat caused by these
|
|
implementation-level version duplicates (this is typically a
|
|
problem for <command>UPDATE</command>-heavy workloads that cannot
|
|
apply the <acronym>HOT</acronym> optimization due to modifying at
|
|
least one indexed column). B-Tree deduplication does not
|
|
distinguish between these implementation-level version duplicates
|
|
and conventional duplicates. Deduplication can nevertheless help
|
|
with controlling index bloat caused by implementation-level version
|
|
churn.
|
|
</para>
|
|
<tip>
|
|
<para>
|
|
A special heuristic is applied to determine whether a
|
|
deduplication pass in a unique index should take place. It can
|
|
often skip straight to splitting a leaf page, avoiding a
|
|
performance penalty from wasting cycles on unhelpful deduplication
|
|
passes. If you're concerned about the overhead of deduplication,
|
|
consider setting <literal>deduplicate_items = off</literal>
|
|
selectively. Leaving deduplication enabled in unique indexes has
|
|
little downside.
|
|
</para>
|
|
</tip>
|
|
<para>
|
|
Deduplication cannot be used in all cases due to
|
|
implementation-level restrictions. Deduplication safety is
|
|
determined when <command>CREATE INDEX</command> or
|
|
<command>REINDEX</command> run.
|
|
</para>
|
|
<para>
|
|
Note that deduplication is deemed unsafe and cannot be used in the
|
|
following cases involving semantically significant differences
|
|
among equal datums:
|
|
</para>
|
|
<para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<type>text</type>, <type>varchar</type>, and <type>char</type>
|
|
cannot use deduplication when a
|
|
<emphasis>nondeterministic</emphasis> collation is used. Case
|
|
and accent differences must be preserved among equal datums.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
<type>numeric</type> cannot use deduplication. Numeric display
|
|
scale must be preserved among equal datums.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
<type>jsonb</type> cannot use deduplication, since the
|
|
<type>jsonb</type> B-Tree operator class uses
|
|
<type>numeric</type> internally.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
<type>float4</type> and <type>float8</type> cannot use
|
|
deduplication. These types have distinct representations for
|
|
<literal>-0</literal> and <literal>0</literal>, which are
|
|
nevertheless considered equal. This difference must be
|
|
preserved.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
There is one further implementation-level restriction that may be
|
|
lifted in a future version of
|
|
<productname>PostgreSQL</productname>:
|
|
</para>
|
|
<para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Container types (such as composite types, arrays, or range
|
|
types) cannot use deduplication.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
There is one further implementation-level restriction that applies
|
|
regardless of the operator class or collation used:
|
|
</para>
|
|
<para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<literal>INCLUDE</literal> indexes can never use deduplication.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
</sect2>
|
|
</sect1>
|
|
|
|
</chapter>
|