postgresql/src/backend/access/nbtree/README

src/backend/access/nbtree/README

Btree Indexing
==============

This directory contains a correct implementation of Lehman and Yao's
high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions
on Database Systems, Vol 6, No. 4, December 1981, pp 650-670).  We also
use a simplified version of the deletion logic described in Lanin and
Shasha (V. Lanin and D. Shasha, A Symmetric Concurrent B-Tree Algorithm,
Proceedings of 1986 Fall Joint Computer Conference, pp 380-389).

The basic Lehman & Yao Algorithm
--------------------------------

Compared to a classic B-tree, L&Y adds a right-link pointer to each page,
to the page's right sibling.  It also adds a "high key" to each page, which
is an upper bound on the keys that are allowed on that page.  These two
additions make it possible detect a concurrent page split, which allows the
tree to be searched without holding any read locks (except to keep a single
page from being modified while reading it).

When a search follows a downlink to a child page, it compares the page's
high key with the search key.  If the search key is greater than the high
key, the page must've been split concurrently, and you must follow the
right-link to find the new page containing the key range you're looking
for.  This might need to be repeated, if the page has been split more than
once.

Lehman and Yao talk about alternating "separator" keys and downlinks in
internal pages rather than tuples or records.  We use the term "pivot"
tuple to refer to tuples which don't point to heap tuples, that are used
only for tree navigation.  All tuples on non-leaf pages and high keys on
leaf pages are pivot tuples.  Since pivot tuples are only used to represent
which part of the key space belongs on each page, they can have attribute
values copied from non-pivot tuples that were deleted and killed by VACUUM
some time ago.  A pivot tuple may contain a "separator" key and downlink,
just a separator key (i.e. the downlink value is implicitly undefined), or
just a downlink (i.e. all attributes are truncated away).

The requirement that all btree keys be unique is satisfied by treating heap
TID as a tiebreaker attribute.  Logical duplicates are sorted in heap TID
order.  This is necessary because Lehman and Yao also require that the key
range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are
the adjacent keys in the parent page (Ki must be _strictly_ less than v,
which is assured by having reliably unique keys).  Keys are always unique
on their level, with the exception of a leaf page's high key, which can be
fully equal to the last item on the page.

The Postgres implementation of suffix truncation must make sure that the
Lehman and Yao invariants hold, and represents that absent/truncated
attributes in pivot tuples have the sentinel value "minus infinity".  The
later section on suffix truncation will be helpful if it's unclear how the
Lehman & Yao invariants work with a real world example.

Differences to the Lehman & Yao algorithm
-----------------------------------------

We have made the following changes in order to incorporate the L&Y algorithm
into Postgres:

Lehman and Yao don't require read locks, but assume that in-memory
copies of tree pages are unshared.  Postgres shares in-memory buffers
among backends.  As a result, we do page-level read locking on btree
pages in order to guarantee that no record is modified while we are
examining it.  This reduces concurrency but guarantees correct
behavior.

We support the notion of an ordered "scan" of an index as well as
insertions, deletions, and simple lookups.  A scan in the forward
direction is no problem, we just use the right-sibling pointers that
L&Y require anyway.  (Thus, once we have descended the tree to the
correct start point for the scan, the scan looks only at leaf pages
and never at higher tree levels.)  To support scans in the backward
direction, we also store a "left sibling" link much like the "right
sibling".  (This adds an extra step to the L&Y split algorithm: while
holding the write lock on the page being split, we also lock its former
right sibling to update that page's left-link.  This is safe since no
writer of that page can be interested in acquiring a write lock on our
page.)  A backwards scan has one additional bit of complexity: after
following the left-link we must account for the possibility that the
left sibling page got split before we could read it.  So, we have to
move right until we find a page whose right-link matches the page we
came from.  (Actually, it's even harder than that; see deletion discussion
below.)

Page read locks are held only for as long as a scan is examining a page.
To minimize lock/unlock traffic, an index scan always searches a leaf page
to identify all the matching items at once, copying their heap tuple IDs
into backend-local storage.  The heap tuple IDs are then processed while
not holding any page lock within the index.  We do continue to hold a pin
on the leaf page in some circumstances, to protect against concurrent
deletions (see below).  In this state the scan is effectively stopped
"between" pages, either before or after the page it has pinned.  This is
safe in the presence of concurrent insertions and even page splits, because
items are never moved across pre-existing page boundaries --- so the scan
cannot miss any items it should have seen, nor accidentally return the same
item twice.  The scan must remember the page's right-link at the time it
was scanned, since that is the page to move right to; if we move right to
the current right-link then we'd re-scan any items moved by a page split.
We don't similarly remember the left-link, since it's best to use the most
up-to-date left-link when trying to move left (see detailed move-left
algorithm below).

In most cases we release our lock and pin on a page before attempting
to acquire pin and lock on the page we are moving to.  In a few places
it is necessary to lock the next page before releasing the current one.
This is safe when moving right or up, but not when moving left or down
(else we'd create the possibility of deadlocks).

Lehman and Yao fail to discuss what must happen when the root page
becomes full and must be split.  Our implementation is to split the
root in the same way that any other page would be split, then construct
a new root page holding pointers to both of the resulting pages (which
now become siblings on the next level of the tree).  The new root page
is then installed by altering the root pointer in the meta-data page (see
below).  This works because the root is not treated specially in any
other way --- in particular, searches will move right using its link
pointer if the link is set.  Therefore, searches will find the data
that's been moved into the right sibling even if they read the meta-data
page before it got updated.  This is the same reasoning that makes a
split of a non-root page safe.  The locking considerations are similar too.

When an inserter recurses up the tree, splitting internal pages to insert
links to pages inserted on the level below, it is possible that it will
need to access a page above the level that was the root when it began its
descent (or more accurately, the level that was the root when it read the
meta-data page).  In this case the stack it made while descending does not
help for finding the correct page.  When this happens, we find the correct
place by re-descending the tree until we reach the level one above the
level we need to insert a link to, and then moving right as necessary.
(Typically this will take only two fetches, the meta-data page and the new
root, but in principle there could have been more than one root split
since we saw the root.  We can identify the correct tree level by means of
the level numbers stored in each page.  The situation is rare enough that
we do not need a more efficient solution.)

Lehman and Yao must couple/chain locks as part of moving right when
relocating a child page's downlink during an ascent of the tree.  This is
the only point where Lehman and Yao have to simultaneously hold three
locks (a lock on the child, the original parent, and the original parent's
right sibling).  We don't need to couple internal page locks for pages on
the same level, though.  We match a child's block number to a downlink
from a pivot tuple one level up, whereas Lehman and Yao match on the
separator key associated with the downlink that was followed during the
initial descent.  We can release the lock on the original parent page
before acquiring a lock on its right sibling, since there is never any
need to deal with the case where the separator key that we must relocate
becomes the original parent's high key.  Lanin and Shasha don't couple
locks here either, though they also don't couple locks between levels
during ascents.  They are willing to "wait and try again" to avoid races.
Their algorithm is optimistic, which means that "an insertion holds no
more than one write lock at a time during its ascent".  We more or less
stick with Lehman and Yao's approach of conservatively coupling parent and
child locks when ascending the tree, since it's far simpler.

Lehman and Yao assume fixed-size keys, but we must deal with
variable-size keys.  Therefore there is not a fixed maximum number of
keys per page; we just stuff in as many as will fit.  When we split a
page, we try to equalize the number of bytes, not items, assigned to
pages (though suffix truncation is also considered).  Note we must include
the incoming item in this calculation, otherwise it is possible to find
that the incoming item doesn't fit on the split page where it needs to go!

The Deletion Algorithm
----------------------

Before deleting a leaf item, we get a super-exclusive lock on the target
page, so that no other backend has a pin on the page when the deletion
starts.  This is not necessary for correctness in terms of the btree index
operations themselves; as explained above, index scans logically stop
"between" pages and so can't lose their place.  The reason we do it is to
provide an interlock between non-full VACUUM and indexscans.  Since VACUUM
deletes index entries before reclaiming heap tuple line pointers, the
super-exclusive lock guarantees that VACUUM can't reclaim for re-use a
line pointer that an indexscanning process might be about to visit.  This
guarantee works only for simple indexscans that visit the heap in sync
with the index scan, not for bitmap scans.  We only need the guarantee
when using non-MVCC snapshot rules; when using an MVCC snapshot, it
doesn't matter if the heap tuple is replaced with an unrelated tuple at
the same TID, because the new tuple won't be visible to our scan anyway.
Therefore, a scan using an MVCC snapshot which has no other confounding
factors will not hold the pin after the page contents are read.  The
current reasons for exceptions, where a pin is still needed, are if the
index is not WAL-logged or if the scan is an index-only scan.  If later
work allows the pin to be dropped for all cases we will be able to
simplify the vacuum code, since the concept of a super-exclusive lock
for btree indexes will no longer be needed.

Because a pin is not always held, and a page can be split even while
someone does hold a pin on it, it is possible that an indexscan will
return items that are no longer stored on the page it has a pin on, but
rather somewhere to the right of that page.  To ensure that VACUUM can't
prematurely remove such heap tuples, we require btbulkdelete to obtain a
super-exclusive lock on every leaf page in the index, even pages that
don't contain any deletable tuples.  Any scan which could yield incorrect
results if the tuple at a TID matching the scan's range and filter
conditions were replaced by a different tuple while the scan is in
progress must hold the pin on each index page until all index entries read
from the page have been processed.  This guarantees that the btbulkdelete
call cannot return while any indexscan is still holding a copy of a
deleted index tuple if the scan could be confused by that.  Note that this
requirement does not say that btbulkdelete must visit the pages in any
particular order.  (See also on-the-fly deletion, below.)

There is no such interlocking for deletion of items in internal pages,
since backends keep no lock nor pin on a page they have descended past.
Hence, when a backend is ascending the tree using its stack, it must
be prepared for the possibility that the item it wants is to the left of
the recorded position (but it can't have moved left out of the recorded
page).  Since we hold a lock on the lower page (per L&Y) until we have
re-found the parent item that links to it, we can be assured that the
parent item does still exist and can't have been deleted.

Page Deletion
-------------

We consider deleting an entire page from the btree only when it's become
completely empty of items.  (Merging partly-full pages would allow better
space reuse, but it seems impractical to move existing data items left or
right to make this happen --- a scan moving in the opposite direction
might miss the items if so.)  Also, we *never* delete the rightmost page
on a tree level (this restriction simplifies the traversal algorithms, as
explained below).  Page deletion always begins from an empty leaf page.  An
internal page can only be deleted as part of a branch leading to a leaf
page, where each internal page has only one child and that child is also to
be deleted.

Deleting a leaf page is a two-stage process.  In the first stage, the page
is unlinked from its parent, and marked as half-dead.  The parent page must
be found using the same type of search as used to find the parent during an
insertion split.  We lock the target and the parent pages, change the
target's downlink to point to the right sibling, and remove its old
downlink.  This causes the target page's key space to effectively belong to
its right sibling.  (Neither the left nor right sibling pages need to
change their "high key" if any; so there is no problem with possibly not
having enough space to replace a high key.)  At the same time, we mark the
target page as half-dead, which causes any subsequent searches to ignore it
and move right (or left, in a backwards scan).  This leaves the tree in a
similar state as during a page split: the page has no downlink pointing to
it, but it's still linked to its siblings.

(Note: Lanin and Shasha prefer to make the key space move left, but their
argument for doing so hinges on not having left-links, which we have
anyway.  So we simplify the algorithm by moving the key space right.)

To preserve consistency on the parent level, we cannot merge the key space
of a page into its right sibling unless the right sibling is a child of
the same parent --- otherwise, the parent's key space assignment changes
too, meaning we'd have to make bounding-key updates in its parent, and
perhaps all the way up the tree.  Since we can't possibly do that
atomically, we forbid this case.  That means that the rightmost child of a
parent node can't be deleted unless it's the only remaining child, in which
case we will delete the parent too (see below).

In the second-stage, the half-dead leaf page is unlinked from its siblings.
We first lock the left sibling (if any) of the target, the target page
itself, and its right sibling (there must be one) in that order.  Then we
update the side-links in the siblings, and mark the target page deleted.

When we're about to delete the last remaining child of a parent page, things
are slightly more complicated.  In the first stage, we leave the immediate
parent of the leaf page alone, and remove the downlink to the parent page
instead, from the grandparent.  If it's the last child of the grandparent
too, we recurse up until we find a parent with more than one child, and
remove the downlink of that page.  The leaf page is marked as half-dead, and
the block number of the page whose downlink was removed is stashed in the
half-dead leaf page.  This leaves us with a chain of internal pages, with
one downlink each, leading to the half-dead leaf page, and no downlink
pointing to the topmost page in the chain.

While we recurse up to find the topmost parent in the chain, we keep the
leaf page locked, but don't need to hold locks on the intermediate pages
between the leaf and the topmost parent -- insertions into upper tree levels
happen only as a result of splits of child pages, and that can't happen as
long as we're keeping the leaf locked.  The internal pages in the chain
cannot acquire new children afterwards either, because the leaf page is
marked as half-dead and won't be split.

Removing the downlink to the top of the to-be-deleted subtree/chain
effectively transfers the key space to the right sibling for all the
intermediate levels too, in one atomic operation.  A concurrent search might
still visit the intermediate pages, but it will move right when it reaches
the half-dead page at the leaf level.  In particular, the search will move to
the subtree to the right of the half-dead leaf page/to-be-deleted subtree,
since the half-dead leaf page's right sibling must be a "cousin" page, not a
"true" sibling page (or a second cousin page when the to-be-deleted chain
starts at leaf page's grandparent page, and so on).

In the second stage, the topmost page in the chain is unlinked from its
siblings, and the half-dead leaf page is updated to point to the next page
down in the chain.  This is repeated until there are no internal pages left
in the chain.  Finally, the half-dead leaf page itself is unlinked from its
siblings.

A deleted page cannot be reclaimed immediately, since there may be other
processes waiting to reference it (ie, search processes that just left the
parent, or scans moving right or left from one of the siblings).  These
processes must observe that the page is marked dead and recover
accordingly.  Searches and forward scans simply follow the right-link
until they find a non-dead page --- this will be where the deleted page's
key-space moved to.

Moving left in a backward scan is complicated because we must consider
the possibility that the left sibling was just split (meaning we must find
the rightmost page derived from the left sibling), plus the possibility
that the page we were just on has now been deleted and hence isn't in the
sibling chain at all anymore.  So the move-left algorithm becomes:
0. Remember the page we are on as the "original page".
1. Follow the original page's left-link (we're done if this is zero).
2. If the current page is live and its right-link matches the "original
   page", we are done.
3. Otherwise, move right one or more times looking for a live page whose
   right-link matches the "original page".  If found, we are done.  (In
   principle we could scan all the way to the right end of the index, but
   in practice it seems better to give up after a small number of tries.
   It's unlikely the original page's sibling split more than a few times
   while we were in flight to it; if we do not find a matching link in a
   few tries, then most likely the original page is deleted.)
4. Return to the "original page".  If it is still live, return to step 1
   (we guessed wrong about it being deleted, and should restart with its
   current left-link).  If it is dead, move right until a non-dead page
   is found (there must be one, since rightmost pages are never deleted),
   mark that as the new "original page", and return to step 1.
This algorithm is correct because the live page found by step 4 will have
the same left keyspace boundary as the page we started from.  Therefore,
when we ultimately exit, it must be on a page whose right keyspace
boundary matches the left boundary of where we started --- which is what
we need to be sure we don't miss or re-scan any items.

A deleted page can only be reclaimed once there is no scan or search that
has a reference to it; until then, it must stay in place with its
right-link undisturbed.  We implement this by waiting until all active
snapshots and registered snapshots as of the deletion are gone; which is
overly strong, but is simple to implement within Postgres.  When marked
dead, a deleted page is labeled with the next-transaction counter value.
VACUUM can reclaim the page for re-use when this transaction number is
older than RecentGlobalXmin.  As collateral damage, this implementation
also waits for running XIDs with no snapshots and for snapshots taken
until the next transaction to allocate an XID commits.

Reclaiming a page doesn't actually change its state on disk --- we simply
record it in the shared-memory free space map, from which it will be
handed out the next time a new page is needed for a page split.  The
deleted page's contents will be overwritten by the split operation.
(Note: if we find a deleted page with an extremely old transaction
number, it'd be worthwhile to re-mark it with FrozenTransactionId so that
a later xid wraparound can't cause us to think the page is unreclaimable.
But in more normal situations this would be a waste of a disk write.)

Because we never delete the rightmost page of any level (and in particular
never delete the root), it's impossible for the height of the tree to
decrease.  After massive deletions we might have a scenario in which the
tree is "skinny", with several single-page levels below the root.
Operations will still be correct in this case, but we'd waste cycles
descending through the single-page levels.  To handle this we use an idea
from Lanin and Shasha: we keep track of the "fast root" level, which is
the lowest single-page level.  The meta-data page keeps a pointer to this
level as well as the true root.  All ordinary operations initiate their
searches at the fast root not the true root.  When we split a page that is
alone on its level or delete the next-to-last page on a level (both cases
are easily detected), we have to make sure that the fast root pointer is
adjusted appropriately.  In the split case, we do this work as part of the
atomic update for the insertion into the parent level; in the delete case
as part of the atomic update for the delete (either way, the metapage has
to be the last page locked in the update to avoid deadlock risks).  This
avoids race conditions if two such operations are executing concurrently.

VACUUM needs to do a linear scan of an index to search for deleted pages
that can be reclaimed because they are older than all open transactions.
For efficiency's sake, we'd like to use the same linear scan to search for
deletable tuples.  Before Postgres 8.2, btbulkdelete scanned the leaf pages
in index order, but it is possible to visit them in physical order instead.
The tricky part of this is to avoid missing any deletable tuples in the
presence of concurrent page splits: a page split could easily move some
tuples from a page not yet passed over by the sequential scan to a
lower-numbered page already passed over.  (This wasn't a concern for the
index-order scan, because splits always split right.)  To implement this,
we provide a "vacuum cycle ID" mechanism that makes it possible to
determine whether a page has been split since the current btbulkdelete
cycle started.  If btbulkdelete finds a page that has been split since
it started, and has a right-link pointing to a lower page number, then
it temporarily suspends its sequential scan and visits that page instead.
It must continue to follow right-links and vacuum dead tuples until
reaching a page that either hasn't been split since btbulkdelete started,
or is above the location of the outer sequential scan.  Then it can resume
the sequential scan.  This ensures that all tuples are visited.  It may be
that some tuples are visited twice, but that has no worse effect than an
inaccurate index tuple count (and we can't guarantee an accurate count
anyway in the face of concurrent activity).  Note that this still works
if the has-been-recently-split test has a small probability of false
positives, so long as it never gives a false negative.  This makes it
possible to implement the test with a small counter value stored on each
index page.

Fastpath For Index Insertion
----------------------------

We optimize for a common case of insertion of increasing index key
values by caching the last page to which this backend inserted the last
value, if this page was the rightmost leaf page. For the next insert, we
can then quickly check if the cached page is still the rightmost leaf
page and also the correct place to hold the current value. We can avoid
the cost of walking down the tree in such common cases.

The optimization works on the assumption that there can only be one
non-ignorable leaf rightmost page, and so even a RecentGlobalXmin style
interlock isn't required.  We cannot fail to detect that our hint was
invalidated, because there can only be one such page in the B-Tree at
any time. It's possible that the page will be deleted and recycled
without a backend's cached page also being detected as invalidated, but
only when we happen to recycle a block that once again gets recycled as the
rightmost leaf page.

On-the-Fly Deletion Of Index Tuples
-----------------------------------

If a process visits a heap tuple and finds that it's dead and removable
(ie, dead to all open transactions, not only that process), then we can
return to the index and mark the corresponding index entry "known dead",
allowing subsequent index scans to skip visiting the heap tuple.  The
"known dead" marking works by setting the index item's lp_flags state
to LP_DEAD.  This is currently only done in plain indexscans, not bitmap
scans, because only plain scans visit the heap and index "in sync" and so
there's not a convenient way to do it for bitmap scans.

Once an index tuple has been marked LP_DEAD it can actually be removed
from the index immediately; since index scans only stop "between" pages,
no scan can lose its place from such a deletion.  We separate the steps
because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock.  In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and
hence have exclusive lock on it already).  Deduplication can also prevent
a page split, but removing LP_DEAD tuples is the preferred approach.
(Note that posting list tuples can only have their LP_DEAD bit set when
every table TID within the posting list is known dead.)

This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap.  This is not a problem for the current
implementation of VACUUM, but it could be a problem for anything that
explicitly tries to find index entries for dead tuples.  (However, the
same situation is created by REINDEX, since it doesn't enter dead
tuples into the index.)

It's sufficient to have an exclusive lock on the index page, not a
super-exclusive lock, to do deletion of LP_DEAD items.  It might seem
that this breaks the interlock between VACUUM and indexscans, but that is
not so: as long as an indexscanning process has a pin on the page where
the index item used to be, VACUUM cannot complete its btbulkdelete scan
and so cannot remove the heap tuple.  This is another reason why
btbulkdelete has to get a super-exclusive lock on every leaf page, not
only the ones where it actually sees items to delete.  So that we can
handle the cases where we attempt LP_DEAD flagging for a page after we
have released its pin, we remember the LSN of the index page when we read
the index tuples from it; we do not attempt to flag index tuples as dead
if the we didn't hold the pin the entire time and the LSN has changed.

WAL Considerations
------------------

The insertion and deletion algorithms in themselves don't guarantee btree
consistency after a crash.  To provide robustness, we depend on WAL
replay.  A single WAL entry is effectively an atomic action --- we can
redo it from the log if it fails to complete.

Ordinary item insertions (that don't force a page split) are of course
single WAL entries, since they only affect one page.  The same for
leaf-item deletions (if the deletion brings the leaf page to zero items,
it is now a candidate to be deleted, but that is a separate action).

An insertion that causes a page split is logged as a single WAL entry for
the changes occurring on the insertion's level --- including update of the
right sibling's left-link --- followed by a second WAL entry for the
insertion on the parent level (which might itself be a page split, requiring
an additional insertion above that, etc).

For a root split, the follow-on WAL entry is a "new root" entry rather than
an "insertion" entry, but details are otherwise much the same.

Because splitting involves multiple atomic actions, it's possible that the
system crashes between splitting a page and inserting the downlink for the
new half to the parent.  After recovery, the downlink for the new page will
be missing.  The search algorithm works correctly, as the page will be found
by following the right-link from its left sibling, although if a lot of
downlinks in the tree are missing, performance will suffer.  A more serious
consequence is that if the page without a downlink gets split again, the
insertion algorithm will fail to find the location in the parent level to
insert the downlink.

Our approach is to create any missing downlinks on-the-fly, when searching
the tree for a new insertion.  It could be done during searches, too, but
it seems best not to put any extra updates in what would otherwise be a
read-only operation (updating is not possible in hot standby mode anyway).
It would seem natural to add the missing downlinks in VACUUM, but since
inserting a downlink might require splitting a page, it might fail if you
run out of disk space.  That would be bad during VACUUM - the reason for
running VACUUM in the first place might be that you run out of disk space,
and now VACUUM won't finish because you're out of disk space.  In contrast,
an insertion can require enlarging the physical file anyway.  There is one
minor exception: VACUUM finishes interrupted splits of internal pages when
deleting their children.  This allows the code for re-finding parent items
to be used by both page splits and page deletion.

To identify missing downlinks, when a page is split, the left page is
flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT).
When the downlink is inserted to the parent, the flag is cleared atomically
with the insertion.  The child page is kept locked until the insertion in
the parent is finished and the flag in the child cleared, but can be
released immediately after that, before recursing up the tree if the parent
also needs to be split.  This ensures that incompletely split pages should
not be seen under normal circumstances; only if insertion to the parent
has failed for some reason. (It's also possible for a reader to observe
a page with the incomplete split flag set during recovery; see later
section on "Scans during Recovery" for details.)

We flag the left page, even though it's the right page that's missing the
downlink, because it's more convenient to know already when following the
right-link from the left page to the right page that it will need to have
its downlink inserted to the parent.

When splitting a non-root page that is alone on its level, the required
metapage update (of the "fast root" link) is performed and logged as part
of the insertion into the parent level.  When splitting the root page, the
metapage update is handled as part of the "new root" action.

Each step in page deletion is logged as a separate WAL entry: marking the
leaf as half-dead and removing the downlink is one record, and unlinking a
page is a second record.  If vacuum is interrupted for some reason, or the
system crashes, the tree is consistent for searches and insertions.  The
next VACUUM will find the half-dead leaf page and continue the deletion.

Before 9.4, we used to keep track of incomplete splits and page deletions
during recovery and finish them immediately at end of recovery, instead of
doing it lazily at the next insertion or vacuum.  However, that made the
recovery much more complicated, and only fixed the problem when crash
recovery was performed.  An incomplete split can also occur if an otherwise
recoverable error, like out-of-memory or out-of-disk-space, happens while
inserting the downlink to the parent.

Scans during Recovery
---------------------

nbtree indexes support read queries in Hot Standby mode. Every atomic
action/WAL record makes isolated changes that leave the tree in a
consistent state for readers. Readers lock pages according to the same
rules that readers follow on the primary. (Readers may have to move
right to recover from a "concurrent" page split or page deletion, just
like on the primary.)

However, there are a couple of differences in how pages are locked by
replay/the startup process as compared to the original write operation
on the primary. The exceptions involve page splits and page deletions.
The first phase and second phase of a page split are processed
independently during replay, since they are independent atomic actions.
We do not attempt to recreate the coupling of parent and child page
write locks that took place on the primary. This is safe because readers
never care about the incomplete split flag anyway. Holding on to an
extra write lock on the primary is only necessary so that a second
writer cannot observe the incomplete split flag before the first writer
finishes the split. If we let concurrent writers on the primary observe
an incomplete split flag on the same page, each writer would attempt to
complete the unfinished split, corrupting the parent page.  (Similarly,
replay of page deletion records does not hold a write lock on the target
leaf page throughout; only the primary needs to block out concurrent
writers that insert on to the page being deleted.)

During recovery all index scans start with ignore_killed_tuples = false
and we never set kill_prior_tuple. We do this because the oldest xmin
on the standby server can be older than the oldest xmin on the master
server, which means tuples can be marked LP_DEAD even when they are
still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
they can still appear in the standby because of full page writes. So
we must always ignore them in standby, and that means it's not worth
setting them either.  (When LP_DEAD-marked tuples are eventually deleted
on the primary, the deletion is WAL-logged.  Queries that run on a
standby therefore get much of the benefit of any LP_DEAD setting that
takes place on the primary.)

Note that we talk about scans that are started during recovery. We go to
a little trouble to allow a scan to start during recovery and end during
normal running after recovery has completed. This is a key capability
because it allows running applications to continue while the standby
changes state into a normally running server.

The interlocking required to avoid returning incorrect results from
non-MVCC scans is not required on standby nodes. We still get a
super-exclusive lock ("cleanup lock") when replaying VACUUM records
during recovery, but recovery does not need to lock every leaf page
(only those leaf pages that have items to delete). That is safe because
HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(),
HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only ever
used during write transactions, which cannot exist on the standby. MVCC
scans are already protected by definition, so HeapTupleSatisfiesMVCC()
is not a problem. The optimizer looks at the boundaries of value ranges
using HeapTupleSatisfiesNonVacuumable() with an index-only scan, which
is also safe. That leaves concern only for HeapTupleSatisfiesToast().

HeapTupleSatisfiesToast() doesn't use MVCC semantics, though that's
because it doesn't need to - if the main heap row is visible then the
toast rows will also be visible. So as long as we follow a toast
pointer from a visible (live) tuple the corresponding toast rows
will also be visible, so we do not need to recheck MVCC on them.

Other Things That Are Handy to Know
-----------------------------------

Page zero of every btree is a meta-data page.  This page stores the
location of the root page --- both the true root and the current effective
root ("fast" root).  To avoid fetching the metapage for every single index
search, we cache a copy of the meta-data information in the index's
relcache entry (rd_amcache).  This is a bit ticklish since using the cache
implies following a root page pointer that could be stale.  However, a
backend following a cached pointer can sufficiently verify whether it
reached the intended page; either by checking the is-root flag when it
is going to the true root, or by checking that the page has no siblings
when going to the fast root.  At worst, this could result in descending
some extra tree levels if we have a cached pointer to a fast root that is
now above the real fast root.  Such cases shouldn't arise often enough to
be worth optimizing; and in any case we can expect a relcache flush will
discard the cached metapage before long, since a VACUUM that's moved the
fast root pointer can be expected to issue a statistics update for the
index.

The algorithm assumes we can fit at least three items per page
(a "high key" and two real data items).  Therefore it's unsafe
to accept items larger than 1/3rd page size.  Larger items would
work sometimes, but could cause failures later on depending on
what else gets put on their page.

"ScanKey" data structures are used in two fundamentally different ways
in this code, which we describe as "search" scankeys and "insertion"
scankeys.  A search scankey is the kind passed to btbeginscan() or
btrescan() from outside the btree code.  The sk_func pointers in a search
scankey point to comparison functions that return boolean, such as int4lt.
There might be more than one scankey entry for a given index column, or
none at all.  (We require the keys to appear in index column order, but
the order of multiple keys for a given column is unspecified.)  An
insertion scankey ("BTScanInsert" data structure) uses a similar
array-of-ScanKey data structure, but the sk_func pointers point to btree
comparison support functions (ie, 3-way comparators that return int4 values
interpreted as <0, =0, >0).  In an insertion scankey there is at most one
entry per index column.  There is also other data about the rules used to
locate where to begin the scan, such as whether or not the scan is a
"nextkey" scan.  Insertion scankeys are built within the btree code (eg, by
_bt_mkscankey()) and are used to locate the starting point of a scan, as
well as for locating the place to insert a new index tuple.  (Note: in the
case of an insertion scankey built from a search scankey or built from a
truncated pivot tuple, there might be fewer keys than index columns,
indicating that we have no constraints for the remaining index columns.)
After we have located the starting point of a scan, the original search
scankey is consulted as each index entry is sequentially scanned to decide
whether to return the entry and whether the scan can stop (see
_bt_checkkeys()).

Notes about suffix truncation
-----------------------------

We truncate away suffix key attributes that are not needed for a page high
key during a leaf page split.  The remaining attributes must distinguish
the last index tuple on the post-split left page as belonging on the left
page, and the first index tuple on the post-split right page as belonging
on the right page.  Tuples logically retain truncated key attributes,
though they implicitly have "negative infinity" as their value, and have no
storage overhead.  Since the high key is subsequently reused as the
downlink in the parent page for the new right page, suffix truncation makes
pivot tuples short.  INCLUDE indexes are guaranteed to have non-key
attributes truncated at the time of a leaf page split, but may also have
some key attributes truncated away, based on the usual criteria for key
attributes.  They are not a special case, since non-key attributes are
merely payload to B-Tree searches.

The goal of suffix truncation of key attributes is to improve index
fan-out.  The technique was first described by Bayer and Unterauer (R.Bayer
and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol
2, No. 1, March 1977, pp 11-26).  The Postgres implementation is loosely
based on their paper.  Note that Postgres only implements what the paper
refers to as simple prefix B-Trees.  Note also that the paper assumes that
the tree has keys that consist of single strings that maintain the "prefix
property", much like strings that are stored in a suffix tree (comparisons
of earlier bytes must always be more significant than comparisons of later
bytes, and, in general, the strings must compare in a way that doesn't
break transitive consistency as they're split into pieces).  Suffix
truncation in Postgres currently only works at the whole-attribute
granularity, but it would be straightforward to invent opclass
infrastructure that manufactures a smaller attribute value in the case of
variable-length types, such as text.  An opclass support function could
manufacture the shortest possible key value that still correctly separates
each half of a leaf page split.

There is sophisticated criteria for choosing a leaf page split point.  The
general idea is to make suffix truncation effective without unduly
influencing the balance of space for each half of the page split.  The
choice of leaf split point can be thought of as a choice among points
*between* items on the page to be split, at least if you pretend that the
incoming tuple was placed on the page already (you have to pretend because
there won't actually be enough space for it on the page).  Choosing the
split point between two index tuples where the first non-equal attribute
appears as early as possible results in truncating away as many suffix
attributes as possible.  Evenly balancing space among each half of the
split is usually the first concern, but even small adjustments in the
precise split point can allow truncation to be far more effective.

Suffix truncation is primarily valuable because it makes pivot tuples
smaller, which delays splits of internal pages, but that isn't the only
reason why it's effective.  Even truncation that doesn't make pivot tuples
smaller due to alignment still prevents pivot tuples from being more
restrictive than truly necessary in how they describe which values belong
on which pages.

While it's not possible to correctly perform suffix truncation during
internal page splits, it's still useful to be discriminating when splitting
an internal page.  The split point that implies a downlink be inserted in
the parent that's the smallest one available within an acceptable range of
the fillfactor-wise optimal split point is chosen.  This idea also comes
from the Prefix B-Tree paper.  This process has much in common with what
happens at the leaf level to make suffix truncation effective.  The overall
effect is that suffix truncation tends to produce smaller, more
discriminating pivot tuples, especially early in the lifetime of the index,
while biasing internal page splits makes the earlier, smaller pivot tuples
end up in the root page, delaying root page splits.

Logical duplicates are given special consideration.  The logic for
selecting a split point goes to great lengths to avoid having duplicates
span more than one page, and almost always manages to pick a split point
between two user-key-distinct tuples, accepting a completely lopsided split
if it must.  When a page that's already full of duplicates must be split,
the fallback strategy assumes that duplicates are mostly inserted in
ascending heap TID order.  The page is split in a way that leaves the left
half of the page mostly full, and the right half of the page mostly empty.
The overall effect is that leaf page splits gracefully adapt to inserts of
large groups of duplicates, maximizing space utilization.  Note also that
"trapping" large groups of duplicates on the same leaf page like this makes
deduplication more efficient.  Deduplication can be performed infrequently,
without merging together existing posting list tuples too often.

Notes about deduplication
-------------------------

We deduplicate non-pivot tuples in non-unique indexes to reduce storage
overhead, and to avoid (or at least delay) page splits.  Note that the
goals for deduplication in unique indexes are rather different; see later
section for details.  Deduplication alters the physical representation of
tuples without changing the logical contents of the index, and without
adding overhead to read queries.  Non-pivot tuples are merged together
into a single physical tuple with a posting list (a simple array of heap
TIDs with the standard item pointer format).  Deduplication is always
applied lazily, at the point where it would otherwise be necessary to
perform a page split.  It occurs only when LP_DEAD items have been
removed, as our last line of defense against splitting a leaf page.  We
can set the LP_DEAD bit with posting list tuples, though only when all
TIDs are known dead.

Our lazy approach to deduplication allows the page space accounting used
during page splits to have absolutely minimal special case logic for
posting lists.  Posting lists can be thought of as extra payload that
suffix truncation will reliably truncate away as needed during page
splits, just like non-key columns from an INCLUDE index tuple.
Incoming/new tuples can generally be treated as non-overlapping plain
items (though see section on posting list splits for information about how
overlapping new/incoming items are really handled).

The representation of posting lists is almost identical to the posting
lists used by GIN, so it would be straightforward to apply GIN's varbyte
encoding compression scheme to individual posting lists.  Posting list
compression would break the assumptions made by posting list splits about
page space accounting (see later section), so it's not clear how
compression could be integrated with nbtree.  Besides, posting list
compression does not offer a compelling trade-off for nbtree, since in
general nbtree is optimized for consistent performance with many
concurrent readers and writers.

A major goal of our lazy approach to deduplication is to limit the
performance impact of deduplication with random updates.  Even concurrent
append-only inserts of the same key value will tend to have inserts of
individual index tuples in an order that doesn't quite match heap TID
order.  Delaying deduplication minimizes page level fragmentation.

Deduplication in unique indexes
-------------------------------

Very often, the range of values that can be placed on a given leaf page in
a unique index is fixed and permanent.  For example, a primary key on an
identity column will usually only have page splits caused by the insertion
of new logical rows within the rightmost leaf page.  If there is a split
of a non-rightmost leaf page, then the split must have been triggered by
inserts associated with an UPDATE of an existing logical row.  Splitting a
leaf page purely to store multiple versions should be considered
pathological, since it permanently degrades the index structure in order
to absorb a temporary burst of duplicates.  Deduplication in unique
indexes helps to prevent these pathological page splits.  Storing
duplicates in a space efficient manner is not the goal, since in the long
run there won't be any duplicates anyway.  Rather, we're buying time for
standard garbage collection mechanisms to run before a page split is
needed.

Unique index leaf pages only get a deduplication pass when an insertion
(that might have to split the page) observed an existing duplicate on the
page in passing.  This is based on the assumption that deduplication will
only work out when _all_ new insertions are duplicates from UPDATEs.  This
may mean that we miss an opportunity to delay a page split, but that's
okay because our ultimate goal is to delay leaf page splits _indefinitely_
(i.e. to prevent them altogether).  There is little point in trying to
delay a split that is probably inevitable anyway.  This allows us to avoid
the overhead of attempting to deduplicate with unique indexes that always
have few or no duplicates.

Posting list splits
-------------------

When the incoming tuple happens to overlap with an existing posting list,
a posting list split is performed.  Like a page split, a posting list
split resolves a situation where a new/incoming item "won't fit", while
inserting the incoming item in passing (i.e. as part of the same atomic
action).  It's possible (though not particularly likely) that an insert of
a new item on to an almost-full page will overlap with a posting list,
resulting in both a posting list split and a page split.  Even then, the
atomic action that splits the posting list also inserts the new item
(since page splits always insert the new item in passing).  Including the
posting list split in the same atomic action as the insert avoids problems
caused by concurrent inserts into the same posting list --  the exact
details of how we change the posting list depend upon the new item, and
vice-versa.  A single atomic action also minimizes the volume of extra
WAL required for a posting list split, since we don't have to explicitly
WAL-log the original posting list tuple.

Despite piggy-backing on the same atomic action that inserts a new tuple,
posting list splits can be thought of as a separate, extra action to the
insert itself (or to the page split itself).  Posting list splits
conceptually "rewrite" an insert that overlaps with an existing posting
list into an insert that adds its final new item just to the right of the
posting list instead.  The size of the posting list won't change, and so
page space accounting code does not need to care about posting list splits
at all.  This is an important upside of our design; the page split point
choice logic is very subtle even without it needing to deal with posting
list splits.

Only a few isolated extra steps are required to preserve the illusion that
the new item never overlapped with an existing posting list in the first
place: the heap TID of the incoming tuple is swapped with the rightmost/max
heap TID from the existing/originally overlapping posting list.  Also, the
posting-split-with-page-split case must generate a new high key based on
an imaginary version of the original page that has both the final new item
and the after-list-split posting tuple (page splits usually just operate
against an imaginary version that contains the new item/item that won't
fit).

This approach avoids inventing an "eager" atomic posting split operation
that splits the posting list without simultaneously finishing the insert
of the incoming item.  This alternative design might seem cleaner, but it
creates subtle problems for page space accounting.  In general, there
might not be enough free space on the page to split a posting list such
that the incoming/new item no longer overlaps with either posting list
half --- the operation could fail before the actual retail insert of the
new item even begins.  We'd end up having to handle posting list splits
that need a page split anyway.  Besides, supporting variable "split points"
while splitting posting lists won't actually improve overall space
utilization.

Notes About Data Representation
-------------------------------

The right-sibling link required by L&Y is kept in the page "opaque
data" area, as is the left-sibling link, the page level, and some flags.
The page level counts upwards from zero at the leaf level, to the tree
depth minus 1 at the root.  (Counting up from the leaves ensures that we
don't need to renumber any existing pages when splitting the root.)

The Postgres disk block data format (an array of items) doesn't fit
Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
so we have to play some games.  (The alternating-keys-and-pointers
notion is important for internal page splits, which conceptually split
at the middle of an existing pivot tuple -- the tuple's "separator" key
goes on the left side of the split as the left side's new high key,
while the tuple's pointer/downlink goes on the right side as the
first/minus infinity downlink.)

On a page that is not rightmost in its tree level, the "high key" is
kept in the page's first item, and real data items start at item 2.
The link portion of the "high key" item goes unused.  A page that is
rightmost has no "high key" (it's implicitly positive infinity), so
data items start with the first item.  Putting the high key at the
left, rather than the right, may seem odd, but it avoids moving the
high key as we add data items.

On a leaf page, the data items are simply links to (TIDs of) tuples
in the relation being indexed, with the associated key values.

On a non-leaf page, the data items are down-links to child pages with
bounding keys.  The key in each data item is a strict lower bound for
keys on that child page, so logically the key is to the left of that
downlink.  The high key (if present) is the upper bound for the last
downlink.  The first data item on each such page has no lower bound
--- or lower bound of minus infinity, if you prefer.  The comparison
routines must treat it accordingly.  The actual key stored in the
item is irrelevant, and need not be stored at all.  This arrangement
corresponds to the fact that an L&Y non-leaf page has one more pointer
than key.  Suffix truncation's negative infinity attributes behave in
the same way.