488 lines
24 KiB
Plaintext
488 lines
24 KiB
Plaintext
src/backend/utils/mmgr/README
|
|
|
|
Memory Context System Design Overview
|
|
=====================================
|
|
|
|
Background
|
|
----------
|
|
|
|
We do most of our memory allocation in "memory contexts", which are usually
|
|
AllocSets as implemented by src/backend/utils/mmgr/aset.c. The key to
|
|
successful memory management without lots of overhead is to define a useful
|
|
set of contexts with appropriate lifespans.
|
|
|
|
The basic operations on a memory context are:
|
|
|
|
* create a context
|
|
|
|
* allocate a chunk of memory within a context (equivalent of standard
|
|
C library's malloc())
|
|
|
|
* delete a context (including freeing all the memory allocated therein)
|
|
|
|
* reset a context (free all memory allocated in the context, but not the
|
|
context object itself)
|
|
|
|
* inquire about the total amount of memory allocated to the context
|
|
(the raw memory from which the context allocates chunks; not the
|
|
chunks themselves)
|
|
|
|
Given a chunk of memory previously allocated from a context, one can
|
|
free it or reallocate it larger or smaller (corresponding to standard C
|
|
library's free() and realloc() routines). These operations return memory
|
|
to or get more memory from the same context the chunk was originally
|
|
allocated in.
|
|
|
|
At all times there is a "current" context denoted by the
|
|
CurrentMemoryContext global variable. palloc() implicitly allocates space
|
|
in that context. The MemoryContextSwitchTo() operation selects a new current
|
|
context (and returns the previous context, so that the caller can restore the
|
|
previous context before exiting).
|
|
|
|
The main advantage of memory contexts over plain use of malloc/free is
|
|
that the entire contents of a memory context can be freed easily, without
|
|
having to request freeing of each individual chunk within it. This is
|
|
both faster and more reliable than per-chunk bookkeeping. We use this
|
|
fact to clean up at transaction end: by resetting all the active contexts
|
|
of transaction or shorter lifespan, we can reclaim all transient memory.
|
|
Similarly, we can clean up at the end of each query, or after each tuple
|
|
is processed during a query.
|
|
|
|
|
|
Some Notes About the palloc API Versus Standard C Library
|
|
---------------------------------------------------------
|
|
|
|
The behavior of palloc and friends is similar to the standard C library's
|
|
malloc and friends, but there are some deliberate differences too. Here
|
|
are some notes to clarify the behavior.
|
|
|
|
* If out of memory, palloc and repalloc exit via elog(ERROR). They
|
|
never return NULL, and it is not necessary or useful to test for such
|
|
a result. With palloc_extended() that behavior can be overridden
|
|
using the MCXT_ALLOC_NO_OOM flag.
|
|
|
|
* palloc(0) is explicitly a valid operation. It does not return a NULL
|
|
pointer, but a valid chunk of which no bytes may be used. However, the
|
|
chunk might later be repalloc'd larger; it can also be pfree'd without
|
|
error. Similarly, repalloc allows realloc'ing to zero size.
|
|
|
|
* pfree and repalloc do not accept a NULL pointer. This is intentional.
|
|
|
|
|
|
The Current Memory Context
|
|
--------------------------
|
|
|
|
Because it would be too much notational overhead to always pass an
|
|
appropriate memory context to called routines, there always exists the
|
|
notion of the current memory context CurrentMemoryContext. Without it,
|
|
for example, the copyObject routines would need to be passed a context, as
|
|
would function execution routines that return a pass-by-reference
|
|
datatype. Similarly for routines that temporarily allocate space
|
|
internally, but don't return it to their caller? We certainly don't
|
|
want to clutter every call in the system with "here is a context to
|
|
use for any temporary memory allocation you might want to do".
|
|
|
|
The upshot of that reasoning, though, is that CurrentMemoryContext should
|
|
generally point at a short-lifespan context if at all possible. During
|
|
query execution it usually points to a context that gets reset after each
|
|
tuple. Only in *very* circumscribed code should it ever point at a
|
|
context having greater than transaction lifespan, since doing so risks
|
|
permanent memory leaks.
|
|
|
|
|
|
pfree/repalloc Do Not Depend On CurrentMemoryContext
|
|
----------------------------------------------------
|
|
|
|
pfree() and repalloc() can be applied to any chunk whether it belongs
|
|
to CurrentMemoryContext or not --- the chunk's owning context will be
|
|
invoked to handle the operation, regardless.
|
|
|
|
|
|
"Parent" and "Child" Contexts
|
|
-----------------------------
|
|
|
|
If all contexts were independent, it'd be hard to keep track of them,
|
|
especially in error cases. That is solved by creating a tree of
|
|
"parent" and "child" contexts. When creating a memory context, the
|
|
new context can be specified to be a child of some existing context.
|
|
A context can have many children, but only one parent. In this way
|
|
the contexts form a forest (not necessarily a single tree, since there
|
|
could be more than one top-level context; although in current practice
|
|
there is only one top context, TopMemoryContext).
|
|
|
|
Deleting a context deletes all its direct and indirect children as
|
|
well. When resetting a context it's almost always more useful to
|
|
delete child contexts, thus MemoryContextReset() means that, and if
|
|
you really do want a tree of empty contexts you need to call
|
|
MemoryContextResetOnly() plus MemoryContextResetChildren().
|
|
|
|
These features allow us to manage a lot of contexts without fear that
|
|
some will be leaked; we only need to keep track of one top-level
|
|
context that we are going to delete at transaction end, and make sure
|
|
that any shorter-lived contexts we create are descendants of that
|
|
context. Since the tree can have multiple levels, we can deal easily
|
|
with nested lifetimes of storage, such as per-transaction,
|
|
per-statement, per-scan, per-tuple. Storage lifetimes that only
|
|
partially overlap can be handled by allocating from different trees of
|
|
the context forest (there are some examples in the next section).
|
|
|
|
For convenience we also provide operations like "reset/delete all children
|
|
of a given context, but don't reset or delete that context itself".
|
|
|
|
|
|
Memory Context Reset/Delete Callbacks
|
|
-------------------------------------
|
|
|
|
A feature introduced in Postgres 9.5 allows memory contexts to be used
|
|
for managing more resources than just plain palloc'd memory. This is
|
|
done by registering a "reset callback function" for a memory context.
|
|
Such a function will be called, once, just before the context is next
|
|
reset or deleted. It can be used to give up resources that are in some
|
|
sense associated with an object allocated within the context. Possible
|
|
use-cases include
|
|
* closing open files associated with a tuplesort object;
|
|
* releasing reference counts on long-lived cache objects that are held
|
|
by some object within the context being reset;
|
|
* freeing malloc-managed memory associated with some palloc'd object.
|
|
That last case would just represent bad programming practice for pure
|
|
Postgres code; better to have made all the allocations using palloc,
|
|
in the target context or some child context. However, it could well
|
|
come in handy for code that interfaces to non-Postgres libraries.
|
|
|
|
Any number of reset callbacks can be established for a memory context;
|
|
they are called in reverse order of registration. Also, callbacks
|
|
attached to child contexts are called before callbacks attached to
|
|
parent contexts, if a tree of contexts is being reset or deleted.
|
|
|
|
The API for this requires the caller to provide a MemoryContextCallback
|
|
memory chunk to hold the state for a callback. Typically this should be
|
|
allocated in the same context it is logically attached to, so that it
|
|
will be released automatically after use. The reason for asking the
|
|
caller to provide this memory is that in most usage scenarios, the caller
|
|
will be creating some larger struct within the target context, and the
|
|
MemoryContextCallback struct can be made "for free" without a separate
|
|
palloc() call by including it in this larger struct.
|
|
|
|
|
|
Memory Contexts in Practice
|
|
===========================
|
|
|
|
Globally Known Contexts
|
|
-----------------------
|
|
|
|
There are a few widely-known contexts that are typically referenced
|
|
through global variables. At any instant the system may contain many
|
|
additional contexts, but all other contexts should be direct or indirect
|
|
children of one of these contexts to ensure they are not leaked in event
|
|
of an error.
|
|
|
|
TopMemoryContext --- this is the actual top level of the context tree;
|
|
every other context is a direct or indirect child of this one. Allocating
|
|
here is essentially the same as "malloc", because this context will never
|
|
be reset or deleted. This is for stuff that should live forever, or for
|
|
stuff that the controlling module will take care of deleting at the
|
|
appropriate time. An example is fd.c's tables of open files. Avoid
|
|
allocating stuff here unless really necessary, and especially avoid
|
|
running with CurrentMemoryContext pointing here.
|
|
|
|
PostmasterContext --- this is the postmaster's normal working context.
|
|
After a backend is spawned, it can delete PostmasterContext to free its
|
|
copy of memory the postmaster was using that it doesn't need.
|
|
Note that in non-EXEC_BACKEND builds, the postmaster's copy of pg_hba.conf
|
|
and pg_ident.conf data is used directly during authentication in backend
|
|
processes; so backends can't delete PostmasterContext until that's done.
|
|
(The postmaster has only TopMemoryContext, PostmasterContext, and
|
|
ErrorContext --- the remaining top-level contexts are set up in each
|
|
backend during startup.)
|
|
|
|
CacheMemoryContext --- permanent storage for relcache, catcache, and
|
|
related modules. This will never be reset or deleted, either, so it's
|
|
not truly necessary to distinguish it from TopMemoryContext. But it
|
|
seems worthwhile to maintain the distinction for debugging purposes.
|
|
(Note: CacheMemoryContext has child contexts with shorter lifespans.
|
|
For example, a child context is the best place to keep the subsidiary
|
|
storage associated with a relcache entry; that way we can free rule
|
|
parsetrees and so forth easily, without having to depend on constructing
|
|
a reliable version of freeObject().)
|
|
|
|
MessageContext --- this context holds the current command message from the
|
|
frontend, as well as any derived storage that need only live as long as
|
|
the current message (for example, in simple-Query mode the parse and plan
|
|
trees can live here). This context will be reset, and any children
|
|
deleted, at the top of each cycle of the outer loop of PostgresMain. This
|
|
is kept separate from per-transaction and per-portal contexts because a
|
|
query string might need to live either a longer or shorter time than any
|
|
single transaction or portal.
|
|
|
|
TopTransactionContext --- this holds everything that lives until end of the
|
|
top-level transaction. This context will be reset, and all its children
|
|
deleted, at conclusion of each top-level transaction cycle. In most cases
|
|
you don't want to allocate stuff directly here, but in CurTransactionContext;
|
|
what does belong here is control information that exists explicitly to manage
|
|
status across multiple subtransactions. Note: this context is NOT cleared
|
|
immediately upon error; its contents will survive until the transaction block
|
|
is exited by COMMIT/ROLLBACK.
|
|
|
|
CurTransactionContext --- this holds data that has to survive until the end
|
|
of the current transaction, and in particular will be needed at top-level
|
|
transaction commit. When we are in a top-level transaction this is the same
|
|
as TopTransactionContext, but in subtransactions it points to a child context.
|
|
It is important to understand that if a subtransaction aborts, its
|
|
CurTransactionContext is thrown away after finishing the abort processing;
|
|
but a committed subtransaction's CurTransactionContext is kept until top-level
|
|
commit (unless of course one of the intermediate levels of subtransaction
|
|
aborts). This ensures that we do not keep data from a failed subtransaction
|
|
longer than necessary. Because of this behavior, you must be careful to clean
|
|
up properly during subtransaction abort --- the subtransaction's state must be
|
|
delinked from any pointers or lists kept in upper transactions, or you will
|
|
have dangling pointers leading to a crash at top-level commit. An example of
|
|
data kept here is pending NOTIFY messages, which are sent at top-level commit,
|
|
but only if the generating subtransaction did not abort.
|
|
|
|
PortalContext --- this is not actually a separate context, but a
|
|
global variable pointing to the per-portal context of the currently active
|
|
execution portal. This can be used if it's necessary to allocate storage
|
|
that will live just as long as the execution of the current portal requires.
|
|
|
|
ErrorContext --- this permanent context is switched into for error
|
|
recovery processing, and then reset on completion of recovery. We arrange
|
|
to have a few KB of memory available in it at all times. In this way, we
|
|
can ensure that some memory is available for error recovery even if the
|
|
backend has run out of memory otherwise. This allows out-of-memory to be
|
|
treated as a normal ERROR condition, not a FATAL error.
|
|
|
|
|
|
Contexts For Prepared Statements And Portals
|
|
--------------------------------------------
|
|
|
|
A prepared-statement object has an associated private context, in which
|
|
the parse and plan trees for its query are stored. Because these trees
|
|
are read-only to the executor, the prepared statement can be re-used many
|
|
times without further copying of these trees.
|
|
|
|
An execution-portal object has a private context that is referenced by
|
|
PortalContext when the portal is active. In the case of a portal created
|
|
by DECLARE CURSOR, this private context contains the query parse and plan
|
|
trees (there being no other object that can hold them). Portals created
|
|
from prepared statements simply reference the prepared statements' trees,
|
|
and don't actually need any storage allocated in their private contexts.
|
|
|
|
|
|
Logical Replication Worker Contexts
|
|
-----------------------------------
|
|
|
|
ApplyContext --- permanent during whole lifetime of apply worker. It
|
|
is possible to use TopMemoryContext here as well, but for simplicity
|
|
of memory usage analysis we spin up different context.
|
|
|
|
ApplyMessageContext --- short-lived context that is reset after each
|
|
logical replication protocol message is processed.
|
|
|
|
|
|
Transient Contexts During Execution
|
|
-----------------------------------
|
|
|
|
When creating a prepared statement, the parse and plan trees will be built
|
|
in a temporary context that's a child of MessageContext (so that it will
|
|
go away automatically upon error). On success, the finished plan is
|
|
copied to the prepared statement's private context, and the temp context
|
|
is released; this allows planner temporary space to be recovered before
|
|
execution begins. (In simple-Query mode we don't bother with the extra
|
|
copy step, so the planner temp space stays around till end of query.)
|
|
|
|
The top-level executor routines, as well as most of the "plan node"
|
|
execution code, will normally run in a context that is created by
|
|
ExecutorStart and destroyed by ExecutorEnd; this context also holds the
|
|
"plan state" tree built during ExecutorStart. Most of the memory
|
|
allocated in these routines is intended to live until end of query,
|
|
so this is appropriate for those purposes. The executor's top context
|
|
is a child of PortalContext, that is, the per-portal context of the
|
|
portal that represents the query's execution.
|
|
|
|
The main memory-management consideration in the executor is that
|
|
expression evaluation --- both for qual testing and for computation of
|
|
targetlist entries --- needs to not leak memory. To do this, each
|
|
ExprContext (expression-eval context) created in the executor has a
|
|
private memory context associated with it, and we switch into that context
|
|
when evaluating expressions in that ExprContext. The plan node that owns
|
|
the ExprContext is responsible for resetting the private context to empty
|
|
when it no longer needs the results of expression evaluations. Typically
|
|
the reset is done at the start of each tuple-fetch cycle in the plan node.
|
|
|
|
Note that this design gives each plan node its own expression-eval memory
|
|
context. This appears necessary to handle nested joins properly, since
|
|
an outer plan node might need to retain expression results it has computed
|
|
while obtaining the next tuple from an inner node --- but the inner node
|
|
might execute many tuple cycles and many expressions before returning a
|
|
tuple. The inner node must be able to reset its own expression context
|
|
more often than once per outer tuple cycle. Fortunately, memory contexts
|
|
are cheap enough that giving one to each plan node doesn't seem like a
|
|
problem.
|
|
|
|
A problem with running index accesses and sorts in a query-lifespan context
|
|
is that these operations invoke datatype-specific comparison functions,
|
|
and if the comparators leak any memory then that memory won't be recovered
|
|
till end of query. The comparator functions all return bool or int32,
|
|
so there's no problem with their result data, but there can be a problem
|
|
with leakage of internal temporary data. In particular, comparator
|
|
functions that operate on TOAST-able data types need to be careful
|
|
not to leak detoasted versions of their inputs. This is annoying, but
|
|
it appeared a lot easier to make the comparators conform than to fix the
|
|
index and sort routines, so that's what was done for 7.1. This remains
|
|
the state of affairs in btree and hash indexes, so btree and hash support
|
|
functions still need to not leak memory. Most of the other index AMs
|
|
have been modified to run opclass support functions in short-lived
|
|
contexts, so that leakage is not a problem; this is necessary in view
|
|
of the fact that their support functions tend to be far more complex.
|
|
|
|
There are some special cases, such as aggregate functions. nodeAgg.c
|
|
needs to remember the results of evaluation of aggregate transition
|
|
functions from one tuple cycle to the next, so it can't just discard
|
|
all per-tuple state in each cycle. The easiest way to handle this seems
|
|
to be to have two per-tuple contexts in an aggregate node, and to
|
|
ping-pong between them, so that at each tuple one is the active allocation
|
|
context and the other holds any results allocated by the prior cycle's
|
|
transition function.
|
|
|
|
Executor routines that switch the active CurrentMemoryContext may need
|
|
to copy data into their caller's current memory context before returning.
|
|
However, we have minimized the need for that, because of the convention
|
|
of resetting the per-tuple context at the *start* of an execution cycle
|
|
rather than at its end. With that rule, an execution node can return a
|
|
tuple that is palloc'd in its per-tuple context, and the tuple will remain
|
|
good until the node is called for another tuple or told to end execution.
|
|
This parallels the situation with pass-by-reference values at the table
|
|
scan level, since a scan node can return a direct pointer to a tuple in a
|
|
disk buffer that is only guaranteed to remain good that long.
|
|
|
|
A more common reason for copying data is to transfer a result from
|
|
per-tuple context to per-query context; for example, a Unique node will
|
|
save the last distinct tuple value in its per-query context, requiring a
|
|
copy step.
|
|
|
|
|
|
Mechanisms to Allow Multiple Types of Contexts
|
|
----------------------------------------------
|
|
|
|
To efficiently allow for different allocation patterns, and for
|
|
experimentation, we allow for different types of memory contexts with
|
|
different allocation policies but similar external behavior. To
|
|
handle this, memory allocation functions are accessed via function
|
|
pointers, and we require all context types to obey the conventions
|
|
given here.
|
|
|
|
A memory context is represented by struct MemoryContextData (see
|
|
memnodes.h). This struct identifies the exact type of the context, and
|
|
contains information common between the different types of
|
|
MemoryContext like the parent and child contexts, and the name of the
|
|
context.
|
|
|
|
This is essentially an abstract superclass, and the behavior is
|
|
determined by the "methods" pointer is its virtual function table
|
|
(struct MemoryContextMethods). Specific memory context types will use
|
|
derived structs having these fields as their first fields. All the
|
|
contexts of a specific type will have methods pointers that point to
|
|
the same static table of function pointers.
|
|
|
|
While operations like allocating from and resetting a context take the
|
|
relevant MemoryContext as a parameter, operations like free and
|
|
realloc are trickier. To make those work, we require all memory
|
|
context types to produce allocated chunks that are immediately,
|
|
without any padding, preceded by a pointer to the corresponding
|
|
MemoryContext.
|
|
|
|
If a type of allocator needs additional information about its chunks,
|
|
like e.g. the size of the allocation, that information can in turn
|
|
precede the MemoryContext. This means the only overhead implied by
|
|
the memory context mechanism is a pointer to its context, so we're not
|
|
constraining context-type designers very much.
|
|
|
|
Given this, routines like pfree determine their corresponding context
|
|
with an operation like (although that is usually encapsulated in
|
|
GetMemoryChunkContext())
|
|
|
|
MemoryContext context = *(MemoryContext*) (((char *) pointer) - sizeof(void *));
|
|
|
|
and then invoke the corresponding method for the context
|
|
|
|
context->methods->free_p(pointer);
|
|
|
|
|
|
More Control Over aset.c Behavior
|
|
---------------------------------
|
|
|
|
By default aset.c always allocates an 8K block upon the first
|
|
allocation in a context, and doubles that size for each successive
|
|
block request. That's good behavior for a context that might hold
|
|
*lots* of data. But if there are dozens if not hundreds of smaller
|
|
contexts in the system, we need to be able to fine-tune things a
|
|
little better.
|
|
|
|
The creator of a context is able to specify an initial block size and
|
|
a maximum block size. Selecting smaller values can prevent wastage of
|
|
space in contexts that aren't expected to hold very much (an example
|
|
is the relcache's per-relation contexts).
|
|
|
|
Also, it is possible to specify a minimum context size, in case for some
|
|
reason that should be different from the initial size for additional
|
|
blocks. An aset.c context will always contain at least one block,
|
|
of size minContextSize if that is specified, otherwise initBlockSize.
|
|
|
|
We expect that per-tuple contexts will be reset frequently and typically
|
|
will not allocate very much space per tuple cycle. To make this usage
|
|
pattern cheap, the first block allocated in a context is not given
|
|
back to malloc() during reset, but just cleared. This avoids malloc
|
|
thrashing.
|
|
|
|
|
|
Alternative Memory Context Implementations
|
|
------------------------------------------
|
|
|
|
aset.c is our default general-purpose implementation, working fine
|
|
in most situations. We also have two implementations optimized for
|
|
special use cases, providing either better performance or lower memory
|
|
usage compared to aset.c (or both).
|
|
|
|
* slab.c (SlabContext) is designed for allocations of fixed-length
|
|
chunks, and does not allow allocations of chunks with different size.
|
|
|
|
* generation.c (GenerationContext) is designed for cases when chunks
|
|
are allocated in groups with similar lifespan (generations), or
|
|
roughly in FIFO order.
|
|
|
|
Both memory contexts aim to free memory back to the operating system
|
|
(unlike aset.c, which keeps the freed chunks in a freelist, and only
|
|
returns the memory when reset/deleted).
|
|
|
|
These memory contexts were initially developed for ReorderBuffer, but
|
|
may be useful elsewhere as long as the allocation patterns match.
|
|
|
|
|
|
Memory Accounting
|
|
-----------------
|
|
|
|
One of the basic memory context operations is determining the amount of
|
|
memory used in the context (and it's children). We have multiple places
|
|
that implement their own ad hoc memory accounting, and this is meant to
|
|
provide a unified approach. Ad hoc accounting solutions work for places
|
|
with tight control over the allocations or when it's easy to determine
|
|
sizes of allocated chunks (e.g. places that only work with tuples).
|
|
|
|
The accounting built into the memory contexts is transparent and works
|
|
transparently for all allocations as long as they end up in the right
|
|
memory context subtree.
|
|
|
|
Consider for example aggregate functions - the aggregate state is often
|
|
represented by an arbitrary structure, allocated from the transition
|
|
function, so the ad hoc accounting is unlikely to work. The built-in
|
|
accounting will however handle such cases just fine.
|
|
|
|
To minimize overhead, the accounting is done at the block level, not for
|
|
individual allocation chunks.
|
|
|
|
The accounting is lazy - after a block is allocated (or freed), only the
|
|
context owning that block is updated. This means that when inquiring
|
|
about the memory usage in a given context, we have to walk all children
|
|
contexts recursively. This means the memory accounting is not intended
|
|
for cases with too many memory contexts (in the relevant subtree).
|