Allow read only connections during recovery, known as Hot Standby.

Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record. New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far. This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required. Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit. Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 01:32:45 +00:00 · 2009-12-19 01:32:45 +00:00 · efc16ea520
parent 78a09145e0
commit efc16ea520
87 changed files with 6165 additions and 428 deletions
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/backup.sgml,v 2.130 2009/08/07 20:54:31 alvherre Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/backup.sgml,v 2.131 2009/12/19 01:32:30 sriggs Exp $ -->

 <chapter id="backup">
 <title>Backup and Restore</title>
@ -1429,8 +1429,12 @@ archive_command = 'local_backup_script.sh'
   <listitem>
    <para>
     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.  The recommended workaround
-     is to manually <xref linkend="sql-reindex" endterm="sql-reindex-title">
+     replay will not update these indexes.  This will mean that any new inserts
+	 will be ignored by the index, updated rows will apparently disappear and
+	 deleted rows will still retain pointers. In other words, if you modify a
+	 table with a hash index on it then you will get incorrect query results
+	 on a standby server.  When recovery completes it is recommended that you
+     manually <xref linkend="sql-reindex" endterm="sql-reindex-title">
     each such index after completing a recovery operation.
    </para>
   </listitem>
@ -1883,6 +1887,772 @@ if (!triggered)
  </sect2>
 </sect1>

+ <sect1 id="hot-standby">
+  <title>Hot Standby</title>
+
+  <indexterm zone="backup">
+   <primary>Hot Standby</primary>
+  </indexterm>
+
+   <para>
+	Hot Standby is the term used to describe the ability to connect to
+	the server and run queries while the server is in archive recovery. This
+	is useful for both log shipping replication and for restoring a backup
+	to an exact state with great precision.
+	The term Hot Standby also refers to the ability of the server to move
+	from recovery through to normal running while users continue running
+	queries and/or continue their connections.
+   </para>
+
+   <para>
+	Running queries in recovery is in many ways the same as normal running
+	though there are a large number of usage and administrative points
+	to note.
+   </para>
+
+  <sect2 id="hot-standby-users">
+   <title>User's Overview</title>
+
+   <para>
+	Users can connect to the database while the server is in recovery
+	and perform read-only queries. Read-only access to catalogs and views
+	will also occur as normal.
+   </para>
+
+   <para>
+	The data on the standby takes some time to arrive from the primary server
+	so there will be a measurable delay between primary and standby. Running the
+	same query nearly simultaneously on both primary and standby might therefore
+	return differing results. We say that data on the standby is eventually
+	consistent with the primary.
+	Queries executed on the standby will be correct with regard to the transactions
+	that had been recovered at the start of the query, or start of first statement,
+	in the case of serializable transactions. In comparison with the primary,
+	the standby returns query results that could have been obtained on the primary
+	at some exact moment in the past.
+   </para>
+
+   <para>
+	When a transaction is started in recovery, the parameter
+	<varname>transaction_read_only</> will be forced to be true, regardless of the
+	<varname>default_transaction_read_only</> setting in <filename>postgresql.conf</>.
+	It can't be manually set to false either. As a result, all transactions
+	started during recovery will be limited to read-only actions only. In all
+	other ways, connected sessions will appear identical to sessions
+	initiated during normal processing mode. There are no special commands
+	required to initiate a connection at this time, so all interfaces
+	work normally without change. After recovery finishes, the session
+	will allow normal read-write transactions at the start of the next
+	transaction, if these are requested.
+   </para>
+
+   <para>
+	Read-only here means "no writes to the permanent database tables".
+	There are no problems with queries that make use of transient sort and
+	work files.
+   </para>
+
+   <para>
+	The following actions are allowed
+
+	<itemizedlist>
+	 <listitem>
+	  <para>
+       Query access - SELECT, COPY TO including views and SELECT RULEs
+      </para>
+     </listitem>
+	 <listitem>
+	  <para>
+       Cursor commands - DECLARE, FETCH, CLOSE,
+      </para>
+     </listitem>
+	 <listitem>
+	  <para>
+       Parameters - SHOW, SET, RESET
+      </para>
+     </listitem>
+	 <listitem>
+	  <para>
+       Transaction management commands
+		<itemizedlist>
+		 <listitem>
+		  <para>
+		   BEGIN, END, ABORT, START TRANSACTION
+	      </para>
+	     </listitem>
+		 <listitem>
+		  <para>
+	       SAVEPOINT, RELEASE, ROLLBACK TO SAVEPOINT
+	      </para>
+	     </listitem>
+		 <listitem>
+		  <para>
+	       EXCEPTION blocks and other internal subtransactions
+	      </para>
+	     </listitem>
+		</itemizedlist>
+      </para>
+     </listitem>
+	 <listitem>
+	  <para>
+       LOCK TABLE, though only when explicitly in one of these modes:
+	   ACCESS SHARE, ROW SHARE or ROW EXCLUSIVE.
+      </para>
+     </listitem>
+	 <listitem>
+	  <para>
+       Plans and resources - PREPARE, EXECUTE, DEALLOCATE, DISCARD
+      </para>
+     </listitem>
+	 <listitem>
+	  <para>
+       Plugins and extensions - LOAD
+      </para>
+     </listitem>
+    </itemizedlist>
+   </para>
+
+   <para>
+	These actions produce error messages
+
+	<itemizedlist>
+	 <listitem>
+	  <para>
+	   Data Definition Language (DML) - INSERT, UPDATE, DELETE, COPY FROM, TRUNCATE.
+	   Note that there are no allowed actions that result in a trigger
+	   being executed during recovery.
+      </para>
+     </listitem>
+	 <listitem>
+	  <para>
+	   Data Definition Language (DDL) - CREATE, DROP, ALTER, COMMENT.
+	   This also applies to temporary tables currently because currently their
+	   definition causes writes to catalog tables.
+      </para>
+     </listitem>
+	 <listitem>
+	  <para>
+       SELECT ... FOR SHARE | UPDATE which cause row locks to be written
+      </para>
+     </listitem>
+	 <listitem>
+	  <para>
+       RULEs on SELECT statements that generate DML commands.
+      </para>
+     </listitem>
+	 <listitem>
+	  <para>
+       LOCK TABLE, in short default form, since it requests ACCESS EXCLUSIVE MODE.
+       LOCK TABLE that explicitly requests a mode higher than ROW EXCLUSIVE MODE.
+      </para>
+     </listitem>
+	 <listitem>
+	  <para>
+       Transaction management commands that explicitly set non-read only state
+		<itemizedlist>
+		 <listitem>
+		  <para>
+			BEGIN READ WRITE,
+			START TRANSACTION READ WRITE
+	      </para>
+	     </listitem>
+		 <listitem>
+		  <para>
+			SET TRANSACTION READ WRITE,
+			SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE
+	      </para>
+	     </listitem>
+		 <listitem>
+		  <para>
+	       SET transaction_read_only = off
+	      </para>
+	     </listitem>
+		</itemizedlist>
+      </para>
+     </listitem>
+	 <listitem>
+	  <para>
+       Two-phase commit commands - PREPARE TRANSACTION, COMMIT PREPARED,
+	   ROLLBACK PREPARED because even read-only transactions need to write
+	   WAL in the prepare phase (the first phase of two phase commit).
+      </para>
+     </listitem>
+	 <listitem>
+	  <para>
+       sequence update - nextval()
+      </para>
+     </listitem>
+	 <listitem>
+	  <para>
+	   LISTEN, UNLISTEN, NOTIFY since they currently write to system tables
+      </para>
+     </listitem>
+    </itemizedlist>
+   </para>
+
+   <para>
+	Note that current behaviour of read only transactions when not in
+	recovery is to allow the last two actions, so there are small and
+	subtle differences in behaviour between read-only transactions
+	run on standby and during normal running.
+	It is possible that the restrictions on LISTEN, UNLISTEN, NOTIFY and
+	temporary tables may be lifted in a future release, if their internal
+	implementation is altered to make this possible.
+   </para>
+
+   <para>
+	If failover or switchover occurs the database will switch to normal
+	processing mode. Sessions will remain connected while the server
+	changes mode. Current transactions will continue, though will remain
+	read-only. After recovery is complete, it will be possible to initiate
+	read-write transactions.
+   </para>
+
+   <para>
+	Users will be able to tell whether their session is read-only by
+	issuing SHOW transaction_read_only.  In addition a set of
+	functions <xref linkend="functions-recovery-info-table"> allow users to
+	access information about Hot Standby. These allow you to write
+	functions that are aware of the current state of the database. These
+	can be used to monitor the progress of recovery, or to allow you to
+	write complex programs that restore the database to particular states.
+   </para>
+
+   <para>
+	In recovery, transactions will not be permitted to take any table lock
+	higher than RowExclusiveLock. In addition, transactions may never assign
+	a TransactionId and may never write WAL.
+	Any <command>LOCK TABLE</> command that runs on the standby and requests
+	a specific lock mode higher than ROW EXCLUSIVE MODE will be rejected.
+   </para>
+
+   <para>
+	In general queries will not experience lock conflicts with the database
+	changes made by recovery. This is becase recovery follows normal
+	concurrency control mechanisms, known as <acronym>MVCC</>. There are
+	some types of change that will cause conflicts, covered in the following
+	section.
+   </para>
+  </sect2>
+
+  <sect2 id="hot-standby-conflict">
+   <title>Handling query conflicts</title>
+
+   <para>
+	The primary and standby nodes are in many ways loosely connected. Actions
+	on the primary will have an effect on the standby. As a result, there is
+	potential for negative interactions or conflicts between them. The easiest
+	conflict to understand is performance: if a huge data load is taking place
+	on the primary then this will generate a similar stream of WAL records on the
+	standby, so standby queries may contend for system resources, such as I/O.
+   </para>
+
+   <para>
+	There are also additional types of conflict that can occur with Hot Standby.
+	These conflicts are <emphasis>hard conflicts</> in the sense that we may
+	need to cancel queries and in some cases disconnect sessions to resolve them.
+	The user is provided with a number of optional ways to handle these
+	conflicts, though we must first understand the possible reasons behind a conflict.
+
+	  <itemizedlist>
+	   <listitem>
+	    <para>
+		 Access Exclusive Locks from primary node, including both explicit
+		 LOCK commands and various kinds of DDL action
+	    </para>
+	   </listitem>
+	   <listitem>
+	    <para>
+		 Dropping tablespaces on the primary while standby queries are using
+		 those tablespace for temporary work files (work_mem overflow)
+	    </para>
+	   </listitem>
+	   <listitem>
+	    <para>
+		 Dropping databases on the primary while that role is connected on standby.
+	    </para>
+	   </listitem>
+	   <listitem>
+	    <para>
+		 Waiting to acquire buffer cleanup locks (for which there is no time out)
+	    </para>
+	   </listitem>
+	   <listitem>
+	    <para>
+		 Early cleanup of data still visible to the current query's snapshot
+	    </para>
+	   </listitem>
+	  </itemizedlist>
+   </para>
+
+   <para>
+	Some WAL redo actions will be for DDL actions. These DDL actions are
+	repeating actions that have already committed on the primary node, so
+	they must not fail on the standby node. These DDL locks take priority
+	and will automatically *cancel* any read-only transactions that get in
+	their way, after a grace period. This is similar to the possibility of
+	being canceled by the deadlock detector, but in this case the standby
+	process always wins, since the replayed actions must not fail. This
+	also ensures that replication doesn't fall behind while we wait for a
+	query to complete. Again, we assume that the standby is there for high
+	availability purposes primarily.
+   </para>
+
+   <para>
+	An example of the above would be an Administrator on Primary server
+	runs a <command>DROP TABLE</> on a table that's currently being queried
+	in the standby server.
+	Clearly the query cannot continue if we let the <command>DROP TABLE</>
+	proceed. If this situation occurred on the primary, the <command>DROP TABLE</>
+	would wait until the query has finished. When the query is on the standby
+	and the <command>DROP TABLE</> is on the primary, the primary doesn't have
+	information about which queries are running on the standby and so the query
+	does not wait on the primary. The WAL change records come through to the
+	standby while the standby query is still running, causing a conflict.
+   </para>
+
+   <para>
+	The most common reason for conflict between standby queries and WAL redo is
+	"early cleanup". Normally, <productname>PostgreSQL</> allows cleanup of old
+	row versions when there are no users who may need to see them to ensure correct
+	visibility of data (the heart of MVCC). If there is a standby query that has
+	been running for longer than any query on the primary then it is possible
+	for old row versions to be removed by either a vacuum or HOT. This will
+	then generate WAL records that, if applied, would remove data on the
+	standby that might *potentially* be required by the standby query.
+	In more technical language, the primary's xmin horizon is later than
+	the standby's xmin horizon, allowing dead rows to be removed.
+   </para>
+
+   <para>
+	Experienced users should note that both row version cleanup and row version
+	freezing will potentially conflict with recovery queries. Running a
+	manual <command>VACUUM FREEZE</> is likely to cause conflicts even on tables
+	with no updated or deleted rows.
+   </para>
+
+   <para>
+	We have a number of choices for resolving query conflicts.  The default
+	is that we wait and hope the query completes. The server will wait
+	automatically until the lag between primary and standby is at most
+	<varname>max_standby_delay</> seconds. Once that grace period expires,
+	we take one of the following actions:
+
+	  <itemizedlist>
+	   <listitem>
+	    <para>
+		 If the conflict is caused by a lock, we cancel the conflicting standby
+		 transaction immediately. If the transaction is idle-in-transaction
+		 then currently we abort the session instead, though this may change
+		 in the future.
+	    </para>
+	   </listitem>
+
+	   <listitem>
+	    <para>
+		 If the conflict is caused by cleanup records we tell the standby query
+		 that a conflict has occurred and that it must cancel itself to avoid the
+		 risk that it silently fails to read relevant data because
+		 that data has been removed. (This is regrettably very similar to the
+		 much feared and iconic error message "snapshot too old"). Some cleanup
+		 records only cause conflict with older queries, though some types of
+		 cleanup record affect all queries.
+	    </para>
+
+	    <para>
+		 If cancellation does occur, the query and/or transaction can always
+		 be re-executed. The error is dynamic and will not necessarily occur
+		 the same way if the query is executed again.
+	    </para>
+	   </listitem>
+	  </itemizedlist>
+   </para>
+
+   <para>
+	<varname>max_standby_delay</> is set in <filename>postgresql.conf</>.
+	The parameter applies to the server as a whole so if the delay is all used
+	up by a single query then there may be little or no waiting for queries that
+	follow immediately, though they will have benefited equally from the initial
+	waiting period. The server may take time to catch up again before the grace
+	period is available again, though if there is a heavy and constant stream
+	of conflicts it may seldom catch up fully.
+   </para>
+
+   <para>
+	Users should be clear that tables that are regularly and heavily updated on
+	primary server will quickly cause cancellation of longer running queries on
+	the standby. In those cases <varname>max_standby_delay</> can be
+	considered somewhat but not exactly the same as setting
+	<varname>statement_timeout</>.
+    </para>
+
+   <para>
+	Other remedial actions exist if the number of cancellations is unacceptable.
+	The first option is to connect to primary server and keep a query active
+	for as long as we need to run queries on the standby. This guarantees that
+	a WAL cleanup record is never generated and we don't ever get query
+	conflicts as described above. This could be done using contrib/dblink
+	and pg_sleep(), or via other mechanisms. If you do this, you should note
+	that this will delay cleanup of dead rows by vacuum or HOT and many
+	people may find this undesirable. However, we should remember that
+	primary and standby nodes are linked via the WAL, so this situation is no
+	different to the case where we ran the query on the primary node itself
+	except we have the benefit of off-loading the execution onto the standby.
+   </para>
+
+   <para>
+	It is also possible to set <varname>vacuum_defer_cleanup_age</> on the primary
+	to defer the cleanup of records by autovacuum, vacuum and HOT. This may allow
+	more time for queries to execute before they are cancelled on the standby,
+	without the need for setting a high <varname>max_standby_delay</>.
+   </para>
+
+   <para>
+	Three-way deadlocks are possible between AccessExclusiveLocks arriving from
+	the primary, cleanup WAL records that require buffer cleanup locks and
+	user requests that are waiting behind replayed AccessExclusiveLocks. Deadlocks
+	are currently resolved by the cancellation of user processes that would
+	need to wait on a lock. This is heavy-handed and generates more query
+	cancellations than we need to, though does remove the possibility of deadlock.
+	This behaviour is expected to improve substantially for the main release
+	version of 8.5.
+   </para>
+
+   <para>
+	Dropping tablespaces or databases is discussed in the administrator's
+	section since they are not typical user situations.
+   </para>
+  </sect2>
+
+  <sect2 id="hot-standby-admin">
+   <title>Administrator's Overview</title>
+
+   <para>
+	If there is a <filename>recovery.conf</> file present the server will start
+	in Hot Standby mode by default, though <varname>recovery_connections</> can
+	be disabled via <filename>postgresql.conf</>, if required. The server may take
+	some time to enable recovery connections since the server must first complete
+	sufficient recovery to provide a consistent state against which queries
+	can run before enabling read only connections. Look for these messages
+	in the server logs
+
+<programlisting>
+LOG:  initializing recovery connections
+
+... then some time later ...
+
+LOG:  consistent recovery state reached
+LOG:  database system is ready to accept read only connections
+</programlisting>
+
+	Consistency information is recorded once per checkpoint on the primary, as long
+	as <varname>recovery_connections</> is enabled (on the primary). If this parameter
+	is disabled, it will not be possible to enable recovery connections on the standby.
+	The consistent state can also be delayed in the presence of both of these conditions
+
+	  <itemizedlist>
+	   <listitem>
+	    <para>
+		 a write transaction has more than 64 subtransactions
+	    </para>
+	   </listitem>
+	   <listitem>
+	    <para>
+		 very long-lived write transactions
+	    </para>
+	   </listitem>
+	  </itemizedlist>
+
+	If you are running file-based log shipping ("warm standby"), you may need
+	to wait until the next WAL file arrives, which could be as long as the
+	<varname>archive_timeout</> setting on the primary.
+   </para>
+
+   <para>
+	The setting of some parameters on the standby will need reconfiguration
+	if they have been changed on the primary. The value on the standby must
+	be equal to or greater than the value on the primary. If these parameters
+	are not set high enough then the standby will not be able to track work
+	correctly from recovering transactions. If these values are set too low the
+	the server will halt. Higher values can then be supplied and the server
+	restarted to begin recovery again.
+
+	  <itemizedlist>
+	   <listitem>
+	    <para>
+		 <varname>max_connections</>
+	    </para>
+	   </listitem>
+	   <listitem>
+	    <para>
+		 <varname>max_prepared_transactions</>
+	    </para>
+	   </listitem>
+	   <listitem>
+	    <para>
+		 <varname>max_locks_per_transaction</>
+	    </para>
+	   </listitem>
+	  </itemizedlist>
+   </para>
+
+   <para>
+	It is important that the administrator consider the appropriate setting
+	of <varname>max_standby_delay</>, set in <filename>postgresql.conf</>.
+	There is no optimal setting and should be set according to business
+	priorities. For example if the server is primarily tasked as a High
+	Availability server, then you may wish to lower
+	<varname>max_standby_delay</> or even set it to zero, though that is a
+	very aggressive setting. If the standby server is tasked as an additional
+	server for decision support queries then it may be acceptable to set this
+	to a value of many hours (in seconds).  It is also possible to set
+	<varname>max_standby_delay</> to -1 which means wait forever for queries
+	to complete, if there are conflicts; this will be useful when performing
+	an archive recovery from a backup.
+   </para>
+
+   <para>
+	Transaction status "hint bits" written on primary are not WAL-logged,
+	so data on standby will likely re-write the hints again on the standby.
+	Thus the main database blocks will produce write I/Os even though
+	all users are read-only; no changes have occurred to the data values
+	themselves.  Users will be able to write large sort temp files and
+	re-generate relcache info files, so there is no part of the database
+	that is truly read-only during hot standby mode. There is no restriction
+	on the use of set returning functions, or other users of tuplestore/tuplesort
+	code. Note also that writes to remote databases will still be possible,
+	even though the transaction is read-only locally.
+   </para>
+
+   <para>
+	The following types of administrator command are not accepted
+	during recovery mode
+
+	  <itemizedlist>
+	   <listitem>
+	    <para>
+	     Data Definition Language (DDL) - e.g. CREATE INDEX
+	    </para>
+	   </listitem>
+	   <listitem>
+	    <para>
+	     Privilege and Ownership - GRANT, REVOKE, REASSIGN
+	    </para>
+	   </listitem>
+	   <listitem>
+	    <para>
+	     Maintenance commands - ANALYZE, VACUUM, CLUSTER, REINDEX
+	    </para>
+	   </listitem>
+	  </itemizedlist>
+   </para>
+
+   <para>
+	Note again that some of these commands are actually allowed during
+	"read only" mode transactions on the primary.
+   </para>
+
+   <para>
+	As a result, you cannot create additional indexes that exist solely
+	on the standby, nor can statistics that exist solely on the standby.
+	If these administrator commands are needed they should be executed
+	on the primary so that the changes will propagate through to the
+	standby.
+   </para>
+
+   <para>
+	<function>pg_cancel_backend()</> will work on user backends, but not the
+	Startup process, which performs recovery. pg_stat_activity does not
+	show an entry for the Startup process, nor do recovering transactions
+	show as active. As a result, pg_prepared_xacts is always empty during
+	recovery. If you wish to resolve in-doubt prepared transactions
+	then look at pg_prepared_xacts on the primary and issue commands to
+	resolve those transactions there.
+   </para>
+
+   <para>
+	pg_locks will show locks held by backends as normal. pg_locks also shows
+	a virtual transaction managed by the Startup process that owns all
+	AccessExclusiveLocks held by transactions being replayed by recovery.
+	Note that Startup process does not acquire locks to
+	make database changes and thus locks other than AccessExclusiveLocks
+	do not show in pg_locks for the Startup process, they are just presumed
+	to exist.
+   </para>
+
+   <para>
+	<productname>check_pgsql</> will work, but it is very simple.
+	<productname>check_postgres</> will also work, though many some actions
+	could give different or confusing results.
+	e.g. last vacuum time will not be maintained for example, since no
+	vacuum occurs on the standby (though vacuums running on the primary do
+	send their changes to the standby).
+   </para>
+
+   <para>
+	WAL file control commands will not work during recovery
+	e.g. <function>pg_start_backup</>, <function>pg_switch_xlog</> etc..
+   </para>
+
+   <para>
+	Dynamically loadable modules work, including pg_stat_statements.
+   </para>
+
+   <para>
+	Advisory locks work normally in recovery, including deadlock detection.
+	Note that advisory locks are never WAL logged, so it is not possible for
+	an advisory lock on either the primary or the standby to conflict with WAL
+	replay. Nor is it possible to acquire an advisory lock on the primary
+	and have it initiate a similar advisory lock on the standby. Advisory
+	locks relate only to a single server on which they are acquired.
+   </para>
+
+   <para>
+	Trigger-based replication systems such as <productname>Slony</>,
+	<productname>Londiste</> and <productname>Bucardo</> won't run on the
+	standby at all, though they will run happily on the primary server as
+	long as the changes are not sent to standby servers to be applied.
+	WAL replay is not trigger-based so you cannot relay from the
+	standby to any system that requires additional database writes or
+	relies on the use of triggers.
+   </para>
+
+   <para>
+	New oids cannot be assigned, though some <acronym>UUID</> generators may still
+	work as long as they do not rely on writing new status to the database.
+   </para>
+
+   <para>
+	Currently, temp table creation is not allowed during read only
+	transactions, so in some cases existing scripts will not run correctly.
+	It is possible we may relax that restriction in a later release. This is
+	both a SQL Standard compliance issue and a technical issue.
+   </para>
+
+   <para>
+	<command>DROP TABLESPACE</> can only succeed if the tablespace is empty.
+	Some standby users may be actively using the tablespace via their
+	<varname>temp_tablespaces</> parameter. If there are temp files in the
+	tablespace we currently cancel all active queries to ensure that temp
+	files are removed, so that we can remove the tablespace and continue with
+	WAL replay.
+   </para>
+
+   <para>
+	Running <command>DROP DATABASE</>, <command>ALTER DATABASE ... SET TABLESPACE</>,
+	or <command>ALTER DATABASE ... RENAME</> on primary will generate a log message
+	that will cause all users connected to that database on the standby to be
+	forcibly disconnected, once <varname>max_standby_delay</> has been reached.
+   </para>
+
+   <para>
+	In normal running, if you issue <command>DROP USER</> or <command>DROP ROLE</>
+	for a role with login capability while that user is still connected then
+	nothing happens to the connected user - they remain connected. The user cannot
+	reconnect however. This behaviour applies in recovery also, so a
+	<command>DROP USER</> on the primary does not disconnect that user on the standby.
+   </para>
+
+   <para>
+	Stats collector is active during recovery. All scans, reads, blocks,
+	index usage etc will all be recorded normally on the standby. Replayed
+	actions will not duplicate their effects on primary, so replaying an
+	insert will not increment the Inserts column of pg_stat_user_tables.
+	The stats file is deleted at start of recovery, so stats from primary
+	and standby will differ; this is considered a feature not a bug.
+   </para>
+
+   <para>
+	Autovacuum is not active during recovery, though will start normally
+	at the end of recovery.
+   </para>
+
+   <para>
+	Background writer is active during recovery and will perform
+	restartpoints (similar to checkpoints on primary) and normal block
+	cleaning activities. The <command>CHECKPOINT</> command is accepted during recovery,
+	though performs a restartpoint rather than a new checkpoint.
+   </para>
+  </sect2>
+
+  <sect2 id="hot-standby-parameters">
+   <title>Hot Standby Parameter Reference</title>
+
+   <para>
+	Various parameters have been mentioned above in the <xref linkend="hot-standby-admin">
+	and <xref linkend="hot-standby-conflict"> sections.
+   </para>
+
+   <para>
+	On the primary, parameters <varname>recovery_connections</> and
+	<varname>vacuum_defer_cleanup_age</> can be used to enable and control the
+	primary server to assist the successful configuration of Hot Standby servers.
+	<varname>max_standby_delay</> has no effect if set on the primary.
+   </para>
+
+   <para>
+	On the standby, parameters <varname>recovery_connections</> and
+	<varname>max_standby_delay</> can be used to enable and control Hot Standby.
+	standby server to assist the successful configuration of Hot Standby servers.
+	<varname>vacuum_defer_cleanup_age</> has no effect during recovery.
+   </para>
+  </sect2>
+
+  <sect2 id="hot-standby-caveats">
+   <title>Caveats</title>
+
+   <para>
+    At this writing, there are several limitations of Hot Standby.
+    These can and probably will be fixed in future releases:
+
+  <itemizedlist>
+   <listitem>
+    <para>
+     Operations on hash indexes are not presently WAL-logged, so
+     replay will not update these indexes.  Hash indexes will not be
+	 used for query plans during recovery.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Full knowledge of running transactions is required before snapshots
+	 may be taken. Transactions that take use large numbers of subtransactions
+	 (currently greater than 64) will delay the start of read only
+	 connections until the completion of the longest running write transaction.
+	 If this situation occurs explanatory messages will be sent to server log.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Valid starting points for recovery connections are generated at each
+	 checkpoint on the master. If the standby is shutdown while the master
+	 is in a shutdown state it may not be possible to re-enter Hot Standby
+	 until the primary is started up so that it generates further starting
+	 points in the WAL logs. This is not considered a serious issue
+	 because the standby is usually switched into the primary role while
+	 the first node is taken down.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     At the end of recovery, AccessExclusiveLocks held by prepared transactions
+	 will require twice the normal number of lock table entries. If you plan
+	 on running either a large number of concurrent prepared transactions
+	 that normally take AccessExclusiveLocks, or you plan on having one
+	 large transaction that takes many AccessExclusiveLocks then you are
+	 advised to select a larger value of <varname>max_locks_per_transaction</>,
+	 up to, but never more than twice the value of the parameter setting on
+	 the primary server in rare extremes. You need not consider this at all if
+	 your setting of <varname>max_prepared_transactions</> is <literal>0</>.
+    </para>
+   </listitem>
+  </itemizedlist>
+
+   </para>
+  </sect2>
+
+ </sect1>
+
 <sect1 id="migration">
  <title>Migration Between Releases</title>

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/config.sgml,v 1.238 2009/12/17 14:36:16 rhaas Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/config.sgml,v 1.239 2009/12/19 01:32:31 sriggs Exp $ -->

 <chapter Id="runtime-config">
  <title>Server Configuration</title>
@ -376,6 +376,12 @@ SET ENABLE_SEQSCAN TO OFF;
        allows. See <xref linkend="sysvipc"> for information on how to
        adjust those parameters, if necessary.
       </para>
+
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
      </listitem>
     </varlistentry>

@ -826,6 +832,12 @@ SET ENABLE_SEQSCAN TO OFF;
        allows. See <xref linkend="sysvipc"> for information on how to
        adjust those parameters, if necessary.
       </para>
+
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
      </listitem>
     </varlistentry>

@ -1733,6 +1745,51 @@ archive_command = 'copy "%p" "C:\\server\\archivedir\\%f"'  # Windows
     
     </variablelist>
    </sect2>
+
+    <sect2 id="runtime-config-standby">
+    <title>Standby Servers</title>
+
+    <variablelist>
+
+     <varlistentry id="recovery-connections" xreflabel="recovery_connections">
+      <term><varname>recovery_connections</varname> (<type>boolean</type>)</term>
+      <listitem>
+       <para>
+		Parameter has two roles. During recovery, specifies whether or not
+		you can connect and run queries to enable <xref linkend="hot-standby">.
+		During normal running, specifies whether additional information is written
+		to WAL to allow recovery connections on a standby server that reads
+		WAL data generated by this server. The default value is
+        <literal>on</literal>.  It is thought that there is little
+		measurable difference in performance from using this feature, so
+		feedback is welcome if any production impacts are noticeable.
+		It is likely that this parameter will be removed in later releases.
+        This parameter can only be set at server start.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="max-standby-delay" xreflabel="max_standby_delay">
+      <term><varname>max_standby_delay</varname> (<type>string</type>)</term>
+      <listitem>
+       <para>
+		When server acts as a standby, this parameter specifies a wait policy
+		for queries that conflict with incoming data changes. Valid settings
+		are -1, meaning wait forever, or a wait time of 0 or more seconds.
+		If a conflict should occur the server will delay up to this
+		amount before it begins trying to resolve things less amicably, as
+		described in <xref linkend="hot-standby-conflict">. Typically,
+		this parameter makes sense only during replication, so when
+		performing an archive recovery to recover from data loss a
+		parameter setting of 0 is recommended.  The default is 30 seconds.
+        This parameter can only be set in the <filename>postgresql.conf</>
+        file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     </variablelist>
+    </sect2>
   </sect1>

   <sect1 id="runtime-config-query">
@ -4161,6 +4218,29 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
      </listitem>
     </varlistentry>

+     <varlistentry id="guc-vacuum-defer-cleanup-age" xreflabel="vacuum_defer_cleanup_age">
+      <term><varname>vacuum_defer_cleanup_age</varname> (<type>integer</type>)</term>
+      <indexterm>
+       <primary><varname>vacuum_defer_cleanup_age</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Specifies the number of transactions by which <command>VACUUM</> and
+		<acronym>HOT</> updates will defer cleanup of dead row versions. The
+		default is 0 transactions, meaning that dead row versions will be
+		removed as soon as possible. You may wish to set this to a non-zero
+		value when planning or maintaining a <xref linkend="hot-standby">
+		configuration. The recommended value is <literal>0</> unless you have
+		clear reason to increase it. The purpose of the parameter is to
+		allow the user to specify an approximate time delay before cleanup
+		occurs. However, it should be noted that there is no direct link with
+		any specific time delay and so the results will be application and
+		installation specific, as well as variable over time, depending upon
+		the transaction rate (of writes only).
+       </para>
+      </listitem>
+     </varlistentry>
+
     <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
      <term><varname>bytea_output</varname> (<type>enum</type>)</term>
      <indexterm>
@ -4689,6 +4769,12 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
        allows. See <xref linkend="sysvipc"> for information on how to
        adjust those parameters, if necessary.
       </para>
+
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
      </listitem>
     </varlistentry>

@ -5546,6 +5632,32 @@ plruby.use_strict = true        # generates error: unknown class name
      </listitem>
     </varlistentry>

+     <varlistentry id="guc-trace-recovery-messages" xreflabel="trace_recovery_messages">
+      <term><varname>trace_recovery_messages</varname> (<type>string</type>)</term>
+      <indexterm>
+       <primary><varname>trace_recovery_messages</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Controls which message levels are written to the server log
+        for system modules needed for recovery processing. This allows
+        the user to override the normal setting of log_min_messages,
+        but only for specific messages. This is intended for use in
+        debugging Hot Standby.
+        Valid values are <literal>DEBUG5</>, <literal>DEBUG4</>,
+        <literal>DEBUG3</>, <literal>DEBUG2</>, <literal>DEBUG1</>,
+        <literal>INFO</>, <literal>NOTICE</>, <literal>WARNING</>,
+        <literal>ERROR</>, <literal>LOG</>, <literal>FATAL</>, and
+        <literal>PANIC</>.  Each level includes all the levels that
+        follow it.  The later the level, the fewer messages are sent
+        to the log.  The default is <literal>WARNING</>.  Note that
+        <literal>LOG</> has a different rank here than in
+        <varname>client_min_messages</>.
+        Parameter should be set in the postgresql.conf only.
+       </para>
+      </listitem>
+     </varlistentry>
+
    <varlistentry id="guc-zero-damaged-pages" xreflabel="zero_damaged_pages">
      <term><varname>zero_damaged_pages</varname> (<type>boolean</type>)</term>
      <indexterm>
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/func.sgml,v 1.493 2009/12/15 17:57:46 tgl Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/func.sgml,v 1.494 2009/12/19 01:32:31 sriggs Exp $ -->

 <chapter id="functions">
  <title>Functions and Operators</title>
@ -13132,6 +13132,38 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
    <xref linkend="continuous-archiving">.
   </para>

+   <indexterm>
+    <primary>pg_is_in_recovery</primary>
+   </indexterm>
+
+   <para>
+    The functions shown in <xref
+    linkend="functions-recovery-info-table"> provide information
+	about the current status of Hot Standby.
+    These functions may be executed during both recovery and in normal running.
+   </para>
+
+   <table id="functions-recovery-info-table">
+    <title>Recovery Information Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_is_in_recovery</function>()</literal>
+        </entry>
+       <entry><type>bool</type></entry>
+       <entry>True if recovery is still in progress.
+	   </entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
   <para>
    The functions shown in <xref linkend="functions-admin-dbsize"> calculate
    the disk space usage of database objects.
--- a/doc/src/sgml/ref/checkpoint.sgml
+++ b/doc/src/sgml/ref/checkpoint.sgml
@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/ref/checkpoint.sgml,v 1.16 2008/11/14 10:22:45 petere Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/ref/checkpoint.sgml,v 1.17 2009/12/19 01:32:31 sriggs Exp $ -->

 <refentry id="sql-checkpoint">
 <refmeta>
@ -42,6 +42,11 @@ CHECKPOINT
   <xref linkend="wal"> for more information about the WAL system.
  </para>

+  <para>
+   If executed during recovery, the <command>CHECKPOINT</command> command
+   will force a restartpoint rather than writing a new checkpoint.
+  </para>
+
  <para>
   Only superusers can call <command>CHECKPOINT</command>.  The command is
   not intended for use during normal operation.
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@ -8,7 +8,7 @@
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * IDENTIFICATION
- *			 $PostgreSQL: pgsql/src/backend/access/gin/ginxlog.c,v 1.19 2009/06/11 14:48:53 momjian Exp $
+ *			 $PostgreSQL: pgsql/src/backend/access/gin/ginxlog.c,v 1.20 2009/12/19 01:32:31 sriggs Exp $
 *-------------------------------------------------------------------------
 */
 #include "postgres.h"
@ -621,6 +621,10 @@ gin_redo(XLogRecPtr lsn, XLogRecord *record)
 {
 	uint8		info = record->xl_info & ~XLR_INFO_MASK;

+	/*
+	 * GIN indexes do not require any conflict processing.
+	 */
+
 	RestoreBkpBlocks(lsn, record, false);

 	topCtx = MemoryContextSwitchTo(opCtx);
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@ -8,7 +8,7 @@
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * IDENTIFICATION
- *			 $PostgreSQL: pgsql/src/backend/access/gist/gistxlog.c,v 1.32 2009/01/20 18:59:36 heikki Exp $
+ *			 $PostgreSQL: pgsql/src/backend/access/gist/gistxlog.c,v 1.33 2009/12/19 01:32:32 sriggs Exp $
 *-------------------------------------------------------------------------
 */
 #include "postgres.h"
@ -396,6 +396,12 @@ gist_redo(XLogRecPtr lsn, XLogRecord *record)
 	uint8		info = record->xl_info & ~XLR_INFO_MASK;
 	MemoryContext oldCxt;

+	/*
+	 * GIST indexes do not require any conflict processing. NB: If we ever
+	 * implement a similar optimization we have in b-tree, and remove killed
+	 * tuples outside VACUUM, we'll need to handle that here.
+	 */
+
 	RestoreBkpBlocks(lsn, record, false);

 	oldCxt = MemoryContextSwitchTo(opCtx);
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/access/heap/heapam.c,v 1.278 2009/08/24 02:18:31 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/access/heap/heapam.c,v 1.279 2009/12/19 01:32:32 sriggs Exp $
 *
 *
 * INTERFACE ROUTINES
@ -59,6 +59,7 @@
 #include "storage/lmgr.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
+#include "storage/standby.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
@ -248,8 +249,11 @@ heapgetpage(HeapScanDesc scan, BlockNumber page)
 	/*
 	 * If the all-visible flag indicates that all tuples on the page are
 	 * visible to everyone, we can skip the per-tuple visibility tests.
+	 * But not in hot standby mode. A tuple that's already visible to all
+	 * transactions in the master might still be invisible to a read-only
+	 * transaction in the standby.
 	 */
-	all_visible = PageIsAllVisible(dp);
+	all_visible = PageIsAllVisible(dp) && !snapshot->takenDuringRecovery;

 	for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff);
 		 lineoff <= lines;
@ -3769,6 +3773,60 @@ heap_restrpos(HeapScanDesc scan)
 	}
 }

+/*
+ * If 'tuple' contains any XID greater than latestRemovedXid, update
+ * latestRemovedXid to the greatest one found.
+ */
+void
+HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
+									   TransactionId *latestRemovedXid)
+{
+	TransactionId xmin = HeapTupleHeaderGetXmin(tuple);
+	TransactionId xmax = HeapTupleHeaderGetXmax(tuple);
+	TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
+
+	if (tuple->t_infomask & HEAP_MOVED_OFF ||
+		tuple->t_infomask & HEAP_MOVED_IN)
+	{
+		if (TransactionIdPrecedes(*latestRemovedXid, xvac))
+			*latestRemovedXid = xvac;
+	}
+
+	if (TransactionIdPrecedes(*latestRemovedXid, xmax))
+		*latestRemovedXid = xmax;
+
+	if (TransactionIdPrecedes(*latestRemovedXid, xmin))
+		*latestRemovedXid = xmin;
+
+	Assert(TransactionIdIsValid(*latestRemovedXid));
+}
+
+/*
+ * Perform XLogInsert to register a heap cleanup info message. These
+ * messages are sent once per VACUUM and are required because
+ * of the phasing of removal operations during a lazy VACUUM.
+ * see comments for vacuum_log_cleanup_info().
+ */
+XLogRecPtr
+log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
+{
+	xl_heap_cleanup_info xlrec;
+	XLogRecPtr	recptr;
+	XLogRecData rdata;
+
+	xlrec.node = rnode;
+	xlrec.latestRemovedXid = latestRemovedXid;
+
+	rdata.data = (char *) &xlrec;
+	rdata.len = SizeOfHeapCleanupInfo;
+	rdata.buffer = InvalidBuffer;
+	rdata.next = NULL;
+
+	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEANUP_INFO, &rdata);
+
+	return recptr;
+}
+
 /*
 * Perform XLogInsert for a heap-clean operation.  Caller must already
 * have modified the buffer and marked it dirty.
@ -3776,13 +3834,17 @@ heap_restrpos(HeapScanDesc scan)
 * Note: prior to Postgres 8.3, the entries in the nowunused[] array were
 * zero-based tuple indexes.  Now they are one-based like other uses
 * of OffsetNumber.
+ *
+ * We also include latestRemovedXid, which is the greatest XID present in
+ * the removed tuples. That allows recovery processing to cancel or wait
+ * for long standby queries that can still see these tuples.
 */
 XLogRecPtr
 log_heap_clean(Relation reln, Buffer buffer,
 			   OffsetNumber *redirected, int nredirected,
 			   OffsetNumber *nowdead, int ndead,
 			   OffsetNumber *nowunused, int nunused,
-			   bool redirect_move)
+			   TransactionId latestRemovedXid, bool redirect_move)
 {
 	xl_heap_clean xlrec;
 	uint8		info;
@ -3794,6 +3856,7 @@ log_heap_clean(Relation reln, Buffer buffer,

 	xlrec.node = reln->rd_node;
 	xlrec.block = BufferGetBlockNumber(buffer);
+	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;

@ -4067,6 +4130,33 @@ log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno,
 	return recptr;
 }

+/*
+ * Handles CLEANUP_INFO
+ */
+static void
+heap_xlog_cleanup_info(XLogRecPtr lsn, XLogRecord *record)
+{
+	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
+
+	if (InHotStandby)
+	{
+		VirtualTransactionId *backends;
+
+		backends = GetConflictingVirtualXIDs(xlrec->latestRemovedXid,
+											 InvalidOid,
+											 true);
+		ResolveRecoveryConflictWithVirtualXIDs(backends,
+											   "VACUUM index cleanup",
+											   CONFLICT_MODE_ERROR);
+	}
+
+	/*
+	 * Actual operation is a no-op. Record type exists to provide a means
+	 * for conflict processing to occur before we begin index vacuum actions.
+	 * see vacuumlazy.c and also comments in btvacuumpage()
+	 */
+}
+
 /*
 * Handles CLEAN and CLEAN_MOVE record types
 */
@ -4085,12 +4175,31 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record, bool clean_move)
 	int			nunused;
 	Size		freespace;

+	/*
+	 * We're about to remove tuples. In Hot Standby mode, ensure that there's
+	 * no queries running for which the removed tuples are still visible.
+	 */
+	if (InHotStandby)
+	{
+		VirtualTransactionId *backends;
+
+		backends = GetConflictingVirtualXIDs(xlrec->latestRemovedXid,
+											 InvalidOid,
+											 true);
+		ResolveRecoveryConflictWithVirtualXIDs(backends,
+											   "VACUUM heap cleanup",
+											   CONFLICT_MODE_ERROR);
+	}
+
+	RestoreBkpBlocks(lsn, record, true);
+
 	if (record->xl_info & XLR_BKP_BLOCK_1)
 		return;

-	buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+	buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL);
 	if (!BufferIsValid(buffer))
 		return;
+	LockBufferForCleanup(buffer);
 	page = (Page) BufferGetPage(buffer);

 	if (XLByteLE(lsn, PageGetLSN(page)))
@ -4145,12 +4254,40 @@ heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
 	Buffer		buffer;
 	Page		page;

+	/*
+	 * In Hot Standby mode, ensure that there's no queries running which still
+	 * consider the frozen xids as running.
+	 */
+	if (InHotStandby)
+	{
+		VirtualTransactionId *backends;
+
+		/*
+		 * XXX: Using cutoff_xid is overly conservative. Even if cutoff_xid
+		 * is recent enough to conflict with a backend, the actual values
+		 * being frozen might not be. With a typical vacuum_freeze_min_age
+		 * setting in the ballpark of millions of transactions, it won't make
+		 * a difference, but it might if you run a manual VACUUM FREEZE.
+		 * Typically the cutoff is much earlier than any recently deceased
+		 * tuple versions removed by this vacuum, so don't worry too much.
+		 */
+		backends = GetConflictingVirtualXIDs(cutoff_xid,
+											 InvalidOid,
+											 true);
+		ResolveRecoveryConflictWithVirtualXIDs(backends,
+											   "VACUUM heap freeze",
+											   CONFLICT_MODE_ERROR);
+	}
+
+	RestoreBkpBlocks(lsn, record, false);
+
 	if (record->xl_info & XLR_BKP_BLOCK_1)
 		return;

-	buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+	buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL);
 	if (!BufferIsValid(buffer))
 		return;
+	LockBufferForCleanup(buffer);
 	page = (Page) BufferGetPage(buffer);

 	if (XLByteLE(lsn, PageGetLSN(page)))
@ -4740,6 +4877,11 @@ heap_redo(XLogRecPtr lsn, XLogRecord *record)
 {
 	uint8		info = record->xl_info & ~XLR_INFO_MASK;

+	/*
+	 * These operations don't overwrite MVCC data so no conflict
+	 * processing is required. The ones in heap2 rmgr do.
+	 */
+
 	RestoreBkpBlocks(lsn, record, false);

 	switch (info & XLOG_HEAP_OPMASK)
@ -4778,20 +4920,25 @@ heap2_redo(XLogRecPtr lsn, XLogRecord *record)
 {
 	uint8		info = record->xl_info & ~XLR_INFO_MASK;

+	/*
+	 * Note that RestoreBkpBlocks() is called after conflict processing
+	 * within each record type handling function.
+	 */
+
 	switch (info & XLOG_HEAP_OPMASK)
 	{
 		case XLOG_HEAP2_FREEZE:
-			RestoreBkpBlocks(lsn, record, false);
 			heap_xlog_freeze(lsn, record);
 			break;
 		case XLOG_HEAP2_CLEAN:
-			RestoreBkpBlocks(lsn, record, true);
 			heap_xlog_clean(lsn, record, false);
 			break;
 		case XLOG_HEAP2_CLEAN_MOVE:
-			RestoreBkpBlocks(lsn, record, true);
 			heap_xlog_clean(lsn, record, true);
 			break;
+		case XLOG_HEAP2_CLEANUP_INFO:
+			heap_xlog_cleanup_info(lsn, record);
+			break;
 		default:
 			elog(PANIC, "heap2_redo: unknown op code %u", info);
 	}
@ -4921,17 +5068,26 @@ heap2_desc(StringInfo buf, uint8 xl_info, char *rec)
 	{
 		xl_heap_clean *xlrec = (xl_heap_clean *) rec;

-		appendStringInfo(buf, "clean: rel %u/%u/%u; blk %u",
+		appendStringInfo(buf, "clean: rel %u/%u/%u; blk %u remxid %u",
 						 xlrec->node.spcNode, xlrec->node.dbNode,
-						 xlrec->node.relNode, xlrec->block);
+						 xlrec->node.relNode, xlrec->block,
+						 xlrec->latestRemovedXid);
 	}
 	else if (info == XLOG_HEAP2_CLEAN_MOVE)
 	{
 		xl_heap_clean *xlrec = (xl_heap_clean *) rec;

-		appendStringInfo(buf, "clean_move: rel %u/%u/%u; blk %u",
+		appendStringInfo(buf, "clean_move: rel %u/%u/%u; blk %u remxid %u",
 						 xlrec->node.spcNode, xlrec->node.dbNode,
-						 xlrec->node.relNode, xlrec->block);
+						 xlrec->node.relNode, xlrec->block,
+						 xlrec->latestRemovedXid);
+	}
+	else if (info == XLOG_HEAP2_CLEANUP_INFO)
+	{
+		xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec;
+
+		appendStringInfo(buf, "cleanup info: remxid %u",
+						 xlrec->latestRemovedXid);
 	}
 	else
 		appendStringInfo(buf, "UNKNOWN");
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/access/heap/pruneheap.c,v 1.18 2009/06/11 14:48:53 momjian Exp $
+ *	  $PostgreSQL: pgsql/src/backend/access/heap/pruneheap.c,v 1.19 2009/12/19 01:32:32 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -30,7 +30,8 @@
 typedef struct
 {
 	TransactionId new_prune_xid;	/* new prune hint value for page */
-	int			nredirected;	/* numbers of entries in arrays below */
+	TransactionId latestRemovedXid; /* latest xid to be removed by this prune */
+	int			nredirected;		/* numbers of entries in arrays below */
 	int			ndead;
 	int			nunused;
 	/* arrays that accumulate indexes of items to be changed */
@ -84,6 +85,14 @@ heap_page_prune_opt(Relation relation, Buffer buffer, TransactionId OldestXmin)
 	if (!PageIsPrunable(page, OldestXmin))
 		return;

+	/*
+	 * We can't write WAL in recovery mode, so there's no point trying to
+	 * clean the page. The master will likely issue a cleaning WAL record
+	 * soon anyway, so this is no particular loss.
+	 */
+	if (RecoveryInProgress())
+		return;
+
 	/*
 	 * We prune when a previous UPDATE failed to find enough space on the page
 	 * for a new tuple version, or when free space falls below the relation's
@ -176,6 +185,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 	 * of our working state.
 	 */
 	prstate.new_prune_xid = InvalidTransactionId;
+	prstate.latestRemovedXid = InvalidTransactionId;
 	prstate.nredirected = prstate.ndead = prstate.nunused = 0;
 	memset(prstate.marked, 0, sizeof(prstate.marked));

@ -257,7 +267,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 									prstate.redirected, prstate.nredirected,
 									prstate.nowdead, prstate.ndead,
 									prstate.nowunused, prstate.nunused,
-									redirect_move);
+									prstate.latestRemovedXid, redirect_move);

 			PageSetLSN(BufferGetPage(buffer), recptr);
 			PageSetTLI(BufferGetPage(buffer), ThisTimeLineID);
@ -395,6 +405,8 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 				== HEAPTUPLE_DEAD && !HeapTupleHeaderIsHotUpdated(htup))
 			{
 				heap_prune_record_unused(prstate, rootoffnum);
+				HeapTupleHeaderAdvanceLatestRemovedXid(htup,
+													   &prstate->latestRemovedXid);
 				ndeleted++;
 			}

@ -520,7 +532,11 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 		 * find another DEAD tuple is a fairly unusual corner case.)
 		 */
 		if (tupdead)
+		{
 			latestdead = offnum;
+			HeapTupleHeaderAdvanceLatestRemovedXid(htup,
+												   &prstate->latestRemovedXid);
+		}
 		else if (!recent_dead)
 			break;

--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/access/index/genam.c,v 1.77 2009/12/07 05:22:21 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/access/index/genam.c,v 1.78 2009/12/19 01:32:32 sriggs Exp $
 *
 * NOTES
 *	  many of the old access method routines have been turned into
@ -91,8 +91,19 @@ RelationGetIndexScan(Relation indexRelation,
 	else
 		scan->keyData = NULL;

+	/*
+	 * During recovery we ignore killed tuples and don't bother to kill them
+	 * either. We do this because the xmin on the primary node could easily
+	 * be later than the xmin on the standby node, so that what the primary
+	 * thinks is killed is supposed to be visible on standby. So for correct
+	 * MVCC for queries during recovery we must ignore these hints and check
+	 * all tuples. Do *not* set ignore_killed_tuples to true when running
+	 * in a transaction that was started during recovery.
+	 * xactStartedInRecovery should not be altered by index AMs.
+	 */
 	scan->kill_prior_tuple = false;
-	scan->ignore_killed_tuples = true;	/* default setting */
+	scan->xactStartedInRecovery = TransactionStartedDuringRecovery();
+	scan->ignore_killed_tuples = !scan->xactStartedInRecovery;

 	scan->opaque = NULL;

--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/access/index/indexam.c,v 1.115 2009/07/29 20:56:18 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/access/index/indexam.c,v 1.116 2009/12/19 01:32:32 sriggs Exp $
 *
 * INTERFACE ROUTINES
 *		index_open		- open an index relation by relation OID
@ -455,9 +455,12 @@ index_getnext(IndexScanDesc scan, ScanDirection direction)

 			/*
 			 * If we scanned a whole HOT chain and found only dead tuples,
-			 * tell index AM to kill its entry for that TID.
+			 * tell index AM to kill its entry for that TID. We do not do
+			 * this when in recovery because it may violate MVCC to do so.
+			 * see comments in RelationGetIndexScan().
 			 */
-			scan->kill_prior_tuple = scan->xs_hot_dead;
+			if (!scan->xactStartedInRecovery)
+				scan->kill_prior_tuple = scan->xs_hot_dead;

 			/*
 			 * The AM's gettuple proc finds the next index entry matching the
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@ -1,4 +1,4 @@
-$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.20 2008/03/21 13:23:27 momjian Exp $
+$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.21 2009/12/19 01:32:32 sriggs Exp $

 Btree Indexing
 ==============
@ -401,6 +401,33 @@ of the WAL entry.)  If the parent page becomes half-dead but is not
 immediately deleted due to a subsequent crash, there is no loss of
 consistency, and the empty page will be picked up by the next VACUUM.

+Scans during Recovery
+---------------------
+
+The btree index type can be safely used during recovery. During recovery
+we have at most one writer and potentially many readers. In that
+situation the locking requirements can be relaxed and we do not need
+double locking during block splits. Each WAL record makes changes to a
+single level of the btree using the correct locking sequence and so
+is safe for concurrent readers. Some readers may observe a block split
+in progress as they descend the tree, but they will simply move right
+onto the correct page.
+
+During recovery all index scans start with ignore_killed_tuples = false
+and we never set kill_prior_tuple. We do this because the oldest xmin
+on the standby server can be older than the oldest xmin on the master
+server, which means tuples can be marked as killed even when they are
+still visible on the standby. We don't WAL log tuple killed bits, but
+they can still appear in the standby because of full page writes. So
+we must always ignore them in standby, and that means it's not worth
+setting them either.
+
+Note that we talk about scans that are started during recovery. We go to
+a little trouble to allow a scan to start during recovery and end during
+normal running after recovery has completed. This is a key capability
+because it allows running applications to continue while the standby
+changes state into a normally running server.
+
 Other Things That Are Handy to Know
 -----------------------------------

--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.174 2009/10/02 21:14:04 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.175 2009/12/19 01:32:32 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -2025,7 +2025,7 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer)
 	}

 	if (ndeletable > 0)
-		_bt_delitems(rel, buffer, deletable, ndeletable);
+		_bt_delitems(rel, buffer, deletable, ndeletable, false, 0);

 	/*
 	 * Note: if we didn't find any LP_DEAD items, then the page's
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@ -9,7 +9,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/access/nbtree/nbtpage.c,v 1.113 2009/05/05 19:02:22 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/access/nbtree/nbtpage.c,v 1.114 2009/12/19 01:32:33 sriggs Exp $
 *
 *	NOTES
 *	   Postgres btree pages look like ordinary relation pages.	The opaque
@ -653,19 +653,33 @@ _bt_page_recyclable(Page page)
 *
 * This routine assumes that the caller has pinned and locked the buffer.
 * Also, the given itemnos *must* appear in increasing order in the array.
+ *
+ * We record VACUUMs and b-tree deletes differently in WAL. InHotStandby
+ * we need to be able to pin all of the blocks in the btree in physical
+ * order when replaying the effects of a VACUUM, just as we do for the
+ * original VACUUM itself. lastBlockVacuumed allows us to tell whether an
+ * intermediate range of blocks has had no changes at all by VACUUM,
+ * and so must be scanned anyway during replay. We always write a WAL record
+ * for the last block in the index, whether or not it contained any items
+ * to be removed. This allows us to scan right up to end of index to
+ * ensure correct locking.
 */
 void
 _bt_delitems(Relation rel, Buffer buf,
-			 OffsetNumber *itemnos, int nitems)
+			 OffsetNumber *itemnos, int nitems, bool isVacuum,
+			 BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;

+	Assert(isVacuum || lastBlockVacuumed == 0);
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();

 	/* Fix the page */
-	PageIndexMultiDelete(page, itemnos, nitems);
+	if (nitems > 0)
+		PageIndexMultiDelete(page, itemnos, nitems);

 	/*
 	 * We can clear the vacuum cycle ID since this page has certainly been
@ -688,15 +702,36 @@ _bt_delitems(Relation rel, Buffer buf,
 	/* XLOG stuff */
 	if (!rel->rd_istemp)
 	{
-		xl_btree_delete xlrec;
 		XLogRecPtr	recptr;
 		XLogRecData rdata[2];

-		xlrec.node = rel->rd_node;
-		xlrec.block = BufferGetBlockNumber(buf);
+		if (isVacuum)
+		{
+			xl_btree_vacuum xlrec_vacuum;
+			xlrec_vacuum.node = rel->rd_node;
+			xlrec_vacuum.block = BufferGetBlockNumber(buf);
+
+			xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+			rdata[0].data = (char *) &xlrec_vacuum;
+			rdata[0].len = SizeOfBtreeVacuum;
+		}
+		else
+		{
+			xl_btree_delete xlrec_delete;
+			xlrec_delete.node = rel->rd_node;
+			xlrec_delete.block = BufferGetBlockNumber(buf);
+
+			/*
+			 * XXX: We would like to set an accurate latestRemovedXid, but
+			 * there is no easy way of obtaining a useful value. So we punt
+			 * and store InvalidTransactionId, which forces the standby to
+			 * wait for/cancel all currently running transactions.
+			 */
+			xlrec_delete.latestRemovedXid = InvalidTransactionId;
+			rdata[0].data = (char *) &xlrec_delete;
+			rdata[0].len = SizeOfBtreeDelete;
+		}

-		rdata[0].data = (char *) &xlrec;
-		rdata[0].len = SizeOfBtreeDelete;
 		rdata[0].buffer = InvalidBuffer;
 		rdata[0].next = &(rdata[1]);

@ -719,7 +754,10 @@ _bt_delitems(Relation rel, Buffer buf,
 		rdata[1].buffer_std = true;
 		rdata[1].next = NULL;

-		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DELETE, rdata);
+		if (isVacuum)
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM, rdata);
+		else
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DELETE, rdata);

 		PageSetLSN(page, recptr);
 		PageSetTLI(page, ThisTimeLineID);
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@ -12,7 +12,7 @@
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/access/nbtree/nbtree.c,v 1.172 2009/07/29 20:56:18 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/access/nbtree/nbtree.c,v 1.173 2009/12/19 01:32:33 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -57,7 +57,8 @@ typedef struct
 	IndexBulkDeleteCallback callback;
 	void	   *callback_state;
 	BTCycleId	cycleid;
-	BlockNumber lastUsedPage;
+	BlockNumber lastBlockVacuumed; 	/* last blkno reached by Vacuum scan */
+	BlockNumber lastUsedPage;		/* blkno of last non-recyclable page */
 	BlockNumber totFreePages;	/* true total # of free pages */
 	MemoryContext pagedelcontext;
 } BTVacState;
@ -629,6 +630,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	vstate.callback = callback;
 	vstate.callback_state = callback_state;
 	vstate.cycleid = cycleid;
+	vstate.lastBlockVacuumed = BTREE_METAPAGE; /* Initialise at first block */
 	vstate.lastUsedPage = BTREE_METAPAGE;
 	vstate.totFreePages = 0;

@ -705,6 +707,32 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		num_pages = new_pages;
 	}

+	/*
+	 * InHotStandby we need to scan right up to the end of the index for
+	 * correct locking, so we may need to write a WAL record for the final
+	 * block in the index if it was not vacuumed. It's possible that VACUUMing
+	 * has actually removed zeroed pages at the end of the index so we need to
+	 * take care to issue the record for last actual block and not for the
+	 * last block that was scanned. Ignore empty indexes.
+	 */
+	if (XLogStandbyInfoActive() &&
+		num_pages > 1 && vstate.lastBlockVacuumed < (num_pages - 1))
+	{
+		Buffer		buf;
+
+		/*
+		 * We can't use _bt_getbuf() here because it always applies
+		 * _bt_checkpage(), which will barf on an all-zero page. We want to
+		 * recycle all-zero pages, not fail.  Also, we want to use a nondefault
+		 * buffer access strategy.
+		 */
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, num_pages - 1, RBM_NORMAL,
+								 info->strategy);
+		LockBufferForCleanup(buf);
+		_bt_delitems(rel, buf, NULL, 0, true, vstate.lastBlockVacuumed);
+		_bt_relbuf(rel, buf);
+	}
+
 	MemoryContextDelete(vstate.pagedelcontext);

 	/* update statistics */
@ -847,6 +875,26 @@ restart:
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
 				htup = &(itup->t_tid);
+
+				/*
+				 * During Hot Standby we currently assume that XLOG_BTREE_VACUUM
+				 * records do not produce conflicts. That is only true as long
+				 * as the callback function depends only upon whether the index
+				 * tuple refers to heap tuples removed in the initial heap scan.
+				 * When vacuum starts it derives a value of OldestXmin. Backends
+				 * taking later snapshots could have a RecentGlobalXmin with a
+				 * later xid than the vacuum's OldestXmin, so it is possible that
+				 * row versions deleted after OldestXmin could be marked as killed
+				 * by other backends. The callback function *could* look at the
+				 * index tuple state in isolation and decide to delete the index
+				 * tuple, though currently it does not. If it ever did, we would
+				 * need to reconsider whether XLOG_BTREE_VACUUM records should
+				 * cause conflicts. If they did cause conflicts they would be
+				 * fairly harsh conflicts, since we haven't yet worked out a way
+				 * to pass a useful value for latestRemovedXid on the
+				 * XLOG_BTREE_VACUUM records. This applies to *any* type of index
+				 * that marks index tuples as killed.
+				 */
 				if (callback(htup, callback_state))
 					deletable[ndeletable++] = offnum;
 			}
@ -858,7 +906,19 @@ restart:
 		 */
 		if (ndeletable > 0)
 		{
-			_bt_delitems(rel, buf, deletable, ndeletable);
+			BlockNumber	lastBlockVacuumed = BufferGetBlockNumber(buf);
+
+			_bt_delitems(rel, buf, deletable, ndeletable, true, vstate->lastBlockVacuumed);
+
+			/*
+			 * Keep track of the block number of the lastBlockVacuumed, so
+			 * we can scan those blocks as well during WAL replay. This then
+			 * provides concurrency protection and allows btrees to be used
+			 * while in recovery.
+			 */
+			if (lastBlockVacuumed > vstate->lastBlockVacuumed)
+				vstate->lastBlockVacuumed = lastBlockVacuumed;
+
 			stats->tuples_removed += ndeletable;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@ -8,7 +8,7 @@
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/access/nbtree/nbtxlog.c,v 1.55 2009/06/11 14:48:54 momjian Exp $
+ *	  $PostgreSQL: pgsql/src/backend/access/nbtree/nbtxlog.c,v 1.56 2009/12/19 01:32:33 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -16,7 +16,11 @@

 #include "access/nbtree.h"
 #include "access/transam.h"
+#include "access/xact.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
+#include "storage/standby.h"
+#include "miscadmin.h"

 /*
 * We must keep track of expected insertions due to page splits, and apply
@ -458,6 +462,97 @@ btree_xlog_split(bool onleft, bool isroot,
 						 xlrec->leftsib, xlrec->rightsib, isroot);
 }

+static void
+btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
+{
+	xl_btree_vacuum *xlrec;
+	Buffer		buffer;
+	Page		page;
+	BTPageOpaque opaque;
+
+	xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+
+	/*
+	 * If queries might be active then we need to ensure every block is unpinned
+	 * between the lastBlockVacuumed and the current block, if there are any.
+	 * This ensures that every block in the index is touched during VACUUM as
+	 * required to ensure scans work correctly.
+	 */
+	if (standbyState == STANDBY_SNAPSHOT_READY &&
+		(xlrec->lastBlockVacuumed + 1) != xlrec->block)
+	{
+		BlockNumber blkno = xlrec->lastBlockVacuumed + 1;
+
+		for (; blkno < xlrec->block; blkno++)
+		{
+			/*
+			 * XXX we don't actually need to read the block, we
+			 * just need to confirm it is unpinned. If we had a special call
+			 * into the buffer manager we could optimise this so that
+			 * if the block is not in shared_buffers we confirm it as unpinned.
+			 *
+			 * Another simple optimization would be to check if there's any
+			 * backends running; if not, we could just skip this.
+			 */
+			buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno, RBM_NORMAL);
+			if (BufferIsValid(buffer))
+			{
+				LockBufferForCleanup(buffer);
+				UnlockReleaseBuffer(buffer);
+			}
+		}
+	}
+
+	/*
+	 * If the block was restored from a full page image, nothing more to do.
+	 * The RestoreBkpBlocks() call already pinned and took cleanup lock on
+	 * it. XXX: Perhaps we should call RestoreBkpBlocks() *after* the loop
+	 * above, to make the disk access more sequential.
+	 */
+	if (record->xl_info & XLR_BKP_BLOCK_1)
+		return;
+
+	/*
+	 * Like in btvacuumpage(), we need to take a cleanup lock on every leaf
+	 * page. See nbtree/README for details.
+	 */
+	buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL);
+	if (!BufferIsValid(buffer))
+		return;
+	LockBufferForCleanup(buffer);
+	page = (Page) BufferGetPage(buffer);
+
+	if (XLByteLE(lsn, PageGetLSN(page)))
+	{
+		UnlockReleaseBuffer(buffer);
+		return;
+	}
+
+	if (record->xl_len > SizeOfBtreeVacuum)
+	{
+		OffsetNumber *unused;
+		OffsetNumber *unend;
+
+		unused = (OffsetNumber *) ((char *) xlrec + SizeOfBtreeVacuum);
+		unend = (OffsetNumber *) ((char *) xlrec + record->xl_len);
+
+		if ((unend - unused) > 0)
+			PageIndexMultiDelete(page, unused, unend - unused);
+	}
+
+	/*
+	 * Mark the page as not containing any LP_DEAD items --- see comments in
+	 * _bt_delitems().
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	PageSetLSN(page, lsn);
+	PageSetTLI(page, ThisTimeLineID);
+	MarkBufferDirty(buffer);
+	UnlockReleaseBuffer(buffer);
+}
+
 static void
 btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
 {
@ -470,6 +565,11 @@ btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
 		return;

 	xlrec = (xl_btree_delete *) XLogRecGetData(record);
+
+	/*
+	 * We don't need to take a cleanup lock to apply these changes.
+	 * See nbtree/README for details.
+	 */
 	buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
 	if (!BufferIsValid(buffer))
 		return;
@ -714,7 +814,43 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)
 {
 	uint8		info = record->xl_info & ~XLR_INFO_MASK;

-	RestoreBkpBlocks(lsn, record, false);
+	/*
+	 * Btree delete records can conflict with standby queries. You might
+	 * think that vacuum records would conflict as well, but we've handled
+	 * that already. XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
+	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
+	 * just once when that arrives. After that any we know that no conflicts
+	 * exist from individual btree vacuum records on that index.
+	 */
+	if (InHotStandby)
+	{
+		if (info == XLOG_BTREE_DELETE)
+		{
+			xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record);
+			VirtualTransactionId *backends;
+
+			/*
+			 * XXX Currently we put everybody on death row, because
+			 * currently _bt_delitems() supplies InvalidTransactionId.
+			 * This can be fairly painful, so providing a better value
+			 * here is worth some thought and possibly some effort to
+			 * improve.
+			 */
+			backends = GetConflictingVirtualXIDs(xlrec->latestRemovedXid,
+												 InvalidOid,
+												 true);
+
+			ResolveRecoveryConflictWithVirtualXIDs(backends,
+												   "b-tree delete",
+												   CONFLICT_MODE_ERROR);
+		}
+	}
+
+	/*
+	 * Vacuum needs to pin and take cleanup lock on every leaf page,
+	 * a regular exclusive lock is enough for all other purposes.
+	 */
+	RestoreBkpBlocks(lsn, record, (info == XLOG_BTREE_VACUUM));

 	switch (info)
 	{
@ -739,6 +875,9 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)
 		case XLOG_BTREE_SPLIT_R_ROOT:
 			btree_xlog_split(false, true, lsn, record);
 			break;
+		case XLOG_BTREE_VACUUM:
+			btree_xlog_vacuum(lsn, record);
+			break;
 		case XLOG_BTREE_DELETE:
 			btree_xlog_delete(lsn, record);
 			break;
@ -843,13 +982,24 @@ btree_desc(StringInfo buf, uint8 xl_info, char *rec)
 								 xlrec->level, xlrec->firstright);
 				break;
 			}
+		case XLOG_BTREE_VACUUM:
+			{
+				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
+
+				appendStringInfo(buf, "vacuum: rel %u/%u/%u; blk %u, lastBlockVacuumed %u",
+								 xlrec->node.spcNode, xlrec->node.dbNode,
+								 xlrec->node.relNode, xlrec->block,
+								 xlrec->lastBlockVacuumed);
+				break;
+			}
 		case XLOG_BTREE_DELETE:
 			{
 				xl_btree_delete *xlrec = (xl_btree_delete *) rec;

-				appendStringInfo(buf, "delete: rel %u/%u/%u; blk %u",
+				appendStringInfo(buf, "delete: rel %u/%u/%u; blk %u, latestRemovedXid %u",
 								 xlrec->node.spcNode, xlrec->node.dbNode,
-								 xlrec->node.relNode, xlrec->block);
+								 xlrec->node.relNode, xlrec->block,
+								 xlrec->latestRemovedXid);
 				break;
 			}
 		case XLOG_BTREE_DELETE_PAGE:
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@ -1,4 +1,4 @@
-$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.12 2008/10/20 19:18:18 alvherre Exp $
+$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.13 2009/12/19 01:32:33 sriggs Exp $

 The Transaction System
 ======================
@ -649,3 +649,34 @@ fsync it down to disk without any sort of interlock, as soon as it finishes
 the bulk update.  However, all these paths are designed to write data that
 no other transaction can see until after T1 commits.  The situation is thus
 not different from ordinary WAL-logged updates.
+
+Transaction Emulation during Recovery
+-------------------------------------
+
+During Recovery we replay transaction changes in the order they occurred.
+As part of this replay we emulate some transactional behaviour, so that
+read only backends can take MVCC snapshots. We do this by maintaining a
+list of XIDs belonging to transactions that are being replayed, so that
+each transaction that has recorded WAL records for database writes exist
+in the array until it commits. Further details are given in comments in
+procarray.c.
+
+Many actions write no WAL records at all, for example read only transactions.
+These have no effect on MVCC in recovery and we can pretend they never
+occurred at all. Subtransaction commit does not write a WAL record either
+and has very little effect, since lock waiters need to wait for the
+parent transaction to complete.
+
+Not all transactional behaviour is emulated, for example we do not insert
+a transaction entry into the lock table, nor do we maintain the transaction
+stack in memory. Clog entries are made normally. Multitrans is not maintained
+because its purpose is to record tuple level locks that an application has
+requested to prevent write locks. Since write locks cannot be obtained at all,
+there is never any conflict and so there is no reason to update multitrans.
+Subtrans is maintained during recovery but the details of the transaction
+tree are ignored and all subtransactions reference the top-level TransactionId
+directly. Since commit is atomic this provides correct lock wait behaviour
+yet simplifies emulation of subtransactions considerably.
+
+Further details on locking mechanics in recovery are given in comments
+with the Lock rmgr code.
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@ -26,7 +26,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/backend/access/transam/clog.c,v 1.53 2009/06/11 14:48:54 momjian Exp $
+ * $PostgreSQL: pgsql/src/backend/access/transam/clog.c,v 1.54 2009/12/19 01:32:33 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -574,7 +574,7 @@ ExtendCLOG(TransactionId newestXact)
 	LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);

 	/* Zero the page and make an XLOG entry about it */
-	ZeroCLOGPage(pageno, true);
+	ZeroCLOGPage(pageno, !InRecovery);

 	LWLockRelease(CLogControlLock);
 }
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@ -42,7 +42,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/backend/access/transam/multixact.c,v 1.32 2009/11/23 09:58:36 heikki Exp $
+ * $PostgreSQL: pgsql/src/backend/access/transam/multixact.c,v 1.33 2009/12/19 01:32:33 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -59,6 +59,7 @@
 #include "storage/backendid.h"
 #include "storage/lmgr.h"
 #include "storage/procarray.h"
+#include "utils/builtins.h"
 #include "utils/memutils.h"


@ -220,7 +221,6 @@ static MultiXactId GetNewMultiXactId(int nxids, MultiXactOffset *offset);
 static MultiXactId mXactCacheGetBySet(int nxids, TransactionId *xids);
 static int	mXactCacheGetById(MultiXactId multi, TransactionId **xids);
 static void mXactCachePut(MultiXactId multi, int nxids, TransactionId *xids);
-static int	xidComparator(const void *arg1, const void *arg2);

 #ifdef MULTIXACT_DEBUG
 static char *mxid_to_string(MultiXactId multi, int nxids, TransactionId *xids);
@ -1221,27 +1221,6 @@ mXactCachePut(MultiXactId multi, int nxids, TransactionId *xids)
 	MXactCache = entry;
 }

-/*
- * xidComparator
- *		qsort comparison function for XIDs
- *
- * We don't need to use wraparound comparison for XIDs, and indeed must
- * not do so since that does not respect the triangle inequality!  Any
- * old sort order will do.
- */
-static int
-xidComparator(const void *arg1, const void *arg2)
-{
-	TransactionId xid1 = *(const TransactionId *) arg1;
-	TransactionId xid2 = *(const TransactionId *) arg2;
-
-	if (xid1 > xid2)
-		return 1;
-	if (xid1 < xid2)
-		return -1;
-	return 0;
-}
-
 #ifdef MULTIXACT_DEBUG
 static char *
 mxid_to_string(MultiXactId multi, int nxids, TransactionId *xids)
@ -2051,11 +2030,18 @@ multixact_redo(XLogRecPtr lsn, XLogRecord *record)
 			if (TransactionIdPrecedes(max_xid, xids[i]))
 				max_xid = xids[i];
 		}
+
+		/* We don't expect anyone else to modify nextXid, hence startup process
+		 * doesn't need to hold a lock while checking this. We still acquire
+		 * the lock to modify it, though.
+		 */
 		if (TransactionIdFollowsOrEquals(max_xid,
 										 ShmemVariableCache->nextXid))
 		{
+			LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 			ShmemVariableCache->nextXid = max_xid;
 			TransactionIdAdvance(ShmemVariableCache->nextXid);
+			LWLockRelease(XidGenLock);
 		}
 	}
 	else
--- a/src/backend/access/transam/recovery.conf.sample
+++ b/src/backend/access/transam/recovery.conf.sample
@ -79,3 +79,10 @@
 #
 #
 #---------------------------------------------------------------------------
+# HOT STANDBY PARAMETERS
+#---------------------------------------------------------------------------
+#
+# If you want to enable read-only connections during recovery, enable
+# recovery_connections in postgresql.conf
+#
+#---------------------------------------------------------------------------
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@ -3,7 +3,7 @@
 *
 * Resource managers definition
 *
- * $PostgreSQL: pgsql/src/backend/access/transam/rmgr.c,v 1.27 2008/11/19 10:34:50 heikki Exp $
+ * $PostgreSQL: pgsql/src/backend/access/transam/rmgr.c,v 1.28 2009/12/19 01:32:33 sriggs Exp $
 */
 #include "postgres.h"

@ -21,6 +21,7 @@
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "storage/freespace.h"
+#include "storage/standby.h"


 const RmgrData RmgrTable[RM_MAX_ID + 1] = {
@ -32,7 +33,7 @@ const RmgrData RmgrTable[RM_MAX_ID + 1] = {
 	{"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL},
 	{"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL},
 	{"Reserved 7", NULL, NULL, NULL, NULL, NULL},
-	{"Reserved 8", NULL, NULL, NULL, NULL, NULL},
+	{"Standby", standby_redo, standby_desc, NULL, NULL, NULL},
 	{"Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL},
 	{"Heap", heap_redo, heap_desc, NULL, NULL, NULL},
 	{"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint},
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@ -22,7 +22,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/backend/access/transam/subtrans.c,v 1.24 2009/01/01 17:23:36 momjian Exp $
+ * $PostgreSQL: pgsql/src/backend/access/transam/subtrans.c,v 1.25 2009/12/19 01:32:33 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -68,15 +68,19 @@ static bool SubTransPagePrecedes(int page1, int page2);

 /*
 * Record the parent of a subtransaction in the subtrans log.
+ *
+ * In some cases we may need to overwrite an existing value.
 */
 void
-SubTransSetParent(TransactionId xid, TransactionId parent)
+SubTransSetParent(TransactionId xid, TransactionId parent, bool overwriteOK)
 {
 	int			pageno = TransactionIdToPage(xid);
 	int			entryno = TransactionIdToEntry(xid);
 	int			slotno;
 	TransactionId *ptr;

+	Assert(TransactionIdIsValid(parent));
+
 	LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE);

 	slotno = SimpleLruReadPage(SubTransCtl, pageno, true, xid);
@ -84,7 +88,8 @@ SubTransSetParent(TransactionId xid, TransactionId parent)
 	ptr += entryno;

 	/* Current state should be 0 */
-	Assert(*ptr == InvalidTransactionId);
+	Assert(*ptr == InvalidTransactionId ||
+			(*ptr == parent && overwriteOK));

 	*ptr = parent;

--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@ -7,7 +7,7 @@
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * IDENTIFICATION
- *		$PostgreSQL: pgsql/src/backend/access/transam/twophase.c,v 1.56 2009/11/23 09:58:36 heikki Exp $
+ *		$PostgreSQL: pgsql/src/backend/access/transam/twophase.c,v 1.57 2009/12/19 01:32:33 sriggs Exp $
 *
 * NOTES
 *		Each global transaction is associated with a global transaction
@ -57,6 +57,7 @@
 #include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/procarray.h"
+#include "storage/sinvaladt.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@ -144,7 +145,10 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
 								int nchildren,
 								TransactionId *children,
 								int nrels,
-								RelFileNode *rels);
+								RelFileNode *rels,
+								int ninvalmsgs,
+								SharedInvalidationMessage *invalmsgs,
+								bool initfileinval);
 static void RecordTransactionAbortPrepared(TransactionId xid,
 							   int nchildren,
 							   TransactionId *children,
@ -736,10 +740,11 @@ TwoPhaseGetDummyProc(TransactionId xid)
 *	2. TransactionId[] (subtransactions)
 *	3. RelFileNode[] (files to be deleted at commit)
 *	4. RelFileNode[] (files to be deleted at abort)
- *	5. TwoPhaseRecordOnDisk
- *	6. ...
- *	7. TwoPhaseRecordOnDisk (end sentinel, rmid == TWOPHASE_RM_END_ID)
- *	8. CRC32
+ *	5. SharedInvalidationMessage[] (inval messages to be sent at commit)
+ *	6. TwoPhaseRecordOnDisk
+ *	7. ...
+ *	8. TwoPhaseRecordOnDisk (end sentinel, rmid == TWOPHASE_RM_END_ID)
+ *	9. CRC32
 *
 * Each segment except the final CRC32 is MAXALIGN'd.
 */
@ -760,6 +765,8 @@ typedef struct TwoPhaseFileHeader
 	int32		nsubxacts;		/* number of following subxact XIDs */
 	int32		ncommitrels;	/* number of delete-on-commit rels */
 	int32		nabortrels;		/* number of delete-on-abort rels */
+	int32		ninvalmsgs;		/* number of cache invalidation messages */
+	bool		initfileinval;	/* does relcache init file need invalidation? */
 	char		gid[GIDSIZE];	/* GID for transaction */
 } TwoPhaseFileHeader;

@ -835,6 +842,7 @@ StartPrepare(GlobalTransaction gxact)
 	TransactionId *children;
 	RelFileNode *commitrels;
 	RelFileNode *abortrels;
+	SharedInvalidationMessage *invalmsgs;

 	/* Initialize linked list */
 	records.head = palloc0(sizeof(XLogRecData));
@ -859,11 +867,16 @@ StartPrepare(GlobalTransaction gxact)
 	hdr.nsubxacts = xactGetCommittedChildren(&children);
 	hdr.ncommitrels = smgrGetPendingDeletes(true, &commitrels, NULL);
 	hdr.nabortrels = smgrGetPendingDeletes(false, &abortrels, NULL);
+	hdr.ninvalmsgs = xactGetCommittedInvalidationMessages(&invalmsgs,
+														  &hdr.initfileinval);
 	StrNCpy(hdr.gid, gxact->gid, GIDSIZE);

 	save_state_data(&hdr, sizeof(TwoPhaseFileHeader));

-	/* Add the additional info about subxacts and deletable files */
+	/*
+	 * Add the additional info about subxacts, deletable files and
+	 * cache invalidation messages.
+	 */
 	if (hdr.nsubxacts > 0)
 	{
 		save_state_data(children, hdr.nsubxacts * sizeof(TransactionId));
@ -880,6 +893,12 @@ StartPrepare(GlobalTransaction gxact)
 		save_state_data(abortrels, hdr.nabortrels * sizeof(RelFileNode));
 		pfree(abortrels);
 	}
+	if (hdr.ninvalmsgs > 0)
+	{
+		save_state_data(invalmsgs,
+						hdr.ninvalmsgs * sizeof(SharedInvalidationMessage));
+		pfree(invalmsgs);
+	}
 }

 /*
@ -1071,7 +1090,7 @@ RegisterTwoPhaseRecord(TwoPhaseRmgrId rmid, uint16 info,
 * contents of the file.  Otherwise return NULL.
 */
 static char *
-ReadTwoPhaseFile(TransactionId xid)
+ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 {
 	char		path[MAXPGPATH];
 	char	   *buf;
@ -1087,10 +1106,11 @@ ReadTwoPhaseFile(TransactionId xid)
 	fd = BasicOpenFile(path, O_RDONLY | PG_BINARY, 0);
 	if (fd < 0)
 	{
-		ereport(WARNING,
-				(errcode_for_file_access(),
-				 errmsg("could not open two-phase state file \"%s\": %m",
-						path)));
+		if (give_warnings)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not open two-phase state file \"%s\": %m",
+							path)));
 		return NULL;
 	}

@ -1103,10 +1123,11 @@ ReadTwoPhaseFile(TransactionId xid)
 	if (fstat(fd, &stat))
 	{
 		close(fd);
-		ereport(WARNING,
-				(errcode_for_file_access(),
-				 errmsg("could not stat two-phase state file \"%s\": %m",
-						path)));
+		if (give_warnings)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not stat two-phase state file \"%s\": %m",
+							path)));
 		return NULL;
 	}

@ -1134,10 +1155,11 @@ ReadTwoPhaseFile(TransactionId xid)
 	if (read(fd, buf, stat.st_size) != stat.st_size)
 	{
 		close(fd);
-		ereport(WARNING,
-				(errcode_for_file_access(),
-				 errmsg("could not read two-phase state file \"%s\": %m",
-						path)));
+		if (give_warnings)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not read two-phase state file \"%s\": %m",
+							path)));
 		pfree(buf);
 		return NULL;
 	}
@ -1166,6 +1188,30 @@ ReadTwoPhaseFile(TransactionId xid)
 	return buf;
 }

+/*
+ * Confirms an xid is prepared, during recovery
+ */
+bool
+StandbyTransactionIdIsPrepared(TransactionId xid)
+{
+	char	   *buf;
+	TwoPhaseFileHeader *hdr;
+	bool		result;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* Read and validate file */
+	buf = ReadTwoPhaseFile(xid, false);
+	if (buf == NULL)
+		return false;
+
+	/* Check header also */
+	hdr = (TwoPhaseFileHeader *) buf;
+	result = TransactionIdEquals(hdr->xid, xid);
+	pfree(buf);
+
+	return result;
+}

 /*
 * FinishPreparedTransaction: execute COMMIT PREPARED or ROLLBACK PREPARED
@ -1184,6 +1230,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	RelFileNode *abortrels;
 	RelFileNode *delrels;
 	int			ndelrels;
+	SharedInvalidationMessage *invalmsgs;
 	int			i;

 	/*
@ -1196,7 +1243,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	/*
 	 * Read and validate the state file
 	 */
-	buf = ReadTwoPhaseFile(xid);
+	buf = ReadTwoPhaseFile(xid, true);
 	if (buf == NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_DATA_CORRUPTED),
@ -1215,6 +1262,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
 	abortrels = (RelFileNode *) bufptr;
 	bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+	invalmsgs = (SharedInvalidationMessage *) bufptr;
+	bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));

 	/* compute latestXid among all children */
 	latestXid = TransactionIdLatest(xid, hdr->nsubxacts, children);
@ -1230,7 +1279,9 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	if (isCommit)
 		RecordTransactionCommitPrepared(xid,
 										hdr->nsubxacts, children,
-										hdr->ncommitrels, commitrels);
+										hdr->ncommitrels, commitrels,
+										hdr->ninvalmsgs, invalmsgs,
+										hdr->initfileinval);
 	else
 		RecordTransactionAbortPrepared(xid,
 									   hdr->nsubxacts, children,
@ -1277,6 +1328,18 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 		smgrclose(srel);
 	}

+	/*
+	 * Handle cache invalidation messages.
+	 *
+	 * Relcache init file invalidation requires processing both
+	 * before and after we send the SI messages. See AtEOXact_Inval()
+	 */
+	if (hdr->initfileinval)
+		RelationCacheInitFileInvalidate(true);
+	SendSharedInvalidMessages(invalmsgs, hdr->ninvalmsgs);
+	if (hdr->initfileinval)
+		RelationCacheInitFileInvalidate(false);
+
 	/* And now do the callbacks */
 	if (isCommit)
 		ProcessRecords(bufptr, xid, twophase_postcommit_callbacks);
@ -1528,14 +1591,21 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 * Our other responsibility is to determine and return the oldest valid XID
 * among the prepared xacts (if none, return ShmemVariableCache->nextXid).
 * This is needed to synchronize pg_subtrans startup properly.
+ *
+ * If xids_p and nxids_p are not NULL, pointer to a palloc'd array of all
+ * top-level xids is stored in *xids_p. The number of entries in the array
+ * is returned in *nxids_p.
 */
 TransactionId
-PrescanPreparedTransactions(void)
+PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 {
 	TransactionId origNextXid = ShmemVariableCache->nextXid;
 	TransactionId result = origNextXid;
 	DIR		   *cldir;
 	struct dirent *clde;
+	TransactionId *xids = NULL;
+	int			nxids = 0;
+	int			allocsize = 0;

 	cldir = AllocateDir(TWOPHASE_DIR);
 	while ((clde = ReadDir(cldir, TWOPHASE_DIR)) != NULL)
@ -1567,7 +1637,7 @@ PrescanPreparedTransactions(void)
 			 */

 			/* Read and validate file */
-			buf = ReadTwoPhaseFile(xid);
+			buf = ReadTwoPhaseFile(xid, true);
 			if (buf == NULL)
 			{
 				ereport(WARNING,
@ -1615,11 +1685,36 @@ PrescanPreparedTransactions(void)
 				}
 			}

+
+			if (xids_p)
+			{
+				if (nxids == allocsize)
+				{
+					if (nxids == 0)
+					{
+						allocsize = 10;
+						xids = palloc(allocsize * sizeof(TransactionId));
+					}
+					else
+					{
+						allocsize = allocsize * 2;
+						xids = repalloc(xids, allocsize * sizeof(TransactionId));
+					}
+				}
+				xids[nxids++] = xid;
+			}
+
 			pfree(buf);
 		}
 	}
 	FreeDir(cldir);

+	if (xids_p)
+	{
+		*xids_p = xids;
+		*nxids_p = nxids;
+	}
+
 	return result;
 }

@ -1636,6 +1731,7 @@ RecoverPreparedTransactions(void)
 	char		dir[MAXPGPATH];
 	DIR		   *cldir;
 	struct dirent *clde;
+	bool		overwriteOK = false;

 	snprintf(dir, MAXPGPATH, "%s", TWOPHASE_DIR);

@ -1666,7 +1762,7 @@ RecoverPreparedTransactions(void)
 			}

 			/* Read and validate file */
-			buf = ReadTwoPhaseFile(xid);
+			buf = ReadTwoPhaseFile(xid, true);
 			if (buf == NULL)
 			{
 				ereport(WARNING,
@ -1687,6 +1783,15 @@ RecoverPreparedTransactions(void)
 			bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
 			bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
 			bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
+			bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+
+			/*
+			 * It's possible that SubTransSetParent has been set before, if the
+			 * prepared transaction generated xid assignment records. Test
+			 * here must match one used in AssignTransactionId().
+			 */
+			if (InHotStandby && hdr->nsubxacts >= PGPROC_MAX_CACHED_SUBXIDS)
+				overwriteOK = true;

 			/*
 			 * Reconstruct subtrans state for the transaction --- needed
@ -1696,7 +1801,7 @@ RecoverPreparedTransactions(void)
 			 * hierarchy, but there's no need to restore that exactly.
 			 */
 			for (i = 0; i < hdr->nsubxacts; i++)
-				SubTransSetParent(subxids[i], xid);
+				SubTransSetParent(subxids[i], xid, overwriteOK);

 			/*
 			 * Recreate its GXACT and dummy PGPROC
@ -1719,6 +1824,14 @@ RecoverPreparedTransactions(void)
 			 */
 			ProcessRecords(bufptr, xid, twophase_recover_callbacks);

+			/*
+			 * Release locks held by the standby process after we process each
+			 * prepared transaction. As a result, we don't need too many
+			 * additional locks at any one time.
+			 */
+			if (InHotStandby)
+				StandbyReleaseLockTree(xid, hdr->nsubxacts, subxids);
+
 			pfree(buf);
 		}
 	}
@ -1739,9 +1852,12 @@ RecordTransactionCommitPrepared(TransactionId xid,
 								int nchildren,
 								TransactionId *children,
 								int nrels,
-								RelFileNode *rels)
+								RelFileNode *rels,
+								int ninvalmsgs,
+								SharedInvalidationMessage *invalmsgs,
+								bool initfileinval)
 {
-	XLogRecData rdata[3];
+	XLogRecData rdata[4];
 	int			lastrdata = 0;
 	xl_xact_commit_prepared xlrec;
 	XLogRecPtr	recptr;
@ -1754,8 +1870,12 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	/* Emit the XLOG commit record */
 	xlrec.xid = xid;
 	xlrec.crec.xact_time = GetCurrentTimestamp();
+	xlrec.crec.xinfo = initfileinval ? XACT_COMPLETION_UPDATE_RELCACHE_FILE : 0;
+	xlrec.crec.nmsgs = 0;
 	xlrec.crec.nrels = nrels;
 	xlrec.crec.nsubxacts = nchildren;
+	xlrec.crec.nmsgs = ninvalmsgs;
+
 	rdata[0].data = (char *) (&xlrec);
 	rdata[0].len = MinSizeOfXactCommitPrepared;
 	rdata[0].buffer = InvalidBuffer;
@ -1777,6 +1897,15 @@ RecordTransactionCommitPrepared(TransactionId xid,
 		rdata[2].buffer = InvalidBuffer;
 		lastrdata = 2;
 	}
+	/* dump cache invalidation messages */
+	if (ninvalmsgs > 0)
+	{
+		rdata[lastrdata].next = &(rdata[3]);
+		rdata[3].data = (char *) invalmsgs;
+		rdata[3].len = ninvalmsgs * sizeof(SharedInvalidationMessage);
+		rdata[3].buffer = InvalidBuffer;
+		lastrdata = 3;
+	}
 	rdata[lastrdata].next = NULL;

 	recptr = XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT_PREPARED, rdata);
--- a/src/backend/access/transam/twophase_rmgr.c
+++ b/src/backend/access/transam/twophase_rmgr.c
@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/access/transam/twophase_rmgr.c,v 1.10 2009/11/23 09:58:36 heikki Exp $
+ *	  $PostgreSQL: pgsql/src/backend/access/transam/twophase_rmgr.c,v 1.11 2009/12/19 01:32:33 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -19,14 +19,12 @@
 #include "commands/async.h"
 #include "pgstat.h"
 #include "storage/lock.h"
-#include "utils/inval.h"


 const TwoPhaseCallback twophase_recover_callbacks[TWOPHASE_RM_MAX_ID + 1] =
 {
 	NULL,						/* END ID */
 	lock_twophase_recover,		/* Lock */
-	NULL,						/* Inval */
 	NULL,						/* notify/listen */
 	NULL,						/* pgstat */
 	multixact_twophase_recover	/* MultiXact */
@ -36,7 +34,6 @@ const TwoPhaseCallback twophase_postcommit_callbacks[TWOPHASE_RM_MAX_ID + 1] =
 {
 	NULL,						/* END ID */
 	lock_twophase_postcommit,	/* Lock */
-	inval_twophase_postcommit,	/* Inval */
 	notify_twophase_postcommit, /* notify/listen */
 	pgstat_twophase_postcommit,	/* pgstat */
 	multixact_twophase_postcommit /* MultiXact */
@ -46,8 +43,16 @@ const TwoPhaseCallback twophase_postabort_callbacks[TWOPHASE_RM_MAX_ID + 1] =
 {
 	NULL,						/* END ID */
 	lock_twophase_postabort,	/* Lock */
-	NULL,						/* Inval */
 	NULL,						/* notify/listen */
 	pgstat_twophase_postabort,	/* pgstat */
 	multixact_twophase_postabort /* MultiXact */
 };
+
+const TwoPhaseCallback twophase_standby_recover_callbacks[TWOPHASE_RM_MAX_ID + 1] =
+{
+	NULL,						/* END ID */
+	lock_twophase_standby_recover,		/* Lock */
+	NULL,						/* notify/listen */
+	NULL,						/* pgstat */
+	NULL						/* MultiXact */
+};
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@ -10,7 +10,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/access/transam/xact.c,v 1.277 2009/12/09 21:57:50 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/access/transam/xact.c,v 1.278 2009/12/19 01:32:33 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -42,6 +42,7 @@
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
 #include "storage/smgr.h"
+#include "storage/standby.h"
 #include "utils/combocid.h"
 #include "utils/guc.h"
 #include "utils/inval.h"
@ -139,6 +140,7 @@ typedef struct TransactionStateData
 	Oid			prevUser;		/* previous CurrentUserId setting */
 	int			prevSecContext;	/* previous SecurityRestrictionContext */
 	bool		prevXactReadOnly;		/* entry-time xact r/o state */
+	bool		startedInRecovery;	/* did we start in recovery? */
 	struct TransactionStateData *parent;		/* back link to parent */
 } TransactionStateData;

@ -167,9 +169,17 @@ static TransactionStateData TopTransactionStateData = {
 	InvalidOid,					/* previous CurrentUserId setting */
 	0,							/* previous SecurityRestrictionContext */
 	false,						/* entry-time xact r/o state */
+	false,						/* startedInRecovery */
 	NULL						/* link to parent state block */
 };

+/*
+ * unreportedXids holds XIDs of all subtransactions that have not yet been
+ * reported in a XLOG_XACT_ASSIGNMENT record.
+ */
+static int nUnreportedXids;
+static TransactionId unreportedXids[PGPROC_MAX_CACHED_SUBXIDS];
+
 static TransactionState CurrentTransactionState = &TopTransactionStateData;

 /*
@ -392,6 +402,9 @@ AssignTransactionId(TransactionState s)
 	bool		isSubXact = (s->parent != NULL);
 	ResourceOwner currentOwner;

+	if (RecoveryInProgress())
+		elog(ERROR, "cannot assign TransactionIds during recovery");
+
 	/* Assert that caller didn't screw up */
 	Assert(!TransactionIdIsValid(s->transactionId));
 	Assert(s->state == TRANS_INPROGRESS);
@ -414,7 +427,7 @@ AssignTransactionId(TransactionState s)
 	s->transactionId = GetNewTransactionId(isSubXact);

 	if (isSubXact)
-		SubTransSetParent(s->transactionId, s->parent->transactionId);
+		SubTransSetParent(s->transactionId, s->parent->transactionId, false);

 	/*
 	 * Acquire lock on the transaction XID.  (We assume this cannot block.) We
@ -435,8 +448,57 @@ AssignTransactionId(TransactionState s)
 	}
 	PG_END_TRY();
 	CurrentResourceOwner = currentOwner;
-}

+	/*
+	 * Every PGPROC_MAX_CACHED_SUBXIDS assigned transaction ids within each
+	 * top-level transaction we issue a WAL record for the assignment. We
+	 * include the top-level xid and all the subxids that have not yet been
+	 * reported using XLOG_XACT_ASSIGNMENT records.
+	 *
+	 * This is required to limit the amount of shared memory required in a
+	 * hot standby server to keep track of in-progress XIDs. See notes for
+	 * RecordKnownAssignedTransactionIds().
+	 *
+	 * We don't keep track of the immediate parent of each subxid,
+	 * only the top-level transaction that each subxact belongs to. This
+	 * is correct in recovery only because aborted subtransactions are
+	 * separately WAL logged.
+	 */
+	if (isSubXact && XLogStandbyInfoActive())
+	{
+		unreportedXids[nUnreportedXids] = s->transactionId;
+		nUnreportedXids++;
+
+		/* ensure this test matches similar one in RecoverPreparedTransactions() */
+		if (nUnreportedXids >= PGPROC_MAX_CACHED_SUBXIDS)
+		{
+			XLogRecData rdata[2];
+			xl_xact_assignment	xlrec;
+
+			/*
+			 * xtop is always set by now because we recurse up transaction
+			 * stack to the highest unassigned xid and then come back down
+			 */
+			xlrec.xtop = GetTopTransactionId();
+			Assert(TransactionIdIsValid(xlrec.xtop));
+			xlrec.nsubxacts = nUnreportedXids;
+
+			rdata[0].data = (char *) &xlrec;
+			rdata[0].len = MinSizeOfXactAssignment;
+			rdata[0].buffer = InvalidBuffer;
+			rdata[0].next = &rdata[1];
+
+			rdata[1].data = (char *) unreportedXids;
+			rdata[1].len = PGPROC_MAX_CACHED_SUBXIDS * sizeof(TransactionId);
+			rdata[1].buffer = InvalidBuffer;
+			rdata[1].next = NULL;
+
+			(void) XLogInsert(RM_XACT_ID, XLOG_XACT_ASSIGNMENT, rdata);
+
+			nUnreportedXids = 0;
+		}
+	}
+}

 /*
 *	GetCurrentSubTransactionId
@ -596,6 +658,18 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
 	return false;
 }

+/*
+ *	TransactionStartedDuringRecovery
+ *
+ * Returns true if the current transaction started while recovery was still
+ * in progress. Recovery might have ended since so RecoveryInProgress() might
+ * return false already.
+ */
+bool
+TransactionStartedDuringRecovery(void)
+{
+	return CurrentTransactionState->startedInRecovery;
+}

 /*
 *	CommandCounterIncrement
@ -811,7 +885,7 @@ AtSubStart_ResourceOwner(void)
 * This is exported only to support an ugly hack in VACUUM FULL.
 */
 TransactionId
-RecordTransactionCommit(void)
+RecordTransactionCommit(bool isVacuumFull)
 {
 	TransactionId xid = GetTopTransactionIdIfAny();
 	bool		markXidCommitted = TransactionIdIsValid(xid);
@ -821,11 +895,15 @@ RecordTransactionCommit(void)
 	bool		haveNonTemp;
 	int			nchildren;
 	TransactionId *children;
+	int			nmsgs;
+	SharedInvalidationMessage *invalMessages = NULL;
+	bool		RelcacheInitFileInval;

 	/* Get data needed for commit record */
 	nrels = smgrGetPendingDeletes(true, &rels, &haveNonTemp);
 	nchildren = xactGetCommittedChildren(&children);
-
+	nmsgs = xactGetCommittedInvalidationMessages(&invalMessages,
+												 &RelcacheInitFileInval);
 	/*
 	 * If we haven't been assigned an XID yet, we neither can, nor do we want
 	 * to write a COMMIT record.
@ -859,13 +937,24 @@ RecordTransactionCommit(void)
 		/*
 		 * Begin commit critical section and insert the commit XLOG record.
 		 */
-		XLogRecData rdata[3];
+		XLogRecData rdata[4];
 		int			lastrdata = 0;
 		xl_xact_commit xlrec;

 		/* Tell bufmgr and smgr to prepare for commit */
 		BufmgrCommit();

+		/*
+		 * Set flags required for recovery processing of commits.
+		 */
+		xlrec.xinfo = 0;
+		if (RelcacheInitFileInval)
+			xlrec.xinfo |= XACT_COMPLETION_UPDATE_RELCACHE_FILE;
+		if (isVacuumFull)
+			xlrec.xinfo |= XACT_COMPLETION_VACUUM_FULL;
+		if (forceSyncCommit)
+			xlrec.xinfo |= XACT_COMPLETION_FORCE_SYNC_COMMIT;
+
 		/*
 		 * Mark ourselves as within our "commit critical section".	This
 		 * forces any concurrent checkpoint to wait until we've updated
@ -890,6 +979,7 @@ RecordTransactionCommit(void)
 		xlrec.xact_time = xactStopTimestamp;
 		xlrec.nrels = nrels;
 		xlrec.nsubxacts = nchildren;
+		xlrec.nmsgs = nmsgs;
 		rdata[0].data = (char *) (&xlrec);
 		rdata[0].len = MinSizeOfXactCommit;
 		rdata[0].buffer = InvalidBuffer;
@ -911,6 +1001,15 @@ RecordTransactionCommit(void)
 			rdata[2].buffer = InvalidBuffer;
 			lastrdata = 2;
 		}
+		/* dump shared cache invalidation messages */
+		if (nmsgs > 0)
+		{
+			rdata[lastrdata].next = &(rdata[3]);
+			rdata[3].data = (char *) invalMessages;
+			rdata[3].len = nmsgs * sizeof(SharedInvalidationMessage);
+			rdata[3].buffer = InvalidBuffer;
+			lastrdata = 3;
+		}
 		rdata[lastrdata].next = NULL;

 		(void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata);
@ -1352,6 +1451,13 @@ AtSubAbort_childXids(void)
 	s->childXids = NULL;
 	s->nChildXids = 0;
 	s->maxChildXids = 0;
+
+	/*
+	 * We could prune the unreportedXids array here. But we don't bother.
+	 * That would potentially reduce number of XLOG_XACT_ASSIGNMENT records
+	 * but it would likely introduce more CPU time into the more common
+	 * paths, so we choose not to do that.
+	 */
 }

 /* ----------------------------------------------------------------
@ -1461,9 +1567,23 @@ StartTransaction(void)

 	/*
 	 * Make sure we've reset xact state variables
+	 *
+	 * If recovery is still in progress, mark this transaction as read-only.
+	 * We have lower level defences in XLogInsert and elsewhere to stop us
+	 * from modifying data during recovery, but this gives the normal
+	 * indication to the user that the transaction is read-only.
 	 */
+	if (RecoveryInProgress())
+	{
+		s->startedInRecovery = true;
+		XactReadOnly = true;
+	}
+	else
+	{
+		s->startedInRecovery = false;
+		XactReadOnly = DefaultXactReadOnly;
+	}
 	XactIsoLevel = DefaultXactIsoLevel;
-	XactReadOnly = DefaultXactReadOnly;
 	forceSyncCommit = false;
 	MyXactAccessedTempRel = false;

@ -1475,6 +1595,11 @@ StartTransaction(void)
 	currentCommandId = FirstCommandId;
 	currentCommandIdUsed = false;

+	/*
+	 * initialize reported xid accounting
+	 */
+	nUnreportedXids = 0;
+
 	/*
 	 * must initialize resource-management stuff first
 	 */
@ -1619,7 +1744,7 @@ CommitTransaction(void)
 	/*
 	 * Here is where we really truly commit.
 	 */
-	latestXid = RecordTransactionCommit();
+	latestXid = RecordTransactionCommit(false);

 	TRACE_POSTGRESQL_TRANSACTION_COMMIT(MyProc->lxid);

@ -1853,7 +1978,6 @@ PrepareTransaction(void)
 	StartPrepare(gxact);

 	AtPrepare_Notify();
-	AtPrepare_Inval();
 	AtPrepare_Locks();
 	AtPrepare_PgStat();
 	AtPrepare_MultiXact();
@ -4199,29 +4323,108 @@ xactGetCommittedChildren(TransactionId **ptr)
 *	XLOG support routines
 */

+/*
+ * Before 8.5 this was a fairly short function, but now it performs many
+ * actions for which the order of execution is critical.
+ */
 static void
-xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid)
+xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid, XLogRecPtr lsn)
 {
 	TransactionId *sub_xids;
+	SharedInvalidationMessage *inval_msgs;
 	TransactionId max_xid;
 	int			i;

-	/* Mark the transaction committed in pg_clog */
+	/* subxid array follows relfilenodes */
 	sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
-	TransactionIdCommitTree(xid, xlrec->nsubxacts, sub_xids);
+	/* invalidation messages array follows subxids */
+	inval_msgs = (SharedInvalidationMessage *) &(sub_xids[xlrec->nsubxacts]);

-	/* Make sure nextXid is beyond any XID mentioned in the record */
-	max_xid = xid;
-	for (i = 0; i < xlrec->nsubxacts; i++)
-	{
-		if (TransactionIdPrecedes(max_xid, sub_xids[i]))
-			max_xid = sub_xids[i];
-	}
+	max_xid = TransactionIdLatest(xid, xlrec->nsubxacts, sub_xids);
+
+	/*
+	 * Make sure nextXid is beyond any XID mentioned in the record.
+	 *
+	 * We don't expect anyone else to modify nextXid, hence we
+	 * don't need to hold a lock while checking this. We still acquire
+	 * the lock to modify it, though.
+	 */
 	if (TransactionIdFollowsOrEquals(max_xid,
 									 ShmemVariableCache->nextXid))
 	{
+		LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 		ShmemVariableCache->nextXid = max_xid;
 		TransactionIdAdvance(ShmemVariableCache->nextXid);
+		LWLockRelease(XidGenLock);
+	}
+
+	if (!InHotStandby || XactCompletionVacuumFull(xlrec))
+	{
+		/*
+		 * Mark the transaction committed in pg_clog.
+		 *
+		 * If InHotStandby and this is the first commit of a VACUUM FULL INPLACE
+		 * we perform only the actual commit to clog. Strangely, there are two
+		 * commits that share the same xid for every VFI, so we need to skip
+		 * some steps for the first commit. It's OK to repeat the clog update
+		 * when we see the second commit on a VFI.
+		 */
+		TransactionIdCommitTree(xid, xlrec->nsubxacts, sub_xids);
+	}
+	else
+	{
+		/*
+		 * If a transaction completion record arrives that has as-yet unobserved
+		 * subtransactions then this will not have been fully handled by the call
+		 * to RecordKnownAssignedTransactionIds() in the main recovery loop in
+		 * xlog.c. So we need to do bookkeeping again to cover that case. This is
+		 * confusing and it is easy to think this call is irrelevant, which has
+		 * happened three times in development already. Leave it in.
+		 */
+		RecordKnownAssignedTransactionIds(max_xid);
+
+		/*
+		 * Mark the transaction committed in pg_clog. We use async commit
+		 * protocol during recovery to provide information on database
+		 * consistency for when users try to set hint bits. It is important
+		 * that we do not set hint bits until the minRecoveryPoint is past
+		 * this commit record. This ensures that if we crash we don't see
+		 * hint bits set on changes made by transactions that haven't yet
+		 * recovered. It's unlikely but it's good to be safe.
+		 */
+		TransactionIdAsyncCommitTree(xid, xlrec->nsubxacts, sub_xids, lsn);
+
+		/*
+		 * We must mark clog before we update the ProcArray.
+		 */
+		ExpireTreeKnownAssignedTransactionIds(xid, xlrec->nsubxacts, sub_xids);
+
+		/*
+		 * Send any cache invalidations attached to the commit. We must
+		 * maintain the same order of invalidation then release locks
+		 * as occurs in 	.
+		 */
+		if (xlrec->nmsgs > 0)
+		{
+			/*
+			 * Relcache init file invalidation requires processing both
+			 * before and after we send the SI messages. See AtEOXact_Inval()
+			 */
+			if (XactCompletionRelcacheInitFileInval(xlrec))
+				RelationCacheInitFileInvalidate(true);
+
+			SendSharedInvalidMessages(inval_msgs, xlrec->nmsgs);
+
+			if (XactCompletionRelcacheInitFileInval(xlrec))
+				RelationCacheInitFileInvalidate(false);
+		}
+
+		/*
+		 * Release locks, if any. We do this for both two phase and normal
+		 * one phase transactions. In effect we are ignoring the prepare
+		 * phase and just going straight to lock release.
+		 */
+		StandbyReleaseLockTree(xid, xlrec->nsubxacts, sub_xids);
 	}

 	/* Make sure files supposed to be dropped are dropped */
@ -4240,8 +4443,31 @@ xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid)
 		}
 		smgrclose(srel);
 	}
+
+	/*
+	 * We issue an XLogFlush() for the same reason we emit ForceSyncCommit() in
+	 * normal operation. For example, in DROP DATABASE, we delete all the files
+	 * belonging to the database, and then commit the transaction. If we crash
+	 * after all the files have been deleted but before the commit, you have an
+	 * entry in pg_database without any files. To minimize the window for that,
+	 * we use ForceSyncCommit() to rush the commit record to disk as quick as
+	 * possible. We have the same window during recovery, and forcing an
+	 * XLogFlush() (which updates minRecoveryPoint during recovery) helps
+	 * to reduce that problem window, for any user that requested ForceSyncCommit().
+	 */
+	if (XactCompletionForceSyncCommit(xlrec))
+		XLogFlush(lsn);
 }

+/*
+ * Be careful with the order of execution, as with xact_redo_commit().
+ * The two functions are similar but differ in key places.
+ *
+ * Note also that an abort can be for a subtransaction and its children,
+ * not just for a top level abort. That means we have to consider
+ * topxid != xid, whereas in commit we would find topxid == xid always
+ * because subtransaction commit is never WAL logged.
+ */
 static void
 xact_redo_abort(xl_xact_abort *xlrec, TransactionId xid)
 {
@ -4249,22 +4475,55 @@ xact_redo_abort(xl_xact_abort *xlrec, TransactionId xid)
 	TransactionId max_xid;
 	int			i;

-	/* Mark the transaction aborted in pg_clog */
 	sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
-	TransactionIdAbortTree(xid, xlrec->nsubxacts, sub_xids);
+	max_xid = TransactionIdLatest(xid, xlrec->nsubxacts, sub_xids);

 	/* Make sure nextXid is beyond any XID mentioned in the record */
-	max_xid = xid;
-	for (i = 0; i < xlrec->nsubxacts; i++)
-	{
-		if (TransactionIdPrecedes(max_xid, sub_xids[i]))
-			max_xid = sub_xids[i];
-	}
+	/* We don't expect anyone else to modify nextXid, hence we
+	 * don't need to hold a lock while checking this. We still acquire
+	 * the lock to modify it, though.
+	 */
 	if (TransactionIdFollowsOrEquals(max_xid,
 									 ShmemVariableCache->nextXid))
 	{
+		LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 		ShmemVariableCache->nextXid = max_xid;
 		TransactionIdAdvance(ShmemVariableCache->nextXid);
+		LWLockRelease(XidGenLock);
+	}
+
+	if (InHotStandby)
+	{
+		/*
+		 * If a transaction completion record arrives that has as-yet unobserved
+		 * subtransactions then this will not have been fully handled by the call
+		 * to RecordKnownAssignedTransactionIds() in the main recovery loop in
+		 * xlog.c. So we need to do bookkeeping again to cover that case. This is
+		 * confusing and it is easy to think this call is irrelevant, which has
+		 * happened three times in development already. Leave it in.
+		 */
+		RecordKnownAssignedTransactionIds(max_xid);
+	}
+
+	/* Mark the transaction aborted in pg_clog, no need for async stuff */
+	TransactionIdAbortTree(xid, xlrec->nsubxacts, sub_xids);
+
+	if (InHotStandby)
+	{
+		/*
+		 * We must mark clog before we update the ProcArray.
+		 */
+		ExpireTreeKnownAssignedTransactionIds(xid, xlrec->nsubxacts, sub_xids);
+
+		/*
+		 * There are no flat files that need updating, nor invalidation
+		 * messages to send or undo.
+		 */
+
+		/*
+		 * Release locks, if any. There are no invalidations to send.
+		 */
+		StandbyReleaseLockTree(xid, xlrec->nsubxacts, sub_xids);
 	}

 	/* Make sure files supposed to be dropped are dropped */
@ -4297,7 +4556,7 @@ xact_redo(XLogRecPtr lsn, XLogRecord *record)
 	{
 		xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(record);

-		xact_redo_commit(xlrec, record->xl_xid);
+		xact_redo_commit(xlrec, record->xl_xid, lsn);
 	}
 	else if (info == XLOG_XACT_ABORT)
 	{
@ -4315,7 +4574,7 @@ xact_redo(XLogRecPtr lsn, XLogRecord *record)
 	{
 		xl_xact_commit_prepared *xlrec = (xl_xact_commit_prepared *) XLogRecGetData(record);

-		xact_redo_commit(&xlrec->crec, xlrec->xid);
+		xact_redo_commit(&xlrec->crec, xlrec->xid, lsn);
 		RemoveTwoPhaseFile(xlrec->xid, false);
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
@ -4325,6 +4584,14 @@ xact_redo(XLogRecPtr lsn, XLogRecord *record)
 		xact_redo_abort(&xlrec->arec, xlrec->xid);
 		RemoveTwoPhaseFile(xlrec->xid, false);
 	}
+	else if (info == XLOG_XACT_ASSIGNMENT)
+	{
+		xl_xact_assignment *xlrec = (xl_xact_assignment *) XLogRecGetData(record);
+
+		if (InHotStandby)
+			ProcArrayApplyXidAssignment(xlrec->xtop,
+										xlrec->nsubxacts, xlrec->xsub);
+	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
@ -4333,6 +4600,14 @@ static void
 xact_desc_commit(StringInfo buf, xl_xact_commit *xlrec)
 {
 	int			i;
+	TransactionId *xacts;
+	SharedInvalidationMessage *msgs;
+
+	xacts = (TransactionId *) &xlrec->xnodes[xlrec->nrels];
+	msgs = (SharedInvalidationMessage *) &xacts[xlrec->nsubxacts];
+
+	if (XactCompletionRelcacheInitFileInval(xlrec))
+		appendStringInfo(buf, "; relcache init file inval");

 	appendStringInfoString(buf, timestamptz_to_str(xlrec->xact_time));
 	if (xlrec->nrels > 0)
@ -4348,13 +4623,25 @@ xact_desc_commit(StringInfo buf, xl_xact_commit *xlrec)
 	}
 	if (xlrec->nsubxacts > 0)
 	{
-		TransactionId *xacts = (TransactionId *)
-		&xlrec->xnodes[xlrec->nrels];
-
 		appendStringInfo(buf, "; subxacts:");
 		for (i = 0; i < xlrec->nsubxacts; i++)
 			appendStringInfo(buf, " %u", xacts[i]);
 	}
+	if (xlrec->nmsgs > 0)
+	{
+		appendStringInfo(buf, "; inval msgs:");
+		for (i = 0; i < xlrec->nmsgs; i++)
+		{
+			SharedInvalidationMessage *msg = &msgs[i];
+
+			if (msg->id >= 0)
+				appendStringInfo(buf,  "catcache id%d ", msg->id);
+			else if (msg->id == SHAREDINVALRELCACHE_ID)
+				appendStringInfo(buf,  "relcache ");
+			else if (msg->id == SHAREDINVALSMGR_ID)
+				appendStringInfo(buf,  "smgr ");
+		}
+	}
 }

 static void
@ -4385,6 +4672,17 @@ xact_desc_abort(StringInfo buf, xl_xact_abort *xlrec)
 	}
 }

+static void
+xact_desc_assignment(StringInfo buf, xl_xact_assignment *xlrec)
+{
+	int			i;
+
+	appendStringInfo(buf, "subxacts:");
+
+	for (i = 0; i < xlrec->nsubxacts; i++)
+		appendStringInfo(buf, " %u", xlrec->xsub[i]);
+}
+
 void
 xact_desc(StringInfo buf, uint8 xl_info, char *rec)
 {
@ -4412,16 +4710,28 @@ xact_desc(StringInfo buf, uint8 xl_info, char *rec)
 	{
 		xl_xact_commit_prepared *xlrec = (xl_xact_commit_prepared *) rec;

-		appendStringInfo(buf, "commit %u: ", xlrec->xid);
+		appendStringInfo(buf, "commit prepared %u: ", xlrec->xid);
 		xact_desc_commit(buf, &xlrec->crec);
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
 		xl_xact_abort_prepared *xlrec = (xl_xact_abort_prepared *) rec;

-		appendStringInfo(buf, "abort %u: ", xlrec->xid);
+		appendStringInfo(buf, "abort prepared %u: ", xlrec->xid);
 		xact_desc_abort(buf, &xlrec->arec);
 	}
+	else if (info == XLOG_XACT_ASSIGNMENT)
+	{
+		xl_xact_assignment *xlrec = (xl_xact_assignment *) rec;
+
+		/*
+		 * Note that we ignore the WAL record's xid, since we're more
+		 * interested in the top-level xid that issued the record
+		 * and which xids are being reported here.
+		 */
+		appendStringInfo(buf, "xid assignment xtop %u: ", xlrec->xtop);
+		xact_desc_assignment(buf, xlrec);
+	}
 	else
 		appendStringInfo(buf, "UNKNOWN");
 }
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@ -7,7 +7,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/backend/access/transam/xlog.c,v 1.353 2009/09/13 18:32:07 heikki Exp $
+ * $PostgreSQL: pgsql/src/backend/access/transam/xlog.c,v 1.354 2009/12/19 01:32:33 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -67,6 +67,8 @@ int			XLOGbuffers = 8;
 int			XLogArchiveTimeout = 0;
 bool		XLogArchiveMode = false;
 char	   *XLogArchiveCommand = NULL;
+bool 		XLogRequestRecoveryConnections = true;
+int			MaxStandbyDelay = 30;
 bool		fullPageWrites = true;
 bool		log_checkpoints = false;
 int			sync_method = DEFAULT_SYNC_METHOD;
@ -129,10 +131,16 @@ TimeLineID	ThisTimeLineID = 0;
 * recovery mode".  It should be examined primarily by functions that need
 * to act differently when called from a WAL redo function (e.g., to skip WAL
 * logging).  To check whether the system is in recovery regardless of which
- * process you're running in, use RecoveryInProgress().
+ * process you're running in, use RecoveryInProgress() but only after shared
+ * memory startup and lock initialization.
 */
 bool		InRecovery = false;

+/* Are we in Hot Standby mode? Only valid in startup process, see xlog.h */
+HotStandbyState		standbyState = STANDBY_DISABLED;
+
+static 	XLogRecPtr	LastRec;
+
 /*
 * Local copy of SharedRecoveryInProgress variable. True actually means "not
 * known, need to check the shared state".
@ -359,6 +367,8 @@ typedef struct XLogCtlData

 	/* end+1 of the last record replayed (or being replayed) */
 	XLogRecPtr	replayEndRecPtr;
+	/* timestamp of last record replayed (or being replayed) */
+	TimestampTz	recoveryLastXTime;

 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
@ -463,6 +473,7 @@ static void readRecoveryCommandFile(void);
 static void exitArchiveRecovery(TimeLineID endTLI,
 					uint32 endLogId, uint32 endLogSeg);
 static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
+static void CheckRequiredParameterValues(CheckPoint checkPoint);
 static void LocalSetXLogInsertAllowed(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);

@ -2103,9 +2114,40 @@ XLogAsyncCommitFlush(void)
 bool
 XLogNeedsFlush(XLogRecPtr record)
 {
-	/* XLOG doesn't need flushing during recovery */
+	/*
+	 * During recovery, we don't flush WAL but update minRecoveryPoint
+	 * instead. So "needs flush" is taken to mean whether minRecoveryPoint
+	 * would need to be updated.
+	 */
 	if (RecoveryInProgress())
-		return false;
+	{
+		/* Quick exit if already known updated */
+		if (XLByteLE(record, minRecoveryPoint) || !updateMinRecoveryPoint)
+			return false;
+
+		/*
+		 * Update local copy of minRecoveryPoint. But if the lock is busy,
+		 * just return a conservative guess.
+		 */
+		if (!LWLockConditionalAcquire(ControlFileLock, LW_SHARED))
+			return true;
+		minRecoveryPoint = ControlFile->minRecoveryPoint;
+		LWLockRelease(ControlFileLock);
+
+		/*
+		 * An invalid minRecoveryPoint means that we need to recover all the WAL,
+		 * i.e., we're doing crash recovery.  We never modify the control file's
+		 * value in that case, so we can short-circuit future checks here too.
+		 */
+		if (minRecoveryPoint.xlogid == 0 && minRecoveryPoint.xrecoff == 0)
+			updateMinRecoveryPoint = false;
+
+		/* check again */
+		if (XLByteLE(record, minRecoveryPoint) || !updateMinRecoveryPoint)
+			return false;
+		else
+			return true;
+	}

 	/* Quick exit if already known flushed */
 	if (XLByteLE(record, LogwrtResult.Flush))
@ -3259,10 +3301,11 @@ CleanupBackupHistory(void)
 * ignoring them as already applied, but that's not a huge drawback.
 *
 * If 'cleanup' is true, a cleanup lock is used when restoring blocks.
- * Otherwise, a normal exclusive lock is used.	At the moment, that's just
- * pro forma, because there can't be any regular backends in the system
- * during recovery.  The 'cleanup' argument applies to all backup blocks
- * in the WAL record, that suffices for now.
+ * Otherwise, a normal exclusive lock is used.	During crash recovery, that's
+ * just pro forma because there can't be any regular backends in the system,
+ * but in hot standby mode the distinction is important. The 'cleanup'
+ * argument applies to all backup blocks in the WAL record, that suffices for
+ * now.
 */
 void
 RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
@ -4679,6 +4722,7 @@ BootStrapXLOG(void)
 	checkPoint.oldestXid = FirstNormalTransactionId;
 	checkPoint.oldestXidDB = TemplateDbOid;
 	checkPoint.time = (pg_time_t) time(NULL);
+	checkPoint.oldestActiveXid = InvalidTransactionId;

 	ShmemVariableCache->nextXid = checkPoint.nextXid;
 	ShmemVariableCache->nextOid = checkPoint.nextOid;
@ -5117,22 +5161,43 @@ recoveryStopsHere(XLogRecord *record, bool *includeThis)
 	TimestampTz recordXtime;

 	/* We only consider stopping at COMMIT or ABORT records */
-	if (record->xl_rmid != RM_XACT_ID)
-		return false;
-	record_info = record->xl_info & ~XLR_INFO_MASK;
-	if (record_info == XLOG_XACT_COMMIT)
+	if (record->xl_rmid == RM_XACT_ID)
 	{
-		xl_xact_commit *recordXactCommitData;
+		record_info = record->xl_info & ~XLR_INFO_MASK;
+		if (record_info == XLOG_XACT_COMMIT)
+		{
+			xl_xact_commit *recordXactCommitData;

-		recordXactCommitData = (xl_xact_commit *) XLogRecGetData(record);
-		recordXtime = recordXactCommitData->xact_time;
+			recordXactCommitData = (xl_xact_commit *) XLogRecGetData(record);
+			recordXtime = recordXactCommitData->xact_time;
+		}
+		else if (record_info == XLOG_XACT_ABORT)
+		{
+			xl_xact_abort *recordXactAbortData;
+
+			recordXactAbortData = (xl_xact_abort *) XLogRecGetData(record);
+			recordXtime = recordXactAbortData->xact_time;
+		}
+		else
+			return false;
 	}
-	else if (record_info == XLOG_XACT_ABORT)
+	else if (record->xl_rmid == RM_XLOG_ID)
 	{
-		xl_xact_abort *recordXactAbortData;
+		record_info = record->xl_info & ~XLR_INFO_MASK;
+		if (record_info == XLOG_CHECKPOINT_SHUTDOWN ||
+			record_info == XLOG_CHECKPOINT_ONLINE)
+		{
+			CheckPoint	checkPoint;

-		recordXactAbortData = (xl_xact_abort *) XLogRecGetData(record);
-		recordXtime = recordXactAbortData->xact_time;
+			memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
+			recoveryLastXTime = checkPoint.time;
+		}
+
+		/*
+		 * We don't want to stop recovery on a checkpoint record, but we do
+		 * want to update recoveryLastXTime. So return is unconditional.
+		 */
+		return false;
 	}
 	else
 		return false;
@ -5216,6 +5281,67 @@ recoveryStopsHere(XLogRecord *record, bool *includeThis)
 	return stopsHere;
 }

+/*
+ * Returns bool with current recovery mode, a global state.
+ */
+Datum
+pg_is_in_recovery(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_BOOL(RecoveryInProgress());
+}
+
+/*
+ * Returns timestamp of last recovered commit/abort record.
+ */
+TimestampTz
+GetLatestXLogTime(void)
+{
+	/* use volatile pointer to prevent code rearrangement */
+	volatile XLogCtlData *xlogctl = XLogCtl;
+
+	SpinLockAcquire(&xlogctl->info_lck);
+	recoveryLastXTime = xlogctl->recoveryLastXTime;
+	SpinLockRelease(&xlogctl->info_lck);
+
+	return recoveryLastXTime;
+}
+
+/*
+ * Note that text field supplied is a parameter name and does not require translation
+ */
+#define RecoveryRequiresIntParameter(param_name, currValue, checkpointValue) \
+{ \
+	if (currValue < checkpointValue) \
+		ereport(ERROR, \
+			(errmsg("recovery connections cannot continue because " \
+					"%s = %u is a lower setting than on WAL source server (value was %u)", \
+					param_name, \
+					currValue, \
+					checkpointValue))); \
+}
+
+/*
+ * Check to see if required parameters are set high enough on this server
+ * for various aspects of recovery operation.
+ */
+static void
+CheckRequiredParameterValues(CheckPoint checkPoint)
+{
+	/* We ignore autovacuum_max_workers when we make this test. */
+	RecoveryRequiresIntParameter("max_connections",
+									MaxConnections, checkPoint.MaxConnections);
+
+	RecoveryRequiresIntParameter("max_prepared_xacts",
+									max_prepared_xacts, checkPoint.max_prepared_xacts);
+	RecoveryRequiresIntParameter("max_locks_per_xact",
+									max_locks_per_xact, checkPoint.max_locks_per_xact);
+
+	if (!checkPoint.XLogStandbyInfoMode)
+		ereport(ERROR,
+			(errmsg("recovery connections cannot start because the recovery_connections "
+					"parameter is disabled on the WAL source server")));
+}
+
 /*
 * This must be called ONCE during postmaster or standalone-backend startup
 */
@ -5228,7 +5354,6 @@ StartupXLOG(void)
 	bool		reachedStopPoint = false;
 	bool		haveBackupLabel = false;
 	XLogRecPtr	RecPtr,
-				LastRec,
 				checkPointLoc,
 				backupStopLoc,
 				EndOfLog;
@ -5238,6 +5363,7 @@ StartupXLOG(void)
 	uint32		freespace;
 	TransactionId oldestActiveXID;
 	bool		bgwriterLaunched = false;
+	bool		backendsAllowed = false;

 	/*
 	 * Read control file and check XLOG status looks valid.
@ -5506,6 +5632,38 @@ StartupXLOG(void)
 								BACKUP_LABEL_FILE, BACKUP_LABEL_OLD)));
 		}

+		/*
+		 * Initialize recovery connections, if enabled. We won't let backends
+		 * in yet, not until we've reached the min recovery point specified
+		 * in control file and we've established a recovery snapshot from a
+		 * running-xacts WAL record.
+		 */
+		if (InArchiveRecovery && XLogRequestRecoveryConnections)
+		{
+			TransactionId *xids;
+			int nxids;
+
+			CheckRequiredParameterValues(checkPoint);
+
+			ereport(LOG,
+				(errmsg("initializing recovery connections")));
+
+			InitRecoveryTransactionEnvironment();
+
+			if (wasShutdown)
+				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			else
+				oldestActiveXID = checkPoint.oldestActiveXid;
+			Assert(TransactionIdIsValid(oldestActiveXID));
+
+			/* Startup commit log and related stuff */
+			StartupCLOG();
+			StartupSUBTRANS(oldestActiveXID);
+			StartupMultiXact();
+
+			ProcArrayInitRecoveryInfo(oldestActiveXID);
+		}
+
 		/* Initialize resource managers */
 		for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
 		{
@ -5580,7 +5738,9 @@ StartupXLOG(void)
 			do
 			{
 #ifdef WAL_DEBUG
-				if (XLOG_DEBUG)
+				if (XLOG_DEBUG ||
+					(rmid == RM_XACT_ID && trace_recovery_messages <= DEBUG2) ||
+					(rmid != RM_XACT_ID && trace_recovery_messages <= DEBUG3))
 				{
 					StringInfoData buf;

@ -5608,27 +5768,29 @@ StartupXLOG(void)
 				}

 				/*
-				 * Check if we were requested to exit without finishing
-				 * recovery.
-				 */
-				if (shutdown_requested)
-					proc_exit(1);
-
-				/*
-				 * Have we passed our safe starting point? If so, we can tell
-				 * postmaster that the database is consistent now.
+				 * Have we passed our safe starting point?
 				 */
 				if (!reachedMinRecoveryPoint &&
-					XLByteLT(minRecoveryPoint, EndRecPtr))
+					XLByteLE(minRecoveryPoint, EndRecPtr))
 				{
 					reachedMinRecoveryPoint = true;
-					if (InArchiveRecovery)
-					{
-						ereport(LOG,
-							  (errmsg("consistent recovery state reached")));
-						if (IsUnderPostmaster)
-							SendPostmasterSignal(PMSIGNAL_RECOVERY_CONSISTENT);
-					}
+					ereport(LOG,
+							(errmsg("consistent recovery state reached at %X/%X",
+									EndRecPtr.xlogid, EndRecPtr.xrecoff)));
+				}
+
+				/*
+				 * Have we got a valid starting snapshot that will allow
+				 * queries to be run? If so, we can tell postmaster that
+				 * the database is consistent now, enabling connections.
+				 */
+				if (standbyState == STANDBY_SNAPSHOT_READY &&
+					!backendsAllowed &&
+					reachedMinRecoveryPoint &&
+					IsUnderPostmaster)
+				{
+					backendsAllowed = true;
+					SendPostmasterSignal(PMSIGNAL_RECOVERY_CONSISTENT);
 				}

 				/*
@ -5662,8 +5824,13 @@ StartupXLOG(void)
 				 */
 				SpinLockAcquire(&xlogctl->info_lck);
 				xlogctl->replayEndRecPtr = EndRecPtr;
+				xlogctl->recoveryLastXTime = recoveryLastXTime;
 				SpinLockRelease(&xlogctl->info_lck);

+				/* In Hot Standby mode, keep track of XIDs we've seen */
+				if (InHotStandby && TransactionIdIsValid(record->xl_xid))
+					RecordKnownAssignedTransactionIds(record->xl_xid);
+
 				RmgrTable[record->xl_rmid].rm_redo(EndRecPtr, record);

 				/* Pop the error context stack */
@ -5810,7 +5977,7 @@ StartupXLOG(void)
 	}

 	/* Pre-scan prepared transactions to find out the range of XIDs present */
-	oldestActiveXID = PrescanPreparedTransactions();
+	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);

 	if (InRecovery)
 	{
@ -5891,14 +6058,27 @@ StartupXLOG(void)
 	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
 	TransactionIdRetreat(ShmemVariableCache->latestCompletedXid);

-	/* Start up the commit log and related stuff, too */
-	StartupCLOG();
-	StartupSUBTRANS(oldestActiveXID);
-	StartupMultiXact();
+	/*
+	 * Start up the commit log and related stuff, too. In hot standby mode
+	 * we did this already before WAL replay.
+	 */
+	if (standbyState == STANDBY_DISABLED)
+	{
+		StartupCLOG();
+		StartupSUBTRANS(oldestActiveXID);
+		StartupMultiXact();
+	}

 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();

+	/*
+	 * Shutdown the recovery environment. This must occur after
+	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
+	 */
+	if (standbyState != STANDBY_DISABLED)
+		ShutdownRecoveryTransactionEnvironment();
+
 	/* Shut down readFile facility, free space */
 	if (readFile >= 0)
 	{
@ -5964,8 +6144,9 @@ RecoveryInProgress(void)

 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
-		 * is finished.  (If you change this, see also
-		 * LocalSetXLogInsertAllowed.)
+		 * is finished. InitPostgres() relies upon this behaviour to ensure
+		 * that InitXLOGAccess() is called at backend startup.  (If you change
+		 * this, see also LocalSetXLogInsertAllowed.)
 		 */
 		if (!LocalRecoveryInProgress)
 			InitXLOGAccess();
@ -6151,7 +6332,7 @@ InitXLOGAccess(void)
 {
 	/* ThisTimeLineID doesn't change so we need no lock to copy it */
 	ThisTimeLineID = XLogCtl->ThisTimeLineID;
-	Assert(ThisTimeLineID != 0);
+	Assert(ThisTimeLineID != 0 || IsBootstrapProcessingMode());

 	/* Use GetRedoRecPtr to copy the RedoRecPtr safely */
 	(void) GetRedoRecPtr();
@ -6449,6 +6630,12 @@ CreateCheckPoint(int flags)
 	MemSet(&checkPoint, 0, sizeof(checkPoint));
 	checkPoint.time = (pg_time_t) time(NULL);

+	/* Set important parameter values for use when replaying WAL */
+	checkPoint.MaxConnections = MaxConnections;
+	checkPoint.max_prepared_xacts = max_prepared_xacts;
+	checkPoint.max_locks_per_xact = max_locks_per_xact;
+	checkPoint.XLogStandbyInfoMode = XLogStandbyInfoActive();
+
 	/*
 	 * We must hold WALInsertLock while examining insert state to determine
 	 * the checkpoint REDO pointer.
@ -6624,6 +6811,21 @@ CreateCheckPoint(int flags)

 	CheckPointGuts(checkPoint.redo, flags);

+	/*
+	 * Take a snapshot of running transactions and write this to WAL.
+	 * This allows us to reconstruct the state of running transactions
+	 * during archive recovery, if required. Skip, if this info disabled.
+	 *
+	 * If we are shutting down, or Startup process is completing crash
+	 * recovery we don't need to write running xact data.
+	 *
+	 * Update checkPoint.nextXid since we have a later value
+	 */
+	if (!shutdown && XLogStandbyInfoActive())
+		 LogStandbySnapshot(&checkPoint.oldestActiveXid, &checkPoint.nextXid);
+	else
+		checkPoint.oldestActiveXid = InvalidTransactionId;
+
 	START_CRIT_SECTION();

 	/*
@ -6791,7 +6993,7 @@ RecoveryRestartPoint(const CheckPoint *checkPoint)
 		if (RmgrTable[rmid].rm_safe_restartpoint != NULL)
 			if (!(RmgrTable[rmid].rm_safe_restartpoint()))
 			{
-				elog(DEBUG2, "RM %d not safe to record restart point at %X/%X",
+				elog(trace_recovery(DEBUG2), "RM %d not safe to record restart point at %X/%X",
 					 rmid,
 					 checkPoint->redo.xlogid,
 					 checkPoint->redo.xrecoff);
@ -6923,14 +7125,9 @@ CreateRestartPoint(int flags)
 		LogCheckpointEnd(true);

 	ereport((log_checkpoints ? LOG : DEBUG2),
-			(errmsg("recovery restart point at %X/%X",
-				  lastCheckPoint.redo.xlogid, lastCheckPoint.redo.xrecoff)));
-
-	/* XXX this is currently BROKEN because we are in the wrong process */
-	if (recoveryLastXTime)
-		ereport((log_checkpoints ? LOG : DEBUG2),
-				(errmsg("last completed transaction was at log time %s",
-						timestamptz_to_str(recoveryLastXTime))));
+			(errmsg("recovery restart point at %X/%X with latest known log time %s",
+					lastCheckPoint.redo.xlogid, lastCheckPoint.redo.xrecoff,
+					timestamptz_to_str(GetLatestXLogTime()))));

 	LWLockRelease(CheckpointLock);
 	return true;
@ -7036,6 +7233,19 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
 		ShmemVariableCache->oldestXid = checkPoint.oldestXid;
 		ShmemVariableCache->oldestXidDB = checkPoint.oldestXidDB;

+		/* Check to see if any changes to max_connections give problems */
+		if (standbyState != STANDBY_DISABLED)
+			CheckRequiredParameterValues(checkPoint);
+
+		if (standbyState >= STANDBY_INITIALIZED)
+		{
+			/*
+			 * Remove stale transactions, if any.
+			 */
+			ExpireOldKnownAssignedTransactionIds(checkPoint.nextXid);
+			StandbyReleaseOldLocks(checkPoint.nextXid);
+		}
+
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
@ -7114,7 +7324,7 @@ xlog_desc(StringInfo buf, uint8 xl_info, char *rec)

 		appendStringInfo(buf, "checkpoint: redo %X/%X; "
 						 "tli %u; xid %u/%u; oid %u; multi %u; offset %u; "
-						 "oldest xid %u in DB %u; %s",
+						 "oldest xid %u in DB %u; oldest running xid %u; %s",
 						 checkpoint->redo.xlogid, checkpoint->redo.xrecoff,
 						 checkpoint->ThisTimeLineID,
 						 checkpoint->nextXidEpoch, checkpoint->nextXid,
@ -7123,6 +7333,7 @@ xlog_desc(StringInfo buf, uint8 xl_info, char *rec)
 						 checkpoint->nextMultiOffset,
 						 checkpoint->oldestXid,
 						 checkpoint->oldestXidDB,
+						 checkpoint->oldestActiveXid,
 				 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
 	}
 	else if (info == XLOG_NOOP)
@ -7155,6 +7366,9 @@ xlog_outrec(StringInfo buf, XLogRecord *record)
 					 record->xl_prev.xlogid, record->xl_prev.xrecoff,
 					 record->xl_xid);

+	appendStringInfo(buf, "; len %u",
+					 record->xl_len);
+
 	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
 	{
 		if (record->xl_info & XLR_SET_BKP_BLOCK(i))
@ -7311,6 +7525,12 @@ pg_start_backup(PG_FUNCTION_ARGS)
 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 				 errmsg("must be superuser to run a backup")));

+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("WAL control functions cannot be executed during recovery.")));
+
 	if (!XLogArchivingActive())
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
@ -7498,6 +7718,12 @@ pg_stop_backup(PG_FUNCTION_ARGS)
 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 				 (errmsg("must be superuser to run a backup"))));

+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("WAL control functions cannot be executed during recovery.")));
+
 	if (!XLogArchivingActive())
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
@ -7659,6 +7885,12 @@ pg_switch_xlog(PG_FUNCTION_ARGS)
 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 			 (errmsg("must be superuser to switch transaction log files"))));

+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("WAL control functions cannot be executed during recovery.")));
+
 	switchpoint = RequestXLogSwitch();

 	/*
@ -7681,6 +7913,12 @@ pg_current_xlog_location(PG_FUNCTION_ARGS)
 {
 	char		location[MAXFNAMELEN];

+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("WAL control functions cannot be executed during recovery.")));
+
 	/* Make sure we have an up-to-date local LogwrtResult */
 	{
 		/* use volatile pointer to prevent code rearrangement */
@ -7708,6 +7946,12 @@ pg_current_xlog_insert_location(PG_FUNCTION_ARGS)
 	XLogRecPtr	current_recptr;
 	char		location[MAXFNAMELEN];

+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("WAL control functions cannot be executed during recovery.")));
+
 	/*
 	 * Get the current end-of-WAL position ... shared lock is sufficient
 	 */
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@ -13,7 +13,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/commands/dbcommands.c,v 1.228 2009/11/12 02:46:16 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/commands/dbcommands.c,v 1.229 2009/12/19 01:32:34 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -26,6 +26,7 @@

 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
@ -48,6 +49,7 @@
 #include "storage/ipc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
+#include "storage/standby.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@ -1941,6 +1943,26 @@ dbase_redo(XLogRecPtr lsn, XLogRecord *record)

 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);

+		if (InHotStandby)
+		{
+			VirtualTransactionId *database_users;
+
+			/*
+			 * Find all users connected to this database and ask them
+			 * politely to immediately kill their sessions before processing
+			 * the drop database record, after the usual grace period.
+			 * We don't wait for commit because drop database is
+			 * non-transactional.
+			 */
+		    database_users = GetConflictingVirtualXIDs(InvalidTransactionId,
+													   xlrec->db_id,
+													   false);
+
+			ResolveRecoveryConflictWithVirtualXIDs(database_users,
+												   "drop database",
+												   CONFLICT_MODE_FATAL);
+		}
+
 		/* Drop pages for this database that are in the shared buffer cache */
 		DropDatabaseBuffers(xlrec->db_id);

--- a/src/backend/commands/lockcmds.c
+++ b/src/backend/commands/lockcmds.c
@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/commands/lockcmds.c,v 1.25 2009/06/11 14:48:56 momjian Exp $
+ *	  $PostgreSQL: pgsql/src/backend/commands/lockcmds.c,v 1.26 2009/12/19 01:32:34 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -47,6 +47,16 @@ LockTableCommand(LockStmt *lockstmt)

 		reloid = RangeVarGetRelid(relation, false);

+		/*
+		 * During recovery we only accept these variations:
+		 *   LOCK TABLE foo IN ACCESS SHARE MODE
+		 *   LOCK TABLE foo IN ROW SHARE MODE
+		 *   LOCK TABLE foo IN ROW EXCLUSIVE MODE
+		 * This test must match the restrictions defined in LockAcquire()
+		 */
+		if (lockstmt->mode > RowExclusiveLock)
+			PreventCommandDuringRecovery();
+
 		LockTableRecurse(reloid, relation,
 						 lockstmt->mode, lockstmt->nowait, recurse);
 	}
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/commands/sequence.c,v 1.162 2009/10/13 00:53:07 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/commands/sequence.c,v 1.163 2009/12/19 01:32:34 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -458,6 +458,9 @@ nextval_internal(Oid relid)
 				rescnt = 0;
 	bool		logit = false;

+	/* nextval() writes to database and must be prevented during recovery */
+	PreventCommandDuringRecovery();
+
 	/* open and AccessShareLock sequence */
 	init_sequence(relid, &elm, &seqrel);

--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@ -37,7 +37,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/commands/tablespace.c,v 1.63 2009/11/10 18:53:38 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/commands/tablespace.c,v 1.64 2009/12/19 01:32:34 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -50,6 +50,7 @@

 #include "access/heapam.h"
 #include "access/sysattr.h"
+#include "access/transam.h"
 #include "access/xact.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
@ -60,6 +61,8 @@
 #include "miscadmin.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
+#include "storage/procarray.h"
+#include "storage/standby.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@ -1317,11 +1320,58 @@ tblspc_redo(XLogRecPtr lsn, XLogRecord *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);

+		/*
+		 * If we issued a WAL record for a drop tablespace it is
+		 * because there were no files in it at all. That means that
+		 * no permanent objects can exist in it at this point.
+		 *
+		 * It is possible for standby users to be using this tablespace
+		 * as a location for their temporary files, so if we fail to
+		 * remove all files then do conflict processing and try again,
+		 * if currently enabled.
+		 */
 		if (!remove_tablespace_directories(xlrec->ts_id, true))
-			ereport(ERROR,
+		{
+			VirtualTransactionId *temp_file_users;
+
+			/*
+			 * Standby users may be currently using this tablespace for
+			 * for their temporary files. We only care about current
+			 * users because temp_tablespace parameter will just ignore
+			 * tablespaces that no longer exist.
+			 *
+			 * Ask everybody to cancel their queries immediately so
+			 * we can ensure no temp files remain and we can remove the
+			 * tablespace. Nuke the entire site from orbit, it's the only
+			 * way to be sure.
+			 *
+			 * XXX: We could work out the pids of active backends
+			 * using this tablespace by examining the temp filenames in the
+			 * directory. We would then convert the pids into VirtualXIDs
+			 * before attempting to cancel them.
+			 *
+			 * We don't wait for commit because drop tablespace is
+			 * non-transactional.
+			 */
+			temp_file_users = GetConflictingVirtualXIDs(InvalidTransactionId,
+														InvalidOid,
+														false);
+			ResolveRecoveryConflictWithVirtualXIDs(temp_file_users,
+												   "drop tablespace",
+												   CONFLICT_MODE_ERROR);
+
+			/*
+			 * If we did recovery processing then hopefully the
+			 * backends who wrote temp files should have cleaned up and
+			 * exited by now. So lets recheck before we throw an error.
+			 * If !process_conflicts then this will just fail again.
+			 */
+			if (!remove_tablespace_directories(xlrec->ts_id, true))
+				ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("tablespace %u is not empty",
 							xlrec->ts_id)));
+		}
 	}
 	else
 		elog(PANIC, "tblspc_redo: unknown op code %u", info);
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@ -13,7 +13,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/commands/vacuum.c,v 1.398 2009/12/09 21:57:51 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/commands/vacuum.c,v 1.399 2009/12/19 01:32:34 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -141,6 +141,7 @@ typedef struct VRelStats
 	/* vtlinks array for tuple chain following - sorted by new_tid */
 	int			num_vtlinks;
 	VTupleLink	vtlinks;
+	TransactionId	latestRemovedXid;
 } VRelStats;

 /*----------------------------------------------------------------------
@ -224,7 +225,7 @@ static void scan_heap(VRelStats *vacrelstats, Relation onerel,
 static bool repair_frag(VRelStats *vacrelstats, Relation onerel,
 			VacPageList vacuum_pages, VacPageList fraged_pages,
 			int nindexes, Relation *Irel);
-static void move_chain_tuple(Relation rel,
+static void move_chain_tuple(VRelStats *vacrelstats, Relation rel,
 				 Buffer old_buf, Page old_page, HeapTuple old_tup,
 				 Buffer dst_buf, Page dst_page, VacPage dst_vacpage,
 				 ExecContext ec, ItemPointer ctid, bool cleanVpd);
@ -237,7 +238,7 @@ static void update_hint_bits(Relation rel, VacPageList fraged_pages,
 				 int num_moved);
 static void vacuum_heap(VRelStats *vacrelstats, Relation onerel,
 			VacPageList vacpagelist);
-static void vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage);
+static void vacuum_page(VRelStats *vacrelstats, Relation onerel, Buffer buffer, VacPage vacpage);
 static void vacuum_index(VacPageList vacpagelist, Relation indrel,
 			 double num_tuples, int keep_tuples);
 static void scan_index(Relation indrel, double num_tuples);
@ -1300,6 +1301,7 @@ full_vacuum_rel(Relation onerel, VacuumStmt *vacstmt)
 	vacrelstats->rel_tuples = 0;
 	vacrelstats->rel_indexed_tuples = 0;
 	vacrelstats->hasindex = false;
+	vacrelstats->latestRemovedXid = InvalidTransactionId;

 	/* scan the heap */
 	vacuum_pages.num_pages = fraged_pages.num_pages = 0;
@ -1708,6 +1710,9 @@ scan_heap(VRelStats *vacrelstats, Relation onerel,
 			{
 				ItemId		lpp;

+				HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
+											&vacrelstats->latestRemovedXid);
+
 				/*
 				 * Here we are building a temporary copy of the page with dead
 				 * tuples removed.	Below we will apply
@ -2025,7 +2030,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
 				/* there are dead tuples on this page - clean them */
 				Assert(!isempty);
 				LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-				vacuum_page(onerel, buf, last_vacuum_page);
+				vacuum_page(vacrelstats, onerel, buf, last_vacuum_page);
 				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 			}
 			else
@ -2514,7 +2519,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
 					tuple.t_data = (HeapTupleHeader) PageGetItem(Cpage, Citemid);
 					tuple_len = tuple.t_len = ItemIdGetLength(Citemid);

-					move_chain_tuple(onerel, Cbuf, Cpage, &tuple,
+					move_chain_tuple(vacrelstats, onerel, Cbuf, Cpage, &tuple,
 									 dst_buffer, dst_page, destvacpage,
 									 &ec, &Ctid, vtmove[ti].cleanVpd);

@ -2600,7 +2605,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
 				dst_page = BufferGetPage(dst_buffer);
 				/* if this page was not used before - clean it */
 				if (!PageIsEmpty(dst_page) && dst_vacpage->offsets_used == 0)
-					vacuum_page(onerel, dst_buffer, dst_vacpage);
+					vacuum_page(vacrelstats, onerel, dst_buffer, dst_vacpage);
 			}
 			else
 				LockBuffer(dst_buffer, BUFFER_LOCK_EXCLUSIVE);
@ -2753,7 +2758,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
 		HOLD_INTERRUPTS();
 		heldoff = true;
 		ForceSyncCommit();
-		(void) RecordTransactionCommit();
+		(void) RecordTransactionCommit(true);
 	}

 	/*
@ -2781,7 +2786,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
 			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 			page = BufferGetPage(buf);
 			if (!PageIsEmpty(page))
-				vacuum_page(onerel, buf, *curpage);
+				vacuum_page(vacrelstats, onerel, buf, *curpage);
 			UnlockReleaseBuffer(buf);
 		}
 	}
@ -2917,7 +2922,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
 				recptr = log_heap_clean(onerel, buf,
 										NULL, 0, NULL, 0,
 										unused, uncnt,
-										false);
+										vacrelstats->latestRemovedXid, false);
 				PageSetLSN(page, recptr);
 				PageSetTLI(page, ThisTimeLineID);
 			}
@ -2969,7 +2974,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel,
 *		already too long and almost unreadable.
 */
 static void
-move_chain_tuple(Relation rel,
+move_chain_tuple(VRelStats *vacrelstats, Relation rel,
 				 Buffer old_buf, Page old_page, HeapTuple old_tup,
 				 Buffer dst_buf, Page dst_page, VacPage dst_vacpage,
 				 ExecContext ec, ItemPointer ctid, bool cleanVpd)
@ -3027,7 +3032,7 @@ move_chain_tuple(Relation rel,
 		int			sv_offsets_used = dst_vacpage->offsets_used;

 		dst_vacpage->offsets_used = 0;
-		vacuum_page(rel, dst_buf, dst_vacpage);
+		vacuum_page(vacrelstats, rel, dst_buf, dst_vacpage);
 		dst_vacpage->offsets_used = sv_offsets_used;
 	}

@ -3367,7 +3372,7 @@ vacuum_heap(VRelStats *vacrelstats, Relation onerel, VacPageList vacuum_pages)
 			buf = ReadBufferExtended(onerel, MAIN_FORKNUM, (*vacpage)->blkno,
 									 RBM_NORMAL, vac_strategy);
 			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-			vacuum_page(onerel, buf, *vacpage);
+			vacuum_page(vacrelstats, onerel, buf, *vacpage);
 			UnlockReleaseBuffer(buf);
 		}
 	}
@ -3397,7 +3402,7 @@ vacuum_heap(VRelStats *vacrelstats, Relation onerel, VacPageList vacuum_pages)
 * Caller must hold pin and lock on buffer.
 */
 static void
-vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage)
+vacuum_page(VRelStats *vacrelstats, Relation onerel, Buffer buffer, VacPage vacpage)
 {
 	Page		page = BufferGetPage(buffer);
 	int			i;
@ -3426,7 +3431,7 @@ vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage)
 		recptr = log_heap_clean(onerel, buffer,
 								NULL, 0, NULL, 0,
 								vacpage->offsets, vacpage->offsets_free,
-								false);
+								vacrelstats->latestRemovedXid, false);
 		PageSetLSN(page, recptr);
 		PageSetTLI(page, ThisTimeLineID);
 	}
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@ -29,7 +29,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/commands/vacuumlazy.c,v 1.124 2009/11/16 21:32:06 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/commands/vacuumlazy.c,v 1.125 2009/12/19 01:32:34 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -98,6 +98,7 @@ typedef struct LVRelStats
 	int			max_dead_tuples;	/* # slots allocated in array */
 	ItemPointer dead_tuples;	/* array of ItemPointerData */
 	int			num_index_scans;
+	TransactionId latestRemovedXid;
 } LVRelStats;


@ -265,6 +266,34 @@ lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
 	return heldoff;
 }

+/*
+ * For Hot Standby we need to know the highest transaction id that will
+ * be removed by any change. VACUUM proceeds in a number of passes so
+ * we need to consider how each pass operates. The first phase runs
+ * heap_page_prune(), which can issue XLOG_HEAP2_CLEAN records as it
+ * progresses - these will have a latestRemovedXid on each record.
+ * In some cases this removes all of the tuples to be removed, though
+ * often we have dead tuples with index pointers so we must remember them
+ * for removal in phase 3. Index records for those rows are removed
+ * in phase 2 and index blocks do not have MVCC information attached.
+ * So before we can allow removal of any index tuples we need to issue
+ * a WAL record containing the latestRemovedXid of rows that will be
+ * removed in phase three. This allows recovery queries to block at the
+ * correct place, i.e. before phase two, rather than during phase three
+ * which would be after the rows have become inaccessible.
+ */
+static void
+vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
+{
+	/*
+	 * No need to log changes for temp tables, they do not contain
+	 * data visible on the standby server.
+	 */
+	if (rel->rd_istemp || !XLogArchivingActive())
+		return;
+
+	(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+}

 /*
 *	lazy_scan_heap() -- scan an open heap relation
@ -315,6 +344,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 	nblocks = RelationGetNumberOfBlocks(onerel);
 	vacrelstats->rel_pages = nblocks;
 	vacrelstats->nonempty_pages = 0;
+	vacrelstats->latestRemovedXid = InvalidTransactionId;

 	lazy_space_alloc(vacrelstats, nblocks);

@ -373,6 +403,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 		if ((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage &&
 			vacrelstats->num_dead_tuples > 0)
 		{
+			/* Log cleanup info before we touch indexes */
+			vacuum_log_cleanup_info(onerel, vacrelstats);
+
 			/* Remove index entries */
 			for (i = 0; i < nindexes; i++)
 				lazy_vacuum_index(Irel[i],
@ -382,6 +415,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 			lazy_vacuum_heap(onerel, vacrelstats);
 			/* Forget the now-vacuumed tuples, and press on */
 			vacrelstats->num_dead_tuples = 0;
+			vacrelstats->latestRemovedXid = InvalidTransactionId;
 			vacrelstats->num_index_scans++;
 		}

@ -613,6 +647,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 			if (tupgone)
 			{
 				lazy_record_dead_tuple(vacrelstats, &(tuple.t_self));
+				HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
+												&vacrelstats->latestRemovedXid);
 				tups_vacuumed += 1;
 			}
 			else
@ -661,6 +697,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 			lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats);
 			/* Forget the now-vacuumed tuples, and press on */
 			vacrelstats->num_dead_tuples = 0;
+			vacrelstats->latestRemovedXid = InvalidTransactionId;
 			vacuumed_pages++;
 		}

@ -724,6 +761,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 	/* XXX put a threshold on min number of tuples here? */
 	if (vacrelstats->num_dead_tuples > 0)
 	{
+		/* Log cleanup info before we touch indexes */
+		vacuum_log_cleanup_info(onerel, vacrelstats);
+
 		/* Remove index entries */
 		for (i = 0; i < nindexes; i++)
 			lazy_vacuum_index(Irel[i],
@ -868,7 +908,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 		recptr = log_heap_clean(onerel, buffer,
 								NULL, 0, NULL, 0,
 								unused, uncnt,
-								false);
+								vacrelstats->latestRemovedXid, false);
 		PageSetLSN(page, recptr);
 		PageSetTLI(page, ThisTimeLineID);
 	}
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@ -37,7 +37,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/postmaster/postmaster.c,v 1.596 2009/09/08 17:08:36 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/postmaster/postmaster.c,v 1.597 2009/12/19 01:32:34 sriggs Exp $
 *
 * NOTES
 *
@ -245,8 +245,9 @@ static bool RecoveryError = false;		/* T if WAL recovery failed */
 * When archive recovery is finished, the startup process exits with exit
 * code 0 and we switch to PM_RUN state.
 *
- * Normal child backends can only be launched when we are in PM_RUN state.
- * (We also allow it in PM_WAIT_BACKUP state, but only for superusers.)
+ * Normal child backends can only be launched when we are in PM_RUN or
+ * PM_RECOVERY_CONSISTENT state.  (We also allow launch of normal
+ * child backends in PM_WAIT_BACKUP state, but only for superusers.)
 * In other states we handle connection requests by launching "dead_end"
 * child processes, which will simply send the client an error message and
 * quit.  (We track these in the BackendList so that we can know when they
@ -1868,7 +1869,7 @@ static enum CAC_state
 canAcceptConnections(void)
 {
 	/*
-	 * Can't start backends when in startup/shutdown/recovery state.
+	 * Can't start backends when in startup/shutdown/inconsistent recovery state.
 	 *
 	 * In state PM_WAIT_BACKUP only superusers can connect (this must be
 	 * allowed so that a superuser can end online backup mode); we return
@ -1882,9 +1883,11 @@ canAcceptConnections(void)
 			return CAC_SHUTDOWN;	/* shutdown is pending */
 		if (!FatalError &&
 			(pmState == PM_STARTUP ||
-			 pmState == PM_RECOVERY ||
-			 pmState == PM_RECOVERY_CONSISTENT))
+			 pmState == PM_RECOVERY))
 			return CAC_STARTUP; /* normal startup */
+		if (!FatalError &&
+			 pmState == PM_RECOVERY_CONSISTENT)
+			return CAC_OK; /* connection OK during recovery */
 		return CAC_RECOVERY;	/* else must be crash recovery */
 	}

@ -4003,9 +4006,8 @@ sigusr1_handler(SIGNAL_ARGS)
 		Assert(PgStatPID == 0);
 		PgStatPID = pgstat_start();

-		/* XXX at this point we could accept read-only connections */
-		ereport(DEBUG1,
-				(errmsg("database system is in consistent recovery mode")));
+		ereport(LOG,
+				 (errmsg("database system is ready to accept read only connections")));

 		pmState = PM_RECOVERY_CONSISTENT;
 	}
--- a/src/backend/storage/ipc/Makefile
+++ b/src/backend/storage/ipc/Makefile
@ -1,7 +1,7 @@
 #
 # Makefile for storage/ipc
 #
-# $PostgreSQL: pgsql/src/backend/storage/ipc/Makefile,v 1.22 2009/07/31 20:26:23 tgl Exp $
+# $PostgreSQL: pgsql/src/backend/storage/ipc/Makefile,v 1.23 2009/12/19 01:32:35 sriggs Exp $
 #

 subdir = src/backend/storage/ipc
@ -16,6 +16,6 @@ endif
 endif

 OBJS = ipc.o ipci.o pmsignal.o procarray.o procsignal.o shmem.o shmqueue.o \
-	sinval.o sinvaladt.o
+	sinval.o sinvaladt.o standby.o

 include $(top_srcdir)/src/backend/common.mk
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/storage/ipc/sinvaladt.c,v 1.79 2009/07/31 20:26:23 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/storage/ipc/sinvaladt.c,v 1.80 2009/12/19 01:32:35 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -144,6 +144,13 @@ typedef struct ProcState
 	bool		resetState;		/* backend needs to reset its state */
 	bool		signaled;		/* backend has been sent catchup signal */

+	/*
+	 * Backend only sends invalidations, never receives them. This only makes sense
+	 * for Startup process during recovery because it doesn't maintain a relcache,
+	 * yet it fires inval messages to allow query backends to see schema changes.
+	 */
+	bool		sendOnly;		/* backend only sends, never receives */
+
 	/*
 	 * Next LocalTransactionId to use for each idle backend slot.  We keep
 	 * this here because it is indexed by BackendId and it is convenient to
@ -249,7 +256,7 @@ CreateSharedInvalidationState(void)
 *		Initialize a new backend to operate on the sinval buffer
 */
 void
-SharedInvalBackendInit(void)
+SharedInvalBackendInit(bool sendOnly)
 {
 	int			index;
 	ProcState  *stateP = NULL;
@ -308,6 +315,7 @@ SharedInvalBackendInit(void)
 	stateP->nextMsgNum = segP->maxMsgNum;
 	stateP->resetState = false;
 	stateP->signaled = false;
+	stateP->sendOnly = sendOnly;

 	LWLockRelease(SInvalWriteLock);

@ -579,7 +587,9 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
 	/*
 	 * Recompute minMsgNum = minimum of all backends' nextMsgNum, identify the
 	 * furthest-back backend that needs signaling (if any), and reset any
-	 * backends that are too far back.
+	 * backends that are too far back.  Note that because we ignore sendOnly
+	 * backends here it is possible for them to keep sending messages without
+	 * a problem even when they are the only active backend.
 	 */
 	min = segP->maxMsgNum;
 	minsig = min - SIG_THRESHOLD;
@ -591,7 +601,7 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
 		int			n = stateP->nextMsgNum;

 		/* Ignore if inactive or already in reset state */
-		if (stateP->procPid == 0 || stateP->resetState)
+		if (stateP->procPid == 0 || stateP->resetState || stateP->sendOnly)
 			continue;

 		/*
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@ -0,0 +1,717 @@
+/*-------------------------------------------------------------------------
+ *
+ * standby.c
+ *	  Misc functions used in Hot Standby mode.
+ *
+ *	InitRecoveryTransactionEnvironment()
+ *  ShutdownRecoveryTransactionEnvironment()
+ *
+ *  ResolveRecoveryConflictWithVirtualXIDs()
+ *
+ *  All functions for handling RM_STANDBY_ID, which relate to
+ *  AccessExclusiveLocks and starting snapshots for Hot Standby mode.
+ *
+ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  $PostgreSQL: pgsql/src/backend/storage/ipc/standby.c,v 1.1 2009/12/19 01:32:35 sriggs Exp $
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+#include "access/transam.h"
+#include "access/twophase.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/lmgr.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/sinvaladt.h"
+#include "storage/standby.h"
+#include "utils/ps_status.h"
+
+int		vacuum_defer_cleanup_age;
+
+static List *RecoveryLockList;
+
+static void LogCurrentRunningXacts(RunningTransactions CurrRunningXacts);
+static void LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks);
+
+/*
+ * InitRecoveryTransactionEnvironment
+ *		Initiallize tracking of in-progress transactions in master
+ *
+ * We need to issue shared invalidations and hold locks. Holding locks
+ * means others may want to wait on us, so we need to make lock table
+ * inserts to appear like a transaction. We could create and delete
+ * lock table entries for each transaction but its simpler just to create
+ * one permanent entry and leave it there all the time. Locks are then
+ * acquired and released as needed. Yes, this means you can see the
+ * Startup process in pg_locks once we have run this.
+ */
+void
+InitRecoveryTransactionEnvironment(void)
+{
+	VirtualTransactionId vxid;
+
+	/*
+	 * Initialise shared invalidation management for Startup process,
+	 * being careful to register ourselves as a sendOnly process so
+	 * we don't need to read messages, nor will we get signalled
+	 * when the queue starts filling up.
+	 */
+	SharedInvalBackendInit(true);
+
+	/*
+	 * Record the PID and PGPROC structure of the startup process.
+	 */
+	PublishStartupProcessInformation();
+
+	/*
+	 * Lock a virtual transaction id for Startup process.
+	 *
+	 * We need to do GetNextLocalTransactionId() because
+	 * SharedInvalBackendInit() leaves localTransactionid invalid and
+	 * the lock manager doesn't like that at all.
+	 *
+	 * Note that we don't need to run XactLockTableInsert() because nobody
+	 * needs to wait on xids. That sounds a little strange, but table locks
+	 * are held by vxids and row level locks are held by xids. All queries
+	 * hold AccessShareLocks so never block while we write or lock new rows.
+	 */
+	vxid.backendId = MyBackendId;
+	vxid.localTransactionId = GetNextLocalTransactionId();
+	VirtualXactLockTableInsert(vxid);
+
+	standbyState = STANDBY_INITIALIZED;
+}
+
+/*
+ * ShutdownRecoveryTransactionEnvironment
+ *		Shut down transaction tracking
+ *
+ * Prepare to switch from hot standby mode to normal operation. Shut down
+ * recovery-time transaction tracking.
+ */
+void
+ShutdownRecoveryTransactionEnvironment(void)
+{
+	/* Mark all tracked in-progress transactions as finished. */
+	ExpireAllKnownAssignedTransactionIds();
+
+	/* Release all locks the tracked transactions were holding */
+	StandbyReleaseAllLocks();
+}
+
+
+/*
+ * -----------------------------------------------------
+ * 		Standby wait timers and backend cancel logic
+ * -----------------------------------------------------
+ */
+
+#define STANDBY_INITIAL_WAIT_US  1000
+static int standbyWait_us = STANDBY_INITIAL_WAIT_US;
+
+/*
+ * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs.
+ * We wait here for a while then return. If we decide we can't wait any
+ * more then we return true, if we can wait some more return false.
+ */
+static bool
+WaitExceedsMaxStandbyDelay(void)
+{
+	long	delay_secs;
+	int		delay_usecs;
+
+	/* max_standby_delay = -1 means wait forever, if necessary */
+	if (MaxStandbyDelay < 0)
+		return false;
+
+	/* Are we past max_standby_delay? */
+	TimestampDifference(GetLatestXLogTime(), GetCurrentTimestamp(),
+						&delay_secs, &delay_usecs);
+	if (delay_secs > MaxStandbyDelay)
+		return true;
+
+	/*
+	 * Sleep, then do bookkeeping.
+	 */
+	pg_usleep(standbyWait_us);
+
+	/*
+	 * Progressively increase the sleep times.
+	 */
+	standbyWait_us *= 2;
+	if (standbyWait_us > 1000000)
+		standbyWait_us = 1000000;
+	if (standbyWait_us > MaxStandbyDelay * 1000000 / 4)
+		standbyWait_us = MaxStandbyDelay * 1000000 / 4;
+
+	return false;
+}
+
+/*
+ * This is the main executioner for any query backend that conflicts with
+ * recovery processing. Judgement has already been passed on it within
+ * a specific rmgr. Here we just issue the orders to the procs. The procs
+ * then throw the required error as instructed.
+ *
+ * We may ask for a specific cancel_mode, typically ERROR or FATAL.
+ */
+void
+ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
+									   char *reason, int cancel_mode)
+{
+	char		waitactivitymsg[100];
+
+	Assert(cancel_mode > 0);
+
+	while (VirtualTransactionIdIsValid(*waitlist))
+	{
+		long wait_s;
+		int wait_us;			/* wait in microseconds (us) */
+		TimestampTz waitStart;
+		bool		logged;
+
+		waitStart = GetCurrentTimestamp();
+		standbyWait_us = STANDBY_INITIAL_WAIT_US;
+		logged = false;
+
+		/* wait until the virtual xid is gone */
+		while(!ConditionalVirtualXactLockTableWait(*waitlist))
+		{
+			/*
+			 * Report if we have been waiting for a while now...
+			 */
+			TimestampTz now = GetCurrentTimestamp();
+			TimestampDifference(waitStart, now, &wait_s, &wait_us);
+			if (!logged && (wait_s > 0 || wait_us > 500000))
+			{
+				const char *oldactivitymsg;
+				int			len;
+
+				oldactivitymsg = get_ps_display(&len);
+				snprintf(waitactivitymsg, sizeof(waitactivitymsg),
+						 "waiting for max_standby_delay (%u ms)",
+						 MaxStandbyDelay);
+				set_ps_display(waitactivitymsg, false);
+				if (len > 100)
+					len = 100;
+				memcpy(waitactivitymsg, oldactivitymsg, len);
+
+				ereport(trace_recovery(DEBUG5),
+						(errmsg("virtual transaction %u/%u is blocking %s",
+								waitlist->backendId,
+								waitlist->localTransactionId,
+								reason)));
+
+				pgstat_report_waiting(true);
+
+				logged = true;
+			}
+
+			/* Is it time to kill it? */
+			if (WaitExceedsMaxStandbyDelay())
+			{
+				pid_t pid;
+
+				/*
+				 * Now find out who to throw out of the balloon.
+				 */
+				Assert(VirtualTransactionIdIsValid(*waitlist));
+				pid = CancelVirtualTransaction(*waitlist, cancel_mode);
+
+				if (pid != 0)
+				{
+					/*
+					 * Startup process debug messages
+					 */
+					switch (cancel_mode)
+					{
+						case CONFLICT_MODE_FATAL:
+							elog(trace_recovery(DEBUG1),
+									"recovery disconnects session with pid %d because of conflict with %s",
+											pid,
+											reason);
+							break;
+						case CONFLICT_MODE_ERROR:
+							elog(trace_recovery(DEBUG1),
+									"recovery cancels virtual transaction %u/%u pid %d because of conflict with %s",
+											waitlist->backendId,
+											waitlist->localTransactionId,
+											pid,
+											reason);
+							break;
+						default:
+							/* No conflict pending, so fall through */
+							break;
+					}
+
+					/*
+					 * Wait awhile for it to die so that we avoid flooding an
+					 * unresponsive backend when system is heavily loaded.
+					 */
+					pg_usleep(5000);
+				}
+			}
+		}
+
+		/* Reset ps display */
+		if (logged)
+		{
+			set_ps_display(waitactivitymsg, false);
+			pgstat_report_waiting(false);
+		}
+
+		/* The virtual transaction is gone now, wait for the next one */
+		waitlist++;
+    }
+}
+
+/*
+ * -----------------------------------------------------
+ * Locking in Recovery Mode
+ * -----------------------------------------------------
+ *
+ * All locks are held by the Startup process using a single virtual
+ * transaction. This implementation is both simpler and in some senses,
+ * more correct. The locks held mean "some original transaction held
+ * this lock, so query access is not allowed at this time". So the Startup
+ * process is the proxy by which the original locks are implemented.
+ *
+ * We only keep track of AccessExclusiveLocks, which are only ever held by
+ * one transaction on one relation, and don't worry about lock queuing.
+ *
+ * We keep a single dynamically expandible list of locks in local memory,
+ * RelationLockList, so we can keep track of the various entried made by
+ * the Startup process's virtual xid in the shared lock table.
+ *
+ * List elements use type xl_rel_lock, since the WAL record type exactly
+ * matches the information that we need to keep track of.
+ *
+ * We use session locks rather than normal locks so we don't need
+ * ResourceOwners.
+ */
+
+
+void
+StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid)
+{
+	xl_standby_lock	*newlock;
+	LOCKTAG			locktag;
+	bool			report_memory_error = false;
+	int				num_attempts = 0;
+
+	/* Already processed? */
+	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+		return;
+
+	elog(trace_recovery(DEBUG4),
+		 "adding recovery lock: db %d rel %d", dbOid, relOid);
+
+	/* dbOid is InvalidOid when we are locking a shared relation. */
+	Assert(OidIsValid(relOid));
+
+	newlock = palloc(sizeof(xl_standby_lock));
+	newlock->xid = xid;
+	newlock->dbOid = dbOid;
+	newlock->relOid = relOid;
+	RecoveryLockList = lappend(RecoveryLockList, newlock);
+
+	/*
+	 * Attempt to acquire the lock as requested.
+	 */
+	SET_LOCKTAG_RELATION(locktag, newlock->dbOid, newlock->relOid);
+
+	/*
+	 * Wait for lock to clear or kill anyone in our way.
+	 */
+	while (LockAcquireExtended(&locktag, AccessExclusiveLock,
+								true, true, report_memory_error)
+											== LOCKACQUIRE_NOT_AVAIL)
+	{
+		VirtualTransactionId *backends;
+
+		/*
+		 * If blowing away everybody with conflicting locks doesn't work,
+		 * after the first two attempts then we just start blowing everybody
+		 * away until it does work. We do this because its likely that we
+		 * either have too many locks and we just can't get one at all,
+		 * or that there are many people crowding for the same table.
+		 * Recovery must win; the end justifies the means.
+		 */
+		if (++num_attempts < 3)
+			backends = GetLockConflicts(&locktag, AccessExclusiveLock);
+		else
+		{
+			backends = GetConflictingVirtualXIDs(InvalidTransactionId,
+												 InvalidOid,
+												 true);
+			report_memory_error = true;
+		}
+
+		ResolveRecoveryConflictWithVirtualXIDs(backends,
+											   "exclusive lock",
+											   CONFLICT_MODE_ERROR);
+	}
+}
+
+static void
+StandbyReleaseLocks(TransactionId xid)
+{
+	ListCell   *cell,
+			   *prev,
+			   *next;
+
+	/*
+	 * Release all matching locks and remove them from list
+	 */
+	prev = NULL;
+	for (cell = list_head(RecoveryLockList); cell; cell = next)
+	{
+		xl_standby_lock *lock = (xl_standby_lock *) lfirst(cell);
+		next = lnext(cell);
+
+		if (!TransactionIdIsValid(xid) || lock->xid == xid)
+		{
+			LOCKTAG		locktag;
+
+			elog(trace_recovery(DEBUG4),
+					"releasing recovery lock: xid %u db %d rel %d",
+							lock->xid, lock->dbOid, lock->relOid);
+			SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid);
+			if (!LockRelease(&locktag, AccessExclusiveLock, true))
+				elog(trace_recovery(LOG),
+					"RecoveryLockList contains entry for lock "
+					"no longer recorded by lock manager "
+					"xid %u database %d relation %d",
+						lock->xid, lock->dbOid, lock->relOid);
+
+			RecoveryLockList = list_delete_cell(RecoveryLockList, cell, prev);
+			pfree(lock);
+		}
+		else
+			prev = cell;
+	}
+}
+
+/*
+ * Release locks for a transaction tree, starting at xid down, from
+ * RecoveryLockList.
+ *
+ * Called during WAL replay of COMMIT/ROLLBACK when in hot standby mode,
+ * to remove any AccessExclusiveLocks requested by a transaction.
+ */
+void
+StandbyReleaseLockTree(TransactionId xid, int nsubxids, TransactionId *subxids)
+{
+	int i;
+
+	StandbyReleaseLocks(xid);
+
+	for (i = 0; i < nsubxids; i++)
+		StandbyReleaseLocks(subxids[i]);
+}
+
+/*
+ * StandbyReleaseOldLocks
+ *		Release standby locks held by XIDs < removeXid
+ *		In some cases, keep prepared transactions.
+ */
+static void
+StandbyReleaseLocksMany(TransactionId removeXid, bool keepPreparedXacts)
+{
+	ListCell   *cell,
+			   *prev,
+			   *next;
+	LOCKTAG		locktag;
+
+	/*
+	 * Release all matching locks.
+	 */
+	prev = NULL;
+	for (cell = list_head(RecoveryLockList); cell; cell = next)
+	{
+		xl_standby_lock *lock = (xl_standby_lock *) lfirst(cell);
+		next = lnext(cell);
+
+		if (!TransactionIdIsValid(removeXid) || TransactionIdPrecedes(lock->xid, removeXid))
+		{
+			if (keepPreparedXacts && StandbyTransactionIdIsPrepared(lock->xid))
+				continue;
+			elog(trace_recovery(DEBUG4),
+				 "releasing recovery lock: xid %u db %d rel %d",
+				 lock->xid, lock->dbOid, lock->relOid);
+			SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid);
+			if (!LockRelease(&locktag, AccessExclusiveLock, true))
+				elog(trace_recovery(LOG),
+					 "RecoveryLockList contains entry for lock "
+					 "no longer recorded by lock manager "
+					 "xid %u database %d relation %d",
+					 lock->xid, lock->dbOid, lock->relOid);
+			RecoveryLockList = list_delete_cell(RecoveryLockList, cell, prev);
+			pfree(lock);
+		}
+		else
+			prev = cell;
+	}
+}
+
+/*
+ * Called at end of recovery and when we see a shutdown checkpoint.
+ */
+void
+StandbyReleaseAllLocks(void)
+{
+	elog(trace_recovery(DEBUG2), "release all standby locks");
+	StandbyReleaseLocksMany(InvalidTransactionId, false);
+}
+
+/*
+ * StandbyReleaseOldLocks
+ *		Release standby locks held by XIDs < removeXid, as long
+ *		as their not prepared transactions.
+ */
+void
+StandbyReleaseOldLocks(TransactionId removeXid)
+{
+	StandbyReleaseLocksMany(removeXid, true);
+}
+
+/*
+ * --------------------------------------------------------------------
+ * 		Recovery handling for Rmgr RM_STANDBY_ID
+ *
+ * These record types will only be created if XLogStandbyInfoActive()
+ * --------------------------------------------------------------------
+ */
+
+void
+standby_redo(XLogRecPtr lsn, XLogRecord *record)
+{
+	uint8		info = record->xl_info & ~XLR_INFO_MASK;
+
+	/* Do nothing if we're not in standby mode */
+	if (standbyState == STANDBY_DISABLED)
+		return;
+
+	if (info == XLOG_STANDBY_LOCK)
+	{
+		xl_standby_locks *xlrec = (xl_standby_locks *) XLogRecGetData(record);
+		int i;
+
+		for (i = 0; i < xlrec->nlocks; i++)
+			StandbyAcquireAccessExclusiveLock(xlrec->locks[i].xid,
+											  xlrec->locks[i].dbOid,
+											  xlrec->locks[i].relOid);
+	}
+	else if (info == XLOG_RUNNING_XACTS)
+	{
+		xl_running_xacts *xlrec = (xl_running_xacts *) XLogRecGetData(record);
+		RunningTransactionsData running;
+
+		running.xcnt = xlrec->xcnt;
+		running.subxid_overflow = xlrec->subxid_overflow;
+		running.nextXid = xlrec->nextXid;
+		running.oldestRunningXid = xlrec->oldestRunningXid;
+		running.xids = xlrec->xids;
+
+		ProcArrayApplyRecoveryInfo(&running);
+	}
+	else
+		elog(PANIC, "relation_redo: unknown op code %u", info);
+}
+
+static void
+standby_desc_running_xacts(StringInfo buf, xl_running_xacts *xlrec)
+{
+	int			i;
+
+	appendStringInfo(buf,
+					 " nextXid %u oldestRunningXid %u",
+					 xlrec->nextXid,
+					 xlrec->oldestRunningXid);
+	if (xlrec->xcnt > 0)
+	{
+		appendStringInfo(buf, "; %d xacts:", xlrec->xcnt);
+		for (i = 0; i < xlrec->xcnt; i++)
+			appendStringInfo(buf, " %u", xlrec->xids[i]);
+	}
+
+	if (xlrec->subxid_overflow)
+		appendStringInfo(buf, "; subxid ovf");
+}
+
+void
+standby_desc(StringInfo buf, uint8 xl_info, char *rec)
+{
+	uint8		info = xl_info & ~XLR_INFO_MASK;
+
+	if (info == XLOG_STANDBY_LOCK)
+	{
+		xl_standby_locks *xlrec = (xl_standby_locks *) rec;
+		int i;
+
+		appendStringInfo(buf, "AccessExclusive locks:");
+
+		for (i = 0; i < xlrec->nlocks; i++)
+			appendStringInfo(buf, " xid %u db %d rel %d",
+							 xlrec->locks[i].xid, xlrec->locks[i].dbOid,
+							 xlrec->locks[i].relOid);
+	}
+	else if (info == XLOG_RUNNING_XACTS)
+	{
+		xl_running_xacts *xlrec = (xl_running_xacts *) rec;
+
+		appendStringInfo(buf, " running xacts:");
+		standby_desc_running_xacts(buf, xlrec);
+	}
+	else
+		appendStringInfo(buf, "UNKNOWN");
+}
+
+/*
+ * Log details of the current snapshot to WAL. This allows the snapshot state
+ * to be reconstructed on the standby.
+ */
+void
+LogStandbySnapshot(TransactionId *oldestActiveXid, TransactionId *nextXid)
+{
+	RunningTransactions running;
+	xl_standby_lock *locks;
+	int nlocks;
+
+	Assert(XLogStandbyInfoActive());
+
+	/*
+	 * Get details of any AccessExclusiveLocks being held at the moment.
+	 */
+	locks = GetRunningTransactionLocks(&nlocks);
+	if (nlocks > 0)
+		LogAccessExclusiveLocks(nlocks, locks);
+
+	/*
+	 * Log details of all in-progress transactions. This should be the last
+	 * record we write, because standby will open up when it sees this.
+	 */
+	running = GetRunningTransactionData();
+	LogCurrentRunningXacts(running);
+
+	*oldestActiveXid = running->oldestRunningXid;
+	*nextXid = running->nextXid;
+}
+
+/*
+ * Record an enhanced snapshot of running transactions into WAL.
+ *
+ * The definitions of RunningTransactionData and xl_xact_running_xacts
+ * are similar. We keep them separate because xl_xact_running_xacts
+ * is a contiguous chunk of memory and never exists fully until it is
+ * assembled in WAL.
+ */
+static void
+LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
+{
+	xl_running_xacts	xlrec;
+	XLogRecData 			rdata[2];
+	int						lastrdata = 0;
+	XLogRecPtr	recptr;
+
+	xlrec.xcnt = CurrRunningXacts->xcnt;
+	xlrec.subxid_overflow = CurrRunningXacts->subxid_overflow;
+	xlrec.nextXid = CurrRunningXacts->nextXid;
+	xlrec.oldestRunningXid = CurrRunningXacts->oldestRunningXid;
+
+	/* Header */
+	rdata[0].data = (char *) (&xlrec);
+	rdata[0].len = MinSizeOfXactRunningXacts;
+	rdata[0].buffer = InvalidBuffer;
+
+	/* array of TransactionIds */
+	if (xlrec.xcnt > 0)
+	{
+		rdata[0].next = &(rdata[1]);
+		rdata[1].data = (char *) CurrRunningXacts->xids;
+		rdata[1].len = xlrec.xcnt * sizeof(TransactionId);
+		rdata[1].buffer = InvalidBuffer;
+		lastrdata = 1;
+	}
+
+	rdata[lastrdata].next = NULL;
+
+	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS, rdata);
+
+	if (CurrRunningXacts->subxid_overflow)
+		ereport(trace_recovery(DEBUG2),
+				(errmsg("snapshot of %u running transactions overflowed (lsn %X/%X oldest xid %u next xid %u)",
+						CurrRunningXacts->xcnt,
+						recptr.xlogid, recptr.xrecoff,
+						CurrRunningXacts->oldestRunningXid,
+						CurrRunningXacts->nextXid)));
+	else
+		ereport(trace_recovery(DEBUG2),
+				(errmsg("snapshot of %u running transaction ids (lsn %X/%X oldest xid %u next xid %u)",
+						CurrRunningXacts->xcnt,
+						recptr.xlogid, recptr.xrecoff,
+						CurrRunningXacts->oldestRunningXid,
+						CurrRunningXacts->nextXid)));
+
+}
+
+/*
+ * Wholesale logging of AccessExclusiveLocks. Other lock types need not be
+ * logged, as described in backend/storage/lmgr/README.
+ */
+static void
+LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
+{
+	XLogRecData		rdata[2];
+	xl_standby_locks	xlrec;
+
+	xlrec.nlocks = nlocks;
+
+	rdata[0].data = (char *) &xlrec;
+	rdata[0].len = offsetof(xl_standby_locks, locks);
+	rdata[0].buffer = InvalidBuffer;
+	rdata[0].next = &rdata[1];
+
+	rdata[1].data = (char *) locks;
+	rdata[1].len = nlocks * sizeof(xl_standby_lock);
+	rdata[1].buffer = InvalidBuffer;
+	rdata[1].next = NULL;
+
+	(void) XLogInsert(RM_STANDBY_ID, XLOG_STANDBY_LOCK, rdata);
+}
+
+/*
+ * Individual logging of AccessExclusiveLocks for use during LockAcquire()
+ */
+void
+LogAccessExclusiveLock(Oid dbOid, Oid relOid)
+{
+	xl_standby_lock		xlrec;
+
+	/*
+	 * Ensure that a TransactionId has been assigned to this transaction.
+	 * We don't actually need the xid yet but if we don't do this then
+	 * RecordTransactionCommit() and RecordTransactionAbort() will optimise
+	 * away the transaction completion record which recovery relies upon to
+	 * release locks. It's a hack, but for a corner case not worth adding
+	 * code for into the main commit path.
+	 */
+	xlrec.xid = GetTopTransactionId();
+
+	/*
+	 * Decode the locktag back to the original values, to avoid
+	 * sending lots of empty bytes with every message.  See
+	 * lock.h to check how a locktag is defined for LOCKTAG_RELATION
+	 */
+	xlrec.dbOid = dbOid;
+	xlrec.relOid = relOid;
+
+	LogAccessExclusiveLocks(1, &xlrec);
+}
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@ -1,4 +1,4 @@
-$PostgreSQL: pgsql/src/backend/storage/lmgr/README,v 1.24 2008/03/21 13:23:28 momjian Exp $
+$PostgreSQL: pgsql/src/backend/storage/lmgr/README,v 1.25 2009/12/19 01:32:35 sriggs Exp $

 Locking Overview
 ================
@ -517,3 +517,27 @@ interfere with each other.
 User locks are always held as session locks, so that they are not released at
 transaction end.  They must be released explicitly by the application --- but
 they are released automatically when a backend terminates.
+
+Locking during Hot Standby
+--------------------------
+
+The Startup process is the only backend that can make changes during
+recovery, all other backends are read only.  As a result the Startup
+process does not acquire locks on relations or objects except when the lock
+level is AccessExclusiveLock.
+
+Regular backends are only allowed to take locks on relations or objects
+at RowExclusiveLock or lower. This ensures that they do not conflict with
+each other or with the Startup process, unless AccessExclusiveLocks are
+requested by one of the backends.
+
+Deadlocks involving AccessExclusiveLocks are not possible, so we need
+not be concerned that a user initiated deadlock can prevent recovery from
+progressing.
+
+AccessExclusiveLocks on the primary or master node generate WAL records
+that are then applied by the Startup process. Locks are released at end
+of transaction just as they are in normal processing. These locks are
+held by the Startup process, acting as a proxy for the backends that
+originally acquired these locks. Again, these locks cannot conflict with
+one another, so the Startup process cannot deadlock itself either.
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/storage/lmgr/lock.c,v 1.188 2009/06/11 14:49:02 momjian Exp $
+ *	  $PostgreSQL: pgsql/src/backend/storage/lmgr/lock.c,v 1.189 2009/12/19 01:32:35 sriggs Exp $
 *
 * NOTES
 *	  A lock table is a shared memory hash table.  When
@ -38,6 +38,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/standby.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/resowner.h"
@ -468,6 +469,25 @@ LockAcquire(const LOCKTAG *locktag,
 			LOCKMODE lockmode,
 			bool sessionLock,
 			bool dontWait)
+{
+	return LockAcquireExtended(locktag, lockmode, sessionLock, dontWait, true);
+}
+
+/*
+ * LockAcquireExtended - allows us to specify additional options
+ *
+ * reportMemoryError specifies whether a lock request that fills the
+ * lock table should generate an ERROR or not. This allows a priority
+ * caller to note that the lock table is full and then begin taking
+ * extreme action to reduce the number of other lock holders before
+ * retrying the action.
+ */
+LockAcquireResult
+LockAcquireExtended(const LOCKTAG *locktag,
+			LOCKMODE lockmode,
+			bool sessionLock,
+			bool dontWait,
+			bool reportMemoryError)
 {
 	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
 	LockMethod	lockMethodTable;
@ -490,6 +510,16 @@ LockAcquire(const LOCKTAG *locktag,
 	if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
 		elog(ERROR, "unrecognized lock mode: %d", lockmode);

+	if (RecoveryInProgress() && !InRecovery &&
+		(locktag->locktag_type == LOCKTAG_OBJECT ||
+		 locktag->locktag_type == LOCKTAG_RELATION ) &&
+		lockmode > RowExclusiveLock)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot acquire lockmode %s on database objects while recovery is in progress",
+									lockMethodTable->lockModeNames[lockmode]),
+				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
+
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
 		elog(LOG, "LockAcquire: lock [%u,%u] %s",
@ -578,10 +608,13 @@ LockAcquire(const LOCKTAG *locktag,
 	if (!lock)
 	{
 		LWLockRelease(partitionLock);
-		ereport(ERROR,
-				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("out of shared memory"),
-		  errhint("You might need to increase max_locks_per_transaction.")));
+		if (reportMemoryError)
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+					 errmsg("out of shared memory"),
+				  errhint("You might need to increase max_locks_per_transaction.")));
+		else
+			return LOCKACQUIRE_NOT_AVAIL;
 	}
 	locallock->lock = lock;

@ -644,10 +677,13 @@ LockAcquire(const LOCKTAG *locktag,
 				elog(PANIC, "lock table corrupted");
 		}
 		LWLockRelease(partitionLock);
-		ereport(ERROR,
-				(errcode(ERRCODE_OUT_OF_MEMORY),
-				 errmsg("out of shared memory"),
-		  errhint("You might need to increase max_locks_per_transaction.")));
+		if (reportMemoryError)
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+					 errmsg("out of shared memory"),
+				  errhint("You might need to increase max_locks_per_transaction.")));
+		else
+			return LOCKACQUIRE_NOT_AVAIL;
 	}
 	locallock->proclock = proclock;

@ -778,6 +814,25 @@ LockAcquire(const LOCKTAG *locktag,
 			return LOCKACQUIRE_NOT_AVAIL;
 		}

+		/*
+		 * In Hot Standby we abort the lock wait if Startup process is waiting
+		 * since this would result in a deadlock. The deadlock occurs because
+		 * if we are waiting it must be behind an AccessExclusiveLock, which
+		 * can only clear when a transaction completion record is replayed.
+		 * If Startup process is waiting we never will clear that lock, so to
+		 * wait for it just causes a deadlock.
+		 */
+		if (RecoveryInProgress() && !InRecovery &&
+			locktag->locktag_type == LOCKTAG_RELATION)
+		{
+			LWLockRelease(partitionLock);
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_DEADLOCK_DETECTED),
+					 errmsg("possible deadlock detected"),
+					 errdetail("process conflicts with recovery - please resubmit query later"),
+					 errdetail_log("process conflicts with recovery")));
+		}
+
 		/*
 		 * Set bitmask of locks this process already holds on this object.
 		 */
@ -827,6 +882,27 @@ LockAcquire(const LOCKTAG *locktag,

 	LWLockRelease(partitionLock);

+	/*
+	 * Emit a WAL record if acquisition of this lock need to be replayed in
+	 * a standby server. Only AccessExclusiveLocks can conflict with lock
+	 * types that read-only transactions can acquire in a standby server.
+	 *
+	 * Make sure this definition matches the one GetRunningTransactionLocks().
+	 */
+	if (lockmode >= AccessExclusiveLock &&
+		locktag->locktag_type == LOCKTAG_RELATION &&
+		!RecoveryInProgress() &&
+		XLogStandbyInfoActive())
+	{
+		/*
+		 * Decode the locktag back to the original values, to avoid
+		 * sending lots of empty bytes with every message.  See
+		 * lock.h to check how a locktag is defined for LOCKTAG_RELATION
+		 */
+		LogAccessExclusiveLock(locktag->locktag_field1,
+							   locktag->locktag_field2);
+	}
+
 	return LOCKACQUIRE_OK;
 }

@ -2193,6 +2269,79 @@ GetLockStatusData(void)
 	return data;
 }

+/*
+ * Returns a list of currently held AccessExclusiveLocks, for use
+ * by GetRunningTransactionData().
+ */
+xl_standby_lock *
+GetRunningTransactionLocks(int *nlocks)
+{
+	PROCLOCK   *proclock;
+	HASH_SEQ_STATUS seqstat;
+	int			i;
+	int 		index;
+	int			els;
+	xl_standby_lock *accessExclusiveLocks;
+
+	/*
+	 * Acquire lock on the entire shared lock data structure.
+	 *
+	 * Must grab LWLocks in partition-number order to avoid LWLock deadlock.
+	 */
+	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
+		LWLockAcquire(FirstLockMgrLock + i, LW_SHARED);
+
+	/* Now scan the tables to copy the data */
+	hash_seq_init(&seqstat, LockMethodProcLockHash);
+
+	/* Now we can safely count the number of proclocks */
+	els = hash_get_num_entries(LockMethodProcLockHash);
+
+	/*
+	 * Allocating enough space for all locks in the lock table is overkill,
+	 * but it's more convenient and faster than having to enlarge the array.
+	 */
+	accessExclusiveLocks = palloc(els * sizeof(xl_standby_lock));
+
+	/*
+	 * If lock is a currently granted AccessExclusiveLock then
+	 * it will have just one proclock holder, so locks are never
+	 * accessed twice in this particular case. Don't copy this code
+	 * for use elsewhere because in the general case this will
+	 * give you duplicate locks when looking at non-exclusive lock types.
+	 */
+	index = 0;
+	while ((proclock = (PROCLOCK *) hash_seq_search(&seqstat)))
+	{
+		/* make sure this definition matches the one used in LockAcquire */
+		if ((proclock->holdMask & LOCKBIT_ON(AccessExclusiveLock)) &&
+			proclock->tag.myLock->tag.locktag_type == LOCKTAG_RELATION)
+		{
+			PGPROC	*proc = proclock->tag.myProc;
+			LOCK	*lock = proclock->tag.myLock;
+
+			accessExclusiveLocks[index].xid 	= proc->xid;
+			accessExclusiveLocks[index].dbOid  = lock->tag.locktag_field1;
+			accessExclusiveLocks[index].relOid = lock->tag.locktag_field2;
+
+			index++;
+		}
+	}
+
+	/*
+	 * And release locks.  We do this in reverse order for two reasons: (1)
+	 * Anyone else who needs more than one of the locks will be trying to lock
+	 * them in increasing order; we don't want to release the other process
+	 * until it can get all the locks it needs. (2) This avoids O(N^2)
+	 * behavior inside LWLockRelease.
+	 */
+	for (i = NUM_LOCK_PARTITIONS; --i >= 0;)
+		LWLockRelease(FirstLockMgrLock + i);
+
+	*nlocks = index;
+	return accessExclusiveLocks;
+}
+
 /* Provide the textual name of any lock mode */
 const char *
 GetLockmodeName(LOCKMETHODID lockmethodid, LOCKMODE mode)
@ -2288,6 +2437,24 @@ DumpAllLocks(void)
 * Because this function is run at db startup, re-acquiring the locks should
 * never conflict with running transactions because there are none.  We
 * assume that the lock state represented by the stored 2PC files is legal.
+ *
+ * When switching from Hot Standby mode to normal operation, the locks will
+ * be already held by the startup process. The locks are acquired for the new
+ * procs without checking for conflicts, so we don'get a conflict between the
+ * startup process and the dummy procs, even though we will momentarily have
+ * a situation where two procs are holding the same AccessExclusiveLock,
+ * which isn't normally possible because the conflict. If we're in standby
+ * mode, but a recovery snapshot hasn't been established yet, it's possible
+ * that some but not all of the locks are already held by the startup process.
+ *
+ * This approach is simple, but also a bit dangerous, because if there isn't
+ * enough shared memory to acquire the locks, an error will be thrown, which
+ * is promoted to FATAL and recovery will abort, bringing down postmaster.
+ * A safer approach would be to transfer the locks like we do in
+ * AtPrepare_Locks, but then again, in hot standby mode it's possible for
+ * read-only backends to use up all the shared lock memory anyway, so that
+ * replaying the WAL record that needs to acquire a lock will throw an error
+ * and PANIC anyway.
 */
 void
 lock_twophase_recover(TransactionId xid, uint16 info,
@ -2443,12 +2610,45 @@ lock_twophase_recover(TransactionId xid, uint16 info,

 	/*
 	 * We ignore any possible conflicts and just grant ourselves the lock.
+	 * Not only because we don't bother, but also to avoid deadlocks when
+	 * switching from standby to normal mode. See function comment.
 	 */
 	GrantLock(lock, proclock, lockmode);

 	LWLockRelease(partitionLock);
 }

+/*
+ * Re-acquire a lock belonging to a transaction that was prepared, when
+ * when starting up into hot standby mode.
+ */
+void
+lock_twophase_standby_recover(TransactionId xid, uint16 info,
+							  void *recdata, uint32 len)
+{
+	TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata;
+	LOCKTAG    *locktag;
+	LOCKMODE	lockmode;
+	LOCKMETHODID lockmethodid;
+
+	Assert(len == sizeof(TwoPhaseLockRecord));
+	locktag = &rec->locktag;
+	lockmode = rec->lockmode;
+	lockmethodid = locktag->locktag_lockmethodid;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	if (lockmode == AccessExclusiveLock &&
+		locktag->locktag_type == LOCKTAG_RELATION)
+	{
+		StandbyAcquireAccessExclusiveLock(xid,
+										  locktag->locktag_field1 /* dboid */,
+										  locktag->locktag_field2 /* reloid */);
+	}
+}
+
+
 /*
 * 2PC processing routine for COMMIT PREPARED case.
 *
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/storage/lmgr/proc.c,v 1.209 2009/08/31 19:41:00 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/storage/lmgr/proc.c,v 1.210 2009/12/19 01:32:36 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -318,6 +318,7 @@ InitProcess(void)
 	MyProc->waitProcLock = NULL;
 	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
 		SHMQueueInit(&(MyProc->myProcLocks[i]));
+	MyProc->recoveryConflictMode = 0;

 	/*
 	 * We might be reusing a semaphore that belonged to a failed process. So
@ -374,6 +375,11 @@ InitProcessPhase2(void)
 * to the ProcArray or the sinval messaging mechanism, either.	They also
 * don't get a VXID assigned, since this is only useful when we actually
 * hold lockmgr locks.
+ *
+ * Startup process however uses locks but never waits for them in the
+ * normal backend sense. Startup process also takes part in sinval messaging
+ * as a sendOnly process, so never reads messages from sinval queue. So
+ * Startup process does have a VXID and does show up in pg_locks.
 */
 void
 InitAuxiliaryProcess(void)
@ -461,6 +467,24 @@ InitAuxiliaryProcess(void)
 	on_shmem_exit(AuxiliaryProcKill, Int32GetDatum(proctype));
 }

+/*
+ * Record the PID and PGPROC structures for the Startup process, for use in
+ * ProcSendSignal().  See comments there for further explanation.
+ */
+void
+PublishStartupProcessInformation(void)
+{
+	/* use volatile pointer to prevent code rearrangement */
+	volatile PROC_HDR *procglobal = ProcGlobal;
+
+	SpinLockAcquire(ProcStructLock);
+
+	procglobal->startupProc = MyProc;
+	procglobal->startupProcPid = MyProcPid;
+
+	SpinLockRelease(ProcStructLock);
+}
+
 /*
 * Check whether there are at least N free PGPROC objects.
 *
@ -1289,7 +1313,31 @@ ProcWaitForSignal(void)
 void
 ProcSendSignal(int pid)
 {
-	PGPROC	   *proc = BackendPidGetProc(pid);
+	PGPROC	   *proc = NULL;
+
+	if (RecoveryInProgress())
+	{
+		/* use volatile pointer to prevent code rearrangement */
+		volatile PROC_HDR *procglobal = ProcGlobal;
+
+		SpinLockAcquire(ProcStructLock);
+
+		/*
+		 * Check to see whether it is the Startup process we wish to signal.
+		 * This call is made by the buffer manager when it wishes to wake
+		 * up a process that has been waiting for a pin in so it can obtain a
+		 * cleanup lock using LockBufferForCleanup(). Startup is not a normal
+		 * backend, so BackendPidGetProc() will not return any pid at all.
+		 * So we remember the information for this special case.
+		 */
+		if (pid == procglobal->startupProcPid)
+			proc = procglobal->startupProc;
+
+		SpinLockRelease(ProcStructLock);
+	}
+
+	if (proc == NULL)
+		proc = BackendPidGetProc(pid);

 	if (proc != NULL)
 		PGSemaphoreUnlock(&proc->sem);
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/tcop/postgres.c,v 1.578 2009/12/16 23:05:00 petere Exp $
+ *	  $PostgreSQL: pgsql/src/backend/tcop/postgres.c,v 1.579 2009/12/19 01:32:36 sriggs Exp $
 *
 * NOTES
 *	  this is the "main" module of the postgres backend and
@ -62,6 +62,7 @@
 #include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinval.h"
+#include "storage/standby.h"
 #include "tcop/fastpath.h"
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
@ -2643,8 +2644,8 @@ StatementCancelHandler(SIGNAL_ARGS)
 		 * the interrupt immediately.  No point in interrupting if we're
 		 * waiting for input, however.
 		 */
-		if (ImmediateInterruptOK && InterruptHoldoffCount == 0 &&
-			CritSectionCount == 0 && !DoingCommandRead)
+		if (InterruptHoldoffCount == 0 && CritSectionCount == 0 &&
+			(DoingCommandRead || ImmediateInterruptOK))
 		{
 			/* bump holdoff count to make ProcessInterrupts() a no-op */
 			/* until we are done getting ready for it */
@ -2735,9 +2736,58 @@ ProcessInterrupts(void)
 					(errcode(ERRCODE_QUERY_CANCELED),
 					 errmsg("canceling autovacuum task")));
 		else
+		{
+			int cancelMode = MyProc->recoveryConflictMode;
+
+			/*
+			 * XXXHS: We don't yet have a clean way to cancel an
+			 * idle-in-transaction session, so make it FATAL instead.
+			 * This isn't as bad as it looks because we don't issue a
+			 * CONFLICT_MODE_ERROR for a session with proc->xmin == 0
+			 * on cleanup conflicts. There's a possibility that we
+			 * marked somebody as a conflict and then they go idle.
+			 */
+			if (DoingCommandRead && IsTransactionBlock() &&
+				cancelMode == CONFLICT_MODE_ERROR)
+			{
+				cancelMode = CONFLICT_MODE_FATAL;
+			}
+
+			switch (cancelMode)
+			{
+				case CONFLICT_MODE_FATAL:
+						Assert(RecoveryInProgress());
+						ereport(FATAL,
+							(errcode(ERRCODE_QUERY_CANCELED),
+							 errmsg("canceling session due to conflict with recovery")));
+
+				case CONFLICT_MODE_ERROR:
+						/*
+						 * We are aborting because we need to release
+						 * locks. So we need to abort out of all
+						 * subtransactions to make sure we release
+						 * all locks at whatever their level.
+						 *
+						 * XXX Should we try to examine the
+						 * transaction tree and cancel just enough
+						 * subxacts to remove locks? Doubt it.
+						 */
+						Assert(RecoveryInProgress());
+						AbortOutOfAnyTransaction();
+						ereport(ERROR,
+							(errcode(ERRCODE_QUERY_CANCELED),
+							 errmsg("canceling statement due to conflict with recovery")));
+						return;
+
+				default:
+						/* No conflict pending, so fall through */
+						break;
+			}
+
 			ereport(ERROR,
 					(errcode(ERRCODE_QUERY_CANCELED),
 					 errmsg("canceling statement due to user request")));
+		}
 	}
 	/* If we get here, do nothing (probably, QueryCancelPending was reset) */
 }
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@ -10,7 +10,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/tcop/utility.c,v 1.324 2009/12/15 20:04:49 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/tcop/utility.c,v 1.325 2009/12/19 01:32:36 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -351,6 +351,7 @@ standard_ProcessUtility(Node *parsetree,
 						break;

 					case TRANS_STMT_PREPARE:
+						PreventCommandDuringRecovery();
 						if (!PrepareTransactionBlock(stmt->gid))
 						{
 							/* report unsuccessful commit in completionTag */
@ -360,11 +361,13 @@ standard_ProcessUtility(Node *parsetree,
 						break;

 					case TRANS_STMT_COMMIT_PREPARED:
+						PreventCommandDuringRecovery();
 						PreventTransactionChain(isTopLevel, "COMMIT PREPARED");
 						FinishPreparedTransaction(stmt->gid, true);
 						break;

 					case TRANS_STMT_ROLLBACK_PREPARED:
+						PreventCommandDuringRecovery();
 						PreventTransactionChain(isTopLevel, "ROLLBACK PREPARED");
 						FinishPreparedTransaction(stmt->gid, false);
 						break;
@ -742,6 +745,7 @@ standard_ProcessUtility(Node *parsetree,
 			break;

 		case T_GrantStmt:
+			PreventCommandDuringRecovery();
 			ExecuteGrantStmt((GrantStmt *) parsetree);
 			break;

@ -923,6 +927,7 @@ standard_ProcessUtility(Node *parsetree,
 		case T_NotifyStmt:
 			{
 				NotifyStmt *stmt = (NotifyStmt *) parsetree;
+				PreventCommandDuringRecovery();

 				Async_Notify(stmt->conditionname);
 			}
@ -931,6 +936,7 @@ standard_ProcessUtility(Node *parsetree,
 		case T_ListenStmt:
 			{
 				ListenStmt *stmt = (ListenStmt *) parsetree;
+				PreventCommandDuringRecovery();

 				CheckRestrictedOperation("LISTEN");
 				Async_Listen(stmt->conditionname);
@ -940,6 +946,7 @@ standard_ProcessUtility(Node *parsetree,
 		case T_UnlistenStmt:
 			{
 				UnlistenStmt *stmt = (UnlistenStmt *) parsetree;
+				PreventCommandDuringRecovery();

 				CheckRestrictedOperation("UNLISTEN");
 				if (stmt->conditionname)
@ -960,10 +967,12 @@ standard_ProcessUtility(Node *parsetree,
 			break;

 		case T_ClusterStmt:
+			PreventCommandDuringRecovery();
 			cluster((ClusterStmt *) parsetree, isTopLevel);
 			break;

 		case T_VacuumStmt:
+			PreventCommandDuringRecovery();
 			vacuum((VacuumStmt *) parsetree, InvalidOid, true, NULL, false,
 				   isTopLevel);
 			break;
@ -1083,12 +1092,21 @@ standard_ProcessUtility(Node *parsetree,
 				ereport(ERROR,
 						(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 						 errmsg("must be superuser to do CHECKPOINT")));
-			RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+			/*
+			 * You might think we should have a PreventCommandDuringRecovery()
+			 * here, but we interpret a CHECKPOINT command during recovery
+			 * as a request for a restartpoint instead. We allow this since
+			 * it can be a useful way of reducing switchover time when
+			 * using various forms of replication.
+			 */
+			RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT |
+								(RecoveryInProgress() ? 0 : CHECKPOINT_FORCE));
 			break;

 		case T_ReindexStmt:
 			{
 				ReindexStmt *stmt = (ReindexStmt *) parsetree;
+				PreventCommandDuringRecovery();

 				switch (stmt->kind)
 				{
@ -2604,3 +2622,12 @@ GetCommandLogLevel(Node *parsetree)

 	return lev;
 }
+
+void
+PreventCommandDuringRecovery(void)
+{
+	if (RecoveryInProgress())
+		ereport(ERROR,
+			(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+			 errmsg("cannot be executed during recovery")));
+}
--- a/src/backend/utils/adt/txid.c
+++ b/src/backend/utils/adt/txid.c
@ -14,7 +14,7 @@
 *	Author: Jan Wieck, Afilias USA INC.
 *	64-bit txids: Marko Kreen, Skype Technologies
 *
- *	$PostgreSQL: pgsql/src/backend/utils/adt/txid.c,v 1.8 2009/01/01 17:23:50 momjian Exp $
+ *	$PostgreSQL: pgsql/src/backend/utils/adt/txid.c,v 1.9 2009/12/19 01:32:36 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -24,6 +24,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "funcapi.h"
+#include "miscadmin.h"
 #include "libpq/pqformat.h"
 #include "utils/builtins.h"
 #include "utils/snapmgr.h"
@ -338,6 +339,15 @@ txid_current(PG_FUNCTION_ARGS)
 	txid		val;
 	TxidEpoch	state;

+	/*
+	 * Must prevent during recovery because if an xid is
+	 * not assigned we try to assign one, which would fail.
+	 * Programs already rely on this function to always
+	 * return a valid current xid, so we should not change
+	 * this to return NULL or similar invalid xid.
+	 */
+	PreventCommandDuringRecovery();
+
 	load_xid_epoch(&state);

 	val = convert_xid(GetTopTransactionId(), &state);
--- a/src/backend/utils/adt/xid.c
+++ b/src/backend/utils/adt/xid.c
@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/utils/adt/xid.c,v 1.12 2009/01/01 17:23:50 momjian Exp $
+ *	  $PostgreSQL: pgsql/src/backend/utils/adt/xid.c,v 1.13 2009/12/19 01:32:36 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -102,6 +102,25 @@ xid_age(PG_FUNCTION_ARGS)
 	PG_RETURN_INT32((int32) (now - xid));
 }

+/*
+ * xidComparator
+ *		qsort comparison function for XIDs
+ *
+ * We can't use wraparound comparison for XIDs because that does not respect
+ * the triangle inequality!  Any old sort order will do.
+ */
+int
+xidComparator(const void *arg1, const void *arg2)
+{
+	TransactionId xid1 = *(const TransactionId *) arg1;
+	TransactionId xid2 = *(const TransactionId *) arg2;
+
+	if (xid1 > xid2)
+		return 1;
+	if (xid1 < xid2)
+		return -1;
+	return 0;
+}

 /*****************************************************************************
 *	 COMMAND IDENTIFIER ROUTINES											 *
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@ -80,7 +80,7 @@
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/utils/cache/inval.c,v 1.89 2009/06/11 14:49:05 momjian Exp $
+ *	  $PostgreSQL: pgsql/src/backend/utils/cache/inval.c,v 1.90 2009/12/19 01:32:36 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -155,6 +155,11 @@ typedef struct TransInvalidationInfo

 static TransInvalidationInfo *transInvalInfo = NULL;

+static SharedInvalidationMessage *SharedInvalidMessagesArray;
+static int 					numSharedInvalidMessagesArray;
+static int 					maxSharedInvalidMessagesArray;
+
+
 /*
 * Dynamically-registered callback functions.  Current implementation
 * assumes there won't be very many of these at once; could improve if needed.
@ -180,14 +185,6 @@ static struct RELCACHECALLBACK

 static int	relcache_callback_count = 0;

-/* info values for 2PC callback */
-#define TWOPHASE_INFO_MSG			0	/* SharedInvalidationMessage */
-#define TWOPHASE_INFO_FILE_BEFORE	1	/* relcache file inval */
-#define TWOPHASE_INFO_FILE_AFTER	2	/* relcache file inval */
-
-static void PersistInvalidationMessage(SharedInvalidationMessage *msg);
-
-
 /* ----------------------------------------------------------------
 *				Invalidation list support functions
 *
@ -741,38 +738,8 @@ AtStart_Inval(void)
 		MemoryContextAllocZero(TopTransactionContext,
 							   sizeof(TransInvalidationInfo));
 	transInvalInfo->my_level = GetCurrentTransactionNestLevel();
-}
-
-/*
- * AtPrepare_Inval
- *		Save the inval lists state at 2PC transaction prepare.
- *
- * In this phase we just generate 2PC records for all the pending invalidation
- * work.
- */
-void
-AtPrepare_Inval(void)
-{
-	/* Must be at top of stack */
-	Assert(transInvalInfo != NULL && transInvalInfo->parent == NULL);
-
-	/*
-	 * Relcache init file invalidation requires processing both before and
-	 * after we send the SI messages.
-	 */
-	if (transInvalInfo->RelcacheInitFileInval)
-		RegisterTwoPhaseRecord(TWOPHASE_RM_INVAL_ID, TWOPHASE_INFO_FILE_BEFORE,
-							   NULL, 0);
-
-	AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
-							   &transInvalInfo->CurrentCmdInvalidMsgs);
-
-	ProcessInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs,
-								PersistInvalidationMessage);
-
-	if (transInvalInfo->RelcacheInitFileInval)
-		RegisterTwoPhaseRecord(TWOPHASE_RM_INVAL_ID, TWOPHASE_INFO_FILE_AFTER,
-							   NULL, 0);
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
 }

 /*
@ -812,46 +779,98 @@ AtSubStart_Inval(void)
 }

 /*
- * PersistInvalidationMessage
- *		Write an invalidation message to the 2PC state file.
+ * Collect invalidation messages into SharedInvalidMessagesArray array.
 */
 static void
-PersistInvalidationMessage(SharedInvalidationMessage *msg)
+MakeSharedInvalidMessagesArray(const SharedInvalidationMessage *msgs, int n)
 {
-	RegisterTwoPhaseRecord(TWOPHASE_RM_INVAL_ID, TWOPHASE_INFO_MSG,
-						   msg, sizeof(SharedInvalidationMessage));
+	/*
+	 * Initialise array first time through in each commit
+	 */
+	if (SharedInvalidMessagesArray == NULL)
+	{
+		maxSharedInvalidMessagesArray = FIRSTCHUNKSIZE;
+		numSharedInvalidMessagesArray = 0;
+
+		/*
+		 * Although this is being palloc'd we don't actually free it directly.
+		 * We're so close to EOXact that we now we're going to lose it anyhow.
+		 */
+		SharedInvalidMessagesArray = palloc(maxSharedInvalidMessagesArray
+											* sizeof(SharedInvalidationMessage));
+	}
+
+	if ((numSharedInvalidMessagesArray + n) > maxSharedInvalidMessagesArray)
+	{
+		while ((numSharedInvalidMessagesArray + n) > maxSharedInvalidMessagesArray)
+			maxSharedInvalidMessagesArray *= 2;
+
+		SharedInvalidMessagesArray = repalloc(SharedInvalidMessagesArray,
+											maxSharedInvalidMessagesArray
+											* sizeof(SharedInvalidationMessage));
+	}
+
+	/*
+	 * Append the next chunk onto the array
+	 */
+	memcpy(SharedInvalidMessagesArray + numSharedInvalidMessagesArray,
+			msgs, n * sizeof(SharedInvalidationMessage));
+	numSharedInvalidMessagesArray += n;
 }

 /*
- * inval_twophase_postcommit
- *		Process an invalidation message from the 2PC state file.
+ * xactGetCommittedInvalidationMessages() is executed by
+ * RecordTransactionCommit() to add invalidation messages onto the
+ * commit record. This applies only to commit message types, never to
+ * abort records. Must always run before AtEOXact_Inval(), since that
+ * removes the data we need to see.
+ *
+ * Remember that this runs before we have officially committed, so we
+ * must not do anything here to change what might occur *if* we should
+ * fail between here and the actual commit.
+ *
+ * see also xact_redo_commit() and xact_desc_commit()
 */
-void
-inval_twophase_postcommit(TransactionId xid, uint16 info,
-						  void *recdata, uint32 len)
+int
+xactGetCommittedInvalidationMessages(SharedInvalidationMessage **msgs,
+									 bool *RelcacheInitFileInval)
 {
-	SharedInvalidationMessage *msg;
+	MemoryContext oldcontext;

-	switch (info)
-	{
-		case TWOPHASE_INFO_MSG:
-			msg = (SharedInvalidationMessage *) recdata;
-			Assert(len == sizeof(SharedInvalidationMessage));
-			SendSharedInvalidMessages(msg, 1);
-			break;
-		case TWOPHASE_INFO_FILE_BEFORE:
-			RelationCacheInitFileInvalidate(true);
-			break;
-		case TWOPHASE_INFO_FILE_AFTER:
-			RelationCacheInitFileInvalidate(false);
-			break;
-		default:
-			Assert(false);
-			break;
-	}
+	/* Must be at top of stack */
+	Assert(transInvalInfo != NULL && transInvalInfo->parent == NULL);
+
+	/*
+	 * Relcache init file invalidation requires processing both before and
+	 * after we send the SI messages.  However, we need not do anything
+	 * unless we committed.
+	 */
+	*RelcacheInitFileInval = transInvalInfo->RelcacheInitFileInval;
+
+	/*
+	 * Walk through TransInvalidationInfo to collect all the messages
+	 * into a single contiguous array of invalidation messages. It must
+	 * be contiguous so we can copy directly into WAL message. Maintain the
+	 * order that they would be processed in by AtEOXact_Inval(), to ensure
+	 * emulated behaviour in redo is as similar as possible to original.
+	 * We want the same bugs, if any, not new ones.
+	 */
+	oldcontext = MemoryContextSwitchTo(CurTransactionContext);
+
+	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+									 MakeSharedInvalidMessagesArray);
+	ProcessInvalidationMessagesMulti(&transInvalInfo->PriorCmdInvalidMsgs,
+									 MakeSharedInvalidMessagesArray);
+	MemoryContextSwitchTo(oldcontext);
+
+	Assert(!(numSharedInvalidMessagesArray > 0 &&
+			 SharedInvalidMessagesArray == NULL));
+
+	*msgs = SharedInvalidMessagesArray;
+
+	return numSharedInvalidMessagesArray;
 }

-
 /*
 * AtEOXact_Inval
 *		Process queued-up invalidation messages at end of main transaction.
@ -1028,6 +1047,8 @@ CommandEndInvalidationMessages(void)
 * no need to worry about cleaning up if there's an elog(ERROR) before
 * reaching EndNonTransactionalInvalidation (the invals will just be thrown
 * away if that happens).
+ *
+ * Note that these are not replayed in standby mode.
 */
 void
 BeginNonTransactionalInvalidation(void)
@ -1041,6 +1062,9 @@ BeginNonTransactionalInvalidation(void)
 	Assert(transInvalInfo->CurrentCmdInvalidMsgs.cclist == NULL);
 	Assert(transInvalInfo->CurrentCmdInvalidMsgs.rclist == NULL);
 	Assert(transInvalInfo->RelcacheInitFileInval == false);
+
+	SharedInvalidMessagesArray = NULL;
+	numSharedInvalidMessagesArray = 0;
 }

 /*
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@ -42,7 +42,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/utils/error/elog.c,v 1.219 2009/11/28 23:38:07 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/utils/error/elog.c,v 1.220 2009/12/19 01:32:37 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -2794,3 +2794,21 @@ is_log_level_output(int elevel, int log_min_level)

 	return false;
 }
+
+/*
+ * If trace_recovery_messages is set to make this visible, then show as LOG,
+ * else display as whatever level is set. It may still be shown, but only
+ * if log_min_messages is set lower than trace_recovery_messages.
+ *
+ * Intention is to keep this for at least the whole of the 8.5 production
+ * release, so we can more easily diagnose production problems in the field.
+ */
+int
+trace_recovery(int trace_level)
+{
+	if (trace_level < LOG &&
+		trace_level >= trace_recovery_messages)
+			return LOG;
+
+	return trace_level;
+}
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/utils/init/postinit.c,v 1.198 2009/10/07 22:14:23 alvherre Exp $
+ *	  $PostgreSQL: pgsql/src/backend/utils/init/postinit.c,v 1.199 2009/12/19 01:32:37 sriggs Exp $
 *
 *
 *-------------------------------------------------------------------------
@ -481,7 +481,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	 */
 	MyBackendId = InvalidBackendId;

-	SharedInvalBackendInit();
+	SharedInvalBackendInit(false);

 	if (MyBackendId > MaxBackends || MyBackendId <= 0)
 		elog(FATAL, "bad backend id: %d", MyBackendId);
@ -495,11 +495,11 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	InitBufferPoolBackend();

 	/*
-	 * Initialize local process's access to XLOG.  In bootstrap case we may
-	 * skip this since StartupXLOG() was run instead.
+	 * Initialize local process's access to XLOG, if appropriate.  In bootstrap
+	 * case we skip this since StartupXLOG() was run instead.
 	 */
 	if (!bootstrap)
-		InitXLOGAccess();
+		(void) RecoveryInProgress();

 	/*
 	 * Initialize the relation cache and the system catalog caches.  Note that
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@ -10,7 +10,7 @@
 * Written by Peter Eisentraut <peter_e@gmx.net>.
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/utils/misc/guc.c,v 1.527 2009/12/11 03:34:56 itagaki Exp $
+ *	  $PostgreSQL: pgsql/src/backend/utils/misc/guc.c,v 1.528 2009/12/19 01:32:37 sriggs Exp $
 *
 *--------------------------------------------------------------------
 */
@ -114,6 +114,9 @@ extern char *default_tablespace;
 extern char *temp_tablespaces;
 extern bool synchronize_seqscans;
 extern bool fullPageWrites;
+extern int	vacuum_defer_cleanup_age;
+
+int	trace_recovery_messages = LOG;

 #ifdef TRACE_SORT
 extern bool trace_sort;
@ -1206,6 +1209,17 @@ static struct config_bool ConfigureNamesBool[] =
 		false, NULL, NULL
 	},

+	{
+		{"recovery_connections", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("During recovery, allows connections and queries. "
+						 " During normal running, causes additional info to be written"
+						 " to WAL to enable hot standby mode on WAL standby nodes."),
+			NULL
+		},
+		&XLogRequestRecoveryConnections,
+		true, NULL, NULL
+	},
+
 	{
 		{"allow_system_table_mods", PGC_POSTMASTER, DEVELOPER_OPTIONS,
 			gettext_noop("Allows modifications of the structure of system tables."),
@ -1347,6 +1361,8 @@ static struct config_int ConfigureNamesInt[] =
 	 * plus autovacuum_max_workers plus one (for the autovacuum launcher).
 	 *
 	 * Likewise we have to limit NBuffers to INT_MAX/2.
+	 *
+	 * See also CheckRequiredParameterValues() if this parameter changes
 	 */
 	{
 		{"max_connections", PGC_POSTMASTER, CONN_AUTH_SETTINGS,
@ -1357,6 +1373,15 @@ static struct config_int ConfigureNamesInt[] =
 		100, 1, INT_MAX / 4, assign_maxconnections, NULL
 	},

+	{
+		{"max_standby_delay", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Sets the maximum delay to avoid conflict processing on Hot Standby servers."),
+			NULL
+		},
+		&MaxStandbyDelay,
+		30, -1, INT_MAX, NULL, NULL
+	},
+
 	{
 		{"superuser_reserved_connections", PGC_POSTMASTER, CONN_AUTH_SETTINGS,
 			gettext_noop("Sets the number of connection slots reserved for superusers."),
@ -1514,6 +1539,9 @@ static struct config_int ConfigureNamesInt[] =
 		1000, 25, INT_MAX, NULL, NULL
 	},

+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
 	{
 		{"max_prepared_transactions", PGC_POSTMASTER, RESOURCES,
 			gettext_noop("Sets the maximum number of simultaneously prepared transactions."),
@ -1572,6 +1600,18 @@ static struct config_int ConfigureNamesInt[] =
 		150000000, 0, 2000000000, NULL, NULL
 	},

+	{
+		{"vacuum_defer_cleanup_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Age by which VACUUM and HOT cleanup should be deferred, if any."),
+			NULL
+		},
+		&vacuum_defer_cleanup_age,
+		0, 0, 1000000, NULL, NULL
+	},
+
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
 	{
 		{"max_locks_per_transaction", PGC_POSTMASTER, LOCK_MANAGEMENT,
 			gettext_noop("Sets the maximum number of locks per transaction."),
@ -2684,6 +2724,16 @@ static struct config_enum ConfigureNamesEnum[] =
 		assign_session_replication_role, NULL
 	},

+	{
+		{"trace_recovery_messages", PGC_SUSET, LOGGING_WHEN,
+			gettext_noop("Sets the message levels that are logged during recovery."),
+			gettext_noop("Each level includes all the levels that follow it. The later"
+						 " the level, the fewer messages are sent.")
+		},
+		&trace_recovery_messages,
+		DEBUG1, server_message_level_options, NULL, NULL
+	},
+
 	{
 		{"track_functions", PGC_SUSET, STATS_COLLECTOR,
 			gettext_noop("Collects function-level statistics on database activity."),
@ -7511,6 +7561,18 @@ assign_transaction_read_only(bool newval, bool doit, GucSource source)
 		if (source != PGC_S_OVERRIDE)
 			return false;
 	}
+
+	/* Can't go to r/w mode while recovery is still active */
+	if (newval == false && XactReadOnly && RecoveryInProgress())
+	{
+		ereport(GUC_complaint_elevel(source),
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("cannot set transaction read-write mode during recovery")));
+		/* source == PGC_S_OVERRIDE means do it anyway, eg at xact abort */
+		if (source != PGC_S_OVERRIDE)
+			return false;
+	}
+
 	return true;
 }

--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@ -181,6 +181,9 @@
 #archive_timeout = 0		# force a logfile segment switch after this
 				# number of seconds; 0 disables

+#recovery_connections = on	# allows connections during recovery
+#max_standby_delay = 30		# max acceptable standby lag (s) to help queries
+				# complete without conflict; -1 disables

 #------------------------------------------------------------------------------
 # QUERY TUNING
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@ -19,7 +19,7 @@
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/utils/time/snapmgr.c,v 1.12 2009/10/07 16:27:18 alvherre Exp $
+ *	  $PostgreSQL: pgsql/src/backend/utils/time/snapmgr.c,v 1.13 2009/12/19 01:32:37 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -224,8 +224,14 @@ CopySnapshot(Snapshot snapshot)
 	else
 		newsnap->xip = NULL;

-	/* setup subXID array */
-	if (snapshot->subxcnt > 0)
+	/*
+	 * Setup subXID array. Don't bother to copy it if it had overflowed,
+	 * though, because it's not used anywhere in that case. Except if it's
+	 * a snapshot taken during recovery; all the top-level XIDs are in subxip
+	 * as well in that case, so we mustn't lose them.
+	 */
+	if (snapshot->subxcnt > 0 &&
+		(!snapshot->suboverflowed || snapshot->takenDuringRecovery))
 	{
 		newsnap->subxip = (TransactionId *) ((char *) newsnap + subxipoff);
 		memcpy(newsnap->subxip, snapshot->subxip,
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@ -50,7 +50,7 @@
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/utils/time/tqual.c,v 1.113 2009/06/11 14:49:06 momjian Exp $
+ *	  $PostgreSQL: pgsql/src/backend/utils/time/tqual.c,v 1.114 2009/12/19 01:32:37 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -1257,42 +1257,84 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 		return true;

 	/*
-	 * If the snapshot contains full subxact data, the fastest way to check
-	 * things is just to compare the given XID against both subxact XIDs and
-	 * top-level XIDs.	If the snapshot overflowed, we have to use pg_subtrans
-	 * to convert a subxact XID to its parent XID, but then we need only look
-	 * at top-level XIDs not subxacts.
+	 * Snapshot information is stored slightly differently in snapshots
+	 * taken during recovery.
 	 */
-	if (snapshot->subxcnt >= 0)
+	if (!snapshot->takenDuringRecovery)
+	{
+		/*
+		 * If the snapshot contains full subxact data, the fastest way to check
+		 * things is just to compare the given XID against both subxact XIDs and
+		 * top-level XIDs.	If the snapshot overflowed, we have to use pg_subtrans
+		 * to convert a subxact XID to its parent XID, but then we need only look
+		 * at top-level XIDs not subxacts.
+		 */
+		if (!snapshot->suboverflowed)
+		{
+			/* full data, so search subxip */
+			int32		j;
+
+			for (j = 0; j < snapshot->subxcnt; j++)
+			{
+				if (TransactionIdEquals(xid, snapshot->subxip[j]))
+					return true;
+			}
+
+			/* not there, fall through to search xip[] */
+		}
+		else
+		{
+			/* overflowed, so convert xid to top-level */
+			xid = SubTransGetTopmostTransaction(xid);
+
+			/*
+			 * If xid was indeed a subxact, we might now have an xid < xmin, so
+			 * recheck to avoid an array scan.	No point in rechecking xmax.
+			 */
+			if (TransactionIdPrecedes(xid, snapshot->xmin))
+				return false;
+		}
+
+		for (i = 0; i < snapshot->xcnt; i++)
+		{
+			if (TransactionIdEquals(xid, snapshot->xip[i]))
+				return true;
+		}
+	}
+	else
 	{
-		/* full data, so search subxip */
 		int32		j;

+		/*
+		 * In recovery we store all xids in the subxact array because it
+		 * is by far the bigger array, and we mostly don't know which xids
+		 * are top-level and which are subxacts. The xip array is empty.
+		 *
+		 * We start by searching subtrans, if we overflowed.
+		 */
+		if (snapshot->suboverflowed)
+		{
+			/* overflowed, so convert xid to top-level */
+			xid = SubTransGetTopmostTransaction(xid);
+
+			/*
+			 * If xid was indeed a subxact, we might now have an xid < xmin, so
+			 * recheck to avoid an array scan.	No point in rechecking xmax.
+			 */
+			if (TransactionIdPrecedes(xid, snapshot->xmin))
+				return false;
+		}
+
+		/*
+		 * We now have either a top-level xid higher than xmin or an
+		 * indeterminate xid. We don't know whether it's top level or subxact
+		 * but it doesn't matter. If it's present, the xid is visible.
+		 */
 		for (j = 0; j < snapshot->subxcnt; j++)
 		{
 			if (TransactionIdEquals(xid, snapshot->subxip[j]))
 				return true;
 		}
-
-		/* not there, fall through to search xip[] */
-	}
-	else
-	{
-		/* overflowed, so convert xid to top-level */
-		xid = SubTransGetTopmostTransaction(xid);
-
-		/*
-		 * If xid was indeed a subxact, we might now have an xid < xmin, so
-		 * recheck to avoid an array scan.	No point in rechecking xmax.
-		 */
-		if (TransactionIdPrecedes(xid, snapshot->xmin))
-			return false;
-	}
-
-	for (i = 0; i < snapshot->xcnt; i++)
-	{
-		if (TransactionIdEquals(xid, snapshot->xip[i]))
-			return true;
 	}

 	return false;
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@ -6,7 +6,7 @@
 * copyright (c) Oliver Elphick <olly@lfix.co.uk>, 2001;
 * licence: BSD
 *
- * $PostgreSQL: pgsql/src/bin/pg_controldata/pg_controldata.c,v 1.44 2009/08/31 02:23:22 tgl Exp $
+ * $PostgreSQL: pgsql/src/bin/pg_controldata/pg_controldata.c,v 1.45 2009/12/19 01:32:38 sriggs Exp $
 */
 #include "postgres_fe.h"

@ -196,6 +196,8 @@ main(int argc, char *argv[])
 		   ControlFile.checkPointCopy.oldestXid);
 	printf(_("Latest checkpoint's oldestXID's DB:   %u\n"),
 		   ControlFile.checkPointCopy.oldestXidDB);
+	printf(_("Latest checkpoint's oldestActiveXID:   %u\n"),
+		   ControlFile.checkPointCopy.oldestActiveXid);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Minimum recovery ending location:     %X/%X\n"),
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@ -7,7 +7,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/access/heapam.h,v 1.144 2009/08/24 02:18:32 tgl Exp $
+ * $PostgreSQL: pgsql/src/include/access/heapam.h,v 1.145 2009/12/19 01:32:42 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -130,11 +130,13 @@ extern XLogRecPtr log_heap_move(Relation reln, Buffer oldbuf,
 			  ItemPointerData from,
 			  Buffer newbuf, HeapTuple newtup,
 			  bool all_visible_cleared, bool new_all_visible_cleared);
+extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
+				TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 			   OffsetNumber *redirected, int nredirected,
 			   OffsetNumber *nowdead, int ndead,
 			   OffsetNumber *nowunused, int nunused,
-			   bool redirect_move);
+			   TransactionId latestRemovedXid, bool redirect_move);
 extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
 				TransactionId cutoff_xid,
 				OffsetNumber *offsets, int offcnt);
--- a/src/include/access/htup.h
+++ b/src/include/access/htup.h
@ -7,7 +7,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/access/htup.h,v 1.107 2009/06/11 14:49:08 momjian Exp $
+ * $PostgreSQL: pgsql/src/include/access/htup.h,v 1.108 2009/12/19 01:32:42 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -580,6 +580,7 @@ typedef HeapTupleData *HeapTuple;
 #define XLOG_HEAP2_FREEZE		0x00
 #define XLOG_HEAP2_CLEAN		0x10
 #define XLOG_HEAP2_CLEAN_MOVE	0x20
+#define XLOG_HEAP2_CLEANUP_INFO 0x30

 /*
 * All what we need to find changed tuple
@ -668,6 +669,7 @@ typedef struct xl_heap_clean
 {
 	RelFileNode node;
 	BlockNumber block;
+	TransactionId	latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
 	/* OFFSET NUMBERS FOLLOW */
@ -675,6 +677,19 @@ typedef struct xl_heap_clean

 #define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16))

+/*
+ * Cleanup_info is required in some cases during a lazy VACUUM.
+ * Used for reporting the results of HeapTupleHeaderAdvanceLatestRemovedXid()
+ * see vacuumlazy.c for full explanation
+ */
+typedef struct xl_heap_cleanup_info
+{
+	RelFileNode 	node;
+	TransactionId	latestRemovedXid;
+} xl_heap_cleanup_info;
+
+#define SizeOfHeapCleanupInfo (sizeof(xl_heap_cleanup_info))
+
 /* This is for replacing a page's contents in toto */
 /* NB: this is used for indexes as well as heaps */
 typedef struct xl_heap_newpage
@ -718,6 +733,9 @@ typedef struct xl_heap_freeze

 #define SizeOfHeapFreeze (offsetof(xl_heap_freeze, cutoff_xid) + sizeof(TransactionId))

+extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
+										TransactionId *latestRemovedXid);
+
 /* HeapTupleHeader functions implemented in utils/time/combocid.c */
 extern CommandId HeapTupleHeaderGetCmin(HeapTupleHeader tup);
 extern CommandId HeapTupleHeaderGetCmax(HeapTupleHeader tup);
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@ -7,7 +7,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/access/nbtree.h,v 1.125 2009/07/29 20:56:19 tgl Exp $
+ * $PostgreSQL: pgsql/src/include/access/nbtree.h,v 1.126 2009/12/19 01:32:42 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -214,12 +214,13 @@ typedef struct BTMetaPageData
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
 #define XLOG_BTREE_SPLIT_L_ROOT 0x50	/* add tuple with split of root */
 #define XLOG_BTREE_SPLIT_R_ROOT 0x60	/* as above, new item on right */
-#define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuple */
+#define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_DELETE_PAGE	0x80	/* delete an entire page */
 #define XLOG_BTREE_DELETE_PAGE_META 0x90		/* same, and update metapage */
 #define XLOG_BTREE_NEWROOT		0xA0	/* new root page */
 #define XLOG_BTREE_DELETE_PAGE_HALF 0xB0		/* page deletion that makes
 												 * parent half-dead */
+#define XLOG_BTREE_VACUUM		0xC0	/* delete entries on a page during vacuum */

 /*
 * All that we need to find changed index tuple
@ -306,16 +307,53 @@ typedef struct xl_btree_split
 /*
 * This is what we need to know about delete of individual leaf index tuples.
 * The WAL record can represent deletion of any number of index tuples on a
- * single index page.
+ * single index page when *not* executed by VACUUM.
 */
 typedef struct xl_btree_delete
 {
 	RelFileNode node;
 	BlockNumber block;
+	TransactionId	latestRemovedXid;
+	int			numItems;		 /* number of items in the offset array */
+
 	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
 } xl_btree_delete;

-#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, block) + sizeof(BlockNumber))
+#define SizeOfBtreeDelete	(offsetof(xl_btree_delete, latestRemovedXid) + sizeof(TransactionId))
+
+/*
+ * This is what we need to know about vacuum of individual leaf index tuples.
+ * The WAL record can represent deletion of any number of index tuples on a
+ * single index page when executed by VACUUM.
+ *
+ * The correctness requirement for applying these changes during recovery is
+ * that we must do one of these two things for every block in the index:
+ * 		* lock the block for cleanup and apply any required changes
+ *		* EnsureBlockUnpinned()
+ * The purpose of this is to ensure that no index scans started before we
+ * finish scanning the index are still running by the time we begin to remove
+ * heap tuples.
+ *
+ * Any changes to any one block are registered on just one WAL record. All
+ * blocks that we need to run EnsureBlockUnpinned() before we touch the changed
+ * block are also given on this record as a variable length array. The array
+ * is compressed by way of storing an array of block ranges, rather than an
+ * actual array of blockids.
+ *
+ * Note that the *last* WAL record in any vacuum of an index is allowed to
+ * have numItems == 0. All other WAL records must have numItems > 0.
+ */
+typedef struct xl_btree_vacuum
+{
+	RelFileNode node;
+	BlockNumber block;
+	BlockNumber lastBlockVacuumed;
+	int			numItems;		 /* number of items in the offset array */
+
+	/* TARGET OFFSET NUMBERS FOLLOW */
+} xl_btree_vacuum;
+
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))

 /*
 * This is what we need to know about deletion of a btree page.  The target
@ -537,7 +575,8 @@ extern void _bt_relbuf(Relation rel, Buffer buf);
 extern void _bt_pageinit(Page page, Size size);
 extern bool _bt_page_recyclable(Page page);
 extern void _bt_delitems(Relation rel, Buffer buf,
-			 OffsetNumber *itemnos, int nitems);
+			 OffsetNumber *itemnos, int nitems, bool isVacuum,
+			 BlockNumber lastBlockVacuumed);
 extern int _bt_pagedel(Relation rel, Buffer buf,
 			BTStack stack, bool vacuum_full);

--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@ -7,7 +7,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/access/relscan.h,v 1.67 2009/01/01 17:23:56 momjian Exp $
+ * $PostgreSQL: pgsql/src/include/access/relscan.h,v 1.68 2009/12/19 01:32:42 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -68,6 +68,7 @@ typedef struct IndexScanDescData
 	/* signaling to index AM about killing index tuples */
 	bool		kill_prior_tuple;		/* last-returned tuple is dead */
 	bool		ignore_killed_tuples;	/* do not return killed entries */
+	bool		xactStartedInRecovery;	/* prevents killing/seeing killed tuples */

 	/* index access method's private state */
 	void	   *opaque;			/* access-method-specific info */
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@ -3,7 +3,7 @@
 *
 * Resource managers definition
 *
- * $PostgreSQL: pgsql/src/include/access/rmgr.h,v 1.19 2008/11/19 10:34:52 heikki Exp $
+ * $PostgreSQL: pgsql/src/include/access/rmgr.h,v 1.20 2009/12/19 01:32:42 sriggs Exp $
 */
 #ifndef RMGR_H
 #define RMGR_H
@ -23,6 +23,7 @@ typedef uint8 RmgrId;
 #define RM_DBASE_ID				4
 #define RM_TBLSPC_ID			5
 #define RM_MULTIXACT_ID			6
+#define RM_STANDBY_ID			8
 #define RM_HEAP2_ID				9
 #define RM_HEAP_ID				10
 #define RM_BTREE_ID				11
--- a/src/include/access/subtrans.h
+++ b/src/include/access/subtrans.h
@ -6,7 +6,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/access/subtrans.h,v 1.12 2009/01/01 17:23:56 momjian Exp $
+ * $PostgreSQL: pgsql/src/include/access/subtrans.h,v 1.13 2009/12/19 01:32:42 sriggs Exp $
 */
 #ifndef SUBTRANS_H
 #define SUBTRANS_H
@ -14,7 +14,7 @@
 /* Number of SLRU buffers to use for subtrans */
 #define NUM_SUBTRANS_BUFFERS	32

-extern void SubTransSetParent(TransactionId xid, TransactionId parent);
+extern void SubTransSetParent(TransactionId xid, TransactionId parent, bool overwriteOK);
 extern TransactionId SubTransGetParent(TransactionId xid);
 extern TransactionId SubTransGetTopmostTransaction(TransactionId xid);

--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@ -7,7 +7,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/access/transam.h,v 1.70 2009/09/01 04:46:49 tgl Exp $
+ * $PostgreSQL: pgsql/src/include/access/transam.h,v 1.71 2009/12/19 01:32:42 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -129,6 +129,9 @@ typedef VariableCacheData *VariableCache;
 * ----------------
 */

+/* in transam/xact.c */
+extern bool TransactionStartedDuringRecovery(void);
+
 /* in transam/varsup.c */
 extern PGDLLIMPORT VariableCache ShmemVariableCache;

--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@ -7,7 +7,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/access/twophase.h,v 1.12 2009/11/23 09:58:36 heikki Exp $
+ * $PostgreSQL: pgsql/src/include/access/twophase.h,v 1.13 2009/12/19 01:32:42 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -40,8 +40,10 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,

 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
+extern bool StandbyTransactionIdIsPrepared(TransactionId xid);

-extern TransactionId PrescanPreparedTransactions(void);
+extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
+							int *nxids_p);
 extern void RecoverPreparedTransactions(void);

 extern void RecreateTwoPhaseFile(TransactionId xid, void *content, int len);
--- a/src/include/access/twophase_rmgr.h
+++ b/src/include/access/twophase_rmgr.h
@ -7,7 +7,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/access/twophase_rmgr.h,v 1.9 2009/11/23 09:58:36 heikki Exp $
+ * $PostgreSQL: pgsql/src/include/access/twophase_rmgr.h,v 1.10 2009/12/19 01:32:42 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -23,15 +23,15 @@ typedef uint8 TwoPhaseRmgrId;
 */
 #define TWOPHASE_RM_END_ID			0
 #define TWOPHASE_RM_LOCK_ID			1
-#define TWOPHASE_RM_INVAL_ID		2
-#define TWOPHASE_RM_NOTIFY_ID		3
-#define TWOPHASE_RM_PGSTAT_ID		4
-#define TWOPHASE_RM_MULTIXACT_ID	5
+#define TWOPHASE_RM_NOTIFY_ID		2
+#define TWOPHASE_RM_PGSTAT_ID		3
+#define TWOPHASE_RM_MULTIXACT_ID	4
 #define TWOPHASE_RM_MAX_ID			TWOPHASE_RM_MULTIXACT_ID

 extern const TwoPhaseCallback twophase_recover_callbacks[];
 extern const TwoPhaseCallback twophase_postcommit_callbacks[];
 extern const TwoPhaseCallback twophase_postabort_callbacks[];
+extern const TwoPhaseCallback twophase_standby_recover_callbacks[];


 extern void RegisterTwoPhaseRecord(TwoPhaseRmgrId rmid, uint16 info,
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@ -7,7 +7,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/access/xact.h,v 1.98 2009/06/11 14:49:09 momjian Exp $
+ * $PostgreSQL: pgsql/src/include/access/xact.h,v 1.99 2009/12/19 01:32:42 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -84,19 +84,49 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
 #define XLOG_XACT_ABORT				0x20
 #define XLOG_XACT_COMMIT_PREPARED	0x30
 #define XLOG_XACT_ABORT_PREPARED	0x40
+#define XLOG_XACT_ASSIGNMENT		0x50
+
+typedef struct xl_xact_assignment
+{
+	TransactionId	xtop;		/* assigned XID's top-level XID */
+	int				nsubxacts;	/* number of subtransaction XIDs */
+	TransactionId	xsub[1];	/* assigned subxids */
+} xl_xact_assignment;
+
+#define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub)

 typedef struct xl_xact_commit
 {
 	TimestampTz xact_time;		/* time of commit */
+	uint32		xinfo;			/* info flags */
 	int			nrels;			/* number of RelFileNodes */
 	int			nsubxacts;		/* number of subtransaction XIDs */
+	int			nmsgs;			/* number of shared inval msgs */
 	/* Array of RelFileNode(s) to drop at commit */
 	RelFileNode xnodes[1];		/* VARIABLE LENGTH ARRAY */
 	/* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
+	/* ARRAY OF SHARED INVALIDATION MESSAGES FOLLOWS */
 } xl_xact_commit;

 #define MinSizeOfXactCommit offsetof(xl_xact_commit, xnodes)

+/*
+ * These flags are set in the xinfo fields of WAL commit records,
+ * indicating a variety of additional actions that need to occur
+ * when emulating transaction effects during recovery.
+ * They are named XactCompletion... to differentiate them from
+ * EOXact... routines which run at the end of the original
+ * transaction completion.
+ */
+#define XACT_COMPLETION_UPDATE_RELCACHE_FILE	0x01
+#define XACT_COMPLETION_VACUUM_FULL				0x02
+#define XACT_COMPLETION_FORCE_SYNC_COMMIT		0x04
+
+/* Access macros for above flags */
+#define XactCompletionRelcacheInitFileInval(xlrec)	((xlrec)->xinfo & XACT_COMPLETION_UPDATE_RELCACHE_FILE)
+#define XactCompletionVacuumFull(xlrec)				((xlrec)->xinfo & XACT_COMPLETION_VACUUM_FULL)
+#define XactCompletionForceSyncCommit(xlrec)		((xlrec)->xinfo & XACT_COMPLETION_FORCE_SYNC_COMMIT)
+
 typedef struct xl_xact_abort
 {
 	TimestampTz xact_time;		/* time of abort */
@ -106,6 +136,7 @@ typedef struct xl_xact_abort
 	RelFileNode xnodes[1];		/* VARIABLE LENGTH ARRAY */
 	/* ARRAY OF ABORTED SUBTRANSACTION XIDs FOLLOWS */
 } xl_xact_abort;
+/* Note the intentional lack of an invalidation message array c.f. commit */

 #define MinSizeOfXactAbort offsetof(xl_xact_abort, xnodes)

@ -181,7 +212,7 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);

-extern TransactionId RecordTransactionCommit(void);
+extern TransactionId RecordTransactionCommit(bool isVacuumFull);

 extern int	xactGetCommittedChildren(TransactionId **ptr);

--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@ -6,7 +6,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/access/xlog.h,v 1.93 2009/06/26 20:29:04 tgl Exp $
+ * $PostgreSQL: pgsql/src/include/access/xlog.h,v 1.94 2009/12/19 01:32:42 sriggs Exp $
 */
 #ifndef XLOG_H
 #define XLOG_H
@ -133,7 +133,45 @@ typedef struct XLogRecData
 } XLogRecData;

 extern TimeLineID ThisTimeLineID;		/* current TLI */
+
+/*
+ * Prior to 8.4, all activity during recovery was carried out by Startup
+ * process. This local variable continues to be used in many parts of the
+ * code to indicate actions taken by RecoveryManagers. Other processes who
+ * potentially perform work during recovery should check RecoveryInProgress()
+ * see XLogCtl notes in xlog.c
+ */
 extern bool InRecovery;
+
+/*
+ * Like InRecovery, standbyState is only valid in the startup process.
+ *
+ * In DISABLED state, we're performing crash recovery or hot standby was
+ * disabled in recovery.conf.
+ *
+ * In INITIALIZED state, we haven't yet received a RUNNING_XACTS or shutdown
+ * checkpoint record to initialize our master transaction tracking system.
+ *
+ * When the transaction tracking is initialized, we enter the SNAPSHOT_PENDING
+ * state. The tracked information might still be incomplete, so we can't allow
+ * connections yet, but redo functions must update the in-memory state when
+ * appropriate.
+ *
+ * In SNAPSHOT_READY mode, we have full knowledge of transactions that are
+ * (or were) running in the master at the current WAL location. Snapshots
+ * can be taken, and read-only queries can be run.
+ */
+typedef enum
+{
+	STANDBY_DISABLED,
+	STANDBY_INITIALIZED,
+	STANDBY_SNAPSHOT_PENDING,
+	STANDBY_SNAPSHOT_READY
+} HotStandbyState;
+extern HotStandbyState standbyState;
+
+#define InHotStandby (standbyState >= STANDBY_SNAPSHOT_PENDING)
+
 extern XLogRecPtr XactLastRecEnd;

 /* these variables are GUC parameters related to XLOG */
@ -143,9 +181,12 @@ extern bool XLogArchiveMode;
 extern char *XLogArchiveCommand;
 extern int	XLogArchiveTimeout;
 extern bool log_checkpoints;
+extern bool XLogRequestRecoveryConnections;
+extern int MaxStandbyDelay;

 #define XLogArchivingActive()	(XLogArchiveMode)
 #define XLogArchiveCommandSet() (XLogArchiveCommand[0] != '\0')
+#define XLogStandbyInfoActive()	(XLogRequestRecoveryConnections && XLogArchiveMode)

 #ifdef WAL_DEBUG
 extern bool XLOG_DEBUG;
@ -203,6 +244,7 @@ extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);

 extern bool RecoveryInProgress(void);
 extern bool XLogInsertAllowed(void);
+extern TimestampTz GetLatestXLogTime(void);

 extern void UpdateControlFile(void);
 extern Size XLOGShmemSize(void);
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@ -11,7 +11,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/access/xlog_internal.h,v 1.25 2009/01/01 17:23:56 momjian Exp $
+ * $PostgreSQL: pgsql/src/include/access/xlog_internal.h,v 1.26 2009/12/19 01:32:42 sriggs Exp $
 */
 #ifndef XLOG_INTERNAL_H
 #define XLOG_INTERNAL_H
@ -71,7 +71,7 @@ typedef struct XLogContRecord
 /*
 * Each page of XLOG file has a header like this:
 */
-#define XLOG_PAGE_MAGIC 0xD063	/* can be used as WAL version indicator */
+#define XLOG_PAGE_MAGIC 0xD166	/* can be used as WAL version indicator */

 typedef struct XLogPageHeaderData
 {
@ -255,5 +255,6 @@ extern Datum pg_current_xlog_location(PG_FUNCTION_ARGS);
 extern Datum pg_current_xlog_insert_location(PG_FUNCTION_ARGS);
 extern Datum pg_xlogfile_name_offset(PG_FUNCTION_ARGS);
 extern Datum pg_xlogfile_name(PG_FUNCTION_ARGS);
+extern Datum pg_is_in_recovery(PG_FUNCTION_ARGS);

 #endif   /* XLOG_INTERNAL_H */
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@ -8,7 +8,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/catalog/pg_control.h,v 1.44 2009/08/31 02:23:23 tgl Exp $
+ * $PostgreSQL: pgsql/src/include/catalog/pg_control.h,v 1.45 2009/12/19 01:32:42 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -40,6 +40,20 @@ typedef struct CheckPoint
 	TransactionId oldestXid;	/* cluster-wide minimum datfrozenxid */
 	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
 	pg_time_t	time;			/* time stamp of checkpoint */
+
+	/* Important parameter settings at time of shutdown checkpoints */
+	int		MaxConnections;
+	int		max_prepared_xacts;
+	int		max_locks_per_xact;
+	bool	XLogStandbyInfoMode;
+
+	/*
+	 * Oldest XID still running. This is only needed to initialize hot standby
+	 * mode from an online checkpoint, so we only bother calculating this for
+	 * online checkpoints and only when archiving is enabled. Otherwise it's
+	 * set to InvalidTransactionId.
+	 */
+	TransactionId   oldestActiveXid;
 } CheckPoint;

 /* XLOG info values for XLOG rmgr */
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@ -7,7 +7,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/catalog/pg_proc.h,v 1.556 2009/12/06 02:55:54 tgl Exp $
+ * $PostgreSQL: pgsql/src/include/catalog/pg_proc.h,v 1.557 2009/12/19 01:32:42 sriggs Exp $
 *
 * NOTES
 *	  The script catalog/genbki.sh reads this file and generates .bki
@ -3285,6 +3285,9 @@ DESCR("xlog filename and byte offset, given an xlog location");
 DATA(insert OID = 2851 ( pg_xlogfile_name			PGNSP PGUID 12 1 0 0 f f f t f i 1 0 25 "25" _null_ _null_ _null_ _null_ pg_xlogfile_name _null_ _null_ _null_ ));
 DESCR("xlog filename, given an xlog location");

+DATA(insert OID = 3810 (  pg_is_in_recovery 	PGNSP PGUID 12 1 0 0 f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_is_in_recovery _null_ _null_ _null_ ));
+DESCR("true if server is in recovery");
+
 DATA(insert OID = 2621 ( pg_reload_conf			PGNSP PGUID 12 1 0 0 f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_reload_conf _null_ _null_ _null_ ));
 DESCR("reload configuration files");
 DATA(insert OID = 2622 ( pg_rotate_logfile		PGNSP PGUID 12 1 0 0 f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_rotate_logfile _null_ _null_ _null_ ));
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@ -13,7 +13,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/miscadmin.h,v 1.215 2009/12/09 21:57:51 tgl Exp $
+ * $PostgreSQL: pgsql/src/include/miscadmin.h,v 1.216 2009/12/19 01:32:41 sriggs Exp $
 *
 * NOTES
 *	  some of the information in this file should be moved to other files.
@ -236,6 +236,12 @@ extern bool VacuumCostActive;
 /* in tcop/postgres.c */
 extern void check_stack_depth(void);

+/* in tcop/utility.c */
+extern void PreventCommandDuringRecovery(void);
+
+/* in utils/misc/guc.c */
+extern int trace_recovery_messages;
+int trace_recovery(int trace_level);

 /*****************************************************************************
 *	  pdir.h --																 *
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@ -7,7 +7,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/storage/lock.h,v 1.116 2009/04/04 17:40:36 tgl Exp $
+ * $PostgreSQL: pgsql/src/include/storage/lock.h,v 1.117 2009/12/19 01:32:44 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -477,6 +477,11 @@ extern LockAcquireResult LockAcquire(const LOCKTAG *locktag,
 			LOCKMODE lockmode,
 			bool sessionLock,
 			bool dontWait);
+extern LockAcquireResult LockAcquireExtended(const LOCKTAG *locktag,
+			LOCKMODE lockmode,
+			bool sessionLock,
+			bool dontWait,
+			bool report_memory_error);
 extern bool LockRelease(const LOCKTAG *locktag,
 			LOCKMODE lockmode, bool sessionLock);
 extern void LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks);
@ -494,6 +499,17 @@ extern void GrantAwaitedLock(void);
 extern void RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode);
 extern Size LockShmemSize(void);
 extern LockData *GetLockStatusData(void);
+
+extern void ReportLockTableError(bool report);
+
+typedef struct xl_standby_lock
+{
+	TransactionId	xid;	/* xid of holder of AccessExclusiveLock */
+	Oid		dbOid;
+	Oid		relOid;
+} xl_standby_lock;
+
+extern xl_standby_lock *GetRunningTransactionLocks(int *nlocks);
 extern const char *GetLockmodeName(LOCKMETHODID lockmethodid, LOCKMODE mode);

 extern void lock_twophase_recover(TransactionId xid, uint16 info,
@ -502,6 +518,8 @@ extern void lock_twophase_postcommit(TransactionId xid, uint16 info,
 						 void *recdata, uint32 len);
 extern void lock_twophase_postabort(TransactionId xid, uint16 info,
 						void *recdata, uint32 len);
+extern void lock_twophase_standby_recover(TransactionId xid, uint16 info,
+					  void *recdata, uint32 len);

 extern DeadLockState DeadLockCheck(PGPROC *proc);
 extern PGPROC *GetBlockingAutoVacuumPgproc(void);
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@ -7,7 +7,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/storage/proc.h,v 1.114 2009/08/31 19:41:00 tgl Exp $
+ * $PostgreSQL: pgsql/src/include/storage/proc.h,v 1.115 2009/12/19 01:32:44 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -95,6 +95,13 @@ struct PGPROC

 	uint8		vacuumFlags;	/* vacuum-related flags, see above */

+	/*
+	 * While in hot standby mode, setting recoveryConflictMode instructs
+	 * the backend to commit suicide. Possible values are the same as those
+	 * passed to ResolveRecoveryConflictWithVirtualXIDs().
+	 */
+	int			recoveryConflictMode;
+
 	/* Info about LWLock the process is currently waiting for, if any. */
 	bool		lwWaiting;		/* true if waiting for an LW lock */
 	bool		lwExclusive;	/* true if waiting for exclusive access */
@ -135,6 +142,9 @@ typedef struct PROC_HDR
 	PGPROC	   *autovacFreeProcs;
 	/* Current shared estimate of appropriate spins_per_delay value */
 	int			spins_per_delay;
+	/* The proc of the Startup process, since not in ProcArray */
+	PGPROC	   *startupProc;
+	int			startupProcPid;
 } PROC_HDR;

 /*
@ -165,6 +175,9 @@ extern void InitProcGlobal(void);
 extern void InitProcess(void);
 extern void InitProcessPhase2(void);
 extern void InitAuxiliaryProcess(void);
+
+extern void PublishStartupProcessInformation(void);
+
 extern bool HaveNFreeProcs(int n);
 extern void ProcReleaseLocks(bool isCommit);

--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@ -7,7 +7,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/storage/procarray.h,v 1.26 2009/06/11 14:49:12 momjian Exp $
+ * $PostgreSQL: pgsql/src/include/storage/procarray.h,v 1.27 2009/12/19 01:32:44 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -15,6 +15,7 @@
 #define PROCARRAY_H

 #include "storage/lock.h"
+#include "storage/standby.h"
 #include "utils/snapshot.h"


@ -26,6 +27,19 @@ extern void ProcArrayRemove(PGPROC *proc, TransactionId latestXid);
 extern void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid);
 extern void ProcArrayClearTransaction(PGPROC *proc);

+extern void ProcArrayInitRecoveryInfo(TransactionId oldestActiveXid);
+extern void ProcArrayApplyRecoveryInfo(RunningTransactions running);
+extern void ProcArrayApplyXidAssignment(TransactionId topxid,
+							int nsubxids, TransactionId *subxids);
+
+extern void RecordKnownAssignedTransactionIds(TransactionId xid);
+extern void ExpireTreeKnownAssignedTransactionIds(TransactionId xid,
+									  int nsubxids, TransactionId *subxids);
+extern void ExpireAllKnownAssignedTransactionIds(void);
+extern void ExpireOldKnownAssignedTransactionIds(TransactionId xid);
+
+extern RunningTransactions GetRunningTransactionData(void);
+
 extern Snapshot GetSnapshotData(Snapshot snapshot);

 extern bool TransactionIdIsInProgress(TransactionId xid);
@ -42,6 +56,11 @@ extern bool IsBackendPid(int pid);
 extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin,
 					  bool excludeXmin0, bool allDbs, int excludeVacuum,
 					  int *nvxids);
+extern VirtualTransactionId *GetConflictingVirtualXIDs(TransactionId limitXmin,
+					Oid dbOid, bool skipExistingConflicts);
+extern pid_t CancelVirtualTransaction(VirtualTransactionId vxid,
+						 int cancel_mode);
+
 extern int	CountActiveBackends(void);
 extern int	CountDBBackends(Oid databaseid);
 extern int	CountUserBackends(Oid roleid);
--- a/src/include/storage/sinval.h
+++ b/src/include/storage/sinval.h
@ -7,7 +7,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/storage/sinval.h,v 1.53 2009/07/31 20:26:23 tgl Exp $
+ * $PostgreSQL: pgsql/src/include/storage/sinval.h,v 1.54 2009/12/19 01:32:44 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -100,4 +100,7 @@ extern void HandleCatchupInterrupt(void);
 extern void EnableCatchupInterrupt(void);
 extern bool DisableCatchupInterrupt(void);

+extern int xactGetCommittedInvalidationMessages(SharedInvalidationMessage **msgs,
+										bool *RelcacheInitFileInval);
+
 #endif   /* SINVAL_H */
--- a/src/include/storage/sinvaladt.h
+++ b/src/include/storage/sinvaladt.h
@ -15,7 +15,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/storage/sinvaladt.h,v 1.51 2009/06/11 14:49:12 momjian Exp $
+ * $PostgreSQL: pgsql/src/include/storage/sinvaladt.h,v 1.52 2009/12/19 01:32:44 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -29,7 +29,7 @@
 */
 extern Size SInvalShmemSize(void);
 extern void CreateSharedInvalidationState(void);
-extern void SharedInvalBackendInit(void);
+extern void SharedInvalBackendInit(bool sendOnly);
 extern bool BackendIdIsActive(int backendID);

 extern void SIInsertDataEntries(const SharedInvalidationMessage *data, int n);
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@ -0,0 +1,106 @@
+/*-------------------------------------------------------------------------
+ *
+ * standby.h
+ *	  Definitions for hot standby mode.
+ *
+ *
+ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * $PostgreSQL: pgsql/src/include/storage/standby.h,v 1.1 2009/12/19 01:32:44 sriggs Exp $
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef STANDBY_H
+#define STANDBY_H
+
+#include "access/xlog.h"
+#include "storage/lock.h"
+
+extern int	vacuum_defer_cleanup_age;
+
+/* cancel modes for ResolveRecoveryConflictWithVirtualXIDs */
+#define CONFLICT_MODE_NOT_SET		0
+#define CONFLICT_MODE_ERROR			1	/* Conflict can be resolved by canceling query */
+#define CONFLICT_MODE_FATAL			2	/* Conflict can only be resolved by disconnecting session */
+
+extern void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
+									   char *reason, int cancel_mode);
+
+extern void InitRecoveryTransactionEnvironment(void);
+extern void ShutdownRecoveryTransactionEnvironment(void);
+
+/*
+ * Standby Rmgr (RM_STANDBY_ID)
+ *
+ * Standby recovery manager exists to perform actions that are required
+ * to make hot standby work. That includes logging AccessExclusiveLocks taken
+ * by transactions and running-xacts snapshots.
+ */
+extern void StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid);
+extern void StandbyReleaseLockTree(TransactionId xid,
+								   int nsubxids, TransactionId *subxids);
+extern void StandbyReleaseAllLocks(void);
+extern void StandbyReleaseOldLocks(TransactionId removeXid);
+
+/*
+ * XLOG message types
+ */
+#define XLOG_STANDBY_LOCK			0x00
+#define XLOG_RUNNING_XACTS			0x10
+
+typedef struct xl_standby_locks
+{
+	int				nlocks;		/* number of entries in locks array */
+	xl_standby_lock	locks[1];	/* VARIABLE LENGTH ARRAY */
+} xl_standby_locks;
+
+/*
+ * When we write running xact data to WAL, we use this structure.
+ */
+typedef struct xl_running_xacts
+{
+	int				xcnt;				/* # of xact ids in xids[] */
+	bool			subxid_overflow;	/* snapshot overflowed, subxids missing */
+	TransactionId	nextXid;			/* copy of ShmemVariableCache->nextXid */
+	TransactionId	oldestRunningXid;	/* *not* oldestXmin */
+
+	TransactionId	xids[1];		/* VARIABLE LENGTH ARRAY */
+} xl_running_xacts;
+
+#define MinSizeOfXactRunningXacts offsetof(xl_running_xacts, xids)
+
+
+/* Recovery handlers for the Standby Rmgr (RM_STANDBY_ID) */
+extern void standby_redo(XLogRecPtr lsn, XLogRecord *record);
+extern void standby_desc(StringInfo buf, uint8 xl_info, char *rec);
+
+/*
+ * Declarations for GetRunningTransactionData(). Similar to Snapshots, but
+ * not quite. This has nothing at all to do with visibility on this server,
+ * so this is completely separate from snapmgr.c and snapmgr.h
+ * This data is important for creating the initial snapshot state on a
+ * standby server. We need lots more information than a normal snapshot,
+ * hence we use a specific data structure for our needs. This data
+ * is written to WAL as a separate record immediately after each
+ * checkpoint. That means that wherever we start a standby from we will
+ * almost immediately see the data we need to begin executing queries.
+ */
+
+typedef struct RunningTransactionsData
+{
+	int				xcnt;				/* # of xact ids in xids[] */
+	bool			subxid_overflow;	/* snapshot overflowed, subxids missing */
+	TransactionId 	nextXid;			/* copy of ShmemVariableCache->nextXid */
+	TransactionId	oldestRunningXid;	/* *not* oldestXmin */
+
+	TransactionId  *xids;				/* array of (sub)xids still running */
+} RunningTransactionsData;
+
+typedef RunningTransactionsData *RunningTransactions;
+
+extern void LogAccessExclusiveLock(Oid dbOid, Oid relOid);
+
+extern void LogStandbySnapshot(TransactionId *oldestActiveXid, TransactionId *nextXid);
+
+#endif   /* STANDBY_H */
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@ -7,7 +7,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/utils/builtins.h,v 1.341 2009/10/21 20:38:58 tgl Exp $
+ * $PostgreSQL: pgsql/src/include/utils/builtins.h,v 1.342 2009/12/19 01:32:44 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -730,6 +730,7 @@ extern Datum xidrecv(PG_FUNCTION_ARGS);
 extern Datum xidsend(PG_FUNCTION_ARGS);
 extern Datum xideq(PG_FUNCTION_ARGS);
 extern Datum xid_age(PG_FUNCTION_ARGS);
+extern int xidComparator(const void *arg1, const void *arg2);
 extern Datum cidin(PG_FUNCTION_ARGS);
 extern Datum cidout(PG_FUNCTION_ARGS);
 extern Datum cidrecv(PG_FUNCTION_ARGS);
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@ -6,7 +6,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/include/utils/snapshot.h,v 1.5 2009/06/11 14:49:13 momjian Exp $
+ * $PostgreSQL: pgsql/src/include/utils/snapshot.h,v 1.6 2009/12/19 01:32:44 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -49,8 +49,10 @@ typedef struct SnapshotData
 	uint32		xcnt;			/* # of xact ids in xip[] */
 	TransactionId *xip;			/* array of xact IDs in progress */
 	/* note: all ids in xip[] satisfy xmin <= xip[i] < xmax */
-	int32		subxcnt;		/* # of xact ids in subxip[], -1 if overflow */
+	int32		subxcnt;		/* # of xact ids in subxip[] */
 	TransactionId *subxip;		/* array of subxact IDs in progress */
+	bool		suboverflowed;	/* has the subxip array overflowed? */
+	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */

 	/*
 	 * note: all ids in subxip[] are >= xmin, but we don't bother filtering
--- a/src/test/regress/GNUmakefile
+++ b/src/test/regress/GNUmakefile
@ -6,7 +6,7 @@
 # Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 # Portions Copyright (c) 1994, Regents of the University of California
 #
-# $PostgreSQL: pgsql/src/test/regress/GNUmakefile,v 1.80 2009/12/18 21:28:42 momjian Exp $
+# $PostgreSQL: pgsql/src/test/regress/GNUmakefile,v 1.81 2009/12/19 01:32:45 sriggs Exp $
 #
 #-------------------------------------------------------------------------

@ -149,6 +149,8 @@ installcheck: all
 installcheck-parallel: all
 	$(pg_regress_call) --psqldir=$(PSQLDIR) --schedule=$(srcdir)/parallel_schedule $(MAXCONNOPT)

+standbycheck: all
+	$(pg_regress_call) --psqldir=$(PSQLDIR) --schedule=$(srcdir)/standby_schedule --use-existing

 # old interfaces follow...

--- a/src/test/regress/expected/hs_standby_allowed.out
+++ b/src/test/regress/expected/hs_standby_allowed.out
@ -0,0 +1,215 @@
+--
+-- Hot Standby tests
+--
+-- hs_standby_allowed.sql
+--
+-- SELECT
+select count(*) as should_be_1 from hs1;
+ should_be_1 
+-------------
+           1
+(1 row)
+
+select count(*) as should_be_2 from hs2;
+ should_be_2 
+-------------
+           2
+(1 row)
+
+select count(*) as should_be_3 from hs3;
+ should_be_3 
+-------------
+           3
+(1 row)
+
+COPY hs1 TO '/tmp/copy_test';
+\! cat /tmp/copy_test
+1
+-- Access sequence directly
+select min_value as sequence_min_value from hsseq;
+ sequence_min_value 
+--------------------
+                  1
+(1 row)
+
+-- Transactions
+begin;
+select count(*)  as should_be_1 from hs1;
+ should_be_1 
+-------------
+           1
+(1 row)
+
+end;
+begin transaction read only;
+select count(*)  as should_be_1 from hs1;
+ should_be_1 
+-------------
+           1
+(1 row)
+
+end;
+begin transaction isolation level serializable;
+select count(*) as should_be_1 from hs1;
+ should_be_1 
+-------------
+           1
+(1 row)
+
+select count(*) as should_be_1 from hs1;
+ should_be_1 
+-------------
+           1
+(1 row)
+
+select count(*) as should_be_1 from hs1;
+ should_be_1 
+-------------
+           1
+(1 row)
+
+commit;
+begin;
+select count(*) as should_be_1 from hs1;
+ should_be_1 
+-------------
+           1
+(1 row)
+
+commit;
+begin;
+select count(*) as should_be_1 from hs1;
+ should_be_1 
+-------------
+           1
+(1 row)
+
+abort;
+start transaction;
+select count(*) as should_be_1 from hs1;
+ should_be_1 
+-------------
+           1
+(1 row)
+
+commit;
+begin;
+select count(*) as should_be_1 from hs1;
+ should_be_1 
+-------------
+           1
+(1 row)
+
+rollback;
+begin;
+select count(*) as should_be_1 from hs1;
+ should_be_1 
+-------------
+           1
+(1 row)
+
+savepoint s;
+select count(*) as should_be_2 from hs2;
+ should_be_2 
+-------------
+           2
+(1 row)
+
+commit;
+begin;
+select count(*) as should_be_1 from hs1;
+ should_be_1 
+-------------
+           1
+(1 row)
+
+savepoint s;
+select count(*) as should_be_2 from hs2;
+ should_be_2 
+-------------
+           2
+(1 row)
+
+release savepoint s;
+select count(*) as should_be_2 from hs2;
+ should_be_2 
+-------------
+           2
+(1 row)
+
+savepoint s;
+select count(*) as should_be_3 from hs3;
+ should_be_3 
+-------------
+           3
+(1 row)
+
+rollback to savepoint s;
+select count(*) as should_be_2 from hs2;
+ should_be_2 
+-------------
+           2
+(1 row)
+
+commit;
+-- SET parameters
+-- has no effect on read only transactions, but we can still set it
+set synchronous_commit = on;
+show synchronous_commit;
+ synchronous_commit 
+--------------------
+ on
+(1 row)
+
+reset synchronous_commit;
+discard temp;
+discard all;
+-- CURSOR commands
+BEGIN;
+DECLARE hsc CURSOR FOR select * from hs3;
+FETCH next from hsc;
+ col1 
+------
+  113
+(1 row)
+
+fetch first from hsc;
+ col1 
+------
+  113
+(1 row)
+
+fetch last from hsc;
+ col1 
+------
+  115
+(1 row)
+
+fetch 1 from hsc;
+ col1 
+------
+(0 rows)
+
+CLOSE hsc;
+COMMIT;
+-- Prepared plans
+PREPARE hsp AS select count(*) from hs1;
+PREPARE hsp_noexec (integer) AS insert into hs1 values ($1);
+EXECUTE hsp;
+ count 
+-------
+     1
+(1 row)
+
+DEALLOCATE hsp;
+-- LOCK
+BEGIN;
+LOCK hs1 IN ACCESS SHARE MODE;
+LOCK hs1 IN ROW SHARE MODE;
+LOCK hs1 IN ROW EXCLUSIVE MODE;
+COMMIT;
+-- LOAD
+-- should work, easier if there is no test for that...
+-- ALLOWED COMMANDS
+CHECKPOINT;
+discard all;
--- a/src/test/regress/expected/hs_standby_check.out
+++ b/src/test/regress/expected/hs_standby_check.out
@ -0,0 +1,20 @@
+--
+-- Hot Standby tests
+--
+-- hs_standby_check.sql
+--
+--
+-- If the query below returns false then all other tests will fail after it.
+--
+select case pg_is_in_recovery() when false then
+	'These tests are intended only for execution on a standby server that is reading ' ||
+	'WAL from a server upon which the regression database is already created and into ' ||
+	'which src/test/regress/sql/hs_primary_setup.sql has been run'
+else
+	'Tests are running on a standby server during recovery'
+end;
+                         case                          
+-------------------------------------------------------
+ Tests are running on a standby server during recovery
+(1 row)
+
--- a/src/test/regress/expected/hs_standby_disallowed.out
+++ b/src/test/regress/expected/hs_standby_disallowed.out
@ -0,0 +1,137 @@
+--
+-- Hot Standby tests
+--
+-- hs_standby_disallowed.sql
+--
+SET transaction_read_only = off;
+ERROR:  cannot set transaction read-write mode during recovery
+begin transaction read write;
+ERROR:  cannot set transaction read-write mode during recovery
+commit;
+WARNING:  there is no transaction in progress
+-- SELECT
+select * from hs1 FOR SHARE;
+ERROR:  transaction is read-only
+select * from hs1 FOR UPDATE;
+ERROR:  transaction is read-only
+-- DML
+BEGIN;
+insert into hs1 values (37);
+ERROR:  transaction is read-only
+ROLLBACK;
+BEGIN;
+delete from hs1 where col1 = 1;
+ERROR:  transaction is read-only
+ROLLBACK;
+BEGIN;
+update hs1 set col1 = NULL where col1 > 0;
+ERROR:  transaction is read-only
+ROLLBACK;
+BEGIN;
+truncate hs3;
+ERROR:  transaction is read-only
+ROLLBACK;
+-- DDL
+create temporary table hstemp1 (col1 integer);
+ERROR:  transaction is read-only
+BEGIN;
+drop table hs2;
+ERROR:  transaction is read-only
+ROLLBACK;
+BEGIN;
+create table hs4 (col1 integer);
+ERROR:  transaction is read-only
+ROLLBACK;
+-- Sequences
+SELECT nextval('hsseq');
+ERROR:  cannot be executed during recovery
+-- Two-phase commit transaction stuff
+BEGIN;
+SELECT count(*) FROM hs1;
+ count 
+-------
+     1
+(1 row)
+
+PREPARE TRANSACTION 'foobar';
+ERROR:  cannot be executed during recovery
+ROLLBACK;
+BEGIN;
+SELECT count(*) FROM hs1;
+ count 
+-------
+     1
+(1 row)
+
+COMMIT PREPARED 'foobar';
+ERROR:  cannot be executed during recovery
+ROLLBACK;
+BEGIN;
+SELECT count(*) FROM hs1;
+ count 
+-------
+     1
+(1 row)
+
+PREPARE TRANSACTION 'foobar';
+ERROR:  cannot be executed during recovery
+ROLLBACK PREPARED 'foobar';
+ERROR:  current transaction is aborted, commands ignored until end of transaction block
+ROLLBACK;
+BEGIN;
+SELECT count(*) FROM hs1;
+ count 
+-------
+     1
+(1 row)
+
+ROLLBACK PREPARED 'foobar';
+ERROR:  cannot be executed during recovery
+ROLLBACK;
+-- Locks
+BEGIN;
+LOCK hs1;
+ERROR:  cannot be executed during recovery
+COMMIT;
+BEGIN;
+LOCK hs1 IN SHARE UPDATE EXCLUSIVE MODE;
+ERROR:  cannot be executed during recovery
+COMMIT;
+BEGIN;
+LOCK hs1 IN SHARE MODE;
+ERROR:  cannot be executed during recovery
+COMMIT;
+BEGIN;
+LOCK hs1 IN SHARE ROW EXCLUSIVE MODE;
+ERROR:  cannot be executed during recovery
+COMMIT;
+BEGIN;
+LOCK hs1 IN EXCLUSIVE MODE;
+ERROR:  cannot be executed during recovery
+COMMIT;
+BEGIN;
+LOCK hs1 IN ACCESS EXCLUSIVE MODE;
+ERROR:  cannot be executed during recovery
+COMMIT;
+-- Listen
+listen a;
+ERROR:  cannot be executed during recovery
+notify a;
+ERROR:  cannot be executed during recovery
+unlisten a;
+ERROR:  cannot be executed during recovery
+unlisten *;
+ERROR:  cannot be executed during recovery
+-- disallowed commands
+ANALYZE hs1;
+ERROR:  cannot be executed during recovery
+VACUUM hs2;
+ERROR:  cannot be executed during recovery
+CLUSTER hs2 using hs1_pkey;
+ERROR:  cannot be executed during recovery
+REINDEX TABLE hs2;
+ERROR:  cannot be executed during recovery
+REVOKE SELECT ON hs1 FROM PUBLIC;
+ERROR:  transaction is read-only
+GRANT SELECT ON hs1 TO PUBLIC;
+ERROR:  transaction is read-only
--- a/src/test/regress/expected/hs_standby_functions.out
+++ b/src/test/regress/expected/hs_standby_functions.out
@ -0,0 +1,40 @@
+--
+-- Hot Standby tests
+--
+-- hs_standby_functions.sql
+--
+-- should fail
+select txid_current();
+ERROR:  cannot be executed during recovery
+select length(txid_current_snapshot()::text) >= 4;
+ ?column? 
+----------
+ t
+(1 row)
+
+select pg_start_backup('should fail');
+ERROR:  recovery is in progress
+HINT:  WAL control functions cannot be executed during recovery.
+select pg_switch_xlog();
+ERROR:  recovery is in progress
+HINT:  WAL control functions cannot be executed during recovery.
+select pg_stop_backup();
+ERROR:  recovery is in progress
+HINT:  WAL control functions cannot be executed during recovery.
+-- should return no rows
+select * from pg_prepared_xacts;
+ transaction | gid | prepared | owner | database 
+-------------+-----+----------+-------+----------
+(0 rows)
+
+-- just the startup process
+select locktype, virtualxid, virtualtransaction, mode, granted
+from pg_locks where virtualxid = '1/1';
+  locktype  | virtualxid | virtualtransaction |     mode      | granted 
+------------+------------+--------------------+---------------+---------
+ virtualxid | 1/1        | 1/0                | ExclusiveLock | t
+(1 row)
+
+-- suicide is painless
+select pg_cancel_backend(pg_backend_pid());
+ERROR:  canceling statement due to user request
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@ -11,7 +11,7 @@
 * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $PostgreSQL: pgsql/src/test/regress/pg_regress.c,v 1.67 2009/11/23 16:02:24 tgl Exp $
+ * $PostgreSQL: pgsql/src/test/regress/pg_regress.c,v 1.68 2009/12/19 01:32:45 sriggs Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -93,6 +93,7 @@ static char *temp_install = NULL;
 static char *temp_config = NULL;
 static char *top_builddir = NULL;
 static bool nolocale = false;
+static bool use_existing = false;
 static char *hostname = NULL;
 static int	port = -1;
 static bool port_specified_by_user = false;
@ -1545,7 +1546,7 @@ run_schedule(const char *schedule, test_function tfunc)

 		if (num_tests == 1)
 		{
-			status(_("test %-20s ... "), tests[0]);
+			status(_("test %-24s ... "), tests[0]);
 			pids[0] = (tfunc) (tests[0], &resultfiles[0], &expectfiles[0], &tags[0]);
 			wait_for_tests(pids, statuses, NULL, 1);
 			/* status line is finished below */
@ -1590,7 +1591,7 @@ run_schedule(const char *schedule, test_function tfunc)
 			bool		differ = false;

 			if (num_tests > 1)
-				status(_("     %-20s ... "), tests[i]);
+				status(_("     %-24s ... "), tests[i]);

 			/*
 			 * Advance over all three lists simultaneously.
@ -1918,6 +1919,7 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		{"dlpath", required_argument, NULL, 17},
 		{"create-role", required_argument, NULL, 18},
 		{"temp-config", required_argument, NULL, 19},
+		{"use-existing", no_argument, NULL, 20},
 		{NULL, 0, NULL, 0}
 	};

@ -2008,6 +2010,9 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 			case 19:
 				temp_config = strdup(optarg);
 				break;
+			case 20:
+				use_existing = true;
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				fprintf(stderr, _("\nTry \"%s -h\" for more information.\n"),
@ -2254,19 +2259,25 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Using an existing installation, so may need to get rid of
 		 * pre-existing database(s) and role(s)
 		 */
-		for (sl = dblist; sl; sl = sl->next)
-			drop_database_if_exists(sl->str);
-		for (sl = extraroles; sl; sl = sl->next)
-			drop_role_if_exists(sl->str);
+		if (!use_existing)
+		{
+			for (sl = dblist; sl; sl = sl->next)
+				drop_database_if_exists(sl->str);
+			for (sl = extraroles; sl; sl = sl->next)
+				drop_role_if_exists(sl->str);
+		}
 	}

 	/*
 	 * Create the test database(s) and role(s)
 	 */
-	for (sl = dblist; sl; sl = sl->next)
-		create_database(sl->str);
-	for (sl = extraroles; sl; sl = sl->next)
-		create_role(sl->str, dblist);
+	if (!use_existing)
+	{
+		for (sl = dblist; sl; sl = sl->next)
+			create_database(sl->str);
+		for (sl = extraroles; sl; sl = sl->next)
+			create_role(sl->str, dblist);
+	}

 	/*
 	 * Ready to run the tests
--- a/src/test/regress/sql/hs_primary_extremes.sql
+++ b/src/test/regress/sql/hs_primary_extremes.sql
@ -0,0 +1,74 @@
+--
+-- Hot Standby tests
+--
+-- hs_primary_extremes.sql
+--
+
+drop table if exists hs_extreme;
+create table hs_extreme (col1 integer);
+
+CREATE OR REPLACE FUNCTION hs_subxids (n integer)
+RETURNS void 
+LANGUAGE plpgsql 
+AS $$
+    BEGIN
+      IF n <= 0 THEN RETURN; END IF;
+      INSERT INTO hs_extreme VALUES (n);
+      PERFORM hs_subxids(n - 1);
+      RETURN;
+    EXCEPTION WHEN raise_exception THEN NULL; END;
+$$;
+
+BEGIN;
+SELECT hs_subxids(257);
+ROLLBACK;
+BEGIN;
+SELECT hs_subxids(257);
+COMMIT;
+
+set client_min_messages = 'warning';
+
+CREATE OR REPLACE FUNCTION hs_locks_create (n integer)
+RETURNS void 
+LANGUAGE plpgsql 
+AS $$
+    BEGIN
+      IF n <= 0 THEN
+		CHECKPOINT;
+		RETURN; 
+	  END IF;
+      EXECUTE 'CREATE TABLE hs_locks_' || n::text || ' ()';
+      PERFORM hs_locks_create(n - 1);
+      RETURN;
+    EXCEPTION WHEN raise_exception THEN NULL; END;
+$$;
+
+CREATE OR REPLACE FUNCTION hs_locks_drop (n integer)
+RETURNS void 
+LANGUAGE plpgsql 
+AS $$
+    BEGIN
+      IF n <= 0 THEN
+		CHECKPOINT;
+		RETURN; 
+	  END IF;
+	  EXECUTE 'DROP TABLE IF EXISTS hs_locks_' || n::text;
+      PERFORM hs_locks_drop(n - 1);
+      RETURN;
+    EXCEPTION WHEN raise_exception THEN NULL; END;
+$$;
+
+BEGIN;
+SELECT hs_locks_drop(257);
+SELECT hs_locks_create(257);
+SELECT count(*) > 257 FROM pg_locks;
+ROLLBACK;
+BEGIN;
+SELECT hs_locks_drop(257);
+SELECT hs_locks_create(257);
+SELECT count(*) > 257 FROM pg_locks;
+COMMIT;
+SELECT hs_locks_drop(257);
+
+SELECT pg_switch_xlog();
+
--- a/src/test/regress/sql/hs_primary_setup.sql
+++ b/src/test/regress/sql/hs_primary_setup.sql
@ -0,0 +1,25 @@
+--
+-- Hot Standby tests
+--
+-- hs_primary_setup.sql
+--
+
+drop table if exists hs1;
+create table hs1 (col1 integer primary key);
+insert into hs1 values (1);
+
+drop table if exists hs2;
+create table hs2 (col1 integer primary key);
+insert into hs2 values (12);
+insert into hs2 values (13);
+
+drop table if exists hs3;
+create table hs3 (col1 integer primary key);
+insert into hs3 values (113);
+insert into hs3 values (114);
+insert into hs3 values (115);
+
+DROP sequence if exists hsseq;
+create sequence hsseq;
+
+SELECT pg_switch_xlog();
--- a/src/test/regress/sql/hs_standby_allowed.sql
+++ b/src/test/regress/sql/hs_standby_allowed.sql
@ -0,0 +1,121 @@
+--
+-- Hot Standby tests
+--
+-- hs_standby_allowed.sql
+--
+
+-- SELECT
+
+select count(*) as should_be_1 from hs1;
+
+select count(*) as should_be_2 from hs2;
+
+select count(*) as should_be_3 from hs3;
+
+COPY hs1 TO '/tmp/copy_test';
+\! cat /tmp/copy_test
+
+-- Access sequence directly
+select min_value as sequence_min_value from hsseq;
+
+-- Transactions
+
+begin;
+select count(*)  as should_be_1 from hs1;
+end;
+
+begin transaction read only;
+select count(*)  as should_be_1 from hs1;
+end;
+
+begin transaction isolation level serializable;
+select count(*) as should_be_1 from hs1;
+select count(*) as should_be_1 from hs1;
+select count(*) as should_be_1 from hs1;
+commit;
+
+begin;
+select count(*) as should_be_1 from hs1;
+commit;
+
+begin;
+select count(*) as should_be_1 from hs1;
+abort;
+
+start transaction;
+select count(*) as should_be_1 from hs1;
+commit;
+
+begin;
+select count(*) as should_be_1 from hs1;
+rollback;
+
+begin;
+select count(*) as should_be_1 from hs1;
+savepoint s;
+select count(*) as should_be_2 from hs2;
+commit;
+
+begin;
+select count(*) as should_be_1 from hs1;
+savepoint s;
+select count(*) as should_be_2 from hs2;
+release savepoint s;
+select count(*) as should_be_2 from hs2;
+savepoint s;
+select count(*) as should_be_3 from hs3;
+rollback to savepoint s;
+select count(*) as should_be_2 from hs2;
+commit;
+
+-- SET parameters
+
+-- has no effect on read only transactions, but we can still set it
+set synchronous_commit = on;
+show synchronous_commit;
+reset synchronous_commit;
+
+discard temp;
+discard all;
+
+-- CURSOR commands
+
+BEGIN;
+
+DECLARE hsc CURSOR FOR select * from hs3;
+
+FETCH next from hsc;
+fetch first from hsc;
+fetch last from hsc;
+fetch 1 from hsc;
+
+CLOSE hsc;
+
+COMMIT;
+
+-- Prepared plans
+
+PREPARE hsp AS select count(*) from hs1;
+PREPARE hsp_noexec (integer) AS insert into hs1 values ($1);
+
+EXECUTE hsp;
+
+DEALLOCATE hsp;
+
+-- LOCK
+
+BEGIN;
+LOCK hs1 IN ACCESS SHARE MODE;
+LOCK hs1 IN ROW SHARE MODE;
+LOCK hs1 IN ROW EXCLUSIVE MODE;
+COMMIT;
+
+-- LOAD
+-- should work, easier if there is no test for that...
+
+
+-- ALLOWED COMMANDS
+
+CHECKPOINT;
+
+discard all;
--- a/src/test/regress/sql/hs_standby_check.sql
+++ b/src/test/regress/sql/hs_standby_check.sql
@ -0,0 +1,16 @@
+--
+-- Hot Standby tests
+--
+-- hs_standby_check.sql
+--
+
+--
+-- If the query below returns false then all other tests will fail after it.
+--
+select case pg_is_in_recovery() when false then
+	'These tests are intended only for execution on a standby server that is reading ' ||
+	'WAL from a server upon which the regression database is already created and into ' ||
+	'which src/test/regress/sql/hs_primary_setup.sql has been run'
+else
+	'Tests are running on a standby server during recovery'
+end;
--- a/src/test/regress/sql/hs_standby_disallowed.sql
+++ b/src/test/regress/sql/hs_standby_disallowed.sql
@ -0,0 +1,105 @@
+--
+-- Hot Standby tests
+--
+-- hs_standby_disallowed.sql
+--
+
+SET transaction_read_only = off;
+
+begin transaction read write;
+commit;
+
+-- SELECT
+
+select * from hs1 FOR SHARE;
+select * from hs1 FOR UPDATE;
+
+-- DML
+BEGIN;
+insert into hs1 values (37);
+ROLLBACK;
+BEGIN;
+delete from hs1 where col1 = 1;
+ROLLBACK;
+BEGIN;
+update hs1 set col1 = NULL where col1 > 0;
+ROLLBACK;
+BEGIN;
+truncate hs3;
+ROLLBACK;
+
+-- DDL
+
+create temporary table hstemp1 (col1 integer);
+BEGIN;
+drop table hs2;
+ROLLBACK;
+BEGIN;
+create table hs4 (col1 integer);
+ROLLBACK;
+
+-- Sequences
+
+SELECT nextval('hsseq');
+
+-- Two-phase commit transaction stuff
+
+BEGIN;
+SELECT count(*) FROM hs1;
+PREPARE TRANSACTION 'foobar';
+ROLLBACK;
+BEGIN;
+SELECT count(*) FROM hs1;
+COMMIT PREPARED 'foobar';
+ROLLBACK;
+
+BEGIN;
+SELECT count(*) FROM hs1;
+PREPARE TRANSACTION 'foobar';
+ROLLBACK PREPARED 'foobar';
+ROLLBACK;
+
+BEGIN;
+SELECT count(*) FROM hs1;
+ROLLBACK PREPARED 'foobar';
+ROLLBACK;
+
+
+-- Locks
+BEGIN;
+LOCK hs1;
+COMMIT;
+BEGIN;
+LOCK hs1 IN SHARE UPDATE EXCLUSIVE MODE;
+COMMIT;
+BEGIN;
+LOCK hs1 IN SHARE MODE;
+COMMIT;
+BEGIN;
+LOCK hs1 IN SHARE ROW EXCLUSIVE MODE;
+COMMIT;
+BEGIN;
+LOCK hs1 IN EXCLUSIVE MODE;
+COMMIT;
+BEGIN;
+LOCK hs1 IN ACCESS EXCLUSIVE MODE;
+COMMIT;
+
+-- Listen
+listen a;
+notify a;
+unlisten a;
+unlisten *;
+
+-- disallowed commands
+
+ANALYZE hs1;
+
+VACUUM hs2;
+
+CLUSTER hs2 using hs1_pkey;
+
+REINDEX TABLE hs2;
+
+REVOKE SELECT ON hs1 FROM PUBLIC;
+GRANT SELECT ON hs1 TO PUBLIC;
--- a/src/test/regress/sql/hs_standby_functions.sql
+++ b/src/test/regress/sql/hs_standby_functions.sql
@ -0,0 +1,24 @@
+--
+-- Hot Standby tests
+--
+-- hs_standby_functions.sql
+--
+
+-- should fail
+select txid_current();
+
+select length(txid_current_snapshot()::text) >= 4;
+
+select pg_start_backup('should fail');
+select pg_switch_xlog();
+select pg_stop_backup();
+
+-- should return no rows
+select * from pg_prepared_xacts;
+
+-- just the startup process
+select locktype, virtualxid, virtualtransaction, mode, granted
+from pg_locks where virtualxid = '1/1';
+
+-- suicide is painless
+select pg_cancel_backend(pg_backend_pid());
--- a/src/test/regress/standby_schedule
+++ b/src/test/regress/standby_schedule
@ -0,0 +1,21 @@
+# $PostgreSQL: pgsql/src/test/regress/standby_schedule,v 1.1 2009/12/19 01:32:45 sriggs Exp $
+#
+# Test schedule for Hot Standby
+#
+# First test checks we are on a standby server.
+# Subsequent tests rely upon a setup script having already
+# been executed in the appropriate database on the primary server
+# which is feeding WAL files to target standby.
+#
+# psql -f src/test/regress/sql/hs_primary_setup.sql regression
+#
+test: hs_standby_check
+#
+# These tests will pass on both primary and standby servers
+#
+test: hs_standby_allowed
+#
+# These tests will fail on a non-standby server
+#
+test: hs_standby_disallowed
+test: hs_standby_functions