Reorder EPQ work, to fix rowmark related bugs and improve efficiency.

In ad0bda5d24 I changed the EvalPlanQual machinery to store substitution tuples in slot, instead of using plain HeapTuples. The main motivation for that was that using HeapTuples will be inefficient for future tableams. But it turns out that that conversion was buggy for non-locking rowmarks - the wrong tuple descriptor was used to create the slot. As a secondary issue 5db6df0c0 changed ExecLockRows() to begin EPQ earlier, to allow to fetch the locked rows directly into the EPQ slots, instead of having to copy tuples around. Unfortunately, as Tom complained, that forces some expensive initialization to happen earlier. As a third issue, the test coverage for EPQ was clearly insufficient. Fixing the first issue is unfortunately not trivial: Non-locked row marks were fetched at the start of EPQ, and we don't have the type information for the rowmarks available at that point. While we could change that, it's not easy. It might be worthwhile to change that at some point, but to fix this bug, it seems better to delay fetching non-locking rowmarks when they're actually needed, rather than eagerly. They're referenced at most once, and in cases where EPQ fails, might never be referenced. Fetching them when needed also increases locality a bit. To be able to fetch rowmarks during execution, rather than initialization, we need to be able to access the active EPQState, as that contains necessary data. To do so move EPQ related data from EState to EPQState, and, only for EStates creates as part of EPQ, reference the associated EPQState from EState. To fix the second issue, change EPQ initialization to allow use of EvalPlanQualSlot() to be used before EvalPlanQualBegin() (but obviously still requiring EvalPlanQualInit() to have been done). As these changes made struct EState harder to understand, e.g. by adding multiple EStates, significantly reorder the members, and add a lot more comments. Also add a few more EPQ tests, including one that fails for the first issue above. More is needed. Reported-By: yi huang Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/CAHU7rYZo_C4ULsAx_LAj8az9zqgrD8WDd4hTegDTMM1LMqrBsg@mail.gmail.com https://postgr.es/m/24530.1562686693@sss.pgh.pa.us Backpatch: 12-, where the EPQ changes were introduced
2019-09-05 13:00:20 -07:00 · 2019-09-05 13:00:20 -07:00 · 27cc7cd2bc
parent 7e04160390
commit 27cc7cd2bc
12 changed files with 705 additions and 273 deletions
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@ -3363,8 +3363,7 @@ GetTupleForTrigger(EState *estate,
 				{
 					TupleTableSlot *epqslot;

-					epqslot = EvalPlanQual(estate,
-										   epqstate,
+					epqslot = EvalPlanQual(epqstate,
 										   relation,
 										   relinfo->ri_RangeTableIndex,
 										   oldslot);
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@ -98,8 +98,7 @@ static char *ExecBuildSlotValueDescription(Oid reloid,
 										   TupleDesc tupdesc,
 										   Bitmapset *modifiedCols,
 										   int maxfieldlen);
-static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
-							  Plan *planTree);
+static void EvalPlanQualStart(EPQState *epqstate, Plan *planTree);

 /*
 * Note that GetAllUpdatedColumns() also exists in commands/trigger.c.  There does
@ -979,9 +978,8 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	 */
 	estate->es_tupleTable = NIL;

-	/* mark EvalPlanQual not active */
-	estate->es_epqTupleSlot = NULL;
-	estate->es_epqScanDone = NULL;
+	/* signal that this EState is not used for EPQ */
+	estate->es_epq_active = NULL;

 	/*
 	 * Initialize private state information for each SubPlan.  We must do this
@ -2421,7 +2419,6 @@ ExecBuildAuxRowMark(ExecRowMark *erm, List *targetlist)
 * Check the updated version of a tuple to see if we want to process it under
 * READ COMMITTED rules.
 *
- *	estate - outer executor state data
 *	epqstate - state for EvalPlanQual rechecking
 *	relation - table containing tuple
 *	rti - rangetable index of table containing tuple
@ -2439,8 +2436,8 @@ ExecBuildAuxRowMark(ExecRowMark *erm, List *targetlist)
 * NULL if we determine we shouldn't process the row.
 */
 TupleTableSlot *
-EvalPlanQual(EState *estate, EPQState *epqstate,
-			 Relation relation, Index rti, TupleTableSlot *inputslot)
+EvalPlanQual(EPQState *epqstate, Relation relation,
+			 Index rti, TupleTableSlot *inputslot)
 {
 	TupleTableSlot *slot;
 	TupleTableSlot *testslot;
@ -2450,7 +2447,7 @@ EvalPlanQual(EState *estate, EPQState *epqstate,
 	/*
 	 * Need to run a recheck subquery.  Initialize or reinitialize EPQ state.
 	 */
-	EvalPlanQualBegin(epqstate, estate);
+	EvalPlanQualBegin(epqstate);

 	/*
 	 * Callers will often use the EvalPlanQualSlot to store the tuple to avoid
@ -2460,11 +2457,6 @@ EvalPlanQual(EState *estate, EPQState *epqstate,
 	if (testslot != inputslot)
 		ExecCopySlot(testslot, inputslot);

-	/*
-	 * Fetch any non-locked source rows
-	 */
-	EvalPlanQualFetchRowMarks(epqstate);
-
 	/*
 	 * Run the EPQ query.  We assume it will return at most one tuple.
 	 */
@ -2498,17 +2490,36 @@ EvalPlanQual(EState *estate, EPQState *epqstate,
 * with EvalPlanQualSetPlan.
 */
 void
-EvalPlanQualInit(EPQState *epqstate, EState *estate,
+EvalPlanQualInit(EPQState *epqstate, EState *parentestate,
 				 Plan *subplan, List *auxrowmarks, int epqParam)
 {
-	/* Mark the EPQ state inactive */
-	epqstate->estate = NULL;
-	epqstate->planstate = NULL;
-	epqstate->origslot = NULL;
+	Index		rtsize = parentestate->es_range_table_size;
+
+	/* initialize data not changing over EPQState's lifetime */
+	epqstate->parentestate = parentestate;
+	epqstate->epqParam = epqParam;
+
+	/*
+	 * Allocate space to reference a slot for each potential rti - do so now
+	 * rather than in EvalPlanQualBegin(), as done for other dynamically
+	 * allocated resources, so EvalPlanQualSlot() can be used to hold tuples
+	 * that *may* need EPQ later, without forcing the overhead of
+	 * EvalPlanQualBegin().
+	 */
+	epqstate->tuple_table = NIL;
+	epqstate->relsubs_slot = (TupleTableSlot **)
+		palloc0(rtsize * sizeof(TupleTableSlot *));
+
 	/* ... and remember data that EvalPlanQualBegin will need */
 	epqstate->plan = subplan;
 	epqstate->arowMarks = auxrowmarks;
-	epqstate->epqParam = epqParam;
+
+	/* ... and mark the EPQ state inactive */
+	epqstate->origslot = NULL;
+	epqstate->recheckestate = NULL;
+	epqstate->recheckplanstate = NULL;
+	epqstate->relsubs_rowmark = NULL;
+	epqstate->relsubs_done = NULL;
 }

 /*
@ -2529,6 +2540,9 @@ EvalPlanQualSetPlan(EPQState *epqstate, Plan *subplan, List *auxrowmarks)

 /*
 * Return, and create if necessary, a slot for an EPQ test tuple.
+ *
+ * Note this only requires EvalPlanQualInit() to have been called,
+ * EvalPlanQualBegin() is not necessary.
 */
 TupleTableSlot *
 EvalPlanQualSlot(EPQState *epqstate,
@ -2536,23 +2550,16 @@ EvalPlanQualSlot(EPQState *epqstate,
 {
 	TupleTableSlot **slot;

-	Assert(rti > 0 && rti <= epqstate->estate->es_range_table_size);
-	slot = &epqstate->estate->es_epqTupleSlot[rti - 1];
+	Assert(relation);
+	Assert(rti > 0 && rti <= epqstate->parentestate->es_range_table_size);
+	slot = &epqstate->relsubs_slot[rti - 1];

 	if (*slot == NULL)
 	{
 		MemoryContext oldcontext;

-		oldcontext = MemoryContextSwitchTo(epqstate->estate->es_query_cxt);
-
-		if (relation)
-			*slot = table_slot_create(relation,
-									  &epqstate->estate->es_tupleTable);
-		else
-			*slot = ExecAllocTableSlot(&epqstate->estate->es_tupleTable,
-									   epqstate->origslot->tts_tupleDescriptor,
-									   &TTSOpsVirtual);
-
+		oldcontext = MemoryContextSwitchTo(epqstate->parentestate->es_query_cxt);
+		*slot = table_slot_create(relation, &epqstate->tuple_table);
 		MemoryContextSwitchTo(oldcontext);
 	}

@ -2560,117 +2567,113 @@ EvalPlanQualSlot(EPQState *epqstate,
 }

 /*
- * Fetch the current row values for any non-locked relations that need
- * to be scanned by an EvalPlanQual operation.  origslot must have been set
- * to contain the current result row (top-level row) that we need to recheck.
+ * Fetch the current row value for a non-locked relation, identified by rti,
+ * that needs to be scanned by an EvalPlanQual operation.  origslot must have
+ * been set to contain the current result row (top-level row) that we need to
+ * recheck.  Returns true if a substitution tuple was found, false if not.
 */
-void
-EvalPlanQualFetchRowMarks(EPQState *epqstate)
+bool
+EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot)
 {
-	ListCell   *l;
+	ExecAuxRowMark *earm = epqstate->relsubs_rowmark[rti - 1];
+	ExecRowMark *erm = earm->rowmark;
+	Datum		datum;
+	bool		isNull;

+	Assert(earm != NULL);
 	Assert(epqstate->origslot != NULL);

-	foreach(l, epqstate->arowMarks)
+	if (RowMarkRequiresRowShareLock(erm->markType))
+		elog(ERROR, "EvalPlanQual doesn't support locking rowmarks");
+
+	/* if child rel, must check whether it produced this row */
+	if (erm->rti != erm->prti)
 	{
-		ExecAuxRowMark *aerm = (ExecAuxRowMark *) lfirst(l);
-		ExecRowMark *erm = aerm->rowmark;
-		Datum		datum;
-		bool		isNull;
-		TupleTableSlot *slot;
+		Oid			tableoid;

-		if (RowMarkRequiresRowShareLock(erm->markType))
-			elog(ERROR, "EvalPlanQual doesn't support locking rowmarks");
+		datum = ExecGetJunkAttribute(epqstate->origslot,
+									 earm->toidAttNo,
+									 &isNull);
+		/* non-locked rels could be on the inside of outer joins */
+		if (isNull)
+			return false;

-		/* clear any leftover test tuple for this rel */
-		slot = EvalPlanQualSlot(epqstate, erm->relation, erm->rti);
-		ExecClearTuple(slot);
+		tableoid = DatumGetObjectId(datum);

-		/* if child rel, must check whether it produced this row */
-		if (erm->rti != erm->prti)
+		Assert(OidIsValid(erm->relid));
+		if (tableoid != erm->relid)
 		{
-			Oid			tableoid;
-
-			datum = ExecGetJunkAttribute(epqstate->origslot,
-										 aerm->toidAttNo,
-										 &isNull);
-			/* non-locked rels could be on the inside of outer joins */
-			if (isNull)
-				continue;
-			tableoid = DatumGetObjectId(datum);
-
-			Assert(OidIsValid(erm->relid));
-			if (tableoid != erm->relid)
-			{
-				/* this child is inactive right now */
-				continue;
-			}
+			/* this child is inactive right now */
+			return false;
 		}
+	}

-		if (erm->markType == ROW_MARK_REFERENCE)
+	if (erm->markType == ROW_MARK_REFERENCE)
+	{
+		Assert(erm->relation != NULL);
+
+		/* fetch the tuple's ctid */
+		datum = ExecGetJunkAttribute(epqstate->origslot,
+									 earm->ctidAttNo,
+									 &isNull);
+		/* non-locked rels could be on the inside of outer joins */
+		if (isNull)
+			return false;
+
+		/* fetch requests on foreign tables must be passed to their FDW */
+		if (erm->relation->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 		{
-			Assert(erm->relation != NULL);
+			FdwRoutine *fdwroutine;
+			bool		updated = false;

-			/* fetch the tuple's ctid */
-			datum = ExecGetJunkAttribute(epqstate->origslot,
-										 aerm->ctidAttNo,
-										 &isNull);
-			/* non-locked rels could be on the inside of outer joins */
-			if (isNull)
-				continue;
+			fdwroutine = GetFdwRoutineForRelation(erm->relation, false);
+			/* this should have been checked already, but let's be safe */
+			if (fdwroutine->RefetchForeignRow == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("cannot lock rows in foreign table \"%s\"",
+								RelationGetRelationName(erm->relation))));

-			/* fetch requests on foreign tables must be passed to their FDW */
-			if (erm->relation->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
-			{
-				FdwRoutine *fdwroutine;
-				bool		updated = false;
+			fdwroutine->RefetchForeignRow(epqstate->recheckestate,
+										  erm,
+										  datum,
+										  slot,
+										  &updated);
+			if (TupIsNull(slot))
+				elog(ERROR, "failed to fetch tuple for EvalPlanQual recheck");

-				fdwroutine = GetFdwRoutineForRelation(erm->relation, false);
-				/* this should have been checked already, but let's be safe */
-				if (fdwroutine->RefetchForeignRow == NULL)
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot lock rows in foreign table \"%s\"",
-									RelationGetRelationName(erm->relation))));
-
-				fdwroutine->RefetchForeignRow(epqstate->estate,
-											  erm,
-											  datum,
-											  slot,
-											  &updated);
-				if (TupIsNull(slot))
-					elog(ERROR, "failed to fetch tuple for EvalPlanQual recheck");
-
-				/*
-				 * Ideally we'd insist on updated == false here, but that
-				 * assumes that FDWs can track that exactly, which they might
-				 * not be able to.  So just ignore the flag.
-				 */
-			}
-			else
-			{
-				/* ordinary table, fetch the tuple */
-				if (!table_tuple_fetch_row_version(erm->relation,
-												   (ItemPointer) DatumGetPointer(datum),
-												   SnapshotAny, slot))
-					elog(ERROR, "failed to fetch tuple for EvalPlanQual recheck");
-			}
+			/*
+			 * Ideally we'd insist on updated == false here, but that assumes
+			 * that FDWs can track that exactly, which they might not be able
+			 * to.  So just ignore the flag.
+			 */
+			return true;
 		}
 		else
 		{
-			Assert(erm->markType == ROW_MARK_COPY);
-
-			/* fetch the whole-row Var for the relation */
-			datum = ExecGetJunkAttribute(epqstate->origslot,
-										 aerm->wholeAttNo,
-										 &isNull);
-			/* non-locked rels could be on the inside of outer joins */
-			if (isNull)
-				continue;
-
-			ExecStoreHeapTupleDatum(datum, slot);
+			/* ordinary table, fetch the tuple */
+			if (!table_tuple_fetch_row_version(erm->relation,
+											   (ItemPointer) DatumGetPointer(datum),
+											   SnapshotAny, slot))
+				elog(ERROR, "failed to fetch tuple for EvalPlanQual recheck");
+			return true;
 		}
 	}
+	else
+	{
+		Assert(erm->markType == ROW_MARK_COPY);
+
+		/* fetch the whole-row Var for the relation */
+		datum = ExecGetJunkAttribute(epqstate->origslot,
+									 earm->wholeAttNo,
+									 &isNull);
+		/* non-locked rels could be on the inside of outer joins */
+		if (isNull)
+			return false;
+
+		ExecStoreHeapTupleDatum(datum, slot);
+		return true;
+	}
 }

 /*
@ -2684,8 +2687,8 @@ EvalPlanQualNext(EPQState *epqstate)
 	MemoryContext oldcontext;
 	TupleTableSlot *slot;

-	oldcontext = MemoryContextSwitchTo(epqstate->estate->es_query_cxt);
-	slot = ExecProcNode(epqstate->planstate);
+	oldcontext = MemoryContextSwitchTo(epqstate->recheckestate->es_query_cxt);
+	slot = ExecProcNode(epqstate->recheckplanstate);
 	MemoryContextSwitchTo(oldcontext);

 	return slot;
@ -2695,14 +2698,15 @@ EvalPlanQualNext(EPQState *epqstate)
 * Initialize or reset an EvalPlanQual state tree
 */
 void
-EvalPlanQualBegin(EPQState *epqstate, EState *parentestate)
+EvalPlanQualBegin(EPQState *epqstate)
 {
-	EState	   *estate = epqstate->estate;
+	EState	   *parentestate = epqstate->parentestate;
+	EState	   *recheckestate = epqstate->recheckestate;

-	if (estate == NULL)
+	if (recheckestate == NULL)
 	{
 		/* First time through, so create a child EState */
-		EvalPlanQualStart(epqstate, parentestate, epqstate->plan);
+		EvalPlanQualStart(epqstate, epqstate->plan);
 	}
 	else
 	{
@ -2710,9 +2714,9 @@ EvalPlanQualBegin(EPQState *epqstate, EState *parentestate)
 		 * We already have a suitable child EPQ tree, so just reset it.
 		 */
 		Index		rtsize = parentestate->es_range_table_size;
-		PlanState  *planstate = epqstate->planstate;
+		PlanState  *rcplanstate = epqstate->recheckplanstate;

-		MemSet(estate->es_epqScanDone, 0, rtsize * sizeof(bool));
+		MemSet(epqstate->relsubs_done, 0, rtsize * sizeof(bool));

 		/* Recopy current values of parent parameters */
 		if (parentestate->es_plannedstmt->paramExecTypes != NIL)
@ -2724,7 +2728,7 @@ EvalPlanQualBegin(EPQState *epqstate, EState *parentestate)
 			 * by the subplan, just in case they got reset since
 			 * EvalPlanQualStart (see comments therein).
 			 */
-			ExecSetParamPlanMulti(planstate->plan->extParam,
+			ExecSetParamPlanMulti(rcplanstate->plan->extParam,
 								  GetPerTupleExprContext(parentestate));

 			i = list_length(parentestate->es_plannedstmt->paramExecTypes);
@ -2732,9 +2736,9 @@ EvalPlanQualBegin(EPQState *epqstate, EState *parentestate)
 			while (--i >= 0)
 			{
 				/* copy value if any, but not execPlan link */
-				estate->es_param_exec_vals[i].value =
+				recheckestate->es_param_exec_vals[i].value =
 					parentestate->es_param_exec_vals[i].value;
-				estate->es_param_exec_vals[i].isnull =
+				recheckestate->es_param_exec_vals[i].isnull =
 					parentestate->es_param_exec_vals[i].isnull;
 			}
 		}
@ -2743,8 +2747,8 @@ EvalPlanQualBegin(EPQState *epqstate, EState *parentestate)
 		 * Mark child plan tree as needing rescan at all scan nodes.  The
 		 * first ExecProcNode will take care of actually doing the rescan.
 		 */
-		planstate->chgParam = bms_add_member(planstate->chgParam,
-											 epqstate->epqParam);
+		rcplanstate->chgParam = bms_add_member(rcplanstate->chgParam,
+											   epqstate->epqParam);
 	}
 }

@ -2755,18 +2759,20 @@ EvalPlanQualBegin(EPQState *epqstate, EState *parentestate)
 * the top-level estate rather than initializing it fresh.
 */
 static void
-EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
+EvalPlanQualStart(EPQState *epqstate, Plan *planTree)
 {
-	EState	   *estate;
-	Index		rtsize;
+	EState	   *parentestate = epqstate->parentestate;
+	Index		rtsize = parentestate->es_range_table_size;
+	EState	   *rcestate;
 	MemoryContext oldcontext;
 	ListCell   *l;

-	rtsize = parentestate->es_range_table_size;
+	epqstate->recheckestate = rcestate = CreateExecutorState();

-	epqstate->estate = estate = CreateExecutorState();
+	oldcontext = MemoryContextSwitchTo(rcestate->es_query_cxt);

-	oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
+	/* signal that this is an EState for executing EPQ */
+	rcestate->es_epq_active = epqstate;

 	/*
 	 * Child EPQ EStates share the parent's copy of unchanging state such as
@ -2782,17 +2788,17 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 	 * state must *not* propagate back to the parent.  (For one thing, the
 	 * pointed-to data is in a memory context that won't last long enough.)
 	 */
-	estate->es_direction = ForwardScanDirection;
-	estate->es_snapshot = parentestate->es_snapshot;
-	estate->es_crosscheck_snapshot = parentestate->es_crosscheck_snapshot;
-	estate->es_range_table = parentestate->es_range_table;
-	estate->es_range_table_size = parentestate->es_range_table_size;
-	estate->es_relations = parentestate->es_relations;
-	estate->es_queryEnv = parentestate->es_queryEnv;
-	estate->es_rowmarks = parentestate->es_rowmarks;
-	estate->es_plannedstmt = parentestate->es_plannedstmt;
-	estate->es_junkFilter = parentestate->es_junkFilter;
-	estate->es_output_cid = parentestate->es_output_cid;
+	rcestate->es_direction = ForwardScanDirection;
+	rcestate->es_snapshot = parentestate->es_snapshot;
+	rcestate->es_crosscheck_snapshot = parentestate->es_crosscheck_snapshot;
+	rcestate->es_range_table = parentestate->es_range_table;
+	rcestate->es_range_table_size = parentestate->es_range_table_size;
+	rcestate->es_relations = parentestate->es_relations;
+	rcestate->es_queryEnv = parentestate->es_queryEnv;
+	rcestate->es_rowmarks = parentestate->es_rowmarks;
+	rcestate->es_plannedstmt = parentestate->es_plannedstmt;
+	rcestate->es_junkFilter = parentestate->es_junkFilter;
+	rcestate->es_output_cid = parentestate->es_output_cid;
 	if (parentestate->es_num_result_relations > 0)
 	{
 		int			numResultRelations = parentestate->es_num_result_relations;
@ -2803,8 +2809,8 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 			palloc(numResultRelations * sizeof(ResultRelInfo));
 		memcpy(resultRelInfos, parentestate->es_result_relations,
 			   numResultRelations * sizeof(ResultRelInfo));
-		estate->es_result_relations = resultRelInfos;
-		estate->es_num_result_relations = numResultRelations;
+		rcestate->es_result_relations = resultRelInfos;
+		rcestate->es_num_result_relations = numResultRelations;

 		/* Also transfer partitioned root result relations. */
 		if (numRootResultRels > 0)
@ -2813,14 +2819,14 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 				palloc(numRootResultRels * sizeof(ResultRelInfo));
 			memcpy(resultRelInfos, parentestate->es_root_result_relations,
 				   numRootResultRels * sizeof(ResultRelInfo));
-			estate->es_root_result_relations = resultRelInfos;
-			estate->es_num_root_result_relations = numRootResultRels;
+			rcestate->es_root_result_relations = resultRelInfos;
+			rcestate->es_num_root_result_relations = numRootResultRels;
 		}
 	}
 	/* es_result_relation_info must NOT be copied */
 	/* es_trig_target_relations must NOT be copied */
-	estate->es_top_eflags = parentestate->es_top_eflags;
-	estate->es_instrument = parentestate->es_instrument;
+	rcestate->es_top_eflags = parentestate->es_top_eflags;
+	rcestate->es_instrument = parentestate->es_instrument;
 	/* es_auxmodifytables must NOT be copied */

 	/*
@ -2829,7 +2835,7 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 	 * from the parent, so as to have access to any param values that were
 	 * already set from other parts of the parent's plan tree.
 	 */
-	estate->es_param_list_info = parentestate->es_param_list_info;
+	rcestate->es_param_list_info = parentestate->es_param_list_info;
 	if (parentestate->es_plannedstmt->paramExecTypes != NIL)
 	{
 		int			i;
@ -2857,41 +2863,19 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)

 		/* now make the internal param workspace ... */
 		i = list_length(parentestate->es_plannedstmt->paramExecTypes);
-		estate->es_param_exec_vals = (ParamExecData *)
+		rcestate->es_param_exec_vals = (ParamExecData *)
 			palloc0(i * sizeof(ParamExecData));
 		/* ... and copy down all values, whether really needed or not */
 		while (--i >= 0)
 		{
 			/* copy value if any, but not execPlan link */
-			estate->es_param_exec_vals[i].value =
+			rcestate->es_param_exec_vals[i].value =
 				parentestate->es_param_exec_vals[i].value;
-			estate->es_param_exec_vals[i].isnull =
+			rcestate->es_param_exec_vals[i].isnull =
 				parentestate->es_param_exec_vals[i].isnull;
 		}
 	}

-	/*
-	 * Each EState must have its own es_epqScanDone state, but if we have
-	 * nested EPQ checks they should share es_epqTupleSlot arrays.  This
-	 * allows sub-rechecks to inherit the values being examined by an outer
-	 * recheck.
-	 */
-	estate->es_epqScanDone = (bool *) palloc0(rtsize * sizeof(bool));
-	if (parentestate->es_epqTupleSlot != NULL)
-	{
-		estate->es_epqTupleSlot = parentestate->es_epqTupleSlot;
-	}
-	else
-	{
-		estate->es_epqTupleSlot = (TupleTableSlot **)
-			palloc0(rtsize * sizeof(TupleTableSlot *));
-	}
-
-	/*
-	 * Each estate also has its own tuple table.
-	 */
-	estate->es_tupleTable = NIL;
-
 	/*
 	 * Initialize private state information for each SubPlan.  We must do this
 	 * before running ExecInitNode on the main query tree, since
@ -2900,15 +2884,49 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 	 * run, but since it's not easy to tell which, we just initialize them
 	 * all.
 	 */
-	Assert(estate->es_subplanstates == NIL);
+	Assert(rcestate->es_subplanstates == NIL);
 	foreach(l, parentestate->es_plannedstmt->subplans)
 	{
 		Plan	   *subplan = (Plan *) lfirst(l);
 		PlanState  *subplanstate;

-		subplanstate = ExecInitNode(subplan, estate, 0);
-		estate->es_subplanstates = lappend(estate->es_subplanstates,
-										   subplanstate);
+		subplanstate = ExecInitNode(subplan, rcestate, 0);
+		rcestate->es_subplanstates = lappend(rcestate->es_subplanstates,
+											 subplanstate);
+	}
+
+	/*
+	 * These arrays are reused across different plans set with
+	 * EvalPlanQualSetPlan(), which is safe because they all use the same
+	 * parent EState. Therefore we can reuse if already allocated.
+	 */
+	if (epqstate->relsubs_rowmark == NULL)
+	{
+		Assert(epqstate->relsubs_done == NULL);
+		epqstate->relsubs_rowmark = (ExecAuxRowMark **)
+			palloc0(rtsize * sizeof(ExecAuxRowMark *));
+		epqstate->relsubs_done = (bool *)
+			palloc0(rtsize * sizeof(bool));
+	}
+	else
+	{
+		Assert(epqstate->relsubs_done != NULL);
+		memset(epqstate->relsubs_rowmark, 0,
+			   sizeof(rtsize * sizeof(ExecAuxRowMark *)));
+		memset(epqstate->relsubs_done, 0,
+			   rtsize * sizeof(bool));
+	}
+
+	/*
+	 * Build an RTI indexed array of rowmarks, so that
+	 * EvalPlanQualFetchRowMark() can efficiently access the to be fetched
+	 * rowmark.
+	 */
+	foreach(l, epqstate->arowMarks)
+	{
+		ExecAuxRowMark *earm = (ExecAuxRowMark *) lfirst(l);
+
+		epqstate->relsubs_rowmark[earm->rowmark->rti - 1] = earm;
 	}

 	/*
@ -2916,7 +2934,7 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 	 * of the plan tree we need to run.  This opens files, allocates storage
 	 * and leaves us ready to start processing tuples.
 	 */
-	epqstate->planstate = ExecInitNode(planTree, estate, 0);
+	epqstate->recheckplanstate = ExecInitNode(planTree, rcestate, 0);

 	MemoryContextSwitchTo(oldcontext);
 }
@ -2934,16 +2952,32 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 void
 EvalPlanQualEnd(EPQState *epqstate)
 {
-	EState	   *estate = epqstate->estate;
+	EState	   *estate = epqstate->recheckestate;
+	Index		rtsize;
 	MemoryContext oldcontext;
 	ListCell   *l;

+	rtsize = epqstate->parentestate->es_range_table_size;
+
+	/*
+	 * We may have a tuple table, even if EPQ wasn't started, because we allow
+	 * use of EvalPlanQualSlot() without calling EvalPlanQualBegin().
+	 */
+	if (epqstate->tuple_table != NIL)
+	{
+		memset(epqstate->relsubs_slot, 0,
+			   sizeof(rtsize * sizeof(TupleTableSlot *)));
+		ExecResetTupleTable(epqstate->tuple_table, true);
+		epqstate->tuple_table = NIL;
+	}
+
+	/* EPQ wasn't started, nothing further to do */
 	if (estate == NULL)
-		return;					/* idle, so nothing to do */
+		return;

 	oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);

-	ExecEndNode(epqstate->planstate);
+	ExecEndNode(epqstate->recheckplanstate);

 	foreach(l, estate->es_subplanstates)
 	{
@ -2952,7 +2986,7 @@ EvalPlanQualEnd(EPQState *epqstate)
 		ExecEndNode(subplanstate);
 	}

-	/* throw away the per-estate tuple table */
+	/* throw away the per-estate tuple table, some node may have used it */
 	ExecResetTupleTable(estate->es_tupleTable, false);

 	/* close any trigger target relations attached to this EState */
@ -2963,7 +2997,7 @@ EvalPlanQualEnd(EPQState *epqstate)
 	FreeExecutorState(estate);

 	/* Mark EPQState idle */
-	epqstate->estate = NULL;
-	epqstate->planstate = NULL;
+	epqstate->recheckestate = NULL;
+	epqstate->recheckplanstate = NULL;
 	epqstate->origslot = NULL;
 }
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@ -40,8 +40,10 @@ ExecScanFetch(ScanState *node,

 	CHECK_FOR_INTERRUPTS();

-	if (estate->es_epqTupleSlot != NULL)
+	if (estate->es_epq_active != NULL)
 	{
+		EPQState   *epqstate = estate->es_epq_active;
+
 		/*
 		 * We are inside an EvalPlanQual recheck.  Return the test tuple if
 		 * one is available, after rechecking any access-method-specific
@ -51,29 +53,43 @@ ExecScanFetch(ScanState *node,

 		if (scanrelid == 0)
 		{
-			TupleTableSlot *slot = node->ss_ScanTupleSlot;
-
 			/*
 			 * This is a ForeignScan or CustomScan which has pushed down a
 			 * join to the remote side.  The recheck method is responsible not
 			 * only for rechecking the scan/join quals but also for storing
 			 * the correct tuple in the slot.
 			 */
+
+			TupleTableSlot *slot = node->ss_ScanTupleSlot;
+
 			if (!(*recheckMtd) (node, slot))
 				ExecClearTuple(slot);	/* would not be returned by scan */
 			return slot;
 		}
-		else if (estate->es_epqTupleSlot[scanrelid - 1] != NULL)
+		else if (epqstate->relsubs_done[scanrelid - 1])
 		{
+			/*
+			 * Return empty slot, as we already performed an EPQ substitution
+			 * for this relation.
+			 */
+
 			TupleTableSlot *slot = node->ss_ScanTupleSlot;

-			/* Return empty slot if we already returned a tuple */
-			if (estate->es_epqScanDone[scanrelid - 1])
-				return ExecClearTuple(slot);
-			/* Else mark to remember that we shouldn't return more */
-			estate->es_epqScanDone[scanrelid - 1] = true;
+			/* Return empty slot, as we already returned a tuple */
+			return ExecClearTuple(slot);
+		}
+		else if (epqstate->relsubs_slot[scanrelid - 1] != NULL)
+		{
+			/*
+			 * Return replacement tuple provided by the EPQ caller.
+			 */

-			slot = estate->es_epqTupleSlot[scanrelid - 1];
+			TupleTableSlot *slot = epqstate->relsubs_slot[scanrelid - 1];
+
+			Assert(epqstate->relsubs_rowmark[scanrelid - 1] == NULL);
+
+			/* Mark to remember that we shouldn't return more */
+			epqstate->relsubs_done[scanrelid - 1] = true;

 			/* Return empty slot if we haven't got a test tuple */
 			if (TupIsNull(slot))
@ -83,7 +99,30 @@ ExecScanFetch(ScanState *node,
 			if (!(*recheckMtd) (node, slot))
 				return ExecClearTuple(slot);	/* would not be returned by
 												 * scan */
+			return slot;
+		}
+		else if (epqstate->relsubs_rowmark[scanrelid - 1] != NULL)
+		{
+			/*
+			 * Fetch and return replacement tuple using a non-locking rowmark.
+			 */

+			TupleTableSlot *slot = node->ss_ScanTupleSlot;
+
+			/* Mark to remember that we shouldn't return more */
+			epqstate->relsubs_done[scanrelid - 1] = true;
+
+			if (!EvalPlanQualFetchRowMark(epqstate, scanrelid, slot))
+				return NULL;
+
+			/* Return empty slot if we haven't got a test tuple */
+			if (TupIsNull(slot))
+				return NULL;
+
+			/* Check if it meets the access-method conditions */
+			if (!(*recheckMtd) (node, slot))
+				return ExecClearTuple(slot);	/* would not be returned by
+												 * scan */
 			return slot;
 		}
 	}
@ -268,12 +307,13 @@ ExecScanReScan(ScanState *node)
 	ExecClearTuple(node->ss_ScanTupleSlot);

 	/* Rescan EvalPlanQual tuple if we're inside an EvalPlanQual recheck */
-	if (estate->es_epqScanDone != NULL)
+	if (estate->es_epq_active != NULL)
 	{
+		EPQState   *epqstate = estate->es_epq_active;
 		Index		scanrelid = ((Scan *) node->ps.plan)->scanrelid;

 		if (scanrelid > 0)
-			estate->es_epqScanDone[scanrelid - 1] = false;
+			epqstate->relsubs_done[scanrelid - 1] = false;
 		else
 		{
 			Bitmapset  *relids;
@ -295,7 +335,7 @@ ExecScanReScan(ScanState *node)
 			while ((rtindex = bms_next_member(relids, rtindex)) >= 0)
 			{
 				Assert(rtindex > 0);
-				estate->es_epqScanDone[rtindex - 1] = false;
+				epqstate->relsubs_done[rtindex - 1] = false;
 			}
 		}
 	}
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@ -156,8 +156,6 @@ CreateExecutorState(void)

 	estate->es_per_tuple_exprcontext = NULL;

-	estate->es_epqTupleSlot = NULL;
-	estate->es_epqScanDone = NULL;
 	estate->es_sourceText = NULL;

 	estate->es_use_parallel_mode = false;
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@ -420,25 +420,27 @@ void
 ExecIndexOnlyMarkPos(IndexOnlyScanState *node)
 {
 	EState	   *estate = node->ss.ps.state;
+	EPQState   *epqstate = estate->es_epq_active;

-	if (estate->es_epqTupleSlot != NULL)
+	if (epqstate != NULL)
 	{
 		/*
 		 * We are inside an EvalPlanQual recheck.  If a test tuple exists for
 		 * this relation, then we shouldn't access the index at all.  We would
 		 * instead need to save, and later restore, the state of the
-		 * es_epqScanDone flag, so that re-fetching the test tuple is
-		 * possible.  However, given the assumption that no caller sets a mark
-		 * at the start of the scan, we can only get here with es_epqScanDone
+		 * relsubs_done flag, so that re-fetching the test tuple is possible.
+		 * However, given the assumption that no caller sets a mark at the
+		 * start of the scan, we can only get here with relsubs_done[i]
 		 * already set, and so no state need be saved.
 		 */
 		Index		scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;

 		Assert(scanrelid > 0);
-		if (estate->es_epqTupleSlot[scanrelid - 1] != NULL)
+		if (epqstate->relsubs_slot[scanrelid - 1] != NULL ||
+			epqstate->relsubs_rowmark[scanrelid - 1] != NULL)
 		{
 			/* Verify the claim above */
-			if (!estate->es_epqScanDone[scanrelid - 1])
+			if (!epqstate->relsubs_done[scanrelid - 1])
 				elog(ERROR, "unexpected ExecIndexOnlyMarkPos call in EPQ recheck");
 			return;
 		}
@ -455,17 +457,19 @@ void
 ExecIndexOnlyRestrPos(IndexOnlyScanState *node)
 {
 	EState	   *estate = node->ss.ps.state;
+	EPQState   *epqstate = estate->es_epq_active;

-	if (estate->es_epqTupleSlot != NULL)
+	if (estate->es_epq_active != NULL)
 	{
-		/* See comments in ExecIndexOnlyMarkPos */
+		/* See comments in ExecIndexMarkPos */
 		Index		scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;

 		Assert(scanrelid > 0);
-		if (estate->es_epqTupleSlot[scanrelid - 1])
+		if (epqstate->relsubs_slot[scanrelid - 1] != NULL ||
+			epqstate->relsubs_rowmark[scanrelid - 1] != NULL)
 		{
 			/* Verify the claim above */
-			if (!estate->es_epqScanDone[scanrelid - 1])
+			if (!epqstate->relsubs_done[scanrelid - 1])
 				elog(ERROR, "unexpected ExecIndexOnlyRestrPos call in EPQ recheck");
 			return;
 		}
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@ -827,25 +827,27 @@ void
 ExecIndexMarkPos(IndexScanState *node)
 {
 	EState	   *estate = node->ss.ps.state;
+	EPQState   *epqstate = estate->es_epq_active;

-	if (estate->es_epqTupleSlot != NULL)
+	if (epqstate != NULL)
 	{
 		/*
 		 * We are inside an EvalPlanQual recheck.  If a test tuple exists for
 		 * this relation, then we shouldn't access the index at all.  We would
 		 * instead need to save, and later restore, the state of the
-		 * es_epqScanDone flag, so that re-fetching the test tuple is
-		 * possible.  However, given the assumption that no caller sets a mark
-		 * at the start of the scan, we can only get here with es_epqScanDone
+		 * relsubs_done flag, so that re-fetching the test tuple is possible.
+		 * However, given the assumption that no caller sets a mark at the
+		 * start of the scan, we can only get here with relsubs_done[i]
 		 * already set, and so no state need be saved.
 		 */
 		Index		scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;

 		Assert(scanrelid > 0);
-		if (estate->es_epqTupleSlot[scanrelid - 1] != NULL)
+		if (epqstate->relsubs_slot[scanrelid - 1] != NULL ||
+			epqstate->relsubs_rowmark[scanrelid - 1] != NULL)
 		{
 			/* Verify the claim above */
-			if (!estate->es_epqScanDone[scanrelid - 1])
+			if (!epqstate->relsubs_done[scanrelid - 1])
 				elog(ERROR, "unexpected ExecIndexMarkPos call in EPQ recheck");
 			return;
 		}
@ -862,17 +864,19 @@ void
 ExecIndexRestrPos(IndexScanState *node)
 {
 	EState	   *estate = node->ss.ps.state;
+	EPQState   *epqstate = estate->es_epq_active;

-	if (estate->es_epqTupleSlot != NULL)
+	if (estate->es_epq_active != NULL)
 	{
 		/* See comments in ExecIndexMarkPos */
 		Index		scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;

 		Assert(scanrelid > 0);
-		if (estate->es_epqTupleSlot[scanrelid - 1] != NULL)
+		if (epqstate->relsubs_slot[scanrelid - 1] != NULL ||
+			epqstate->relsubs_rowmark[scanrelid - 1] != NULL)
 		{
 			/* Verify the claim above */
-			if (!estate->es_epqScanDone[scanrelid - 1])
+			if (!epqstate->relsubs_done[scanrelid - 1])
 				elog(ERROR, "unexpected ExecIndexRestrPos call in EPQ recheck");
 			return;
 		}
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@ -64,12 +64,6 @@ lnext:
 	/* We don't need EvalPlanQual unless we get updated tuple version(s) */
 	epq_needed = false;

-	/*
-	 * Initialize EPQ machinery. Need to do that early because source tuples
-	 * are stored in slots initialized therein.
-	 */
-	EvalPlanQualBegin(&node->lr_epqstate, estate);
-
 	/*
 	 * Attempt to lock the source tuple(s).  (Note we only have locking
 	 * rowmarks in lr_arowMarks.)
@ -259,12 +253,14 @@ lnext:
 	 */
 	if (epq_needed)
 	{
+		/* Initialize EPQ machinery */
+		EvalPlanQualBegin(&node->lr_epqstate);
+
 		/*
-		 * Now fetch any non-locked source rows --- the EPQ logic knows how to
-		 * do that.
+		 * To fetch non-locked source rows the EPQ logic needs to access junk
+		 * columns from the tuple being tested.
 		 */
 		EvalPlanQualSetSlot(&node->lr_epqstate, slot);
-		EvalPlanQualFetchRowMarks(&node->lr_epqstate);

 		/*
 		 * And finally we can re-evaluate the tuple.
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@ -828,7 +828,7 @@ ldelete:;
 					 * Already know that we're going to need to do EPQ, so
 					 * fetch tuple directly into the right slot.
 					 */
-					EvalPlanQualBegin(epqstate, estate);
+					EvalPlanQualBegin(epqstate);
 					inputslot = EvalPlanQualSlot(epqstate, resultRelationDesc,
 												 resultRelInfo->ri_RangeTableIndex);

@ -843,8 +843,7 @@ ldelete:;
 					{
 						case TM_Ok:
 							Assert(tmfd.traversed);
-							epqslot = EvalPlanQual(estate,
-												   epqstate,
+							epqslot = EvalPlanQual(epqstate,
 												   resultRelationDesc,
 												   resultRelInfo->ri_RangeTableIndex,
 												   inputslot);
@ -1370,7 +1369,6 @@ lreplace:;
 					 * Already know that we're going to need to do EPQ, so
 					 * fetch tuple directly into the right slot.
 					 */
-					EvalPlanQualBegin(epqstate, estate);
 					inputslot = EvalPlanQualSlot(epqstate, resultRelationDesc,
 												 resultRelInfo->ri_RangeTableIndex);

@ -1386,8 +1384,7 @@ lreplace:;
 						case TM_Ok:
 							Assert(tmfd.traversed);

-							epqslot = EvalPlanQual(estate,
-												   epqstate,
+							epqslot = EvalPlanQual(epqstate,
 												   resultRelationDesc,
 												   resultRelInfo->ri_RangeTableIndex,
 												   inputslot);
@ -2013,7 +2010,7 @@ ExecModifyTable(PlanState *pstate)
 	 * case it is within a CTE subplan.  Hence this test must be here, not in
 	 * ExecInitModifyTable.)
 	 */
-	if (estate->es_epqTupleSlot != NULL)
+	if (estate->es_epq_active != NULL)
 		elog(ERROR, "ModifyTable should not be called during EvalPlanQual");

 	/*
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@ -198,9 +198,9 @@ extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
 extern ExecRowMark *ExecFindRowMark(EState *estate, Index rti, bool missing_ok);
 extern ExecAuxRowMark *ExecBuildAuxRowMark(ExecRowMark *erm, List *targetlist);
-extern TupleTableSlot *EvalPlanQual(EState *estate, EPQState *epqstate,
-									Relation relation, Index rti, TupleTableSlot *testslot);
-extern void EvalPlanQualInit(EPQState *epqstate, EState *estate,
+extern TupleTableSlot *EvalPlanQual(EPQState *epqstate, Relation relation,
+									Index rti, TupleTableSlot *testslot);
+extern void EvalPlanQualInit(EPQState *epqstate, EState *parentestate,
 							 Plan *subplan, List *auxrowmarks, int epqParam);
 extern void EvalPlanQualSetPlan(EPQState *epqstate,
 								Plan *subplan, List *auxrowmarks);
@ -208,9 +208,9 @@ extern TupleTableSlot *EvalPlanQualSlot(EPQState *epqstate,
 										Relation relation, Index rti);

 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
-extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
+extern bool EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot);
 extern TupleTableSlot *EvalPlanQualNext(EPQState *epqstate);
-extern void EvalPlanQualBegin(EPQState *epqstate, EState *parentestate);
+extern void EvalPlanQualBegin(EPQState *epqstate);
 extern void EvalPlanQualEnd(EPQState *epqstate);

 /*
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@ -571,17 +571,12 @@ typedef struct EState
 	ExprContext *es_per_tuple_exprcontext;

 	/*
-	 * These fields are for re-evaluating plan quals when an updated tuple is
-	 * substituted in READ COMMITTED mode.  es_epqTupleSlot[] contains test
-	 * tuples that scan plan nodes should return instead of whatever they'd
-	 * normally return, or an empty slot if there is nothing to return; if
-	 * es_epqTupleSlot[] is not NULL if a particular array entry is valid; and
-	 * es_epqScanDone[] is state to remember if the tuple has been returned
-	 * already.  Arrays are of size es_range_table_size and are indexed by
-	 * scan node scanrelid - 1.
+	 * If not NULL, this is an EPQState's EState. This is a field in EState
+	 * both to allow EvalPlanQual aware executor nodes to detect that they
+	 * need to perform EPQ related work, and to provide necessary information
+	 * to do so.
 	 */
-	TupleTableSlot **es_epqTupleSlot;	/* array of EPQ substitute tuples */
-	bool	   *es_epqScanDone; /* true if EPQ tuple has been fetched */
+	struct EPQState *es_epq_active;

 	bool		es_use_parallel_mode;	/* can we use parallel workers? */

@ -1057,17 +1052,73 @@ typedef struct PlanState

 /*
 * EPQState is state for executing an EvalPlanQual recheck on a candidate
- * tuple in ModifyTable or LockRows.  The estate and planstate fields are
- * NULL if inactive.
+ * tuples e.g. in ModifyTable or LockRows.
+ *
+ * To execute EPQ a separate EState is created (stored in ->recheckestate),
+ * which shares some resources, like the rangetable, with the main query's
+ * EState (stored in ->parentestate). The (sub-)tree of the plan that needs to
+ * be rechecked (in ->plan), is separately initialized (into
+ * ->recheckplanstate), but shares plan nodes with the corresponding nodes in
+ * the main query. The scan nodes in that separate executor tree are changed
+ * to return only the current tuple of interest for the respective
+ * table. Those tuples are either provided by the caller (using
+ * EvalPlanQualSlot), and/or found using the rowmark mechanism (non-locking
+ * rowmarks by the EPQ machinery itself, locking ones by the caller).
+ *
+ * While the plan to be checked may be changed using EvalPlanQualSetPlan() -
+ * e.g. so all source plans for a ModifyTable node can be processed - all such
+ * plans need to share the same EState.
 */
 typedef struct EPQState
 {
-	EState	   *estate;			/* subsidiary EState */
-	PlanState  *planstate;		/* plan state tree ready to be executed */
-	TupleTableSlot *origslot;	/* original output tuple to be rechecked */
+	/* Initialized at EvalPlanQualInit() time: */
+
+	EState	   *parentestate;	/* main query's EState */
+	int			epqParam;		/* ID of Param to force scan node re-eval */
+
+	/*
+	 * Tuples to be substituted by scan nodes. They need to set up, before
+	 * calling EvalPlanQual()/EvalPlanQualNext(), into the slot returned by
+	 * EvalPlanQualSlot(scanrelid). The array is indexed by scanrelid - 1.
+	 */
+	List	   *tuple_table;	/* tuple table for relsubs_slot */
+	TupleTableSlot **relsubs_slot;
+
+	/*
+	 * Initialized by EvalPlanQualInit(), may be changed later with
+	 * EvalPlanQualSetPlan():
+	 */
+
 	Plan	   *plan;			/* plan tree to be executed */
 	List	   *arowMarks;		/* ExecAuxRowMarks (non-locking only) */
-	int			epqParam;		/* ID of Param to force scan node re-eval */
+
+
+	/*
+	 * The original output tuple to be rechecked.  Set by
+	 * EvalPlanQualSetSlot(), before EvalPlanQualNext() or EvalPlanQual() may
+	 * be called.
+	 */
+	TupleTableSlot *origslot;
+
+
+	/* Initialized or reset by EvalPlanQualBegin(): */
+
+	EState	   *recheckestate;	/* EState for EPQ execution, see above */
+
+	/*
+	 * Rowmarks that can be fetched on-demand using
+	 * EvalPlanQualFetchRowMark(), indexed by scanrelid - 1. Only non-locking
+	 * rowmarks.
+	 */
+	ExecAuxRowMark **relsubs_rowmark;
+
+	/*
+	 * True if a relation's EPQ tuple has been fetched for relation, indexed
+	 * by scanrelid - 1.
+	 */
+	bool	   *relsubs_done;
+
+	PlanState  *recheckplanstate;	/* EPQ specific exec nodes, for ->plan */
 } EPQState;


--- a/src/test/isolation/expected/eval-plan-qual.out
+++ b/src/test/isolation/expected/eval-plan-qual.out
@ -258,6 +258,273 @@ accountid      balance
 checking       1050           
 savings        600            

+starting permutation: wnested2 c1 c2 read
+s2: NOTICE:  upid: text checking = text checking: t
+s2: NOTICE:  up: numeric 600 > numeric 200.0: t
+s2: NOTICE:  lock_id: text checking = text checking: t
+s2: NOTICE:  lock_bal: numeric 600 > numeric 200.0: t
+s2: NOTICE:  upid: text savings = text checking: f
+step wnested2: 
+    UPDATE accounts SET balance = balance - 1200
+    WHERE noisy_oper('upid', accountid, '=', 'checking')
+    AND noisy_oper('up', balance, '>', 200.0)
+    AND EXISTS (
+        SELECT accountid
+        FROM accounts_ext ae
+        WHERE noisy_oper('lock_id', ae.accountid, '=', accounts.accountid)
+            AND noisy_oper('lock_bal', ae.balance, '>', 200.0)
+        FOR UPDATE
+    );
+
+step c1: COMMIT;
+step c2: COMMIT;
+step read: SELECT * FROM accounts ORDER BY accountid;
+accountid      balance        
+
+checking       -600           
+savings        600            
+
+starting permutation: wx1 wxext1 wnested2 c1 c2 read
+step wx1: UPDATE accounts SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance;
+balance        
+
+400            
+step wxext1: UPDATE accounts_ext SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance;
+balance        
+
+400            
+s2: NOTICE:  upid: text checking = text checking: t
+s2: NOTICE:  up: numeric 600 > numeric 200.0: t
+s2: NOTICE:  lock_id: text checking = text checking: t
+s2: NOTICE:  lock_bal: numeric 600 > numeric 200.0: t
+step wnested2: 
+    UPDATE accounts SET balance = balance - 1200
+    WHERE noisy_oper('upid', accountid, '=', 'checking')
+    AND noisy_oper('up', balance, '>', 200.0)
+    AND EXISTS (
+        SELECT accountid
+        FROM accounts_ext ae
+        WHERE noisy_oper('lock_id', ae.accountid, '=', accounts.accountid)
+            AND noisy_oper('lock_bal', ae.balance, '>', 200.0)
+        FOR UPDATE
+    );
+ <waiting ...>
+step c1: COMMIT;
+s2: NOTICE:  lock_id: text checking = text checking: t
+s2: NOTICE:  lock_bal: numeric 400 > numeric 200.0: t
+s2: NOTICE:  upid: text checking = text checking: t
+s2: NOTICE:  up: numeric 400 > numeric 200.0: t
+s2: NOTICE:  lock_id: text checking = text checking: t
+s2: NOTICE:  lock_bal: numeric 600 > numeric 200.0: t
+s2: NOTICE:  lock_id: text checking = text checking: t
+s2: NOTICE:  lock_bal: numeric 400 > numeric 200.0: t
+s2: NOTICE:  upid: text savings = text checking: f
+step wnested2: <... completed>
+step c2: COMMIT;
+step read: SELECT * FROM accounts ORDER BY accountid;
+accountid      balance        
+
+checking       -800           
+savings        600            
+
+starting permutation: wx1 wx1 wxext1 wnested2 c1 c2 read
+step wx1: UPDATE accounts SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance;
+balance        
+
+400            
+step wx1: UPDATE accounts SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance;
+balance        
+
+200            
+step wxext1: UPDATE accounts_ext SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance;
+balance        
+
+400            
+s2: NOTICE:  upid: text checking = text checking: t
+s2: NOTICE:  up: numeric 600 > numeric 200.0: t
+s2: NOTICE:  lock_id: text checking = text checking: t
+s2: NOTICE:  lock_bal: numeric 600 > numeric 200.0: t
+step wnested2: 
+    UPDATE accounts SET balance = balance - 1200
+    WHERE noisy_oper('upid', accountid, '=', 'checking')
+    AND noisy_oper('up', balance, '>', 200.0)
+    AND EXISTS (
+        SELECT accountid
+        FROM accounts_ext ae
+        WHERE noisy_oper('lock_id', ae.accountid, '=', accounts.accountid)
+            AND noisy_oper('lock_bal', ae.balance, '>', 200.0)
+        FOR UPDATE
+    );
+ <waiting ...>
+step c1: COMMIT;
+s2: NOTICE:  lock_id: text checking = text checking: t
+s2: NOTICE:  lock_bal: numeric 400 > numeric 200.0: t
+s2: NOTICE:  upid: text checking = text checking: t
+s2: NOTICE:  up: numeric 200 > numeric 200.0: f
+s2: NOTICE:  upid: text savings = text checking: f
+step wnested2: <... completed>
+step c2: COMMIT;
+step read: SELECT * FROM accounts ORDER BY accountid;
+accountid      balance        
+
+checking       200            
+savings        600            
+
+starting permutation: wx1 wx1 wxext1 wxext1 wnested2 c1 c2 read
+step wx1: UPDATE accounts SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance;
+balance        
+
+400            
+step wx1: UPDATE accounts SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance;
+balance        
+
+200            
+step wxext1: UPDATE accounts_ext SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance;
+balance        
+
+400            
+step wxext1: UPDATE accounts_ext SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance;
+balance        
+
+200            
+s2: NOTICE:  upid: text checking = text checking: t
+s2: NOTICE:  up: numeric 600 > numeric 200.0: t
+s2: NOTICE:  lock_id: text checking = text checking: t
+s2: NOTICE:  lock_bal: numeric 600 > numeric 200.0: t
+step wnested2: 
+    UPDATE accounts SET balance = balance - 1200
+    WHERE noisy_oper('upid', accountid, '=', 'checking')
+    AND noisy_oper('up', balance, '>', 200.0)
+    AND EXISTS (
+        SELECT accountid
+        FROM accounts_ext ae
+        WHERE noisy_oper('lock_id', ae.accountid, '=', accounts.accountid)
+            AND noisy_oper('lock_bal', ae.balance, '>', 200.0)
+        FOR UPDATE
+    );
+ <waiting ...>
+step c1: COMMIT;
+s2: NOTICE:  lock_id: text checking = text checking: t
+s2: NOTICE:  lock_bal: numeric 200 > numeric 200.0: f
+s2: NOTICE:  lock_id: text savings = text checking: f
+s2: NOTICE:  upid: text savings = text checking: f
+step wnested2: <... completed>
+step c2: COMMIT;
+step read: SELECT * FROM accounts ORDER BY accountid;
+accountid      balance        
+
+checking       200            
+savings        600            
+
+starting permutation: wx1 wxext1 wxext1 wnested2 c1 c2 read
+step wx1: UPDATE accounts SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance;
+balance        
+
+400            
+step wxext1: UPDATE accounts_ext SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance;
+balance        
+
+400            
+step wxext1: UPDATE accounts_ext SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance;
+balance        
+
+200            
+s2: NOTICE:  upid: text checking = text checking: t
+s2: NOTICE:  up: numeric 600 > numeric 200.0: t
+s2: NOTICE:  lock_id: text checking = text checking: t
+s2: NOTICE:  lock_bal: numeric 600 > numeric 200.0: t
+step wnested2: 
+    UPDATE accounts SET balance = balance - 1200
+    WHERE noisy_oper('upid', accountid, '=', 'checking')
+    AND noisy_oper('up', balance, '>', 200.0)
+    AND EXISTS (
+        SELECT accountid
+        FROM accounts_ext ae
+        WHERE noisy_oper('lock_id', ae.accountid, '=', accounts.accountid)
+            AND noisy_oper('lock_bal', ae.balance, '>', 200.0)
+        FOR UPDATE
+    );
+ <waiting ...>
+step c1: COMMIT;
+s2: NOTICE:  lock_id: text checking = text checking: t
+s2: NOTICE:  lock_bal: numeric 200 > numeric 200.0: f
+s2: NOTICE:  lock_id: text savings = text checking: f
+s2: NOTICE:  upid: text savings = text checking: f
+step wnested2: <... completed>
+step c2: COMMIT;
+step read: SELECT * FROM accounts ORDER BY accountid;
+accountid      balance        
+
+checking       400            
+savings        600            
+
+starting permutation: wx1 tocds1 wnested2 c1 c2 read
+step wx1: UPDATE accounts SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance;
+balance        
+
+400            
+step tocds1: UPDATE accounts SET accountid = 'cds' WHERE accountid = 'checking';
+s2: NOTICE:  upid: text checking = text checking: t
+s2: NOTICE:  up: numeric 600 > numeric 200.0: t
+s2: NOTICE:  lock_id: text checking = text checking: t
+s2: NOTICE:  lock_bal: numeric 600 > numeric 200.0: t
+step wnested2: 
+    UPDATE accounts SET balance = balance - 1200
+    WHERE noisy_oper('upid', accountid, '=', 'checking')
+    AND noisy_oper('up', balance, '>', 200.0)
+    AND EXISTS (
+        SELECT accountid
+        FROM accounts_ext ae
+        WHERE noisy_oper('lock_id', ae.accountid, '=', accounts.accountid)
+            AND noisy_oper('lock_bal', ae.balance, '>', 200.0)
+        FOR UPDATE
+    );
+ <waiting ...>
+step c1: COMMIT;
+s2: NOTICE:  upid: text cds = text checking: f
+s2: NOTICE:  upid: text savings = text checking: f
+step wnested2: <... completed>
+step c2: COMMIT;
+step read: SELECT * FROM accounts ORDER BY accountid;
+accountid      balance        
+
+cds            400            
+savings        600            
+
+starting permutation: wx1 tocdsext1 wnested2 c1 c2 read
+step wx1: UPDATE accounts SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance;
+balance        
+
+400            
+step tocdsext1: UPDATE accounts_ext SET accountid = 'cds' WHERE accountid = 'checking';
+s2: NOTICE:  upid: text checking = text checking: t
+s2: NOTICE:  up: numeric 600 > numeric 200.0: t
+s2: NOTICE:  lock_id: text checking = text checking: t
+s2: NOTICE:  lock_bal: numeric 600 > numeric 200.0: t
+step wnested2: 
+    UPDATE accounts SET balance = balance - 1200
+    WHERE noisy_oper('upid', accountid, '=', 'checking')
+    AND noisy_oper('up', balance, '>', 200.0)
+    AND EXISTS (
+        SELECT accountid
+        FROM accounts_ext ae
+        WHERE noisy_oper('lock_id', ae.accountid, '=', accounts.accountid)
+            AND noisy_oper('lock_bal', ae.balance, '>', 200.0)
+        FOR UPDATE
+    );
+ <waiting ...>
+step c1: COMMIT;
+s2: NOTICE:  lock_id: text cds = text checking: f
+s2: NOTICE:  lock_id: text savings = text checking: f
+s2: NOTICE:  upid: text savings = text checking: f
+step wnested2: <... completed>
+step c2: COMMIT;
+step read: SELECT * FROM accounts ORDER BY accountid;
+accountid      balance        
+
+checking       400            
+savings        600            
+
 starting permutation: wx1 updwcte c1 c2 read
 step wx1: UPDATE accounts SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance;
 balance        
@ -435,8 +702,10 @@ balance

 1050           
 step lockwithvalues: 
-	SELECT * FROM accounts a1, (values('checking'),('savings')) v(id)
-	  WHERE a1.accountid = v.id
+	-- Reference rowmark column that differs in type from targetlist at some attno.
+	-- See CAHU7rYZo_C4ULsAx_LAj8az9zqgrD8WDd4hTegDTMM1LMqrBsg@mail.gmail.com
+	SELECT a1.*, v.id FROM accounts a1, (values('checking'::text, 'nan'::text),('savings', 'nan')) v(id, notnumeric)
+	WHERE a1.accountid = v.id AND v.notnumeric != 'einszwei'
 	  FOR UPDATE OF a1;
 <waiting ...>
 step c2: COMMIT;
--- a/src/test/isolation/specs/eval-plan-qual.spec
+++ b/src/test/isolation/specs/eval-plan-qual.spec
@ -42,6 +42,16 @@ setup
 CREATE TABLE another_parttbl1 PARTITION OF another_parttbl FOR VALUES IN (1);
 CREATE TABLE another_parttbl2 PARTITION OF another_parttbl FOR VALUES IN (2);
 INSERT INTO another_parttbl VALUES (1, 1, 1);
+
+ CREATE FUNCTION noisy_oper(p_comment text, p_a anynonarray, p_op text, p_b anynonarray)
+ RETURNS bool LANGUAGE plpgsql AS $$
+ DECLARE
+  r bool;
+  BEGIN
+  EXECUTE format('SELECT $1 %s $2', p_op) INTO r USING p_a, p_b;
+  RAISE NOTICE '%: % % % % %: %', p_comment, pg_typeof(p_a), p_a, p_op, pg_typeof(p_b), p_b, r;
+  RETURN r;
+  END;$$;
 }

 teardown
@ -53,6 +63,7 @@ teardown
 DROP TABLE table_a, table_b, jointest;
 DROP TABLE parttbl;
 DROP TABLE another_parttbl;
+ DROP FUNCTION noisy_oper(text, anynonarray, text, anynonarray)
 }

 session "s1"
@ -62,6 +73,10 @@ step "wx1"	{ UPDATE accounts SET balance = balance - 200 WHERE accountid = 'chec
 # wy1 then wy2 checks the case where quals pass then fail
 step "wy1"	{ UPDATE accounts SET balance = balance + 500 WHERE accountid = 'checking' RETURNING balance; }

+step "wxext1"	{ UPDATE accounts_ext SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance; }
+step "tocds1"	{ UPDATE accounts SET accountid = 'cds' WHERE accountid = 'checking'; }
+step "tocdsext1" { UPDATE accounts_ext SET accountid = 'cds' WHERE accountid = 'checking'; }
+
 # d1 then wx1 checks that update can deal with the updated row vanishing
 # wx2 then d1 checks that the delete affects the updated row
 # wx2, wx2 then d1 checks that the delete checks the quals correctly (balance too high)
@ -89,7 +104,7 @@ step "writep2"	{ UPDATE p SET b = -b WHERE a = 1 AND c = 0; }
 step "c1"	{ COMMIT; }
 step "r1"	{ ROLLBACK; }

-# these tests are meant to exercise EvalPlanQualFetchRowMarks,
+# these tests are meant to exercise EvalPlanQualFetchRowMark,
 # ie, handling non-locked tables in an EvalPlanQual recheck

 step "partiallock"	{
@ -98,8 +113,10 @@ step "partiallock"	{
 	  FOR UPDATE OF a1;
 }
 step "lockwithvalues"	{
-	SELECT * FROM accounts a1, (values('checking'),('savings')) v(id)
-	  WHERE a1.accountid = v.id
+	-- Reference rowmark column that differs in type from targetlist at some attno.
+	-- See CAHU7rYZo_C4ULsAx_LAj8az9zqgrD8WDd4hTegDTMM1LMqrBsg@mail.gmail.com
+	SELECT a1.*, v.id FROM accounts a1, (values('checking'::text, 'nan'::text),('savings', 'nan')) v(id, notnumeric)
+	WHERE a1.accountid = v.id AND v.notnumeric != 'einszwei'
 	  FOR UPDATE OF a1;
 }
 step "partiallock_ext"	{
@ -231,6 +248,20 @@ step "updwctefail"  { WITH doup AS (UPDATE accounts SET balance = balance + 1100
 step "delwcte"  { WITH doup AS (UPDATE accounts SET balance = balance + 1100 WHERE accountid = 'checking' RETURNING *) DELETE FROM accounts a USING doup RETURNING *; }
 step "delwctefail"  { WITH doup AS (UPDATE accounts SET balance = balance + 1100 WHERE accountid = 'checking' RETURNING *, update_checking(999)) DELETE FROM accounts a USING doup RETURNING *; }

+# Check that nested EPQ works correctly
+step "wnested2" {
+    UPDATE accounts SET balance = balance - 1200
+    WHERE noisy_oper('upid', accountid, '=', 'checking')
+    AND noisy_oper('up', balance, '>', 200.0)
+    AND EXISTS (
+        SELECT accountid
+        FROM accounts_ext ae
+        WHERE noisy_oper('lock_id', ae.accountid, '=', accounts.accountid)
+            AND noisy_oper('lock_bal', ae.balance, '>', 200.0)
+        FOR UPDATE
+    );
+}
+
 step "c2"	{ COMMIT; }
 step "r2"	{ ROLLBACK; }

@ -282,6 +313,15 @@ permutation "wx2" "d2" "d1" "r2" "c1" "read"
 permutation "d1" "wx2" "c1" "c2" "read"
 permutation "d1" "wx2" "r1" "c2" "read"

+# Check that nested EPQ works correctly
+permutation "wnested2" "c1" "c2" "read"
+permutation "wx1" "wxext1" "wnested2" "c1" "c2" "read"
+permutation "wx1" "wx1" "wxext1" "wnested2" "c1" "c2" "read"
+permutation "wx1" "wx1" "wxext1" "wxext1" "wnested2" "c1" "c2" "read"
+permutation "wx1" "wxext1" "wxext1" "wnested2" "c1" "c2" "read"
+permutation "wx1" "tocds1" "wnested2" "c1" "c2" "read"
+permutation "wx1" "tocdsext1" "wnested2" "c1" "c2" "read"
+
 # test that an update to a self-modified row is ignored when
 # previously updated by the same cid
 permutation "wx1" "updwcte" "c1" "c2" "read"