postgresql/contrib/zap_ltv/README

From ccshag@cclabs.missouri.edu  Wed Jul 10 04:27:16 1996
Received: from realtime.cc.missouri.edu (realtime.cc.missouri.edu [128.206.212.69]) by ki.net (8.7.5/8.7.5) with ESMTP id EAA12722 for <scrappy@ki.net>; Wed, 10 Jul 1996 04:27:13 -0400 (EDT)
Received: from localhost (ccshag@localhost) by realtime.cc.missouri.edu (8.7.1/8.7.1) with SMTP id DAA00915; Wed, 10 Jul 1996 03:26:42 -0500 (CDT)
X-Authentication-Warning: realtime.cc.missouri.edu: ccshag owned process doing -bs
Date: Wed, 10 Jul 1996 03:26:41 -0500 (CDT)
From: "Paul 'Shag' Walmsley" <ccshag@cclabs.missouri.edu>
Sender: ccshag@cclabs.missouri.edu
To: postgres95-hackers@oozoo.vnet.net
cc: pixel@tiger.coe.missouri.edu, michael.siebenborn@ae3.Hypo.DE,
        scrappy@ki.net
Subject: Workaround for database corruption problems [long]
Message-ID: <Pine.SGI.3.93.960710023021.673B-100000@realtime.cc.missouri.edu>
X-Disclaimer: The opinions of this writer do not necessarily represent those of the University of Missouri-Columbia.
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: RO
X-Status:


I've been poking through the guts of POSTGRES95 in pursuit of the
previously-mentioned "database corruption" problem, and I think that I
have some preliminary answers and a workaround for people to play around
with.

Specifically, the problem that I'm referring to is manifested by
pg_log, pg_time, or pg_variable either growing to huge sizes (usually
128MB) or showing evidence of internal corruption.  Symptoms that appear
to the user include receiving messages like the following in psql or
monitor:
	"WARN:cannot open segment 1 of relation pg_time"
	"WARN:cannot write block 16384 of pg_log"
.. which are followed up in a 'postmaster -d 3' debug log by a
	"NOTICE:AbortTransaction and not in in-progress state"

It's been my experience that this problem is difficult to reproduce on
demand; I've tried several different ways and haven't succeeded.  (If
anyone knows how to reliably reproduce this, please E-mail me.)  I was,
however, able to obtain a database in a corrupted state from one of the
fellows on campus here running POSTGRES95; and following is something of a
post-mortem.

----------
The problem with the database is that after any query, POSTGRES95 attempts
to generate a huge pg_log file and fails (running out of disk space) with
a "WARN:cannot write block 16384 of pg_log" message.  The contents of the
database and its internal structure were reasonably intact - I discovered
that I was able to execute a single SQL query per-backend, and could
therefore back the contents of the database up with a COPY command.  (I
wasn't able to use pg_dump; it failed on its BEGIN TRANSACTION step.)
Inspection of data/pg_log, data/pg_variable, and data/pg_time revealed the
following;

realtime /usr/postgres95 > od -x data/pg_log.old | head
0000000 0060 2000 2000 2000 f680 0000 ed00 0000
0000020 e300 0000 dc00 0000 ce00 0000 bc00 0000
0000040 b140 0000 a740 0000 9dc0 0000 9180 0000
0000060 8580 0000 7c80 0000 6fc0 0000 6480 0000
0000100 58c0 0000 4fc0 0000 4640 0000 3a80 0000
0000120 30c0 0000 23c0 0000 1900 0000 0c40 0000
0000140 0000 0000 0000 0000 0000 0000 0000 0000
*
0000200 8800 0000 0000 0000 0800 0000 0000 0000
0000220 0800 0000 0000 0000 0800 0000 0000 0000


Now, already something is obviously wrong: pg_log shouldn't have any
values other than 0, 1, 2, 4, 8, a, c, or e for each nybble (see the XID_
... definitions in src/backend/access/transam.h:35; two of these can be
stored per nybble.)


realtime /usr/postgres95 > od -x data/pg_variable.old | head
0000000 0058 2000 2000 2740 f240 0000 e400 0000
0000020 d780 0000 cac0 0000 b840 0000 a900 0000
0000040 9f80 0000 9280 0000 8580 0000 78c0 0000
0000060 6c80 0000 6200 0000 5800 0000 4f80 0000
0000100 44c0 0000 39c0 0000 2d80 0000 2600 0000
0000120 1c00 0000 0e00 0000 0000 0000 0000 0000
0000140 0000 0000 0000 0000 0000 0000 0000 0000
*
0000700 0000 01bc 0000 009c 0014 ffff ffff 0000
0000720 0004 5f0b 0000 0000 0000 0ba3 0000 0c23


This is obviously broken too; the only bytes that should be non-zero in
pg_variable are the first 16 (see src/backend/access/transam.h:99).


realtime /usr/postgres95 > od -x data/pg_time.old | head -10
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0004000 31c6 09ae 0000 0000 31c6 09b3 31a9 79a6
0004020 31a9 79a6 31a9 79a8 31a9 79a9 31a9 79a9
0004040 31a9 79a9 31a9 79aa 31a9 79aa 31a9 79aa
0004060 31a9 79aa 31a9 79aa 31a9 79ab 31a9 79ab
0004100 31a9 79ab 31a9 79ac 31a9 79ac 31a9 79ac
0004120 31a9 79ac 31a9 79ad 31a9 79ad 31a9 79ae
0004140 31a9 79ae 31a9 79af 31a9 79af 31a9 79af
0004160 31a9 79b0 31a9 79b0 31a9 79b4 31a9 79b4
realtime /usr/postgres95 > od -x data/pg_time.old +32b | head -10
0032000 6665 6520 666f 7220 31c5 cb61 696e 6720
0032020 6465 7665 6c6f 7065 7273 2062 7579 2061
0032040 206c 6f77 2d69 6e63 6f6d 6520 6170 6172
0032060 746d 656e 7420 636f 6d70 6c65 7869 6e20
0032100 7468 6520 7472 6561 7375 7265 7227 7320
0032120 7761 7264 2066 726f 6d20 7468 6520 6665
0032140 6465 7261 6c20 676f 7665 726e 6d65 6e74
0032160 2066 6f72 2024 313b 2064 6576 656c 6f70
0032200 6572 7361 6c73 6f20 31c5 cb7c 31be 91a2
0032220 31be 91a2 31be 91a2 31be 91a2 31be 91a3

The first part of pg_time looked okay -- filled with 4-byte time values
with the occasional "00 00 00 00" here and there -- but inspection into
pg_time's guts revealed some preposterous numbers.

For your comparing pleasure, I include copies of what these files
_should_ look like -- more or less -- from a newly initdb'd database
directory:

realtime /usr/postgres95 > od -x data-new-initdb/pg_log
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000200 8aaa aaaa aaaa aaa8 0000 0000 0000 0000
0000220 0000 0000 0000 0000 0000 0000 0000 0000
*
0020000
realtime /usr/postgres95 > od -x data-new-initdb/pg_variable
0000000 0000 0000 0000 0222 0000 021e 0000 5120
0000020 0000 0000 0000 0000 0000 0000 0000 0000
*
0020000
realtime /usr/postgres95 > od -x data-new-initdb/pg_time
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0004000 31e3 4fb3 0000 0000 31e3 4fb4 31e3 4fb4
0004020 31e3 4fb4 31e3 4fb5 31e3 4fb5 31e3 4fb5
0004040 31e3 4fb6 31e3 4fb6 31e3 4fb6 31e3 4fb6
0004060 31e3 4fb7 31e3 4fb7 31e3 4fb7 31e3 4fb7
0004100 31e3 4fb8 31e3 4fb8 31e3 4fb8 31e3 4fb9
0004120 31e3 4fb9 31e3 4fb9 31e3 4fb9 31e3 4fba
0004140 31e3 4fba 31e3 4fba 31e3 4fbb 31e3 4fbb
0004160 31e3 4fbb 31e3 4fbb 31e3 4fbb 0000 0000
0004200 0000 0000 0000 0000 0000 0000 0000 0000
*
0020000


What's more, if I view pg_log, pg_variable, and pg_time with a text
editor, I discover that the corrupted areas contain fragments of tuples
from the database!
----------


Since I have been unable to reproduce this problem on demand (and
therefore have a complete picture of all of the queries executed on the
database), I've only been able to speculate on the original cause of the
problem.  One explanation would be some kind of buffer manager bug; or
perhaps the backend is running low on file descriptors and bollixing
existing ones?  I intend to continue tracking this down as I have the
time, but in the interim, I've created a quick Perl script that is
designed to restore some sense of rationality to the
pg_{log,time,variable} files.  It's a horribly barbaric way to do this --
at the very least, it almost certainly breaks time travel, and will
almost certainly result in duplicate oid problems -- but it restored my
pathological database here to a sufficiently working state such that I
could run pg_dump on it and do various other queries with it.  The script
is included at the bottom of this message.

Anyway, fellow posthackers, assuming that no one raises any red flags
about this, I'll post it on postgres95@oozoo.vnet.net as a temporary
stop-gap workaround to allow people to recover their databases when this
sort of problem occurs.

I wish I had more time to do this kind of thing.


- Paul "Shag" Walmsley <ccshag@cclabs.missouri.edu>
  "Knowing is not enough." -- Hal Hartley, "Surviving Desire"