memory errors

Larry Parmelee parmelee at wayback.cs.cornell.edu
Tue Feb 23 01:05:13 AEST 1988


In article <128 at thebes.Thalatta.COM> bossert at Thalatta.COM 
(John Bossert) writes:
> The true error message I was getting on a 11/780 with
> 8 Mb of interleaved memory was:
> 
> 	mcr0: soft ecc addr 110dd syn 26
> 

Credentials:
We have an 11/780, with 8Mb of memory interleaved on two memory
controllers, using 256kb memory boards, running 4.3BSD unix.

Until recently, we had lots of memory errors;  Finally with the
help of our DEC Technician, we were able to figure out how to
interpret that message to the board level.

First, "mcr#" is "Memory Controller #".  (That was easy, right?)
The "addr" part is a little more interesting.

We have 256kb memory boards, which works out to 0x40000 bytes
per board.  Memory error correction/detection works on 4-byte
"globs", so a 256kb memory board has

    0x40000(bytes) / 4 (bytes-per-glob) = 0x10000 (globs-per-board).

Now, the number following "addr" in the message above is the
"glob" number where the error occurred, so

    0x110dd'th glob / 0x10000 (globs-per-board) = board number 1

(Ignore the remainder for now).  For these purposes, the boards
are numbered starting with 0 (the memory in each memory controller
is considered individually).  Physically, board 0 is the leftmost
memory board in the given controller. 

The remainder from above and the syndrome  (the number following
"syn") can be used to figure out which chip had the problem, but
with 256kb memory going for $25 a board (used) nowadays, it wasn't
worth figuring out the rest.  We just replaced the board(s).

I think you said you had 1Mb boards;  In that case, it would work 
out like this:  0x100000 bytes-per-board, or 0x40000 "globs" per
board.  Then 0x110dd / 0x40000 = board number 0.

Have fun!
-Larry Parmelee
parmelee at wayback.cs.cornell.edu
parmelee at cornell.uucp



More information about the Comp.unix.wizards mailing list