Monday, July 15, 2024

Diagnosing core memory issues now that I have a full display to watch

PATTERN DETECTED IN THE REAL PARITY ERRORS

I noticed that either I had a 1 in bit 11 but a 0 parity bit (bit 17) or I had a 0 in bit 11 and a 1 in bit 17, both of these were where the remainder of the word meant that both 11 and 17 should be 0 or both should have been 1. 

There are two cards that implement the inhibit and sense operations for bits 11 and 17, which are unique to the upper 4K of memory. This fits the symptoms. I swapped the 3475 cards between low and upper 4K, then swapped the 3466 cards between low and upper 4K. These swaps were at K2/L2 and M4/N4 card slots. The error did not go away in the upper 4K so it is NOT the cards. 

Loaded all of memory with all zeroes than ran a Storage Display scan which hits the parity errors and stops normally. By using the Parity Run switch, it continued in spite of the errors. This was illuminating because I saw bits 7, 9 and 11 flipping on sporadically as the display continuously cycled around all of memory. 

These errors are also specific to certain address ranges. That is, they only occur in the upper 4K when bit 9 is a 1 and for certain combinations of the four lowest address bits 12, 13, 14 and 15. It will never get a failure if bits 14 and 15 are both 0, nor does it fail with just bit 15 set. However, it is very reliable in failing when bit 14 is on by itself or other combinations of 12, 13 and 14. 

Somehow in all the swapping of cards and testing, I am now seeing errors occurring in the low 4K of memory, also having the low address bit pattern from the paragraph above. 

IMPLICATIONS AT THE FIRST GLANCE

The issue appears to be a failure of the inhibit line from suppressing the flip of bits 7, 9 and 11 to the 1 state. In every write cycle, the X and Y axis currents will flip every core to a 1 state. By having a counter current in the inhibit wire, that specific bit does not flip and remains at the 0 state it receiving during the preceding read cycle. 

However, the correlation with ranges of addresses makes me think that the core memory is just out of adjustment enough that it is marginally working. Certain wires through the core planes have just a bit more current - the address range correlation is why I suspect this. The inhibit current is right on the fringe and when it is working against the slightly higher current of some of the address lines, it fails to block the flip of the core. 

ADJUSTING CORE MEMORY IS A COMPLEX PROCESS

The brief summary of the adjustment process is 22 steps long, involving setting voltages, measuring temperatures,  adjusting various settings while watching oscilloscope traces, and some differential voltage measurements. 

ALSO NEED TO CHECK WIRE WRAP THAT EFFECTS L2 AND N4 SLOTS

The cards involved in some of the error bits show some wire wrap additions on the backplane. I need to understand these, figure out if they were in compensation for some trace failure that has gotten worse, or are defects introduced as the CHI memory expansion was added to the machine.




x

No comments:

Post a Comment