Wednesday, June 29, 2022

Repaired bit flipping error in memory, running core diagnostics; one anomaly to check into

MOVING FAULT SUGGESTED CARD DEFECT, BUT POSSIBLE IT IS A COMMON SIGNAL

I had a defect on bit 0 and I moved that card up to the spot that generates bit 8. The fault moved to bit 8, a sign that I had a bad card. I found a spare and swapped it in, but then the fault appeared on bit 2. That is in the slot above where I just put the spare card and doesn't make sense to me. 

Therefore there are a few possibilities I needed to investigate. First, the act of inserting the card may have disturbed the one above or its signal traces. Second, the new spare card may have its own defect that impacts a signal which the newly failing bit 2 card depends upon. Third, there may be a common signal whose path is broken, which leads to the faults. 

For example, the memory in the 1130 is divided into the upper and lower 4K of locations. There is a unique card for each half of the memory. Thus, bit 0 sense amplifiers are in slot A7 for the high 4K and slot B7 for the low 4K. A logic signal tells the card whether the current memory access is occurring in the upper or lower half of core, thus disabling the operation of the card that is not involved in that memory access. 

If the logic signal is not reaching a card, it may output a sense bit 1 when it should not, stepping on the intended card which is outputting a 0. There are also strobe and enabling voltage inputs to the cards which may be impacted if a trace on the backplane has failed.

CONTINUITY TEST AROUND THE CARDS WHERE I HAVE SEEN FAULTS

I created a list of the pins on these cards and tested that each and every one of them is well connected to the proper spots in the compartment. That was a way to eliminate any solid failures. There is still the chance that I will have a connection that sporadically goes open due to minute vibrations during operation - that would need to be captured with a scope on the signals as I capture the flipped bit error. 

I suspected that an erratic connection is not the reason for the failure, since I can load memory with all zero words and then cycle the machine for long periods of time reading every word in a giant loop through memory. This has not produced a parity stop, thus the fault does not appear to be due to bad traces.


PUT BACK SUSPECT CARD IN B6, FAULT AT BIT 8 RETURNS

The original card that was in slot B7 controlling bit 0 had been moved to slot B6 in order to verify or exclude a bad sense card. The problem moved from bit 0 over to bit 8, which is the assignment of the card in B6. This seemed to be definitive and I replaced it with a spare card that came with the machine.

The challenge was that with a spare in B6, I began getting faults with bit 2. That is controlled by the card above, slot B5, and should have nothing to do with the spare. A second spare still gave me the bit 2 errors. On a hunch, I put another spare in B5 along with the spare that was in B6. The problems went away!

Apparently there was a flaw in the card in B5 which was masked by the defect in the card originally in B7 and then swapped to B6. With two replacements, the memory seems to be working well.

LOADED THE HI CORE DIAGNOSTIC AND RAN IT

The diagnostic program runs six routines, setting bits to various patterns throughout memory to test for weak or defective bits. Five of the six routines ran successfully to completion as many times as I tried them. However, one routine failed steadily. 

It is possible that I have some corruption in the core load from my dump and memory tool. Previously this was true for the CPU Instructions diagnostic, but with a reload of the proper contents atop the erroneous spots, it was able to run fully to completion. This may well be what is occurring with this failure, thus I am preparing a reload file for the portion of core that contains the routine, its unique subroutines and data areas. 

The routine, however, is a key one that must run properly or I have a different memory problem to troubleshoot. This one attempts to verify that memory addressing is correct by writing the address of each word as its contents, then reading memory to test whether its value is the same as its address.

The failure occurs immediately on the first address, which is suggestive of a code corruption issue rather than a hardware addressing defect, but I must be certain that word addressing is correct before I declare the memory to be 100% functional. 

wait 3005 shows address of fault in accumulator - 0800

Wait 3004 shows expected in EXT and actual contents in ACC

After reloading the key parts of the diagnostic from the file I created tonight, I will attempt to run routine 2 again tomorrow. If the problem recurs, I can put in stops and watch the behavior, since this error is happening immediately after the routine begins to execute. 


No comments:

Post a Comment