Monday, June 20, 2022

Narrowing in on the two failures after verifying they are consistent, likely both are resolved

 FAILURE 1 - STX TEST FAILS

The test that fails here is pretty simple. A value of xFFFF is stored in a fixed location, then index register 1 is loaded with the value x0000 and a STX instruction puts the contents of IX1 into the fixed memory location from before. The fixed location is then loaded and if it isn't zero, it indicates that the STX didn't store properly causing a stop. 

Single stepping always works properly, but at speed this seemed to consistently fail. I first reran to verify that this misbehaves at normal run speed. Embarrassingly, I found that the instruction immediately after where I stopped when single stepping was wrong, another copy of the error wait 30DF instead of the proper instruction. 

When I fixed the incorrect value, the machine ran right through this with no errors at all. This is indeed not a failure of the machine processing STX instructions. 

FAILURE 2 - MULTIPLY/DIVIDE LOOP GETS PARITY ERROR

This is a long loop that runs through all possible values from lowest negative to highest positive, doing a multiply and then a divide. It uses four seed values to which it multiplies and divides, thus four loops from -32768 to +32767. 65,536 multiplies and 65,536 divides for four different seed values. 

33 microseconds is the average execution time for the multiply being done and 76 microseconds for the average divide. That gives us 8.7 seconds of multiplication execution and 20 seconds of division, or a total loop in excess of 29 seconds. That is 1/4 of the entire diagnostic test's execution time for this one comprehensive multiply-divide test. 

I obviously can't hand step through 262,144 pairs of multiply and divide, but this one does trigger a parity stop which is a signal that I can use to latch up the scope and/or logic analyzer. I ran this again to be sure that it does consistently fail with the parity stop, probably because it executes so many times that this sporadic issue is sure to crop up. 

These parity errors don't appear to exist in the core memory, just in the value read into the B register during the read part of a core memory cycle. I believe this because I can immediately run a Storage Display loop that reads all memory; that scan never sees a parity error so the data is not written in core wrong. Instead, it seems to be that bit 0 of B register is set in error during a read cycle. 

I will monitor the sense amplifier output to see whether we are getting bad sensing or whether something else is causing the B register Bit 0 to latch on. I have two other leads which I will hang on some of the gating signals that might cause other random data to flip on the bit latch.

The sense amplifiers of the SJ-4 memory are split - one card handles bit 0 and 6 for addresses from 0 to 4095 and the other handles the same two bits for addresses from 4096 to 8191. Thus there are two different sense amplifiers, with an addressing bit gating whether the lower 4K or higher 4K sense amp is connected to the output. 

So far, my issues have all occurred in the lower 4K, but I could relocate the failing code up above the line and see if the results are the same. That would point me at a bad card or connection if it only fails in lower core addresses. Fortunately, I don't have to do this - see below. 

This flip flop has a number of inputs coming from the A register, I register and I/O (device controller) registers. These should only be passed on to the latch if the sample pulse signal goes negative. For example, if -A to B SP 0-7 is activated while the A bit is 1 (gate signal -A Bit 0 is low), then this triggers the latching of the B Bit 0 flipflop. Similarly, -I/O to B SP 0-7 and -I to B SP 0-7 will latch for a 1 in I/O or I bit 0. 

The pulses are sent to all eight bits, 0 to 7, yet only bit 0 is latching up. It cannot be an error in the generation of these sample pulses, but it might be a signal path fault bringing that signal to the pin for the Bit 0 instance of the B register logic. It could also be a path error with -Sense Amp Bit 0 coming to the card. 

I wrote up the relevant pins and paths to verify, applying the VOM to the backplane to test connectivity before I start the scope and logic analyzer captures. All the paths were well connected. Interestingly, the path from the sense amp up to the edge connector had a wire wrap on exactly this bit. In was good, however, so I moved on.

Using the scope and triggering on the generation of -Parity Stop, I could see a clear 1 bit coming from the sense amplifier line. Since the memory module has multiple identical SLT cards (type 3475) that handle the inhibit and the sense duties for pairs of bits for a 4K group of addresses, it hosts 18 of these identical cards.

The locations for bits 0 and 6 are A7 and B7 in the B gate, C1 compartment which is where the memory sits. I swapped the card with another - B6 which is responsible for other bits. I ran the Multiply-Divide test again and got a Parity Stop again but this time the bit that was flipped on spuriously was bit 8! That is the responsibility of the card in B6. 

My working assumption is that the card currently in B6 has some fault that causes it to sometimes report a 1 value when the core was actually zero. The museum had a box full of spare SLT cards including a 3475. I will swap in the spare card and see whether I can get this test to run successfully. 

If it does, then all of the CPU instructions were validated by the diagnostic and I can consider both the CPU and the memory (because of this replacement) to be good. I will do the card change and retest tomorrow as it is the end of my time in the shop for today.

No comments:

Post a Comment