ERROR STOP WE ARE TRYING TO CAPTURE
The IBM 1130 employs odd parity to detect core memory errors, where each 8 bit half of a memory word has an associated parity bit to make the count of bits with a 1 value be odd. If the count is not odd when retrieving a word, the machine stops with a Parity Check error. My replacement for the core memory has a much more reliable memory technology and thus does not bother storing parity bits with each word. It instead generates the correct value for the parity bits as a word of data is read.
I can load and display memory several ways using my 1130 MRAM memory replacement board, with no detected errors. However, if I load the IBM high core memory diagnostic program and run it, the program encounters parity stops at predictable, repeatable points in the code. This happens after the code has successfully run through the same point more than a thousand times.
I could also repeatedly see the error when I store and then display words at certain addresses. Anything with data having a 1 in bits 8, 12, 13 and 14 would come back with bits 12, 13 and 14 as 0 instead of 1. Because an odd number of one bits had been removed, the parity bit would have had to change to ensure odd parity, but it corresponded to the value when 12, 13 and 14 were correctly read as 1 bits, thus triggering the parity error.
LOGIC ANALYZER SETUP
I used my DSLogic USB logic analyzer because it is very portable and easy to connect, but it is limited to 16 channels of data capture. That does result in iterative changes to use of the probes as I narrow in on a problem. Of course, my initial 16 channels of data need to provide some clue to where to next look.
I need a trigger signal, which is the +Parity Stop signal when the error is first recognized. I will capture the two parity bit flipflop values and the signals generated when a halfword plus the parity bit has even parity. The two sense signals from my board whose pulse sets a 1 value into the parity bits, -Sense 16 and -Sense 17 will be recorded. Rounding this out will be the +Storage Read signal to place the activity in context of a memory cycle.
The other eight channels will be recording sets of data bits from the SBR or address bits from the Storage Address Register (SAR). Since there are 16 data bits and 13 address bits, it will take at least four runs to record all of them in conjunction with the eight channels from the paragraph above.
OSCILLOSCOPE USED TO LOOK AT SENSE PULSES FOR BITS 12, 13 AND 14
The four channel scope was set up to trigger on the +Parity Stop signal in one channel and display the -Sense Bit 12, -Sense Bit 13 and -Sense Bit 14 signals on the other three channels. I wanted to confirm whether the 1130 MRAM board is emitting pulses to set the bits to a 1 or not.
Indeed, I could see pulses coming from my board but the SBR bit was not being set in the 1130. I experimented with longer pulses, with no effect. I then tried separating the pulses with 40 ns pulses within each 75 ns state machine step, but the results became worse.
RECOGNIZING THAT THE SAME OLD UNEXPLAINED ANALOG ISSUE PERSISTS
I rebuilt my board entirely from the prior design that seemed to have too much ground bounce, thus encountering spurious retriggering that produced sense bit pulses at improper times. I continually strengthened the ground plane, power plan and size of the power connections within my board. I drove the pulses with very fast discrete transistors controlled by the logic chips. I varied timing and spread out the bit setting. Ultimately, none of these changes gave me a memory that was reliably and consistently working.
There are constraints on how early in a read cycle that I can set bits in the SBR; I believe I must wait at least 450 ns so that I am in clock step T1 of the four clock step read cycle (T0 - T3), but also must complete all the bit settings before reaching clock step T2 because the 1130 may begin gating the results in the SBR to other registers in the system within that step.
The two parity bits do not need to be emitted within that constraint. They only need to be set by the end of the read cycle since they are interrogated midway through the ensuing write cycle at clock step T6. This still requires the 16 data bits to be pulsed in that interval.
This gave me a tight window of 450 ns in which to set 16 bits, just over 28 ns if each bit were set separately. The Solid Logic Technology (SLT) family used in the IBM 1130 is 30 ns nominally, making this impossible to achieve with a single bit in a step.
If I pulse two bits in each FPGA step, then I have just over 56 ns for the pulse, which should be long enough. However, since I don't understand what is happening with the analog behavior that leads to this issue, I am not satisfied that such a change is sufficient.
MUSINGS
Each pulse sent to the 1130 is in fact a current sink from the IBM 1130 through an open collector transistor on my board. The power rail on my board is not involved in this current flow just in the minor 1.5ma drive current to the transistors as well as minor switching current in the 74LVC08 quad 2 input AND gate that is delivering the 1.5ma to each transistor.
The ground plane of my board is joined to the ground bus of the 1130 system with stranded 18 gauge wire which should keep my ground plane from straying too much from the 1130's however my hunch is that this is where the problems arise. When I watched pulses that failed to set the SBR bits, the pulses didn't make it all the way to zero volts on the scope. They seemed to bottom out higher, which could be caused by a voltage differential between the ground planes.
With an inductance of 800 nanoHenry and an effective resistance at 2.2 Mhz of approximately 60 milliOhms, the resulting impedance is around 20 ohms, giving me a voltage drop of almost one-half Volt on the ground conductor for those high frequency signals if they were a continuous train.
This is quick and dirty, but it is consistent with the scope observed pulse bottom rising above 0V. The germanium diodes in SLT circuitry have a voltage drop around 0.3V, thus I could easily drive up the pulse bottom so high that it fails to switch the transistor in the register. Depending upon what other pulses were produced close in time to the affected one and what ringing might occur in the ground wire, I could see that it would be possible to get instances where it fails to set the bit.
I still don't see how specific data patterns cause the failures. These are spread across three SLT cards in the B gate compartment B1 at H2/H3, K2/K3 and L2/L3, across two of the cables between my board and the 1130 and across multiple parts on my board.
DOING MORE INVESTIGATION
After tightening up the FGPA code even further, I found that the 1130 would run for 5-15 seconds before encountering a parity stop. I noticed that bits 10 and 13 were the most likely to not register in the SBR when they should be 1.
I then hooked the scope up to the -Sense Bit 10 and -Sense Bit 13 pins on the 1130 to watch the signals when a failure tripped a parity stop. Interestingly putting the scope probe on one of the pins dramatically reduced the rate of that bit failing to set. Putting probes on both led to the machine running 25 to 30 seconds, looping through memory successfully, prior to hitting a parity stop. The machine executes almost 277, 778 reads and writes per second, thus the failure rate was around once per 7 million accesses. However, only words with a susceptible bit set to 1 would cause a parity error, thus the actual error rate is closer to 1 in 3.5 million accesses.
Close, so close, but far from acceptable when the computer may run for many hours to days. However, the fact that putting leads on the backplane pins lowers the error rate is a tantalizing clue. The scope doesn't show ringing on the signal when observed at the pin, but that may be due to the impedance of the scope probe - its capacitance and resistance. The input resistance should be around 10 megohms and the capacitance perhaps 10 picofarad. For 100 MHz signals, the impedance is closer to 100 ohms and at 1MHz the impedance is still around 10K ohms.
The effect of the frequency dependent impedance is that the probe absorbs the higher frequencies more than lower, rounding the pulses. It acts to slow the rise and fall times of the pulses, which appears to improve the reliability of the memory. Thus I may need to develop a filter to produce similar but larger rounding of the pulses.
No comments:
Post a Comment