Sunday, September 7, 2014

Brainstorming possible causes for 1131 memory problems, while traveling


Although I had intended not to blog until I returned home and worked on the system again, I have been thinking quite a bit about what might be causing the cascading problems with the processor. I decided to go through some of the possibilities I am positing, as a way of documenting the thinking process behind vintage system restoration when roadblocks appear.

These are in no particular order and some may be quite unlikely, but I can devise tests to validate or disprove the more likely ones first, eventually settling on the actual cause.

Corrosion on backplane pins and/or SLT card connectors

If the pins and/or connectors have oxidized from age, it would have occurred around the surface but not on the mating surfaces where the metal of both sides was pressed together. Due to minute irregularities in the surfaces of the metals, the actual areas of contact are not the entire face but only the high points of both sides.

After removal and insertion, if the alignment of the card is slightly different, high spots on a pin may ride on high oxide spots on the connector, or vice versa, so that contact is high resistance or an open circuit. I am not sure what I would do if this is the case, especially if it involves the connectors with their recessed metal, but it might involve some very careful oxide removal process. There is a very thin gold plate on the pins and contacts which I have to protect as much as possible.

Cracking or other failure mode of the backplanes

If the backplane, which is an epoxied sandwich of material that includes connections between pins on different cards, has begun to fail due to age related material failure, it may be breaking signal and/or power connections to the card slot pins. Each time I remove and insert a card, with the pressure placed by the 'snap' action that signals a good insertion, it may be splintering internal layers and connections.

Resolving this situation would be very difficult, amounting to reworking or replacing the backplanes for the eight card compartments. The fact that other 1130s are restored and working is the best evidence that this should not be happening to my machine, otherwise the operators of the other systems would be reporting similar problems.

Injection of bits and control signals from unconnected peripherals

I don't have the signal and power cables connected for the 1442 reader/punch and 1132 printer. Since both of these devices can write data into the 1131 core memory during read operations, both can send their status into the system, both can request interrupts and the 1132 can trigger cycle steal accesses to memory, they have the potential to introduce at least some of these symptoms.

It is the adapter logic that actually does these actions and for these two peripherals, the adapters are built into the 1131. However, it is possible that the unconnected lines from the devices might be confusing the adapter, causing one of them to inject bits when it shouldn't or trigger cycle stealing or interrupts.

I am not seeing the interrupt level active, which rules out that specific case, but cycle steal would not be easily apparent to the operator and of course a malfunction that clobbers the IO bus or B register will cause problems such as I am experiencing. I suppose that a malfunctioning adapter for a connected device, such as the 1053 typewriter console or the keyboard, could also inject bits but I would characterize that under a card failure, whereas this category is for errors that are caused simply because the signal cable is not connected.

There are jumpers I can install to block the cycle steal requests, also I could block off the gate that allows bits in from various devices, making this easy to spot (scope the cycle steal state signal) and easy to validate by using those jumpers.

Injection of data bits and control signals from the unconnected SAC

Unlike the other peripherals which have adapter logic in the 1131 which is what can directly inject bits, request interrupts or activate cycle stealing, the Storage Access Channel feature is used with devices whose adapters are external. That means the SAC is connecting signals on the cable to the interrupt, cycle steal and data lines. Since I have no cable installed, this is an even more likely candidate for injecting bits, triggering cycle steals and requesting interrupts.

I haven't seen an interrupt go active but the other situations are possible. I can tie down the cycle steal and interrupt lines with jumpers, as well as blocking the gate of inbound bits, as a way of verifying if this is what is occurring in my system.

Bad grounding or power connections at card compartments

Step 1 in any restoration is verify that power is correct and clean. I set the voltages very carefully, checked ripple/noise at the supplies, and did a check out at the memory frame, but didn't do a methodical check gate by gate and backplane by backplane.

I will measure the resistance between the ground on a gate and the central ground point on the frame, as a poor connection introduces a voltage drop which therefore biases the voltages on the gate. If the voltages aren't what is expected, circuits and logic may malfunction.

Bad voltage levels at card compartments or bad connections to backplane

After checking the ground connection, I will measure the voltages as they are delivered to cards on the backplane, ensuring they meet spec. If there is too much drop,then I would bump up the voltage at the regulators to yield the target amount where it counts, at the logic cards.

Noise spikes on power supply

Enough noise or spikes and gates can improperly triggered. Generally this manifests as erratic behavior, not the consistent failures I am experiencing, but it belongs on the list. I will have to monitor the power lines for longer periods to see if I spot any noise. I could also set the storage scope to trigger on a threshold high enough to represent noise.

Failure of components due to operating stress

Another possible effect of aging would be imperfectly sealed components having degraded - the discrete transistors on the boards are a likely candidate - so that they fail from the stress of the heat shock of power up/power down and from continued operation.

For this many to keel over in a relatively short time would suggest a pervasive issue that should have been spotted by other 1130 operators. When I diagnose failed cards down to the malfunctioning component, I will see if there is a pattern here.

Failing single card whose operation is worsening over time

Occam's Razor applied to the problems I am facing implies that I should prefer explanations that involve the fewest simultaneous failures, ideally looking for one card or cable or other component that would cause what I experience.

It is possible that a card had a component declining, so that initially it was a marginal problem that would tip only one part of the card into failing, but as the component worsened, it could cascade to other parts of the card circuitry that had more resilience to marginal voltages.

For example, if a filter capacitor on a input power line were to become increasingly shorted, pulling down the voltage, that could widen the scope of failures, either affecting more circuits on the card or impacting downstream cards which would now receive improper signal levels.

If I can collect enough evidence from scoping, looking at logic levels and thinking about the behavior, I could deduce likely bad cards. Swapping a candidate with another of the same card type from a different part of the machine should radically change the behavior, a clear confirmation.

Loose connectors onto backplanes

I tried to test the seating of as many connectors that are fitted onto the backplanes as I could, but there are some that are relatively inaccessible and I didn't pull and reseat most of the connectors anyway. This should be detectable by observing a difference in signals at the source and destination end, highlighting those that are open or distorted.

Destroying cards through static charge each time I pull one out

A nightmare scenario is that many cards I pulled out to reseat were destroyed by static electricity from my hands. It is a nightmare because if a majority of the SLT cards are bad, the chances are exceedingly low that I could find replacements or repair them all.

The CE manuals and instructions don't mention any anti-static precautions for handling boards. These are not integrated circuits, but discrete components even though some are encapsulated on a tiny ceramic square. They are not MOS which is the type of semiconductor device that is most prone to static discharge damage.

Thus, this case is not that likely, but has to be considered along with all the others. I powered down before opening any compartment and removing any board, following all the recommendations from the maintenance manuals. If I later discover that many cards are bad and the failed components seem to match a punch-through of diode or transistor junctions, I might change my mind.

Open to other scenarios from readers of this blog

If any reader of this blog has other scenarios I should be considering, or advice of any kind to help track down the plague infecting the 1131, please either post them publicly or email me. They will be gratefully accepted.

1 comment:

  1. I received some great feedback and suggestions from Peter Vaughan of The National Museum of Computing in Milton Keynes, UK, who has restored an 1130 system. I will summarize some of the points later and add a case I hadn't covered in my post.