Tuesday, August 11, 2015

CHM 1401 issue with group mark resolved, working on improved fast SPI link logic on SAC Interface Box


I continued to set up the logic to produce a 16 bit CRC of the 104 bits of data in a frame, which is then converted to 128 bits using a SECDED Hamming code scheme. I also picked one of the 104 data bits to be a 'parity error on last frame' indicator back to the sender, allowing retransmission of frames to be sure I don't miss transient conditions that occur while frames are being discarded due to link errors.

Right now my hamming receiver logic will detect single or double bit errors in the combined data + CRC frame, but I have not implemented the single bit error correction that this error checking scheme makes possible.

I really have to think through the state machines for sending and receiving frames, as there are multiple places to detect errors (hamming check and CRC check) and various ways the logic should recover, depending on the error.

For example, a single bit error detected and corrected, which then passes the CRC test, should be considered the same as an error free frame. However, if the CRC does not match, the packet needs to be rejected and a flag bit sent back asking the sender to repeat the frame.

I am kicking around various designs of the state machine(s) and thinking about the implications of each choice. If I reject an error frame and ask for it to be resent, that means that some short duration signal changes might have passed entirely between the time the original frame was composed, during the time it is being re-transmitted, but before the next frame of content is sent.

The alternative to this also allows me to just throw away frames in error, rather than seeking a re-transmit. The alternative is to always use an interlocking scheme to request actions, so that a request or response can't be lost due to short term link noise. This is more consistent with the idea for the link, which is a bidirectional data pump which simply copies a set of 104 registered signals from one side to the other, meanwhile the other is sending its own set of 104 registered signals back to the first side. The signals in each direction are independent except for any interlocked transaction protocols I set up.


One team member found mention of an RPQ that could be put on a 1401 whose purpose was to treat a groupmark as a 12-5-8 instead of the usual 12-7-8 card code. No documentation or wiring diagrams at the museum show this RPQ installed, but another team member remembered having seen some mystery switches inside the card reader/punch.

We looked today at lunchtime, found the mystery switches with a clear marking that they involved groupmark for read and punch separately. Both had a 'normal' mode and a 'IBM 705' mode, which makes sense since the older 705 mainframe used 12-5-8 for a groupmark instead of 12-7-8. So, we have the RPQ installed but our ALDs and other wiring diagrams don't show it at all.

The RPQ switches inside the reader/punch
The switches were both in their "705" position. We flipped them up to "Normal", hooked up the mystery wire on gate 01B4, and the machine now reads 12-7-8 cards into memory as a groupmark without any validity checks. It also punches a groupmark in memory (BA8421) as a 12-7-8, just as it should.

Back when this problem was first experienced, somebody must have recently reached inside the machine and changed the setting of the switches. The first time the museum demonstrators tried to run the tape program after this mystery change of the switches, the deck having the 12-7-8 in some cards, they got the read validity check and reported it to us. With no mention of the RPQ, all our documentation said that a 12-7-8 was a valid groupmark thus we had an erroneous check condition. Later, others discovered that punching a groupmark produced the wrong hole pattern.

It took weeks of diagnosis because the ALD manuals and other documentation did not reflect the machine as it exists. This RPQ is rare enough that nobody on the team had every experienced it on another 1401; if some one had, we would at least have known to check for the RPQ and perhaps looked inside the covers of the reader/punch for the switches.

The upside is that the team learned quite a bit about some facets of the machine and its design as a result of investigating this situation. The downsides include knowing our schematics are out of sync with the machine and, since there is an unmarked third switch next to the RPQ switches, we may have yet another unknown RPQ on the machine which could bite us down the road. 

No comments:

Post a Comment