Friday, June 24, 2022

Set up core tests to run on the machine next

IBM CORE MEMORY TEST ROUTINES

IBM provides two routines to test memory, called the high and low test. The only difference is where they place the executing code because those locations aren't checked. Each of these will also check the wraparound capability. The highest address on this machine is 8191 decimal or 1FFF in hex. If you add to the address it should wrap around to 0, which the tests verify. 

The tests have six stages. The first will write all 1 bits in each location and then all 0 bits. The second writes the address of a word into that location, so that the addressing logic can be verified. The third writes alternating AAAA and 5555 patterns, called a checkerboard. The fourth first sets all bits 0 except for moving a 1 from left to right in the word, then it does the complement with all 1 and a moving 0. The fifth and sixth will write alternating blocks of ones and zeroes which is the worst case pattern for generating noise that can trigger a misadjusted sense amplifier. 

SETTING UP THE TEST CODE

I used the IBM 1130 Simulator to boot the 1442 Relocating Loader with the Hi Memory diagnostic behind it. When it stopped at the first wait, I dumped the memory to a file that I can use to load core in the physical 1130. I then did the same but with the Lo Memory diagnostic, giving me a second load file.

My Memory Load Tool will toggle these into core after which I can set the IAR to the address of the wait and push Prog Start to run the tests. I need as much information as I can get to hunt down the problem this machine is having with the bit flip and parity stop. 

Tuesday, June 21, 2022

Test with the spare 3475 card in the memory module - parity errors shifting to other bits

SWAPPING IN THE SPARE CARD

I pulled out the card that I suspected was bad and put in a spare card of the same type. This position covers bits 4 & 8, whereas its original position handled bits 0 & 6. My original failure was always bit 0 flipping on erroneously. After moving the card up I had bit 8 flipping on in error. This is why I suspected the card and did a replacement. 

CLEAN UP THE MULTIPLY-DIVIDE TEST AREA TO BE SURE NO BITS ARE FLIPPED

The nature of a parity error leaves memory corrupted, although parity is reestablished to make the new pattern have correct parity. Thus, when the location was mis-read with bit 0 as a 1, the count of 1 bits had to be odd but with this extra one, the bits plus the parity value were not odd anymore. 

Since memory is read destructively, all the bits are flipped to zero with sense amplifiers reporting those that had previously been a 1. Those 1 bits are saved in the B register and then in the second half of the memory cycle, the hardware writes back the value in the B register. This means that the process of mis-reading gives us corrupted data that is immediately written back. 

A memory cycle consists of T-Clock steps T0 to T7. The first half, steps T0 to T3, are the destructive read part of the cycle where the value read out is latched into the B register. The second half, steps T4 to T7, does the write of the B register to memory. When the CPU is storing new data in a location, the B register contents are replaced, discarding what was read out of the location, so that the new contents of B are written back. 

The parity checking occurs during the first half of the memory cycle, while proper parity for the word to be written is generated in the second half. If the parity from the read, the number of 1 bits in 8 bits of data plus one of the parity, is not odd then we have a parity error. The latch turns on in step T6, when the B register is written back to memory. 

If it stopped earlier we would have a completely zeroed word and both halves would calculate as even parity. We want valid parity on memory so we have to generate proper parity in the second half of the cycle and then stop after it is written back. 

Thus, words where we have a parity error are written back with good parity but incorrect contents since a flipped bit is what triggered the parity error in the first place. I wanted to restore the multiply-divide routine and its data areas to the correct values, which I did by stripping down the load file to just those locations and letting my Memory Load Tool toggle it in.

RERUN THE TEST TO COMPLETION

The test ran for almost two minutes and finished with a normal completion wait (3003). This validates the hardware for multiplication and division, finishing the checkout of all the instructions. I decided to run it a second time, which I started but it stopped with a Parity Stop!

The bit being flipped on was bit 2 this time. It did this consistently. The card that handles bits 2 & 3 is up a level in B5 rather than B6 where I swapped the card. This is perplexing. Something more subtle is happening than a bad sense amplifier. 

ANOTHER OBSERVATION ABOUT THE PARITY STOP

I corrected the value in the core word and reran the test a few times, always getting a bit 2 turned on to trigger the Parity Stop. More interestingly, it was always the same location where this happened. It is always executing an EOR instruction, long format, indirect. The failure occurs in fetching the second word of the instruction, in other words during the I2 cycle. 

I remember that this was the same place where bit 8 was going on before I swapped the card, the second word of the EOR instruction at location 0D36 and 0D37. Very curious. 

INVESTIGATIONS AHEAD

In order to investigate this, I need to use the IBM 1130 Simulator to load the CPU Core Test diagnostics, create a load file and have it entered in the core memory of this 1130. That will shake down the memory and give me a better idea of what kind of error lurks there.

If this is an issue with that one word of core, it is a very strange error. Earlier I had experienced the parity stop with a simple loop at an entirely different address, thus I suspect this is not associated with one address. That would be very unusual since the failures happened on different core planes - bits 0, 2 and 8. 

I need to ponder the circuitry of the memory to see if I can find any common factor. There are steering diodes that handle the addressing, the inhibit and the sense operations, so that the same wire can have current flowing in different directions at different times of the cycle. A bad diode could do funky things, but the core tests will help flag this. 

Monday, June 20, 2022

Narrowing in on the two failures after verifying they are consistent, likely both are resolved

 FAILURE 1 - STX TEST FAILS

The test that fails here is pretty simple. A value of xFFFF is stored in a fixed location, then index register 1 is loaded with the value x0000 and a STX instruction puts the contents of IX1 into the fixed memory location from before. The fixed location is then loaded and if it isn't zero, it indicates that the STX didn't store properly causing a stop. 

Single stepping always works properly, but at speed this seemed to consistently fail. I first reran to verify that this misbehaves at normal run speed. Embarrassingly, I found that the instruction immediately after where I stopped when single stepping was wrong, another copy of the error wait 30DF instead of the proper instruction. 

When I fixed the incorrect value, the machine ran right through this with no errors at all. This is indeed not a failure of the machine processing STX instructions. 

FAILURE 2 - MULTIPLY/DIVIDE LOOP GETS PARITY ERROR

This is a long loop that runs through all possible values from lowest negative to highest positive, doing a multiply and then a divide. It uses four seed values to which it multiplies and divides, thus four loops from -32768 to +32767. 65,536 multiplies and 65,536 divides for four different seed values. 

33 microseconds is the average execution time for the multiply being done and 76 microseconds for the average divide. That gives us 8.7 seconds of multiplication execution and 20 seconds of division, or a total loop in excess of 29 seconds. That is 1/4 of the entire diagnostic test's execution time for this one comprehensive multiply-divide test. 

I obviously can't hand step through 262,144 pairs of multiply and divide, but this one does trigger a parity stop which is a signal that I can use to latch up the scope and/or logic analyzer. I ran this again to be sure that it does consistently fail with the parity stop, probably because it executes so many times that this sporadic issue is sure to crop up. 

These parity errors don't appear to exist in the core memory, just in the value read into the B register during the read part of a core memory cycle. I believe this because I can immediately run a Storage Display loop that reads all memory; that scan never sees a parity error so the data is not written in core wrong. Instead, it seems to be that bit 0 of B register is set in error during a read cycle. 

I will monitor the sense amplifier output to see whether we are getting bad sensing or whether something else is causing the B register Bit 0 to latch on. I have two other leads which I will hang on some of the gating signals that might cause other random data to flip on the bit latch.

The sense amplifiers of the SJ-4 memory are split - one card handles bit 0 and 6 for addresses from 0 to 4095 and the other handles the same two bits for addresses from 4096 to 8191. Thus there are two different sense amplifiers, with an addressing bit gating whether the lower 4K or higher 4K sense amp is connected to the output. 

So far, my issues have all occurred in the lower 4K, but I could relocate the failing code up above the line and see if the results are the same. That would point me at a bad card or connection if it only fails in lower core addresses. Fortunately, I don't have to do this - see below. 

This flip flop has a number of inputs coming from the A register, I register and I/O (device controller) registers. These should only be passed on to the latch if the sample pulse signal goes negative. For example, if -A to B SP 0-7 is activated while the A bit is 1 (gate signal -A Bit 0 is low), then this triggers the latching of the B Bit 0 flipflop. Similarly, -I/O to B SP 0-7 and -I to B SP 0-7 will latch for a 1 in I/O or I bit 0. 

The pulses are sent to all eight bits, 0 to 7, yet only bit 0 is latching up. It cannot be an error in the generation of these sample pulses, but it might be a signal path fault bringing that signal to the pin for the Bit 0 instance of the B register logic. It could also be a path error with -Sense Amp Bit 0 coming to the card. 

I wrote up the relevant pins and paths to verify, applying the VOM to the backplane to test connectivity before I start the scope and logic analyzer captures. All the paths were well connected. Interestingly, the path from the sense amp up to the edge connector had a wire wrap on exactly this bit. In was good, however, so I moved on.

Using the scope and triggering on the generation of -Parity Stop, I could see a clear 1 bit coming from the sense amplifier line. Since the memory module has multiple identical SLT cards (type 3475) that handle the inhibit and the sense duties for pairs of bits for a 4K group of addresses, it hosts 18 of these identical cards.

The locations for bits 0 and 6 are A7 and B7 in the B gate, C1 compartment which is where the memory sits. I swapped the card with another - B6 which is responsible for other bits. I ran the Multiply-Divide test again and got a Parity Stop again but this time the bit that was flipped on spuriously was bit 8! That is the responsibility of the card in B6. 

My working assumption is that the card currently in B6 has some fault that causes it to sometimes report a 1 value when the core was actually zero. The museum had a box full of spare SLT cards including a 3475. I will swap in the spare card and see whether I can get this test to run successfully. 

If it does, then all of the CPU instructions were validated by the diagnostic and I can consider both the CPU and the memory (because of this replacement) to be good. I will do the card change and retest tomorrow as it is the end of my time in the shop for today.

Sunday, June 19, 2022

Adding stops to the CPU Test to figure out how far it gets successfully

LISTING OF THE DIAGNOSTIC GIVES ME LOCATIONS OF THE START OF EACH SECTION

I can replace the first instruction word with a special halt - using the unassigned operation code b11111 to form words of the form F80n where n is the number of each stop. I have a spreadsheet with the original value of those words, so that once it stops at a point, I can restore the proper instruction and let it continue.

At each point, I will know that all the tests up to that point were completed successfully. Once it begins looping I know the issue arises from the last wait point forward and can more granularly sprinkle F80n waits to zoom in on the misbehaving instruction. 

RESULTS OF RUNNING THE MODIFIED CPU TEST DIAGNOSTIC

I discovered one corrupted word in memory that caused the looping and repaired it. I then ran through sections, with the waits I had inserted. I got through almost every section without issues. There were two anomalies.

First, the diagnostic gave an error stop while testing the Store Index (STX) instruction. When I single step through that part of the test it works perfectly and doesn't get the error, but when I run at normal speed, it fails. I must have a timing issue here that needs to be checked.

Second, the section where it attempts to test multiple and divide cases had a parity stop in fetching the second word of a long instruction, again with bit 0 flipped on to cause the parity error. I repaired the location, started the section where it looped for a bit and then stopped with the same bit flip parity error. 

I guess the good news is that I have some code that will repeatedly cause the bit flip, thus I can begin instrumenting the machine to catch it in the act. I am not certain how to catch whatever problem is happening with the STX test section. That too failed the same way several times, but it doesn't trigger a parity error, which is a definitive trigger for logic analyzers and oscilloscopes, instead just executing improperly in an unknown way.


Saturday, June 18, 2022

Continuing the load of the CPU Test diagnostics into the 1130 core memory and ran them, not successfully

ADJUSTED THE TOOL TO OPERATE FASTER

I made some improvements to the Memory Load tool which now loads each 1K words in just under 6 minutes. I expect that a full memory load (8K words) would take 47 minutes to complete. 

LOAD COMPLETED AFTER 23 MINUTES

To my delight the CPU Test diagnostic had a footprint of only 4K words. This makes sense because IBM did sell a 4K low end version of the machine. Thus the load process was faster than I had anticipated. 

WHAT I EXPECTED RUNNING THE DIAGNOSTIC

The documentation, as well as the behavior on the IBM 1130 Simulator, is that the program would run for a couple of minutes and then stop with a wait instruction 3003 indicating successful completion of all tests. 

ACTUAL RESULTS NOT AS IDEAL AS I HAD HOPED

When I began the test, it ran but continued to run long past the two minute point where it should have stopped. A bit later, it stopped with a Parity Stop, meaning that we had a parity error in core. It was the same symptoms I had seen before, the high bit (0) turned on when the parity value indicates that it should have been a zero. 

Red Parity Stop lamp on left side is lit

Since I had the listing for the code that was running at the time, I could see that it was loading a value of 0005 from a memory location but the value in the Accumulator was 8005 because of the high bit flip. I immediately ran a Storage Display where the hardware cycles around through all memory locations reading the contents of each word - with no parity error indicated. 

Executing Store long format, fetching word 2 of the instruction

This suggests to me that some process is flipping bit 0 to a 1 on a read but not actually flipping the core. It could be an out of adjustment sense amplifier or it could be some errant logic elsewhere that is ORed to set the flipflop for bit 0. 

Further, the code that is executing is the code that would be invoked if I had requested looping on an error condition, but I had set all the CES switches to zero thus asking for a single pass. In order to get to that code, something had gone awry in the execution of the diagnostic, but I don't know where or even when it happened. 

I may have to patch in some stops into the diagnostic so that I can find where it reaches. If I know that it has successfully tested some percentage of the instructions, I can at least consider them to be fully operational. Further, I could do some binary search to home in on where the divergence begins and get a clue about the defect causing it. 

I may also have to troubleshoot the bit flip parity problem, which does not occur with continual Storage Display access but does with some loops. I will build some loops and set them running to see if I can force the failure. It may allow me to record enough information when the parity error is detected to find the culprit 



Friday, June 17, 2022

Dumping the cpu test diagnostic from simulator and loading on the real 1130

IBM 1130 SIMULATOR USED TO BOOT THE CPU TEST DECKS

Brian Knittel created an IBM 1130 simulator with graphical interface, based on Supnick's simh simulator framework. I use it to run real programs from the 1130 and to sort out how various things should work, since it is a very faithful recreation.

In an earlier project I read and archived all the card decks that I had collected, which included all of the IBM maintenance/diagnostic decks that were used to troubleshoot and adjust the machine. There is a CPU test program which will exercise all the instructions and functions, with particular attention to all the special cases that might unearth even a single gate that is malfunctioning in the processor.

This CPU Test program deck is put at the rear of the Basic Diagnostics Loader deck, then the combined deck is loaded using the Program Load button on the machine. After the decks complete loading, the program stops at location x012D with 3000 as the wait instruction showing in the Storage Buffer Register. From there the instructions tell you how to make it execute and what options you can select.

The entire set of tests runs for about two minutes on the 3.6 microsecond versions of the 1130. It would be a wonderful comprehensive test to apply to this machine to be confident in the restoration. 

I used the simulator to Program Load the combined card deck images, with the simulator stopping at the beginning at x012D waiting for me to continue. If I transfer the contents of the simulated 8K of storage over to the real machine, then start the machine at address x012D, it will let me run the tests exactly as if it had a card reader and I booted up those decks. 

DUMP COMMAND PRODUCES TEXT FILE WITH CONTENTS

The simulator offers a command, DUMP, which puts any range of memory addresses you want into a text file in the same format as I chose for the Memory Loader tool that is installed on this system. The file begins with a reminder of the current execution address x012D, then sets the memory location to x0000 and begins entering words, one at a time with four hex characters. 

It provides for a shortcut for long bursts of zero value words, Znnnn where nnnn is the number of words, in hex, to load with zeroes. The result was 8,192 words of content, some of zeroes but mostly this filled all of memory. 

NEED TO TWEAK FILE TO FORMAT FOR MY LOADER PROGRAM

My loader program supports the lines that load the memory location and the lines that load a particular word value into memory, but did not handle the Znnnn entries. I could have written a simple Python program to convert these into nnnn sequential entries of 0000 but instead I combined that into a program that opens a text file on my PC, connects over the serial USB link to the tools, then reads the file and sends appropriate commands to the loader including converting Z into a series of 0000 words. 

LOADING CORE CONTENTS

The loader processes entries at approximately 1 per second, since it is flipping Console Entry Switches and pushing the Prog Start button for each entry. Due to the debounce logic for the pushbuttons and other factors, I didn't want to go much faster in order to ensure reliable loading of memory.

At this rate, the entire memory is loaded in just under two and a third hours. On my own 1130 with its Storage Access Channel, I was able to use my FPGA based extension box to load that amount of memory in a couple of seconds. This machine does not have the SAC and thus I fall back to the much slower method of manipulating the console switches and buttons remotely.


An Arduino controls several relay boards, which are hooked to the console entry switches and to both the Prog Start and the Load IAR buttons. When activated, the produce the same result as if the CES switch was flipped on or the button was pushed. I would never be able to toggle in data as fast as the tool does. Slow as it is, it would beat me more than ten times as fast, much more accurate and without all the wear and tear on my hands. 

STABBED IN THE BACK BY MY WINDOWS 11 BASED LENOVO PC

I kicked off the load process, ready to work on other projects for the 2.3 hours that the 1130 would be busy getting everything loaded into memory. I was more than a third of the way through the load process, almost 50 minutes after I started it, when the hardware or software decided to crash and reboot. 

Now I need to modify the deck so that it will set the proper start address and begin loading where it left off, for the remaining 1 2/3 hours of load time. I don't want to get this wrong, otherwise I ruin the entire load, so I went home and will work on it when I am calmed down. 

While I work to recover from this setback, you can enjoy a few minutes of loading without any comments. 



More testing of the console printer controller logic in the IBM 1130

SHORT VIDEOS OF SOLENOIDS ENGAGING FROM XIO WRITE COMMAND EXECUTION

Here are two videos in slow motion of characters being requested - you can see a few solenoids activate to fire off the selection of that character and trigger a print cycle. These are two different character codes thus different solenoids of the character selection group trip in each. The sound of the fans, slowed down, is an annoying buzz.



The third video is the solenoid in the function group activating to trigger a line feed. This happens when the XIO Write sends the code 0300 for a line feed operation. Sorry that due to the orientation when taking video, YouTube insists on calling this a short rather than a regular video.


DEVICE GOES NOT READY AND BUSY IF IT NEEDS TO SHIFT TO UPPER CASE ON BALL

When the controller sees a character code request for a position on the opposite hemisphere from where the typewriter is currently resting - in other words the 'upper case' or 'lower case' side of the ball - it first fires off a shift solenoid to flip the ball around. The logic waits for a positive confirmation through a microswitch that this has completed, staying busy until that point. 

Since the original printer is gummed up and not under motor power, that cycle does not take place and this leaves the controller logic hung in the busy state. I can see that with a XIO Sense Device execution. This is a healthy sign from the controller logic.

REMOVING PRINTER FOR RESTORATION

I removed the console printer from the computer. This involves removing the faceplate which has the 16 Console Entry Switches which are cabled to the CPU itself. You then have to pull some SMS paddle cards from the signal and power SMS cages inside the machine. Finally the cable has to be snaked out of the machine, a tedious task.

Printer on its side to video the solenoids

1053 moved to the bench for restoration

WILL PUT MY 1130'S PRINTER ON THIS MACHINE AS IT IS MOSTLY WORKING RIGHT

I grabbed my 1053 from the bench where I was finishing up its restoration and moved it over near the 1130 I am working on currently. I will set up a table where it can sit, then plug it into the 1130 and make use of it to further validate the device controller logic.

My 1053 ready to install on the 1130 being restored

My code to fire off characters is already updated to provide for a short interrupt routine that simply resets the printer response status and branches out to resume the mainline execution. I will make a further tweak where I can read the CES switches and use that as the character to type, a convenience compared to loading the IAR and then loading the data value with several button presses, switch rotations and switch settings. 

1053 EMULATOR IS READY TO BE CABLED AND TESTED

It has been years since I built this emulator to plug into an 1130 in place of the Selectric typewriter printer. As such, I am not sure how debugged it was but at some convenient time I will plug this in and see what results I get.