1053 CONSOLE PRINTER RESTORATION
I experimented with removing some oil and grease from around the torque limiter and CR clutch springs, hoping that the speed and energy of the return will increase. It seemed to go the wrong way initially, which suggests I moved some oil inside with my first treatment.
SAC INTERFACE FOR ADDING PERIPHERALS TO THE 1130
My core loading function is placing values in core, but skipping over some locations. It puts 4400 in location 0100 but then 0101 to 010B are skipped, before the next word 0F1F is placed at 010C instead of 0101. After handling a few cycle steal write transactions, when I tried to set the SAR to the next starting location for another group of words, what came back didn't match what I sent.
Suspects include my auto-increment function for the SAR and the logic to save incoming words and copy them to outgoing buffers. For the first problem, the approach to take is careful examination of the FSMs involved looking for race hazards or other possible flaws. The second is more puzzling since the first SAR setting transaction worked properly.
This continues to be hard to track down - I keep developing new diagnostic LED assignments and testing, hoping to discover the root causes. A big part of the problem is the poorly documented USB transfer mechanism from opencores atop a complex Cypress mechanism, coupled with the poorly documented pyUSB module. I really, really, really need a rock solid transactional transport to make this system work. Time for some research, fresh thinking and experimentation.
I decided on a restructuring of my code on the PC side to send all 12 words as a single write to the USB, just as the return of the 12 word results is a single read. I then redesigned the main read/write logic to the USB from the FPGA side to pull in all 12 words as a continuous sequence.
I make the assumption that the FIFO in the USB hardware is filled with all 12 words thus I won't have any pauses between words. I can then set the system up as twelve sequential cycles in the FSM, no interlocking or conditional state changes, which hopefully will make the transport more steady.
A similar change is made to the write side, sending all 12 output words in a continuous sequence of 12 cycles. Finally, I added a signal that tells me when we start reading words but have not finished fetching the last of the 12. It is useful for those FSMs that get triggered as soon as I see the command in word 1, but need information from the entire incoming packet.
To test this, I built a testing scaffold in the FPGA that will light diagnostic lamps to validate that I have the proper contents from a test transaction I will send in from the PC. This will have fixed values for the various data words, which I will check and use to light the diagnostic LEDS. I should immediately see if my inbound half of the transaction has worked properly.
The second half of the scaffolding emits a different fixed set of words as the return packet, which I will examine in the Python code to determine if my outbound side works properly. These two tests will let me narrow down my focus on further improvements.
Having completed all the myriad adjustments in the fpga, except that I am only echoing back the incoming words rather than replacing them with new content. I can switch this function to a fixed set of output values when I am ready for those tests.
I did uncover a fascinating fact that helps explain the reason that my minimum transaction is in the range of 1.5 ms. The logic in the ezusb_IO module from Opencores is set up to send short packets (e.g. packets smaller than the block of 1024 that is the default configuration) after a timer expires from the last write from the FPGA.
This time is defined as the number of sets of 65, 536 clocks that have occurred - with the smallest value you can set for this is 1 - thus about 1.3ms from when the packet arrives and I begin reading words until the outgoing short packet is flushed out, containing my 12 words of real payload. If i set this parameter to 0, the USB will wait forever to fill a 1024 byte packet.
If I can change this behavior to flush out the short packet much faster, I should materially improve the performance of all the USB transactions. Very important discovery, and I quickly found the logic involved in the Verilog code for the ezusb_io module. It checks for the top half of a 32 bit counter to match the timeout parameter, thus the logic is in units of 65,536 clock ticks. I just need to change the scope of the comparison to shorten the timeout substantially.
I decided that 256 clock cycles or roughly 5.3 microseconds is long enough. Change made and included with other changes in my testing regime. The echo transaction did not return the expected values, so working with LED diagnostics to figure out what is captured on the read.
After a few runs to check the words seven bits at a time, I confirmed that what was sent back was the same data we had latched on the inbound side. Not what I thought I was sending, but that could be multiple problems. It might be defects in how I latch incoming words on the link, but also it could be due to the changes I made to try to write all 12 words from Python as a single USB write. I put in some diagnostics on the Python side to show what is going outbound.
I corrected a flaw in my Python code to sent out the 12 word group and began to get back nearly the right data. It seemed I was one off, due to a cycle delay on startup of reading messages. Now the test message came back perfectly and my machine status transactions are working properly.
I moved back to debugging the core load function. Things were looking up. I loaded blazingly fast and ran into verification problem for the sections where I used the "Z" code in the PC file, which asks to write nnn words of zero beginning at the current SAR. I have some flaw in the Python code either writing or verifying those zeroes - as a first hypothesis - and will look at code and then build some diagnostics if necessary.
I believe my problem is in the logic to verify the contents. I set the SAR with a call and then prefetch a block of 11 words from core, pulling from them as needed. The problem is when I encounter a "@" directive in the PC file, which says to jump to a new block of core. I need to fetch a new block of 11 words from this new location.
Another problem occurred when I encountered a Z directive and fetched the next block of 11, when I should just have continued unpacking from the buffer and checking for zeroes. That resulting in my skipping ahead of the intended location in core, ensuring mismatches after this point.
Change made, testing again, this time trying to verify the contents of core manually. The first block of storage - 11 words - was perfect. The next words were a copy of the first block, not a new group - uhoh. I also had a failure where the code to test contents received different contents than were actually in memory.
The problems I now am shooting could be caused by either or both sides - PC program or fpga logic. Back to the diagnostics - emitting messages at key points in the core load code and turning on status lights to ensure that my cycle steal read, CS write, fetch/store and other FSMs are properly resetting after each transaction.
My first test mainly highlighted problems in my diagnostic writes, Quickly fixed, then I could see what was really occurring. The reason for my repeated sets of the same data seems to be that my FSM which walks through the eleven words, using the cycle steal read or write FSM repeatedly, is somehow taking a second pass through the FSM, writing the same 11 words twice. Should be easy to spot now that I know what is happening.
Problem found and fixed by adding a last interlocking step in the FSM that drives the 1 to 11 word CS read or write cycles. I manually checked core and validated that the first half of the function, which writes core, had worked properly. The remaining flaw is in the verification process.
A quick look showed me that I was fetching the first 11 words at the correct address, but then fetching another time so that I really had words 12 to 22 not 1 to 11. This is a flaw in the Python code doing the verification - will find and fix it.
While doing this last testing, I had a troubling symptom. I flipped the power switch for the 1130 and it flickered but didn't come on. I had to turn the main breaker off then on to get it to come up. The first time it came up, it stayed in 'reset' mode but after a power cycle it was back to normal. I hope this isn't an emerging problem with the machine.
One last test proved out the functioning on my smaller test file, so I let it rip with the full DCIP file which will load all 16K of core. That is a good relative speed test for my code that will write 16,384 words then read them all back. At full bore 1130 speed that would take a bit more than a tenth of a second, at 3.6 microseconds memory access speed.
Since I made the change with the short packet timeout, lowering it to 5.3 microseconds, we are operating on a similar timescale to the 1130 itself. I am packing multiple words, close to 11 per transaction, or half a microsecond per word added to the memory access itself. When I loaded the core file that does the write then readback of 16,384 words, it seemed to take less than a second. The results were perfect, I hit the Int Req key and the utility began typing out the menu on the 1053.
From a speed standpoint, this is exceptionally good news. It tells me that my virtual disk should run acceptably fast - it may be necessary to add in the delays while the virtual disk platter rotates the requested sector under the head. I will do the testing tomorrow on the virtual disk function.
I do have a remaining flaw in my validation routine, where it seems to get out of sync when it is reading back sections of zeroes written by the 'Z" directive. I will look at that as well tomorrow. For now, I am very pleased with the progress.
This continues to be hard to track down - I keep developing new diagnostic LED assignments and testing, hoping to discover the root causes. A big part of the problem is the poorly documented USB transfer mechanism from opencores atop a complex Cypress mechanism, coupled with the poorly documented pyUSB module. I really, really, really need a rock solid transactional transport to make this system work. Time for some research, fresh thinking and experimentation.
I decided on a restructuring of my code on the PC side to send all 12 words as a single write to the USB, just as the return of the 12 word results is a single read. I then redesigned the main read/write logic to the USB from the FPGA side to pull in all 12 words as a continuous sequence.
I make the assumption that the FIFO in the USB hardware is filled with all 12 words thus I won't have any pauses between words. I can then set the system up as twelve sequential cycles in the FSM, no interlocking or conditional state changes, which hopefully will make the transport more steady.
A similar change is made to the write side, sending all 12 output words in a continuous sequence of 12 cycles. Finally, I added a signal that tells me when we start reading words but have not finished fetching the last of the 12. It is useful for those FSMs that get triggered as soon as I see the command in word 1, but need information from the entire incoming packet.
To test this, I built a testing scaffold in the FPGA that will light diagnostic lamps to validate that I have the proper contents from a test transaction I will send in from the PC. This will have fixed values for the various data words, which I will check and use to light the diagnostic LEDS. I should immediately see if my inbound half of the transaction has worked properly.
The second half of the scaffolding emits a different fixed set of words as the return packet, which I will examine in the Python code to determine if my outbound side works properly. These two tests will let me narrow down my focus on further improvements.
Having completed all the myriad adjustments in the fpga, except that I am only echoing back the incoming words rather than replacing them with new content. I can switch this function to a fixed set of output values when I am ready for those tests.
I did uncover a fascinating fact that helps explain the reason that my minimum transaction is in the range of 1.5 ms. The logic in the ezusb_IO module from Opencores is set up to send short packets (e.g. packets smaller than the block of 1024 that is the default configuration) after a timer expires from the last write from the FPGA.
This time is defined as the number of sets of 65, 536 clocks that have occurred - with the smallest value you can set for this is 1 - thus about 1.3ms from when the packet arrives and I begin reading words until the outgoing short packet is flushed out, containing my 12 words of real payload. If i set this parameter to 0, the USB will wait forever to fill a 1024 byte packet.
If I can change this behavior to flush out the short packet much faster, I should materially improve the performance of all the USB transactions. Very important discovery, and I quickly found the logic involved in the Verilog code for the ezusb_io module. It checks for the top half of a 32 bit counter to match the timeout parameter, thus the logic is in units of 65,536 clock ticks. I just need to change the scope of the comparison to shorten the timeout substantially.
I decided that 256 clock cycles or roughly 5.3 microseconds is long enough. Change made and included with other changes in my testing regime. The echo transaction did not return the expected values, so working with LED diagnostics to figure out what is captured on the read.
After a few runs to check the words seven bits at a time, I confirmed that what was sent back was the same data we had latched on the inbound side. Not what I thought I was sending, but that could be multiple problems. It might be defects in how I latch incoming words on the link, but also it could be due to the changes I made to try to write all 12 words from Python as a single USB write. I put in some diagnostics on the Python side to show what is going outbound.
I corrected a flaw in my Python code to sent out the 12 word group and began to get back nearly the right data. It seemed I was one off, due to a cycle delay on startup of reading messages. Now the test message came back perfectly and my machine status transactions are working properly.
I moved back to debugging the core load function. Things were looking up. I loaded blazingly fast and ran into verification problem for the sections where I used the "Z" code in the PC file, which asks to write nnn words of zero beginning at the current SAR. I have some flaw in the Python code either writing or verifying those zeroes - as a first hypothesis - and will look at code and then build some diagnostics if necessary.
I believe my problem is in the logic to verify the contents. I set the SAR with a call and then prefetch a block of 11 words from core, pulling from them as needed. The problem is when I encounter a "@" directive in the PC file, which says to jump to a new block of core. I need to fetch a new block of 11 words from this new location.
Another problem occurred when I encountered a Z directive and fetched the next block of 11, when I should just have continued unpacking from the buffer and checking for zeroes. That resulting in my skipping ahead of the intended location in core, ensuring mismatches after this point.
Change made, testing again, this time trying to verify the contents of core manually. The first block of storage - 11 words - was perfect. The next words were a copy of the first block, not a new group - uhoh. I also had a failure where the code to test contents received different contents than were actually in memory.
The problems I now am shooting could be caused by either or both sides - PC program or fpga logic. Back to the diagnostics - emitting messages at key points in the core load code and turning on status lights to ensure that my cycle steal read, CS write, fetch/store and other FSMs are properly resetting after each transaction.
My first test mainly highlighted problems in my diagnostic writes, Quickly fixed, then I could see what was really occurring. The reason for my repeated sets of the same data seems to be that my FSM which walks through the eleven words, using the cycle steal read or write FSM repeatedly, is somehow taking a second pass through the FSM, writing the same 11 words twice. Should be easy to spot now that I know what is happening.
Problem found and fixed by adding a last interlocking step in the FSM that drives the 1 to 11 word CS read or write cycles. I manually checked core and validated that the first half of the function, which writes core, had worked properly. The remaining flaw is in the verification process.
A quick look showed me that I was fetching the first 11 words at the correct address, but then fetching another time so that I really had words 12 to 22 not 1 to 11. This is a flaw in the Python code doing the verification - will find and fix it.
While doing this last testing, I had a troubling symptom. I flipped the power switch for the 1130 and it flickered but didn't come on. I had to turn the main breaker off then on to get it to come up. The first time it came up, it stayed in 'reset' mode but after a power cycle it was back to normal. I hope this isn't an emerging problem with the machine.
One last test proved out the functioning on my smaller test file, so I let it rip with the full DCIP file which will load all 16K of core. That is a good relative speed test for my code that will write 16,384 words then read them all back. At full bore 1130 speed that would take a bit more than a tenth of a second, at 3.6 microseconds memory access speed.
Since I made the change with the short packet timeout, lowering it to 5.3 microseconds, we are operating on a similar timescale to the 1130 itself. I am packing multiple words, close to 11 per transaction, or half a microsecond per word added to the memory access itself. When I loaded the core file that does the write then readback of 16,384 words, it seemed to take less than a second. The results were perfect, I hit the Int Req key and the utility began typing out the menu on the 1053.
From a speed standpoint, this is exceptionally good news. It tells me that my virtual disk should run acceptably fast - it may be necessary to add in the delays while the virtual disk platter rotates the requested sector under the head. I will do the testing tomorrow on the virtual disk function.
I do have a remaining flaw in my validation routine, where it seems to get out of sync when it is reading back sections of zeroes written by the 'Z" directive. I will look at that as well tomorrow. For now, I am very pleased with the progress.
Nice catch on the USB i/o routines!
ReplyDelete