Friday, December 9, 2022

Found the source of the erratic behavior and improved it

CAREFULLY INSTRUMENTED ANALYZER SETUP

Armed with an error latch that was set on when the byte transfer from the Arduino ended without having completed the read from the memory interface, plus a shadow of that latch properly transitioned across clock domains to the signals driven by the memory interface (ramclock is generated), I set up all the signals of the user interface with the memory module. 

FOUND THE CASE THAT PRODUCED THE STALL

When the error latch went on and the shadow latch appeared a couple of cycles later, I immediately saw the condition that would clearly and obviously cause a stall. The user interface required that the app_en signal remain asserted while app_rdy is false otherwise the request would be lost. It must remain asserted until the clock cycle when both the enable signal and the ready signal are true. At that point it is accepted and we can drop the enable. 

I saw that the app_rdy signal dropped to false, during a refresh cycle for the DRAM, at the same cycle when my logic asserted app_en. I then dropped app_en although the ready signal was still false. It appeared that if the ready signal was already false a cycle before I was to assert enable, I would handle it correctly and of course if the interface was ready when I was asserting it worked properly. The failure occurs if ready just happens to drop just as I prepare to assert enable. 

This is very timing dependent and indeterministic to my logic, as the times when the memory interface would pursue refresh cycles was buried inaccessibly in the interface IP. Things had to align just right (or just wrong) to fail, something that happened often enough to fail during a 321 word transaction of reads but not so often that every read would fail. Exactly the kind of erratic situation I knew was the cause of the bad behavior. 

DEFECT IN MY STATE MACHINE HANDLING OF MEMORY INTERFACE BUSY

My state machine for reading and writing memory would check the ready signal and if it were true, advance to a next state which asserted the enable signal for one cycle. This worked fine if ready were true during the whole process, and worked properly if ready was false during the testing, but if ready were to drop to false at the start of the next state, when I was raising enable, things would go wrong. 

This is due to the change times of the various signals. Inputs to the state machine such as app_rdy which determined the next state to enter at the clock edge might change at that clock edge just as we moved to the next state. 

I should have tested the ready signal in the same cycle where I was asserting the enable signal. This was a stylistic approach which led to the error in handling the case where ready drops just as I asserted enable. 

CORRECTION WAS EASY

I made changes to the state machine, such that it didn't raise the app_en (or app_wdf_en for writing) until a certain state where it checked to see if it needed to keep the enable high due to a false ready, or it could drop enable and move on to further states to complete the read or write operation. 

QUICK TEST SHOWS THE STALL CONDITION HAS DISAPPEARED

I reran my testing and the error latch never went on. Further, the state machine didn't stall and returned to idle once the unload transaction was complete. Looking at the data returned, however, shows that this is still not operating correctly.

Thursday, December 8, 2022

Can't trust the DRAM memory interface - considering radical restructuring to use on chip static ram

STILL DON'T HAVE CAUSE OF ERRATIC BEHAVIOR NAILED DOWN, BUT . . .

I can instrument internal logic analyzer cores for a relatively small number of signals at a time and can only record 8K cycles on the core. Secondarily, each clock domain requires a separate analyzer core and they aren't easy to trigger 'simultaneously'. Third, signals in the SPI link domain can't be traced by an analyzer core because it needs a constant rather than intermittent clock signal.

Thus each time I have a new suspicion I have to resynthesize, set up the testbed and then capture only a small sample in time. The failure is so likely that I can't get through a single unload of 321 words, but not deterministic thus it fails on different words and perhaps in different ways on each test. If it always failed on a given word of the sector I could set better triggering for the analyzer cores.

Grossly, however, I seem to have stalling of the state machines and only return the last good value for all the subsequent transactions. My current suspicion is that it is triggered by a refresh cycle of the DRAM at exactly the worst moment. 

The delay could be 32 clock cycles, which when added to the delays transiting through FIFOs across clock domains, can add up to the major SPI link state machine having moved beyond the point where it needed the RAM data. I don't have proof that this is happening, although I will likely continue to construct tests where I might observe the smoking gun.

What I do know, however, is that were I to have a memory with a known and consistent access time that fits inside the state machine steps for the SPI link, I could have a reliable upload. Thus, if I can't find and fix the cause of erratic behavior, I might shift to a deterministic and reliable method to avoid said erratic conditions.

STATIC RAM ON FPGA CHIP CAN OFFER DETERMINISTIC READ AND WRITE

FPGA chips have static ram available onboard. It comes as both block RAM and distributed RAM. The block ram are sections of SRAM that are embedded in the chip and available to the designer. The look up tables and flip flops that are usually employed to create logic circuits can also configured as SRAM, and these are distributed among the LUTS of the FPGA. 

Block RAM use has essentially zero impact on the amount of logic that can be instantiated on the FPGA chip, since it is distinct areas of the chip that are not involved in generalized logic. Each chip has a fixed capacity of block RAM - in the case of the board I am using, 1,658,880 bits that is organized in words of up to 18 bits wide. 

Distributed RAM, on the other hand, takes up LUTS that otherwise would be available to form logic circuits. The more memory you instantiate, the less logic you can create. There are only 3, 650 LUTs, the basic building block of an FPGA, for my chip. Each LUT used as distributed ram instantiates 16 bits. An entire cartridge would require 521,304 LUTs and even a single cylinder would consume a large fraction of the available LUT capacity. 

SIZE CHALLENGE AS ENTIRE CARTRIDGE IMAGE CAN'T FIT ON THIS FPGA CHIP

One cylinder of the 2315 disk has eight sectors of 321 words, each 16 bits, thus it takes only 41,088 bits to hold that cylinder. The problem is when you look at the entire cartridge, all 203 cylinders of it, which would take five times the capacity of the block RAM to hold in its entirety. Distributed RAM provides little additional capacity. 

The erratic DDR3 DRAM, on the other hand, is 256 MB, far more than it needed for a cartridge. This is the reason I selected the DRAM initially to hold the cartridge image while the virtual drive was operating. 

CONSIDERING USING BLOCK RAM FOR ONE CYLINDER AT A TIME, DRAM FOR REST

If I have an entire cylinder in the block RAM, then the Unload transaction up to the Arduino will be deterministic and reliable. The disk drive controller reading and writing through the head electronics would also be satisfied easily and reliably from this cylinder buffer. 

When moving to a different cylinder, the current contents (potentially updated if writes from the CPU have take place) would be written to the DRAM and then the contents of that new cylinder would be read from DRAM and written to the block RAM.

SOME CHALLENGES TO CONSIDER WITH THE DUAL MEMORY APPROACH

The time it would take to dump 321 words from block RAM to DRAM, then load new block RAM contents from DRAM, may take longer than the time a real disk drive would take to perform a single cylinder seek. The minimum seek time is 15 milliseconds, a relative eternity to the FPGA operating at with 10 or 20 ns cycles, which provides about 2,336 cycles per word to do both a read and a write. 

The consequence of not meeting that timing would be that my virtual 2315 will no longer be timing accurate on short seeks. Even worse, the drive controller signals that the access is complete via a single shot timer, not some signal from the drive, thus the CPU will be justified to begin reading or writing before our slower dump/restore has completed. 

Another issue results from the current SPI link protocol, where the Arduino specifies the particular sector (including cylinder) where it wants to load or upload as part of each transaction. Thus, it might be commanded to seek to a new cylinder as part of the first two words of the transaction, but expect to receive words almost instantly on word 3 which is far too quick for the swap to occur.

It is conceivable that I could implement a reverse feedback signal to the Arduino that would hold it in mid word of a transaction until the swap completed. This is the major problem because I don't really have a constraint with the timing issues raised at the start of this section.

THIS CONCEPT ON BEING EXPANDED

It appears I can keep up with the disk drive seeks rather easily, so my only issue is in holding off the Arduino Unload or Load transactions. I am looking at various ways to handle this elegantly.