CAREFULLY INSTRUMENTED ANALYZER SETUP
Armed with an error latch that was set on when the byte transfer from the Arduino ended without having completed the read from the memory interface, plus a shadow of that latch properly transitioned across clock domains to the signals driven by the memory interface (ramclock is generated), I set up all the signals of the user interface with the memory module.
FOUND THE CASE THAT PRODUCED THE STALL
When the error latch went on and the shadow latch appeared a couple of cycles later, I immediately saw the condition that would clearly and obviously cause a stall. The user interface required that the app_en signal remain asserted while app_rdy is false otherwise the request would be lost. It must remain asserted until the clock cycle when both the enable signal and the ready signal are true. At that point it is accepted and we can drop the enable.
I saw that the app_rdy signal dropped to false, during a refresh cycle for the DRAM, at the same cycle when my logic asserted app_en. I then dropped app_en although the ready signal was still false. It appeared that if the ready signal was already false a cycle before I was to assert enable, I would handle it correctly and of course if the interface was ready when I was asserting it worked properly. The failure occurs if ready just happens to drop just as I prepare to assert enable.
This is very timing dependent and indeterministic to my logic, as the times when the memory interface would pursue refresh cycles was buried inaccessibly in the interface IP. Things had to align just right (or just wrong) to fail, something that happened often enough to fail during a 321 word transaction of reads but not so often that every read would fail. Exactly the kind of erratic situation I knew was the cause of the bad behavior.
DEFECT IN MY STATE MACHINE HANDLING OF MEMORY INTERFACE BUSY
My state machine for reading and writing memory would check the ready signal and if it were true, advance to a next state which asserted the enable signal for one cycle. This worked fine if ready were true during the whole process, and worked properly if ready was false during the testing, but if ready were to drop to false at the start of the next state, when I was raising enable, things would go wrong.
This is due to the change times of the various signals. Inputs to the state machine such as app_rdy which determined the next state to enter at the clock edge might change at that clock edge just as we moved to the next state.
I should have tested the ready signal in the same cycle where I was asserting the enable signal. This was a stylistic approach which led to the error in handling the case where ready drops just as I asserted enable.
CORRECTION WAS EASY
I made changes to the state machine, such that it didn't raise the app_en (or app_wdf_en for writing) until a certain state where it checked to see if it needed to keep the enable high due to a false ready, or it could drop enable and move on to further states to complete the read or write operation.
QUICK TEST SHOWS THE STALL CONDITION HAS DISAPPEARED
I reran my testing and the error latch never went on. Further, the state machine didn't stall and returned to idle once the unload transaction was complete. Looking at the data returned, however, shows that this is still not operating correctly.