EVIDENCE INDICATES THIS IS ERRATIC AND TIMING DEPENDENT
I can run the test transactions from the Arduino to the FPGA multiple times and I see it failing at different points. I load a fixed pattern where each word of the sector has its word number as the content - 1, 2, 3 etc. I then fetch the sector content up to the Arduino and report where the returned value does not match the word number.
I will find that two broad cases. In one case, after the FPGA hits some unknown state it will return gibberish that is constant for every word and every transfer over the SPI link. The second and more meaningful case is where it begins with agreement for some number of words and then the value returned is a fixed one at some word value for all subsequent words.
The interesting observation for the second case is that the word number where it stops sending the proper value will change from test run to test run. It might be on the third word, it might be on the 50th word, but it will occur for certain sometime during the 320 words of a unload transaction.
This tells me that I don't have a rare situation like a metastable signal or cross clock domain problem, it is a large window that is certain to hit a transfer sometime during a signal transaction. This is good, in that it should be easier to find than a very infrequent issue. However, it has not been obvious to me so far.
MANY FIXES IN ATTEMPT TO TIGHTEN UP RESISTANCE TO TIMING VARIATIONS
Because this was clearly a timing issue that varied from run to run, I focused on timing between state machines and in all signals crossing clock domains. I had put synchronizers on all external signals coming into the FPGA. I even put on a synchronizer plus debouncer/hysteresis for the key signal that bracketed each two byte word of the SPI transaction.
In my refactoring I put in a tightly interlocked set of signals to keep state machines in sync. One raises a trigger for the other but won't drop that trigger until the response signal is seen. The driven state machine will raise a response signal when it sees the trigger and won't drop the response until it sees the trigger go away.
CURRENTLY LOOKING AT THE RAM STATE MACHINE AS IT IS LOCKING UP
I have recently found that the central memory access state machine the one that drives the memory interface IP that in turn controls the DDR3 DRAM, will end up stuck on some state other than its rest or idle state. That aligns with the symptoms, in that when it locks up it will stop responding with incrementing word values or it will not return even the first - thus the mismatch values I saw in the Arduino.
When the first error case occurs, no meaningful match for any word, the value being returned is consistently the first value that was received to declare this as an unload transaction. That is, the value F8 09 which is the code for unload (B11111) and the value for the targeted sector number for my test which is B00000001001 and thus the outbound link remains frozen with the first value returned back to the Arduino.
Normally, we send the value x0000 as we are receiving the first word defining the command, we then send the command value back in the second word of the transaction as we are receiving the inverse of the command. Error checking verifies that x07F6 is the valid inverse of the command word xF809 and we proceed to reach RAM and send up the contents for the next 321 words. Being stuck, we see xF809 coming back.
In the second case, we do fetch the RAM locations properly for a while, sending that value up the link, but then we are frozen so the upward bound link keeps sending the last properly fetched value all the way to the end.