Sunday, November 20, 2022

Update on debugging the SPI link between Virtual 2315 FPGA and Arduino sides

EVIDENCE INDICATES THIS IS ERRATIC AND TIMING DEPENDENT

I can run the test transactions from the Arduino to the FPGA multiple times and I see it failing at different points. I load a fixed pattern where each word of the sector has its word number as the content - 1, 2, 3 etc. I then fetch the sector content up to the Arduino and report where the returned value does not match the word number. 

I will find that two broad cases. In one case, after the FPGA hits some unknown state it will return gibberish that is constant for every word and every transfer over the SPI link. The second and more meaningful case is where it begins with agreement for some number of words and then the value returned is a fixed one at some word value for all subsequent words. 

The interesting observation for the second case is that the word number where it stops sending the proper value will change from test run to test run. It might be on the third word, it might be on the 50th word, but it will occur for certain sometime during the 320 words of a unload transaction. 

This tells me that I don't have a rare situation like a metastable signal or cross clock domain problem, it is a large window that is certain to hit a transfer sometime during a signal transaction. This is good, in that it should be easier to find than a very infrequent issue. However, it has not been obvious to me so far. 

MANY FIXES IN ATTEMPT TO TIGHTEN UP RESISTANCE TO TIMING VARIATIONS

Because this was clearly a timing issue that varied from run to run, I focused on timing between state machines and in all signals crossing clock domains. I had put synchronizers on all external signals coming into the FPGA. I even put on a synchronizer plus debouncer/hysteresis for the key signal that bracketed each two byte word of the SPI transaction. 

In my refactoring I put in a tightly interlocked set of signals to keep state machines in sync. One raises a trigger for the other but won't drop that trigger until the response signal is seen. The driven state machine will raise a response signal when it sees the trigger and won't drop the response until it sees the trigger go away. 

CURRENTLY LOOKING AT THE RAM STATE MACHINE AS IT IS LOCKING UP

I have recently found that the central memory access state machine the one that drives the memory interface IP that in turn controls the DDR3 DRAM, will end up stuck on some state other than its rest or idle state. That aligns with the symptoms, in that when it locks up it will stop responding with incrementing word values or it will not return even the first - thus the mismatch values I saw in the Arduino. 

When the first error case occurs, no meaningful match for any word, the value being returned is consistently the first value that was received to declare this as an unload transaction. That is, the value F8 09 which is the code for unload (B11111) and the value for the targeted sector number for my test which is B00000001001 and thus the outbound link remains frozen with the first value returned back to the Arduino. 

Normally, we send the value x0000 as we are receiving the first word defining the command, we then send the command value back in the second word of the transaction as we are receiving the inverse of the command. Error checking verifies that x07F6 is the valid inverse of the command word xF809 and we proceed to reach RAM and send up the contents for the next 321 words. Being stuck, we see xF809 coming back. 

In the second case, we do fetch the RAM locations properly for a while, sending that value up the link, but then we are frozen so the upward bound link keeps sending the last properly fetched value all the way to the end. 

Saturday, November 12, 2022

Hurricane Nicole now in the rear view mirror; tweaking simulation and nailing the reset and startup sequences with refactored design

HURRICANE NICOLE VISITED UNEXPECTEDLY

A rare November hurricane formed with little warning and was upon us midweek. It reached Category 1 intensity, windspeeds around 70 mph and made landfall approximately 40 miles south of me. The experience was similar to Ian, which while it had been more intense when it first made landfall in the gulf side of the state, was down to Cat 1 by the time it passed over us last month. 

Zero water or damage to the workshop and its computers, zero damage to my condo. Water flying at the windows with gusts to 85 mph finds its way through even the best sealing, such that I had maybe two quarts of water puddling on the tile along the ocean side windows. A few towels soaked that all up and all was well. 

The beach in front of my building did get pounded. Lots of erosion. My building has a very high seawall so that even with a full moon, high tide and six foot storm surge, no water made it up to the ground level or the garages. The sand down on the beach was vacuumed away, however. There is a plan to dredge up sand and reestablish the beaches all along the coast here, although not instantly. 

Also the crashing waves, coming sideways due to the rotating hurricane winds, did smash up most of the wood stairways that lead down to the beach. All 12 of the public access walkways in our town, for example, were damaged and impassible. My building used to have a stairway, too, but all we have now is a 'diving platform' looking down to the sand below. 

IMPROVED MY SIMULATION WITH RANDOM TIME DELAYS

Since the major issue to validate is how my logic performs with the SPI link which is driven by the Arduino completely unrelated to any of my clocks or logic, I made use of a random number generator for the testbench which varies the timing of the bytes which I present to the FPGA simulating the Arduino. 

NAILED DOWN STARTUP/RESET TIMING WITH SIMULATION

Using the post-implementation simulation, modeling the actual structure of logic cells and routing from my design, I was able to spot and repair some weaknesses in the relative timing of starting various state machines and the initialization of the memory interface and FIFO IP that I am using. 

Monday, November 7, 2022

Completely refactoring the SPI link logic

STATE MACHINES DEPEND ON TIMING OF INCOMING WORDS TO ADVANCE

Several of the state machines driving the SPI link depend on the SlaveSelect line which is active for each two byte word being transmitted. Alternatively it used an SPIbusy signal which in turn was driven by SlaveSelect. In both cases, the state machine first sits waiting while the transmission/reception of a word is underway then advances to snag the output when SlaveSelect turns off. 

I suspect that there are times when I have SlaveSelect already active but I am first waiting for it to turn off, or vice versa, because of the relative timing of the Arduino driven SPI signals and what I am doing inside the logic in the FPGA. That certainly aligns with the symptoms I see, where the SPI machine is out of sync with the words being sent by the Arduino or one of the state machines stalls. 

REFACTORING IS MY SOLUTION WHEN I AM ENCOUNTERING FLAKY BEHAVIOR

If I spend enough time fighting with erratic behavior, it is time to look at the problem again and refactor the design. I try to come at the required behavior in a different way, focusing especially on interlocking or other means of ensuring that various state machines work together as intended.

SPI LINK LOGIC BEING REDESIGNED

It is now time to refactor all the state machine gear. I have evolved it several times, in some cases because the way the Arduino worked was different than I expected and in some cases due to defects or poor approaches I found. The longer you layer fixes atop some code, the worst it tends to get. Refactoring lets me redesign with the benefit of all the correct information about the Arduino and all the experience I gained working on the logic.