REDESIGNED AND IMPLEMENTED THE BRIDGE ASSOCIATED FPGA LOGIC
I put in a dozen hours working on the most bulletproof schemes for transferring data over the H2F LW bridge as well as the F2SDRAM bridge that accesses the last megabyte of RAM. One weakness was the need to complete two H2F LW transfers for each F2SDRAM transfer since the bridge widths were 32 and 64 bits respectively.
I shrank the F2SDRAM width to 32, which triggered a warning that this is suboptimal for throughput, but it allows a very clean 1 to 1 handoff as I move data from one bridge to the other in the load or unload loop. The loop iterations doubled to 260,625 but I am now only pushing one transaction on the H2F LW bridge in the loop body.
The small change of width also changes the addressing mechanisms for the RAM. Where before I transferred a 29 bit address to grab a block of eight bytes, I now have to send a 30 bit address to retrieve a block of four. That is, the bottom two or three bits of the RAM address are actually pointers to the byte within one block, and the addressing excludes that since you get the entire block in one transaction. Thus a 32 bit address becomes 29 for the 64 bit width, 30 for the 32 bit width, and would be 28 if the memory bridge were widened to 128bit.
I calculate the relative 1130 word (a 16 bit value) within the cartridge by the cylinder, head, sector and word within the record. This is only 16 bits and our memory is fetched in units of 32 bits, e.g two 1130 words at a time. The bottom bit, therefore, of the calculated relative 1130 word is dropped since that is simple selection logic for which half of a 32 bit block contains the desired word.
The design of the bridges is such that they can transfer one block in every clock cycle and can stream continuously. The only way to hold off a transfer is for the slave side to raise the waitrequest signal. Essentially we will see the read or write request but the other side remains frozen until the clock cycle when we have dropped waitrequest.
My logic has to correctly handle the waitrequest signal to keep the applications running on the Linux processor side in sync with my logic in the FPGA. My prior state machines had points where there were extra cycles that could allow a write from the application to be presented and dropped by my side.
My solution was a dual state machine that worked within the single cycle constraints of the Avalon Memory Mapped bridge protocol used with the bridges. One grabs or sends data in a single cycle. The other raises waitrequest immediately to hold off the master (Linux) side until we are sure we processed our transaction. Thus I am throttling the bridge allowing data in only as fast I can am ready to handle it. The state machine managing waitrequest issues a signal that a read or write has occurred but doesn't drop waitrequest until the requesting logic raises its own signal that it received our results.
The load and unload logic is a big loop in the FPGA which should be mirroring the loop in the application on Linux, running 260,625 times to move an entire 2315 cartridge image between RAM and the Linux side. The FPGA is the slave to the H2F LW bridge, since the Linux application controls when we load or unload. This means we issue the waitrequest up to the application.
The RAM, on the other hand, has the FPGA as the master for the F2SDRAM bridge. That means that it is the RAM circuitry that raises or lowers its own waitrequest, to tell us that a particular attempt to write or read RAM is not yet complete. The number of cycles is variable, a characteristic of DRAM, which is why we depend on the waitrequest signal.
Load is conceptually just a matter of having the application write a block to us, which we then write to the RAM. The RAM may take a while to complete the write operation, but the Linux application may have already attempted to write the next block down to use on H2F LW. Thus, we need the waitrequest throttling to freeze the Linux application until we know we finished putting the last block into RAM.
Unload is the inverse, where we read a block of RAM and then have it ready to provide to the application which is reading from us. Note that in this later case, the application may be issuing the read way ahead of when we have the data from RAM. We need the throttling of the H2F LW waitrequest to make it freeze until we are ready to complete the read with the data we have received from RAM.
The bridge state machine that handles the H2F LW waitrequest is the key to this working properly. We have told the load logic that we have gotten a write from the Linux side, it takes the data and issues the F2SDRAM write, but we don't confirm back to the H2F state machine that we got the written data until we see the RAM write complete. It is at that point that we confirm to the H2F LW state machine that we got the data and it correspondingly drops waitrequest.
For unload, we are always prefetching the next block from the F2SDRAM bridge and then waiting for the H2F LW side to inform us that it has had a read request. We only allow the waitrequest for H2F LW to drop if we have finished our prefetch.
In all cases, unless our logic handling a load or unload loop is active, at most we will get one read or write transaction presented by the dual ARM processor running our application, because as soon as we see either the read or the write we freeze the processor using waitrequest until we can satisfy that transfer request.
For a write down from the application to the FPGA, we don't accept the block contents until we release the waitrequest by sending the confirmation that we have used it (getdata_grabbed). In a load loop which is where the application does the write down to us, we would already see the contents of the block, pass it along to write it out to RAM using F2SDRAM, and only release the waitrequest after the RAM has successfully completed the write. This means that my state machine processing the read or write on H2F LW bridge must register the written block as soon as we see the operation and not wait until we let it complete by dropping waitrequest. This is a subtlety of the design that must be considered as it has to fit into the larger interlock scheme where the loop is being implemented.
For a read by the application seeking to get the next block that had been extracted from RAM, we have to first request the block over F2SDRAM and wait until it is complete. At that point we set up the data and release the waitrequest to let the read operation over H2F LW complete. All of this logic for read and write by H2F LW has to consider that the master is the application and it is able to issue a write or read at any time relative to where our state machines might be inside the FPGA. Thus we always have to see the request, freeze it, and only release at the end of each loop iteration of the load or unload.
The state machine for F2SDRAM is easier than for H2F LW because we are not the side that generates the waitrequest when the slave is not ready to complete any transfer. It is the RAM controller circuitry up in the Hard Processor System (HPS) side of the chip that does this and has implemented a slave that can read or write to RAM independent of the accesses by the two ARM processors or other mechanisms using DMA.
As there are multiple users of RAM and because DRAM itself has periods when it is refreshing cells, the time it takes to complete a RAM access is not predictable from the FPGA side. Thus waitrequest is essential to honor in the state machine. It is our state machine for F2SDRAM that heeds the signal.
All we have to do is set up a block of data and address, then raise the write signal to F2SDRAM. We keep that raised as long as we see waitrequest asserted, then we drop it and our write transaction is complete. If reading, we set up the address and raise the read signal. We keep it asserted as long as the waitrequest is active then we wait for the readdatavalid signal is sent by the slave on the HPS side. This tells us it is time to latch the data as the read from RAM has completed.
For safety reasons, I can put in a mechanism to unfreeze the ARM processor by forcing waitrequest to drop. The challenge is to find a way to detect a failure where we are in a deadly embrace, drop the request and then inform the application that the load or unload has failed so it doesn't corrupt the virtual cartridge file or continue attempting to do a load or unload.
The reason this is challenging is that the application thread is frozen - it executed a store or load instruction to a memory address that is mapped to drive our H2F LW bridge. The processor hangs and thus the application can't process interrupts or read status over the over channel of the H2F LW bridge. We don't have a good means of informing the application that its most recent transfer request failed and it needs to abandon efforts.
The easy part is implementing something like a timer that is fired off when we start a loop iteration and reset when we finish each iteration. Detecting a hang is not the problem, therefore, it is getting the application in sync with the failure condition so that it doesn't blindly continue to attempt the load or unload. I will be musing over methods to make this happen reliably, because that is required before I implement this watchdog mechanism.