Wednesday, November 30, 2016

Switched over the 1200K fpga board as the other board was unusable


While waiting to find/buy a new USB cable with a better fit to the fpga board, I can't run the disk drive when the fpga is randomly resetting or emitting incorrect signals. I will switch over to debugging of my emulator role in the interim.

Unfortunately, the situation with the connection and the fpga board is bad enough that I couldn't get the board to respond even in emulator mode. I se up a few diagnostic signals and synthesized so that I could look again when I returned from my day with the 1401 restoration team at CHM.

I found another cable and had the same very erratic behavior with the board. Since I have another fpga board, I used that instead but it is the 1200K gate version, not the 500K gate variant. Something is wrong with the USB link on the 500K board.

I had to resynthesize for the 1200K configuration before I could test the new board. This also required some config changes to the UCF file because the 1200K version of the board hooks four of the LEDs to different FPGA signals than with the 500K board.

I began testing the emulator function, slowly working through the logic to get it working. Glad to be moving forward again. I also have a reliable board now which will let me hook up the new driver board and test it out.

Tuesday, November 29, 2016

Board ready but USB cable issue, finished coding emulator role


This morning I installed the cables on the Diablo disk drive and hooked up the fpga. It was still loaded with the prior bitstream, but the changes in this interface require an update. Among other changes, this version of the board has active SelectUnit1 and ReadGate signals, reads the SectorNumber from the drive and does not receive IndexMarker.

First up should be a test with the drive powered but not fully spinning, so that I can verify that key signals such as WriteGate are not asserted.  I did this and did verify those signals, but found other issues.
New board attached to FPGA
I couldn't find my original USB cable that connects the FPGA board to the PC, it disappearing during the holiday storage, but I substituted another I found. The connection is erratic, leading to random resets, spurious data transfers and likely unrequested transaction initiations. That isn't safe with the drive spinning, so I need to resolve this before doing anything else.

I concluded the coding of the emulator task with the process to write one or more records on the current sector. It was synthesized and set up to begin testing of the emulation function, initially with the stubby monitoring board attached to the fpga board. This will let me inject given inputs and watch the outputs with the scope and logic analyzer.

Stubby board to allow easy access to inputs and outpus

Monday, November 28, 2016

Board checkout completed, ready to fire up test bed


I finally found the intermittent contact in the new disk driver board and repaired it. Going on through the remaining to finish all the continuity/correctness/shorts testing. Finally, all the careful checking was done.

Next, I populated the chip sockets, turned on power to the board and validated that injecting 1 and 0 values to either side produces the proper results on the converse connector pin associated with each signal,

The twelve input signals worked properly, swinging the fpga input pin between 3.3 and near zero volts as the pin on the Diablo connector was grounded. The initial test on the output signals, where I supplied 3.3 or 0 volts to the fpga output pins and watched the Diablo connector, didn't produce any voltage swing.

It was only after a few minutes of puzzlement that I remembered these are open collector drivers which depend on the terminator resistor network to pull up when the driver is not grounding. I have to add on a pull-up resistor and supply +5V before I could get the outputs to swing as needed.

With the pull-up in place, I verified that all 15 of the output signals would swing up to 5V when the input was pulled to ground, but would drop to zero if the inputs floated up. It all looked good, each circuit performing well.

Tomorrow I connect the cable to the Diablo, plug in the terminator, and hook the fpga board to the driver board. With that, I can bring up the new board and disk drive to test its operation on all the driver functions that had previously tested as working correctly. 

Set up to test again, checking out new board


On the design front, I worked on the logic for the emulator role where it will handle WriteGate switching on. This occurs in two cases - when writing an entire sector of three records from scratch or when updating the later records within an existing sector.

The process that is continually reading sectors should continue to run, but its control of the RAM blocked whenever WriteGate is on. The reading process lets us know which record we are in, essential for the writing process to properly address RAM locations.We don't, however, need to be absolutely synchronized to when the read thinks it is in preamble, postamble, sync or specific words of a record.

The write process will decode the incoming WriteData&Clock signal to follow but throw away the preamble and sync word, before capturing and writing the header, label or data record words into RAM. It will verify checksums just for completeness although there is nothing appropriate that the emulator can present back to the Alto if the checksum does not match.

The write process will then follow and throw away the postamble of the record it is writing. If there are more records in the sector to write, it will iterate until done with the data record. A bit tricky to set up but will work on it after I set up the test bed again.

I felt it time to switch over and test the second version of the driver role interface board I had built, the one I will be passing along to Al Kossow once we are done with our testing on the restored Alto. I performed one more continuity/shorts/wiring test then populated the chip sockets and began circuit tests.

With power applied but the fpga and disk drive unconnected, I began to inject voltages to each side in order to check that the appropriate results appeared on the other connector. Only when these are judged correct would I cable everything together to test out some reading and writing.

Ended the day chasing some peculiarities, thus no full test yet.

Saturday, November 26, 2016

Emulator role built for reading every sector as it rotates under head


Guests departing, finally back to coding and later set up to test. Have to haul logic analyzers, scopes, drives, and other items back from storage and hook them all up.

I coded the read sector process for the emulator which will continually spit out the clock and data bits as the virtual cartridge rotates under the virtual head. As soon as the FileReady is on and we are not in the midst of a seek, the process will wait for the beginning of the next SectorMark and then generate continuously.

Testing is much easier this way, since the disk tool itself will start the drive, turn on FileReady and emit the clock and data bits. All I need to do for testing is override the ReadGate input with a slide switch so that it allows the ReadClock and ReadData pulses to emerge on the outputs.

The logic for the read sector process and the other new parts are all synthesizing without complaint, thus it is time for me to set up for testing. For that, I need to get all the gear out of storage and onto the workbench. I have a party to attend tonight so testing won't commence until Sunday.

Thursday, November 24, 2016

Finishing read sector process for disk emulator role

Holiday interruption, just when I am returning from vacation interruption! To prepare for the US Thanksgiving holiday and guests, I had to disassemble and store much of my testbed and development area. On the weekend, when guests have departed, I can return my house to a laboratory. 


After I returned home from my vacation, I got back to coding of the disk emulator role for the disk tool. I have a few final bits to add to the process that emits a record - the preamble of words of zero, a sync word, the appropriate count of words for the record and the four postamble words of zero.

This will be triggered by a higher level process for the entire sector. That will set the preamble count, data word count and RAM address for the record in question, then trigger the record emitting process. The higher level process will run whenever the FileReady status is on, constantly generating each sector as it flies under the virtual disk head. It will only be interrupted during seek operations.

I spent a bit of time blocking out the higher level process which will read an entire sector, executing iteratively as long as the virtual disk is rotating and online. 

Friday, November 18, 2016

Developed logic for seeks and data separation during write to emulated disk drive


Today I built the logic to handle seeks when the Alto system requests it of the disk emulator. It is essentially ready for testing. It should be accurate in timing, appearing to take the same time as a physical Diablo drive.

I set up the input and output signals to be sensitive to the Select Unit 1 signal and also refuse to respond if the virtual cartridge is not loaded. In addition, the ReadData and ReadClock outputs are gated by ReadGate.

A key component needed to handle writing to the emulated disk is a data separator. The computer sends 100 ns width pulses, with the timing between pulses determining whether it is sending a 0 or a 1 bit. That is, a zero bit value is transmitted by delivering a100ns pulse followed by 500ns of delay before the next 'clock' pulse. A one bit value is transmitted by sending the 100ns clock pulse, a delay of 200ns, a 100 ns data pulse, and a final delay of 200ns before the next clock pulse.
I receive bottom stream of pulses, must break out clock and data as above

The data separator sees the continual train of pulses every 600 ns as the clock pulses. It watches in the interval between clock pulses to see if a second pulse is transmitted midway between - that represents a 1 bit value. I built the logic to accomplish this and to drive my deserializer which shifts the incoming bit values into words which will be stored in RAM as the record is written.

Next up is the process that will continually read the sectors rotating under the virtual head, whether the results are transmitted to the CPU or not. This is the central process that will underlie the entire disk emulation. I spent the day building this in more detail, drawing Moore and Mealy diagrams until I felt it was ready but hadn't yet begun to code in VHDL,

Wednesday, November 16, 2016

Outline of the disk emulator role logic


The core of the disk emulator logic is a process that will start up with the first sector marker it sees while the disk is loaded and not currently seeking. This will generate the data and clock signals for every sector as it rotates past the virtual head, also keeping track of whether we are starting the header, label or data record of the sector.

Those data and clock signals will not be delivered over the interface unless the computer raised ReadGate. With ReadGate true, we pass along the signals to the computer as ReadClock and ReadData.

Thus, if the computer decides it is reading a given sector, it will turn on ReadGate and see the appropriate bits stream in. If it is an update operation, it sees the first record(s) using ReadGate and then starts writing by raising WriteGate.

When we see WriteGate go true, the separator watches for alternations of the WriteClockandData line, each such reversal is seen as a clock or a data bit, depending on the timing from when we see the very first transition.

A transition that occurs less than 500 ns after the prior clock transition is read as a 1 bit. This will be followed by another transition in much less than 600ns to represent the clock pulse ending the bit cell. If the transition occurs after a 500ns delay from the prior clock, it is the succeeding clock bit and there was a 0 data value in the bit cell.

Each bit cell injects the 0 or 1 bit value into the deserializer, shifting in to form 16 bit words. The deserializer signals when a word has been accumulated, so that it can be sampled by other processes.

When the disk is selected, loaded and WriteGate turns on, it begins a process to write one or more records in the sector. The continual read process tells us whether this is going to be the header, label or data record we are writing. We begin tracking the incoming words that the computer writes to us, looking for the words that will be stored into RAM at appropriate locations.

The write process drops zero bits, looking for a 1 bit that represents the sync word which begins a record. This is what actually syncs up the deserializer to begin informing us as words are ready for sampling.

After sync is achieved, we extract 2, 8 or 256 words, storing each in the RAM slot assigned to it for this sector. Meanwhile, we calculate a running checksum using bitwise XOR and a seed value of 0x0151.

Following the last data word of the record, we extract a checksum and verify that it matches the checksum we calculated. There is no way to reflect the error back to the processor, but we can flag it for our awareness.

After a checksum word, there is a stream of up to 4 words of zeroes, which tells us to drop sync and prepare to write the next record of the sector (unless we just finished the data sector). The deserializer drops the sync condition, ignores zero bits and waits for the first 1 bit to act as the sync word.

If WriteGate is dropped, we stop and go back to idle state. The read process continues to run at all times the disk is loaded, fetching the newly written words once the virtual platter rotates the sector back under the virtual head.

We always address RAM with the current sector number and the current value of the Head signal coming from the computer. The cylinder value that addresses RAM is initially zero and is updated by any seek process.

The seek process sits at idle until the disk is selected, loaded and the Strobe signal is activated. ReadytoSeekReadWrite goes false in 2.5 us and we wait another 27.5us before signalling either AddressAcknowledge (or LogicalAddressInterlock if the cylinder requested is >202).

The process calculates the movement distance based on the prior cylinder address and the value presented by the computer when Strobe is activated. We wait 600 us per cylinder plus 14400 us settle time, to model the physical seek duration on a readl drive. The process will then respond by returning ReadytoSeekReadWrite to true which signals completion of the seek.

There needs to be a mechanism to load or unload the virtual drive. When loaded, FileReady as well as ReadytoSeekReadWrite are turned on. Unloading waits until all in-flight actions such as seek or write are idle, then resets those two signals.

It is up to the user to load or fetch RAM contents to the PC while the drive is unloaded. The emulator will serve up whatever is in RAM as data or replace selected words during writes. 

Tuesday, November 15, 2016

On vacation in Kaua'i but doing design work from the cabana

My posts will be short and erratic this week as I am on holiday. I left the Diablo drive, the dog, the 1130 and the housesitter to spend some time at the Grand Hyatt Kaua'i lurking in my cabana. 

View from my cabana


I will do the design work for the emulator role, where the tool will attach to an Alto II computer and act as a disk drive, while on vacation. I worked out the broad strokes for all the functionality, but have not begun generating VHDL yet. 

Saturday, November 12, 2016

Worked on Alto II and prepared for vacation


The team got together today to work on various open tasks and to experiment with some of the applications on the system. 

We cleaned one of the mechanical mice loaned to us by Xerox PARC and it works quite well. We cleaned a second similar mouse but a bearing near the top of the ball cavity failed making it perform erratically. Several of the balls in the bearing fell out - each barely visible to the naked eye - and it does not look like it can be repaired. Finally, we still need to acquire and connect a DE19 plug to the optical mouse that came with the system. 

We serviced the disk drive, replacing the air filter and adding a touch of the oil to the wipers on the arm bearings. The drive is working quite well. 

We undertook some scoping and capture of the ethernet connection, to provide final specs for Ken's ethernet adapter project. We weren't seeing what we expected, so the team hooked up the logic analyzer to probe the adapter more fully. 


I am away on holiday for the next week, having to put aside the work while I pack tonight.

Wednesday, November 9, 2016

Preparing to study the circuit failure in the Diablo drive

Spent part of the day at Computer History Museum working on an odd problem with card reader errors that occur only when executing op code 3 - a combined instruction that writes a line on the 1403 printer and then reads a card. 


I am going to stick the Diablo data separator circuitry onto the logic analyzer to understand more deeply what is occurring in those times where the circuits fail to properly handle timeshifted transitions and report a clock pulse as a data bit, for example.

Points to capture and feed to the logic analyzer

To do this, I can relay on three test points but still need to attach to three pins on ICs deep on the board. The board sits in a group of PCBs such that the surface of this board is fractions of an inch away from the back of the adjacent board.

There is no room for traditional grabbers without using a board extender - which I don't have. I tacked solder wires on the outsides of the selected chip pins so that the wires extended out of the congested area.

The lines were hooked up to the logic analyzer and ready to begin testing tomorrow morning. 

Tuesday, November 8, 2016

Discovered cause of sectors with permanent checksum validation errors, working on possible solutions


The bit error rate is currently around one wrong bit in every 150,500. Since there are about 4,320 bits in the data words, checksum and sync, this comes out at the 2.87% error rate of checksum validation failures.

In order for this error rate to produce a sector which passes the checksum but has corrupted contents, we need an even number of bit flips in each of the 16 positions in a word.  By position, I mean that the high order bit of a word is one position, the least significant digit is another position, etc.

If we have two bit flips, but they are not aligned in the same position, the checksum will catch it. Only when an even number of flips occur in a position will that error slip through. The checksum is a simple XOR of all the words; XOR by definition is independent in its action in each position.

To have a read which appears valid but is not, we need to have 2 bits in the same position occur in the same record. Crudely, this is 2.87% x 2.87%  for two bits and then at least 16 x rarer to have the second bit occur in the same position as the first. That is an 0.005% or 1 in 19,424 sectors. With 4,872 sectors, the chance of an entire cartridge having one sector with a false positive read is 25%.

All this depends on what causes the problems. If there are conditions that always or mostly lead to the error, then I might be able to work around it by mitigating the underlying cause, otherwise it is purely random.

In the random case, running the ReadEntireCartridge several times and using autoretry should produce files that can be merged with a majority vote to clean up any sporadic corrupted sectors that sneak through the checksum test.

In the situation where this is conditionally induced, the right combination of conditions in one sector may make it sneak through even with mulitple cartridge read passes. That is, the conditions occur in just the right (wrong) points in a given sector so that it will almost always have a pair of flipped bits whenever it is read.

I set up the scope and went after the sectors which incurred checksum failures, to look for conditions that cause the errors. It would be easiest when the conditions are in the shorter records - header or label - but essentially all of them are in the data record which is the only one long enough to have a real risk of a bit flip if the root cause is randomly occuring.

I looked at Cylinder 1, Head 0, Sector 2 which had a hard checksum failure on the data record. I found a spot in the data record where a clock pulse was clearly missing coming from the drive! Very consistent and of course it would shift all the bits over by 1 after that point.

Missing clock pulse leading to checksum validation error

Looking at the waveforms inside the drive showed that the flux reversals were a bit smaller at that spot, which could be symptomatic of a bad spot on the surface, but that is far from conclusive. Below is the output of the differential amplifier, where we see the reversal is stunted in both positive-going and negative-going limits compared to all the others.

smaller clock flux reversal at time when clock pulse is dropped

When we look at the signal after the clipping amplifier, which should even out such variations, we see the waveform below. All I notice is that the transition is a bit squeezed compared to the others, closer to where it might look like a data pulse transition instead of a clock transition.

Clipping amp output showing slight 'squishing' of transition

The suspicion is that the data separator logic is malfunctioning when presented with this waveform, neither seeing it as a data bit value of 1 nor as a clock pulse. It makes sense that timing shifts like this would be able to block the pulse, if we look at the relevant part of the J10 card schematic.

Circuits at left pick time when pulse is Data versus Clock
Once again, jitter on the disk signal is leading to problems. In this case, the next clock pulse arrives too early while the data separator is still looking for a data value transition, thus blocking the pulse from reaching the ReadClock line.

I need to validate this by looking at other test points on the card. Specifically, I want that the pulse itself is generated at TP 6, after the clip amplifier, since we need a pulse to pass through. Interestingly, there is another electrolytic capacitor at the output of the pulse driver transitor. I decided to replace it with a new known good capacitor.

I ran the same read of Cyl 1, Head 0, Sector 2 with the changed capacitor and watched on the scope. The same thing happened - lost clock pulse and spurious ReadData bit. I will defer the look at TP6 because it appears the pulse is getting emitted, it is just the timing that is far enough off to fool the data separator.

If the pulse at T6 exists, then the next point to watch is TP5 to see whether the Data Gate is on, blocking the pulse, or not. Finally, I want to look at the clock line at TP 3.  TP3 and TP5 are easily reached, but I have to find access to TP6 which is awkwardly buried in the card making it hard to reach without the card and cable extender (that I don't have).

We can see that the TP 5 Data Gate signal gets scrunched up to left of center, then loses the clock
Lack of data gate pulse leads to failure to pass pulse as Clock here at TP3
TP4 shows the pulse slipping through as Data due to separator error

Once again, jitter in the signal recorded on the platter has led to failure to correctly read the sector. In this case, the jitter has caused the data separator to fail to properly allot the transition as a clock rather than data pulse.

I can see how to spot and compensate for this particular condition - having a watchdog timer to spot a missing clock pulse - but unclear how I would handle jitter surrounding bitcells each with a 1 data bit as it would simply shift the next data bit to come out as a clock pulse.

If only the drive emitted the unseparated pulse train and let me assign them as clock or data signals, I could address the shift more easily in my logic. However, that is not the interface offered by a Diablo drive.

I need to do a lot of thinking. How the Diablo mechanism might be adjusted to handle the timeshifting better. How I might be able to recognize and compensate for the Diablo issues. What scope and logic analyzer traces I might add to the card to understand what happens with the timeshifting situations. 

Monday, November 7, 2016

New logic for handing incoming bit stream working, dropped sector error rate by half


My initial testing showed much of the new logic seemed okay, but the critical part that recognizes whether a bit call has ended and if it had 1 or 0 inside, does not appear to be working properly.

Fortunately, I instrumented quite a bit of the new logic so this means some painstaking stepping through the logic analyzer traces while examining my state machines. The goal is to find behavior that doesn't reflect what I expect to see, then look at the logic to find the flaw.

The logic for the fuzzy recognition of ReadData and ReadClock worked - three in a row got these machines to one of the two end states, while a sequence of erratic readings are ignored. The logic to recognize the sync word worked properly.

The fourth machine is the critical one that will delineate bit cells, remember whether ReadData reached the 1 state during the cell, output the value of the bit cell as 1 or 0, and trigger the deserializer to put that bit value into the shift register.

The fourth machine is not working correctly.  It is stuck forever with a bit value of 1 once it finds any bitcell with a 1 inside. Added correct reset logic to set the bitvalue back to 0 then watching for a ReadData of 1 to flip bitvalue back on. Back to testing.

I now find the logic is not starting properly at sync, loading an incorrect initial bit of 0 when we are just at the end of the sync word bitcell and not at the end of the first real word bitcell.

The solution is to extend the synchronization state machine - which had turned on the synced state after seeing a ReadData value of 1 and then the ReadClock pulse - but now will wait until ReadClock goes back to off before flipping on sync. Actually, I am using the fuzzy machines that handle ReadData and ReadClock.

Testing this refinement produced a big improvement in reading accuracy. I only encounted checksum errors on 140 sectors, a rate of 2.87%. I slashed the rate of errors in half with my redesigned state machines.  Now that some of the sectors that gave me pretty consistent errors are reading clean every time,

Next, I will investigate any sectors that still get errors, to see if I can find anything on them that could explain the error and possibly let me further reduce the error rates.

As an experiment, I ran the ReadEntireCartridge transaction with autoretry enabled, to see what rate of errors occur in that case. The rate dropped to 1.6%, or 78 sectors with checum errors out of 4,872.

Tomorrow I will go after the sectors with the errors, with and without retry, trying to dig into what is occurring in those cases. I will keep digging until I can find no more ways to improve the logic or compensate for errors or mitigate errors. 

Sunday, November 6, 2016

Developed new approach to handling the clock and data signals from the drive


I spent a couple of days examining different ways to handle the incoming ReadClock and ReadData signals, to deal with the imperfect timing of the real disk drive. That is time between clock pulses is not a constant 600 ns. Further, if the signal changes just at the clock edge, we might have an unstable state for a short time which could confuse my state machines.

Dealing with asynchronous incoming signals is challenging, once you go down the rat holes imagining all the conditions you might face with your logic. The key is to develop logic that results in the correct state or answer in spite of the various conditions you imagine.

My clock state recognition machine is now driven by finding three of a given value in a row to swing the state of the clock to one of the binary states. That is, each on or off signal value moves us 'rightward' or 'leftward' between four conditions - clock value 0, intermediate 1, intermediate 2, and clock value 1.

As long as a condition has three states the same sequentially, we reach the side that reflects that incoming signal. One or two conflicting states in a sea of the contrary states will dither slightly off the end points but never reach the alternate end point.

Our clock signals are on for a minimum of 100 ns coming from the Diablo, which is approximately five cycles of the FPGA. We should read the on state of the clock during that intervale. The off condition is a bit less than 500 ns or 25 cycles of the fpga.

Since this state machine has two indeterminate, intermediate states between the valid off and on states, I have to be mindful that any other FSM which depends on the clock state be written with this reality in mind.

The next FSM is the one that recognizes whether we got a ReadData value of 1 during a particular bit cell (interval between clock pulses). It can use a similar strategy where it takes three or more of the same value to drive the FSM to the on or off endpoints.

However, what I truly care about is whether the machine ever reached the on endpoint during the bitcell time, regardless of its current value. It can swing to on for a short time, then swing all the way to off, but we have to remember that we saw a '1' in order to insert the appropriate bit value into the deserializer.

The third piece of this machinery that is critical is the FSM which decides when to recognize that we had a 0 or 1 bit cell. This must reliably detect when the bit cell has ended and also must keep an eye on the ReadData machine to see if we ever reached the on state during the cell.

This third machine was the most critical of all. After many false starts and alternate approaches, I settled on one that I believed would give the most reliable results. It was all done and time to start testing, although the extensive redesign made it unlikely it would all work properly first time around.

The fourth machine must detect the first 1 bit that turns on sync when we begin to read a record, going through the preamble of many zeros until the 1 bit flags the start of data words. I will retain the existing machine for now, as it appears sound to me.

When doing major changes, one has to wait to see how the tool appears to be misbehaving and then develop all the instrumentation needed to attack the problem. I was iterating to set up logic analyzer signals and test the new logic through the evening. 

Friday, November 4, 2016

Possible situation causing misreads


I spent the day poking at various locations, observing the signals and trying to find a cause for intermittent checksum failures at these sectors. Nothing jumped out at me but I will keep looking for a clue that is reproducible and can be clearly tracked to a cause.

My working hypothesis is shifting of the clock timing. There is a specific point where the clock pulses don't arrive 600 ns apart, right before two data bits that sometime clock in as 1100 and sometimes as 0110 meaning I am skipping a clock cell ahead sometimes. The later pattern, 0110, when it sees an extra 0 value that doesn't exist, is when the checksum error occurs.

Sometimes this is captured as  01101100 and sometimes 01100110 
Clock bits where this happens are irregularly spaced

Therefore my logic to recognize clock pulses, set up the data bit values, and push them into the deserializer has some weakness that is triggered by the slightly deformed timing of clock pulses. I will have to stare at the state machine, change the data being emitted to the logic analyzer, and test some more to spot where it goes awry.

Looking at the two sides of the differential amplifier decoding the head signals,we see the following:

one side of differential amplifer

other side of differential amplifer
The signals from the differential amplifer look well formed and about the same, except for a bit of bias (notice the one bit at left is rising peak to peak in the bottom scan, but is more flat peak to peak in the top scan). I can't do anything with this observation (yet) but am saving it just in case it becomes relevant later.

If somehow the decision to count extra bit cells is due to a weak process in my fpga logic, then I could improve reading quality by redesigning this part of the hardware. I will look into a different way to handle the incoming ReadClock and ReadData pulses that might avoid this problem.

Thursday, November 3, 2016

Inconclusive investigation of one sector that receives checksum errors


Today I am going to dig deeper into sectors that had checksum errors while being read, to see if I can spot anything in the signal which preserves the real information even though it foils correct reading. Potentially, there are conditions which I could recognize and improve the successful reading rate, but it all depends on why these sectors fail.

I set up the logic analyzer to trigger the oscilloscope at the start of reading a sector, employing various delays to look at different portions of the incoming signal. The scope can look at the output of the differential amplifer, which should show the actual magnetic flux reversals on the disk surface. It can also show the separated data bits.

I picked on of the sectors and triggered the scope to capture what was occurring, both as ReadData bits emitted from the separator and as raw transitions from the disk surface. I found a spot in the Label record where I could see a bit that appeared or didn't over multiple reads.

I moved the scope to focus more on this time and verified that the bit appears or is hidden on different passes. It is consistently that one bit in this area that pops up or does not.

Region when bit is not detected
Same region, extra 1 bit shows up

I then looked at the raw transitions relating to this and began to study them for signs of some anomaly. Note that it is the Diablo hardware which is responsible for this bit appearing or hiding, not my logic. The signals above are sampled from the incoming ReadData line.

Anomalous region where the random bit appears

A region with all zero data bits correponds to a flux transition every 600 ns, as show below. The spacing is nice and regular as are the voltage swings.

Waveform during string of 0 data bits
Next we look at a region with some 1 bits mixed in, to see what the waveforms look like. I have included a diagram of what we should see as input and the effect it has on recovered data bits.

Some 1 bits, adding transition between the 600 ns clock swings
What mixed waveforsm should look like and bit recovery

Repeating our anomalous region to compare to the waveforms, we interpret it based on the midpoint of each transition. Using the centerline as time 0, the bit cell with a data value of 0 ends with the transition at 250ns. The cell with a 1 data value has the extra transition occuring at 550 ns and is completed at 850ns.

Repeat of suspect signal

We see another transition at 1150 ns, which is the timing for another 1 bit. The end of that bit cell is at 1500 ns, close to when it should occur. The next transition is a 2150 ns, consistent with a bit cell containing a 0 data bit. .

I have to conclude that this is not an anomaly at all, but a pair of 1 bits. I was catching the wrong part of the signal on the scope. I had to retest and shift round the scope to find the location with the random bit value. Somewhere to the right of 561.9 us where this picture took place. 

The captured spot is a clean 1 bit, so I went back to watching the output of ReadData to see if I can spot it failing to output the 1 value.

I did see that this sector will sporadically get a Header record checksum failure, often enough to watch for differences in the delivered bits on ReadData. This is easy because one of the two words is 0000, so all I have to watch is the other header word and the checksum.

What I found was zero difference between times when the read was successful and times when it failed, at least as far as I could see. Below are the four scope images, two pairs of successful and unsuccessful reads of the header word or checksum.

first set, first photo

First set, second photo
Second set, first photo

Second set, second photo

This is especially confusing, meaning it might be a race hazard or other vulnerability in my logic that is decoding these improperly. One more thing to check - I need to watch a long stretch before this header record begins to be sure that there are no spurious 1 bits sneaking in to cause the checksum failure.

Nothing conclusive yet, but I will keep investigating in the hope that I find something which makes reading even more reliable. 

Wednesday, November 2, 2016

Restored the erased sectors on Cylinder 0, then began data analysis of the extracted disk image

Today is my day working at CHM on the 1401 systems, but I did some work in the late afternoon and evening. 


I received a copy of some informal Python code from Ken Shirriff to transpose between my disk tool file format and the 'standard' archive format used with Alto simulators and for storage. The code works in one direction, from my file format to the standard, but I need it work both ways. 

The changed code was used to build my image file. I set up for a test, downloaded the file to RAM, then wrote the 24 sectors on Cylinder 0 using this file data as the content. I then uploaded RAM after having read a few of those rewritten sectors - all without any checksum errors on those read operations. 

The uploaded file was transformed back to the standard format and compared to the archive file from which the recovered sectors were taken. There are some differences but these are all in areas where just running Contralto and booting the cartridge causes very similar changes, comparing the archive file before and after I booted it on the simulator. 

I have now replaced the 24 sectors that were accidentally erased when I had the procedural error a month ago, although I may rewrite them with the pure archive version at some point. 

This accident occurred when I loaded new firmware to the fpga while the disk was still spinning, apparently activating the WriteGate and EraseGate lines while the disk spun under the heads at cylinder 0. The fpga wasn't modulating the WriteData&Clock line, so the result was a fully erased track on both head 0 and 1, with no clock or data transitions at all. 

I ran the ReadEntireCartridge transaction wishing to see what sectors have reported errors and where the sectors differ from the archived image. I plan to monitor the raw data coming from test point 1, compare it to the ReadData bits returned and see if I can spot why the record had a checksum error. 

It will be easiest with Header or Label records, but most of the errors I captured this time were in the 256 word Data record. This will require looking at 4, 112 bit cells (one or two transitions per bit cell which should produce a 0 or 1 data value). That will cover a span of 2.467 milliseconds, the majority of the sector. Much better to find a Header or Label record, requiring 48 or 144 bit cells and only 23 or 86 microseconds time span. 

With 256 words, I can't possibly hand calculate the checksum of the bits I receive. Fortunately, I can have the logic analyzer display the calculated checksum and all I have to do is compare it to the word read in. That may tell me which bit position(s) didn't checksum properly, but I will also look for unusual signals coming in anywhere in that record. 

I have identified sectors that have a checksum error in the Label record:
  • Cylinder 73, Head 0, Sector 5
  • Cylinder 110 Head 1 Sector 11
A sector with errors on both Label and Data records:
  •  Cylinder 137 Head 1 Sector 0
Finally, I found a sector with.checksum errors in the Header and Data records:
  • Cyl 141 Head 1 Sector 11
 There are a few other sectors with Label record errors and plenty of sectors with only errors in the Data record but I can learn something from the first four above which might guide me to a compensatory strategy or perhaps require investigation of more sectors with errors. 

 I have been using some analysis and visualization tools built by Ken to compare my uploaded file with the archive and other versions of xmsmall.dsk. It found many differences in the index word of each sector in the archive, which is generated by the archiving tool but not actually read or written by an Alto. 

I changed the code to ignore the index word as it is irrelevant to analyzing accuracy of disk reading. I then found that the number of sectors with differences between my uploaded file and the archive file from bitsavers was 374, 

The number of sectors with checksum errors was over 200, meaning we had about 3.5% difference. That matches quite well with the 3.7% difference rate between the bitsavers archive before and after it was booted under Contralto. 

My in depth study of the disk surface will take place tomorrow, as it is getting late now. 

Tuesday, November 1, 2016

WriteSector working properly now


It is awkward to switch between monitoring writing and reading, as the signals to be sent to the logic analyzer are mostly different. I finally set up a slide switch that will alternate between banks of signals appropriate to each operation.

With it all generated, and the corresponding changes made to two saved logic analyzer settings for reading and writing, it was lunchtime. I fired up the testbed for a first capture of a WriteSector right after I finished my meal.

The capture looks good up to the end of the first record of the WriteSector. My trigger conditions are to wait for the WriteSector FSM to enter the sector wait step, then trigger on the gotsector output of the sector matching logic.

This assumes that gotsector is issued at the beginning of the correct sector. I can see that SectorMark is on when gotsector is issued, but can't verify the correctness otherwise. I will do independent tests to check for this later.

I did discover the cause of the sporadic power-down of the disk drive I had been experiencing. The low cost power supply that delivers +15 and -15 to the drive seems to thermal out after a period of operation. In spite of the specifications that assert this can power the drive, it clearly can't or it has some internal defect.  The solution is shorter test intervals with a cooldown in between.

I watched the delivery of the proper header record, all the times and the WriteData&Clock states matched what should be happening. Since my ReadSector logic appears to read this record properly, but finds a checksum error on the label record that follows it, I need to watch further along in the WriteSector operation.

At first I had worried that the Alto drops the ReadGate between records but I do not and somehow feared the drive wouldn't resync, but the schematics proved that to be a needless worry. All ReadGate does is gate or ungate the output of the separator which is emitting clock and data pulses, it does not affect anything that would cause the drive or logic to sync up.

I did some tests using the logic analyzer trigger output to trigger the scope - looking at SectorMark at the time that I received the gotsector signal. They were essentially at the same time, except for a few 20 ns cycles used by my matching logic. As long as this is the correct sector number, I am beginning to write at precisely the proper place - the beginning of a sector.

Next up, I triggered the scope at the same gotsector point, advanced to the time when the sync bit would be emitted to start the label record, and did see that transition, both on the logic analyzer and by reading from the differential amplifier inside the drive via test point 1.

The odd dip of the signal is occurring when writing, which definitely causes the data separator to be confused and return junk. I still don't have official statements about the validity of the ReadClock and ReadData signals when writing, only innuendo from comments in Alto design documents, ergo I can't clearly flag this as a flaw.
Reflected write signal showing sync word 1 bit is 2 us to right of center, but also notice dips
I did see an anomaly - the bits being written for the label record are not correct - I see the first four bits of the data word going out as '1' but the intended content is not that. I can't double check the word showing in the logic analyzer against this because I am only emitting the low 8 bits of the word. I do see what appears to be an incorrect memory address, so I will look there.

I will also emit the sector number so that I can check it against the gotsector signal, since I am going through the investment of a 30 minute cycle to generate the new bitstream with high data word and sector number signals.

Testing with the new signals and the modified logic analyzer formatting showed me two things right away. First, the sector number is indeed 0000 when gotsector fires, which is what it should be. Second, the fetched word to serialize is incorrect. Not sure why, but time to zoom in one the memory access as part of the WriteField process.

Time to look for race hazards or timing errors in my use of the memory access FSM. Another half hour of idling about until I could watch the signals I selected - bits 16 down to 1 of the memory address and the low order three bits of the fetched data word. I can easily check this against the contents of RAM which I can fetch through my USB link transactions.

It appears that it is a timing issue, which I hope I corrected by adding one extra cycle before loading the serializer with the output of the RAM access. I also switched the instrumentation to show the full output data word from memory.

It appears I am writing the contents correctly now - verified at least through the header and label records - but the checksum that is written out by WriteSector is not valid when doing a ReadSector. This is undoubtedly also a timing issue that just needs a bit of adjustment to get right.

As I suspected, the signals from testpoint 1 are clean, with no dips in the middle, when doing a read. Similarly, the separated data bits are clean and correct. This tells me that at least with this drive, it is not possible to 'loop back' the ReadClock and ReadData to check what is being written.

Testpoint 1 while reading - sync bit 1 to left of center and first word x7900

ReadData bits corresponding to signal above - good data separation
I am fairly certain I found the problem. It is a classic mistake where a clocked process in VHDL is coded as if it were sequential - like programming - when in fact all changes to signals only take place at the next clock edge.

I updated the checksum by XORing the last word of the record, then later output the checksum as the next output word for the serializer. Sounds correct, but in fact the latter step will use the old value of the checksum, prior to the XOR. Thus we leave off the value of the last word from the checksum.

The change was made, as well as displaying the running checksum on the logic analyzer - after the obligatory 30 minutes of processing - and I tested again. First I run a WriteSector and observe the checksums generated and written by the logic. If that passes, I will switch over and test a ReadSector.

As a means of testing checksum generation, I built a little spreadsheet and used that with the label record value of the real sector 0 I am trying to write. The checksum should be x7CF3 if it includes all eight words of the record.
My spreadsheet to calculate checksum values for records
This explains why the ReadSector validated the checksum of the header record but rejected the label and data records. The contents of the two words of this header are 0000 and 0000, which produce the same checksum whether I drop the last word or not. XOR by 0000 leaves the checksum unchanged.

When the new bitstream was ready, I set up the testbed and ran my WriteSector. The checksums were correct. When I ran the ReadSector, it completed with no checksum errors on any of the three records. I dumped the RAM up to the PC and validated that it was a word for word copy of what I tranferred down to RAM to write to sector 0.

I will call it a day - nice success point in the testing so I leave the testing on a high. Next up is to adapt or write a program so that I can easily set up all the sectors of cylinder 0, which had been inadvertently erased, into the RAM allowing me to restore all of cylinder 0 to the contents found in the xmsmall.dsk archive that matched this cartridge. 

Digging into what is on the disk surface, what I read and what I wrote


I happened to set up the scope to trigger on SectorMark and show the incoming stream from the disk heads at various delays afterwards. I noticed that most of the stream was the unmagnetized, essentially erased noise signal but there are patches with strong clock and data modulation.

I then did single shot traces, which would mostly give me noise but sometimes would hit a sector with the good data. Digging through that data showed me that I have a major timing problem with my WriteSector process.

The noise ran for about 240 microseconds then the disk surface had strong clear disk clock transitions. I forwarded almost 35 word times ahead to where it should be writing the sync word and header record, where I found some 1 bits modulated between the clock pulses.

I should be erasing almost immediately after the SectorMark and clock transitions should begin within 2 us afterwards. Something is clearly wrong with my logic, which is all correct from issuing WriteGate/EraseGate onwards but the time that these are first asserted is garbage I guess.

Interestingly, I had very good extraction of the 1 bits and conversely no false bits extracted during the long preamble time where data is always 0 valued. Things are looking up, in an odd way. It does appear that if I can figure out how my WriteSector is malfunctioning and fix it, I should be able to write properly.

Time to go back to my logic analyzer trace of the WriteSector process, but begin the triggering at the rise of SectorMark to give me good timings for when the state machines are transitioning. To do this, I need to change the instrumentation on the fpga to emit those signals.

To reduce the amount of erased junk on the surface, I did a WriteSector of all 12 sectors on cylinder 0 head 0, which I confirmed by running the scope free and watching the signal from the heads. At least I know that the drive is capable of writing clock and data signals.

Working mostly with the scope at first, I triggered on IndexMarker and delayed far enough to be into sector 0. The first 1 bit, for the sync word, seemed to be properly placed inside the sector. Whether the remainder is well formed, correct and in its place will require slow careful examination.