Friday, November 18, 2016

Developed logic for seeks and data separation during write to emulated disk drive

ALTO DISK TOOL

Today I built the logic to handle seeks when the Alto system requests it of the disk emulator. It is essentially ready for testing. It should be accurate in timing, appearing to take the same time as a physical Diablo drive.

I set up the input and output signals to be sensitive to the Select Unit 1 signal and also refuse to respond if the virtual cartridge is not loaded. In addition, the ReadData and ReadClock outputs are gated by ReadGate.

A key component needed to handle writing to the emulated disk is a data separator. The computer sends 100 ns width pulses, with the timing between pulses determining whether it is sending a 0 or a 1 bit. That is, a zero bit value is transmitted by delivering a100ns pulse followed by 500ns of delay before the next 'clock' pulse. A one bit value is transmitted by sending the 100ns clock pulse, a delay of 200ns, a 100 ns data pulse, and a final delay of 200ns before the next clock pulse.
I receive bottom stream of pulses, must break out clock and data as above

The data separator sees the continual train of pulses every 600 ns as the clock pulses. It watches in the interval between clock pulses to see if a second pulse is transmitted midway between - that represents a 1 bit value. I built the logic to accomplish this and to drive my deserializer which shifts the incoming bit values into words which will be stored in RAM as the record is written.

Next up is the process that will continually read the sectors rotating under the virtual head, whether the results are transmitted to the CPU or not. This is the central process that will underlie the entire disk emulation. I spent the day building this in more detail, drawing Moore and Mealy diagrams until I felt it was ready but hadn't yet begun to code in VHDL,

Wednesday, November 16, 2016

Outline of the disk emulator role logic

ALTO DISK TOOL

The core of the disk emulator logic is a process that will start up with the first sector marker it sees while the disk is loaded and not currently seeking. This will generate the data and clock signals for every sector as it rotates past the virtual head, also keeping track of whether we are starting the header, label or data record of the sector.

Those data and clock signals will not be delivered over the interface unless the computer raised ReadGate. With ReadGate true, we pass along the signals to the computer as ReadClock and ReadData.

Thus, if the computer decides it is reading a given sector, it will turn on ReadGate and see the appropriate bits stream in. If it is an update operation, it sees the first record(s) using ReadGate and then starts writing by raising WriteGate.

When we see WriteGate go true, the separator watches for alternations of the WriteClockandData line, each such reversal is seen as a clock or a data bit, depending on the timing from when we see the very first transition.

A transition that occurs less than 500 ns after the prior clock transition is read as a 1 bit. This will be followed by another transition in much less than 600ns to represent the clock pulse ending the bit cell. If the transition occurs after a 500ns delay from the prior clock, it is the succeeding clock bit and there was a 0 data value in the bit cell.

Each bit cell injects the 0 or 1 bit value into the deserializer, shifting in to form 16 bit words. The deserializer signals when a word has been accumulated, so that it can be sampled by other processes.

When the disk is selected, loaded and WriteGate turns on, it begins a process to write one or more records in the sector. The continual read process tells us whether this is going to be the header, label or data record we are writing. We begin tracking the incoming words that the computer writes to us, looking for the words that will be stored into RAM at appropriate locations.

The write process drops zero bits, looking for a 1 bit that represents the sync word which begins a record. This is what actually syncs up the deserializer to begin informing us as words are ready for sampling.

After sync is achieved, we extract 2, 8 or 256 words, storing each in the RAM slot assigned to it for this sector. Meanwhile, we calculate a running checksum using bitwise XOR and a seed value of 0x0151.

Following the last data word of the record, we extract a checksum and verify that it matches the checksum we calculated. There is no way to reflect the error back to the processor, but we can flag it for our awareness.

After a checksum word, there is a stream of up to 4 words of zeroes, which tells us to drop sync and prepare to write the next record of the sector (unless we just finished the data sector). The deserializer drops the sync condition, ignores zero bits and waits for the first 1 bit to act as the sync word.

If WriteGate is dropped, we stop and go back to idle state. The read process continues to run at all times the disk is loaded, fetching the newly written words once the virtual platter rotates the sector back under the virtual head.

We always address RAM with the current sector number and the current value of the Head signal coming from the computer. The cylinder value that addresses RAM is initially zero and is updated by any seek process.

The seek process sits at idle until the disk is selected, loaded and the Strobe signal is activated. ReadytoSeekReadWrite goes false in 2.5 us and we wait another 27.5us before signalling either AddressAcknowledge (or LogicalAddressInterlock if the cylinder requested is >202).

The process calculates the movement distance based on the prior cylinder address and the value presented by the computer when Strobe is activated. We wait 600 us per cylinder plus 14400 us settle time, to model the physical seek duration on a readl drive. The process will then respond by returning ReadytoSeekReadWrite to true which signals completion of the seek.

There needs to be a mechanism to load or unload the virtual drive. When loaded, FileReady as well as ReadytoSeekReadWrite are turned on. Unloading waits until all in-flight actions such as seek or write are idle, then resets those two signals.

It is up to the user to load or fetch RAM contents to the PC while the drive is unloaded. The emulator will serve up whatever is in RAM as data or replace selected words during writes. 

Tuesday, November 15, 2016

On vacation in Kaua'i but doing design work from the cabana

My posts will be short and erratic this week as I am on holiday. I left the Diablo drive, the dog, the 1130 and the housesitter to spend some time at the Grand Hyatt Kaua'i lurking in my cabana. 

View from my cabana


ALTO DISK TOOL

I will do the design work for the emulator role, where the tool will attach to an Alto II computer and act as a disk drive, while on vacation. I worked out the broad strokes for all the functionality, but have not begun generating VHDL yet. 

Saturday, November 12, 2016

Worked on Alto II and prepared for vacation

XEROX ALTO II RESTORATION

The team got together today to work on various open tasks and to experiment with some of the applications on the system. 

We cleaned one of the mechanical mice loaned to us by Xerox PARC and it works quite well. We cleaned a second similar mouse but a bearing near the top of the ball cavity failed making it perform erratically. Several of the balls in the bearing fell out - each barely visible to the naked eye - and it does not look like it can be repaired. Finally, we still need to acquire and connect a DE19 plug to the optical mouse that came with the system. 

We serviced the disk drive, replacing the air filter and adding a touch of the oil to the wipers on the arm bearings. The drive is working quite well. 

We undertook some scoping and capture of the ethernet connection, to provide final specs for Ken's ethernet adapter project. We weren't seeing what we expected, so the team hooked up the logic analyzer to probe the adapter more fully. 

ALTO DISK TOOL

I am away on holiday for the next week, having to put aside the work while I pack tonight.

Wednesday, November 9, 2016

Preparing to study the circuit failure in the Diablo drive

Spent part of the day at Computer History Museum working on an odd problem with card reader errors that occur only when executing op code 3 - a combined instruction that writes a line on the 1403 printer and then reads a card. 

ALTO DISK TOOL

I am going to stick the Diablo data separator circuitry onto the logic analyzer to understand more deeply what is occurring in those times where the circuits fail to properly handle timeshifted transitions and report a clock pulse as a data bit, for example.

Points to capture and feed to the logic analyzer

To do this, I can relay on three test points but still need to attach to three pins on ICs deep on the board. The board sits in a group of PCBs such that the surface of this board is fractions of an inch away from the back of the adjacent board.

There is no room for traditional grabbers without using a board extender - which I don't have. I tacked solder wires on the outsides of the selected chip pins so that the wires extended out of the congested area.

The lines were hooked up to the logic analyzer and ready to begin testing tomorrow morning. 

Discovered cause of sectors with permanent checksum validation errors, working on possible solutions

ALTO DISK TOOL

The bit error rate is currently around one wrong bit in every 150,500. Since there are about 4,320 bits in the data words, checksum and sync, this comes out at the 2.87% error rate of checksum validation failures.

In order for this error rate to produce a sector which passes the checksum but has corrupted contents, we need an even number of bit flips in each of the 16 positions in a word.  By position, I mean that the high order bit of a word is one position, the least significant digit is another position, etc.

If we have two bit flips, but they are not aligned in the same position, the checksum will catch it. Only when an even number of flips occur in a position will that error slip through. The checksum is a simple XOR of all the words; XOR by definition is independent in its action in each position.

To have a read which appears valid but is not, we need to have 2 bits in the same position occur in the same record. Crudely, this is 2.87% x 2.87%  for two bits and then at least 16 x rarer to have the second bit occur in the same position as the first. That is an 0.005% or 1 in 19,424 sectors. With 4,872 sectors, the chance of an entire cartridge having one sector with a false positive read is 25%.

All this depends on what causes the problems. If there are conditions that always or mostly lead to the error, then I might be able to work around it by mitigating the underlying cause, otherwise it is purely random.

In the random case, running the ReadEntireCartridge several times and using autoretry should produce files that can be merged with a majority vote to clean up any sporadic corrupted sectors that sneak through the checksum test.

In the situation where this is conditionally induced, the right combination of conditions in one sector may make it sneak through even with mulitple cartridge read passes. That is, the conditions occur in just the right (wrong) points in a given sector so that it will almost always have a pair of flipped bits whenever it is read.

I set up the scope and went after the sectors which incurred checksum failures, to look for conditions that cause the errors. It would be easiest when the conditions are in the shorter records - header or label - but essentially all of them are in the data record which is the only one long enough to have a real risk of a bit flip if the root cause is randomly occuring.

I looked at Cylinder 1, Head 0, Sector 2 which had a hard checksum failure on the data record. I found a spot in the data record where a clock pulse was clearly missing coming from the drive! Very consistent and of course it would shift all the bits over by 1 after that point.

Missing clock pulse leading to checksum validation error

Looking at the waveforms inside the drive showed that the flux reversals were a bit smaller at that spot, which could be symptomatic of a bad spot on the surface, but that is far from conclusive. Below is the output of the differential amplifier, where we see the reversal is stunted in both positive-going and negative-going limits compared to all the others.

smaller clock flux reversal at time when clock pulse is dropped

When we look at the signal after the clipping amplifier, which should even out such variations, we see the waveform below. All I notice is that the transition is a bit squeezed compared to the others, closer to where it might look like a data pulse transition instead of a clock transition.

Clipping amp output showing slight 'squishing' of transition

The suspicion is that the data separator logic is malfunctioning when presented with this waveform, neither seeing it as a data bit value of 1 nor as a clock pulse. It makes sense that timing shifts like this would be able to block the pulse, if we look at the relevant part of the J10 card schematic.

Circuits at left pick time when pulse is Data versus Clock
Once again, jitter on the disk signal is leading to problems. In this case, the next clock pulse arrives too early while the data separator is still looking for a data value transition, thus blocking the pulse from reaching the ReadClock line.

I need to validate this by looking at other test points on the card. Specifically, I want that the pulse itself is generated at TP 6, after the clip amplifier, since we need a pulse to pass through. Interestingly, there is another electrolytic capacitor at the output of the pulse driver transitor. I decided to replace it with a new known good capacitor.

I ran the same read of Cyl 1, Head 0, Sector 2 with the changed capacitor and watched on the scope. The same thing happened - lost clock pulse and spurious ReadData bit. I will defer the look at TP6 because it appears the pulse is getting emitted, it is just the timing that is far enough off to fool the data separator.

If the pulse at T6 exists, then the next point to watch is TP5 to see whether the Data Gate is on, blocking the pulse, or not. Finally, I want to look at the clock line at TP 3.  TP3 and TP5 are easily reached, but I have to find access to TP6 which is awkwardly buried in the card making it hard to reach without the card and cable extender (that I don't have).

We can see that the TP 5 Data Gate signal gets scrunched up to left of center, then loses the clock
Lack of data gate pulse leads to failure to pass pulse as Clock here at TP3
TP4 shows the pulse slipping through as Data due to separator error

Once again, jitter in the signal recorded on the platter has led to failure to correctly read the sector. In this case, the jitter has caused the data separator to fail to properly allot the transition as a clock rather than data pulse.

I can see how to spot and compensate for this particular condition - having a watchdog timer to spot a missing clock pulse - but unclear how I would handle jitter surrounding bitcells each with a 1 data bit as it would simply shift the next data bit to come out as a clock pulse.

If only the drive emitted the unseparated pulse train and let me assign them as clock or data signals, I could address the shift more easily in my logic. However, that is not the interface offered by a Diablo drive.

I need to do a lot of thinking. How the Diablo mechanism might be adjusted to handle the timeshifting better. How I might be able to recognize and compensate for the Diablo issues. What scope and logic analyzer traces I might add to the card to understand what happens with the timeshifting situations. 

Monday, November 7, 2016

New logic for handing incoming bit stream working, dropped sector error rate by half

ALTO DISK TOOL

My initial testing showed much of the new logic seemed okay, but the critical part that recognizes whether a bit call has ended and if it had 1 or 0 inside, does not appear to be working properly.

Fortunately, I instrumented quite a bit of the new logic so this means some painstaking stepping through the logic analyzer traces while examining my state machines. The goal is to find behavior that doesn't reflect what I expect to see, then look at the logic to find the flaw.

The logic for the fuzzy recognition of ReadData and ReadClock worked - three in a row got these machines to one of the two end states, while a sequence of erratic readings are ignored. The logic to recognize the sync word worked properly.

The fourth machine is the critical one that will delineate bit cells, remember whether ReadData reached the 1 state during the cell, output the value of the bit cell as 1 or 0, and trigger the deserializer to put that bit value into the shift register.

The fourth machine is not working correctly.  It is stuck forever with a bit value of 1 once it finds any bitcell with a 1 inside. Added correct reset logic to set the bitvalue back to 0 then watching for a ReadData of 1 to flip bitvalue back on. Back to testing.

I now find the logic is not starting properly at sync, loading an incorrect initial bit of 0 when we are just at the end of the sync word bitcell and not at the end of the first real word bitcell.

The solution is to extend the synchronization state machine - which had turned on the synced state after seeing a ReadData value of 1 and then the ReadClock pulse - but now will wait until ReadClock goes back to off before flipping on sync. Actually, I am using the fuzzy machines that handle ReadData and ReadClock.

Testing this refinement produced a big improvement in reading accuracy. I only encounted checksum errors on 140 sectors, a rate of 2.87%. I slashed the rate of errors in half with my redesigned state machines.  Now that some of the sectors that gave me pretty consistent errors are reading clean every time,

Next, I will investigate any sectors that still get errors, to see if I can find anything on them that could explain the error and possibly let me further reduce the error rates.

As an experiment, I ran the ReadEntireCartridge transaction with autoretry enabled, to see what rate of errors occur in that case. The rate dropped to 1.6%, or 78 sectors with checum errors out of 4,872.

Tomorrow I will go after the sectors with the errors, with and without retry, trying to dig into what is occurring in those cases. I will keep digging until I can find no more ways to improve the logic or compensate for errors or mitigate errors.