Tuesday, November 8, 2016

Discovered cause of sectors with permanent checksum validation errors, working on possible solutions


The bit error rate is currently around one wrong bit in every 150,500. Since there are about 4,320 bits in the data words, checksum and sync, this comes out at the 2.87% error rate of checksum validation failures.

In order for this error rate to produce a sector which passes the checksum but has corrupted contents, we need an even number of bit flips in each of the 16 positions in a word.  By position, I mean that the high order bit of a word is one position, the least significant digit is another position, etc.

If we have two bit flips, but they are not aligned in the same position, the checksum will catch it. Only when an even number of flips occur in a position will that error slip through. The checksum is a simple XOR of all the words; XOR by definition is independent in its action in each position.

To have a read which appears valid but is not, we need to have 2 bits in the same position occur in the same record. Crudely, this is 2.87% x 2.87%  for two bits and then at least 16 x rarer to have the second bit occur in the same position as the first. That is an 0.005% or 1 in 19,424 sectors. With 4,872 sectors, the chance of an entire cartridge having one sector with a false positive read is 25%.

All this depends on what causes the problems. If there are conditions that always or mostly lead to the error, then I might be able to work around it by mitigating the underlying cause, otherwise it is purely random.

In the random case, running the ReadEntireCartridge several times and using autoretry should produce files that can be merged with a majority vote to clean up any sporadic corrupted sectors that sneak through the checksum test.

In the situation where this is conditionally induced, the right combination of conditions in one sector may make it sneak through even with mulitple cartridge read passes. That is, the conditions occur in just the right (wrong) points in a given sector so that it will almost always have a pair of flipped bits whenever it is read.

I set up the scope and went after the sectors which incurred checksum failures, to look for conditions that cause the errors. It would be easiest when the conditions are in the shorter records - header or label - but essentially all of them are in the data record which is the only one long enough to have a real risk of a bit flip if the root cause is randomly occuring.

I looked at Cylinder 1, Head 0, Sector 2 which had a hard checksum failure on the data record. I found a spot in the data record where a clock pulse was clearly missing coming from the drive! Very consistent and of course it would shift all the bits over by 1 after that point.

Missing clock pulse leading to checksum validation error

Looking at the waveforms inside the drive showed that the flux reversals were a bit smaller at that spot, which could be symptomatic of a bad spot on the surface, but that is far from conclusive. Below is the output of the differential amplifier, where we see the reversal is stunted in both positive-going and negative-going limits compared to all the others.

smaller clock flux reversal at time when clock pulse is dropped

When we look at the signal after the clipping amplifier, which should even out such variations, we see the waveform below. All I notice is that the transition is a bit squeezed compared to the others, closer to where it might look like a data pulse transition instead of a clock transition.

Clipping amp output showing slight 'squishing' of transition

The suspicion is that the data separator logic is malfunctioning when presented with this waveform, neither seeing it as a data bit value of 1 nor as a clock pulse. It makes sense that timing shifts like this would be able to block the pulse, if we look at the relevant part of the J10 card schematic.

Circuits at left pick time when pulse is Data versus Clock
Once again, jitter on the disk signal is leading to problems. In this case, the next clock pulse arrives too early while the data separator is still looking for a data value transition, thus blocking the pulse from reaching the ReadClock line.

I need to validate this by looking at other test points on the card. Specifically, I want that the pulse itself is generated at TP 6, after the clip amplifier, since we need a pulse to pass through. Interestingly, there is another electrolytic capacitor at the output of the pulse driver transitor. I decided to replace it with a new known good capacitor.

I ran the same read of Cyl 1, Head 0, Sector 2 with the changed capacitor and watched on the scope. The same thing happened - lost clock pulse and spurious ReadData bit. I will defer the look at TP6 because it appears the pulse is getting emitted, it is just the timing that is far enough off to fool the data separator.

If the pulse at T6 exists, then the next point to watch is TP5 to see whether the Data Gate is on, blocking the pulse, or not. Finally, I want to look at the clock line at TP 3.  TP3 and TP5 are easily reached, but I have to find access to TP6 which is awkwardly buried in the card making it hard to reach without the card and cable extender (that I don't have).

We can see that the TP 5 Data Gate signal gets scrunched up to left of center, then loses the clock
Lack of data gate pulse leads to failure to pass pulse as Clock here at TP3
TP4 shows the pulse slipping through as Data due to separator error

Once again, jitter in the signal recorded on the platter has led to failure to correctly read the sector. In this case, the jitter has caused the data separator to fail to properly allot the transition as a clock rather than data pulse.

I can see how to spot and compensate for this particular condition - having a watchdog timer to spot a missing clock pulse - but unclear how I would handle jitter surrounding bitcells each with a 1 data bit as it would simply shift the next data bit to come out as a clock pulse.

If only the drive emitted the unseparated pulse train and let me assign them as clock or data signals, I could address the shift more easily in my logic. However, that is not the interface offered by a Diablo drive.

I need to do a lot of thinking. How the Diablo mechanism might be adjusted to handle the timeshifting better. How I might be able to recognize and compensate for the Diablo issues. What scope and logic analyzer traces I might add to the card to understand what happens with the timeshifting situations. 

No comments:

Post a Comment