Thursday, October 20, 2016

Added automatic reread for sectors with checksum errors, discovered need for Diablo mod, working on new board


My first test of the morning showed that the USB link was no longer working, which I think I had traced to an unconnected signal from my new register logic. Half an hour later, I tested that hypothesis and the corrected code.

I retrieved the checksum status fields perfectly and began to map out the bad sectors. I noticed that head 1 (lower surface) began to get quite a few sector errors starting around cylinder 128 and continuing out to nearly 150, but were most clustered around the 130s. This would be consistent with a surface defect on the disk.

Of course, cyl 0 head 0 was erased by my procedural error of the day before, but I should be able to rewrite those with my disk tool. I will go over my WriteSector logic, enable it and give a test on one of the sectors of that first track.

I made the decision to add an auto-recovery option to the ReadEntireCartridge function, one that will iterate up to 32 times if a given sector has checksum errors. If it gets a clean read, it moves on, otherwise it tries up to the maximum. This will be controlled by one of the slide switches on the board.

Initially this was not included on the recommendation of one of the team, based on his experience were lingering over a bad spot resulted in the failure increasing in seriousness. After some analysis of the practical effect with and without the auto-recovery, I decided to go ahead.

The disk head flies over the surface of the platter, thus lingering will only be a problem if the surface is raised enough to impinge on the head or disturb its air cushion. Thus, for any defect that does not involve a substantially raised surface (relatively speaking, since we are talking about a flying height of 7 thousands of an inch.

Next, the reality is that the head is flying somewhere on the disk from the moment the heads are loaded at startup until it is switched off. In practice, the head flies above the last cylinder that the arm did a seek to, until the next time a seek is performed. It does this whether or not we read anything.

The cylinders are roughly 1/100 of an inch apart, so that the low spot on the head where contact might occur is wide enough to span over a dozen or so cylinders. Thus, lingering anywhere in that 12-ish cylinder zone around the high spot is a risk.

Whether one reads the sector 1000 times in a row or not, the arm is sitting at some cylinder and the disk is spinning under the flying head. From a risk standpoint,  the only way we lower the risk is by moving the arm away from the defect. The must be more than 12 cylinders away to be safe, thus the risk zone is 24-ish cylinders wide or about 18% of the entire disk.

My ReadEntireCartridge  function takes only a few seconds to complete. We need no more than four rotations, 1/10 second, to read all 24 sectors, plus the time to seek to the next cylinder. If it were to reread 5% of all the sectors, and those were to require 20 retries each, it would only double the linger time.

Finally, consider that manual rereading involves seeking to the cylinder, where the head will fly for much longer than then time to auto-recovery and move on. The most serious risks are when the raised flaw is within the first 12 or last 12 cylinders, as the arm hovers at 0 when loaded and hovers at 202 when my function is completed.

No amount of caution will protect against damage if the flaw is in those two critical areas. Practically speaking, a user attempting to recover sectors missed by a one-pass, non recovering function would park the head in the remaining 179 cylinders far longer than would the autorecover version.

I completed the change to the logic, which was more minor than I expected, and set in to test in the late afternoon. The logic ran through all the sectors, pausing for noticeable time to handle the retries, and ultimately completing. The checksum status vector was dumped and shows 30 sectors that weren't recovered even with 31 retries apiece, but that is a failure rate of 0.615%, an order of magnitude better than the one-pass method.

Al raised the point of the timing board modification made to Diablo 31 drives for use with the Alto - two resistors, F28 and H53, on board J10 are replaced with values that accommodate the slightly faster bit rate of the Alto compared to the Diablo spec.

I looked at the board and the resistors don't look reworked, the seem to be the original versions. Every drive has hand selected resistors for these two components, making it impossible to check against a schematic, but my guess is this drive is not from an Alto.

Al confirmed that this is a stock Diablo drive, which could explain the errors I am getting on reading. I could modify this drive or just accept the error rates for now and see how well things read when on the drive attached to the Alto.

I have picked up a pair of 200K potentiometers, which are required to tune the Diablo board to the window duration needed for the Alto - 440 to 460ns - rather than the factory default 450 to 470ns. They will arrive tomorrow when I can determine the needed resistances, then pick up fixed resistors of those values and solder them onto the board.

I continued wiring up the new disk driver board, with all the ground wires now added and four of the last 15 signal wires from the Diablo cable wired in by later afternoon. After dinner, I had another four of the Diablo signal wires hooked up, just seven to go.

By bedtime, all the +5V lines were completed and all that is left are the 3.3V lines between the fpga connector and the level shifter components.

I continue to have issues with the really hard, stiff ribbon cable connected to the Diablo connector. It is way too tough for press-on IDC connectors. Still thinking about the best way to deal with this. I have a female Diablo terminator that could be wired to a regular ribbon cable - thus have ordered the parts to try this.

No comments:

Post a Comment