Sunday, January 17, 2016

Performance analysis of virtual disk drive and improvement plans

SAC INTERFACE FOR ADDING PERIPHERALS TO THE 1130

On the testing front, I tried the new DCIP core load and my virtual 2310 logic to dump some sectors and test out initializing a virtual blank cartridge. Once the card reader is debugged I will boot a virtual DMS boot card from the reader and then continue the IPL on the virtual 2310 disk drive.

First test, dumping a sector, worked perfectly now that I have a core load of the DCIP utility that did not falsely discover a 1403 printer, as had occurred previously on the IBM 1130 simulator until I fixed a bug in its printer support.

I then attempted to initialize a virtual disk cartridge using the utility. I had not previously fully debugged the write functionality and bumped into a Python code bug. With that repaired, I tested again. I ran an initialize, which is faster than an analyze. The init code writes and then reads a pattern three times for each of the 1600 sectors, which is 9.600 IO operations, whereas the analyze will do 16 for each sector or 25,600. It will only take 3/8 the time to do an initialize as to do an analyze of a virtual cartridge. I still expect this to be very slow, probably most of a day because of the large negative performance ratio of the virtual disk to a real 2310 drive.

I ran the initialize for ten minutes and then copied the virtual disk over to check how far it had gotten on the simulator. I discovered that I had forgotten to 'seek' the virtual disk cartridge file on the PC thus was always writing over the first sector. Corrections made, I got back to the testing.

I discovered there was a silent exception taking place during my attempt to seek or write the new sector back into the file. I did a bit of research on the file IO of Python and intended to stick in some instrumentation to narrow down the issue. However, it was horrifyingly clear right away that I hadn't opened the file for read-write (rb+), only for reading. With that fixed, the program began initializing.

The utility initializes disks by writing each sector of a cylinder with three fixed patterns. It first writes 0xAAAA to all eight sectors, then reads those sectors back. It writes 0x5555 to all eight sectors and then reads back the eight. Finally, it writes 0x0000 to the eight and reads them back. That gives us 48 I/O operations per cylinder.

I timed a sequence of writing eight sectors, somewhat slower because the python program wrote a diagnostic message for each write to my debugging console. It took just under 7 seconds to complete the sequence. That gives me a sense of the speed ratio, which is not good. My virtual disk is more than 25 times slower than a real 2310.

On a real disk, if each write required a complete rotation (we missed the next sector and had to rotate once) it would take 40ms per rotation or 1/3 second for the batch of 8 to complete. If we had only two misses, getting all four on each side of the platter in one burst but having to miss once to start and then once for the head switch delay, it would have taken less than half the time. I will split the difference and say 1/4 second as an outside value.

Seeking one cylinder at a time takes about 60ms. Thus, a full pack initialize is 12 seconds of seeks and 300 seconds for writing and reading each cylinder three times. My seek will be just as fast as a real drive, but the read/write time will balloon to over two hours versus the 5 minutes this task should take on a real 2310.

With the same timing, an analyze, which is reading each sector 16 times, or 208 I/O per cylinder compared to the 48 per cylinder of the initialize, will run 9+ hours on the virtual drive but only about 20 minutes on a real device. A disk compare or disk copy would be a faster operation, requiring only two IOs per sector or 2+ minutes total time on a pair of real drives. With my current slowdown ratio, it would be about an hour long event, tolerable but not fun.

If I can slash the slowdown to perhaps 5X instead of 25+X, I could live with most of these times. I would never do an analyze of a pack, because that would still take on the order of 100 minutes to accomplish. Since there will never be an error on the virtual cartridge, it is a pointless operation to run anyhow. Initializing for 25 minutes is bearable, especially since I could init the pack on the IBM 1130 simulator on the PC side as easily.

The main operations where the utility really makes sense are for disk to disk copies and compares, where my existing requires a painful near-hour but a 5X speedup gets this to a tolerable 10 or so minutes. The bottom line, however, is that even at current speeds this is good enough to be usable, even if it would be better to improve the speed.

Time to figure out exactly where the slowdown comes from:

  • Python interpreter
  • Inefficient code in Python
  • locking and queue management overhead in my Python code
  • delays in USB IO and transit time
I am spending some time to find instrumentation to record these various factors. If there is one major cause that is reasonable to attack, it becomes the next. My first measurement was to take the core load for the DCIP utility, which involves more than 4000 transactions across the USB link, and measure the time it took to execute. I saw roughly 9.7 seconds from start until it finished, which works out to less than 2.5ms per transaction and that includes reading from a PC file and processing the text as well as the actual transactions over USB. A memory cycle on the 1130 is 3.6us so much faster.

A disk IO will involve under 330 transactions in total, cumulatively 0.8 second, while a real 2310 read or write takes .01 to .04 depending on rotational delay. This is root cause of the speed differential. My core load function runs at a mere 0.15% the rate of the real machine, with a disk IO requiring 321 core load transactions.

I have to determine what part of this is due to transactional limits, so I created a python thread that does nothing but issue transactions to the fpga, saving the time when we start the loop and the final time when we finish a given number. I can do this submitting these to the thread safe queue but also I can do this directly at startup to time the pure speed of the link out of Python.

My basic round trip transactions took about 1.6 ms each, it didn't matter if it was a very simple one that just fetches the last value from an XIO or if it involved a complete cycle steal memory cycle to read or write to core memory - the answer was identical. With a 480Mbps data link over USB, the delay is almost completely unrelated to the number of bytes transferred but instead reflects the pure overhead of a transaction.

I am planning out a change to the link where I will send 12 word transactions rather than 2 word ones. This will give me up to a ten fold increase in performance for core access and other data transfers over the link. It also allows me to consolidate all the device oriented information in one transaction, where previously I had separate calls to ask about XIO function, WCA, modifiers, etc.

There are quite a few FSMs in the fpga logic that are tightly synced to the transactional flow, thus I have to be extremely careful in the change to avoid breaking any of those aspects. This will take some time to get right but the performance advantage will be significant for the virtual 2310 and other high speed peripherals.

Meanwhile, I wanted to start some testing on the virtual 2501 card reader logic, since this was in decent shape when I last worked on it. I wanted to test out the 2501 reader, then the 1403 printer, while I am plotting out all the changes to make for the 12 word transaction approach.

Unfortunately, my Acer laptop that I use for the Python side of the link began exhibiting bizarre behavior, with the cursor flying all around the screen like it was possessed, even after a reboot. This suggests the hardware is taking a dive, which is unfortunate (and poorly timed).

2 comments:

  1. Carl, (Understanding that I only understand a small portion of what you have done...) Could you "mount" a drive on the FPGA .. Copy from the PC to the FPGA memory, then either retrieve it or, if you don't care about updates from the 1130, throw it away? Remove Python from the transactions?

    ReplyDelete
  2. A cartridge contains 523, 872 words (about 1MB) that could be transferred to the FPGA as a 'mount'. At the original performance of 2.5ms per word, that would take close to four hours to transfer.

    Second issue - where to store the 1MB. It would use up a lot of the silicon in the fpga, but could be put into separate RAM chips. The 'cost' of that is the need to devote 7 address and 16 data bits plus control lines - valuable IO pins from the fpga board.

    Another option would be using a second FPGA linked with my high speed serial link - either using RAM space on the fpga or the other options outlined above. I

    In any case, the performance penalty shipping 520K+ words over the USB link is unacceptable. If I fix the speed issue, such as with my increase of a transaction from 1 to 12 words at a time, I should achieve almost as big an increase in performance.

    That improvement would improve a virtual 'load' of the entire cartridge over the link to under 20 minutes. However, the improvement will also improve the performance of disk IO leaving the data on the PC. It should make virtual disk performance much closer to real disk speeds.

    At this point, I need the USB performance improvement for many functions not just disk, so I will concentrate on this rather than looking at migrating the disk contents down to the FPGA side.

    ReplyDelete