Monday, January 26, 2015

Halfway through adding IBM 1130 Simulator support to keypunch interface

Traveling today from the west coast to Raleigh-Durham where I will spend the next two days with a client, coming home again on Thursday. I was fortunate not to be caught up in the blizzard affecting the northeast US.


NEW KEYPUNCH INTERFACE DEVELOPMENT

I started banging out code to deal with the 1130 Simulator formats when I ran into the realities of Unicode. I had naively expected characters encoded as 16 bit values, thus an 80 column card record would be 160 bytes long. Instead, I had to learn about codecs, encodings, UTF-8, various encodings used on Windows and other operating systems, and all the esoterica of handling strings with Unicode characters that are not part of basic ASCII.

In order to handle the files used with the IBM 1130 Simulator on Windows, it is necessary to deal with Unicode files, not simple ascii text files. I can't just substitute a regular ASCII character for the cent sign and logical not sign characters which Brian Knittel represents with U-A2 and U-AC. While these could fit in a byte, allowing a potential encoding that is still one byte per character, that wouldn't work for any Unicode character above U-FF.

The binary format for the simulator is much simpler - it is just two bytes per card column in pure binary. I just have to read it in fixed sized records instead of with a readline() function, then translate it to the ascii oriented standard binary encoding in this system. It is, however, yet a third way to do I/O that I must select depending on the users choice of encoding - codecs, ascii lines or binary fixed records.

The usual read and write facilities in Windows will return ASCII but for Python, they would throw an exception for values above basic ASCII (U-7F and below). I have no idea how Windows is encoding these characters in files and it seems that one can configure Windows to use different encodings depending on the localization, i.e. language, country, region etc. That means the only safe way to read 80 Unicode characters from a file is to use the codec module and tell it to use the Windows configured encoding.

It is possible get oneself into trouble easily when mixing ordinary (ASCII) strings and Unicode strings in Python, so I have to be careful to isolate all the Unicode handling away from the rest of the program. I only need to get things into Unicode just before I write them out, and on input I will immediately translate Unicode into my EBCDIC encoding in plain ASCII.

I am generating the code for this, then will give it a try with some files from the IBM1130.org simulator. Now working through the code making sure it works.

No comments:

Post a Comment