Done Error: the BDEBUGGER displays the cyrillic CP 866 characters incorrectly

I tried to open in the TCC's IDE a file .TXT with some russian text coded in code page 866 (a result of a .BAT I'm developing). The IDE displays the text incorrectly, and does not have the setting option to correct this. It is interesting, that the .CMD files with the CP 866 characters are displayd correctly in the IDE.
 

Attachments

rconn

Administrator
Staff member
May 14, 2008
11,894
133
There are only two ways to code a batch file with a specific language:

1. Encode the file as utf16
2. Encode the file as utf8 with a BOM

Without one of the above, your file will be treated as ASCII, and the characters will be displayed based on the active codepage.
 

Charles Dye

Super Moderator
Staff member
May 20, 2008
4,188
72
Albuquerque, NM
prospero.unm.edu
Without one of the above, your file will be treated as ASCII, and the characters will be displayed based on the active codepage.
Interpreting high-order OEM characters according the the current code page would be the Right Thing. But it seems the BDEBUGGER/IDE/TCEDIT is not interpreting them per the code page, but displaying them as hex numbers in reverse video.
 

Charles Dye

Super Moderator
Staff member
May 20, 2008
4,188
72
Albuquerque, NM
prospero.unm.edu
Adding: .BAT / .BTM / .CMD files should be interpreted per the current OEM code page, since that's how TCC sees them in batch files.

Whether other extensions, e.g. .TXT, should use the the Windows (“ANSI”) code page instead... I could argue that one either way!
 
Most editors permit users to set the code page to be used displaying text files. Maybe you will include this option into future versions? It seems very strange when a program displays the same text differently depending on the filename exension.

My .CMD file forms some e-mail text in the Russian by redirected ECHO commands, and I tried to see how that text is added to the output file. I had opened the output file in the TCC's IDE and saw some garbage in place of russian words.
 
Last edited:

Charles Dye

Super Moderator
Staff member
May 20, 2008
4,188
72
Albuquerque, NM
prospero.unm.edu
After some testing, I find that BDEBUGGER / IDE / TCEDIT consistently display inverted hex for all high-order characters in 8-bit-encoded files. The file extension and code page don't matter; it always happens. Dmitry, I have to wonder if your .CMD file isn't actually UTF-16.

TCEdit - High OEM characters.png


My OEM code page is 437, and my Windows code page is 1252. I'm guessing that the editor control isn't trying to remap these, just making them Unicode code points with the same values — C1 control characters.

On the bright side, TCEDIT et al. seem quite happy dealing with high Unicode characters — those outside the BMP. Many Windows programs don't even try:

TCEdit - Smilies.png
 

Charles Dye

Super Moderator
Staff member
May 20, 2008
4,188
72
Albuquerque, NM
prospero.unm.edu
My OEM code page is 437, and my Windows code page is 1252. I'm guessing that the editor control isn't trying to remap these, just making them Unicode code points with the same values — C1 control characters.
But in that case, 0xA0 through 0xFF should be printable Latin-1 characters. They aren't. So... I don't know what the editor is doing with high-order characters in OEM files, but it isn't right.
 
The standard Windows font selection dialog permits to set the "character set". Both your IDE.EXE and TCEDIT..EXE call this dialog, but continue, for weakly understandable reason, to show hex codes instead of cyrillic chars despite of "cyrillic charset" selected in the dialog (see the screenshot attached) BDEBUGGER.with CP1251 chars & Font Selection dialog.gif.
 

Charles Dye

Super Moderator
Staff member
May 20, 2008
4,188
72
Albuquerque, NM
prospero.unm.edu
Both your IDE.EXE and TCEDIT..EXE call this dialog, but continue, for weakly understandable reason, to show hex codes instead of cyrillic chars despite of "cyrillic charset" selected in the dialog (see the screenshot attached)
Just to clarify a little bit, this isn't a problem only with Cyrillic or Russian or code page 866. The editor control does this to all high-order characters in non-Unicode files. Anybody using non-ASCII characters in an 8-bit text file will have the same issue. Here's a text file using some CP1252 characters:

TCEdit - CP1252.png
 

Charles Dye

Super Moderator
Staff member
May 20, 2008
4,188
72
Albuquerque, NM
prospero.unm.edu
Okay, I've been looking over the Scintilla documentation and I think I have a better understanding of what's going on. Their docs on "Character representations" say that
Invalid bytes are shown in a similar way with an 'x' followed by their value in hexadecimal, like "xFE".
That sounds familiar. But... what's an "invalid byte"? In an 8-bit text file, any byte should be a valid character. So... maybe Scintilla is thinking this file uses some other encoding. If it were UTF-8, then yeah; these scattered high-order bytes probably would not form any valid character.

I downloaded the demo text editor, SciTE, and tried opening my silly text file in that:

SciTE-1.png


Looks great. But SciTE has a File / Encoding submenu, which TCEdit et al. lack. I check, and it's set to "Code Page Property" — which sounds good — but I can change it. And if I set it to UTF-8:

SciTE-2.png


That looks very familiar. So my theory is that TCEdit (IDE, BDEBUGGER) is not correctly recognizing OEM text files; they get misinterpreted as UTF-8.

Which explains another mystery, one that I'd been ignoring. Dmitry, in your first screen shot, somehow there is a Chinese character amongst the hexadecimal. The three Cyrillic letters ч и к are encoded in CP866 as 0xE7 0xA8 0xAA. Which just happens to form a valid UTF-8 sequence. It works out to U+7A2A, which is 稪. Or, in Japanese, mojibake.

@Rex: Shall I send you my LooksLikeUTF8() function?
 
Last edited:

Charles Dye

Super Moderator
Staff member
May 20, 2008
4,188
72
Albuquerque, NM
prospero.unm.edu
Well. I have understood the underlaying stuff. Can I take an interest, when will this be corrected? Or you don't consider this an error?
I don't speak for Rex. But in my opinion, this is (A) a bug, and (B) easily fixable.

Most batch files use OEM encoding. No batch files, so far as I know, use UTF-8. So if there were absolutely no way to distinguish between the two, BDEBUGGER should assume OEM.

(But in fact, it's not difficult to recognize UTF-8, even without a BOM.)
 
It seems the IDE tries to display the file with code page 866, not 437. The code page 866 is set by my TCSTART.BAT file. But it is not correct to think that all the batch files use the code page set by the TCSTART.BAT. The Russian code pages are 866 and 1251, and both may be used on one computer. The IDE must allow the user to choose this independently of what is written in the start files..
 

rconn

Administrator
Staff member
May 14, 2008
11,894
133
The editor in IDE and TCEDIT does everything in UTF8.

If you edit a UTF16 file, it is converted to UTF8 and everything is displayed as expected.

If you edit a UTF8 file, it doesn't need to convert anything and everything is displayed as expected.

If you edit an ASCII file, it has to be converted to UTF8. The editor does this using CP_OEMCP.

The only good solution is to use Unicode. The awkward solution is to add an option to the IDE to specify (either on a per-file basis or for all subsequent files) a code page. Code pages are definitely the old tech way to handle it -- particularly since Windows cannot always convert reliably from ASCII -> Unicode -> ASCII and get the same results.
 
Aug 23, 2010
562
7
Default Russian OEM (terminal) codepage is CP866.
CP1251 is Russian ANSI (Windows GUI non-UNICODE) default codepage. It is normally not used in terminal.
If, as @rconn said, the IDE is using OEMCP-to-UTF-8 conversion, and you see wrong results, it may means that your OEM CP is wrong.
If you are using Windows 10, please check that your "locale for non-unicode programs" is not set to "unicode".