Done Error: the BDEBUGGER displays the cyrillic CP 866 characters incorrectly

Dmitry L. Kobyakov · Jun 25, 2019

I tried to open in the TCC's IDE a file .TXT with some russian text coded in code page 866 (a result of a .BAT I'm developing). The IDE displays the text incorrectly, and does not have the setting option to correct this. It is interesting, that the .CMD files with the CP 866 characters are displayd correctly in the IDE.

rconn · Jun 25, 2019

There are only two ways to code a batch file with a specific language:

1. Encode the file as utf16
2. Encode the file as utf8 with a BOM

Without one of the above, your file will be treated as ASCII, and the characters will be displayed based on the active codepage.

Charles Dye · Jun 25, 2019

rconn said:
Without one of the above, your file will be treated as ASCII, and the characters will be displayed based on the active codepage.

Interpreting high-order OEM characters according the the current code page would be the Right Thing. But it seems the BDEBUGGER/IDE/TCEDIT is not interpreting them per the code page, but displaying them as hex numbers in reverse video.

Charles Dye · Jun 25, 2019

Adding: .BAT / .BTM / .CMD files should be interpreted per the current OEM code page, since that's how TCC sees them in batch files.

Whether other extensions, e.g. .TXT, should use the the Windows (“ANSI”) code page instead... I could argue that one either way!

Dmitry L. Kobyakov · Jun 25, 2019

Most editors permit users to set the code page to be used displaying text files. Maybe you will include this option into future versions? It seems very strange when a program displays the same text differently depending on the filename exension.

My .CMD file forms some e-mail text in the Russian by redirected ECHO commands, and I tried to see how that text is added to the output file. I had opened the output file in the TCC's IDE and saw some garbage in place of russian words.

AnrDaemon · Jun 26, 2019

Just use a different editor.
F.e. one coming with Far manager.

Charles Dye · Jun 26, 2019

After some testing, I find that BDEBUGGER / IDE / TCEDIT consistently display inverted hex for all high-order characters in 8-bit-encoded files. The file extension and code page don't matter; it always happens. Dmitry, I have to wonder if your .CMD file isn't actually UTF-16.

My OEM code page is 437, and my Windows code page is 1252. I'm guessing that the editor control isn't trying to remap these, just making them Unicode code points with the same values — C1 control characters.

On the bright side, TCEDIT et al. seem quite happy dealing with high Unicode characters — those outside the BMP. Many Windows programs don't even try:

Charles Dye · Jun 26, 2019

Charles Dye said:
My OEM code page is 437, and my Windows code page is 1252. I'm guessing that the editor control isn't trying to remap these, just making them Unicode code points with the same values — C1 control characters.

But in that case, 0xA0 through 0xFF should be printable Latin-1 characters. They aren't. So... I don't know what the editor is doing with high-order characters in OEM files, but it isn't right.

Dmitry L. Kobyakov · Jun 27, 2019

Yes, you are right! I had forgotten I had written that .CMD in unicode! Pardon!

But the option to set the codepage for the foreign files displaying seems to be convenient.

Charles Dye · Jun 27, 2019

Dmitry L. Kobyakov said:
But the option to set the codepage for the foreign files displaying seems to be convenient.

You've found an option to set the code page? Where?

Dmitry L. Kobyakov · Jun 28, 2019

The standard Windows font selection dialog permits to set the "character set". Both your IDE.EXE and TCEDIT..EXE call this dialog, but continue, for weakly understandable reason, to show hex codes instead of cyrillic chars despite of "cyrillic charset" selected in the dialog (see the screenshot attached)

BDEBUGGER.with CP1251 chars & Font Selection dialog.gif

.

Charles Dye · Jun 28, 2019

Dmitry L. Kobyakov said:
Both your IDE.EXE and TCEDIT..EXE call this dialog, but continue, for weakly understandable reason, to show hex codes instead of cyrillic chars despite of "cyrillic charset" selected in the dialog (see the screenshot attached)

Just to clarify a little bit, this isn't a problem only with Cyrillic or Russian or code page 866. The editor control does this to all high-order characters in non-Unicode files. Anybody using non-ASCII characters in an 8-bit text file will have the same issue. Here's a text file using some CP1252 characters:

Charles Dye · Jun 28, 2019

Okay, I've been looking over the Scintilla documentation and I think I have a better understanding of what's going on. Their docs on "Character representations" say that

Invalid bytes are shown in a similar way with an 'x' followed by their value in hexadecimal, like "xFE".

That sounds familiar. But... what's an "invalid byte"? In an 8-bit text file, any byte should be a valid character. So... maybe Scintilla is thinking this file uses some other encoding. If it were UTF-8, then yeah; these scattered high-order bytes probably would not form any valid character.

I downloaded the demo text editor, SciTE, and tried opening my silly text file in that:

Looks great. But SciTE has a File / Encoding submenu, which TCEdit et al. lack. I check, and it's set to "Code Page Property" — which sounds good — but I can change it. And if I set it to UTF-8:

That looks very familiar. So my theory is that TCEdit (IDE, BDEBUGGER) is not correctly recognizing OEM text files; they get misinterpreted as UTF-8.

Which explains another mystery, one that I'd been ignoring. Dmitry, in your first screen shot, somehow there is a Chinese character amongst the hexadecimal. The three Cyrillic letters ч и к are encoded in CP866 as 0xE7 0xA8 0xAA. Which just happens to form a valid UTF-8 sequence. It works out to U+7A2A, which is 稪. Or, in Japanese, mojibake.

@Rex: Shall I send you my LooksLikeUTF8() function?

Dmitry L. Kobyakov · Jun 29, 2019

Well. I have understood the underlaying stuff. Can I take an interest, when will this be corrected? Or you don't consider this an error?

Charles Dye · Jun 29, 2019

Dmitry L. Kobyakov said:
Well. I have understood the underlaying stuff. Can I take an interest, when will this be corrected? Or you don't consider this an error?

I don't speak for Rex. But in my opinion, this is (A) a bug, and (B) easily fixable.

Most batch files use OEM encoding. No batch files, so far as I know, use UTF-8. So if there were absolutely no way to distinguish between the two, BDEBUGGER should assume OEM.

(But in fact, it's not difficult to recognize UTF-8, even without a BOM.)

AnrDaemon · Jun 30, 2019

Dmitry L. Kobyakov said:
permits to set the "character set"

This dropdown is purely cosmetic since introduction of Unicode fonts.

rconn · Jul 21, 2019

Done in v25.

Dmitry L. Kobyakov · May 3, 2020

How can I make the TC's IDE to correctly display my .CMD file with CP 1251 cyrillic characters in the ECHO and PAUSE commands? Now it displays .
The font settings in the IDE.EXE have no effect to the display.

internationalisation problem of the TCC's IDE.screenshot.gif

rconn · May 3, 2020

Please upload your .CMD file so I can test it.

Dmitry L. Kobyakov · May 3, 2020

Please.

Charles Dye · May 3, 2020

If you start CMD‍.EXE from the Win-R "Run" dialog — not from in Take Command or TCC — what does the CHCP command report?

Dmitry L. Kobyakov · May 3, 2020

866

Charles Dye · May 3, 2020

Then I don't know where IDE is getting code page 437 from!

TCEDIT seems to do the same thing, by the way.

Dmitry L. Kobyakov · May 3, 2020

It seems the IDE tries to display the file with code page 866, not 437. The code page 866 is set by my TCSTART.BAT file. But it is not correct to think that all the batch files use the code page set by the TCSTART.BAT. The Russian code pages are 866 and 1251, and both may be used on one computer. The IDE must allow the user to choose this independently of what is written in the start files..

Charles Dye · May 3, 2020

Dmitry L. Kobyakov said:
It seems the IDE tries to display the file with code page 866, not 437.

Quite right, so it is. My mistake.

If IDE and TCEDIT assume the system OEM code page by default -- that seems like a sensible design decision to me. But it would be nice if there was a menu option to select the code page.

Dmitry L. Kobyakov · May 3, 2020

Can you please add this option in some future version? There is an option Настройки → Шрифт [Settings → Font] in the IDE's menu now, but it does not change the code page.

rconn · May 3, 2020

The editor in IDE and TCEDIT does everything in UTF8.

If you edit a UTF16 file, it is converted to UTF8 and everything is displayed as expected.

If you edit a UTF8 file, it doesn't need to convert anything and everything is displayed as expected.

If you edit an ASCII file, it has to be converted to UTF8. The editor does this using CP_OEMCP.

The only good solution is to use Unicode. The awkward solution is to add an option to the IDE to specify (either on a per-file basis or for all subsequent files) a code page. Code pages are definitely the old tech way to handle it -- particularly since Windows cannot always convert reliably from ASCII -> Unicode -> ASCII and get the same results.

Dmitry L. Kobyakov · May 4, 2020

Well. I understand.

AnrDaemon · May 6, 2020

Default Russian OEM (terminal) codepage is CP866.
CP1251 is Russian ANSI (Windows GUI non-UNICODE) default codepage. It is normally not used in terminal.
If, as @rconn said, the IDE is using OEMCP-to-UTF-8 conversion, and you see wrong results, it may means that your OEM CP is wrong.
If you are using Windows 10, please check that your "locale for non-unicode programs" is not set to "unicode".

Dmitry L. Kobyakov · May 6, 2020

Thank you!

Welcome!

Done Error: the BDEBUGGER displays the cyrillic CP 866 characters incorrectly

Attachments

Administrator

Super Moderator

Super Moderator

Super Moderator

Super Moderator

Super Moderator

Super Moderator

Super Moderator

Super Moderator

Administrator

Administrator

Attachments

Super Moderator

Super Moderator

Super Moderator

Administrator

Similar threads