WAD TCC: inconsistent character handling

May 20, 2008
10,555
78
Syracuse, NY, USA
The file in question is ASCII. The character in question is 0xB1(plus/minus). My console font is Consolas, which can handle that character. TCC handles that character rather inconsistently, displaying it in at least 4 different ways.

1580491043682.png
 

rconn

Administrator
Staff member
May 14, 2008
11,910
133
I've mentioned a few dozen times in the past that you will never, ever be satisfied with the results if you convert extended ASCII characters to Unicode and then back to ASCII. (Or even worse, back and forth and back and forth like in your examples.) That's the way Windows works; what you get will depend on your codepage and your font, but it will almost never be what you want. If you're unhappy with the results you should be using UTF-16 or UTF-8 files. Or at the very least, UnicodeOutput or UTF8Output, and/or change your code page to 65001.

In a TCC console window, Windows handles the character display - all TCC does is pass the character and it's up to Windows how it appears.
 
May 20, 2008
10,555
78
Syracuse, NY, USA
I've mentioned a few dozen times in the past that you will never, ever be satisfied with the results if you convert extended ASCII characters to Unicode and then back to ASCII. (Or even worse, back and forth and back and forth like in your examples.) That's the way Windows works; what you get will depend on your codepage and your font, but it will almost never be what you want. If you're unhappy with the results you should be using UTF-16 or UTF-8 files. Or at the very least, UnicodeOutput or UTF8Output, and/or change your code page to 65001.
As I said in Charles Dye's thread about HEAD:

I have a 256-byte file (0255.bin) containing the bytes 0x0 through 0xFF. I wrote a test app to read that file into a buffer, print the decimal values of the bytes in the buffer, use MultiByteToWideChar followed by WideCharToMultiByte (with lpDefaultChar equal to NULL) on the buffer, then print the decimal values again.

I did that for the ANSI, OEM, and THREAD code pages.

In all three cases, the before/after decimal values were identical; i.e., the decimals 0 through 255.
I later did the same (successfully) with CP 866.

This thread started when I simply TYPE'd the file (no pipes). Here's what I saw/see.

1580508717692.png
 
May 20, 2008
10,555
78
Syracuse, NY, USA
Let's start over.

PlusMinusSign is Unicode 0xB1; supported by Consolas; not in my code page (437).
MediumShade is Unicode 0x2592; supported by Consolas; 0xB1 in code page 437. (see it below)

Rex, please explain why/how these are different.

1580532167486.png
 

Charles Dye

Super Moderator
Staff member
May 20, 2008
4,193
72
Albuquerque, NM
prospero.unm.edu
Pretty sure this is the same as AnrDaemon's issue. When HEAD/TAIL read from a pipe, they seem to assume that everything is Unicode. If you have 8-bit text, it becomes 8-bit Unicode — bytes zero-extended to words. So your 0xB1 becomes U+00B1, the plus-minus sign.
 
May 20, 2008
10,555
78
Syracuse, NY, USA
Pretty sure this is the same as AnrDaemon's issue. When HEAD/TAIL read from a pipe, they seem to assume that everything is Unicode. If you have 8-bit text, it becomes 8-bit Unicode — bytes zero-extended to words. So your 0xB1 becomes U+00B1, the plus-minus sign.
Yup! Using codepage 866 and piping to HEAD or TAIL clobbers the entire Cyrillic alphabet, uppercase and lowercase. HEAD and TAIL just seem to ignore codepages altogether (I have no idea why). In contrast, piping to TPIPE seems to respect the current codepage.
 

Charles Dye

Super Moderator
Staff member
May 20, 2008
4,193
72
Albuquerque, NM
prospero.unm.edu
Yup! Using codepage 866 and piping to HEAD or TAIL clobbers the entire Cyrillic alphabet, uppercase and lowercase. HEAD and TAIL just seem to ignore codepages altogether (I have no idea why). In contrast, piping to TPIPE seems to respect the current codepage.
Only from a pipe, though. Using a |! pseudopipe prevents the problem. I think there's a missing MultiByteToWideChar() somewhere.