WAD TCC: inconsistent character handling

vefatica · Jan 31, 2020

The file in question is ASCII. The character in question is 0xB1(plus/minus). My console font is Consolas, which can handle that character. TCC handles that character rather inconsistently, displaying it in at least 4 different ways.

vefatica · Jan 31, 2020

Here's another.

rconn · Jan 31, 2020

Is this in a TCMD tab window or a Windows console window?

vefatica · Jan 31, 2020

rconn said:
Is this in a TCMD tab window or a Windows console window?

That was in a TCC console. Here (below) it is in a TCMD tab. It's the same except for the choice of the unprintable character.

Charles Dye · Jan 31, 2020

And does Option //UnicodeOutput=Yes work around the issue? Because I suspect this is the same issue as "HEAD" mangles stream encoding.

rconn · Jan 31, 2020

I've mentioned a few dozen times in the past that you will never, ever be satisfied with the results if you convert extended ASCII characters to Unicode and then back to ASCII. (Or even worse, back and forth and back and forth like in your examples.) That's the way Windows works; what you get will depend on your codepage and your font, but it will almost never be what you want. If you're unhappy with the results you should be using UTF-16 or UTF-8 files. Or at the very least, UnicodeOutput or UTF8Output, and/or change your code page to 65001.

In a TCC console window, Windows handles the character display - all TCC does is pass the character and it's up to Windows how it appears.

vefatica · Jan 31, 2020

This is

Charles Dye said:
And does Option //UnicodeOutput=Yes work around the issue? Because I suspect this is the same issue as "HEAD" mangles stream encoding.

No. That rather makes a mess of everything!

vefatica · Jan 31, 2020

rconn said:
I've mentioned a few dozen times in the past that you will never, ever be satisfied with the results if you convert extended ASCII characters to Unicode and then back to ASCII. (Or even worse, back and forth and back and forth like in your examples.) That's the way Windows works; what you get will depend on your codepage and your font, but it will almost never be what you want. If you're unhappy with the results you should be using UTF-16 or UTF-8 files. Or at the very least, UnicodeOutput or UTF8Output, and/or change your code page to 65001.

As I said in Charles Dye's thread about HEAD:

I have a 256-byte file (0255.bin) containing the bytes 0x0 through 0xFF. I wrote a test app to read that file into a buffer, print the decimal values of the bytes in the buffer, use MultiByteToWideChar followed by WideCharToMultiByte (with lpDefaultChar equal to NULL) on the buffer, then print the decimal values again.

I did that for the ANSI, OEM, and THREAD code pages.

In all three cases, the before/after decimal values were identical; i.e., the decimals 0 through 255.

I later did the same (successfully) with CP 866.

This thread started when I simply TYPE'd the file (no pipes). Here's what I saw/see.

vefatica · Jan 31, 2020

Let's start over.

PlusMinusSign is Unicode 0xB1; supported by Consolas; not in my code page (437).
MediumShade is Unicode 0x2592; supported by Consolas; 0xB1 in code page 437. (see it below)

Rex, please explain why/how these are different.

Charles Dye · Feb 1, 2020

Pretty sure this is the same as AnrDaemon's issue. When HEAD/TAIL read from a pipe, they seem to assume that everything is Unicode. If you have 8-bit text, it becomes 8-bit Unicode — bytes zero-extended to words. So your 0xB1 becomes U+00B1, the plus-minus sign.

vefatica · Feb 1, 2020

Charles Dye said:
Pretty sure this is the same as AnrDaemon's issue. When HEAD/TAIL read from a pipe, they seem to assume that everything is Unicode. If you have 8-bit text, it becomes 8-bit Unicode — bytes zero-extended to words. So your 0xB1 becomes U+00B1, the plus-minus sign.

Yup! Using codepage 866 and piping to HEAD or TAIL clobbers the entire Cyrillic alphabet, uppercase and lowercase. HEAD and TAIL just seem to ignore codepages altogether (I have no idea why). In contrast, piping to TPIPE seems to respect the current codepage.

Charles Dye · Feb 1, 2020

vefatica said:
Yup! Using codepage 866 and piping to HEAD or TAIL clobbers the entire Cyrillic alphabet, uppercase and lowercase. HEAD and TAIL just seem to ignore codepages altogether (I have no idea why). In contrast, piping to TPIPE seems to respect the current codepage.

Only from a pipe, though. Using a |! pseudopipe prevents the problem. I think there's a missing MultiByteToWideChar() somewhere.

Search

Welcome!

WAD TCC: inconsistent character handling

vefatica

vefatica

rconn

Administrator

vefatica

Charles Dye

Super Moderator

rconn

Administrator

vefatica

vefatica

vefatica

Charles Dye

Super Moderator

vefatica

Charles Dye

Super Moderator

Similar threads