Incorrect Unicode detection in "type" and "head" commands

LurkingKiwi · Feb 22, 2022

TCC 26.02.43 x64 Windows 10 [Version 10.0.19044.1503]

I have an ASCII file "fred.hex" containing one very long line, mostly repetitions of "|00 00" and terminated by CRLF. When the line is longer than 512 chars the output from typing the file (or head) shows the per thousand symbol and the unknown char symbol and a space. type /X shows the correct hex codes on the left but also renders them as Unicode chars on the right (as per attached PNG).

The line is:
-13:03:59| 0| 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00|00 00|00 00|00 00|00 00|00 00|00 00|00 00...

Originally it was 1700 chars long, but continually reducing the length (by bisection initially) resulted in correct display as the length dropped below 512 chars. Examining the critical 6 chars shows no hidden non-ASCII chars. Deleting one of the " 00" from the initial group also results in the correct display, even though the line is over 512 chars.
Regenerating the file by hand in notepad also exhibits the same behaviour, ruling out a hidden control char I missed.
I do not have UTF8 or Unicode output enabled in "option".
Is there any way to force ASCII display?
Thanks, Len.

vefatica · Feb 22, 2022

Similar (but different) here (with TCC v28). VIEW gets it right (as does Gnu CAT).

But TYPE (without /X), LIST, HEAD, and TAIL all show

With /X, TYPE shows the hex correctly but the text is as above.

I wonder if it's the Win32 function IsTextUnicode? I'll test it.

I use codepage 1252 if it matters.

vefatica · Feb 22, 2022

Since I don't know which tests TCC uses, I told IsTextUnicode to use all tests (lpiResult = nullptr). I got

Code:

546 bytes were read
IsTextUnicode() returned TRUE

I don't know if anything can be done about that.

vefatica · Feb 22, 2022

And when I shorten the line, I get

Code:

508 bytes were read
IsTextUnicode() returned FALSE

Alpengreis · Feb 23, 2022

Yup, same problem here! Weird!

PS: The TYPE command of CMD shows it correct (same prefs for console).

vefatica · Feb 23, 2022

IsTextUnicode() is a Microsoft Win32 API function. I also tested TCC'd QueryIsFileUnicode() function (which, no doubt, uses the WIN32 function) in a plugin. The results were as I reported above

Alpengreis · Feb 23, 2022

@vefatica

Ok, thanks for info. Then I guess it's really not easy to change this behaviour ...

Alpengreis · Feb 23, 2022

@LurkingKiwi

If there is no workaround possible (we will see), then maybe it's the best to use an external command for this.

For example "cat" (integrated with git), which works here.

Search

Welcome!

Incorrect Unicode detection in "type" and "head" commands

LurkingKiwi

Attachments

vefatica

vefatica

vefatica

Alpengreis

vefatica

Alpengreis

Alpengreis

Similar threads