Welcome!

By registering with us, you'll be able to discuss, share and private message with other members of our community.

SignUp Now!

"HEAD" mangles stream encoding

Aug
717
10
From the nearby thread, the example
Code:
wmiquery . "select Name,ProcessId,DisplayName from Win32_service where Started='TRUE'"
produces output like
Code:
DisplayName = Информация о совместимости приложений
Name = AeLookupSvc
ProcessId = 540

But when I'm trying to pipe it through TCC's internal "HEAD", the character encoding is utterly destroyed…

Code:
DisplayName = ?-aRa┐ ??i R aR?┐?aa?┐Raa? ?a?<R│?-?c
Name = AeLookupSvc
ProcessId = 540
 
TCC will send the text as ASCII
the output of WMIQUERY is in CP866.

In INI file,
UnicodeOutput=No

Setting it to "Yes" breaks expectation in many other places. (Not to mention, it uses UTF-16 rather than UTF-8, which breaks compatibility on a very high level.)
 
Is HEAD not using the current code page to interpret non-Unicode input?
 
Why it has to use any codepage to begin with?
If your output is in Unicode, then code pages are irrelevant.

But if it is not Unicode, then it must be some other encoding. And the current code page identifies what that "other encoding" is. TCC should be using the code page to determine whether, say, character #136 is И or ט or Ι or or ê or or ....
 
Last edited:
I'm sorry, Rex, but I have to agree with the OP. Something's not right with HEAD and TAIL reading OEM text from a pipe. HEAD and TAIL should interpret OEM text per the current code page, but it looks like they are actually treating high-order OEM characters as Unicode U+0080 through U+00FF. (Including C1 control codes!) TYPE gets it right.

Code:
C:\Bin\JPSDK\TextUtils>chcp
Active code page: 437

C:\Bin\JPSDK\TextUtils>type /x Upper.txt
0000 0000 80 81 82 83 84 85 86 87  88 89 8a 8b 8c 8d 8e 8f  ÇüéâäàåçêëèïîìÄÅ
0000 0010 0d 0a 90 91 92 93 94 95  96 97 98 99 9a 9b 9c 9d  ..ÉæÆôöòûùÿÖÜ¢£¥
0000 0020 9e 9f 0d 0a a0 a1 a2 a3  a4 a5 a6 a7 a8 a9 aa ab  ₧ƒ..áíóúñѪº¿⌐¬½
0000 0030 ac ad ae af 0d 0a b0 b1  b2 b3 b4 b5 b6 b7 b8 b9  ¼¡«»..░▒▓│┤╡╢╖╕╣
0000 0040 ba bb bc bd be bf 0d 0a  c0 c1 c2 c3 c4 c5 c6 c7  ║╗╝╜╛┐..└┴┬├─┼╞╟
0000 0050 c8 c9 ca cb cc cd ce cf  0d 0a d0 d1 d2 d3 d4 d5  ╚╔╩╦╠═╬╧..╨╤╥╙╘╒
0000 0060 d6 d7 d8 d9 da db dc dd  de df 0d 0a e0 e1 e2 e3  ╓╫╪┘┌█▄▌▐▀..αßΓπ
0000 0070 e4 e5 e6 e7 e8 e9 ea eb  ec ed ee ef 0d 0a f0 f1  ΣσµτΦΘΩδ∞φε∩..≡±
0000 0080 f2 f3 f4 f5 f6 f7 f8 f9  fa fb fc fd fe ff 0d 0a  ≥≤⌠⌡÷≈°∙·√ⁿ²■ ..

C:\Bin\JPSDK\TextUtils>option unicodeoutput
unicodeoutput=No

C:\Bin\JPSDK\TextUtils>type Upper.txt | type
ÇüéâäàåçêëèïîìÄÅ
ÉæÆôöòûùÿÖÜ¢£¥₧ƒ
áíóúñѪº¿⌐¬½¼¡«»
░▒▓│┤╡╢╖╕╣║╗╝╜╛┐
└┴┬├─┼╞╟╚╔╩╦╠═╬╧
╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀
αßΓπΣσµτΦΘΩδ∞φε∩
≡±≥≤⌠⌡÷≈°∙·√ⁿ²■

C:\Bin\JPSDK\TextUtils>type Upper.txt | head


¡¢£¤¥¦§¨©ª«¬®¯
°±²³´µ¶·¸¹º»¼½¾¿
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
àáâãäåæçèéêëìíîï
ðñòóôõö÷øùúûüýþÿ

C:\Bin\JPSDK\TextUtils>type Upper.txt | tail


¡¢£¤¥¦§¨©ª«¬®¯
°±²³´µ¶·¸¹º»¼½¾¿
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
àáâãäåæçèéêëìíîï
ðñòóôõö÷øùúûüýþÿ

C:\Bin\JPSDK\TextUtils>chcp 866
Active code page: 866

C:\Bin\JPSDK\TextUtils>type Upper.txt | type
АБВГДЕЖЗИЙКЛМНОП
РСТУФХЦЧШЩЪЫЬЭЮЯ
абвгдежзийклмноп
░▒▓│┤╡╢╖╕╣║╗╝╜╛┐
└┴┬├─┼╞╟╚╔╩╦╠═╬╧
╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀
рстуфхцчшщъыьэюя
ЁёЄєЇїЎў°∙·√№¤■

C:\Bin\JPSDK\TextUtils>type Upper.txt | head


¡¢£¤¥¦§¨©ª«¬®¯
°±²³´µ¶·¸¹º»¼½¾¿
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
àáâãäåæçèéêëìíîï
ðñòóôõö÷øùúûüýþÿ

C:\Bin\JPSDK\TextUtils>type Upper.txt | tail


¡¢£¤¥¦§¨©ª«¬®¯
°±²³´µ¶·¸¹º»¼½¾¿
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
àáâãäåæçèéêëìíîï
ðñòóôõö÷øùúûüýþÿ

C:\Bin\JPSDK\TextUtils>chcp 737
Active code page: 737

C:\Bin\JPSDK\TextUtils>type Upper.txt | type
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠ
ΡΣΤΥΦΧΨΩαβγδεζηθ
ικλμνξοπρσςτυφχψ
░▒▓│┤╡╢╖╕╣║╗╝╜╛┐
└┴┬├─┼╞╟╚╔╩╦╠═╬╧
╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀
ωάέήϊίόύϋώΆΈΉΊΌΎ
Ώ±≥≤ΪΫ÷≈°∙·√ⁿ²■

C:\Bin\JPSDK\TextUtils>type Upper.txt | head


¡¢£¤¥¦§¨©ª«¬®¯
°±²³´µ¶·¸¹º»¼½¾¿
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
àáâãäåæçèéêëìíîï
ðñòóôõö÷øùúûüýþÿ

C:\Bin\JPSDK\TextUtils>type Upper.txt | tail


¡¢£¤¥¦§¨©ª«¬®¯
°±²³´µ¶·¸¹º»¼½¾¿
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
àáâãäåæçèéêëìíîï
ðñòóôõö÷øùúûüýþÿ

HEAD and TAIL work as expected when reading from a file:

Code:
C:\Bin\JPSDK\TextUtils>chcp 437
Active code page: 437

C:\Bin\JPSDK\TextUtils>head Upper.txt
ÇüéâäàåçêëèïîìÄÅ
ÉæÆôöòûùÿÖÜ¢£¥₧ƒ
áíóúñѪº¿⌐¬½¼¡«»
░▒▓│┤╡╢╖╕╣║╗╝╜╛┐
└┴┬├─┼╞╟╚╔╩╦╠═╬╧
╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀
αßΓπΣσµτΦΘΩδ∞φε∩
≡±≥≤⌠⌡÷≈°∙·√ⁿ²■

C:\Bin\JPSDK\TextUtils>chcp 866
Active code page: 866

C:\Bin\JPSDK\TextUtils>head Upper.txt
АБВГДЕЖЗИЙКЛМНОП
РСТУФХЦЧШЩЪЫЬЭЮЯ
абвгдежзийклмноп
░▒▓│┤╡╢╖╕╣║╗╝╜╛┐
└┴┬├─┼╞╟╚╔╩╦╠═╬╧
╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀
рстуфхцчшщъыьэюя
ЁёЄєЇїЎў°∙·√№¤■

And they work as expected when input is redirected:

Code:
C:\Bin\JPSDK\TextUtils>chcp 437
Active code page: 437

C:\Bin\JPSDK\TextUtils>head < Upper.txt
ÇüéâäàåçêëèïîìÄÅ
ÉæÆôöòûùÿÖÜ¢£¥₧ƒ
áíóúñѪº¿⌐¬½¼¡«»
░▒▓│┤╡╢╖╕╣║╗╝╜╛┐
└┴┬├─┼╞╟╚╔╩╦╠═╬╧
╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀
αßΓπΣσµτΦΘΩδ∞φε∩
≡±≥≤⌠⌡÷≈°∙·√ⁿ²■

C:\Bin\JPSDK\TextUtils>chcp 866
Active code page: 866

C:\Bin\JPSDK\TextUtils>head < Upper.txt
АБВГДЕЖЗИЙКЛМНОП
РСТУФХЦЧШЩЪЫЬЭЮЯ
абвгдежзийклмноп
░▒▓│┤╡╢╖╕╣║╗╝╜╛┐
└┴┬├─┼╞╟╚╔╩╦╠═╬╧
╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀
рстуфхцчшщъыьэюя
ЁёЄєЇїЎў°∙·√№¤■

I don't understand why reading redirected input from a pipe should behave differently from reading redirected input from a file, but it does. It looks as if a MultiByteToWideChar() is missing somewhere...?
 

Attachments

  • Upper.zip
    398 bytes · Views: 262
AnrDaemon: Can you verify that piping to TAIL fails in the same way, but piping to TYPE works as expected?
 
I'm sorry, Rex, but I have to agree with the OP. Something's not right with HEAD and TAIL reading OEM text from a pipe. HEAD and TAIL should interpret OEM text per the current code page, but it looks like they are actually treating high-order OEM characters as Unicode U+0080 through U+00FF. (Including C1 control codes!) TYPE gets it right.

That's all well-reasoned, but alas, not what's happening. And this is why I hate ASCII & Windows.

What *is* happening is that (depending on your code page) when you do a MultiByteToWideChar() to convert your ASCII file to Unicode so that TCC (and Windows) can understand it, the string is translated as you expect. Unfortunately, if you then do a WideCharToMultiByte on your new Unicode string -- Surprise! The new ASCII string doesn't match the original ASCII string.

So TYPE in TCC converts your ASCII file to Unicode. It then sends it to the output routine, which determines that STDOUT is a (ASCII) pipe, so it converts the Unicode back to ASCII. The child pipe process reads the newly-mangled ASCII and converts it back to Unicode before displaying it.

The key here is that the conversions to Unicode are working; the problem is the conversions from Unicode to ASCII don't work reliably with extended characters, and never have in Windows. If you complain to Microsoft, they'll tell you not to use ASCII codepages.

Or, you could use UTF8 or UTF16 output in TCC, and everything works.
 
I'm not understanding (and quite likely my test is flawed). I have a 256-byte file (0255.bin) containing the bytes 0x0 through 0xFF. I wrote a test app to read that file into a buffer, print the decimal values of the bytes in the buffer, use MultiByteToWideChar followed by WideCharToMultiByte (with lpDefaultChar equal to NULL) on the buffer, then print the decimal values again.

I did that for the ANSI, OEM, and THREAD code pages.

In all three cases, the before/after decimal values were identical; i.e., the decimals 0 through 255.
 
I did that for the ANSI, OEM, and THREAD code pages.

In all three cases, the before/after decimal values were identical; i.e., the decimals 0 through 255.

I also did it for CP 866. The result was the same, before and after identical.
 
I don't think the text is being translated via MultiByteToWideChar(). I think bytes are just being zero-extended to words. So e.g. a lowercase Cyrillic н, which is character 0xAD in code page 866, becomes Unicode code point U+00AD, a soft hyphen (and then the console further mangles that to a regular hyphen.) HEAD and TAIL treat OEM input text from a pipe as if it were Unicode. Eight-bit-wide Unicode.

Rex, would I be correct in assuming that HEAD and TAIL are calling QueryIsFileUnicode() to analyze the input text?

And if so, how does QueryIsFileUnicode() deal with a pipe handle, as opposed to a file handle? I'm guessing either "badly" or "not at all".
 
Rex, would I be correct in assuming that HEAD and TAIL are calling QueryIsFileUnicode() to analyze the input text?

Only if you're reading a file. If you're reading a pipe, it's dependent on whether you've specified Unicode or UTF8 input. Pipes can't be rewound, so TCC can't read a block of input to try to decode the input type and then rewind it to the beginning to allow HEAD or TAIL to start reading it.
 
Back
Top