TCC doesnt respect active codepage?

  • This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn more.
Dec 5, 2009
9
0
#1
If I enable Unicode mode and redirect (say) some DIR output to a file, it always appears to be encoded in UTF-16 (with BOM) nomatter what I set the codepage to (eg. "CHCP 65001" for UTF-8).

Is there a way to make it respect the console's codepage setting? Sometimes it's quite important for TCC's output to be compatible with other software that does respect the current codepage.
 

rconn

Administrator
Staff member
May 14, 2008
10,100
85
#2
> If I enable Unicode mode and redirect (say) some DIR output to a file,
> it always appears to be encoded in UTF-16 (with BOM) nomatter what I
> set the codepage to (eg. "CHCP 65001" for UTF-8).
>
> Is there a way to make it respect the console's codepage setting?
> Sometimes it's quite important for TCC's output to be compatible with
> other software that does respect the current codepage.
TCC does not support UTF-8 output.

Rex Conn
JP Software
 
Dec 5, 2009
9
0
#3
Further to the above, something else I just noticed.

My test TCC is using unicode ("tcc /u ...."), utf-16 ("chcp 10000") and Courier New (seemingly the most comprehensive standard monospaced Unicode font provided with XP).

So, if TCC is outputting utf-16, it should be capable of accurately displaying any filenames that I can create in Explorer.

So I made the following test files (attached as ZIP, they're only a byte each):

Code:
1_@bµ.txt
2_a%20b.txt
3_a[b'c_3.txt
4_a`b]c.txt
5_a`b'c.txt
6_ab'c.txt
7_åßĉ.txt
8_ấъç.txt
and tried a DIR on them:
Code:
[C:\Projects\xcrc\src\test.probs] dir /b
1_@bµ.txt
2_a%20b.txt
3_a[b'c_3.txt
4_a`b]c.txt
5_a`b'c.txt
6_ab'c.txt
7_åß?.txt
8_??ç.txt
Observe that the last two are wrong, though I can redirect via the clipboard to a utf-16 editor (also displaying in Courier New) and they display correctly. Hence it seems a "display" issue rather than a "wrong data" issue.

Looking at the Unicode codepoints of the non-ACII characters in the two failing filenames, it seems that MSB==0 characters work, MSB<>0 ones don't:

Code:
7_åßĉ.txt -> 7_åß?.txt
  å = 0x000E5  (displayed ok)
  ß = 0x000DF  (displayed ok)
  ĉ = 0x00109  (displayed wrong)

8_ấъç.txt -> 8_??ç.txt:
  ấ = 0x01ea5  (displayed wrong)
  ъ = 0x0044a  (displayed wrong)
  ç = 0x000e7  (displayed ok)
Any thoughts? Am I being stupid and missing something obvious?
 

Attachments

Dec 5, 2009
9
0
#4
TCC does not support UTF-8 output.

Rex Conn
JP Software
I'm genuinely surprised by that. May I suggest proper codepage support for the next version then?

Or, at least, something that can be piped through to do any required conversion?

Actually, would it be possible to do something like that with a plugin?
 
Dec 5, 2009
9
0
#5
Update after more googling- conflicting info, cp10000 is also claimed to be a Mac codepage not utf-16. In fact it's not clear that the Windows console supports anything other than utf-8 for Unicode.

If that's true (checking further now), it means TCC's lack of utf-8 support is the killer blow. And IMHO the manual should be changed to state exactly what Unicode support entails - an interactive textual console application can't reasonably claim general Unicode support if it doesn't support the only way that a console can encode Unicode (utf-8).
 

rconn

Administrator
Staff member
May 14, 2008
10,100
85
#6
> Update after more googling- conflicting info, cp10000 is also claimed
> to be a Mac codepage not utf-16. In fact it's not clear that the
> Windows console supports anything other than utf-8 for Unicode.
The Windows console doesn't support anything other than utf-16 for Unicode.
(This is not TCC-specific.)

Rex Conn
JP Software
 

rconn

Administrator
Staff member
May 14, 2008
10,100
85
#7
> My test TCC is using unicode ("tcc /u ...."), utf-16 ("chcp 10000") and
> Courier New (seemingly the most comprehensive standard monospaced
> Unicode font provided with XP).
The TCC /U option only affects output written to files (normally via
redirection). All Windows console internal storage & display output is
Unicode (utf-16) regardless of the code page; it's just a matter of what
kind of translation Windows is doing from the keyboard and/or file system to
what you see on the screen. (And that's dependent on the console font and
the code page.)

Rex Conn
JP Software
 

rconn

Administrator
Staff member
May 14, 2008
10,100
85
#8
> ---Quote (Originally by rconn)---
> TCC does not support UTF-8 output.
>
> Rex Conn
> JP Software
> ---End Quote---
> I'm genuinely surprised by that. May I suggest proper codepage support
> for the next version then?
>
> Or, at least, something that can be piped through to do any required
> conversion?
>
> Actually, would it be possible to do something like that with a plugin?
CMD behaves the same way as TCC (i.e., no utf-8 output). Not sure why you
would want it -- you're the first person in 15 years that's asked for utf-8.
(Windows internally is all utf-16.)

TCC *could* generate it, though (1) it'd be slow, and (2) almost nothing
else in Windows would be able to recognize it.

It would be simple to write a plugin that took utf-16 in STDIN and wrote
utf-8 to STDOUT.

Rex Conn
JP Software