1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

TCC doesnt respect active codepage?

Discussion in 'Support' started by NeBlackCat, Dec 6, 2009.

  1. NeBlackCat

    Joined:
    Dec 5, 2009
    Messages:
    9
    Likes Received:
    0
    If I enable Unicode mode and redirect (say) some DIR output to a file, it always appears to be encoded in UTF-16 (with BOM) nomatter what I set the codepage to (eg. "CHCP 65001" for UTF-8).

    Is there a way to make it respect the console's codepage setting? Sometimes it's quite important for TCC's output to be compatible with other software that does respect the current codepage.
     
  2. rconn

    rconn Administrator
    Staff Member

    Joined:
    May 14, 2008
    Messages:
    9,854
    Likes Received:
    83
    TCC does not support UTF-8 output.

    Rex Conn
    JP Software
     
  3. NeBlackCat

    Joined:
    Dec 5, 2009
    Messages:
    9
    Likes Received:
    0
    Further to the above, something else I just noticed.

    My test TCC is using unicode ("tcc /u ...."), utf-16 ("chcp 10000") and Courier New (seemingly the most comprehensive standard monospaced Unicode font provided with XP).

    So, if TCC is outputting utf-16, it should be capable of accurately displaying any filenames that I can create in Explorer.

    So I made the following test files (attached as ZIP, they're only a byte each):

    Code:
    1_@bµ.txt
    2_a%20b.txt
    3_a[b'c_3.txt
    4_a`b]c.txt
    5_a`b'c.txt
    6_ab'c.txt
    7_åßĉ.txt
    8_ấъç.txt
    
    and tried a DIR on them:
    Code:
    [C:\Projects\xcrc\src\test.probs] dir /b
    1_@bµ.txt
    2_a%20b.txt
    3_a[b'c_3.txt
    4_a`b]c.txt
    5_a`b'c.txt
    6_ab'c.txt
    7_åß?.txt
    8_??ç.txt
    Observe that the last two are wrong, though I can redirect via the clipboard to a utf-16 editor (also displaying in Courier New) and they display correctly. Hence it seems a "display" issue rather than a "wrong data" issue.

    Looking at the Unicode codepoints of the non-ACII characters in the two failing filenames, it seems that MSB==0 characters work, MSB<>0 ones don't:

    Code:
    7_åßĉ.txt -> 7_åß?.txt
      å = 0x000E5  (displayed ok)
      ß = 0x000DF  (displayed ok)
      ĉ = 0x00109  (displayed wrong)
    
    8_ấъç.txt -> 8_??ç.txt:
      ấ = 0x01ea5  (displayed wrong)
      ъ = 0x0044a  (displayed wrong)
      ç = 0x000e7  (displayed ok)
    

    Any thoughts? Am I being stupid and missing something obvious?
     

    Attached Files:

  4. NeBlackCat

    Joined:
    Dec 5, 2009
    Messages:
    9
    Likes Received:
    0
    I'm genuinely surprised by that. May I suggest proper codepage support for the next version then?

    Or, at least, something that can be piped through to do any required conversion?

    Actually, would it be possible to do something like that with a plugin?
     
  5. NeBlackCat

    Joined:
    Dec 5, 2009
    Messages:
    9
    Likes Received:
    0
    Update after more googling- conflicting info, cp10000 is also claimed to be a Mac codepage not utf-16. In fact it's not clear that the Windows console supports anything other than utf-8 for Unicode.

    If that's true (checking further now), it means TCC's lack of utf-8 support is the killer blow. And IMHO the manual should be changed to state exactly what Unicode support entails - an interactive textual console application can't reasonably claim general Unicode support if it doesn't support the only way that a console can encode Unicode (utf-8).
     
  6. rconn

    rconn Administrator
    Staff Member

    Joined:
    May 14, 2008
    Messages:
    9,854
    Likes Received:
    83
    The Windows console doesn't support anything other than utf-16 for Unicode.
    (This is not TCC-specific.)

    Rex Conn
    JP Software
     
  7. rconn

    rconn Administrator
    Staff Member

    Joined:
    May 14, 2008
    Messages:
    9,854
    Likes Received:
    83
    The TCC /U option only affects output written to files (normally via
    redirection). All Windows console internal storage & display output is
    Unicode (utf-16) regardless of the code page; it's just a matter of what
    kind of translation Windows is doing from the keyboard and/or file system to
    what you see on the screen. (And that's dependent on the console font and
    the code page.)

    Rex Conn
    JP Software
     
  8. rconn

    rconn Administrator
    Staff Member

    Joined:
    May 14, 2008
    Messages:
    9,854
    Likes Received:
    83
    CMD behaves the same way as TCC (i.e., no utf-8 output). Not sure why you
    would want it -- you're the first person in 15 years that's asked for utf-8.
    (Windows internally is all utf-16.)

    TCC *could* generate it, though (1) it'd be slow, and (2) almost nothing
    else in Windows would be able to recognize it.

    It would be simple to write a plugin that took utf-16 in STDIN and wrote
    utf-8 to STDOUT.

    Rex Conn
    JP Software
     

Share This Page