Fixed Using codepage 65001 (UTF-8) breaks non-ASCII characters

#1
When typing any non-US-English character, an empty character (probably 0x00-NUL) is displayed on the right of the typed character. It doesn't overwrite the underlying character (if any), but the string is displayed as if you were pressing `space` after each. The weirder part is that this only manifests itself when typing the non-US English character in two separate lines (which leads me to believe this is a problem with the output stream)

1. Open TCC 20 (either standalone or under Take Command).
2. Type `chcp 65001` to switch to the UTF-8 "codepage".
3. Use any non-US English keyboard (for my example, I'm using Greek, but I've done this with Russian and German keyboards).
4. Type any character that is not 32-127 (my example: τεστ - which means "test" in Greek) and press enter (or type `echo τεστ` and enter, to verify it's not the error stream that has the problem).
5. Repeat step 4.

Actual results:


Expected results:
(This worked with TCC 19)


Notes:

I've done this with empty settings (they were generated from scratch), just to make sure it wasn't some setting that messed things up.
 
#3
BTW, isn't the version supposed to be 10.0.14393? Why does TCC go to the fallback version (6.3)?
 
#5
If it's a Windows issue, why does it work with TCC 19 and it doesn't work with TCC 20?

EDIT: This happens exactly the same under Take Command (I just wanted to isolate the problem).

 
Last edited:

rconn

Administrator
Staff member
May 14, 2008
10,638
97
#6
If it's a Windows issue, why does it work with TCC 19 and it doesn't work with TCC 20?
I keep telling people that Windows does not actually support UTF-8 (other than in a handful of conversion APIs), but nobody wants to listen ...

The reason it behaves differently in v19 vs. v20 is because v20 is using different APIs to fix a different problem with Take Command -- Windows uses different code pages in GUI windows and console windows, and v20 is going to great lengths to try to rationalize those differences.

The reason you're seeing blanks in the output is because TCC is querying Windows for the width of the characters, and Windows is returning "2". I could add a hack to check for codepage 65001 and always assume they're really single-width characters, though that will break Japanese / Korean / Chinese support.

The question I have for you is -- why are you using 65001? Do you think it will provide some benefit?
 
May 30, 2008
214
1
#7
The question I have for you is -- why are you using 65001? Do you think it will provide some benefit?
I use 65001 (but only temporarily, not persistently in a TCC session) when outputting UTF-8 text. Have worked fine so far.
 
#8
The question I have for you is -- why are you using 65001? Do you think it will provide some benefit?
Well, yes, it does. It's more-or-less required if you're working Python on the console.

I keep telling people that Windows does not actually support UTF-8
Well, it seems to be working in PowerShell/CMD.