Welcome!

By registering with us, you'll be able to discuss, share and private message with other members of our community.

SignUp Now!

Character 160 (nbsp) screen vs. clip:

May
13,802
211
1722260676527.webp


After pasting,

1722260716077.webp


What's happening and how do I get the real thing into the clipboard?

TMP1: works differently (and AFAICT you can't TEE to it).

1722261088363.webp


And CLIP: automatically gets a trailing CRLF (which I didn't want) while TMP1: does not.
 
I don't see this:

Code:
C:\>ver

TCC  33.00.2 x64   Windows 11 [Version 10.0.22621.3880]

C:\>option unicodeoutput
unicodeoutput=No

C:\>echos **%@char[160]** | tee clip:
** **

C:\>echo %@ascii[%@clip[]]
42 42 160 42 42

C:\>
 
Here,

Code:
v:\> echos **%@char[160]** | tee clip:
** **

v:\> echo %@ascii[%@clip[]]
42 42 225 42 42

I use CP 1252 where, according to the few sources I found, 0xA0 is NBSP.
 
Okay, using CP1252 I can reproduce that:

Code:
C:\>chcp 1252
Active code page: 1252

C:\>option //unicodeoutput=no

C:\>echos **%@char[160]** | tee clip:
** **

C:\>echo %@ascii[%@clip[]]
42 42 225 42 42

C:\>option //unicodeoutput=yes

C:\>echos **%@char[160]** | tee clip:
** **

C:\>echo %@ascii[%@clip[]]
42 42 160 42 42

C:\>
 
Any idea what's going on? I have no clue. 0xA0 is NBSP in CP 1252 and Unicode. And it's 'á' in CP 437.

And a TCMD anomaly???? Now, in TCMD, the first 5 characters below were pasted with Ctrl-V and did not have a trailing newline. The second 5 characters were pasted with TCMD's right-click menu and did have a trailing newline.

1722275078344.webp
 
This isn't DOS, and you aren't creating ANSI text (well, 1252 is "pseudo ANSI").

Your string is UTF16 with a rather odd character in the middle. When it's written to a temp file (to support the CLIP: pseudo-device), it is converted from UTF16 to ASCII, and that's where the 'á' is created.

Solution - use Unicode (either UTF16 or UTF8) everywhere, and don't try to mix and match Unicode and ASCII when you're using extended characters.
 
This isn't DOS, and you aren't creating ANSI text (well, 1252 is "pseudo ANSI").

Your string is UTF16 with a rather odd character in the middle. When it's written to a temp file (to support the CLIP: pseudo-device), it is converted from UTF16 to ASCII, and that's where the 'á' is created.

Solution - use Unicode (either UTF16 or UTF8) everywhere, and don't try to mix and match Unicode and ASCII when you're using extended characters.
I don't understand a word of that. How do I get both of these to produce a NBSP?
 
I don't understand a word of that. How do I get both of these to produce a NBSP?

The byte 160 (0xA0, 0b10100000) is interpreted as NBSP ' ' in Unicode (and Windows-1252) while it corresponds to 'á' in CP 437, which is the encoding Windows will use as "ASCII" if you don't tell it otherwise. This mixing and matching of encodings is what produces the problem.
 
I'm not exp[licitly asking for any particular encoding. And when I don't, I kinda expect TCC to be consistent. Who's mixing and matching? It's not me, at least not on purpose. According to Google CP 65001 is UTF8 and that's no better.

Code:
v:\> chcp 65001
Active code page: 65001

v:\> echos **%@char[160]** | tee clip:
** **

v:\wordle> echo %@clip[0]
** **

And, my windows ACP is 1252.

Code:
v:\> echo %@regquery[HKLM\system\currentcontrolset\Control\Nls\CodePage\ACP]
1252
 
You are:

1. Creating a UTF16 string
2. Writing that UTF16 string to an ASCII file (necessary because a temp file is needed to handle the CLIP: pseudo-device)
3. Which calls the Windows API WideCharToMultiByte
4. WideCharToMultiByte is using your 1252 codepage (you may be the only one using that codepage in the last 30 years ...)
5. WideCharToMultiByte converts your NBSP Unicode character to 'á' for a 1252 codepage (if you don't like that, you can complain to Microsoft, but I doubt you'll get much joy from them) and is written to the ASCII file.
6. The ASCII file is then opened, the line read, and converted back to UTF16 before being copied into the clipboard. But the character has already been changed (in #5).

ANSI (or pseudo-ANSI in the case of CP 1252) is not Unicode.
 
5. WideCharToMultiByte converts your NBSP Unicode character to 'á' for a 1252 codepage (if you don't like that, you can complain to Microsoft, but I doubt you'll get much joy from them) and is written to the ASCII file.
So that (which is seemingly wrong) is the problem, eh? Char 160 is nbsp in CP 1252.
 
So what do I do if I want to be able to use nbsp ( also char 177, '±') freely and have it appear the same everywhere in TCC and have files written by TCC to be (somehow) 8-bit encoded?
 
So what do I do if I want to be able to use nbsp ( also char 177, '±') freely and have it appear the same everywhere in TCC and have files written by TCC to be (somehow) 8-bit encoded?

You're obsessed with ASCII. Windows is UTF16, the rest of the world is UTF8. ASCII & 8-bit characters died with Win98 & DOS 7.

If you want your files to be (mostly) 8-bit encoded, use UTF8. It'll be ASCII everywhere except where it needs to be two (or three or four) byte characters.
 
OK, but precisely how do I do that ... what settings in the OPTION dialog? And what about my Windows ACP/OEMCP?
 
There must be more to it!

Code:
v:\> option utf8
utf8=Yes

v:\> option utf8output
utf8output=Yes

v:\> chcp 65001
Active code page: 65001

v:\> echo **%@char[160]** | tee clip:
** **

v:\> echo %@clip[0]
** **

v:\> echo **%@char[177]** | tee clip:
**±**

v:\> type clip:
**┬▒**
 
Didn't Rex add syntax to piping and redirection to specify the output encoding?

Code:
C:\>chcp
Active code page: 1252

C:\>option unicodeoutput
unicodeoutput=No

C:\>echo **%@char[160]** |:u tee clip:
** **

C:\>hexdump clip:
00000000  ff fe 2a 00 2a 00 a0 00  2a 00 2a 00 0d 00 0a 00                                                    · * *   * * · ·


C:\>

Wait, where did that Byte Order Mark come from?
 
You can use |:u to pipe as UTF-16. Which I think is generally what you want, with the clipboard. |:8 would work too; UTF-8 and UTF-16 map one-to-one to each other.
 
@Charles Dye

Thank you, Charles - I saw that |:u in your example above ... |:8 does also work with CP 65001 ... but what is with the BOM (ff fe) - where did that come from?

BTW: which "hexdump" do you use?
 
I'm also confused by the BOM; sometimes TCC writes a BOM, sometimes it doesn't. But I think in most cases it just doesn't matter. Most programs will silently ignore the initial BOM.

The hexdump is my own, but I think you would get the same results from any other. The data is the data, yeah?
 
Ok, thank you!

I HAVE a hexdump but that can't display things like "clip:" ... so, a direct view/dump of that is not possible unfortunately ... not really a problem of course, with redirecting is it good enough too ...
 
Back
Top