Declined FFIND and code page?

vefatica · Oct 22, 2021

Among other things, TYPE wmips.btm gives this line (which looks as desired).

Code:

:: rewrite YYYYMMDDHHMMSS.mmmmmm±ZZZ

Among other things FFIND /s /t"write" *.btm gives this (which doesn't look as desired).

Code:

---- V:\wmips.btm
:: rewrite YYYYMMDDHHMMSS.mmmmmm┬▒ZZZ

Could FFIND be made to do whatever TYPE is doing?

Charles Dye · Oct 22, 2021

Looks like the input file is UTF-8. Does it have a BOM?

vefatica · Oct 22, 2021

Charles Dye said:
Looks like the input file is UTF-8. Does it have a BOM?

There's no BOM. The character in question is encoded as two bytes.

rconn · Oct 22, 2021

FFIND could do that, if you don't mind slowing it down (probably somewhere between 2x and 10x slower, depending on the mix of files it's reading).

Charles Dye · Oct 22, 2021

rconn said:
FFIND could do that, if you don't mind slowing it down (probably somewhere between 2x and 10x slower, depending on the mix of files it's reading).

What would be the slowdown? Displaying UTF-8, or detecting UTF-8 without a BOM?

vefatica · Oct 22, 2021

Charles Dye said:
What would be the slowdown? Displaying UTF-8, or detecting UTF-8 without a BOM?

"Slowdown" ... huh?

TYPE gets it right. FFIND gets it wrong.

rconn · Oct 22, 2021

Charles Dye said:
What would be the slowdown? Displaying UTF-8, or detecting UTF-8 without a BOM?

Without a BOM, TCC would have to read each file twice to determine the encoding. TYPE does this because it doesn't slow it down noticeably (the write to the screen is much slower than the file reads).

rconn · Oct 22, 2021

vefatica said:
"Slowdown" ... huh?

TYPE gets it right. FFIND gets it wrong.

No, TYPE reads every file twice because you didn't add the BOM. FFIND could do that too, but you'd pay a heavy price in speed.

vefatica · Oct 22, 2021

I can save as UTF8 but my editor doesn't add a BOM.

vefatica · Oct 22, 2021

vefatica said:
I can save as UTF8 but my editor doesn't add a BOM.

In fact, it removes any UTF8 BOM.

Charles Dye · Oct 22, 2021

rconn said:
Without a BOM, TCC would have to read each file twice to determine the encoding. TYPE does this because it doesn't slow it down noticeably (the write to the screen is much slower than the file reads).

If there is no Byte Order Mark, then I think it's perfectly reasonable to assume the OEM code page, 437 or whatever. But if there is a UTF-8 BOM, then you should probably assume UTF-8.

rconn · Oct 22, 2021

Charles Dye said:
If there is no Byte Order Mark, then I think it's perfectly reasonable to assume the OEM code page, 437 or whatever. But if there is a UTF-8 BOM, then you should probably assume UTF-8.

That's how it works now.

rconn · Oct 22, 2021

vefatica said:
In fact, it removes any UTF8 BOM.

The default Linux behavior is to not use UTF8 BOMs. But that only works on Linux because ALL files are considered to be UTF8, so apps don't need to check for the encoding.

In Windows, files are assumed to be ASCII unless the app also supports UTF16. A few apps (including TCC) also support UTF8, but lacking a BOM the only way to distinguish ASCII and UTF8 is to read the file and look for the extended bytes.

vefatica · Oct 22, 2021

I still don't get it.

Now the +/- character is a single byte (0xb1) in the BTM file. It looks OK in my editor with save-encoding "default" (whatever that is). ECHO %@CHAR[0xb1] looks OK in a console after CHCP with 437 or 1252. The +/- character is 0xb1 in my font (Consolas, according to CHARMAP). Now both TYPE and FFIND show ▒. [TYPE showed ± earlier because I was piping to GREP.]

So what are TYPE and FFIND doing?

vefatica · Oct 22, 2021

My mistake, I think. According to CHARMAP, that character is in "DOS:WesternEurope" and not in "DOS:UnitedStates" (are they 437 and 1252, in that order?).

Charles Dye · Oct 22, 2021

TCC uses Unicode internally. @CHAR returns a Unicode character.

± is Unicode character 00B1. In the UTF-8 encoding, that works out to 0xC2 0xB1.

0xC2 0xB1 in code page 437 is ┬▒

0xC2 0xB1 in code page 1252 is Â±

vefatica · Oct 22, 2021

What if it's not in UTF8? My consoles use 1252; at least that's what this says.

Code:

v:\> echo %@regquery[hkcu\console\codepage]
1252

But I don't suppose there's a way to tell the WideChar/Multibyte conversion functions to use the default console code page.

I didn't think so earlier, but TYPE and FFIND do OK after CHCP 1252. Is there a way to tell TCC what code page to use by default?

rconn · Oct 23, 2021

TCC uses the default code page provided by Windows for the console session. If you want to change it, either change the system default or put a CHCP in your TCSTART.

However, this has nothing to do with FFIND detecting UTF8 files without a BOM. (It won't, and I'm not going to introduce a "feature" that slows it down by 10x in order to search every file on the miniscule change they might be a UTF8 file w/o a BOM).

vefatica · Oct 23, 2021

rconn said:
TCC uses the default code page provided by Windows for the console session. If you want to change it, either change the system default or put a CHCP in your TCSTART.

I'm not sure what you mean there. How do you figure out what code page that is? I have this.

Code:

v:\> echo %@regquery[hkcu\console\codepage]
1252

and TCC starts up like this.

Code:

v:\> chcp
Active code page: 437

rconn said:
However, this has nothing to do with FFIND detecting UTF8 files without a BOM. (It won't, and I'm not going to introduce a "feature" that slows it down by 10x in order to search every file on the miniscule change they might be a UTF8 file w/o a BOM).

This isn't (never was) about UTF8. The file is not encoded.

vefatica · Oct 23, 2021

I think the problem is that setting HKCU\Console\CodePage doesn't do anything. Although I have

Code:

v:\> echo %@regquery[hkcu\console\codepage]
1252

(with no CodePage setting specific to TCC)

TCC (and everything else) starts with CP 437.

But if I add

Code:

echo %@regquery[hkcu\console\d:_tc28_tcc.exe\CodePage]
1252

Then TCC starts with CP 1252 and FFIND shows ± just fine.

rconn · Oct 23, 2021

TCC doesn't set the code page unless you run CHCP. TCC also doesn't display text, that's done by Windows according to the current code page. There's several ways you can set the default user code page, none of which involve TCC.

If your file isn't UTF8, why did you say that the problematic character was encoded as two bytes (which is DEFINITELY not ASCII)? Are you using some non-ASCII and non-UTF8 MBCS?

vefatica · Oct 23, 2021

rconn said:
If your file isn't UTF8, why did you say that the problematic character was encoded as two bytes (which is DEFINITELY not ASCII)? Are you using some non-ASCII and non-UTF8 MBCS?

That was temporary (as I discovered later); it happened when I was messing with TextPad's save-as encoding. Since TextPad wasn't adding a BOM, it screwed things up.

AnrDaemon · Oct 31, 2021

rconn said:
FFIND could do that too, but you'd pay a heavy price in speed.

Not really, if you limit your expectations to UTF-8, then ANSI/OEM.
You would read the file straight away until you detect an impossible UTF-8 code point. Then you change reading mode to SBCS and re-read the last block. Which would have a negligible impact on speed, but improve usability in modern age greatly.

Search

Welcome!

Declined FFIND and code page?

vefatica

Charles Dye

Super Moderator

vefatica

rconn

Administrator

Charles Dye

Super Moderator

vefatica

rconn

Administrator

rconn

Administrator

vefatica

vefatica

Charles Dye

Super Moderator

rconn

Administrator

rconn

Administrator

vefatica

vefatica

Charles Dye

Super Moderator

vefatica

rconn

Administrator

vefatica

vefatica

rconn

Administrator

vefatica

AnrDaemon

Similar threads