Welcome!

By registering with us, you'll be able to discuss, share and private message with other members of our community.

SignUp Now!

Declined FFIND and code page?

May
12,846
164
Among other things, TYPE wmips.btm gives this line (which looks as desired).

Code:
:: rewrite YYYYMMDDHHMMSS.mmmmmm±ZZZ

Among other things FFIND /s /t"write" *.btm gives this (which doesn't look as desired).

Code:
---- V:\wmips.btm
:: rewrite YYYYMMDDHHMMSS.mmmmmm┬▒ZZZ

Could FFIND be made to do whatever TYPE is doing?
 
FFIND could do that, if you don't mind slowing it down (probably somewhere between 2x and 10x slower, depending on the mix of files it's reading).

What would be the slowdown? Displaying UTF-8, or detecting UTF-8 without a BOM?
 
Without a BOM, TCC would have to read each file twice to determine the encoding. TYPE does this because it doesn't slow it down noticeably (the write to the screen is much slower than the file reads).

If there is no Byte Order Mark, then I think it's perfectly reasonable to assume the OEM code page, 437 or whatever. But if there is a UTF-8 BOM, then you should probably assume UTF-8.
 
In fact, it removes any UTF8 BOM. :confused:

The default Linux behavior is to not use UTF8 BOMs. But that only works on Linux because ALL files are considered to be UTF8, so apps don't need to check for the encoding.

In Windows, files are assumed to be ASCII unless the app also supports UTF16. A few apps (including TCC) also support UTF8, but lacking a BOM the only way to distinguish ASCII and UTF8 is to read the file and look for the extended bytes.
 
I still don't get it.

Now the +/- character is a single byte (0xb1) in the BTM file. It looks OK in my editor with save-encoding "default" (whatever that is). ECHO %@CHAR[0xb1] looks OK in a console after CHCP with 437 or 1252. The +/- character is 0xb1 in my font (Consolas, according to CHARMAP). Now both TYPE and FFIND show ▒. [TYPE showed ± earlier because I was piping to GREP.]

So what are TYPE and FFIND doing?
 
My mistake, I think. According to CHARMAP, that character is in "DOS:WesternEurope" and not in "DOS:UnitedStates" (are they 437 and 1252, in that order?).
 
TCC uses Unicode internally. @CHAR returns a Unicode character.

± is Unicode character 00B1. In the UTF-8 encoding, that works out to 0xC2 0xB1.

0xC2 0xB1 in code page 437 is

0xC2 0xB1 in code page 1252 is ±
 
What if it's not in UTF8? My consoles use 1252; at least that's what this says.

Code:
v:\> echo %@regquery[hkcu\console\codepage]
1252

But I don't suppose there's a way to tell the WideChar/Multibyte conversion functions to use the default console code page.

I didn't think so earlier, but TYPE and FFIND do OK after CHCP 1252. Is there a way to tell TCC what code page to use by default?
 
TCC uses the default code page provided by Windows for the console session. If you want to change it, either change the system default or put a CHCP in your TCSTART.

However, this has nothing to do with FFIND detecting UTF8 files without a BOM. (It won't, and I'm not going to introduce a "feature" that slows it down by 10x in order to search every file on the miniscule change they might be a UTF8 file w/o a BOM).
 
TCC uses the default code page provided by Windows for the console session. If you want to change it, either change the system default or put a CHCP in your TCSTART.

I'm not sure what you mean there. How do you figure out what code page that is? I have this.

Code:
v:\> echo %@regquery[hkcu\console\codepage]
1252

and TCC starts up like this.

Code:
v:\> chcp
Active code page: 437

However, this has nothing to do with FFIND detecting UTF8 files without a BOM. (It won't, and I'm not going to introduce a "feature" that slows it down by 10x in order to search every file on the miniscule change they might be a UTF8 file w/o a BOM).

This isn't (never was) about UTF8. The file is not encoded.
 
I think the problem is that setting HKCU\Console\CodePage doesn't do anything. Although I have

Code:
v:\> echo %@regquery[hkcu\console\codepage]
1252

(with no CodePage setting specific to TCC)

TCC (and everything else) starts with CP 437.

But if I add

Code:
echo %@regquery[hkcu\console\d:_tc28_tcc.exe\CodePage]
1252

Then TCC starts with CP 1252 and FFIND shows ± just fine.
 
TCC doesn't set the code page unless you run CHCP. TCC also doesn't display text, that's done by Windows according to the current code page. There's several ways you can set the default user code page, none of which involve TCC.

If your file isn't UTF8, why did you say that the problematic character was encoded as two bytes (which is DEFINITELY not ASCII)? Are you using some non-ASCII and non-UTF8 MBCS?
 
If your file isn't UTF8, why did you say that the problematic character was encoded as two bytes (which is DEFINITELY not ASCII)? Are you using some non-ASCII and non-UTF8 MBCS?
That was temporary (as I discovered later); it happened when I was messing with TextPad's save-as encoding. Since TextPad wasn't adding a BOM, it screwed things up.
 
FFIND could do that too, but you'd pay a heavy price in speed.
Not really, if you limit your expectations to UTF-8, then ANSI/OEM.
You would read the file straight away until you detect an impossible UTF-8 code point. Then you change reading mode to SBCS and re-read the last block. Which would have a negligible impact on speed, but improve usability in modern age greatly.
 

Similar threads

Back
Top