FFIND and code page?

May 20, 2008
11,840
120
Syracuse, NY, USA
Among other things, TYPE wmips.btm gives this line (which looks as desired).

Code:
:: rewrite YYYYMMDDHHMMSS.mmmmmm±ZZZ

Among other things FFIND /s /t"write" *.btm gives this (which doesn't look as desired).

Code:
---- V:\wmips.btm
:: rewrite YYYYMMDDHHMMSS.mmmmmm┬▒ZZZ

Could FFIND be made to do whatever TYPE is doing?
 

Charles Dye

Super Moderator
Staff member
May 20, 2008
4,576
97
Albuquerque, NM
prospero.unm.edu
Without a BOM, TCC would have to read each file twice to determine the encoding. TYPE does this because it doesn't slow it down noticeably (the write to the screen is much slower than the file reads).

If there is no Byte Order Mark, then I think it's perfectly reasonable to assume the OEM code page, 437 or whatever. But if there is a UTF-8 BOM, then you should probably assume UTF-8.
 

rconn

Administrator
Staff member
May 14, 2008
12,426
153
In fact, it removes any UTF8 BOM. :confused:

The default Linux behavior is to not use UTF8 BOMs. But that only works on Linux because ALL files are considered to be UTF8, so apps don't need to check for the encoding.

In Windows, files are assumed to be ASCII unless the app also supports UTF16. A few apps (including TCC) also support UTF8, but lacking a BOM the only way to distinguish ASCII and UTF8 is to read the file and look for the extended bytes.
 
May 20, 2008
11,840
120
Syracuse, NY, USA
I still don't get it.

Now the +/- character is a single byte (0xb1) in the BTM file. It looks OK in my editor with save-encoding "default" (whatever that is). ECHO %@CHAR[0xb1] looks OK in a console after CHCP with 437 or 1252. The +/- character is 0xb1 in my font (Consolas, according to CHARMAP). Now both TYPE and FFIND show ▒. [TYPE showed ± earlier because I was piping to GREP.]

So what are TYPE and FFIND doing?
 
May 20, 2008
11,840
120
Syracuse, NY, USA
My mistake, I think. According to CHARMAP, that character is in "DOS:WesternEurope" and not in "DOS:UnitedStates" (are they 437 and 1252, in that order?).
 
May 20, 2008
11,840
120
Syracuse, NY, USA
What if it's not in UTF8? My consoles use 1252; at least that's what this says.

Code:
v:\> echo %@regquery[hkcu\console\codepage]
1252

But I don't suppose there's a way to tell the WideChar/Multibyte conversion functions to use the default console code page.

I didn't think so earlier, but TYPE and FFIND do OK after CHCP 1252. Is there a way to tell TCC what code page to use by default?
 

rconn

Administrator
Staff member
May 14, 2008
12,426
153
TCC uses the default code page provided by Windows for the console session. If you want to change it, either change the system default or put a CHCP in your TCSTART.

However, this has nothing to do with FFIND detecting UTF8 files without a BOM. (It won't, and I'm not going to introduce a "feature" that slows it down by 10x in order to search every file on the miniscule change they might be a UTF8 file w/o a BOM).
 
May 20, 2008
11,840
120
Syracuse, NY, USA
TCC uses the default code page provided by Windows for the console session. If you want to change it, either change the system default or put a CHCP in your TCSTART.

I'm not sure what you mean there. How do you figure out what code page that is? I have this.

Code:
v:\> echo %@regquery[hkcu\console\codepage]
1252

and TCC starts up like this.

Code:
v:\> chcp
Active code page: 437

However, this has nothing to do with FFIND detecting UTF8 files without a BOM. (It won't, and I'm not going to introduce a "feature" that slows it down by 10x in order to search every file on the miniscule change they might be a UTF8 file w/o a BOM).

This isn't (never was) about UTF8. The file is not encoded.
 
May 20, 2008
11,840
120
Syracuse, NY, USA
I think the problem is that setting HKCU\Console\CodePage doesn't do anything. Although I have

Code:
v:\> echo %@regquery[hkcu\console\codepage]
1252

(with no CodePage setting specific to TCC)

TCC (and everything else) starts with CP 437.

But if I add

Code:
echo %@regquery[hkcu\console\d:_tc28_tcc.exe\CodePage]
1252

Then TCC starts with CP 1252 and FFIND shows ± just fine.
 

rconn

Administrator
Staff member
May 14, 2008
12,426
153
TCC doesn't set the code page unless you run CHCP. TCC also doesn't display text, that's done by Windows according to the current code page. There's several ways you can set the default user code page, none of which involve TCC.

If your file isn't UTF8, why did you say that the problematic character was encoded as two bytes (which is DEFINITELY not ASCII)? Are you using some non-ASCII and non-UTF8 MBCS?
 
May 20, 2008
11,840
120
Syracuse, NY, USA
If your file isn't UTF8, why did you say that the problematic character was encoded as two bytes (which is DEFINITELY not ASCII)? Are you using some non-ASCII and non-UTF8 MBCS?
That was temporary (as I discovered later); it happened when I was messing with TextPad's save-as encoding. Since TextPad wasn't adding a BOM, it screwed things up.
 
Aug 23, 2010
678
9
FFIND could do that too, but you'd pay a heavy price in speed.
Not really, if you limit your expectations to UTF-8, then ANSI/OEM.
You would read the file straight away until you detect an impossible UTF-8 code point. Then you change reading mode to SBCS and re-read the last block. Which would have a negligible impact on speed, but improve usability in modern age greatly.
 

Similar threads