detecting BOM, FFIND multibyte regex

May 31, 2008
382
2
#1
While I was trying to adapt a *nix tip for detecting the BOM marker, I tripped on a puzzling error message, can anyone please explain what is going on?

C:\>ffind /S /E"\xEF\xBB\xBF" C:\test\*
TCC: too short multibyte code string "\xEF\xBB\xBF"

0 lines in 0 files

C:\>ver /r

TCC 9.02.157 Windows XP [Version 5.1.2600]
TCC Build 157 Windows XP Build 2600 Service Pack 3

Is there a way to pass FFIND a regex matching the BOM marker?
Or can anyone suggest another way to find files that include the BOM marker?
Thanks in advance.
 
#3
Stefano Piccardi wrote:
| Is there a way to pass FFIND a regex matching the BOM marker?
| Or can anyone suggest another way to find files that include the BOM
| marker?

I seriously doubt any method other than @FILEREADB previously suggested, or
its equivalent @BREAD would work. Any other method that reads the file or a
portion of it reads the first two bytes, and interprets the file as either
Unicode or ASCII text, depending on those bytes matching or mismatching BOM,
and then processes it.

Suggestion for V12: a new function (@FILETYPE) to read the first two bytes
of a file, and return either a numeric code or a string indicating the file
type. This ought to include not only the BOM (0xFF 0xFE) indicating unicode,
but also the "MZ", "PK", etc. codes of the most common filetypes. Some other
possible "file" types to detect this way: "symbolic link", "junction", etc.
--
Steve
 
#4
Stefano Piccardi wrote:
| Is there a way to pass FFIND a regex matching the BOM marker?
| Or can anyone suggest another way to find files that include the BOM
| marker?

I seriously doubt any method other than @FILEREADB previously suggested, or
its equivalent @BREAD would work. Any other method that reads the file or a
portion of it reads the first two bytes, and interprets the file as either
Unicode or ASCII text, depending on those bytes matching or mismatching BOM,
and then processes it.

Suggestion for V12: a new function (@FILETYPE) to read the first two bytes
of a file, and return either a numeric code or a string indicating the file
type. This ought to include not only the BOM (0xFF 0xFE) indicating unicode,
but also the "MZ", "PK", etc. codes of the most common filetypes. Some other
possible "file" types to detect this way: "symbolic link", "junction", etc.
--
Steve
 
Feb 1, 2010
38
0
#5
Suggestion for V12: a new function (@FILETYPE) to read the first two bytes of a file, and return either a numeric code or a string indicating the file type. This ought to include not only the BOM (0xFF 0xFE) indicating unicode
BOM length depends on the encoding and can be up to 4 bytes.
 
#6
Patulus wrote:
| ---Quote (Originally by Steve Fbin)---
|| Suggestion for V12: a new function (@FILETYPE) to read the first two
|| bytes of a file, and return either a numeric code or a string
|| indicating the file type. This ought to include not only the BOM
|| (0xFF 0xFE) indicating unicode
| ---End Quote---
| BOM length depends on the encoding and can be up to 4 bytes.

Thanks for the correction. I guess I should have refrained from specifying
how much of the file should be checked, as the size of the signature itself
depends on the type. However, there is a set of types that have unique
signatures. The V12 suggestion is that using a compilation of such
signatures the new function would identify the type of a file if it has one
of the known signatures, or report it as "unknown".
--
Steve
 
#8
Patulus wrote:
| ---Quote (Originally by Steve Fbin)---
|| The V12 suggestion is that
| ---End Quote---
| Steve, what is V12?

Tne next future version of TCMD. There has been no announcement about it as
yet. Writing "V12" I suggested that it be implemented in the very next
version, not left for some future one.

| I agree with your suggestion. Btw, can it be
| implemented by means of the brand new plug-in mechanism?

I am sure you could do that, or even do it with just a batch file for that
matter (and executing the batch file to evaluate the function I suggested).
However, the plug-in mechanism is nearly four years old, from version 7, so
it is not "brand new"...
--
Steve
 
#9
On Fri, 05 Feb 2010 09:34:36 -0500, Steve Fábián <> wrote:

|I am sure you could do that, or even do it with just a batch file for that
|matter (and executing the batch file to evaluate the function I suggested).
|However, the plug-in mechanism is nearly four years old, from version 7, so
|it is not "brand new"...

I could easily include an @BOM[file] in my 4UTILS plugin. How should it work
... 0 = no BOM, 1 = BOM, or more explicit, distinguishing the five (AFAIK) BOMs?
I believe only one is meaningful in Windows.
--
- Vince
 
#10
If you're going to detect file type based on the first few characters, file
types could also include exe, pdf, gif, jpg, wp5, uce, wav, etc.

If you're actually talking BOM, Wikipedia lists at least eleven.

On Fri, Feb 5, 2010 at 7:48 AM, vefatica <> wrote:


> On Fri, 05 Feb 2010 09:34:36 -0500, Steve Fábián <> wrote:
>
> |I am sure you could do that, or even do it with just a batch file for that
> |matter (and executing the batch file to evaluate the function I
> suggested).
> |However, the plug-in mechanism is nearly four years old, from version 7,
> so
> |it is not "brand new"...
>
> I could easily include an @BOM[file] in my 4UTILS plugin. How should it
> work
> ... 0 = no BOM, 1 = BOM, or more explicit, distinguishing the five (AFAIK)
> BOMs?
> I believe only one is meaningful in Windows.
> --
> - Vince
>
>
>
>
>


--
Jim Cook
2010 Sundays: 4/4, 6/6, 8/8, 10/10, 12/12 and 5/9, 9/5, 7/11, 11/7.
Next year they're Monday.
 
#11
On Fri, 05 Feb 2010 11:28:03 -0500, Jim Cook <> wrote:

|If you're going to detect file type based on the first few characters, file
|types could also include exe, pdf, gif, jpg, wp5, uce, wav, etc.
|

TCC already has @EXETYPE. I'm not going to get into the others.

|If you're actually talking BOM, Wikipedia lists at least eleven.

As for BOMs, are all eleven of those relevant to the Windows environment?

--
- Vince
 
#12
vefatica wrote:
| On Fri, 05 Feb 2010 11:28:03 -0500, Jim Cook <> wrote:
|
|| If you're going to detect file type based on the first few
|| characters, file types could also include exe, pdf, gif, jpg, wp5,
|| uce, wav, etc.
|
| TCC already has @EXETYPE. I'm not going to get into the others.

In fact, my suggestion was especially for the "others", e.g. pdf, jpg, etc.
The file extention does not always represent the file type correctly. The
original issue was, of course, to determine the encoding of a text file
(ASCII or Unicode). BOM detection serves solely that purpose. I would find
it harmless to detect any of the eleven possible representations of BOM, and
the extra time to check for all 11 instead of just 1 or 2 in compiled code
(e.g. a plug-in) negligible, esp. in comparison to the time it takes to
open, read, and close the file. This is probably true even for a file on a
virtual disk in internal storage.
--
Steve
 
Feb 1, 2010
38
0
#13
However, the plug-in mechanism is nearly four years old, from version 7, so it is not "brand new"...
Oops :). Didn't know about this. I just noticed it and the html page dedicated to plugins is kind of misleading, it says "Take Command 11.0 provides a Plugin architecture..." from which I have decided that it wasn't in the previous versions.
 
#15
Patulus wrote:
| ---Quote (Originally by Steve Fbin)---
|| However, the plug-in mechanism is nearly four years old, from
|| version 7, so it is not "brand new"...
| ---End Quote---
| Oops :). Didn't know about this. I just noticed it and the html page
| dedicated to plugins is kind of misleading, it says "*Take Command
| 11.0 *provides a Plugin architecture..." from which I have decided
| that it wasn't in the previous versions.

No harm done. I had repeatedly suffered from the "foot in mouth" disease
with sometimes conflicting requests. However, if its new in a specific
version, it's listed in the "What's New" section - which is the only way to
know whether or not a feature is new. I agree that the help text you quoted
could mislead one to think it's new in V11.
--
Steve
 

samintz

Scott Mintz
May 20, 2008
1,295
11
Solon, OH, USA
#16
Code:
setlocal
setdos /x-45678
set fh=%@fileopen[%1,r,b]
set r=%@filereadb[%fh,4]
set w=%@format[02,%@convert[10,16,%@word[0,%r]]]
set x=%@format[02,%@convert[10,16,%@word[1,%r]]]
set y=%@format[02,%@convert[10,16,%@word[2,%r]]]
set z=%@format[02,%@convert[10,16,%@word[3,%r]]]
set bom2=%[w]%[x]
set bom3=%[w]%[x]%[y]
set bom4=%[w]%[x]%[y]%[z]

SWITCH %bom4
CASE FFFE0000
    echo UTF-32LE
CASE 0000FEFF
    echo UTF-32BE
CASE 84319533
    echo GB-18030
CASE 2B2F7638 .or. 2B2F7639 .or. 2B2F762B .or. 2B2F762F
    echo UTF-7
CASE DD736673
    echo UTF-EBCDIC
DEFAULT
    SWITCH %bom3
    CASE EFBBBF
        echo UTF-8
    CASE F7644C
        echo UTF-1
    CASE 0EFEFF
        echo SCSU
    CASE FBEE28
        echo BOCU-1
    DEFAULT
        SWITCH %bom2
        CASE FEFF
            echo UTF-16BE
        CASE FFFE
            echo UTF-16LE
        DEFAULT
            echo Unknown: %bom4
        ENDSWITCH
    ENDSWITCH
ENDSWITCH
set fh=%@fileclose[%fh]
endlocal
The above works for detecting the BOM type. Assume it is named BOM.BTM:

[C:\Temp] function bom=%%@execstr[c:\temp\bom.btm %%1]

[C:\Temp] echo %@bom[utf8.txt]
UTF-8

[C:\Temp] echo %@bom[utf16le.txt]
UTF-16LE

[C:\Temp] echo %@bom[utf16be.txt]
UTF-16BE

[C:\Temp] echo %@bom[AUCHECK_PARSER.txt ]
Unknown: 5B576564

-Scott

vefatica <> wrote on 02/05/2010 02:51:37 PM:


> On Fri, 05 Feb 2010 11:28:03 -0500, Jim Cook <> wrote:
>
> |If you're going to detect file type based on the first few characters,
file

> |types could also include exe, pdf, gif, jpg, wp5, uce, wav, etc.
> |
>
> TCC already has @EXETYPE. I'm not going to get into the others.
>
> |If you're actually talking BOM, Wikipedia lists at least eleven.
>
> As for BOMs, are all eleven of those relevant to the Windows
environment?

>
> --
> - Vince
>
>
>
>
 

samintz

Scott Mintz
May 20, 2008
1,295
11
Solon, OH, USA
#17
Remove the setdos statement from the original example I published. Setdos
/x-7 turns off quoting so the @fileopen fails on LFN's that require
quotes.
In this particular example, setdos is not needed anyway. I did a copy &
paste from the hexdump script I had written earlier and modified it.

Using the example function and script you can find all the UTF-16LE files
in a given directory like this:

for %f in (*.txt) do if %@bom["%f"]==UTF-16LE echo %f

-Scott

samintz <> wrote on 02/05/2010 04:13:21 PM:


> Code:
> ---------
> setlocal
> setdos /x-45678
> set _fh=%@fileopen[%1,r,b]
> set r=%@filereadb[%_fh,4]
> set w=%@format[02,%@convert[10,16,%@word[0,%r]]]
> set x=%@format[02,%@convert[10,16,%@word[1,%r]]]
> set y=%@format[02,%@convert[10,16,%@word[2,%r]]]
> set z=%@format[02,%@convert[10,16,%@word[3,%r]]]
> set bom2=%[w]%[x]
> set bom3=%[w]%[x]%[y]
> set bom4=%[w]%[x]%[y]%[z]
>
> SWITCH %bom4
> CASE FFFE0000
> echo UTF-32LE
> CASE 0000FEFF
> echo UTF-32BE
> CASE 84319533
> echo GB-18030
> CASE 2B2F7638 .or. 2B2F7639 .or. 2B2F762B .or. 2B2F762F
> echo UTF-7
> CASE DD736673
> echo UTF-EBCDIC
> DEFAULT
> SWITCH %bom3
> CASE EFBBBF
> echo UTF-8
> CASE F7644C
> echo UTF-1
> CASE 0EFEFF
> echo SCSU
> CASE FBEE28
> echo BOCU-1
> DEFAULT
> SWITCH %bom2
> CASE FEFF
> echo UTF-16BE
> CASE FFFE
> echo UTF-16LE
> DEFAULT
> echo Unknown: %bom4
> ENDSWITCH
> ENDSWITCH
> ENDSWITCH
> set _fh=%@fileclose[%_fh]
> endlocal
> ---------
> The above works for detecting the BOM type. Assume it is named BOM.BTM:
>
> [C:\Temp] function bom=%%@execstr[c:\temp\bom.btm %%1]
>
> [C:\Temp] echo %@bom[utf8.txt]
> UTF-8
>
> [C:\Temp] echo %@bom[utf16le.txt]
> UTF-16LE
>
> [C:\Temp] echo %@bom[utf16be.txt]
> UTF-16BE
>
> [C:\Temp] echo %@bom[AUCHECK_PARSER.txt ]
> Unknown: 5B576564
>
> -Scott
>
> vefatica <> wrote on 02/05/2010 02:51:37 PM:
>
>
>
> ---Quote---
> > On Fri, 05 Feb 2010 11:28:03 -0500, Jim Cook <> wrote:
> >
> > |If you're going to detect file type based on the first few
characters,

> ---End Quote---
> file
>
>
> ---Quote---
> > |types could also include exe, pdf, gif, jpg, wp5, uce, wav, etc.
> > |
> >
> > TCC already has @EXETYPE. I'm not going to get into the others.
> >
> > |If you're actually talking BOM, Wikipedia lists at least eleven.
> >
> > As for BOMs, are all eleven of those relevant to the Windows
> ---End Quote---
> environment?
>
>
> ---Quote---
> >
> > --
> > - Vince
> >
> >
> >
> >
> ---End Quote---
>
>
>