Q about QueryIsFileUnicode

#1
When QueryIsFileUnicode is used on a disk file and a BOM is found, the function seems to leave the file pointer positioned after the BOM. Is that correct?

Besides looking for a BOM, what else does QueryIsFileUnicode do? I ask because it fails to identify a file containing only this line (below), woth no BOM, as Unicode.

Code:
0000 0000 61 00 20 00 3e 00 20 00  62 00 20 00 3c 00 20 00  a. .>. .b. .<. .
0000 0010 63 00 20 00 26 00 20 00  64 00 20 00 7c 00 20 00  c. .&. .d. .|. .
0000 0020 65 00 20 00 60 00 20 00  66 00 20 00 25 00 66 00  e. .`. .f. .%.f.
0000 0030 6f 00 6f 00 20 00 25 00  70 00 61 00 74 00 68 00  o.o. .%.p.a.t.h.
0000 0040 0d 00 0a 00                                       ....
 

rconn

Administrator
Staff member
May 14, 2008
10,554
97
#2
> When QueryIsFileUnicode is used on a disk file and a BOM is found, the
> function seems to leave the file pointer positioned after the BOM. Is
> that correct?
Yes.


> Besides looking for a BOM, what else does QueryIsFileUnicode do? I ask
> because it fails to identify a file containing only this line (below),
> woth no BOM, as Unicode.
QueryIsFileUnicode does not look for a BOM; it just skips it if the text is
declared (by Windows) to be Unicode. It calls the Windows API
IsTextUnicode; if you have a problem with that API you should ask Microsoft
for details.

Rex Conn
JP Software
 

rconn

Administrator
Staff member
May 14, 2008
10,554
97
#4
> QueryIsFileUnicode does not look for a BOM; it just skips it if the
> text is
> declared (by Windows) to be Unicode. It calls the Windows API
> IsTextUnicode; if you have a problem with that API you should ask
> Microsoft for details.
> ---End Quote---
> What tests do you ask IsTextUnicode to do?
IS_TEXT_UNICODE_ASCII16 | IS_TEXT_UNICODE_SIGNATURE |
IS_TEXT_UNICODE_ILLEGAL_CHARS

After trying dozens of combinations, that's the one I've found to get the
best overall results.

Rex Conn
JP Software
 
#5
IS_TEXT_UNICODE_ASCII16 | IS_TEXT_UNICODE_SIGNATURE |
IS_TEXT_UNICODE_ILLEGAL_CHARS

After trying dozens of combinations, that's the one I've found to get the
best overall results.
Yes, that's reasonable. And from the description of IS_TEXT_UNICODE_ASCII16 I'd expect it to catch
Code:
L"a > b < c & d | e ` f %foo %path"
But it doesn't. The only tests which ID that as Unicode are IS_TEXT_UNICODE_STATISTICS, IS_TEXT_UNICODE_CONTROLS, and IS_TEXT_UNICODE_NULL_BYTES. [Hmmm! I just found some articles suggesting is useless/inconsistent.]