
By registering with us, you'll be able to discuss, share and private message with other members of our community.

SignUp Now!

TYPE goes crazy with no-BOM Unicode file

I asked W32Time to produce a log file. It produced a Unicode file with no BOM. Here are the first few lines (copied/pasted from Notepad).
152593 04:23:58.7934268s - ---------- Log File Opened -----------------
152593 04:23:58.7936246s - RPC Call - Query Configuration
152593 04:23:58.7936990s - RPC Call - Query Provider Configuration
152593 04:23:58.8100194s - TimeProvCommand([NtpClient], TPC_Query) called.
152593 04:23:58.8101640s - RPC Call - Query Provider Configuration
152593 04:24:07.5284957s - RPC Caller is BB\vefatica (S-1-5-21-3764633515-3696517045-806287659-1001)

Here they are again according to CMD's TYPE.
v:\> cmd /c type w32tm.log
1 5 2 5 9 3   0 4 : 2 3 : 5 8 . 7 9 3 4 2 6 8 s   -   - - - - - - - - - -   L o g   F i l e   O p e n e d   - - -
  - - - - - - - - - - - - -
 1 5 2 5 9 3   0 4 : 2 3 : 5 8 . 7 9 3 6 2 4 6 s   -   R P C   C a l l   -   Q u e r y   C o n f i g u r a t i o
 1 5 2 5 9 3   0 4 : 2 3 : 5 8 . 7 9 3 6 9 9 0 s   -   R P C   C a l l   -   Q u e r y   P r o v i d e r   C o n
  g u r a t i o n
 1 5 2 5 9 3   0 4 : 2 3 : 5 8 . 8 1 0 0 1 9 4 s   -   T i m e P r o v C o m m a n d ( [ N t p C l i e n t ] ,
  C _ Q u e r y )   c a l l e d .
 1 5 2 5 9 3   0 4 : 2 3 : 5 8 . 8 1 0 1 6 4 0 s   -   R P C   C a l l   -   Q u e r y   P r o v i d e r   C o n
  g u r a t i o n
 1 5 2 5 9 3   0 4 : 2 4 : 0 7 . 5 2 8 4 9 5 7 s   -   R P C   C a l l e r   i s   B B \ v e f a t i c a   ( S -
  5 - 2 1 - 3 7 6 4 6 3 3 5 1 5 - 3 6 9 6 5 1 7 0 4 5 - 8 0 6 2 8 7 6 5 9 - 1 0 0 1 )

Here's some of what TCC's TYPE gives ... garbage, and apparently not related to what's in the file.
v:\> type w32tm.log
1 :  ]  ၪᰀ耀ᖨ翵翼     㚉䐡V:\  $p ၯᴀ耀ᖨ翵翼     ep\ \  S  ၬḀ退耀귅翼 ✐ȱ            ၡἀ退耀귅翼 Ⳡȱ     
ၦ 退㪭堧ƻ䅤욕孵떠ⴖ            ၻ℀退耀귅翼 ⵀȱ            ၸ∀退捉酥䮩䅃꺡뱿똠백            ၽ⌀退Őȱ  ȱ  ⦐ȱ  ȱ
ၲ␀鐀ᖨ翵翼 ⶠȱ    耀        ၷ─退耀귅翼 ⪠ȱ            ၴ☀耀ᖨ翵翼     liV:\  p  ၉✀耀\??\v:\w32tm.log  s ၎⠀退耀
귅翼 ⫐ȱ                    뷟肇  㤀ȱ  룰ȱ      [4324]  v:\
ᨀ脄 ȱ  ᤀ脅 ꫐ȱ  ȱ
WS _LINES_MAXLEN=94 _LINES_MAXLOC=3         병톶脚ࠀ꭛翼  ꭙ翼    䁨ȱ  ꭛翼      ꭛翼  ꭛翼  ꭛翼  ꭛翼  ꭛翼  ꭛翼  ꭛翼
LIST handles the file better but there's an interesting distinction between the x64 and x86 versions of TCC. The x64 version shows

while the x86 version shows

And if I go back to v16, TYPE works better, showing the file with spaces between the characters, like CMD..
It was a really weird file that the Windows IsTextUnicode API couldn't decipher. I added a hack.
What was so weird about it? To me, it looked like what the docs refer to as:

IS_TEXT_UNICODE_ASCII16 The text is Unicode, and contains only zero-extended ASCII values/characters.
There's a rather easy way of "hacking" text files, thanks to characteristics of UNICODE encodings, that requires little read ahead to detect correct encoding with high certainty.

1. Treat input as UTF-8. (ASCII compatible.)
2. If seeing byte sequence, that does not decode as UTF-8, see if it alternates same byte every 4'th place. See if you can decode it as UTF-32 (LE/BE).
3. If that fails, try to decode it as UTF-16.
4. If all else fails, assume ASCII/extended.

You can buffer lines as long, as encoding is uncertain.
Once encoding is certain, you can stop buffering and just send input straight to decoder.
[FOX] Ultimate Translator