TYPE goes crazy with no-BOM Unicode file

May 20, 2008
10,498
77
Syracuse, NY, USA
I asked W32Time to produce a log file. It produced a Unicode file with no BOM. Here are the first few lines (copied/pasted from Notepad).
Code:
152593 04:23:58.7934268s - ---------- Log File Opened -----------------
152593 04:23:58.7936246s - RPC Call - Query Configuration
152593 04:23:58.7936990s - RPC Call - Query Provider Configuration
152593 04:23:58.8100194s - TimeProvCommand([NtpClient], TPC_Query) called.
152593 04:23:58.8101640s - RPC Call - Query Provider Configuration
152593 04:24:07.5284957s - RPC Caller is BB\vefatica (S-1-5-21-3764633515-3696517045-806287659-1001)
Here they are again according to CMD's TYPE.
Code:
v:\> cmd /c type w32tm.log
1 5 2 5 9 3   0 4 : 2 3 : 5 8 . 7 9 3 4 2 6 8 s   -   - - - - - - - - - -   L o g   F i l e   O p e n e d   - - -
  - - - - - - - - - - - - -
 1 5 2 5 9 3   0 4 : 2 3 : 5 8 . 7 9 3 6 2 4 6 s   -   R P C   C a l l   -   Q u e r y   C o n f i g u r a t i o
 1 5 2 5 9 3   0 4 : 2 3 : 5 8 . 7 9 3 6 9 9 0 s   -   R P C   C a l l   -   Q u e r y   P r o v i d e r   C o n
  g u r a t i o n
 1 5 2 5 9 3   0 4 : 2 3 : 5 8 . 8 1 0 0 1 9 4 s   -   T i m e P r o v C o m m a n d ( [ N t p C l i e n t ] ,
  C _ Q u e r y )   c a l l e d .
 1 5 2 5 9 3   0 4 : 2 3 : 5 8 . 8 1 0 1 6 4 0 s   -   R P C   C a l l   -   Q u e r y   P r o v i d e r   C o n
  g u r a t i o n
 1 5 2 5 9 3   0 4 : 2 4 : 0 7 . 5 2 8 4 9 5 7 s   -   R P C   C a l l e r   i s   B B \ v e f a t i c a   ( S -
  5 - 2 1 - 3 7 6 4 6 3 3 5 1 5 - 3 6 9 6 5 1 7 0 4 5 - 8 0 6 2 8 7 6 5 9 - 1 0 0 1 )
Here's some of what TCC's TYPE gives ... garbage, and apparently not related to what's in the file.
Code:
v:\> type w32tm.log
1 :  ]  ၪᰀ耀ᖨ翵翼     㚉䐡V:\  $p ၯᴀ耀ᖨ翵翼     ep\ \  S  ၬḀ退耀귅翼 ✐ȱ            ၡἀ退耀귅翼 Ⳡȱ     
ၦ 退㪭堧ƻ䅤욕孵떠ⴖ            ၻ℀退耀귅翼 ⵀȱ            ၸ∀退捉酥䮩䅃꺡뱿똠백            ၽ⌀退Őȱ  ȱ  ⦐ȱ  ȱ
ၲ␀鐀ᖨ翵翼 ⶠȱ    耀        ၷ─退耀귅翼 ⪠ȱ            ၴ☀耀ᖨ翵翼     liV:\  p  ၉✀耀\??\v:\w32tm.log  s ၎⠀退耀
귅翼 ⫐ȱ                    뷟肇  㤀ȱ  룰ȱ      [4324]  v:\
ᨀ脄 ȱ  ᤀ脅 ꫐ȱ  ȱ
WS _LINES_MAXLEN=94 _LINES_MAXLOC=3         병톶脚ࠀ꭛翼  ꭙ翼    䁨ȱ  ꭛翼      ꭛翼  ꭛翼  ꭛翼  ꭛翼  ꭛翼  ꭛翼  ꭛翼
 
May 20, 2008
10,498
77
Syracuse, NY, USA
LIST handles the file better but there's an interesting distinction between the x64 and x86 versions of TCC. The x64 version shows
1539581916033.png

while the x86 version shows
1539582127712.png


And if I go back to v16, TYPE works better, showing the file with spaces between the characters, like CMD..
 
May 20, 2008
10,498
77
Syracuse, NY, USA
It was a really weird file that the Windows IsTextUnicode API couldn't decipher. I added a hack.
What was so weird about it? To me, it looked like what the docs refer to as:

Code:
IS_TEXT_UNICODE_ASCII16 The text is Unicode, and contains only zero-extended ASCII values/characters.
 
Aug 23, 2010
562
7
There's a rather easy way of "hacking" text files, thanks to characteristics of UNICODE encodings, that requires little read ahead to detect correct encoding with high certainty.

1. Treat input as UTF-8. (ASCII compatible.)
2. If seeing byte sequence, that does not decode as UTF-8, see if it alternates same byte every 4'th place. See if you can decode it as UTF-32 (LE/BE).
3. If that fails, try to decode it as UTF-16.
4. If all else fails, assume ASCII/extended.

You can buffer lines as long, as encoding is uncertain.
Once encoding is certain, you can stop buffering and just send input straight to decoder.