1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

BOM when UnicodeOutput=Yes

Discussion in 'Support' started by Charles Dye, Mar 17, 2012.

  1. Charles Dye

    Charles Dye Super Moderator
    Staff Member

    Joined:
    May 20, 2008
    Messages:
    3,288
    Likes Received:
    39
    It seems that when UnicodeOutput is enabled:

    • redirection to a file generates a Byte Order Mark
    • redirection to a pipe does not generate a Byte Order Mark
    • DIR |! LIST /X will therefore show two more bytes than DIR | LIST /X

    WAD?
     
  2. Steve Fabian

    Joined:
    May 20, 2008
    Messages:
    3,523
    Likes Received:
    4
    While the above described difference of operation appears natural (a file might be read on another system with a different byte order, but a pipe is always local), temporary files, such as those used by |! cannot be copied to any other system, so they don't actually need the Byte Order Mark (regardless of what Microsoft does). They do not even need an indication of encoding (ASCII, UTF-8, UTF-16, etc.) because the same program that wrote them is the only one that can read them, and the program already knows which encoding it used. Based on this logic the two operations should be identical, i.e., regardless of what command (internal or external) feeds the pipe and what command consumes its contents, the standard (concurrent processes) and the in-process pipes should generate identical results. It should not be WAD. If an MS API is responsible for the BOM of the in-process pipe, IMHO TCC ought to work around it. OTOH if a user decides to redirect command output to a Unicode file, and later processes that file, the BOM is a natural part of that file - after all, it may be sent to a big-endian system!
     
  3. rconn

    rconn Administrator
    Staff Member

    Joined:
    May 14, 2008
    Messages:
    9,808
    Likes Received:
    82
    WAD (for a number of internal architectural reasons), and it's worked that way for 10+ years. A |! is *not* a pipe, it's executed as:

    DIR > tempfile & LIST /x < tempfile & del tempfile

    If I had to change an in-process pipe to match a real pipe exactly in all ways (including UnicodeOutput) -- well, I'd have to delete in-process pipes altogether, because it cannot be done.

    Why would you possibly care?
     
  4. rconn

    rconn Administrator
    Staff Member

    Joined:
    May 14, 2008
    Messages:
    9,808
    Likes Received:
    82
    I should also point out that in-process pipes were implemented for users who were still stuck in DOS-mode-thinking, and there's rarely any reason to be using them nowadays.
     
  5. Charles Dye

    Charles Dye Super Moderator
    Staff Member

    Joined:
    May 20, 2008
    Messages:
    3,288
    Likes Received:
    39
    I don't, really. I was troubleshooting a problem of my own creation, and was confused because I didn't expect different data from the same command.

    (It seems that UTF-8 output doesn't get a BOM whether it's to a file or a pipe?)
     
  6. Charles Dye

    Charles Dye Super Moderator
    Staff Member

    Joined:
    May 20, 2008
    Messages:
    3,288
    Likes Received:
    39
    That's a strange statement; of course you can pipe from one program to another. I guess I'm not understanding what you're trying to say?
     
  7. rconn

    rconn Administrator
    Staff Member

    Joined:
    May 14, 2008
    Messages:
    9,808
    Likes Received:
    82
    WAD -- there's no standard agreement on whether UTF-8 files are supposed to have a BOM. (And it's difficult-to-impossible to even determine whether most files are UTF-8.)
     
  8. Charles Dye

    Charles Dye Super Moderator
    Staff Member

    Joined:
    May 20, 2008
    Messages:
    3,288
    Likes Received:
    39
    No argument, just clarifying some details that aren't in the help.
     
  9. mfarah

    Joined:
    Nov 2, 2009
    Messages:
    226
    Likes Received:
    5
    Really? The Unicode standard recommends against the BOM for UTF-8: "Use of a BOM is neither required nor recommended for UTF-8, but may
    be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature."

    It can be accepted and processed, but shouldn't be generated in the first place.

    That's another issue. Although in a few cases (for specific languages), some heuristics can be applied...
     
  10. Charles Dye

    Charles Dye Super Moderator
    Staff Member

    Joined:
    May 20, 2008
    Messages:
    3,288
    Likes Received:
    39
    And it's obviously UTF-8 when a BOM is used as a UTF-8 signature.
     
  11. mfarah

    Joined:
    Nov 2, 2009
    Messages:
    226
    Likes Received:
    5
    ... that shouldn't be there in the first place. BOM's purpose (and a kludge, at that) is to declare whether a UTF-16 text file is big-endian or little-endian [... another kludge - Unicode should have chosen just one]. In UTF-8 it serves no purpose and breaks stuff all-around. For example, in an XML file, the very first character is supposed to be a '<'...
     
  12. Steve Fabian

    Joined:
    May 20, 2008
    Messages:
    3,523
    Likes Received:
    4
    I was indicating that techincally the temporary files implementing in-process pipes do not need encoding indicators (BOM), because they are both written and read by the same TCC process in the same computer and it cannot change its byte ordering and will always be the same without any BOM.
     

Share This Page