Welcome!

By registering with us, you'll be able to discuss, share and private message with other members of our community.

SignUp Now!

BOM when UnicodeOutput=Yes

Charles Dye

Super Moderator
May
4,947
126
Staff member
It seems that when UnicodeOutput is enabled:

• redirection to a file generates a Byte Order Mark
• redirection to a pipe does not generate a Byte Order Mark
• DIR |! LIST /X will therefore show two more bytes than DIR | LIST /X

WAD?
 
While the above described difference of operation appears natural (a file might be read on another system with a different byte order, but a pipe is always local), temporary files, such as those used by |! cannot be copied to any other system, so they don't actually need the Byte Order Mark (regardless of what Microsoft does). They do not even need an indication of encoding (ASCII, UTF-8, UTF-16, etc.) because the same program that wrote them is the only one that can read them, and the program already knows which encoding it used. Based on this logic the two operations should be identical, i.e., regardless of what command (internal or external) feeds the pipe and what command consumes its contents, the standard (concurrent processes) and the in-process pipes should generate identical results. It should not be WAD. If an MS API is responsible for the BOM of the in-process pipe, IMHO TCC ought to work around it. OTOH if a user decides to redirect command output to a Unicode file, and later processes that file, the BOM is a natural part of that file - after all, it may be sent to a big-endian system!
 
It seems that when UnicodeOutput is enabled:

• redirection to a file generates a Byte Order Mark
• redirection to a pipe does not generate a Byte Order Mark
• DIR |! LIST /X will therefore show two more bytes than DIR | LIST /X

WAD?

WAD (for a number of internal architectural reasons), and it's worked that way for 10+ years. A |! is *not* a pipe, it's executed as:

DIR > tempfile & LIST /x < tempfile & del tempfile

If I had to change an in-process pipe to match a real pipe exactly in all ways (including UnicodeOutput) -- well, I'd have to delete in-process pipes altogether, because it cannot be done.

Why would you possibly care?
 
Why would you possibly care?

I don't, really. I was troubleshooting a problem of my own creation, and was confused because I didn't expect different data from the same command.

(It seems that UTF-8 output doesn't get a BOM whether it's to a file or a pipe?)
 
They do not even need an indication of encoding (ASCII, UTF-8, UTF-16, etc.) because the same program that wrote them is the only one that can read them, and the program already knows which encoding it used.

That's a strange statement; of course you can pipe from one program to another. I guess I'm not understanding what you're trying to say?
 
WAD -- there's no standard agreement on whether UTF-8 files are supposed to have a BOM.

Really? The Unicode standard recommends against the BOM for UTF-8: "Use of a BOM is neither required nor recommended for UTF-8, but may
be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature."

It can be accepted and processed, but shouldn't be generated in the first place.

(And it's difficult-to-impossible to even determine whether most files are UTF-8.)

That's another issue. Although in a few cases (for specific languages), some heuristics can be applied...
 
And it's obviously UTF-8 when a BOM is used as a UTF-8 signature.

... that shouldn't be there in the first place. BOM's purpose (and a kludge, at that) is to declare whether a UTF-16 text file is big-endian or little-endian [... another kludge - Unicode should have chosen just one]. In UTF-8 it serves no purpose and breaks stuff all-around. For example, in an XML file, the very first character is supposed to be a '<'...
 
That's a strange statement; of course you can pipe from one program to another. I guess I'm not understanding what you're trying to say?
I was indicating that techincally the temporary files implementing in-process pipes do not need encoding indicators (BOM), because they are both written and read by the same TCC process in the same computer and it cannot change its byte ordering and will always be the same without any BOM.
 
Back
Top