1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Unicode question

Discussion in 'Plugins' started by vefatica, Apr 19, 2009.

  1. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,785
    Likes Received:
    29
    If the user does
    Code:
    ECHO %@CHAR[27]
    then a left-pointing arrow appears on the screen. Actually in the console screen buffer is the Unicode character 8592 (0x2190). If a plugin internal variable (_CURCHAR) returns that character like this
    Code:
    ReadConsoleOutput(STD_OUT, &ci, cdOne, cdZero, &sr);
    Sprintf(pszSrgs, L"%c", ci.Char.UnicodeChar);
    return 0;
    then the only tests for it I can find are
    Code:
    IF %@CHAR[%_CURCHAR] == 8592
    IF %@UNICODE[%_CURCHAR] == 8592
    Is there any way to test for that character using the more familiar number 27? If, internally, I try WideCharToMultiByte(CP_ACP) on it, I wind up with 63, i.e., the question mark (default unprintable, I suppose).

    Thanks!
     
  2. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,785
    Likes Received:
    29
    That last sode was in error; it should have read:
    Code:
    IF %@ASCII[%_CURCHAR] == 8592
    IF %@UNICODE[%_CURCHAR] == 8592
     
  3. rconn

    rconn Administrator
    Staff Member

    Joined:
    May 14, 2008
    Messages:
    9,730
    Likes Received:
    80
    vefatica wrote:

    This is a Windows / console manager issue, not TCC. I don't know the
    answer; try Microsoft.

    Rex Conn
    JP Software
     
  4. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,785
    Likes Received:
    29
    Refering to an <Esc> glyph in the console screen buffer ...

    If I use ReadConsoleOutputW() on that character, I get

    CHAR_INFO::Char.UnicodeChar = 8592 (the right glyph)
    CHAR_INFO::Char.AsciiChar = 65424 [garbage?]

    If I use ReadConsoleOutputCharacterW(), I get 8592.

    If I use ReadConsoleOutputCharacterA(), I get 27.

    I didn't try ReadConsoleOutputA().

    Any thoughts Rex?
     
  5. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,785
    Likes Received:
    29
    On Mon, 20 Apr 2009 21:01:59 -0500, vefatica <> wrote:

    |Refering to an <Esc> glyph in the console screen buffer ...
    |
    |If I use ReadConsoleOutputW() on that character, I get
    |
    | CHAR_INFO::Char.UnicodeChar = 8592 (the right glyph)
    | CHAR_INFO::Char.AsciiChar = 65424 [garbage?]
    |
    |If I use ReadConsoleOutputCharacterW(), I get 8592.
    |
    |If I use ReadConsoleOutputCharacterA(), I get 27.
    |
    |I didn't try ReadConsoleOutputA().

    My guess is that the console screen buffer **is** ASCII (since you're stuck with
    some code page) and the ReadConsoleOutput[Character]W() functions translate into
    an appropriate Unicode glyph. Make sense? What do you think Rex?
    --
    - Vince
     
  6. rconn

    rconn Administrator
    Staff Member

    Joined:
    May 14, 2008
    Messages:
    9,730
    Likes Received:
    80
    vefatica wrote:

    That's what I would expect. What's your question?

    Rex Conn
    JP Software
     
  7. rconn

    rconn Administrator
    Staff Member

    Joined:
    May 14, 2008
    Messages:
    9,730
    Likes Received:
    80
    vefatica wrote:

    Other way around -- the console buffer is Unicode and the translations
    are into ASCII. (Which results occasionally in some odd conversions.)

    All of XP/Vista/etc. is Unicode internally.

    Rex Conn
    JP Software
     
  8. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,785
    Likes Received:
    29
    On Mon, 20 Apr 2009 21:42:38 -0500, rconn <> wrote:

    |vefatica wrote:
    |
    |
    |---Quote---
    |> ---Quote (Originally by rconn)---
    |> This is a Windows / console manager issue, not TCC. I don't know the
    |> answer; try Microsoft.
    |> ---End Quote---
    |> Refering to an <Esc> glyph in the console screen buffer ...
    |>
    |> If I use ReadConsoleOutputW() on that character, I get
    |>
    |> CHAR_INFO::Char.UnicodeChar = 8592 (the right glyph)
    |> CHAR_INFO::Char.AsciiChar = 65424 [garbage?]
    |>
    |> If I use ReadConsoleOutputCharacterW(), I get 8592.
    |>
    |> If I use ReadConsoleOutputCharacterA(), I get 27.
    |>
    |> I didn't try ReadConsoleOutputA().
    |>
    |> Any thoughts Rex?
    |---End Quote---
    |That's what I would expect. What's your question?

    I guess it's this: Is the console screen buffer both ASCII and Unicode, keeping
    a record of both?
    --
    - Vince
     
  9. rconn

    rconn Administrator
    Staff Member

    Joined:
    May 14, 2008
    Messages:
    9,730
    Likes Received:
    80
    vefatica wrote:
    >

    Not to my knowledge. But the only one who'd know for sure is the author
    of the console manager.

    Rex Conn
    JP Software
     
  10. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,785
    Likes Received:
    29
    A seemingly knowledgeable gent replied to my newsgroup query thus (below). It's
    beyond me. Does it make sense to you. I can accurately get the character under
    the mouse cursor and reproduce it. As for turning it into a **familiar** number
    (some character code) I think I'm SOL.

    Quoting:

    You have dipped into subject that mixes ancient history and modern
    internationalization.

    The original IBM CGA display included fonts in its ROM that had glyphs in
    all 256 places, including the control characters and the high 128
    characters. The glyph for 0x1B was a left-facing arrow.

    Today, this character set lives on as the default 8-bit code page for
    command shells, CP437. The console buffer (essentially a virtualization of
    the CGA text-mode buffer at 0B8000) is an 8-bit buffer, so the value that
    is written is the 8-bit value 0x27.

    When you use ReadConsoleOutputW, the system does an ANSI-to-Unicode
    conversion for you, using the CP437 code page. Since 0x27 in CP437 is
    left-pointing-arrow, you read 0x2190.

    -It's interesting. If I use ReadConsoleOutputW() on that character, I get
    -
    - CHAR_INFO::Char.UnicodeChar = 8592
    - CHAR_INFO::Char.AsciiChar = 65424 [garbage?]

    This would have made more sense if you had looked at this in hex.

    8592 = 0x2190
    65424 = 0xff90

    This is just taking the low-order byte of the Unicode character you got,
    and sign-extending it.

    -If I use ReadConsoleOutputCharacterW(), I get 8592.
    -If I use ReadConsoleOutputCharacterA(), I get 27.
    -If I use ReadConsoleOutputA(), I get 27.
    -
    -So the "A" version of the functions is doing some translating (or the "W"
    -version is). WideCharToMultiByte() always failed to translate correctly for
    -returning 8592 into 63 ("?", the default un-printable). I wish I understood
    -what's going on.

    When YOU call WideCharToMultiByte, you are using some other 8-bit code
    page, and that code page does not have an encoding for "left-pointing
    arrow". If you called WideCharToMultiByte with CP437, you would get 0x27.
    --
    - Vince
     
  11. Steve Fabian

    Joined:
    May 20, 2008
    Messages:
    3,523
    Likes Received:
    4
    vefatica wrote:
    | A seemingly knowledgeable gent replied to my newsgroup query thus
    | (below). It's beyond me. Does it make sense to you. I can
    | accurately get the character under the mouse cursor and reproduce it.
    | As for turning it into a **familiar** number (some character code) I
    | think I'm SOL.
    |
    | Quoting:
    |
    | You have dipped into subject that mixes ancient history and modern
    | internationalization.
    |
    | The original IBM CGA display included fonts in its ROM that had
    | glyphs in
    | all 256 places, including the control characters and the high 128
    | characters. The glyph for 0x1B was a left-facing arrow.
    |
    | Today, this character set lives on as the default 8-bit code page for
    | command shells, CP437. The console buffer (essentially a
    | virtualization of
    | the CGA text-mode buffer at 0B8000) is an 8-bit buffer, so the value
    | that
    | is written is the 8-bit value 0x27.
    |
    | When you use ReadConsoleOutputW, the system does an ANSI-to-Unicode
    | conversion for you, using the CP437 code page. Since 0x27 in CP437 is
    | left-pointing-arrow, you read 0x2190.
    |
    | -It's interesting. If I use ReadConsoleOutputW() on that character,
    | I get -
    | - CHAR_INFO::Char.UnicodeChar = 8592
    | - CHAR_INFO::Char.AsciiChar = 65424 [garbage?]
    |
    | This would have made more sense if you had looked at this in hex.
    |
    | 8592 = 0x2190
    | 65424 = 0xff90
    |
    | This is just taking the low-order byte of the Unicode character you
    | got,
    | and sign-extending it.
    |
    | -If I use ReadConsoleOutputCharacterW(), I get 8592.
    | -If I use ReadConsoleOutputCharacterA(), I get 27.
    | -If I use ReadConsoleOutputA(), I get 27.
    | -
    | -So the "A" version of the functions is doing some translating (or
    | the "W"
    | -version is). WideCharToMultiByte() always failed to translate
    | correctly for
    | -returning 8592 into 63 ("?", the default un-printable). I wish I
    | understood
    | -what's going on.
    |
    | When YOU call WideCharToMultiByte, you are using some other 8-bit code
    | page, and that code page does not have an encoding for "left-pointing
    | arrow". If you called WideCharToMultiByte with CP437, you would get
    | 0x27.

    Seems that the W mode performs glyph-based code translation. The glyph for
    any octet that is not a printable ASCII character (0x00-0x1F, 0x7F-0xFF)
    depends on the codepage, and is thus translated. Printable characters within
    the ASCII range (0x20-0x7E) the W codes should be OK. The A mode seems to be
    OK - it does not translate, returns the actual octets.

    PS: Nearly 20 years ago I had used the CP437 non-printable codes to display
    on the screen the outline drawing of an add-on PC card, showing its proper
    jumper settings for the BIOS and other add-in cards in use (to select memory
    mapping and port selection).
    --
    Steve
     
  12. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,785
    Likes Received:
    29
    On Tue, 21 Apr 2009 23:27:56 -0500, Steve Fábián <> wrote:

    |Seems that the W mode performs glyph-based code translation. The glyph for
    |any octet that is not a printable ASCII character (0x00-0x1F, 0x7F-0xFF)
    |depends on the codepage, and is thus translated. Printable characters within
    |the ASCII range (0x20-0x7E) the W codes should be OK. The A mode seems to be
    |OK - it does not translate, returns the actual octets.
    |
    |PS: Nearly 20 years ago I had used the CP437 non-printable codes to display
    |on the screen the outline drawing of an add-on PC card, showing its proper
    |jumper settings for the BIOS and other add-in cards in use (to select memory
    |mapping and port selection).

    It's all unintelligible to me. Can someone explain this (I'm not complaining).

    chcp
    Active code page: 437

    echo %@ascii[%@char[240]]
    240

    Fine!

    But if I "echo %@char[240]" then copy/paste the result into %@ascii[], I get

    echo %@ascii[d]
    100

    What's going on?
    --
    - Vince
     
  13. Steve Fabian

    Joined:
    May 20, 2008
    Messages:
    3,523
    Likes Received:
    4
    vefatica wrote:
    | It's all unintelligible to me. Can someone explain this (I'm not
    | complaining).
    |
    | chcp
    | Active code page: 437
    |
    | echo %@ascii[%@char[240]]
    | 240
    |
    | Fine!
    |
    | But if I "echo %@char[240]" then copy/paste the result into
    | %@ascii[], I get
    |
    | echo %@ascii[d]
    | 100

    In standalone TCC 10.00.67 in Windows XP (SP3), with UnicodeOutput=No, I
    had the same result (100) from the command
    echo %@ascii[%@execstr[echo %@char[240]]]
    as your last point.

    When I switched to UnicodeOutput=Yes the command displayed 240! I was
    amazed...

    The problem is that when UnicodeOutput=No, the display is in ASCII with
    CP437 extensions (Ax437 below), so TCC sends its output (including that
    which is processed via @EXECSTR without actual display) through a
    many-to-few mapping (translation) of Unicode to Ax437, and an Ax437 to
    Unicode (one-to-one) mapping before it is used in @ASCII. When
    UnicodeOutput=Yes, no mapping is done, thus the output both ways is the same
    (lowercase d).

    BTW, my myriad of X3.64 color-changing escape sequences work perfectly well
    when UnicodeOutput=Yes...
    --
    Steve
     
  14. Steve Fabian

    Joined:
    May 20, 2008
    Messages:
    3,523
    Likes Received:
    4
    Steve Fabian wrote:
    | vefatica wrote:
    || It's all unintelligible to me. Can someone explain this (I'm not
    || complaining).
    ||
    || chcp
    || Active code page: 437
    ||
    || echo %@ascii[%@char[240]]
    || 240
    ||
    || Fine!
    ||
    || But if I "echo %@char[240]" then copy/paste the result into
    || %@ascii[], I get
    ||
    || echo %@ascii[d]
    || 100
    |
    | In standalone TCC 10.00.67 in Windows XP (SP3), with
    | UnicodeOutput=No, I had the same result (100) from the command
    | echo %@ascii[%@execstr[echo %@char[240]]]
    | as your last point.
    |
    | When I switched to UnicodeOutput=Yes the command displayed 240! I was
    | amazed...
    |
    | The problem is that when UnicodeOutput=No, the display is in ASCII
    | with CP437 extensions (Ax437 below), so TCC sends its output
    | (including that which is processed via @EXECSTR without actual
    | display) through a many-to-few mapping (translation) of Unicode to
    | Ax437, and an Ax437 to Unicode (one-to-one) mapping before it is used
    | in @ASCII. When UnicodeOutput=Yes, no mapping is done, thus the
    | output both ways is the same (lowercase d).
    |
    | BTW, my myriad of X3.64 color-changing escape sequences work
    | perfectly well when UnicodeOutput=Yes...
     

Share This Page