1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Character / string classification function

Discussion in 'Suggestions' started by Steve Fabian, Oct 24, 2011.

  1. Steve Fabian

    Joined:
    May 20, 2008
    Messages:
    3,523
    Likes Received:
    4
    We have a bunch of functions which can test whether or not a character or a string belongs to one of several classes (@ISDIGIT, @ISAPHA, @ISPUNCT, etc.). These can be used in a series of IFF / ELSEIFF tests. However, often one could better use a single function that classifies the string to indicate which one(s) it belongs to, and use a single SWITCH statement:
    SWITCH %@CLASS[%string]
    CASE digit
    ...
    CASE alpha
    ...

    etc.
    --
    Steve
     
  2. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,794
    Likes Received:
    29
    On Mon, 24 Oct 2011 10:18:28 -0400, Steve Fabian <>
    wrote:

    |We have a bunch of functions which can test whether or not a character or a string belongs to one of several classes (@ISDIGIT, @ISAPHA, @ISPUNCT, etc.). These can be used in a series of IFF / ELSEIFF tests. However, often one could better use a single function that classifies the string to indicate which one(s) it belongs to, and use a single SWITCH statement:
    |SWITCH %@CLASS[%string]
    |CASE digit
    |...
    |CASE alpha
    |...
    |
    |etc.
    |--

    Hmmm! ... like function (@STRTYPE[]?) to return a bit map ...


    #define CONTAINS_NUMERIC 0x01
    #define CONTAINS_LOWER 0x02
    #define CONTAINS_UPPER 0x04
    #define CONTAINS_PUNCT 0x08
    #define CONTAINS_SPACE 0x10
    #define IS_QUOTED 0x20

    What did you have in mind? What would you use it for?
     
  3. samintz

    samintz Scott Mintz

    Joined:
    May 20, 2008
    Messages:
    1,179
    Likes Received:
    11
    Functionally there is no difference. And very little difference from a
    coding perspective. What's the advantage?

    IFF %@ISDIGIT[%string] THEN
    ...
    ELSEIFF %@ISALPHA[%string] THEN
    ...
    etc.

    But you could always roll your own:
    function
    CLASS=`%@IF[%@ISDIGIT[%1],digit,%@IF[%@ISALPHA[%1],alpha,%@IF[%@ISPUNCT[%1],punct,none]]]`

    -Scott


    Steve Fabian <> wrote on 10/24/2011 10:18:27 AM:


     
  4. Steve Fabian

    Joined:
    May 20, 2008
    Messages:
    3,523
    Likes Received:
    4
    From: vefatica
    | Hmmm! ... like function (@STRTYPE[]?) to return a bit map ...
    |
    |
    | #define CONTAINS_NUMERIC 0x01
    | #define CONTAINS_LOWER 0x02
    | #define CONTAINS_UPPER 0x04
    | #define CONTAINS_PUNCT 0x08
    | #define CONTAINS_SPACE 0x10
    | #define IS_QUOTED 0x20
    |
    | What did you have in mind?

    Yes, that's the kind of thing I have in mind.

    | What would you use it for?

    I'd use it for parsing batch parameters, also possibly textfiles, as the control parameter of a SWITCH statement. I also want to limit the list of values in the CASE statements, so I may want flexibility in function values, e.g., optional parameters to indicate whether or not upper and lower case should be distinguished. The reason for fewer values is that my GUESS is that each of those .OR.'d values in CASE requires a separate comparison before declaring a mismatch.
    --
    Steve
     
  5. Steve Fabian

    Joined:
    May 20, 2008
    Messages:
    3,523
    Likes Received:
    4
    From: samintz

    | Functionally there is no difference. And very little difference from a
    | coding perspective. What's the advantage?
    |
    | IFF %@ISDIGIT[%string] THEN
    | ...
    | ELSEIFF %@ISALPHA[%string] THEN
    | ...
    | etc.

    Timing. Instead of classifying the string many times, do it just once, and use the result.
    --
    Steve
     
  6. samintz

    samintz Scott Mintz

    Joined:
    May 20, 2008
    Messages:
    1,179
    Likes Received:
    11
    In order to create a string class or a bitmap of the types, it *still* has
    to be classified many times. How else would you create the bitmap?

    -Scott



    From: samintz

    | Functionally there is no difference. And very little difference from a
    | coding perspective. What's the advantage?
    |
    | IFF %@ISDIGIT[%string] THEN
    | ...
    | ELSEIFF %@ISALPHA[%string] THEN
    | ...
    | etc.

    Timing. Instead of classifying the string many times, do it just once, and
    use the result.
    --
    Steve
     
  7. Steve Fabian

    Joined:
    May 20, 2008
    Messages:
    3,523
    Likes Received:
    4
    From: samintz
    | In order to create a string class or a bitmap of the types, it *still* has
    | to be classified many times. How else would you create the bitmap?

    If you look at the typical implementation in the Standard C library of the character classification functions underlying TCC's isdigit, isxdigit, etc., you will see that they use a constant array, indexed by the character code, the values of which are in the manner Vince suggested, a bit for decimal digit, another bit for hexadecimal digit, one for lower case letter, one for whitespace, one for punctuation, etc. When classifying a string, you just bit-wise OR the class codes of each character in the string, and evaluate the final result, e.g., @ISDIGIT checks that the only bit set is for decimal digit. That's of course why floating point numbers do not match (ref. your post in TC Support) - they'd also have the bit set for punctuation character from the decimal separator (and possibly from the thousands separator). So this is a much faster test. Look at <ctype.h>. It is unfortunate that Standard C never specified a function which returns the actual table entry, or for a string, the bitwise-OR of them, explicitly, allowing the user to check for custom classes.
    --
    Steve
     
  8. samintz

    samintz Scott Mintz

    Joined:
    May 20, 2008
    Messages:
    1,179
    Likes Received:
    11
    That's only practical for ASCII characters where the table is only 256
    bytes long. For Unicode, I don't know how those functions work -
    especially for non-English languages.
    -Scott




    From: samintz
    | In order to create a string class or a bitmap of the types, it *still*
    has
    | to be classified many times. How else would you create the bitmap?

    If you look at the typical implementation in the Standard C library of the
    character classification functions underlying TCC's isdigit, isxdigit,
    etc., you will see that they use a constant array, indexed by the
    character code
    --
    Steve
     
  9. rconn

    rconn Administrator
    Staff Member

    Joined:
    May 14, 2008
    Messages:
    9,732
    Likes Received:
    81
    TCC doesn't use those RTL functions, because they won't work in anything but
    English.
     
  10. Steve Fabian

    Joined:
    May 20, 2008
    Messages:
    3,523
    Likes Received:
    4
    From: samintz
    | That's only practical for ASCII characters where the table is only 256
    | bytes long. For Unicode, I don't know how those functions work -
    | especially for non-English languages.

    Unicode is still only a 16-bit code, in today's machines a 65536-entry table is not excessive. BTW, IIRC the table entries are 16-bit integers, so you'd use 131072 bytes - the maximum length for a fully expanded TCC command (including the terminating NUL). Note however Rex's comment that TCC does not use the Standard-C RTL functions because they are implemented only for 8-bit character sets.
    --
    Steve
     

Share This Page