Character / string classification function

#1
We have a bunch of functions which can test whether or not a character or a string belongs to one of several classes (@ISDIGIT, @ISAPHA, @ISPUNCT, etc.). These can be used in a series of IFF / ELSEIFF tests. However, often one could better use a single function that classifies the string to indicate which one(s) it belongs to, and use a single SWITCH statement:
SWITCH %@CLASS[%string]
CASE digit
...
CASE alpha
...

etc.
--
Steve
 
#2
On Mon, 24 Oct 2011 10:18:28 -0400, Steve Fabian <>
wrote:

|We have a bunch of functions which can test whether or not a character or a string belongs to one of several classes (@ISDIGIT, @ISAPHA, @ISPUNCT, etc.). These can be used in a series of IFF / ELSEIFF tests. However, often one could better use a single function that classifies the string to indicate which one(s) it belongs to, and use a single SWITCH statement:
|SWITCH %@CLASS[%string]
|CASE digit
|...
|CASE alpha
|...
|
|etc.
|--

Hmmm! ... like function (@STRTYPE[]?) to return a bit map ...


#define CONTAINS_NUMERIC 0x01
#define CONTAINS_LOWER 0x02
#define CONTAINS_UPPER 0x04
#define CONTAINS_PUNCT 0x08
#define CONTAINS_SPACE 0x10
#define IS_QUOTED 0x20

What did you have in mind? What would you use it for?
 

samintz

Scott Mintz
May 20, 2008
1,288
11
Solon, OH, USA
#3
Functionally there is no difference. And very little difference from a
coding perspective. What's the advantage?

IFF %@ISDIGIT[%string] THEN
...
ELSEIFF %@ISALPHA[%string] THEN
...
etc.

But you could always roll your own:
function
CLASS=`%@IF[%@ISDIGIT[%1],digit,%@IF[%@ISALPHA[%1],alpha,%@IF[%@ISPUNCT[%1],punct,none]]]`

-Scott


Steve Fabian <> wrote on 10/24/2011 10:18:27 AM:


>
> We have a bunch of functions which can test whether or not a
> character or a string belongs to one of several classes (@ISDIGIT,
> @ISAPHA, @ISPUNCT, etc.). These can be used in a series of IFF /
> ELSEIFF tests. However, often one could better use a single function
> that classifies the string to indicate which one(s) it belongs to,
> and use a single SWITCH statement:
> SWITCH %@CLASS[%string]
> CASE digit
> ...
> CASE alpha
> ...
>
> etc.
> --
> Steve
>
 
#4
From: vefatica
| Hmmm! ... like function (@STRTYPE[]?) to return a bit map ...
|
|
| #define CONTAINS_NUMERIC 0x01
| #define CONTAINS_LOWER 0x02
| #define CONTAINS_UPPER 0x04
| #define CONTAINS_PUNCT 0x08
| #define CONTAINS_SPACE 0x10
| #define IS_QUOTED 0x20
|
| What did you have in mind?

Yes, that's the kind of thing I have in mind.

| What would you use it for?

I'd use it for parsing batch parameters, also possibly textfiles, as the control parameter of a SWITCH statement. I also want to limit the list of values in the CASE statements, so I may want flexibility in function values, e.g., optional parameters to indicate whether or not upper and lower case should be distinguished. The reason for fewer values is that my GUESS is that each of those .OR.'d values in CASE requires a separate comparison before declaring a mismatch.
--
Steve
 
#5
From: samintz

| Functionally there is no difference. And very little difference from a
| coding perspective. What's the advantage?
|
| IFF %@ISDIGIT[%string] THEN
| ...
| ELSEIFF %@ISALPHA[%string] THEN
| ...
| etc.

Timing. Instead of classifying the string many times, do it just once, and use the result.
--
Steve
 

samintz

Scott Mintz
May 20, 2008
1,288
11
Solon, OH, USA
#6
In order to create a string class or a bitmap of the types, it *still* has
to be classified many times. How else would you create the bitmap?

-Scott



From: samintz

| Functionally there is no difference. And very little difference from a
| coding perspective. What's the advantage?
|
| IFF %@ISDIGIT[%string] THEN
| ...
| ELSEIFF %@ISALPHA[%string] THEN
| ...
| etc.

Timing. Instead of classifying the string many times, do it just once, and
use the result.
--
Steve
 
#7
From: samintz
| In order to create a string class or a bitmap of the types, it *still* has
| to be classified many times. How else would you create the bitmap?

If you look at the typical implementation in the Standard C library of the character classification functions underlying TCC's isdigit, isxdigit, etc., you will see that they use a constant array, indexed by the character code, the values of which are in the manner Vince suggested, a bit for decimal digit, another bit for hexadecimal digit, one for lower case letter, one for whitespace, one for punctuation, etc. When classifying a string, you just bit-wise OR the class codes of each character in the string, and evaluate the final result, e.g., @ISDIGIT checks that the only bit set is for decimal digit. That's of course why floating point numbers do not match (ref. your post in TC Support) - they'd also have the bit set for punctuation character from the decimal separator (and possibly from the thousands separator). So this is a much faster test. Look at <ctype.h>. It is unfortunate that Standard C never specified a function which returns the actual table entry, or for a string, the bitwise-OR of them, explicitly, allowing the user to check for custom classes.
--
Steve
 

samintz

Scott Mintz
May 20, 2008
1,288
11
Solon, OH, USA
#8
That's only practical for ASCII characters where the table is only 256
bytes long. For Unicode, I don't know how those functions work -
especially for non-English languages.
-Scott




From: samintz
| In order to create a string class or a bitmap of the types, it *still*
has
| to be classified many times. How else would you create the bitmap?

If you look at the typical implementation in the Standard C library of the
character classification functions underlying TCC's isdigit, isxdigit,
etc., you will see that they use a constant array, indexed by the
character code
--
Steve
 
#10
From: samintz
| That's only practical for ASCII characters where the table is only 256
| bytes long. For Unicode, I don't know how those functions work -
| especially for non-English languages.

Unicode is still only a 16-bit code, in today's machines a 65536-entry table is not excessive. BTW, IIRC the table entries are 16-bit integers, so you'd use 131072 bytes - the maximum length for a fully expanded TCC command (including the terminating NUL). Note however Rex's comment that TCC does not use the Standard-C RTL functions because they are implemented only for 8-bit character sets.
--
Steve