Welcome!

By registering with us, you'll be able to discuss, share and private message with other members of our community.

SignUp Now!

Declined Character / string classification function

May
3,515
5
We have a bunch of functions which can test whether or not a character or a string belongs to one of several classes (@ISDIGIT, @ISAPHA, @ISPUNCT, etc.). These can be used in a series of IFF / ELSEIFF tests. However, often one could better use a single function that classifies the string to indicate which one(s) it belongs to, and use a single SWITCH statement:
SWITCH %@CLASS[%string]
CASE digit
...
CASE alpha
...

etc.
--
Steve
 
On Mon, 24 Oct 2011 10:18:28 -0400, Steve Fabian <>
wrote:

|We have a bunch of functions which can test whether or not a character or a string belongs to one of several classes (@ISDIGIT, @ISAPHA, @ISPUNCT, etc.). These can be used in a series of IFF / ELSEIFF tests. However, often one could better use a single function that classifies the string to indicate which one(s) it belongs to, and use a single SWITCH statement:
|SWITCH %@CLASS[%string]
|CASE digit
|...
|CASE alpha
|...
|
|etc.
|--

Hmmm! ... like function (@STRTYPE[]?) to return a bit map ...


#define CONTAINS_NUMERIC 0x01
#define CONTAINS_LOWER 0x02
#define CONTAINS_UPPER 0x04
#define CONTAINS_PUNCT 0x08
#define CONTAINS_SPACE 0x10
#define IS_QUOTED 0x20

What did you have in mind? What would you use it for?
 
Functionally there is no difference. And very little difference from a
coding perspective. What's the advantage?

IFF %@ISDIGIT[%string] THEN
...
ELSEIFF %@ISALPHA[%string] THEN
...
etc.

But you could always roll your own:
function
CLASS=`%@IF[%@ISDIGIT[%1],digit,%@IF[%@ISALPHA[%1],alpha,%@IF[%@ISPUNCT[%1],punct,none]]]`

-Scott


Steve Fabian <> wrote on 10/24/2011 10:18:27 AM:


>
> We have a bunch of functions which can test whether or not a
> character or a string belongs to one of several classes (@ISDIGIT,
> @ISAPHA, @ISPUNCT, etc.). These can be used in a series of IFF /
> ELSEIFF tests. However, often one could better use a single function
> that classifies the string to indicate which one(s) it belongs to,
> and use a single SWITCH statement:
> SWITCH %@CLASS[%string]
> CASE digit
> ...
> CASE alpha
> ...
>
> etc.
> --
> Steve
>
 
From: vefatica
| Hmmm! ... like function (@STRTYPE[]?) to return a bit map ...
|
|
| #define CONTAINS_NUMERIC 0x01
| #define CONTAINS_LOWER 0x02
| #define CONTAINS_UPPER 0x04
| #define CONTAINS_PUNCT 0x08
| #define CONTAINS_SPACE 0x10
| #define IS_QUOTED 0x20
|
| What did you have in mind?

Yes, that's the kind of thing I have in mind.

| What would you use it for?

I'd use it for parsing batch parameters, also possibly textfiles, as the control parameter of a SWITCH statement. I also want to limit the list of values in the CASE statements, so I may want flexibility in function values, e.g., optional parameters to indicate whether or not upper and lower case should be distinguished. The reason for fewer values is that my GUESS is that each of those .OR.'d values in CASE requires a separate comparison before declaring a mismatch.
--
Steve
 
From: samintz

| Functionally there is no difference. And very little difference from a
| coding perspective. What's the advantage?
|
| IFF %@ISDIGIT[%string] THEN
| ...
| ELSEIFF %@ISALPHA[%string] THEN
| ...
| etc.

Timing. Instead of classifying the string many times, do it just once, and use the result.
--
Steve
 
In order to create a string class or a bitmap of the types, it *still* has
to be classified many times. How else would you create the bitmap?

-Scott



From: samintz

| Functionally there is no difference. And very little difference from a
| coding perspective. What's the advantage?
|
| IFF %@ISDIGIT[%string] THEN
| ...
| ELSEIFF %@ISALPHA[%string] THEN
| ...
| etc.

Timing. Instead of classifying the string many times, do it just once, and
use the result.
--
Steve
 
From: samintz
| In order to create a string class or a bitmap of the types, it *still* has
| to be classified many times. How else would you create the bitmap?

If you look at the typical implementation in the Standard C library of the character classification functions underlying TCC's isdigit, isxdigit, etc., you will see that they use a constant array, indexed by the character code, the values of which are in the manner Vince suggested, a bit for decimal digit, another bit for hexadecimal digit, one for lower case letter, one for whitespace, one for punctuation, etc. When classifying a string, you just bit-wise OR the class codes of each character in the string, and evaluate the final result, e.g., @ISDIGIT checks that the only bit set is for decimal digit. That's of course why floating point numbers do not match (ref. your post in TC Support) - they'd also have the bit set for punctuation character from the decimal separator (and possibly from the thousands separator). So this is a much faster test. Look at <ctype.h>. It is unfortunate that Standard C never specified a function which returns the actual table entry, or for a string, the bitwise-OR of them, explicitly, allowing the user to check for custom classes.
--
Steve
 
That's only practical for ASCII characters where the table is only 256
bytes long. For Unicode, I don't know how those functions work -
especially for non-English languages.
-Scott




From: samintz
| In order to create a string class or a bitmap of the types, it *still*
has
| to be classified many times. How else would you create the bitmap?

If you look at the typical implementation in the Standard C library of the
character classification functions underlying TCC's isdigit, isxdigit,
etc., you will see that they use a constant array, indexed by the
character code
--
Steve
 
From: samintz
| That's only practical for ASCII characters where the table is only 256
| bytes long. For Unicode, I don't know how those functions work -
| especially for non-English languages.

Unicode is still only a 16-bit code, in today's machines a 65536-entry table is not excessive. BTW, IIRC the table entries are 16-bit integers, so you'd use 131072 bytes - the maximum length for a fully expanded TCC command (including the terminating NUL). Note however Rex's comment that TCC does not use the Standard-C RTL functions because they are implemented only for 8-bit character sets.
--
Steve
 
Back
Top