@REGEX revisited

#1
The help says "Returns the number of matching groups in the string" (if groups were specified). I may have misinterpreted that (probably did) to mean the number of matches. Whereas I suppose you meant the number of groups for which matches were found. [Recommend better wording]

In any case, @REGEX does not live up to its documentation, as this example shows.

Code:
echo %@regex[(a)|(b)|(c),maam]
4
One group was matched, and there were two matches.

I have some code that does both (it could do either, I suppose) ... I'd be glad to share it with you. Some examples far below.

It doesn't seem to make much sense to count groups that were matched unless the regex is a disjunction of groups. For example
Code:
%@REGEX[(a)(b),...]
will only match if both groups match. Even with a disjunction of groups, the results may not be as expected. For example, in
Code:
%@REGEX[(\d)|(2),...]
(2) will never be matches because each onig_search or onig_match will match (\d) first (and stop). [But all that's the user's problem.]

Here's an example of what I've got. %XMATCH[] returns a pair: number of matches,number of matched groups.

Code:
v:\> echo %@xmatch[(i)|(s)|(q),Mississippi]
8,2
8 matches, 2 groups matched

I don't know what sort of thing to test here. Folks, please make suggestions.

Here's the working part of my code (it's quite fast; handles up to 63 groups).

P.S. Did you fix vbulletin's mishandling of less/greater symbols inside code tags? There doesn't seem to be a problem any more.

Code:
    OnigRegion    *region = onig_region_new();
    UChar        *at = (UChar*) pString,
                *mend = (UChar*) strend(pString);
    INT            matches = 0,
                rmatches = 0,
                matchlen;
    ULONGLONG    regionmap = 0;

    while ( at < mend )
    {
        matchlen = onig_match(regex, (UChar*) pString, mend, at, region, option);
        if ( matchlen >= 0 )
        {
            matches += 1;
            at += matchlen;
            for ( INT i = 1; i < region->num_regs; i += 1 )
            {
                if ( region->beg[i] >= 0 )
                {
                    if ( !(regionmap & (1I64 << i)) )
                    {
                        rmatches += 1;
                        regionmap |= (1I64 << i);
                    }
                }
            }
        }
        else
        {
            at += 2;
        }
    }
    Sprintf(psz, L"%d,%d", matches, rmatches);
 
#2
The help says, of @REGEX, "Returns the number of matching groups in the string" [if groups were specified]. I reported quite a while ago that it doesn't do that. Will it be fixed, or the documentation changed?

Code:
v:\> echo %@regex[(a)|(b)|(c)|(d)|(e),zap]
6
 

rconn

Administrator
Staff member
May 14, 2008
10,532
94
#3
> The help says, of @REGEX, "Returns the number of matching groups in
> the string" [if groups were specified]. I reported quite a while ago
> that it doesn't do that. Will it be fixed, or the documentation
> changed?
It will definitely not be changed. I will clarify the docs to explain what
it's doing, and that your syntax will return different results depending on
which RE syntax you've selected.

Rex Conn
JP Software
 
#5
It will definitely not be changed. I will clarify the docs to explain what it's doing, and that your syntax will return different results depending on which RE syntax you've selected.
I respectfully beg to differ. When the correct syntax is used, @REGEX, with groups, always counts the number of parenthesized groups in the regex (plus one for the entire regex). That is the value of OnigRegion::num_regs after an onig_search() or an onig_match(). The user already knows this number.

And you can reliably loop to find/count all non-overlapping matches in the string, as the plugin @XMATCH does (see below). Such looping makes my @XMATCH about 8% slower than @REGEX, perhaps a reasonable price to pay for a more useful result.

The script below and its output show the reliability of both functions, regardless of the specified regular expression syntax.
Code:
type xsyntax.bat & echo ^r^nRESULTS^r^n & xsyntax.bat

option //RegularExpressions=perl
echo Syntax: %@option[RegularExpressions]
echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
echo.

option //RegularExpressions=ruby
echo Syntax: %@option[RegularExpressions]
echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
echo.

option //RegularExpressions=grep
echo Syntax: %@option[RegularExpressions]
echo `@regex[\(a\)\|\(b\)\|\(c\),baccarat]` = %@regex[\(a\)\|\(b\)\|\(c\),baccar
at]
echo `@xmatch[\(a\)\|\(b\)\|\(c\),baccarat]` = %@xmatch[\(a\)\|\(b\)\|\(c\),bacc
arat]
echo.

option //RegularExpressions=gnu
echo Syntax: %@option[RegularExpressions]
echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
echo.

option //RegularExpressions=posix
echo Syntax: %@option[RegularExpressions]
echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
echo.

option //RegularExpressions=java
echo Syntax: %@option[RegularExpressions]
echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
echo.

RESULTS

Syntax: Perl
@regex[(a)|(b)|(c),baccarat] = 4
@xmatch[(a)|(b)|(c),baccarat] = 6

Syntax: Ruby
@regex[(a)|(b)|(c),baccarat] = 4
@xmatch[(a)|(b)|(c),baccarat] = 6

Syntax: Grep
@regex[\(a\)\|\(b\)\|\(c\),baccarat] = 4
@xmatch[\(a\)\|\(b\)\|\(c\),baccarat] = 6

Syntax: GNU
@regex[(a)|(b)|(c),baccarat] = 4
@xmatch[(a)|(b)|(c),baccarat] = 6

Syntax: POSIX
@regex[(a)|(b)|(c),baccarat] = 4
@xmatch[(a)|(b)|(c),baccarat] = 6

Syntax: Java
@regex[(a)|(b)|(c),baccarat] = 4
@xmatch[(a)|(b)|(c),baccarat] = 6