@REGEX revisited

  • This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn more.
#1
The help says "Returns the number of matching groups in the string" (if groups were specified). I may have misinterpreted that (probably did) to mean the number of matches. Whereas I suppose you meant the number of groups for which matches were found. [Recommend better wording]

In any case, @REGEX does not live up to its documentation, as this example shows.

Code:
echo %@regex[(a)|(b)|(c),maam]
4
One group was matched, and there were two matches.

I have some code that does both (it could do either, I suppose) ... I'd be glad to share it with you. Some examples far below.

It doesn't seem to make much sense to count groups that were matched unless the regex is a disjunction of groups. For example
Code:
%@REGEX[(a)(b),...]
will only match if both groups match. Even with a disjunction of groups, the results may not be as expected. For example, in
Code:
%@REGEX[(\d)|(2),...]
(2) will never be matches because each onig_search or onig_match will match (\d) first (and stop). [But all that's the user's problem.]

Here's an example of what I've got. %XMATCH[] returns a pair: number of matches,number of matched groups.

Code:
v:\> echo %@xmatch[(i)|(s)|(q),Mississippi]
8,2
8 matches, 2 groups matched

I don't know what sort of thing to test here. Folks, please make suggestions.

Here's the working part of my code (it's quite fast; handles up to 63 groups).

P.S. Did you fix vbulletin's mishandling of less/greater symbols inside code tags? There doesn't seem to be a problem any more.

Code:
    OnigRegion    *region = onig_region_new();
    UChar        *at = (UChar*) pString,
                *mend = (UChar*) strend(pString);
    INT            matches = 0,
                rmatches = 0,
                matchlen;
    ULONGLONG    regionmap = 0;

    while ( at < mend )
    {
        matchlen = onig_match(regex, (UChar*) pString, mend, at, region, option);
        if ( matchlen >= 0 )
        {
            matches += 1;
            at += matchlen;
            for ( INT i = 1; i < region->num_regs; i += 1 )
            {
                if ( region->beg[i] >= 0 )
                {
                    if ( !(regionmap & (1I64 << i)) )
                    {
                        rmatches += 1;
                        regionmap |= (1I64 << i);
                    }
                }
            }
        }
        else
        {
            at += 2;
        }
    }
    Sprintf(psz, L"%d,%d", matches, rmatches);
 
#2
The help says, of @REGEX, "Returns the number of matching groups in the string" [if groups were specified]. I reported quite a while ago that it doesn't do that. Will it be fixed, or the documentation changed?

Code:
v:\> echo %@regex[(a)|(b)|(c)|(d)|(e),zap]
6
 

rconn

Administrator
Staff member
May 14, 2008
10,157
86
#3
> The help says, of @REGEX, "Returns the number of matching groups in
> the string" [if groups were specified]. I reported quite a while ago
> that it doesn't do that. Will it be fixed, or the documentation
> changed?
It will definitely not be changed. I will clarify the docs to explain what
it's doing, and that your syntax will return different results depending on
which RE syntax you've selected.

Rex Conn
JP Software
 
#4
It will definitely not be changed. I will clarify the docs to explain what
it's doing, and that your syntax will return different results depending on
which RE syntax you've selected.
Are there any syntaxes for which it does return the number of matching groups in the string?
 
#5
It will definitely not be changed. I will clarify the docs to explain what it's doing, and that your syntax will return different results depending on which RE syntax you've selected.
I respectfully beg to differ. When the correct syntax is used, @REGEX, with groups, always counts the number of parenthesized groups in the regex (plus one for the entire regex). That is the value of OnigRegion::num_regs after an onig_search() or an onig_match(). The user already knows this number.

And you can reliably loop to find/count all non-overlapping matches in the string, as the plugin @XMATCH does (see below). Such looping makes my @XMATCH about 8% slower than @REGEX, perhaps a reasonable price to pay for a more useful result.

The script below and its output show the reliability of both functions, regardless of the specified regular expression syntax.
Code:
type xsyntax.bat & echo ^r^nRESULTS^r^n & xsyntax.bat

option //RegularExpressions=perl
echo Syntax: %@option[RegularExpressions]
echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
echo.

option //RegularExpressions=ruby
echo Syntax: %@option[RegularExpressions]
echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
echo.

option //RegularExpressions=grep
echo Syntax: %@option[RegularExpressions]
echo `@regex[\(a\)\|\(b\)\|\(c\),baccarat]` = %@regex[\(a\)\|\(b\)\|\(c\),baccar
at]
echo `@xmatch[\(a\)\|\(b\)\|\(c\),baccarat]` = %@xmatch[\(a\)\|\(b\)\|\(c\),bacc
arat]
echo.

option //RegularExpressions=gnu
echo Syntax: %@option[RegularExpressions]
echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
echo.

option //RegularExpressions=posix
echo Syntax: %@option[RegularExpressions]
echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
echo.

option //RegularExpressions=java
echo Syntax: %@option[RegularExpressions]
echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
echo.

RESULTS

Syntax: Perl
@regex[(a)|(b)|(c),baccarat] = 4
@xmatch[(a)|(b)|(c),baccarat] = 6

Syntax: Ruby
@regex[(a)|(b)|(c),baccarat] = 4
@xmatch[(a)|(b)|(c),baccarat] = 6

Syntax: Grep
@regex[\(a\)\|\(b\)\|\(c\),baccarat] = 4
@xmatch[\(a\)\|\(b\)\|\(c\),baccarat] = 6

Syntax: GNU
@regex[(a)|(b)|(c),baccarat] = 4
@xmatch[(a)|(b)|(c),baccarat] = 6

Syntax: POSIX
@regex[(a)|(b)|(c),baccarat] = 4
@xmatch[(a)|(b)|(c),baccarat] = 6

Syntax: Java
@regex[(a)|(b)|(c),baccarat] = 4
@xmatch[(a)|(b)|(c),baccarat] = 6