@REGEX revisited

May 20, 2008
11,400
99
Syracuse, NY, USA
The help says "Returns the number of matching groups in the string" (if groups were specified). I may have misinterpreted that (probably did) to mean the number of matches. Whereas I suppose you meant the number of groups for which matches were found. [Recommend better wording]

In any case, @REGEX does not live up to its documentation, as this example shows.

Code:
echo %@regex[(a)|(b)|(c),maam]
4
One group was matched, and there were two matches.

I have some code that does both (it could do either, I suppose) ... I'd be glad to share it with you. Some examples far below.

It doesn't seem to make much sense to count groups that were matched unless the regex is a disjunction of groups. For example
Code:
%@REGEX[(a)(b),...]
will only match if both groups match. Even with a disjunction of groups, the results may not be as expected. For example, in
Code:
%@REGEX[(\d)|(2),...]
(2) will never be matches because each onig_search or onig_match will match (\d) first (and stop). [But all that's the user's problem.]

Here's an example of what I've got. %XMATCH[] returns a pair: number of matches,number of matched groups.

Code:
v:\> echo %@xmatch[(i)|(s)|(q),Mississippi]
8,2
8 matches, 2 groups matched

I don't know what sort of thing to test here. Folks, please make suggestions.

Here's the working part of my code (it's quite fast; handles up to 63 groups).

P.S. Did you fix vbulletin's mishandling of less/greater symbols inside code tags? There doesn't seem to be a problem any more.

Code:
    OnigRegion    *region = onig_region_new();
    UChar        *at = (UChar*) pString,
                *mend = (UChar*) strend(pString);
    INT            matches = 0,
                rmatches = 0,
                matchlen;
    ULONGLONG    regionmap = 0;

    while ( at < mend )
    {
        matchlen = onig_match(regex, (UChar*) pString, mend, at, region, option);
        if ( matchlen >= 0 )
        {
            matches += 1;
            at += matchlen;
            for ( INT i = 1; i < region->num_regs; i += 1 )
            {
                if ( region->beg[i] >= 0 )
                {
                    if ( !(regionmap & (1I64 << i)) )
                    {
                        rmatches += 1;
                        regionmap |= (1I64 << i);
                    }
                }
            }
        }
        else
        {
            at += 2;
        }
    }
    Sprintf(psz, L"%d,%d", matches, rmatches);
 
May 20, 2008
11,400
99
Syracuse, NY, USA
The help says, of @REGEX, "Returns the number of matching groups in the string" [if groups were specified]. I reported quite a while ago that it doesn't do that. Will it be fixed, or the documentation changed?

Code:
v:\> echo %@regex[(a)|(b)|(c)|(d)|(e),zap]
6
 

rconn

Administrator
Staff member
May 14, 2008
12,345
150
> The help says, of @REGEX, "Returns the number of matching groups in
> the string" [if groups were specified]. I reported quite a while ago
> that it doesn't do that. Will it be fixed, or the documentation
> changed?

It will definitely not be changed. I will clarify the docs to explain what
it's doing, and that your syntax will return different results depending on
which RE syntax you've selected.

Rex Conn
JP Software
 
May 20, 2008
11,400
99
Syracuse, NY, USA
It will definitely not be changed. I will clarify the docs to explain what
it's doing, and that your syntax will return different results depending on
which RE syntax you've selected.

Are there any syntaxes for which it does return the number of matching groups in the string?
 
May 20, 2008
11,400
99
Syracuse, NY, USA
It will definitely not be changed. I will clarify the docs to explain what it's doing, and that your syntax will return different results depending on which RE syntax you've selected.

I respectfully beg to differ. When the correct syntax is used, @REGEX, with groups, always counts the number of parenthesized groups in the regex (plus one for the entire regex). That is the value of OnigRegion::num_regs after an onig_search() or an onig_match(). The user already knows this number.

And you can reliably loop to find/count all non-overlapping matches in the string, as the plugin @XMATCH does (see below). Such looping makes my @XMATCH about 8% slower than @REGEX, perhaps a reasonable price to pay for a more useful result.

The script below and its output show the reliability of both functions, regardless of the specified regular expression syntax.
Code:
type xsyntax.bat & echo ^r^nRESULTS^r^n & xsyntax.bat

option //RegularExpressions=perl
echo Syntax: %@option[RegularExpressions]
echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
echo.

option //RegularExpressions=ruby
echo Syntax: %@option[RegularExpressions]
echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
echo.

option //RegularExpressions=grep
echo Syntax: %@option[RegularExpressions]
echo `@regex[\(a\)\|\(b\)\|\(c\),baccarat]` = %@regex[\(a\)\|\(b\)\|\(c\),baccar
at]
echo `@xmatch[\(a\)\|\(b\)\|\(c\),baccarat]` = %@xmatch[\(a\)\|\(b\)\|\(c\),bacc
arat]
echo.

option //RegularExpressions=gnu
echo Syntax: %@option[RegularExpressions]
echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
echo.

option //RegularExpressions=posix
echo Syntax: %@option[RegularExpressions]
echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
echo.

option //RegularExpressions=java
echo Syntax: %@option[RegularExpressions]
echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
echo.

RESULTS

Syntax: Perl
@regex[(a)|(b)|(c),baccarat] = 4
@xmatch[(a)|(b)|(c),baccarat] = 6

Syntax: Ruby
@regex[(a)|(b)|(c),baccarat] = 4
@xmatch[(a)|(b)|(c),baccarat] = 6

Syntax: Grep
@regex[\(a\)\|\(b\)\|\(c\),baccarat] = 4
@xmatch[\(a\)\|\(b\)\|\(c\),baccarat] = 6

Syntax: GNU
@regex[(a)|(b)|(c),baccarat] = 4
@xmatch[(a)|(b)|(c),baccarat] = 6

Syntax: POSIX
@regex[(a)|(b)|(c),baccarat] = 4
@xmatch[(a)|(b)|(c),baccarat] = 6

Syntax: Java
@regex[(a)|(b)|(c),baccarat] = 4
@xmatch[(a)|(b)|(c),baccarat] = 6
 
Similar threads
Thread starter Title Forum Replies Date
F %@regex["^-","-a"] returns 0, "^-" =~ "-a" is false (no match) Support 4
JohnQSmith Regex renaming Support 2
vefatica TPIPE: unbalanced escaped quotes in a regex? Support 5
rps Regex problem: \xnn not recognized as a hex character Support 0
old coot Regex problem: \xnn not recognized as a hex character Support 12
R Regex using ^ Support 2
T Regex engine doesn't recognize native DOS line endings Support 2
P Simple RegEx copy Support 9
samintz WAD Regex Analyzer Support 1
D How to? Use typed envars using regex. Support 3
P Renaming with a RegEx Support 1
R How to? use @everything perl regex Support 2
C v18 regex help please Support 1
C Font of RegEx Analyzer Support 0
D Regex problem Support 17
mikea How to? Regex match when there shouldn't be (?) Support 18
JohnQSmith Fixed Copying with regex (several issues) Support 7
D Help needed to get a regex to work Support 3
thedave WAD Regex match on \h Support 5
Ville Regex & conditionals Support 9
samintz Regex Rename Support 2
vefatica @REGEX: behavior vs. documentation Support 2
vefatica @REGEX question Support 6
vefatica Regex question Support 5
B Regex and Replace Support 6
Stefano Piccardi detecting BOM, FFIND multibyte regex Support 18
dcantor FFIND syntax -- is /E"regex" /X supported? Support 2
P Renaming files with regex. Support 6
B "Fun" with DO and Regex Support 12
P Need to use a regex in a "for" loop. Support 54
S Vince - forum format revisited Support 1

Similar threads