1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

@REGEX revisited

Discussion in 'Support' started by vefatica, Jul 20, 2010.

  1. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,972
    Likes Received:
    30
    The help says "Returns the number of matching groups in the string" (if groups were specified). I may have misinterpreted that (probably did) to mean the number of matches. Whereas I suppose you meant the number of groups for which matches were found. [Recommend better wording]

    In any case, @REGEX does not live up to its documentation, as this example shows.

    Code:
    echo %@regex[(a)|(b)|(c),maam]
    4
    One group was matched, and there were two matches.

    I have some code that does both (it could do either, I suppose) ... I'd be glad to share it with you. Some examples far below.

    It doesn't seem to make much sense to count groups that were matched unless the regex is a disjunction of groups. For example
    Code:
    %@REGEX[(a)(b),...]
    will only match if both groups match. Even with a disjunction of groups, the results may not be as expected. For example, in
    Code:
    %@REGEX[(\d)|(2),...]
    (2) will never be matches because each onig_search or onig_match will match (\d) first (and stop). [But all that's the user's problem.]

    Here's an example of what I've got. %XMATCH[] returns a pair: number of matches,number of matched groups.

    Code:
    v:\> echo %@xmatch[(i)|(s)|(q),Mississippi]
    8,2
    8 matches, 2 groups matched

    I don't know what sort of thing to test here. Folks, please make suggestions.

    Here's the working part of my code (it's quite fast; handles up to 63 groups).

    P.S. Did you fix vbulletin's mishandling of less/greater symbols inside code tags? There doesn't seem to be a problem any more.

    Code:
        OnigRegion    *region = onig_region_new();
        UChar        *at = (UChar*) pString,
                    *mend = (UChar*) strend(pString);
        INT            matches = 0,
                    rmatches = 0,
                    matchlen;
        ULONGLONG    regionmap = 0;
    
        while ( at < mend )
        {
            matchlen = onig_match(regex, (UChar*) pString, mend, at, region, option);
            if ( matchlen >= 0 )
            {
                matches += 1;
                at += matchlen;
                for ( INT i = 1; i < region->num_regs; i += 1 )
                {
                    if ( region->beg[i] >= 0 )
                    {
                        if ( !(regionmap & (1I64 << i)) )
                        {
                            rmatches += 1;
                            regionmap |= (1I64 << i);
                        }
                    }
                }
            }
            else
            {
                at += 2;
            }
        }
        Sprintf(psz, L"%d,%d", matches, rmatches);
    
     
  2. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,972
    Likes Received:
    30
    The help says, of @REGEX, "Returns the number of matching groups in the string" [if groups were specified]. I reported quite a while ago that it doesn't do that. Will it be fixed, or the documentation changed?

    Code:
    v:\> echo %@regex[(a)|(b)|(c)|(d)|(e),zap]
    6
     
  3. rconn

    rconn Administrator
    Staff Member

    Joined:
    May 14, 2008
    Messages:
    9,870
    Likes Received:
    83
    It will definitely not be changed. I will clarify the docs to explain what
    it's doing, and that your syntax will return different results depending on
    which RE syntax you've selected.

    Rex Conn
    JP Software
     
  4. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,972
    Likes Received:
    30
    Are there any syntaxes for which it does return the number of matching groups in the string?
     
  5. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,972
    Likes Received:
    30
    I respectfully beg to differ. When the correct syntax is used, @REGEX, with groups, always counts the number of parenthesized groups in the regex (plus one for the entire regex). That is the value of OnigRegion::num_regs after an onig_search() or an onig_match(). The user already knows this number.

    And you can reliably loop to find/count all non-overlapping matches in the string, as the plugin @XMATCH does (see below). Such looping makes my @XMATCH about 8% slower than @REGEX, perhaps a reasonable price to pay for a more useful result.

    The script below and its output show the reliability of both functions, regardless of the specified regular expression syntax.
    Code:
    type xsyntax.bat & echo ^r^nRESULTS^r^n & xsyntax.bat
    
    option //RegularExpressions=perl
    echo Syntax: %@option[RegularExpressions]
    echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
    echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
    echo.
    
    option //RegularExpressions=ruby
    echo Syntax: %@option[RegularExpressions]
    echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
    echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
    echo.
    
    option //RegularExpressions=grep
    echo Syntax: %@option[RegularExpressions]
    echo `@regex[\(a\)\|\(b\)\|\(c\),baccarat]` = %@regex[\(a\)\|\(b\)\|\(c\),baccar
    at]
    echo `@xmatch[\(a\)\|\(b\)\|\(c\),baccarat]` = %@xmatch[\(a\)\|\(b\)\|\(c\),bacc
    arat]
    echo.
    
    option //RegularExpressions=gnu
    echo Syntax: %@option[RegularExpressions]
    echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
    echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
    echo.
    
    option //RegularExpressions=posix
    echo Syntax: %@option[RegularExpressions]
    echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
    echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
    echo.
    
    option //RegularExpressions=java
    echo Syntax: %@option[RegularExpressions]
    echo `@regex[(a)|(b)|(c),baccarat]` = %@regex[(a)|(b)|(c),baccarat]
    echo `@xmatch[(a)|(b)|(c),baccarat]` = %@xmatch[(a)|(b)|(c),baccarat]
    echo.
    
    RESULTS
    
    Syntax: Perl
    @regex[(a)|(b)|(c),baccarat] = 4
    @xmatch[(a)|(b)|(c),baccarat] = 6
    
    Syntax: Ruby
    @regex[(a)|(b)|(c),baccarat] = 4
    @xmatch[(a)|(b)|(c),baccarat] = 6
    
    Syntax: Grep
    @regex[\(a\)\|\(b\)\|\(c\),baccarat] = 4
    @xmatch[\(a\)\|\(b\)\|\(c\),baccarat] = 6
    
    Syntax: GNU
    @regex[(a)|(b)|(c),baccarat] = 4
    @xmatch[(a)|(b)|(c),baccarat] = 6
    
    Syntax: POSIX
    @regex[(a)|(b)|(c),baccarat] = 4
    @xmatch[(a)|(b)|(c),baccarat] = 6
    
    Syntax: Java
    @regex[(a)|(b)|(c),baccarat] = 4
    @xmatch[(a)|(b)|(c),baccarat] = 6
     

Share This Page