@REGEX question

  • This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn more.
#1
How am I to interpret the return value of @REGEX[] below? It doesn't seem to be the (documented) "number of matching groups".

Code:
e:\logs\mercury> echo %@REGEX["(efused)|(uthor)|(known)",refused]
4
 
#2
How am I to interpret the return value of @REGEX[] below? It doesn't seem to be the (documented) "number of matching groups".

Code:
e:\logs\mercury> echo %@REGEX["(efused)|(uthor)|(known)",refused]
4
[Working with this Oniguruma stuff gives me a headache!]

This snippet (below) comes close to the documentation. It always gives a count of the matches. I replaced troublesome characters with GT, LT, GE, and LE.

Code:
    UChar    *mstart=(UChar*)szString,
            *mend=(UChar*)szString + 2 * lstrlen(szString);

    OnigRegion *region = onig_region_new();

    // here's the interesting stuff
    INT matches = 0, i;
    while ( onig_search(regex, mstart, mend, mstart, mend, region, 0) GE 0 )
    {
        matches += 1;

        // find the match and move past it
        // first see if the match was a group
        for ( i=1; i < region-GTnum_regs; i++ )
        {
            if ( region->beg[i] GE 0 ) // match was a group
            {
                mstart += region-GTend[i];
                break;    // keep looking (continue the while)    
            }
        }

        if ( i == region-GTnum_regs ) // match was not a group (region 0)
        {
            mstart += region-GTend[0];
        }
        // keep looking (continue the while)
    }
    Sprintf(psz, L"%d", matches);
Here are a few examples.

Code:
g:\projects\4utils\release> echo %@regex[o|g,doggiepoo]
5

g:\projects\4utils\release> echo %@regex[(oo)|(g),doggiepoo]
3

g:\projects\4utils\release> echo %@regex[(oo)|(gg),doggiepoo]
2

g:\projects\4utils\release> echo %@regex[(o)|(g),doggie]
3

g:\projects\4utils\release> echo %@regex[(o)|g,doggie]
3

g:\projects\4utils\release> echo %@regex[o|g,doggie]
3

g:\projects\4utils\release> echo %@regex[(s)|f,doggie]
0

g:\projects\4utils\release> echo %@regex[o|h,dog]
1

g:\projects\4utils\release> echo %@regex[(foo),foozzz]
1

g:\projects\4utils\release> echo %@regex[(foo),foozzzfoo]
2
 
#3
You can shorten that by looping backwards so the region 0 match only gets counted if no group match was found.

Code:
    INT matches = 0;
    while ( onig_search(regex, mstart, mend, mstart, mend, region, 0) GE 0 )
    {
        matches += 1;

        for ( INT i = region-GTnum_regs-1; i GE 0; i-- )
        {
            if ( region->beg[i] GE 0 )
            {
                mstart += region-GTend[i];
                break;    
            }
        }
    }
    Sprintf(psz, L"%d", matches);
 

rconn

Administrator
Staff member
May 14, 2008
10,289
90
#4
> How am I to interpret the return value of @REGEX[] below? It doesn't
> seem to be the (documented) "number of matching groups".
>
>
> Code:
> ---------
> e:\logs\mercury> echo %@REGEX["(efused)|(uthor)|(known)",refused]
> 4
> ---------
I tried that on several regular expression testers, and got results of 0, 1,
or 4, depending on the RE emulation desired.

So -- what are you trying to do, and what language syntax are you using?

Rex Conn
JP Software
 
#5
On Sun, 11 Jul 2010 22:25:42 -0400, rconn <>
wrote:

|---Quote---
|> How am I to interpret the return value of @REGEX[] below? It doesn't
|> seem to be the (documented) "number of matching groups".
|>
|>
|> Code:
|> ---------
|> e:\logs\mercury> echo %@REGEX["(efused)|(uthor)|(known)",refused]
|> 4
|> ---------
|---End Quote---
|I tried that on several regular expression testers, and got results of 0, 1,
|or 4, depending on the RE emulation desired.
|
|So -- what are you trying to do, and what language syntax are you using?

I use PERL syntax. Your return value doesn't seem to depend on how
many are found. Are you returning region.num_regs? That's always the
number of parens (plus 1) in the regex. That's what it looks like
(see below). You have to loop to get all the matches.

Code:
v:\> echo %@regex[(a)|(b)|(c),cat]
4

v:\> echo %@regex[(a)|(b)|(c),ccaat]
4

v:\> echo %@regex[(a)|(b)|(c),cccaaat]
4

v:\> echo %@regex[(a)|(b)|(c)|(d),cccaaat]
5

v:\> echo %@regex[(a)|(b)|(c)|(d),ccaat]
5

v:\> echo %@regex[(a)|(b)|(c)|(d),cat]
5
 
#6
On Sun, 11 Jul 2010 22:25:42 -0400, rconn <>
wrote:

|So -- what are you trying to do

I was just pointing out that, contrary to the help, @REGEX[] doesn't
return the number of matching groups. The code I posted (and the
complete version I emailed you) simply always returns the number of
matches. As far as counting matches is concerned, groups are not
significant; there are 3 matches here [a|b|c,cab] as well as here
[(a)|(b)|(c),cab] ... also here [(a|b|c),cab]. I'm not even sure
whether there's any point in using groups in a simple "find_a_match"
or "count_the_matches" function.
 
#7
On Sun, 11 Jul 2010 22:25:42 -0400, rconn <>
wrote:

|So -- what are you trying to do

I was just pointing out that, contrary to the help, @REGEX[] doesn't
return the number of matching groups. The code I posted (and the
complete version I emailed you) simply always returns the number of
matches. As far as counting matches is concerned, groups are not
significant; there are 3 matches here [a|b|c,cab] as well as here
[(a)|(b)|(c),cab] ... also here [(a|b|c),cab]. I'm not even sure
whether there's any point in using groups in a simple "find_a_match"
or "count_the_matches" function.
Here's a simpler, faster, and much more intuitive (than code I posted earlier) way to count matches.

Code:
    UChar    *at = (UChar*) pString,
            *mend=(UChar*)pString + lstrlen(pString) * sizeof(WCHAR);
    INT        matches = 0,
            matchlen;

    while ( at < mend )
    {
        matchlen = onig_match(regex, (UChar*) pString, mend, at, NULL, option);
        if ( matchlen >= 0 )
        {
            matches += 1;
            at += matchlen;
        }
        else
        {
            at += 2;
        }
    }

    Sprintf(psz, L"%d", matches);
If you want to count matches you must plow through the string looking for subsequent ones. The onig_match function is a bit odd ... It checks to see if a match starts at "at". The parameter indicating the beginning of the whole string (pString, above) appears irrelevant; the function works even if that parameter is NULL or greater than "at"; it seems not used at all.