Welcome!

By registering with us, you'll be able to discuss, share and private message with other members of our community.

SignUp Now!

@REGEXSUB issue

It seems that @REGEXSUB handles regex alternatives incorrectly.
Code:
C:\> for %i in (a b c) echo %i in (a+)^|(b+): %@REGEXINDEX["(a+)|(b+)",x%i] (%@REGEXSUB[1,"(a+)|(b+)",x%i])

a in (a+)|(b+): 1 (a)
b in (a+)|(b+): 1 ()          <=== should match "b"
c in (a+)|(b+): -1 ()

C:\> ver
TCC  9.02.152   Windows XP [Version 5.1.2600]
For comparison, @REGEXINDEX matches the second capture in the second line, (b+), while @REGEXSUB doesn't.
 
On Mon, 23 Mar 2009 17:04:59 -0500, Stefano Piccardi <>
wrote:

|No answer? Then does @REGEXSUB work correctly in version 10?

It would seem it's not working correctly. In writing @XREPLACE (4UTILS) I did
not go out of my way to accommodate this particular scenario. Alternatives
(Perl syntax) seem to be OK as far as Oniguruma is concerned:

v:\> echo %@xreplace[(a+)|(b+),z,xaax]
xzx

v:\> echo %@xreplace[(a+)|(b+),z,xbbx]
xzx

v:\> echo %@xreplace[(a+)|(b+),z,xccx]
xccx
--
- Vince
 
Thank you for confirming this issue in @REGEXSUB.
The regex is given to me as part of a configuration file, so I can't rewrite it to work around REGEXSUB.
I need to isolate the content of the capture.
I suppose that I could do it with @XREPLACE + @WORD, but I noticed a similar issue in @XREPLACE

Run in batch file: for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+)|(b+)",z\1z,x%i])

output:
aa in (a+)|(b+): (xzaaz)
bb in (a+)|(b+): (xzz)
cc in (a+)|(b+): (xcc)

The second line should be (xzbbz).

BTW, if I don't quote the regex I get a different, still incorrect, output:

aa in (a+)|(b+): (xaa)
bb in (a+)|(b+): (xbb)
cc in (a+)|(b+): (xcc)
 
On Tue, 24 Mar 2009 09:44:59 -0500, Stefano Piccardi <>
wrote:

|for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+)|(b+)",z\1z,x%i])
|BTW, if I don't quote the regex I get a different, still incorrect, output:
|
|aa in (a+)|(b+): (xaa)
|bb in (a+)|(b+): (xbb)
|cc in (a+)|(b+): (xcc)

These should be:

v:\> for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+)|(b+)",z\1z,x%
i])
aa in (a+)|(b+): (xzaaz)
bb in (a+)|(b+): (xzz) [\1 not found]
cc in (a+)|(b+): (xcc)

v:\> for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+)|(b+)",z\2z,x%
i])
aa in (a+)|(b+): (xzz) [\2 not found]
bb in (a+)|(b+): (xzbbz)
cc in (a+)|(b+): (xcc)

I found that bug last night while experimenting after reading your post. It
resulted from a change (quite a while back) from wcscpyn() to lstrncpy() which
behaves a little differently. There's a new one in the VC9 plugin directory on
lucky.syr.edu.

Please check out TYPEX (like TYPE /X) and UNTYPEX. If you redirect TYPEX to a
file UNTYPEX will re-construct the original file from the hex values. Don't
overwrite anything important. UNTYPEX needs the exact format that TYPEX outputs.
If you edit TYPEX's output, do it carefully. These are experimental. Example:

v:\> typex fleas.txt
00000000 4D 79 20 64 6F 67 20 68 61 73 20 66 6C 65 61 73 My dog has fleas
00000010 21 0D 0A !..

v:\> typex fleas.txt > fleas.hex

v:\> edit fleas.hex & rem change 61 to 41

v:\> untypex fleas.hex fleas2.txt

v:\> typex fleas2.txt
00000000 4D 79 20 64 6F 67 20 68 41 73 20 66 6C 65 41 73 My dog hAs fleAs
00000010 21 0D 0A !..
--
- Vince
 
These should be:

v:\> for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+)|(b+)",z\1z,x%
i])
aa in (a+)|(b+): (xzaaz)
bb in (a+)|(b+): (xzz) [\1 not found]
cc in (a+)|(b+): (xcc)

v:\> for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+)|(b+)",z\2z,x%
i])
aa in (a+)|(b+): (xzz) [\2 not found]
bb in (a+)|(b+): (xzbbz)
cc in (a+)|(b+): (xcc)
I weakly disagree. IMO in an alternate each capture should count as \1, so the output should be
(xzaaz)
(xzbbz)
(xcc)
But I hesitate to make a stronger point because this level of detail is left open to interpretation in the regex documentation.

However, if in an alternate the first capture is sometimes called \1 and other times it's called \2 then we're out of luck. Consider:
prefix A(\d+)|prefix B(\w+) should capture \1 as
a number when it's prefixed by prefix A or as a word when it's prefixed by prefix B. It can't be rewritten as
prefix A|prefix B(\d+|\w+)
to lock the alternate capture into the same group of parentheses. The second regex is not equivalent to the first one.

IMO the cardinal number of the capture should be assigned as the match is being evaluated, not as it is being compiled from left to right.
 
On Tue, 24 Mar 2009 13:13:31 -0500, Stefano Piccardi <>
wrote:

|v:\> for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+)|(b+)",z\2z,x%
|i])
|aa in (a+)|(b+): (xzz) [\2 not found]
|bb in (a+)|(b+): (xzbbz)
|cc in (a+)|(b+): (xcc)

|I weakly disagree. IMO in an alternate each capture should count as \1, so the output should be
|(xzaaz)
|(xzbbz)
|(xcc)
|But I hesitate to make a stronger point because this level of detail is left open to interpretation in the regex documentation.

It's always been clear to me. \1, \2, ... refer to the parenthesized
expressions in order. To have it your way, do this:

for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+|b+)",z\1z,x%i])
aa in (a+)|(b+): (xzaaz)
bb in (a+)|(b+): (xzbbz)
cc in (a+)|(b+): (xcc)
--
- Vince
 
It's always been clear to me. \1, \2, ... refer to the parenthesized
expressions in order. To have it your way, do this:

for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+|b+)",z\1z,x%i])
aa in (a+)|(b+): (xzaaz)
bb in (a+)|(b+): (xzbbz)
cc in (a+)|(b+): (xcc)
--
- Vince
However, if in an alternate the first capture is sometimes called \1 and other times it's called \2 then we're out of luck. Consider:
prefix A(\d+)|prefix B(\w+) should capture \1 as
a number when it's prefixed by prefix A or as a word when it's prefixed by prefix B. It can't be rewritten as
prefix A|prefix B(\d+|\w+)
to lock the alternate capture into the same group of parentheses. The second regex is not equivalent to the first one.

IMO the cardinal number of the capture should be assigned as the match is being evaluated, not as it is being compiled from left to right.
 
On Tue, 24 Mar 2009 13:54:48 -0500, Stefano Piccardi <>
wrote:

|However, if in an alternate the first capture is sometimes called \1 and other times it's called \2 then we're out of luck. Consider:
|prefix A(\d+)|prefix B(\w+) should capture \1 as
|a number when it's prefixed by prefix A or as a word when it's prefixed by prefix B. It can't be rewritten as
|prefix A|prefix B(\d+|\w+)
|to lock the alternate capture into the same group of parentheses. The second regex is not equivalent to the first one.

What about (A\d+|B\w+)?

|IMO the cardinal number of the capture should be assigned as the match is being evaluated, not as it is being compiled from left to right.

Then you wouldn't know what was matched.
--
- Vince
 
Stefano Piccardi wrote:

> It seems that @REGEXSUB handles regex alternatives incorrectly.
>
> Code:
> ---------
> C:\> for %i in (a b c) echo %i in (a+)^|(b+): %@REGEXINDEX["(a+)|(b+)",x%i] (%@REGEXSUB[1,"(a+)|(b+)",x%i])
>
> a in (a+)|(b+): 1 (a)
> b in (a+)|(b+): 1 () <=== should match "b"
> c in (a+)|(b+): -1 ()
>
> C:\> ver
> TCC 9.02.152 Windows XP [Version 5.1.2600]
> ---------
> For comparison, @REGEXINDEX matches the second capture in the second line, (b+), while @REGEXSUB doesn't.

I'll pass it on to the Oniguruma developers (though in every "bug"
reported in regular expressions for the past couple of years Oniguruma
has been correct).

Rex Conn
JP Software
 
On Tue, 24 Mar 2009 21:52:54 -0500, rconn <> wrote:

|Stefano Piccardi wrote:
|
|
|---Quote---
|> It seems that @REGEXSUB handles regex alternatives incorrectly.
|>
|> Code:
|> ---------
|> C:\> for %i in (a b c) echo %i in (a+)^|(b+): %@REGEXINDEX["(a+)|(b+)",x%i] (%@REGEXSUB[1,"(a+)|(b+)",x%i])
|>
|> a in (a+)|(b+): 1 (a)
|> b in (a+)|(b+): 1 () <=== should match "b"
|> c in (a+)|(b+): -1 ()
|>
|> C:\> ver
|> TCC 9.02.152 Windows XP [Version 5.1.2600]
|> ---------
|> For comparison, @REGEXINDEX matches the second capture in the second line, (b+), while @REGEXSUB doesn't.
|---End Quote---
|I'll pass it on to the Oniguruma developers (though in every "bug"
|reported in regular expressions for the past couple of years Oniguruma
|has been correct).

I don't think it's Onig (or usage). If I do

while ( onig_search(regex, mstart, mend, mstart, mend, region, 0) >= 0 )

with regex pointing to the (unquoted) "(a+)|(b+)", and mstart/mend delimiting
the string "xxbxx", it finds a match (and knows where it is):

v:\> echo %@xreplace["(a+)|(b+)",**\2**,xxbxx]
xx**b**xx

In Stefano's faulty case, @REGEXINDEX is finding the pattern while @REGEXSUB is
not.
--
- Vince
 
On Tue, 24 Mar 2009 21:52:54 -0500, rconn <> wrote:

|---Quote---
|> It seems that @REGEXSUB handles regex alternatives incorrectly.
|>
|> Code:
|> ---------
|> C:\> for %i in (a b c) echo %i in (a+)^|(b+): %@REGEXINDEX["(a+)|(b+)",x%i] (%@REGEXSUB[1,"(a+)|(b+)",x%i])
|>
|> a in (a+)|(b+): 1 (a)
|> b in (a+)|(b+): 1 () <=== should match "b"
|> c in (a+)|(b+): -1 ()
|>
|> C:\> ver
|> TCC 9.02.152 Windows XP [Version 5.1.2600]
|> ---------
|> For comparison, @REGEXINDEX matches the second capture in the second line, (b+), while @REGEXSUB doesn't.
|---End Quote---
|I'll pass it on to the Oniguruma developers (though in every "bug"
|reported in regular expressions for the past couple of years Oniguruma
|has been correct).

Well, you know, it does (sort of) work, but not in a way that's very useful:

v:\> echo %@REGEXSUB[1,"(a+)|(b+)",cbbc]
ECHO is OFF

v:\> echo %@REGEXSUB[2,"(a+)|(b+)",cbbc]
bb

The help says, of @REGEXINDEX "returns the nth matching group in the string". So
the discrepancy is between two notions of "the nth matching group". Above,
there actually was a **first** match but it matched the second paranthesized
pattern; there certainly wasnt a 2nd match. IMO better behavior would be:

v:\> echo %@REGEXSUB[1,"(a+)|(b+)",cbbc]
bb [a first match]

v:\> echo %@REGEXSUB[2,"(a+)|(b+)",cbbc]
ECHO is OFF [no second match]
--
- Vince
 

Similar threads

Back
Top