@REGEXSUB issue

Stefano Piccardi · Mar 17, 2009

It seems that @REGEXSUB handles regex alternatives incorrectly.

Code:

C:\> for %i in (a b c) echo %i in (a+)^|(b+): %@REGEXINDEX["(a+)|(b+)",x%i] (%@REGEXSUB[1,"(a+)|(b+)",x%i])

a in (a+)|(b+): 1 (a)
b in (a+)|(b+): 1 ()          <=== should match "b"
c in (a+)|(b+): -1 ()

C:\> ver
TCC  9.02.152   Windows XP [Version 5.1.2600]

For comparison, @REGEXINDEX matches the second capture in the second line, (b+), while @REGEXSUB doesn't.

Stefano Piccardi · Mar 23, 2009

No answer? Then does @REGEXSUB work correctly in version 10?

vefatica · Mar 23, 2009

On Mon, 23 Mar 2009 17:04:59 -0500, Stefano Piccardi <>
wrote:

|No answer? Then does @REGEXSUB work correctly in version 10?

It would seem it's not working correctly. In writing @XREPLACE (4UTILS) I did
not go out of my way to accommodate this particular scenario. Alternatives
(Perl syntax) seem to be OK as far as Oniguruma is concerned:

v:\> echo %@xreplace[(a+)|(b+),z,xaax]
xzx

v:\> echo %@xreplace[(a+)|(b+),z,xbbx]
xzx

v:\> echo %@xreplace[(a+)|(b+),z,xccx]
xccx
--
- Vince

Stefano Piccardi · Mar 24, 2009

Thank you for confirming this issue in @REGEXSUB.
The regex is given to me as part of a configuration file, so I can't rewrite it to work around REGEXSUB.
I need to isolate the content of the capture.
I suppose that I could do it with @XREPLACE + @WORD, but I noticed a similar issue in @XREPLACE

Run in batch file: for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+)|(b+)",z\1z,x%i])

output:
aa in (a+)|(b+): (xzaaz)
bb in (a+)|(b+): (xzz)
cc in (a+)|(b+): (xcc)

The second line should be (xzbbz).

BTW, if I don't quote the regex I get a different, still incorrect, output:

aa in (a+)|(b+): (xaa)
bb in (a+)|(b+): (xbb)
cc in (a+)|(b+): (xcc)

vefatica · Mar 24, 2009

On Tue, 24 Mar 2009 09:44:59 -0500, Stefano Piccardi <>
wrote:

|for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+)|(b+)",z\1z,x%i])
|BTW, if I don't quote the regex I get a different, still incorrect, output:
|
|aa in (a+)|(b+): (xaa)
|bb in (a+)|(b+): (xbb)
|cc in (a+)|(b+): (xcc)

These should be:

v:\> for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+)|(b+)",z\1z,x%
i])
aa in (a+)|(b+): (xzaaz)
bb in (a+)|(b+): (xzz) [\1 not found]
cc in (a+)|(b+): (xcc)

v:\> for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+)|(b+)",z\2z,x%
i])
aa in (a+)|(b+): (xzz) [\2 not found]
bb in (a+)|(b+): (xzbbz)
cc in (a+)|(b+): (xcc)

I found that bug last night while experimenting after reading your post. It
resulted from a change (quite a while back) from wcscpyn() to lstrncpy() which
behaves a little differently. There's a new one in the VC9 plugin directory on
lucky.syr.edu.

Please check out TYPEX (like TYPE /X) and UNTYPEX. If you redirect TYPEX to a
file UNTYPEX will re-construct the original file from the hex values. Don't
overwrite anything important. UNTYPEX needs the exact format that TYPEX outputs.
If you edit TYPEX's output, do it carefully. These are experimental. Example:

v:\> typex fleas.txt
00000000 4D 79 20 64 6F 67 20 68 61 73 20 66 6C 65 61 73 My dog has fleas
00000010 21 0D 0A !..

v:\> typex fleas.txt > fleas.hex

v:\> edit fleas.hex & rem change 61 to 41

v:\> untypex fleas.hex fleas2.txt

v:\> typex fleas2.txt
00000000 4D 79 20 64 6F 67 20 68 41 73 20 66 6C 65 41 73 My dog hAs fleAs
00000010 21 0D 0A !..
--
- Vince

Stefano Piccardi · Mar 24, 2009

vefatica said:
These should be:

v:\> for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+)|(b+)",z\1z,x%
i])
aa in (a+)|(b+): (xzaaz)
bb in (a+)|(b+): (xzz) [\1 not found]
cc in (a+)|(b+): (xcc)

v:\> for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+)|(b+)",z\2z,x%
i])
aa in (a+)|(b+): (xzz) [\2 not found]
bb in (a+)|(b+): (xzbbz)
cc in (a+)|(b+): (xcc)

I weakly disagree. IMO in an alternate each capture should count as \1, so the output should be
(xzaaz)
(xzbbz)
(xcc)
But I hesitate to make a stronger point because this level of detail is left open to interpretation in the regex documentation.

However, if in an alternate the first capture is sometimes called \1 and other times it's called \2 then we're out of luck. Consider:
prefix A(\d+)|prefix B(\w+) should capture \1 as
a number when it's prefixed by prefix A or as a word when it's prefixed by prefix B. It can't be rewritten as
prefix A|prefix B(\d+|\w+)
to lock the alternate capture into the same group of parentheses. The second regex is not equivalent to the first one.

IMO the cardinal number of the capture should be assigned as the match is being evaluated, not as it is being compiled from left to right.

vefatica · Mar 24, 2009

Stefano Piccardi · Mar 24, 2009

vefatica said:
Please check out TYPEX (like TYPE /X) and UNTYPEX.

I like them! They seem to work well.
Also, the regex quoting seems to be fixed, thank you.

Stefano Piccardi · Mar 24, 2009

vefatica said:
It's always been clear to me. \1, \2, ... refer to the parenthesized
expressions in order. To have it your way, do this:

for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+|b+)",z\1z,x%i])
aa in (a+)|(b+): (xzaaz)
bb in (a+)|(b+): (xzbbz)
cc in (a+)|(b+): (xcc)
--
- Vince

However, if in an alternate the first capture is sometimes called \1 and other times it's called \2 then we're out of luck. Consider:
prefix A(\d+)|prefix B(\w+) should capture \1 as
a number when it's prefixed by prefix A or as a word when it's prefixed by prefix B. It can't be rewritten as
prefix A|prefix B(\d+|\w+)
to lock the alternate capture into the same group of parentheses. The second regex is not equivalent to the first one.

IMO the cardinal number of the capture should be assigned as the match is being evaluated, not as it is being compiled from left to right.

vefatica · Mar 24, 2009

On Tue, 24 Mar 2009 13:54:48 -0500, Stefano Piccardi <>
wrote:

|However, if in an alternate the first capture is sometimes called \1 and other times it's called \2 then we're out of luck. Consider:
|prefix A(\d+)|prefix B(\w+) should capture \1 as
|a number when it's prefixed by prefix A or as a word when it's prefixed by prefix B. It can't be rewritten as
|prefix A|prefix B(\d+|\w+)
|to lock the alternate capture into the same group of parentheses. The second regex is not equivalent to the first one.

What about (A\d+|B\w+)?

|IMO the cardinal number of the capture should be assigned as the match is being evaluated, not as it is being compiled from left to right.

Then you wouldn't know what was matched.
--
- Vince

rconn · Mar 24, 2009

Stefano Piccardi wrote:

> It seems that @REGEXSUB handles regex alternatives incorrectly.
>
> Code:
> ---------
> C:\> for %i in (a b c) echo %i in (a+)^|(b+): %@REGEXINDEX["(a+)|(b+)",x%i] (%@REGEXSUB[1,"(a+)|(b+)",x%i])
>
> a in (a+)|(b+): 1 (a)
> b in (a+)|(b+): 1 () <=== should match "b"
> c in (a+)|(b+): -1 ()
>
> C:\> ver
> TCC 9.02.152 Windows XP [Version 5.1.2600]
> ---------
> For comparison, @REGEXINDEX matches the second capture in the second line, (b+), while @REGEXSUB doesn't.

I'll pass it on to the Oniguruma developers (though in every "bug"
reported in regular expressions for the past couple of years Oniguruma
has been correct).

Rex Conn
JP Software

vefatica · Mar 25, 2009

vefatica · Mar 25, 2009

On Tue, 24 Mar 2009 21:52:54 -0500, rconn <> wrote:

|---Quote---
|> It seems that @REGEXSUB handles regex alternatives incorrectly.
|>
|> Code:
|> ---------
|> C:\> for %i in (a b c) echo %i in (a+)^|(b+): %@REGEXINDEX["(a+)|(b+)",x%i] (%@REGEXSUB[1,"(a+)|(b+)",x%i])
|>
|> a in (a+)|(b+): 1 (a)
|> b in (a+)|(b+): 1 () <=== should match "b"
|> c in (a+)|(b+): -1 ()
|>
|> C:\> ver
|> TCC 9.02.152 Windows XP [Version 5.1.2600]
|> ---------
|> For comparison, @REGEXINDEX matches the second capture in the second line, (b+), while @REGEXSUB doesn't.
|---End Quote---
|I'll pass it on to the Oniguruma developers (though in every "bug"
|reported in regular expressions for the past couple of years Oniguruma
|has been correct).

Well, you know, it does (sort of) work, but not in a way that's very useful:

v:\> echo %@REGEXSUB[1,"(a+)|(b+)",cbbc]
ECHO is OFF

v:\> echo %@REGEXSUB[2,"(a+)|(b+)",cbbc]
bb

The help says, of @REGEXINDEX "returns the nth matching group in the string". So
the discrepancy is between two notions of "the nth matching group". Above,
there actually was a **first** match but it matched the second paranthesized
pattern; there certainly wasnt a 2nd match. IMO better behavior would be:

v:\> echo %@REGEXSUB[1,"(a+)|(b+)",cbbc]
bb [a first match]

v:\> echo %@REGEXSUB[2,"(a+)|(b+)",cbbc]
ECHO is OFF [no second match]
--
- Vince

Search

Welcome!

@REGEXSUB issue

Stefano Piccardi

Stefano Piccardi

vefatica

Stefano Piccardi

vefatica

Stefano Piccardi

vefatica

Stefano Piccardi

Stefano Piccardi

vefatica

rconn

Administrator

vefatica

vefatica

Similar threads