1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

@REGEXSUB issue

Discussion in 'Support' started by Stefano Piccardi, Mar 17, 2009.

  1. Stefano Piccardi

    Joined:
    May 31, 2008
    Messages:
    376
    Likes Received:
    2
    It seems that @REGEXSUB handles regex alternatives incorrectly.
    Code:
    C:\> for %i in (a b c) echo %i in (a+)^|(b+): %@REGEXINDEX["(a+)|(b+)",x%i] (%@REGEXSUB[1,"(a+)|(b+)",x%i])
    
    a in (a+)|(b+): 1 (a)
    b in (a+)|(b+): 1 ()          <=== should match "b"
    c in (a+)|(b+): -1 ()
    
    C:\> ver
    TCC  9.02.152   Windows XP [Version 5.1.2600]
    
    For comparison, @REGEXINDEX matches the second capture in the second line, (b+), while @REGEXSUB doesn't.
     
  2. Stefano Piccardi

    Joined:
    May 31, 2008
    Messages:
    376
    Likes Received:
    2
    No answer? Then does @REGEXSUB work correctly in version 10?
     
  3. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,792
    Likes Received:
    29
    On Mon, 23 Mar 2009 17:04:59 -0500, Stefano Piccardi <>
    wrote:

    |No answer? Then does @REGEXSUB work correctly in version 10?

    It would seem it's not working correctly. In writing @XREPLACE (4UTILS) I did
    not go out of my way to accommodate this particular scenario. Alternatives
    (Perl syntax) seem to be OK as far as Oniguruma is concerned:

    v:\> echo %@xreplace[(a+)|(b+),z,xaax]
    xzx

    v:\> echo %@xreplace[(a+)|(b+),z,xbbx]
    xzx

    v:\> echo %@xreplace[(a+)|(b+),z,xccx]
    xccx
    --
    - Vince
     
  4. Stefano Piccardi

    Joined:
    May 31, 2008
    Messages:
    376
    Likes Received:
    2
    Thank you for confirming this issue in @REGEXSUB.
    The regex is given to me as part of a configuration file, so I can't rewrite it to work around REGEXSUB.
    I need to isolate the content of the capture.
    I suppose that I could do it with @XREPLACE + @WORD, but I noticed a similar issue in @XREPLACE

    Run in batch file: for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+)|(b+)",z\1z,x%i])

    output:
    aa in (a+)|(b+): (xzaaz)
    bb in (a+)|(b+): (xzz)
    cc in (a+)|(b+): (xcc)

    The second line should be (xzbbz).

    BTW, if I don't quote the regex I get a different, still incorrect, output:

    aa in (a+)|(b+): (xaa)
    bb in (a+)|(b+): (xbb)
    cc in (a+)|(b+): (xcc)
     
  5. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,792
    Likes Received:
    29
    On Tue, 24 Mar 2009 09:44:59 -0500, Stefano Piccardi <>
    wrote:

    |for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+)|(b+)",z\1z,x%i])
    |BTW, if I don't quote the regex I get a different, still incorrect, output:
    |
    |aa in (a+)|(b+): (xaa)
    |bb in (a+)|(b+): (xbb)
    |cc in (a+)|(b+): (xcc)

    These should be:

    v:\> for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+)|(b+)",z\1z,x%
    i])
    aa in (a+)|(b+): (xzaaz)
    bb in (a+)|(b+): (xzz) [\1 not found]
    cc in (a+)|(b+): (xcc)

    v:\> for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+)|(b+)",z\2z,x%
    i])
    aa in (a+)|(b+): (xzz) [\2 not found]
    bb in (a+)|(b+): (xzbbz)
    cc in (a+)|(b+): (xcc)

    I found that bug last night while experimenting after reading your post. It
    resulted from a change (quite a while back) from wcscpyn() to lstrncpy() which
    behaves a little differently. There's a new one in the VC9 plugin directory on
    lucky.syr.edu.

    Please check out TYPEX (like TYPE /X) and UNTYPEX. If you redirect TYPEX to a
    file UNTYPEX will re-construct the original file from the hex values. Don't
    overwrite anything important. UNTYPEX needs the exact format that TYPEX outputs.
    If you edit TYPEX's output, do it carefully. These are experimental. Example:

    v:\> typex fleas.txt
    00000000 4D 79 20 64 6F 67 20 68 61 73 20 66 6C 65 61 73 My dog has fleas
    00000010 21 0D 0A !..

    v:\> typex fleas.txt > fleas.hex

    v:\> edit fleas.hex & rem change 61 to 41

    v:\> untypex fleas.hex fleas2.txt

    v:\> typex fleas2.txt
    00000000 4D 79 20 64 6F 67 20 68 41 73 20 66 6C 65 41 73 My dog hAs fleAs
    00000010 21 0D 0A !..
    --
    - Vince
     
  6. Stefano Piccardi

    Joined:
    May 31, 2008
    Messages:
    376
    Likes Received:
    2
    I weakly disagree. IMO in an alternate each capture should count as \1, so the output should be
    (xzaaz)
    (xzbbz)
    (xcc)
    But I hesitate to make a stronger point because this level of detail is left open to interpretation in the regex documentation.

    However, if in an alternate the first capture is sometimes called \1 and other times it's called \2 then we're out of luck. Consider:
    prefix A(\d+)|prefix B(\w+) should capture \1 as
    a number when it's prefixed by prefix A or as a word when it's prefixed by prefix B. It can't be rewritten as
    prefix A|prefix B(\d+|\w+)
    to lock the alternate capture into the same group of parentheses. The second regex is not equivalent to the first one.

    IMO the cardinal number of the capture should be assigned as the match is being evaluated, not as it is being compiled from left to right.
     
  7. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,792
    Likes Received:
    29
    On Tue, 24 Mar 2009 13:13:31 -0500, Stefano Piccardi <>
    wrote:

    |v:\> for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+)|(b+)",z\2z,x%
    |i])
    |aa in (a+)|(b+): (xzz) [\2 not found]
    |bb in (a+)|(b+): (xzbbz)
    |cc in (a+)|(b+): (xcc)

    |I weakly disagree. IMO in an alternate each capture should count as \1, so the output should be
    |(xzaaz)
    |(xzbbz)
    |(xcc)
    |But I hesitate to make a stronger point because this level of detail is left open to interpretation in the regex documentation.

    It's always been clear to me. \1, \2, ... refer to the parenthesized
    expressions in order. To have it your way, do this:

    for %i in (aa bb cc) echo %i in (a+)^|(b+): (%@xreplace["(a+|b+)",z\1z,x%i])
    aa in (a+)|(b+): (xzaaz)
    bb in (a+)|(b+): (xzbbz)
    cc in (a+)|(b+): (xcc)
    --
    - Vince
     
  8. Stefano Piccardi

    Joined:
    May 31, 2008
    Messages:
    376
    Likes Received:
    2
    I like them! They seem to work well.
    Also, the regex quoting seems to be fixed, thank you.
     
  9. Stefano Piccardi

    Joined:
    May 31, 2008
    Messages:
    376
    Likes Received:
    2
    However, if in an alternate the first capture is sometimes called \1 and other times it's called \2 then we're out of luck. Consider:
    prefix A(\d+)|prefix B(\w+) should capture \1 as
    a number when it's prefixed by prefix A or as a word when it's prefixed by prefix B. It can't be rewritten as
    prefix A|prefix B(\d+|\w+)
    to lock the alternate capture into the same group of parentheses. The second regex is not equivalent to the first one.

    IMO the cardinal number of the capture should be assigned as the match is being evaluated, not as it is being compiled from left to right.
     
  10. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,792
    Likes Received:
    29
    On Tue, 24 Mar 2009 13:54:48 -0500, Stefano Piccardi <>
    wrote:

    |However, if in an alternate the first capture is sometimes called \1 and other times it's called \2 then we're out of luck. Consider:
    |prefix A(\d+)|prefix B(\w+) should capture \1 as
    |a number when it's prefixed by prefix A or as a word when it's prefixed by prefix B. It can't be rewritten as
    |prefix A|prefix B(\d+|\w+)
    |to lock the alternate capture into the same group of parentheses. The second regex is not equivalent to the first one.

    What about (A\d+|B\w+)?

    |IMO the cardinal number of the capture should be assigned as the match is being evaluated, not as it is being compiled from left to right.

    Then you wouldn't know what was matched.
    --
    - Vince
     
  11. rconn

    rconn Administrator
    Staff Member

    Joined:
    May 14, 2008
    Messages:
    9,732
    Likes Received:
    81
    Stefano Piccardi wrote:

    I'll pass it on to the Oniguruma developers (though in every "bug"
    reported in regular expressions for the past couple of years Oniguruma
    has been correct).

    Rex Conn
    JP Software
     
  12. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,792
    Likes Received:
    29
    On Tue, 24 Mar 2009 21:52:54 -0500, rconn <> wrote:

    |Stefano Piccardi wrote:
    |
    |
    |---Quote---
    |> It seems that @REGEXSUB handles regex alternatives incorrectly.
    |>
    |> Code:
    |> ---------
    |> C:\> for %i in (a b c) echo %i in (a+)^|(b+): %@REGEXINDEX["(a+)|(b+)",x%i] (%@REGEXSUB[1,"(a+)|(b+)",x%i])
    |>
    |> a in (a+)|(b+): 1 (a)
    |> b in (a+)|(b+): 1 () <=== should match "b"
    |> c in (a+)|(b+): -1 ()
    |>
    |> C:\> ver
    |> TCC 9.02.152 Windows XP [Version 5.1.2600]
    |> ---------
    |> For comparison, @REGEXINDEX matches the second capture in the second line, (b+), while @REGEXSUB doesn't.
    |---End Quote---
    |I'll pass it on to the Oniguruma developers (though in every "bug"
    |reported in regular expressions for the past couple of years Oniguruma
    |has been correct).

    I don't think it's Onig (or usage). If I do

    while ( onig_search(regex, mstart, mend, mstart, mend, region, 0) >= 0 )

    with regex pointing to the (unquoted) "(a+)|(b+)", and mstart/mend delimiting
    the string "xxbxx", it finds a match (and knows where it is):

    v:\> echo %@xreplace["(a+)|(b+)",**\2**,xxbxx]
    xx**b**xx

    In Stefano's faulty case, @REGEXINDEX is finding the pattern while @REGEXSUB is
    not.
    --
    - Vince
     
  13. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,792
    Likes Received:
    29
    On Tue, 24 Mar 2009 21:52:54 -0500, rconn <> wrote:

    |---Quote---
    |> It seems that @REGEXSUB handles regex alternatives incorrectly.
    |>
    |> Code:
    |> ---------
    |> C:\> for %i in (a b c) echo %i in (a+)^|(b+): %@REGEXINDEX["(a+)|(b+)",x%i] (%@REGEXSUB[1,"(a+)|(b+)",x%i])
    |>
    |> a in (a+)|(b+): 1 (a)
    |> b in (a+)|(b+): 1 () <=== should match "b"
    |> c in (a+)|(b+): -1 ()
    |>
    |> C:\> ver
    |> TCC 9.02.152 Windows XP [Version 5.1.2600]
    |> ---------
    |> For comparison, @REGEXINDEX matches the second capture in the second line, (b+), while @REGEXSUB doesn't.
    |---End Quote---
    |I'll pass it on to the Oniguruma developers (though in every "bug"
    |reported in regular expressions for the past couple of years Oniguruma
    |has been correct).

    Well, you know, it does (sort of) work, but not in a way that's very useful:

    v:\> echo %@REGEXSUB[1,"(a+)|(b+)",cbbc]
    ECHO is OFF

    v:\> echo %@REGEXSUB[2,"(a+)|(b+)",cbbc]
    bb

    The help says, of @REGEXINDEX "returns the nth matching group in the string". So
    the discrepancy is between two notions of "the nth matching group". Above,
    there actually was a **first** match but it matched the second paranthesized
    pattern; there certainly wasnt a 2nd match. IMO better behavior would be:

    v:\> echo %@REGEXSUB[1,"(a+)|(b+)",cbbc]
    bb [a first match]

    v:\> echo %@REGEXSUB[2,"(a+)|(b+)",cbbc]
    ECHO is OFF [no second match]
    --
    - Vince
     

Share This Page