1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Regex problem

Discussion in 'Support' started by djspits, Feb 6, 2014.

  1. djspits

    Joined:
    Apr 13, 2010
    Messages:
    189
    Likes Received:
    2
    I would expect the following regex to return "long". It doesn't, and I don't understand why not. Of course I've also tried all combinations of quotes and parenthesis I could think of.

    Code:
    echo [EMAIL]%@regexsub[1,([/-]{1}([A-Za-z]{1})|--([A-Za-z]+)),--long:3][/EMAIL]
    Could someone explain to me what I'm doing wrong?

    DJ

    P.S. I also have no idea where these EMAIL tags came from.
     
  2. JohnQSmith

    Joined:
    Jan 19, 2011
    Messages:
    564
    Likes Received:
    8
    Code:
    echo %@regexsub[1,([/-]{1}([A-Za-z]{1})|--([A-Za-z]+)),--long:3]
                      (                                  )  first match group
                       ********************|*************   left or right
                                            ^^^^^^^^^^^^^   this matches
    
    So it returns "--long" which is "two dashes followed by one or more upper or lower case letters".
     
    #2 JohnQSmith, Feb 6, 2014
    Last edited: Feb 6, 2014
  3. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,883
    Likes Received:
    29
    I get "--long" and I'm not too surprised.

    "--long" matches "--([A-Za-z]+)". So it matches the disjunction of the two expressions and so it matches the expression inside the outer parentheses which is expression 1.

    The expression ([A-Za-z]+) is expression (3). If you ask for a match to expression 3, you'll get "long".

    Code:
    v:\> echo %@regexsub[3,([/-]{1}([A-Za-z]{1})|--([A-Za-z]+)),--long:3]
    long
     
  4. JohnQSmith

    Joined:
    Jan 19, 2011
    Messages:
    564
    Likes Received:
    8
    Now that I think about it, it could have just as easily returned "-l" since that is "single slash or dash followed by a single upper or lower case letter".

    Is there a reason it returned the second alternation instead of the first? Does it consider the longer string to be "more correct"?
     
  5. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,883
    Likes Received:
    29
    That is odd. I'd expect the first of the two disjuncts, as here
    Code:
    v:\> echo %@regexsub[1,((ab)|(cde)),abcde]
    ab
    Undefined behavior?
     
  6. mikea

    Joined:
    Dec 7, 2009
    Messages:
    210
    Likes Received:
    2
    This slightly simplified version returns long:

    echo %@regexsub[2,([/-][A-Za-z])|--([A-Za-z]+),--long:3]

    It does not use the nested parens, which I'm not getting the purpose of. But then I must not be understanding the problem fully. I don't get why the above returns the desired string only with group *2*. I assume that in TCC '|' means 'or', and because of the 'or' I have what amounts to only one (group) in the regular expression. Why would TCC figure this is two groups? Or is this just a convention of %@regexsub[] itself?

    Regarding the original, meaning %@regexsub[1,([/-]{1}([A-Za-z]{1})|--([A-Za-z]+)),--long:3]

    I assume that in TCC the '{1}' means what it would mean in Perl, namely exactly one of the preceding character or expression. Since in the absence of some other operator:

    [/-] by itself means exactly one of either '/' or '-'

    and

    [A-Za-z] by itself means exactly one alphabetic character, upper- or lower-case...

    ... why include '{1}' in those two situations? As I recall the best-practice advice for scripting, at least in Perl, is: if you can avoid backtracking, do avoid it.

    (When I tried the original "echo" command and omitted the two occurrences of '{1}' there was no change in the output.)
     
  7. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,883
    Likes Received:
    29
    Simply, there are two groups; a matching pair of () is a group. You asked for the second one. However, if, with all else the same, I ask for the first group (which I think should be "-l"), I get nothing.
    Code:
    v:\> echo %@regexsub[2,([/-][A-Za-z])|--([A-Za-z]+),--long:3]
    long
    
    v:\> echo %@regexsub[1,([/-][A-Za-z])|--([A-Za-z]+),--long:3]
    ECHO is OFF
     
  8. mikea

    Joined:
    Dec 7, 2009
    Messages:
    210
    Likes Received:
    2
    Yes, in a single regular expression, the first expression in parens is group 1. The second is group 2. Etc. (With some complications if they're nested.) However, since we're talking EITHER/OR here (the "|" character), at least in the more simplified example I provided, that's only one group. Not two.

    In any case, for this variable function's purpose, clearly that doesn't matter.
     
  9. vefatica

    Joined:
    May 20, 2008
    Messages:
    7,883
    Likes Received:
    29
    Well, very simple examples show that something fishy is going on. I can't explain the difference here
    Code:
    v:\> echo %@regexsub[1,(1|123),1234]
    1
    
    v:\> echo %@regexsub[1,(2|123),1234]
    123
    or here
    Code:
    v:\> echo %@regexsub[1,(1)|(123),1234]
    1
    
    v:\> echo %@regexsub[1,(2)|(123),1234]
    ECHO is OFF
     
  10. mikea

    Joined:
    Dec 7, 2009
    Messages:
    210
    Likes Received:
    2
    In the following example, it seems to return the match that it encounters first:

    Code:
    v:\> echo %@regexsub[1,(1|123),1234]
    1
    In the next example, I don't know why it doesn't do much the same thing, and return '2' since that's the first match it might encounter. But does this function assume that the expressions are at the start of the target string unless indicated otherwise?

    Code:
    v:\> echo %@regexsub[1,(2|123),1234]
    123
    Your next example:

    Code:
    v:\> echo %@regexsub[1,(1)|(123),1234]
    1
    If you reverse the order of the expressions:

    Code:
    v:\> echo %@regexsub[1,(123)|(1),1234]
    ... it returns '123'.

    But:

    Code:
    v:\> echo %@regexsub[1,(234)|(1),1234]
    In that case it displays "ECHO is OFF". Why not '234'? Are we in 'Only at the START of the target string' territory again?

    As for this one:

    Code:
    v:\> echo %@regexsub[1,(2)|(123),1234]
    ECHO is OFF
    If you do this instead -- change the group you're asking it to match:

    Code:
    v:\> echo %@regexsub[2,(2)|(123),1234]
    ...the function returns '123'.

    Two more for the road:

    Code:
    v:\> echo %@regexsub[1,(234)|(1),1234]
    ECHO is OFF
    
    v:\> echo %@regexsub[2,(234)|(1),1234]
    1
    I dunno. Is '|' the culprit here?
     
  11. ben

    ben

    Joined:
    Jan 3, 2012
    Messages:
    25
    Likes Received:
    4
    Everything is working as it should be. There are no culprits.

    The recogniser uses these rules:

    1. Capturing. Number each parenthesised subexpression by the position of its first open parenthesis symbol, counting from 1 from the left, ignoring any nesting.

    2. Leftmost match. For the expression as a whole and for each subexpression, consider only leftmost matches.

    3. Alternation. Of several alternatives that match, always choose either (a) the leftmost or (b) the longest. Perl 5's and TCC's recgnisers both choose (a).

    %@regexsub[1,(1|123),1234]

    Number the whole parenthesised expression 1.
    Consider only the leftmost matches, that is, both of the alternatives.
    Of the matching alternatives, choose the leftmost: 1.
    Return what subexpression 1 matched: 1.

    %@regexsub[1,(2|123),1234]

    Number the whole parenthesised expression 1.
    Consider only the leftmost match, that is, the second alternative.
    Choose the only matching alternative: 123.
    Return what subexpression 1 matched: 123.

    %@regexsub[1,(1)|(123),1234]

    Number the first alternative 1, the second alternative 2.
    Consider only the leftmost matches, that is, both of the alternatives.
    Of the matching alternatives, choose the leftmost: (1).
    Return what subexpression 1 matched: 1.

    %@regexsub[1,(123)|(1),1234]

    Number the first alternative 1, the second alternative 2.
    Consider only the leftmost matches, that is, both of the alternatives.
    Of the matching alternatives, choose the leftmost: (123).
    Return what subexpression 1 matched: 123.

    %@regexsub[1,(234)|(1),1234]

    Number the first alternative 1, the second alternative 2.
    Consider only the leftmost match, that is, the second alternative.
    Choose the only matching alternative: (1).
    Return what subexpression 1 matched: [nothing].

    %@regexsub[1,(2)|(123),1234]

    Number the first alternative 1, the second alternative 2.
    Consider only the leftmost match, that is, the second alternative.
    Choose the only matching alternative: (123).
    Return what subexpression 1 matched: [nothing].

    %@regexsub[2,(2)|(123),1234]

    Number the first alternative 1, the second alternative 2.
    Consider only the leftmost match, that is, the second alternative.
    Choose the only matching alternative: (123).
    Return what subexpression 2 matched: 123.

    %@regexsub[1,(234)|(1),1234]

    Number the first alternative 1, the second alternative 2.
    Consider only the leftmost match, that is, the second alternative.
    Choose the only matching alternative: (1).
    Return what subexpression 1 matched: [nothing].

    %@regexsub[2,(234)|(1),1234]

    Number the first alternative 1, the second alternative 2.
    Consider only the leftmost match, that is, the second alternative.
    Choose the only matching alternative: (1).
    Return what subexpression 2 matched: 1.
     
    mikea and JohnQSmith like this.
  12. JohnQSmith

    Joined:
    Jan 19, 2011
    Messages:
    564
    Likes Received:
    8
    ben,

    Outstanding explanation!

    The leftmost match answers why "--long" was chosen over "-l".
     
  13. mikea

    Joined:
    Dec 7, 2009
    Messages:
    210
    Likes Received:
    2
    Yes, an excellent explanation. Thanks for taking the time to write it out.
     
  14. ben

    ben

    Joined:
    Jan 3, 2012
    Messages:
    25
    Likes Received:
    4
    You're welcome. I'm glad if it's useful.
     
  15. djspits

    Joined:
    Apr 13, 2010
    Messages:
    189
    Likes Received:
    2
    The way I understand this is that any parenthesised expression becomes a uniquely numbered group. Meaning with @REGEXSUB and a regex containing an alternation you can never catch either one or the other of the alternatives because @REGEXSUB's first parameter detemines which one is returned (if it matches).

    Thanks, everyone.
     
  16. ben

    ben

    Joined:
    Jan 3, 2012
    Messages:
    25
    Likes Received:
    4
    I believe that is the case.

    But if you want to extract the short or long option name without its prefix hyphen(s) or slash, try

    echo %@regexsub[1,((?<=--)[A-Za-z]+|(?<=[/-])[A-Za-z]),%option]

    Note the order of the alternation.
     
    djspits likes this.
  17. ben

    ben

    Joined:
    Jan 3, 2012
    Messages:
    25
    Likes Received:
    4
    Or

    echo %@rereplace[(?:[/-]([A-Za-z])|--([A-Za-z]+)).*,\1\2,%option]
     
  18. djspits

    Joined:
    Apr 13, 2010
    Messages:
    189
    Likes Received:
    2
    You've solved it using a positive lookbehind !
    That is what I call an excellent answer.
    I'm grateful. Thank you.

    DJ
     

Share This Page