Regex problem

  • This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn more.
Apr 13, 2010
190
2
57
The Hague
#1
I would expect the following regex to return "long". It doesn't, and I don't understand why not. Of course I've also tried all combinations of quotes and parenthesis I could think of.

Code:
echo [EMAIL]%@regexsub[1,([/-]{1}([A-Za-z]{1})|--([A-Za-z]+)),--long:3][/EMAIL]
Could someone explain to me what I'm doing wrong?

DJ

P.S. I also have no idea where these EMAIL tags came from.
 
Jan 19, 2011
559
7
Norman, OK
#2
Code:
echo %@regexsub[1,([/-]{1}([A-Za-z]{1})|--([A-Za-z]+)),--long:3]
                  (                                  )  first match group
                   ********************|*************   left or right
                                        ^^^^^^^^^^^^^   this matches
So it returns "--long" which is "two dashes followed by one or more upper or lower case letters".
 
Last edited:
#3
I get "--long" and I'm not too surprised.

"--long" matches "--([A-Za-z]+)". So it matches the disjunction of the two expressions and so it matches the expression inside the outer parentheses which is expression 1.

The expression ([A-Za-z]+) is expression (3). If you ask for a match to expression 3, you'll get "long".

Code:
v:\> echo %@regexsub[3,([/-]{1}([A-Za-z]{1})|--([A-Za-z]+)),--long:3]
long
 
Jan 19, 2011
559
7
Norman, OK
#4
Now that I think about it, it could have just as easily returned "-l" since that is "single slash or dash followed by a single upper or lower case letter".

Is there a reason it returned the second alternation instead of the first? Does it consider the longer string to be "more correct"?
 
#5
Now that I think about it, it could have just as easily returned "-l" since that is "single slash or dash followed by a single upper or lower case letter".

Is there a reason it returned the second alternation instead of the first? Does it consider the longer string to be "more correct"?
That is odd. I'd expect the first of the two disjuncts, as here
Code:
v:\> echo %@regexsub[1,((ab)|(cde)),abcde]
ab
Undefined behavior?
 
#6
This slightly simplified version returns long:

echo %@regexsub[2,([/-][A-Za-z])|--([A-Za-z]+),--long:3]

It does not use the nested parens, which I'm not getting the purpose of. But then I must not be understanding the problem fully. I don't get why the above returns the desired string only with group *2*. I assume that in TCC '|' means 'or', and because of the 'or' I have what amounts to only one (group) in the regular expression. Why would TCC figure this is two groups? Or is this just a convention of %@regexsub[] itself?

Regarding the original, meaning %@regexsub[1,([/-]{1}([A-Za-z]{1})|--([A-Za-z]+)),--long:3]

I assume that in TCC the '{1}' means what it would mean in Perl, namely exactly one of the preceding character or expression. Since in the absence of some other operator:

[/-] by itself means exactly one of either '/' or '-'

and

[A-Za-z] by itself means exactly one alphabetic character, upper- or lower-case...

... why include '{1}' in those two situations? As I recall the best-practice advice for scripting, at least in Perl, is: if you can avoid backtracking, do avoid it.

(When I tried the original "echo" command and omitted the two occurrences of '{1}' there was no change in the output.)
 
#7
This slightly simplified version returns long:

echo %@regexsub[2,([/-][A-Za-z])|--([A-Za-z]+),--long:3]

It does not use the nested parens, which I'm not getting the purpose of. But then I must not be understanding the problem fully. I don't get why the above returns the desired string only with group *2*. I assume that in TCC '|' means 'or', and because of the 'or' I have what amounts to only one (group) in the regular expression. Why would TCC figure this is two groups?
Simply, there are two groups; a matching pair of () is a group. You asked for the second one. However, if, with all else the same, I ask for the first group (which I think should be "-l"), I get nothing.
Code:
v:\> echo %@regexsub[2,([/-][A-Za-z])|--([A-Za-z]+),--long:3]
long

v:\> echo %@regexsub[1,([/-][A-Za-z])|--([A-Za-z]+),--long:3]
ECHO is OFF
 
#8
Yes, in a single regular expression, the first expression in parens is group 1. The second is group 2. Etc. (With some complications if they're nested.) However, since we're talking EITHER/OR here (the "|" character), at least in the more simplified example I provided, that's only one group. Not two.

In any case, for this variable function's purpose, clearly that doesn't matter.
 
#9
Well, very simple examples show that something fishy is going on. I can't explain the difference here
Code:
v:\> echo %@regexsub[1,(1|123),1234]
1

v:\> echo %@regexsub[1,(2|123),1234]
123
or here
Code:
v:\> echo %@regexsub[1,(1)|(123),1234]
1

v:\> echo %@regexsub[1,(2)|(123),1234]
ECHO is OFF
 
#10
In the following example, it seems to return the match that it encounters first:

Code:
v:\> echo %@regexsub[1,(1|123),1234]
1
In the next example, I don't know why it doesn't do much the same thing, and return '2' since that's the first match it might encounter. But does this function assume that the expressions are at the start of the target string unless indicated otherwise?

Code:
v:\> echo %@regexsub[1,(2|123),1234]
123
Your next example:

Code:
v:\> echo %@regexsub[1,(1)|(123),1234]
1
If you reverse the order of the expressions:

Code:
v:\> echo %@regexsub[1,(123)|(1),1234]
... it returns '123'.

But:

Code:
v:\> echo %@regexsub[1,(234)|(1),1234]
In that case it displays "ECHO is OFF". Why not '234'? Are we in 'Only at the START of the target string' territory again?

As for this one:

Code:
v:\> echo %@regexsub[1,(2)|(123),1234]
ECHO is OFF
If you do this instead -- change the group you're asking it to match:

Code:
v:\> echo %@regexsub[2,(2)|(123),1234]
...the function returns '123'.

Two more for the road:

Code:
v:\> echo %@regexsub[1,(234)|(1),1234]
ECHO is OFF

v:\> echo %@regexsub[2,(234)|(1),1234]
1
I dunno. Is '|' the culprit here?
 
Jan 3, 2012
25
4
UK
#11
Everything is working as it should be. There are no culprits.

The recogniser uses these rules:

1. Capturing. Number each parenthesised subexpression by the position of its first open parenthesis symbol, counting from 1 from the left, ignoring any nesting.

2. Leftmost match. For the expression as a whole and for each subexpression, consider only leftmost matches.

3. Alternation. Of several alternatives that match, always choose either (a) the leftmost or (b) the longest. Perl 5's and TCC's recgnisers both choose (a).

%@regexsub[1,(1|123),1234]

Number the whole parenthesised expression 1.
Consider only the leftmost matches, that is, both of the alternatives.
Of the matching alternatives, choose the leftmost: 1.
Return what subexpression 1 matched: 1.

%@regexsub[1,(2|123),1234]

Number the whole parenthesised expression 1.
Consider only the leftmost match, that is, the second alternative.
Choose the only matching alternative: 123.
Return what subexpression 1 matched: 123.

%@regexsub[1,(1)|(123),1234]

Number the first alternative 1, the second alternative 2.
Consider only the leftmost matches, that is, both of the alternatives.
Of the matching alternatives, choose the leftmost: (1).
Return what subexpression 1 matched: 1.

%@regexsub[1,(123)|(1),1234]

Number the first alternative 1, the second alternative 2.
Consider only the leftmost matches, that is, both of the alternatives.
Of the matching alternatives, choose the leftmost: (123).
Return what subexpression 1 matched: 123.

%@regexsub[1,(234)|(1),1234]

Number the first alternative 1, the second alternative 2.
Consider only the leftmost match, that is, the second alternative.
Choose the only matching alternative: (1).
Return what subexpression 1 matched: [nothing].

%@regexsub[1,(2)|(123),1234]

Number the first alternative 1, the second alternative 2.
Consider only the leftmost match, that is, the second alternative.
Choose the only matching alternative: (123).
Return what subexpression 1 matched: [nothing].

%@regexsub[2,(2)|(123),1234]

Number the first alternative 1, the second alternative 2.
Consider only the leftmost match, that is, the second alternative.
Choose the only matching alternative: (123).
Return what subexpression 2 matched: 123.

%@regexsub[1,(234)|(1),1234]

Number the first alternative 1, the second alternative 2.
Consider only the leftmost match, that is, the second alternative.
Choose the only matching alternative: (1).
Return what subexpression 1 matched: [nothing].

%@regexsub[2,(234)|(1),1234]

Number the first alternative 1, the second alternative 2.
Consider only the leftmost match, that is, the second alternative.
Choose the only matching alternative: (1).
Return what subexpression 2 matched: 1.
 
#12
ben,

Outstanding explanation!

The leftmost match answers why "--long" was chosen over "-l".
 
Apr 13, 2010
190
2
57
The Hague
#15
The way I understand this is that any parenthesised expression becomes a uniquely numbered group. Meaning with @REGEXSUB and a regex containing an alternation you can never catch either one or the other of the alternatives because @REGEXSUB's first parameter detemines which one is returned (if it matches).

Thanks, everyone.
 
Jan 3, 2012
25
4
UK
#16
I believe that is the case.

But if you want to extract the short or long option name without its prefix hyphen(s) or slash, try

echo %@regexsub[1,((?<=--)[A-Za-z]+|(?<=[/-])[A-Za-z]),%option]

Note the order of the alternation.
 
Likes: djspits
Apr 13, 2010
190
2
57
The Hague
#18
I believe that is the case.

But if you want to extract the short or long option name without its prefix hyphen(s) or slash, try

echo %@regexsub[1,((?<=--)[A-Za-z]+|(?<=[/-])[A-Za-z]),%option]

Note the order of the alternation.
You've solved it using a positive lookbehind !
That is what I call an excellent answer.
I'm grateful. Thank you.

DJ