Regex question

#1
I discovered that the Oniguruma library that TCC uses allows for up to 32 captures which might later be used in substitutions (as @XREPLACE does). Gnu sed, for example allows only the back-references \0 to \9.

As it stands (I think) @XREPLACE allows \0 to \31 but this leaves the problem of how to interpret, say, \10 in a replacement string ... should it insert capture number 10 or capture number 1 followed by a 0? As it stands, @XREPLACE substitutes capture number 10.

I am tempted to allow only \0 to \9 (as @XREPLACE's documentation already says) and avoid the ambiguity mentioned above and be more like sed.

Any thoughts?
 
#2
In perl, s//$10/ replaces parameter 10. If you want parameter 1, and a '0',
use s//${1}0/. The ${} syntax is used to disambiguate when what follows the
variable name would otherwise be misinterpreted.

D:\>perl -e "$v.=$_ for (a..z); $v =~ /(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)/;
print qq($1 $10 ${1}0)"

a j a0


On Mon, Jun 14, 2010 at 12:01 PM, vefatica <> wrote:


> I discovered that the Oniguruma library that TCC uses allows for up to 32
> captures which might later be used in substitutions (as @XREPLACE does).
> Gnu sed, for example allows only the back-references \0 to \9.
>
> As it stands (I think) @XREPLACE allows \0 to \31 but this leaves the
> problem of how to interpret, say, \10 in a replacement string ... should it
> insert capture number 10 or capture number 1 followed by a 0? As it stands,
> @XREPLACE substitutes capture number 10.
>
> I am tempted to allow only \0 to \9 (as @XREPLACE's documentation already
> says) and avoid the ambiguity mentioned above and be more like sed.
>
> Any thoughts?
>
>
>
>
>


--
Jim Cook
2010 Sundays: 4/4, 6/6, 8/8, 10/10, 12/12 and 5/9, 9/5, 7/11, 11/7.
Next year they're Monday.
 
#3
On Mon, 14 Jun 2010 15:27:38 -0400, Jim Cook <> wrote:

|In perl, s//$10/ replaces parameter 10. If you want parameter 1, and a '0',
|use s//${1}0/. The ${} syntax is used to disambiguate when what follows the
|variable name would otherwise be misinterpreted.
|
|D:\>perl -e "$v.=$_ for (a..z); $v =~ /(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)/;
|print qq($1 $10 ${1}0)"
|
|a j a0

Thanks. Is there any possibility for ambiguity?

Does this method of processing a '\' sound right?

1. if the next char is '\', it's a literal '\' (eat two characters)

2. else if the next char is a digit, get the number following the '\' (may be
more than one digit) and use the corresponding back-reference

3. else if the next char is '{' read the number that follows (up to the '}'?)
and use the corresponding backref (what if no number follows '\{' or no '}'
appears after the number ... bad syntax?

4. else treat it as a literal '\'
--
- Vince
 
#4
Perl throws a syntax error on unpaired {}. It is perfectly happy to
substitute a variable, e.g.: $plural = "${singular}s" when variables are
allowed. Undefined things "${undefined}" would be like %undefined% and just
empty (ok, undef, but that's nitpicking).

Specifying \1, \2 are backreferences. However, \01 and \001 are the binary
code 0x01 and not a backreference. The \oct syntax consumes at most three
octal digits; stopping on non-digit or three count. \{oct} is not supported.
Certain other escaped characters look like C, e.g.: \n \r \t \a

My temptation would be to ignore the \oct and \n things in XREPLACE, but I
wanted to make you aware of them if you weren't already.

Your rule 4 (else treat it as a literal '\') means that "\\" becomes "\" and
"\1" becomes backreference 1, but "\q" becomes "\q" which seems
counterintuitive. I'd make "\q" become "q". In other words, the \ is
consumed in all cases and affects what comes just after it.

On Mon, Jun 14, 2010 at 1:43 PM, vefatica <> wrote:


> On Mon, 14 Jun 2010 15:27:38 -0400, Jim Cook <> wrote:
>
> |In perl, s//$10/ replaces parameter 10. If you want parameter 1, and a
> '0',
> |use s//${1}0/. The ${} syntax is used to disambiguate when what follows
> the
> |variable name would otherwise be misinterpreted.
> |
> |D:\>perl -e "$v.=$_ for (a..z); $v =~ /(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)/;
> |print qq($1 $10 ${1}0)"
> |
> |a j a0
>
> Thanks. Is there any possibility for ambiguity?
>
> Does this method of processing a '\' sound right?
>
> 1. if the next char is '\', it's a literal '\' (eat two characters)
>
> 2. else if the next char is a digit, get the number following the '\' (may
> be
> more than one digit) and use the corresponding back-reference
>
> 3. else if the next char is '{' read the number that follows (up to the
> '}'?)
> and use the corresponding backref (what if no number follows '\{' or no '}'
> appears after the number ... bad syntax?
>
> 4. else treat it as a literal '\'
> --
> - Vince
>
>
>
>
>


--
Jim Cook
2010 Sundays: 4/4, 6/6, 8/8, 10/10, 12/12 and 5/9, 9/5, 7/11, 11/7.
Next year they're Monday.
 
#5
On Mon, 14 Jun 2010 18:11:06 -0400, Jim Cook <> wrote:

|Specifying \1, \2 are backreferences. However, \01 and \001 are the binary
|code 0x01 and not a backreference. The \oct syntax consumes at most three
|octal digits; stopping on non-digit or three count. \{oct} is not supported.
|Certain other escaped characters look like C, e.g.: \n \r \t \a

I think I'll allow \n (and insert a CRLF) and \t. What's \a?

|My temptation would be to ignore the \oct and \n things in XREPLACE, but I
|wanted to make you aware of them if you weren't already.
|
|Your rule 4 (else treat it as a literal '\') means that "\\" becomes "\" and
|"\1" becomes backreference 1, but "\q" becomes "\q" which seems
|counterintuitive. I'd make "\q" become "q". In other words, the \ is
|consumed in all cases and affects what comes just after it.

So if it's not \\, \n, \t, \number, \{number}, I'll just ignore it and get the
next char.

Sound good?
--
- Vince
 
#6
\a is alarm (0x07). I regularly code \b first, then remember that isn't bell
but backspace :)

I believe perl conforms to the C standard, which defines these:

\a (alert) Produces an audible or visible alert without changing the active
position.
\b (backspace) Moves the active position to the previous position on the
current line. If
the active position is at the initial position of a line, the behavior of
the display
device is unspecified.
\f ( form feed) Moves the active position to the initial position at the
start of the next
logical page.
\n (new line) Moves the active position to the initial position of the next
line.
\r (carriage return) Moves the active position to the initial position of
the current line.
\t (horizontal tab) Moves the active position to the next horizontal
tabulation position
on the current line. If the active position is at or past the last defined
horizontal
tabulation position, the behavior of the display device is unspecified.
\v (vertical tab) Moves the active position to the initial position of the
next vertical
tabulation position. If the active position is at or past the last defined
vertical
tabulation position, the behavior of the display device is unspecified.


On Mon, Jun 14, 2010 at 5:11 PM, vefatica <> wrote:


> On Mon, 14 Jun 2010 18:11:06 -0400, Jim Cook <> wrote:
>
> |Specifying \1, \2 are backreferences. However, \01 and \001 are the binary
> |code 0x01 and not a backreference. The \oct syntax consumes at most three
> |octal digits; stopping on non-digit or three count. \{oct} is not
> supported.
> |Certain other escaped characters look like C, e.g.: \n \r \t \a
>
> I think I'll allow \n (and insert a CRLF) and \t. What's \a?
>
> |My temptation would be to ignore the \oct and \n things in XREPLACE, but I
> |wanted to make you aware of them if you weren't already.
> |
> |Your rule 4 (else treat it as a literal '\') means that "\\" becomes "\"
> and
> |"\1" becomes backreference 1, but "\q" becomes "\q" which seems
> |counterintuitive. I'd make "\q" become "q". In other words, the \ is
> |consumed in all cases and affects what comes just after it.
>
> So if it's not \\, \n, \t, \number, \{number}, I'll just ignore it and get
> the
> next char.
>
> Sound good?
> --
> - Vince
>
>
>
>
>


--
Jim Cook
2010 Sundays: 4/4, 6/6, 8/8, 10/10, 12/12 and 5/9, 9/5, 7/11, 11/7.
Next year they're Monday.