Regular Expression Syntax |
|
Oniguruma Regular Expressions Version 5.9.1 2007/09/05
This section covers the Ruby regular expression syntax. For information on Perl regular expression syntax, see your Perl documentation or http://www.perl.com/doc/manual/html/pod/perlre.html.
1. Syntax elements
| \ | escape (enable or disable meta character meaning) |
| | | alternation |
| (...) | group |
| [...] | character class |
2. Characters
| \t | horizontal tab (0x09) |
| \v | vertical tab (0x0B) |
| \n | newline (0x0A) |
| \r | return (0x0D) |
| \b | back space (0x08) |
| \f | form feed (0x0C) |
| \a | bell (0x07) |
| \e | escape (0x1B) |
| \nnn | octal char (encoded byte value) |
| \xHH | hexadecimal char (encoded byte value) |
\x{7HHHHHHH} wide hexadecimal char (character code point value)
| \cx | control char (character code point value) |
| \C-x | control char (character code point value) |
| \M-x | meta (x|0x80) (character code point value) |
| \M-\C-x | meta control char (character code point value) |
(* \b is effective in character class [...] only)
3. Character types
| . | any character (except newline) |
| \w | word character |
Not Unicode:
| alphanumeric, "_" and multibyte char. |
Unicode:
General_Category -- (Letter|Mark|Number|Connector_Punctuation)
\W non word char
| \s | whitespace char |
Not Unicode:
\t, \n, \v, \f, \r, \x20
Unicode:
0009, 000A, 000B, 000C, 000D, 0085(NEL),
General_Category -- Line_Separator
-- Paragraph_Separator
-- Space_Separator
| \S | non whitespace char |
| \d | decimal digit char |
Unicode: General_Category -- Decimal_Number
| \D | non decimal digit char |
| \h | hexadecimal digit char [0-9a-fA-F] |
| \H | non hexadecimal digit char |
Character Property
* \p{property-name}
* \p{^property-name} (negative)
* \P{property-name} (negative)
property-name:
+ works on all encodings
Alnum, Alpha, Blank, Cntrl, Digit, Graph, Lower, Print, Punct, Space, Upper, XDigit, Word, ASCII,
4. Quantifier
greedy
? 1 or 0 times
* 0 or more times
+ 1 or more times
{n,m} at least n but not more than m times
{n,} at least n times
{,n} at least 0 but not more than n times ({0,n})
{n} n times
reluctant
?? 1 or 0 times
*? 0 or more times
+? 1 or more times
{n,m}? at least n but not more than m times
{n,}? at least n times
{,n}? at least 0 but not more than n times (== {0,n}?)
possessive (greedy and does not backtrack after repeated)
?+ 1 or 0 times
*+ 0 or more times
++ 1 or more times
({n,m}+, {n,}+, {n}+ are possessive op. in ONIG_SYNTAX_JAVA only)
ex. /a*+/ === /(?>a*)/
5. Anchors
| ^ | beginning of the line |
| $ | end of the line |
| \b | word boundary |
| \B | not word boundary |
| \A | beginning of string |
| \Z | end of string, or before newline at the end |
| \z | end of string |
| \G | matching start position (*) |
6. Character class
| ^... | negative class (lowest precedence operator) |
| x-y | range from x to y |
| [...] | set (character class in character class) |
| ..&&.. | intersection (low precedence at the next of ^) |
ex. [a-w&&[^c-g]z] ==> ([a-w] AND ([^c-g] OR z)) ==> [abh-w]
* If you want to use '[', '-', ']' as a normal character in a character class, you should escape these characters by '\'.
POSIX bracket ([:xxxxx:], negate [:^xxxxx:])
Not Unicode Case:
alnum alphabet or digit char
alpha alphabet
ascii code value: [0 - 127]
blank \t, \x20
cntrl
digit 0-9
graph include all of multibyte encoded characters
lower
print include all of multibyte encoded characters
punct
space \t, \n, \v, \f, \r, \x20
upper
word alphanumeric, "_" and multibyte characters
xdigit 0-9, a-f, A-F
Unicode Case:
alnum Letter | Mark | Decimal_Number
alpha Letter | Mark
ascii 0000 - 007F
blank Space_Separator | 0009
cntrl Control | Format | Unassigned | Private_Use | Surrogate
digit Decimal_Number
graph [[:^space:]] && ^Control && ^Unassigned && ^Surrogate
lower Lowercase_Letter
print [[:graph:]] | [[:space:]]
punct Connector_Punctuation | Dash_Punctuation | Close_Punctuation | Final_Punctuation | Initial_Punctuation | Other_Punctuation | Open_Punctuation
space Space_Separator | Line_Separator | Paragraph_Separator | 0009 | 000A | 000B | 000C | 000D | 0085
upper Uppercase_Letter
word Letter | Mark | Decimal_Number | Connector_Punctuation
xdigit 0030 - 0039 | 0041 - 0046 | 0061 - 0066 (0-9, a-f, A-F)
7. Extended groups
| (?#...) | comment |
| (?imx-imx) | option on/off |
i: ignore case
m: multi-line (dot(.) match newline)
x: extended form
| (?imx-imx:subexp) | option on/off for subexp |
| (?:subexp) | not captured group |
| (subexp) | captured group |
| (?=subexp) | look-ahead |
| (?!subexp) | negative look-ahead |
| (?<=subexp) | look-behind |
| (?<!subexp) | negative look-behind |
Subexp of look-behind must be fixed character length. But different character length is allowed in top level alternatives only.
ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.
In negative-look-behind, captured group isn't allowed, but shy group(?:) is allowed.
| (?>subexp) | atomic group |
don't backtrack in subexp.
| (?<name>subexp) | define named group |
(All characters of the name must be a word character. And first character must not be a digit or upper case)
Not only a name but a number is assigned like a captured group.
Assigning the same name as two or more subexps is allowed. In this case, a subexp call can not be performed although the back reference is possible.
8. Back reference
| \n | back reference by group number (n >= 1) |
| \k<n> | back reference by group number (n >= 1) |
| \k'n' | back reference by group number (n >= 1) |
| \k<-n> | back reference by relative group number (n >= 1) |
| \k'-n' | back reference by relative group number (n >= 1) |
| \k<name> | back reference by group name |
| \k'name' | back reference by group name |
In the back reference by the multiplex definition name, a subexp with a large number is referred to preferentially. (When not matched, a group of the small number is referred to.)
* Back reference by group number is forbidden if named group is defined in the pattern and ONIG_OPTION_CAPTURE_GROUP is not setted.
Back reference with nest level
level: 0, 1, 2, ...
\k<n+level> (n >= 1)
\k<n-level> (n >= 1)
\k'n+level' (n >= 1)
\k'n-level' (n >= 1)
\k<name+level>
\k<name-level>
\k'name+level'
\k'name-level'
Destinate relative nest level from back reference position.
example 1.
/\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/.match("reer")
example 2.
r = Regexp.compile(<<'__REGEXP__'.strip, Regexp::EXTENDED)
(?<element> \g<stag> \g<content>* \g<etag> ){0}
(?<stag> < \g<name> \s* > ){0}
(?<name> [a-zA-Z_:]+ ){0}
(?<content> [^<&]+ (\g<element> | [^<&]+)* ){0}
(?<etag> </ \k<name+1> >){0}
\g<element>
__REGEXP__
p r.match('<foo>f<bar>bbb</bar>f</foo>').captures
9. Subexp call ("Tanaka Akira special")
| \g<name> | call by group name |
| \g'name' | call by group name |
| \g<n> | call by group number (n >= 1) |
| \g'n' | call by group number (n >= 1) |
| \g<-n> | call by relative group number (n >= 1) |
| \g'-n' | call by relative group number (n >= 1) |
* left-most recursive call is not allowed.
| ex. | (?<name>a|\g<name>b) => error |
(?<name>a|b\g<name>c) => OK
* Call by group number is forbidden if named group is defined in the pattern and ONIG_OPTION_CAPTURE_GROUP is not set.
* If the option status of called group is different from calling position then the group's option is effective.
ex. (?-i:\g<name>)(?i:(?<name>a)){0} match to "A"
10. Captured group
Behavior of the no-named group (...) changes with the following conditions. (But named group is not changed.)
case 1. /.../ (named group is not used, no option)
(...) is treated as a captured group.
case 2. /.../g (named group is not used, 'g' option)
(...) is treated as a no-captured group (?:...).
case 3. /..(?<name>..)../ (named group is used, no option)
(...) is treated as a no-captured group (?:...).
numbered-backref/call is not allowed.
case 4. /..(?<name>..)../G (named group is used, 'G' option)
(...) is treated as a captured group.
numbered-backref/call is allowed.
where
g: ONIG_OPTION_DONT_CAPTURE_GROUP
G: ONIG_OPTION_CAPTURE_GROUP
A-1. Syntax dependent options
+ RUBY
(?m): dot(.) match newline
+ PERL and JAVA
(?s): dot(.) match newline
(?m): ^ match after newline, $ match before newline
A-2. Original extensions
+ hexadecimal digit char type \h, \H
+ named group (?<name>...)
+ named backref \k<name>
+ subexp call \g<name>, \g<group-num>
A-3. Missing features compared with Perl 5.8.0
+ \N{name}
+ \l,\u,\L,\U, \X, \C
+ (?{code})
+ (??{code})
+ (?(condition)yes-pat|no-pat)
* \Q...\E
This is effective in PERL and JAVA.