Onigmo Regular Expressions Version 6.2.0
This section covers the Ruby regular expression syntax. For information on Perl regular expression syntax, see your Perl documentation or https://perldoc.perl.org/perlre.html.
|
\x{7HHHHHHH} wide hexadecimal char (character code point value)
(* \b is effective in character class [...] only) |
\W non word char
-- Paragraph_Separator -- Space_Separator
Character Property
* \p{property-name} * \p{^property-name} (negative) * \P{property-name} (negative)
property-name:
+ works on all encodings Alnum, Alpha, Blank, Cntrl, Digit, Graph, Lower, Print, Punct, Space, Upper, XDigit, Word, ASCII,
+ works on UTF8, UTF16, UTF32
\R Linebreak
Unicode: (?>\x0D\x0A|[\x0A-\x0D\x{85}\x{2028}\x{2029}])
Not Unicode: (?>\x0D\x0A|[\x0A-\x0D])
\X eXtended grapheme cluster
Unicode: (?>\P{M}\p{M}*)
Not Unicode: (?m:.) |
greedy
reluctant
possessive (greedy and does not backtrack after repeated)
|
|
ex. [a-w&&[^c-g]z] ==> ([a-w] AND ([^c-g] OR z)) ==> [abh-w]
* If you want to use '[', '-', ']' as a normal character in a character class, you should escape these characters by '\'.
POSIX bracket ([:xxxxx:], negate [:^xxxxx:])
Not Unicode Case:
alnum alphabet or digit char alpha alphabet ascii code value: [0 - 127] blank \t, \x20 cntrl digit 0-9 graph include all of multibyte encoded characters lower print include all of multibyte encoded characters punct space \t, \n, \v, \f, \r, \x20 upper word alphanumeric, "_" and multibyte characters xdigit 0-9, a-f, A-F
Unicode Case:
alnum Letter | Mark | Decimal_Number alpha Letter | Mark ascii 0000 - 007F blank Space_Separator | 0009 cntrl Control | Format | Unassigned | Private_Use | Surrogate digit Decimal_Number graph [[:^space:]] && ^Control && ^Unassigned && ^Surrogate lower Lowercase_Letter print [[:graph:]] | [[:space:]] punct Connector_Punctuation | Dash_Punctuation | Close_Punctuation | Final_Punctuation | Initial_Punctuation | Other_Punctuation | Open_Punctuation space Space_Separator | Line_Separator | Paragraph_Separator | 0009 | 000A | 000B | 000C | 000D | 0085 upper Uppercase_Letter word Letter | Mark | Decimal_Number | Connector_Punctuation xdigit 0030 - 0039 | 0041 - 0046 | 0061 - 0066 (0-9, a-f, A-F) |
(?imxdau-imx) option on/off i: ignore case m: multi-line (dot(.) match newline) x: extended form
character set option (character range option) d: Default (compatible with Ruby 1.9.3) \w, \d and \s doesn't match non-ASCII characters. \b, \B and POSIX brackets use the each encoding's rules. a: ASCII ONIG_OPTION_ASCII_RANGE option is turned on. \w, \d, \s and POSIX brackets doesn't match non-ASCII characters. \b and \B use the ASCII rules. u: Unicode ONIG_OPTION_ASCII_RANGE option is turned off. \w (\W), \d (\D), \s (\S), \b (\B) and POSIX brackets use the each encoding's rules.
Another expression of look-behind. Keep the stuff left of the \K, don't include it in the result.
(?(cond)yes-subexp), (?(cond)yes-subexp|no-subexp) conditional expression Matches yes-subexp if (cond) yields a true value, matches no-subexp otherwise. Following (cond) can be used:
(n) (n >= 1) Checks if the numbered capturing group has matched something.
(<name>), ('name') Checks if a group with the given name has matched something. |
In the back reference by the multiplex definition name, a subexp with a large number is referred to preferentially. (When not matched, a group of the small number is referred to.)
* Back reference by group number is forbidden if named group is defined in the pattern and ONIG_OPTION_CAPTURE_GROUP is not setted.
Back reference with nest level
level: 0, 1, 2, ...
\k<n+level> (n >= 1) \k<n-level> (n >= 1) \k'n+level' (n >= 1) \k'n-level' (n >= 1) \k<-n+level> (n >= 1) \k<-n-level> (n >= 1) \k'-n+level' (n >= 1) \k'-n-level' (n >= 1)
\k<name+level> \k<name-level> \k'name+level' \k'name-level'
Destinate relative nest level from back reference position.
example 1.
/\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/.match("reer")
example 2.
r = Regexp.compile(<<'__REGEXP__'.strip, Regexp::EXTENDED) (?<element> \g<stag> \g<content>* \g<etag> ){0} (?<stag> < \g<name> \s* > ){0} (?<name> [a-zA-Z_:]+ ){0} (?<content> [^<&]+ (\g<element> | [^<&]+)* ){0} (?<etag> </ \k<name+1> >){0} \g<element> __REGEXP__
p r.match('<foo>f<bar>bbb</bar>f</foo>').captures |
\g<0> call the whole pattern recursively \g'0' call the whole pattern recursively \g<-n> call by relative group number (n >= 1) \g'-n' call by relative group number (n >= 1) \g<+n> call by relative group number (n >= 1) \g'+n' call by relative group number (n >= 1)
* left-most recursive call is not allowed.
* Call by group number is forbidden if named group is defined in the pattern and ONIG_OPTION_CAPTURE_GROUP is not set.
* If the option status of called group is different from calling position then the group's option is effective.
ex. (?-i:\g<name>)(?i:(?<name>a)){0} match to "A"
Perl syntax:: use (?&name), (?n), (?-n), (?+n), (?R) or (?0) instead. |
Behavior of the no-named group (...) changes with the following conditions. (But named group is not changed.)
case 1. /.../ (named group is not used, no option)
(...) is treated as a captured group.
case 2. /.../g (named group is not used, 'g' option)
(...) is treated as a no-captured group (?:...).
case 3. /..(?<name>..)../ (named group is used, no option)
(...) is treated as a no-captured group (?:...). numbered-backref/call is not allowed.
case 4. /..(?<name>..)../G (named group is used, 'G' option)
(...) is treated as a captured group. numbered-backref/call is allowed.
where g: ONIG_OPTION_DONT_CAPTURE_GROUP G: ONIG_OPTION_CAPTURE_GROUP |
+ RUBY (?m): dot(.) match newline
+ PERL, JAVA, and Python (?s): dot(.) match newline (?m): ^ match after newline, $ match before newline
+ PERL (?d), (?l): same as (?u) |
+ hexadecimal digit char type \h, \H + named group (?<name>...) + named backref \k<name> + subexp call \g<name>, \g<group-num> |
+ \N{name}, \N{U+xxxx}, \N + \l,\u,\L,\U, \C + \v, \V, \h, \H, \o{xxx} + (?{code}) + (??{code}) + (?|...) + (*VERB:ARG)
* \Q...\E This is effective in PERL and JAVA. |
+ capture history
(?@...) and (?@<name>...)
ex. /(?@a)*/.match("aaa") ==> [<0-1>, <1-2>, <2-3>] |
+ Invalid encoding byte sequence is not checked.
ex. UTF-8
* Invalid first byte is treated as a character. /./u =~ "\xa3"
* Incomplete byte sequence is not checked. /\w+/ =~ "a\xf3\x8ec" |