Capturing Parentheses
Capturing Parentheses are of the form
`(<regex>)' and can
be used to group arbitrarily complex regular expressions.
Parentheses can be nested, but the total number of
parentheses, nested or otherwise, is limited to 50
pairs. The text that is matched by the regular expression
between a matched set of parentheses is captured and
available for text substitutions and backreferences
(see below.) Capturing parentheses carry a fairly
high overhead both in terms of memory used and execution
speed, especially if quantified by `*'
or `+'.
Non-Capturing Parentheses
Non-Capturing Parentheses are of the
form `(?:<regex>)'
and facilitate grouping only and do not incur the
overhead of normal capturing parentheses. They should
not be counted when determining numbers for capturing
parentheses which are used with backreferences and
substitutions. Because of the limit on the number
of capturing parentheses allowed in a regex, it is
advisable to use non-capturing parentheses when possible.
Positive Look-Ahead
Positive look-ahead constructs are
of the form `(?=<regex>)'
and implement a zero width assertion of the enclosed
regular expression. In other words, a match of the
regular expression contained in the positive look-ahead
construct is attempted. If it succeeds, control is
passed to the next regular expression atom, but the
text that was consumed by the positive look-ahead
is first unmatched (backtracked) to the place in the
text where the positive look-ahead was first encountered.
One application of positive look-ahead
is the manual implementation of a first character
discrimination optimization. You can include a positive
look-ahead that contains a character class which lists
every character that the following (potentially complex)
regular expression could possibly start with. This
will quickly filter out match attempts that can not
possibly succeed.
Negative Look-Ahead
Negative look-ahead takes the form
`(?!<regex>)' and
is exactly the same as positive look-ahead except
that the enclosed regular expression must NOT match.
This can be particularly useful when you have an expression
that is general, and you want to exclude some special
cases. Simply precede the general expression with
a negative look-ahead that covers the special cases
that need to be filtered out.
Case Sensitivity
There are two parenthetical constructs
that control case sensitivity:
- (?i<regex>)
Case insensitive; `AbcD'
and `aBCd' are
equivalent.
- (?I<regex>)
Case sensitive; `AbcD'
and `aBCd' are
different.
Regular expressions are case sensitive
by default, i.e `(?I<regex>)'
is assumed. All regular expression token types respond
appropriately to case insensitivity including character
classes and backreferences. There is some extra overhead
involved when case insensitivity is in effect, but
only to the extent of converting each character compared
to lower case.
Matching Newlines
NEdit regular expressions by default
handle the matching of newlines in a way that should
seem natural for most editing tasks. There are situations,
however, that require finer control over how newlines
are matched by some regular expression tokens.
By default, NEdit regular expressions
will NOT match a newline character for the following
regex tokens: dot (`.');
a negated character class (`[^...]');
and the following shortcuts for character classes:
- `\d', `\D',
`\l', `\L',
`\s', `\S',
`\w', `\W',
`\Y'
The matching of newlines can be controlled
for the `.' token, negated
character classes, and the `\s'
and `\S' shortcuts by using
one of the following parenthetical constructs:
- (?n<regex>)
`.', `[^...]',
`\s', `\S'
match newlines
- (?N<regex>)
`.', `[^...]',
`\s', `\S'
don't match newlines
- `(?N<regex>)'
is the default behavior.
Notes on New Parenthetical Constructs
Except for plain parentheses, none
of the parenthetical constructs capture text. If that
is desired, the construct must be wrapped with capturing
parentheses, e.g. `((?i<regex))'.
All parenthetical constructs can be
nested as deeply as desired, except for capturing
parentheses which have a limit of 50 sets of parentheses,
regardless of nesting level.
Back References
Backreferences allow you to match text
captured by a set of capturing parenthesis at some
latter position in your regular expression. A backreference
is specified using a single backslash followed by
a single digit from 1 to 9 (example: \3).
Backreferences have similar syntax to substitutions
(see below), but are different from substitutions
in that they appear within the regular expression,
not the substitution string. The number specified
with a backreference identifies which set of text
capturing parentheses the backreference is associated
with. The text that was most recently captured by
these parentheses is used by the backreference to
attempt a match. As with substitutions, open parentheses
are counted from left to right beginning with 1. So
the backreference `\3'
will try to match another occurrence of the text most
recently matched by the third set of capturing parentheses.
As an example, the regular expression `(\d)\1'
could match `22', `33',
or `00', but wouldn't match
`19' or `01'.
A backreference must be associated
with a parenthetical expression that is complete.
The expression `(\w(\1))'
contains an invalid backreference since the first
set of parentheses are not complete at the point where
the backreference appears.
Substitution
Substitution strings are used to replace
text matched by a set of capturing parentheses. The
substitution string is mostly interpreted as ordinary
text except as follows.
The escape sequences described above
for special characters, and octal and hexadecimal
escapes are treated the same way by a substitution
string. When the substitution string contains the
`&' character, NEdit
will substitute the entire string that was matched
by the `Find...' operation. Any of the first nine
sub-expressions of the match string can also be inserted
into the replacement string. This is done by inserting
a `\' followed by a digit
from 1 to 9 that represents the string matched by
a parenthesized expression within the regular expression.
These expressions are numbered left-to-right in order
of their opening parentheses.
The capitalization of text inserted
by `&' or `\1',
`\2', ... `\9'
can be altered by preceding them with `\U',
`\u', `\L',
or `\l'. `\u'
and `\l' change only the
first character of the inserted entity, while `\U'
and `\L' change the entire
entity to upper or lower case, respectively.
|