Regular Expressions: Parenthetical Constructs

Capturing Parentheses

Capturing Parentheses are of the form `(<regex>)' and can be used to group arbitrarily complex regular expressions. Parentheses can be nested, but the total number of parentheses, nested or otherwise, is limited to 50 pairs. The text that is matched by the regular expression between a matched set of parentheses is captured and available for text substitutions and backreferences (see below.) Capturing parentheses carry a fairly high overhead both in terms of memory used and execution speed, especially if quantified by `*' or `+'.

Non-Capturing Parentheses

Non-Capturing Parentheses are of the form `(?:<regex>)' and facilitate grouping only and do not incur the overhead of normal capturing parentheses. They should not be counted when determining numbers for capturing parentheses which are used with backreferences and substitutions. Because of the limit on the number of capturing parentheses allowed in a regex, it is advisable to use non-capturing parentheses when possible.

Positive Look-Ahead

Positive look-ahead constructs are of the form `(?=<regex>)' and implement a zero width assertion of the enclosed regular expression. In other words, a match of the regular expression contained in the positive look-ahead construct is attempted. If it succeeds, control is passed to the next regular expression atom, but the text that was consumed by the positive look-ahead is first unmatched (backtracked) to the place in the text where the positive look-ahead was first encountered.

One application of positive look-ahead is the manual implementation of a first character discrimination optimization. You can include a positive look-ahead that contains a character class which lists every character that the following (potentially complex) regular expression could possibly start with. This will quickly filter out match attempts that can not possibly succeed.

Negative Look-Ahead

Negative look-ahead takes the form `(?!<regex>)' and is exactly the same as positive look-ahead except that the enclosed regular expression must NOT match. This can be particularly useful when you have an expression that is general, and you want to exclude some special cases. Simply precede the general expression with a negative look-ahead that covers the special cases that need to be filtered out.

Case Sensitivity

There are two parenthetical constructs that control case sensitivity:

(?i<regex>) Case insensitive; `AbcD' and `aBCd' are equivalent.
(?I<regex>) Case sensitive; `AbcD' and `aBCd' are different.

Regular expressions are case sensitive by default, i.e `(?I<regex>)' is assumed. All regular expression token types respond appropriately to case insensitivity including character classes and backreferences. There is some extra overhead involved when case insensitivity is in effect, but only to the extent of converting each character compared to lower case.

Matching Newlines

NEdit regular expressions by default handle the matching of newlines in a way that should seem natural for most editing tasks. There are situations, however, that require finer control over how newlines are matched by some regular expression tokens.

By default, NEdit regular expressions will NOT match a newline character for the following regex tokens: dot (`.'); a negated character class (`[^...]'); and the following shortcuts for character classes:

`\d', `\D', `\l', `\L', `\s', `\S', `\w', `\W', `\Y'

The matching of newlines can be controlled for the `.' token, negated character classes, and the `\s' and `\S' shortcuts by using one of the following parenthetical constructs:

(?n<regex>) `.', `[^...]', `\s', `\S' match newlines
(?N<regex>) `.', `[^...]', `\s', `\S' don't match newlines
`(?N<regex>)' is the default behavior.

Notes on New Parenthetical Constructs

Except for plain parentheses, none of the parenthetical constructs capture text. If that is desired, the construct must be wrapped with capturing parentheses, e.g. `((?i<regex))'.

All parenthetical constructs can be nested as deeply as desired, except for capturing parentheses which have a limit of 50 sets of parentheses, regardless of nesting level.

Back References

Backreferences allow you to match text captured by a set of capturing parenthesis at some latter position in your regular expression. A backreference is specified using a single backslash followed by a single digit from 1 to 9 (example: \3). Backreferences have similar syntax to substitutions (see below), but are different from substitutions in that they appear within the regular expression, not the substitution string. The number specified with a backreference identifies which set of text capturing parentheses the backreference is associated with. The text that was most recently captured by these parentheses is used by the backreference to attempt a match. As with substitutions, open parentheses are counted from left to right beginning with 1. So the backreference `\3' will try to match another occurrence of the text most recently matched by the third set of capturing parentheses. As an example, the regular expression `(\d)\1' could match `22', `33', or `00', but wouldn't match `19' or `01'.

A backreference must be associated with a parenthetical expression that is complete. The expression `(\w(\1))' contains an invalid backreference since the first set of parentheses are not complete at the point where the backreference appears.

Substitution

Substitution strings are used to replace text matched by a set of capturing parentheses. The substitution string is mostly interpreted as ordinary text except as follows.

The escape sequences described above for special characters, and octal and hexadecimal escapes are treated the same way by a substitution string. When the substitution string contains the `&' character, NEdit will substitute the entire string that was matched by the `Find...' operation. Any of the first nine sub-expressions of the match string can also be inserted into the replacement string. This is done by inserting a `\' followed by a digit from 1 to 9 that represents the string matched by a parenthesized expression within the regular expression. These expressions are numbered left-to-right in order of their opening parentheses.

The capitalization of text inserted by `&' or `\1', `\2', ... `\9' can be altered by preceding them with `\U', `\u', `\L', or `\l'. `\u' and `\l' change only the first character of the inserted entity, while `\U' and `\L' change the entire entity to upper or lower case, respectively.


<< Previous Section Regular Exp. Escape Sequences	Table of Contents	Next Section >> Regular Exp. Advanced Topics

Released on Wed, 6 Nov 2002 by C. Denat