Regular expressions play an important role in most
text parsing and text matching tasks. They form an important underpinning of
the -split and -match
operators, the switch statement, the
Select-String cmdlet, and more. Tables
Table B.1, “Character classes: Patterns that represent sets of
characters” through Table B.9, “Character escapes: Character sequences that represent another
character” list commonly used
regular expressions.
Table B.1. Character classes: Patterns that represent sets of characters
Character class | Matches |
|---|---|
. | Any character except for a newline.
If the regular expression uses the PS > "T" -match '.' True |
| Any character in the brackets. For
example: PS > "Test" -match '[Tes]' True |
| Any character not in the brackets.
For example: PS > "Test" -match '[^Tes]' False |
| Any character between the
characters PS > "Test" -match '[e-t]' True |
| Any character not between any of
the character ranges PS > "Test" -match '[^e-t]' False |
\p | Any character in the Unicode group
or block range specified by PS > "+" -match '\p{Sm}'
True |
\P | Any character not in the Unicode
group or block range specified by PS > "+" -match '\P{Sm}'
False |
\w | Any word character. PS > "a" -match '\w' True |
\W | Any nonword character. PS > "!" -match '\W' True |
\s | Any whitespace character. PS > "`t" -match '\s' True |
\S | Any nonwhitespace character. PS > " `t" -match '\S' False |
\d | Any decimal digit. PS > "5" -match '\d' True |
\D | Any nondecimal digit. PS > "!" -match '\D' True |
First column: I think I don't quite understand which things are formatted as code here and which are formatted as normal text. I think the complete first column should use code formatting. Likewise for the other tables too.
For \w it could be mentioned that “word character” includes digits and the underscore. As well as it's not restricted to ASCII but matches pretty much any letter, including foreign scripts, CJK ideographs, sub- and superscripted letters (though no sub-/superscripted digits), letterlike forms (ligatures, math stuff, ...) and some things probably no one would consider a “word character” like U+FE4F Wavy Low Line (﹏).
Ah well, shorten it how you see fit but mentioning that “word character” includes more than just [a-z] would be a good idea, I think.
Side note:
-join((32..65535|%{[string][char]$_}) -match '\w') > test.txt
works pretty well in generating a list of what \w matches—it's more than I expected.
last row: “Any nondecimal digit.” could be worded “Any character that isn't a decimal digit.” otherwise it might sound like it would match digits of other bases only.
Table B.2. Quantifiers: Expressions that enforce quantity on the preceding expression
Quantifier | Meaning |
|---|---|
<none> | One match. PS > "T" -match 'T' True |
* | Zero or more matches, matching as much as possible. PS > "A" -match 'T*' True PS > "TTTTT" -match '^T*$' True |
+ | One or more matches, matching as much as possible. PS > "A" -match 'T+' False PS > "TTTTT" -match '^T+$' True |
? | Zero or one matches, matching as much as possible. PS > "TTTTT" -match '^T?$' False |
| Exactly
PS > "TTTTT" -match '^T{5}$'
True |
| n or more matches, matching as much as possible. PS > "TTTTT" -match '^T{4,}$'
True |
| Between
PS > "TTTTT" -match '^T{4,6}$'
True |
*? | Zero or more matches, matching as little as possible. PS > "A" -match '^AT*?$' True |
+? | One or more matches, matching as little as possible. PS > "A" -match '^AT+?$' False |
?? | Zero or one matches, matching as little as possible. PS > "A" -match '^AT??$' True |
| Exactly
PS > "TTTTT" -match '^T{5}?$'
True |
| PS > "TTTTT" -match '^T{4,}?$'
True |
| Between
PS > "TTTTT" -match '^T{4,6}?$'
True |
I think some of the examples in here could benefit from a dump of parts of $Matches; especially the non-greedy quantifiers, to show the difference in how much they actually match. Something along the line of the following:
PS > 'ATTT' -match 'AT*'; $Matches[0]
True
ATTT
PS > 'ATTT' -match 'AT*?'; $Matches[0]
True
A
PS > 'ATTT' -match 'AT+'; $Matches[0]
True
ATTT
PS > 'ATTT' -match 'AT+?'; $Matches[0]
True
AT
PS > 'ATTT' -match 'AT?'; $Matches[0]
True
AT
PS > 'ATTT' -match 'AT??'; $Matches[0]
True
A
You have done something similar in the table below to illustrate grouping and what gets captured by groups.
Table B.3. Grouping constructs: Expressions that let you group characters, patterns, and other expressions
Grouping construct | Description |
|---|---|
| Captures the text matched inside the parentheses. These captures are named by number (starting at one) based on the order of the opening parenthesis. PS > "Hello" -match '^(.*)llo$'; $matches[1] True He |
(? | Captures the text matched inside
the parentheses. These captures are named by the name given in
PS > "Hello" -match '^(?<One>.*)llo$'; $matches.One True He |
(? | A balancing group definition. This is an advanced regular expression construct, but lets you match evenly balanced pairs of terms. |
(?:) | Noncapturing group. PS > "A1" -match '((A|B)\d)'; $matches True Name Value ---- ----- 2 A 1 A1 0 A1 PS > "A1" -match '((?:A|B)\d)'; $matches True Name Value ---- ----- 1 A1 0 A1 |
(?imnsx-imnsx:) | Applies or disables the given option for this group. Supported options are: i case-insensitive m multiline n explicit capture s singleline x ignore whitespace PS > "Te`nst" -match '(T e.st)' False PS > "Te`nst" -match '(?sx:T e.st)' True |
(?=) | Zero-width positive lookahead assertion. Ensures that the given pattern matches to the right, without actually performing the match. PS > "555-1212" -match '(?=...-)(.*)'; $matches[1] True 555-1212 |
(?!) | Zero-width negative lookahead assertion. Ensures that the given pattern does not match to the right, without actually performing the match. PS > "friendly" -match '(?!friendly)friend' False |
(?<=) | Zero-width positive lookbehind assertion. Ensures that the given pattern matches to the left, without actually performing the match. PS > "public int X" -match '^.*(?<=public )int .*$' True |
(?<!) | Zero-width negative lookbehind assertion. Ensures that the given pattern does not match to the left, without actually performing the match. PS > "private int X" -match '^.*(?<!private )int .*$' False |
(?>) | Nonbacktracking subexpression. Matches only if this subexpression can be matched completely. PS > "Hello World" -match '(Hello.*)orld' True PS > "Hello World" -match '(?>Hello.*)orld' False The nonbacktracking version of the subexpression fails to match, as its complete match would be "Hello World". |
Table B.4. Atomic zero-width assertions: Patterns that restrict where a match may occur
Assertion | Restriction |
|---|---|
^ | The match must occur at the
beginning of the string (or line, if the
PS > "Test" -match '^est' False |
$ | The match must occur at the end of
the string (or line, if the PS > "Test" -match 'Tes$' False |
\A | The match must occur at the beginning of the string. PS > "The`nTest" -match '(?m:^Test)' True PS > "The`nTest" -match '(?m:\ATest)' False |
\Z | The match must occur at the end of the string, or before \n at the end of the string. PS > "The`nTest`n" -match '(?m:The$)' True PS > "The`nTest`n" -match '(?m:The\Z)' False PS > "The`nTest`n" -match 'Test\Z' True |
\z | The match must occur at the end of the string. PS > "The`nTest`n" -match 'Test\z' False |
\G | The match must occur where the
previous match ended. Used with the |
\b | The match must occur on a word boundary: the first or last characters in words separated by nonalphanumeric characters. PS > "Testing" -match 'ing\b' True |
\B | The match must not occur on a word boundary. PS > "Testing" -match 'ing\B' False |
Table B.5. Substitution patterns: Patterns used in a regular expression replace operation
Pattern | Substitution |
|---|---|
| The text matched by group number
PS > "Test" -replace "(.*)st",'$1ar' Tear |
| The text matched by group named
PS > "Test" -replace "(?<pre>.*)st", |
$$ | A literal
PS > "Test" -replace ".",'$$' $$$$ |
$& | A copy of the entire match. PS > "Test" -replace "^.*$",'Found: $&' Found: Test |
$` | The text of the input string that precedes the match. PS > "Test" -replace "est$",'Te$`' TTeT |
$' | The text of the input string that follows the match. PS > "Test" -replace "^Tes",'Res$''' Restt |
$+ | The last group captured. PS > "Testing" -replace "(.*)ing",'$+ed' Tested |
$_ | The entire input string. PS > "Testing" -replace "(.*)ing",'String: $_' String: Testing |
Table B.6. Alternation constructs: Expressions that let you perform either/or logic
Alternation construct | Description |
|---|---|
| | Matches any of the terms separated by the vertical bar character. PS > "Test" -match '(B|T)est' True |
| Matches the yes term if expression matches at this point. Otherwise, matches the no term. The no term is optional. PS > "3.14" -match '(?(\d)3.14|Pi)' True PS > "Pi" -match '(?(\d)3.14|Pi)' True PS > "2.71" -match '(?(\d)3.14|Pi)' False |
| Matches the yes term if the capture group named name has a capture at this point. Otherwise, matches the no term. The no term is optional. PS > "123" -match '(?<one>1)?(?(one)23|234)' True PS > "23" -match '(?<one>1)?(?(one)23|234)' False PS > "234" -match '(?<one>1)?(?(one)23|234)' True |
Table B.7. Backreference constructs: Expressions that refer to a capture group within the expression
Backreference construct | Refers to |
|---|---|
| Group number number in the expression. PS > "|Text|" -match '(.)Text\1' True PS > "|Text+" -match '(.)Text\1' False |
\k< | The group named name in the expression. PS > "|Text|" -match '(?<Symbol>.)Text\k<Symbol>' True PS > "|Text+" -match '(?<Symbol>.)Text\k<Symbol>' False |
Table B.8. Other constructs: Other expressions that modify a regular expression
Construct | Description |
|---|---|
(?imnsx-imnsx) | Applies or disables the given option for the rest of this expression. Supported options are: i case-insensitive m multiline n explicit capture s singleline x ignore whitespace PS > "Te`nst" -match '(?sx)T e.st' True |
(?# ) | Inline comment. This terminates at the first closing parenthesis. PS > "Test" -match '(?# Match 'Test')Test' True |
# [to end of line] | Comment form allowed when the
regular expression has the PS > "Test" -match '(?x)Test # Matches Test' True |
The list of options in the first row looks a bit weird, given that it looks like the example code below, even though it's actually a list. Don't know how that'd apply to the final printed version, though. The usual definition list might be strange too, since the definition terms are just single letters.
Table B.9. Character escapes: Character sequences that represent another character
Escaped character | Match |
|---|---|
| Characters other than |
| A bell (alarm) \ |
| A backspace \ |
| A tab \ |
| A carriage return \ |
| A vertical tab \ |
| A form feed \ |
| A new line \ |
| An escape \ |
| An ASCII character as octal (up to three digits.) Numbers with no leading zero are treated as backreferences if they have only one digit, or if they correspond to a capturing group number. |
| An ASCII character using hexadecimal representation (exactly two digits). |
| An ASCII control character; for example, \cC is control-C. |
| A Unicode character using hexadecimal representation (exactly four digits). |
\ | When followed by a character that is not recognized as an escaped character, matches that character. For example, \*is the literal character *. |
last row: “For example, \*is the literal character *.” → “For example, \* is the literal character *.” (missing space after \*)
Row \b: “[]” should be formatted as code (twice). “b” as well (once; the other occurrences are ok).
Row \cC: “cC” should be formatted as code.
Last row, first column: \ should be formatted as code. The difference is subtle but noticeable ;-) (ok, very subtle)
No comments yet
Add a comment