China DOS Union

-- Unite DOS · Advance DOS · Grow DOS --

Union site: www.cn-dos.net Forum site: www.cn-dos.net/forum
DOS stands for freedom, openness and progress. Let us work hard, learn from the openness and GNU spirit of FreeDOS and Linux, and together build and grow a free GNU GPL world!

中国DOS联盟论坛
The time now is 2026-06-24 20:19
中国DOS联盟论坛 » DOS批处理 & 脚本技术(批处理室) » [Recommendation] Collection of Articles on Regular Expressions Regex DigestI View 15,521 Replies 26
Original Poster Posted 2006-10-26 11:42 ·  中国 浙江 宁波 鹏博士宽带
荣誉版主
★★★
Credits 1,338
Posts 356
Joined 2005-07-15 12:09
20-year member
UID 40733
Gender Male
Status Offline
Post some articles about regular expressions that I collected. I will modify this post to add resources and related links.

Regular expression library http://regexlib.com/default.aspx
Recommended online regular expression verification http://osteele.com/tools/rework/#
Online regular expression demonstration http://osteele.com/tools/reanimator/
Online regular expression verification (Chinese) http://www.regexlab.com/zh/workshop.asp
RegexBuddy the best regular expression learning and verification tool http://www.regexbuddy.com/

Post these first, and I will supplement them when I think of them.

[ Last edited by 无奈何 on 2006-10-26 at 12:57 PM ]
Recent Ratings for This Post ( 2 in total) Click for details
RaterScoreTime
redtek +2 2006-10-26 20:46
sonicandy +4 2008-03-15 12:59
  ☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul

Floor 2 Posted 2006-10-26 11:42 ·  中国 浙江 宁波 鹏博士宽带
荣誉版主
★★★
Credits 1,338
Posts 356
Joined 2005-07-15 12:09
20-year member
UID 40733
Gender Male
Status Offline
Regular Expression Syntax
JScript and VBScript Regular Expressions

Regular Expression Syntax
A regular expression is a text pattern composed of ordinary characters (such as characters a to z) as well as special characters (called metacharacters). This pattern describes one or more strings to be matched when searching the text body. A regular expression acts as a template, matching a certain character pattern with the string being searched.
Here are some examples of regular expressions that may be encountered:
JScript
VBScript
Match
/^\*$/
"^\*$"
Match a blank line.
/\d{2}-\d{5}/
"\d{2}-\d{5}"
Verify whether an ID number consists of a 2-digit number, a hyphen, and a 5-digit number.
/.*/
".*"
Match an HTML tag.
The following table is a complete list of metacharacters and their behaviors in the context of regular expressions:
Character
Description
\
Mark the next character as a special character, a literal character, a backreference, or an octal escape. For example, 'n' matches the character "n". '\n' matches a newline character. The sequence '\\' matches "\" and "\(" matches "(".
^
Match the start position of the input string. If the Multiline property of the RegExp object is set, ^ also matches the position after '\n' or '\r'.
$
Match the end position of the input string. If the Multiline property of the RegExp object is set, $ also matches the position before '\n' or '\r'.
*
Match the preceding subexpression zero or more times. For example, zo* can match "z" and "zoo". * is equivalent to {0,}.
+
Match the preceding subexpression one or more times. For example, 'zo+' can match "zo" and "zoo", but not "z". + is equivalent to {1,}.
?
Match the preceding subexpression zero or one time. For example, "do(es)?" can match "do" in "do" or "does". ? is equivalent to {0,1}.
{n}
n is a non-negative integer. Match exactly n times. For example, 'o{2}' cannot match 'o' in "Bob", but can match the two o's in "food".
{n,}
n is a non-negative integer. Match at least n times. For example, 'o{2,}' cannot match 'o' in "Bob", but can match all o's in "foooood". 'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.
{n,m}
m and n are both non-negative integers, where n
?
When this character immediately follows any other qualifier (*, +, ?, {n}, {n,}, {n,m}), the matching pattern is non-greedy. The non-greedy mode matches the searched string as few times as possible, while the default greedy mode matches the searched string as many times as possible. For example, for the string "oooo", 'o+?' will match a single "o", while 'o+' will match all 'o's.
.
Match any single character except "\n". To match any character including '\n', use a pattern like ''.
(pattern)
Match pattern and capture this match. The captured match can be obtained from the resulting Matches collection. In VBScript, use the SubMatches collection, and in JScript, use the $0$9 properties. To match parenthesis characters, use '\(' or '\)'.
(?:pattern)
Match pattern but do not capture the match result, that is, this is a non-capturing match and is not stored for later use. This is useful when using the "or" character (|) to combine parts of a pattern. For example, 'industr(?:y|ies) is a more concise expression than 'industry|industries'.
(?=pattern)
Positive lookahead, matches the search string at the beginning of any string that matches pattern. This is a non-capturing match, that is, this match does not need to be captured for later use. For example, 'Windows (?=95|98|NT|2000)' can match "Windows" in "Windows 2000", but cannot match "Windows" in "Windows 3.1". Lookaheads do not consume characters, that is, after a match occurs, the search for the next match starts immediately after the last match, not from after the character containing the lookahead.
(?! pattern)
Negative lookahead, matches the search string at the beginning of any string that does not match pattern. This is a non-capturing match, that is, this match does not need to be captured for later use. For example, 'Windows (?!95|98|NT|2000)' can match "Windows" in "Windows 3.1", but cannot match "Windows" in "Windows 2000". Lookaheads do not consume characters, that is, after a match occurs, the search for the next match starts immediately after the last match, not from after the character containing the lookahead
x|y
Match x or y. For example, 'z|food' can match "z" or "food". '(z|f)ood' matches "zood" or "food".
[xyz]
Character set. Match any of the included characters. For example, '' can match 'a' in "plain".
xyz]
Negative character set. Match any character not included. For example, '' can match 'p' in "plain".
[a-z]
Character range. Match any character within the specified range. For example, '' can match any lowercase letter character within the range 'a' to 'z'.
a-z]
Negative character range. Match any character not within the specified range. For example, '' can match any character not within the range 'a' to 'z'.
\b
Match a word boundary, that is, the position between a word and a space. For example, 'er\b' can match 'er' in "never", but cannot match 'er' in "verb".
\B
Match a non-word boundary. 'er\B' can match 'er' in "verb", but cannot match 'er' in "never".
\cx
Match the control character indicated by x . For example, \cM matches a Control-M or carriage return. x must be one of A-Z or a-z. Otherwise, c is treated as a literal 'c' character.
\d
Match a digit character. Equivalent to .
\D
Match a non-digit character. Equivalent to .
\f
Match a form feed. Equivalent to \x0c and \cL.
\n
Match a newline character. Equivalent to \x0a and \cJ.
\r
Match a carriage return. Equivalent to \x0d and \cM.
\s
Match any whitespace character, including space, tab, form feed, etc. Equivalent to .
\S
Match any non-whitespace character. Equivalent to .
\t
Match a tab character. Equivalent to \x09 and \cI.
\v
Match a vertical tab character. Equivalent to \x0b and \cK.
\w
Match any word character including underscore. Equivalent to ''.
\W
Match any non-word character. Equivalent to ''.
\xn
Match n, where n is a hexadecimal escape value. The hexadecimal escape value must be exactly two digits long. For example, '\x41' matches "A". '\x041' is equivalent to '\x04' & "1". ASCII encoding can be used in regular expressions. .
\num
Match num, where num is a positive integer. A reference to the captured match. For example, '(.)\1' matches two consecutive identical characters.
\n
Identify an octal escape value or a backreference. If there are at least n captured subexpressions before \n, then n is a backreference. Otherwise, if n is an octal digit (0-7), then n is an octal escape value.
\nm
Identify an octal escape value or a backreference. If there are at least nm obtained subexpressions before \nm, then nm is a backreference. If there are at least n captures before \nm, then n is a backreference followed by the literal m . If none of the above conditions are met, and n and m are both octal digits (0-7), then \nm will match the octal escape value nm.
\nml
If n is an octal digit (0-3), and m and l are both octal digits (0-7), then match the octal escape value nml.
\un
Match n, where n is a Unicode character represented by four hexadecimal digits. For example, \u00A9 matches the copyright symbol (©)
2001 Microsoft Corporation. All rights reserved.

[ Last edited by 无奈何 on 2006-10-26 at 11:45 AM ]
  ☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul

Floor 3 Posted 2006-10-26 11:43 ·  中国 浙江 宁波 鹏博士宽带
荣誉版主
★★★
Credits 1,338
Posts 356
Joined 2005-07-15 12:09
20-year member
UID 40733
Gender Male
Status Offline
### A Concise Explanation of Regular Expressions (Part 1)
Original post address: http://dragon.cnblogs.com/archive/2006/05/08/394078.html

Foreword:
Half a year ago, I became interested in regular expressions. I searched a lot of materials on the Internet and read many tutorials. Finally, when I used a regular expression tool RegexBuddy, I found that its tutorials were written very well, which can be said to be the best regular expression tutorial I have seen so far. So I have always wanted to translate it. This wish was only realized during this May Day holiday, and this article came into being. Regarding the name of this article, using "A Concise Explanation" seems to be too common. But after reading the original text thoroughly, I think that only "A Concise Explanation" can accurately express the feeling I got from this tutorial, so I can't avoid being common.
This article is a translation of the tutorial written by Jan Goyvaerts for RegexBuddy. The copyright belongs to the original author. Reprinting is welcome. But in order to respect the labor of the original author and the translator, please indicate the source! Thank you!

1. What is a Regular Expression
Basically, a regular expression is a pattern used to describe a certain amount of text. Regex stands for Regular Express. In this article, > will be used to represent a specific regular expression.
A piece of text is the most basic pattern, simply matching the same text.
2. Different Regular Expression Engines
A regular expression engine is software that can process regular expressions. Usually, the engine is part of a larger application. In the software world, different regular expressions are not compatible with each other. This tutorial will focus on the Perl 5 type engine because this engine is the most widely used engine. We will also mention some differences from other engines. Many modern engines are similar but not exactly the same. For example, the.NET regular library, the JDK regular package.
3. Literal Characters
The most basic regular expression consists of a single literal character. For example, >, which will match the first occurrence of the character "a" in the string. For example, for the string "Jack is a boy". The "a" after "J" will be matched. The second "a" will not be matched.
The regular expression can also match the second "a", which must be that you tell the regular expression engine to start searching from the place of the first match. In a text editor, you can use "Find Next". In a programming language, there will be a function that allows you to continue searching backward from the position of the previous match.
Similarly, > will match "cat" in "About cats and dogs". This is equivalent to telling the regular expression engine to find a >, followed by a >, and then a >.
It should be noted that the regular expression engine is case-sensitive by default. Unless you tell the engine to ignore case, otherwise > will not match "Cat".
· Special Characters
For literal characters, 11 characters are reserved for special purposes. They are:
\ ^ $ . | ? * + ( )
These special characters are also called metacharacters.
If you want to use these characters as text characters in the regular expression, you need to escape them with the backslash "\". For example, if you want to match "1+1=2", the correct expression is >.
It should be noted that > is also a valid regular expression. But it will not match "1+1=2", but will match "123+111=234" in "111=2". Because "+" here has a special meaning (repeat 1 to multiple times).
In a programming language, it should be noted that some special characters will be processed by the compiler first and then passed to the regular engine. Therefore, the regular expression > in C++ should be written as "1\\+1=2". To match "C:\temp", you need to use the regular expression >. And in C++, the regular expression becomes "C:\\\\temp".
· Invisible Characters
Special character sequences can be used to represent some invisible characters:
> represents Tab (0x09)
> represents the carriage return character (0x0D)
> represents the newline character (0x0A)
It should be noted that text files in Windows use "\r\n" to end a line while Unix uses "\n".
4. Internal Working Mechanism of Regular Expression Engines
Knowing how the regular expression engine works will help you quickly understand why a certain regular expression doesn't work as you expect.
There are two types of engines: text-directed engines and regex-directed engines. Jeffrey Friedl calls them DFA and NFA engines. This article is about regex-directed engines. This is because some very useful features, such as "lazy" quantifiers and backreferences, can only be implemented in regex-directed engines. So it is no surprise that this engine is currently the most popular engine.
You can easily tell whether the engine you are using is text-directed or regex-directed. If backreferences or "lazy" quantifiers are implemented, it can be confirmed that the engine you are using is regex-directed. You can make the following test: Apply the regular expression > to the string "regex not". If the matching result is regex, the engine is regex-directed. If the result is regex not, it is text-directed. Because the regex-directed engine is "eager", it will be eager to show off and report the first match it finds.
· The regex-directed engine always returns the leftmost match
This is an important point you need to understand: even if there may be a "better" match later, the regex-directed engine always returns the leftmost match.
When applying > to "He captured a catfish for his cat", the engine first compares > with "H", and the result fails. Then the engine compares > with "e", and it also fails. Until the fourth character, > matches "c". > matches the fifth character. At the sixth character, > fails to match "p". The engine continues to recheck the matchability from the fifth character. Until the fifteenth character starts, > matches "cat" in "catfish", and the regular expression engine eagerly returns the result of the first match without continuing to search for other better matches.
5. Character Sets
A character set is a collection of characters enclosed in a pair of square brackets "". Using a character set, you can tell the regular expression engine to match only one of multiple characters. If you want to match an "a" or an "e", use >. You can use > to match gray or grey. This is especially useful when you are not sure whether the characters you are searching for are in American English or British English. Conversely, > will not match graay or graey. The order of characters in the character set has no relation, and the result is the same.
You can use the hyphen "-" to define a character range as a character set. > matches a single digit from 0 to 9. You can use more than one range. > matches a single hexadecimal digit, case-insensitive. You can also combine range definitions with single character definitions. > matches a hexadecimal digit or letter X. Again, the order of characters and range definitions has no effect on the result.
· Some Applications of Character Sets
Find a word that may have a spelling error, such as > or >.
Find program language identifiers, >. (* means repeat 0 or more times)
Find C-style hexadecimal numbers >. (+ means repeat once or more times)
· Negated Character Sets
Immediately after the left square bracket "



\ ^ -". "]" represents the end of the character set definition; "\" represents escape; "^" represents negation; "-" represents range definition. Other common metacharacters are normal characters inside the character set definition and do not need to be escaped. For example, to search for an asterisk * or a plus sign +, you can use >. Of course, if you escape those usual metacharacters, your regular expression will work well, but this will reduce readability.
In the character set definition, to use the backslash "\" as a literal character instead of a special meaning character, you need to escape it with another backslash. > will match a backslash and an X. "] ^ -" can all be escaped with a backslash, or placed in a position where their special meanings are not likely to be used. We recommend the latter because this can increase readability. For example, for the character "^", placing it in a position other than after the left bracket "" or "x". > or > will match a "-" or "x".
· Shorthand for Character Sets
Because some character sets are very common, there are some shorthand ways.
> represents >;
> represents word characters. This varies with different regular expression implementations. In most regular expression implementations, the word character set includes >.
> represents "white characters". This is also related to different implementations. In most implementations, it includes space characters, Tab characters, and carriage return and newline characters >.
The abbreviated form of the character set can be used inside or outside the square brackets. > matches a white character followed by a digit. > matches a single white character or digit. > will match a hexadecimal digit.
Shorthand for negated character sets
> = >
> = >
> = >
· Repetition of Character Sets
If you use the "?*+" operator to repeat a character set, you will repeat the entire character set. Not just the character it matches. The regular expression > will match 837 and 222.
If you only want to repeat the matched character, you can achieve it with backreferences. We will talk about backreferences later.
6. Using?* or + for Repetition
? : Tells the engine to match the preceding character 0 or 1 times. In fact, it means that the preceding character is optional.
+ : Tells the engine to match the preceding character 1 or more times
* : Tells the engine to match the preceding character 0 or more times
To match an HTML tag without attributes, "" is a literal character. The first character set matches a letter, and the second character set matches a letter or digit.
We also seem to be able to use. But it will match. But this regular expression is still effective enough when you know that the string you are searching does not contain invalid tags like this.
· Restrictive Repetition
Many modern regular expression implementations allow you to define how many times a character is repeated. The syntax is: {min,max}. min and max are non-negative integers. If there is a comma and max is omitted, then max is unlimited. If the comma and max are both omitted, then repeat min times.
Therefore, {0,} is the same as *, and {1,} is the same as +.
You can use > to match numbers between 1000~9999 ("\b" means word boundary). > matches a number between 100~99999.
· Note on Greediness
Suppose you want to use a regular expression to match an HTML tag. You know that the input will be a valid HTML file, so the regular expression does not need to exclude those invalid tags. So if the content between the two angle brackets should be an HTML tag.
Many novice regular expression users will first think of using the regular expression >>, and they will be very surprised to find that for the test string "This is a first test", you may expect to return, and then when continuing to match, return.
But the fact is no. The regular expression will match "first". Obviously this is not what we want. The reason is that "+" is greedy. That is to say, "+" will cause the regular expression engine to try to repeat the preceding character as much as possible. Only when this repetition will cause the entire regular expression match to fail, the engine will backtrack. That is, it will give up the last "repetition" and then process the remaining part of the regular expression.
Similar to "+", the repetition of "?*" is also greedy.
· Deep into the Regular Expression Engine
Let's see how the regular engine matches the previous example. The first token is "". So far, "first test". The engine will try to match ">" with the newline character, and the result fails. Then the engine backtracks. The result is now "first tes". So the engine matches ">" with "t". Obviously it will still fail. This process continues until "first" matches ">". So the engine finds a match "first". Remember, the regex-directed engine is "eager", so it will be eager to report the first match it finds. Instead of continuing to backtrack, even if there may be a better match, such as "". So we can see that due to the greediness of "+", the regular expression engine returns the leftmost and longest match.
· Replace Greediness with Laziness
A possible solution to correct the above problem is to use the lazy version of "+" instead of greedy. You can follow a question mark "?" after "+" to achieve this. The repetition represented by "*", "{}", and "?" can also use this solution. So in the above example, we can use "". Let's take a look at the processing of the regular expression engine again.
Again, the regular expression token "" matches "M", and the result fails. The engine will backtrack. Different from the previous example, because it is lazy repetition, the engine expands the lazy repetition instead of reducing it, so "". This time a successful match is obtained. The engine then reports "" as a successful match. The whole process is roughly like this.
· An Alternative to Lazy Expansion
We also have a better alternative. You can use a greedy repetition and a negated character set: "]+>". The reason why this is a better solution is that when using lazy repetition, the engine will backtrack for each character before finding a successful match. And using a negated character set does not need to backtrack.
Finally, it should be remembered that this tutorial only talks about regex-directed engines. Text-directed engines do not backtrack. But at the same time they also do not support lazy repetition operations.
7. Using "." to Match Almost Any Character
In regular expressions, "." is one of the most commonly used symbols. Unfortunately, it is also one of the most easily misused symbols.
"." matches a single character regardless of what the matched character is. The only exception is the newline character. In the engine discussed in this tutorial, by default, it does not match the newline character. Therefore, by default, "." is equivalent to the shorthand of the character set (Window) or (Unix).
This exception is due to historical reasons. Because the early tools using regular expressions were line-based. They all read a file line by line and applied regular expressions to each line separately. In these tools, the string does not contain newline characters. Therefore, "." never matches a newline character.
Modern tools and languages can apply regular expressions to very large strings or even the entire file. All regular expression implementations discussed in this tutorial provide an option that can make "." match all characters, including newline characters. In tools such as RegexBuddy, EditPad Pro or PowerGREP, you can simply select "Dot matches newline". In Perl, the pattern in which "." can match newline characters is called "single-line mode". Unfortunately, this is a confusing noun. Because there is also the so-called "multiline mode". Multiline mode only affects the anchoring of the beginning and end of the line, while single-line mode only affects ".".
Other languages and regular expression libraries also use Perl's terminology definition. When using the regular expression class in the.NET Framework, you can use a statement like the following to activate single-line mode: Regex.Match("string","regex",RegexOptions.SingleLine)
· Conservative Use of Dot "."
The dot can be said to be the most powerful metacharacter. It allows you to be lazy: with a dot, you can match almost all characters. But the problem is that it often matches characters that should not be matched.
I will use a simple example to illustrate. Let's see how to match a date in the "mm/dd/yy" format, but we want to allow the user to choose the delimiter. A solution that can be quickly thought of is >. It seems that it can match the date "02/12/03". The problem is that 02512703 will also be considered a valid date.
> seems to be a better solution. Remember that the dot is not a metacharacter in a character set. This solution is far from perfect, and it will match "99/99/99". And > is a step further. Although it will also match "19/39/99". How perfect you want your regular expression to be depends on what you want to achieve. If you want to verify user input, you need to be as perfect as possible. If you just want to analyze a known source and we know there is no wrong data, using a better regular expression to match the characters you want to search is enough.

8. Anchoring at the Start and End of the String
Anchors are different from general regular expression symbols. They do not match any characters. Instead, they match the position before or after the character. "^" matches the position before the first character of a line of string. > will match "a" in the string "abc". > will not match any characters in "abc".
Similarly, $ matches the position after the last character in the string. So > matches "c" in "abc".
· Application of Anchors
When verifying user input in a programming language, using anchors is very important. If you want to verify that the user's input is an integer, use >.
In user input, there are often redundant leading spaces or trailing spaces. You can use > and > to match leading spaces or trailing spaces.
· Using "^" and "$" as Line Start and End Anchors
If you have a string containing multiple lines. For example: "first line\n\rsecond line" (where \n\r represents a newline character). Often, each line needs to be processed separately instead of the entire string. Therefore, almost all regular expression engines provide an option that can expand the meaning of these two anchors. "^" can match the start position of the substring (before f), and the position after each newline character (between \n\r and s). Similarly, $ will match the end position of the substring (after the last e), and the position before each newline character (between e and \n\r).
In.NET, when you use the following code, it will define anchors to match the position before and after each newline character: Regex.Match("string", "regex", RegexOptions.Multiline)
Application: string str = Regex.Replace(Original, "^", "> ", RegexOptions.Multiline)--will insert "> " at the beginning of each line.
· Absolute Anchors
> only matches the start position of the entire string, > only matches the end position of the entire string. Even if you use "multiline mode", > and > never match newline characters.
Even though \Z and $ only match the end position of the string, there is still an exception. If the string ends with a newline character, \Z and $ will match the position before the newline character, not the very end of the entire string. This "improvement" was introduced by Perl and then followed by many regular expression implementations, including Java, .NET, etc. If you apply > to "joe\n", the matching result will be "joe" instead of "joe\n".

[ Last edited by 无奈何 on 2006-10-26 at 11:58 AM ]
  ☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul

Floor 4 Posted 2006-10-26 11:43 ·  中国 浙江 宁波 鹏博士宽带
荣誉版主
★★★
Credits 1,338
Posts 356
Joined 2005-07-15 12:09
20-year member
UID 40733
Gender Male
Status Offline
### In - depth Explanation of Regular Expressions (Part 2)
Original post address: http://dragon.cnblogs.com/archive/2006/05/09/394923.html
Foreword:
This article is a sequel to the previous article "In - depth Explanation of Regular Expressions (Part 1)". In this article, it describes groups and backreferences in regular expressions, positive and negative look - aheads, conditional tests, word boundaries, the alternation operator, etc., and examples, and analyzes the internal mechanism of the regular expression engine when it performs matching.
This article is a translation of a tutorial written by Jan Goyvaerts for RegexBuddy. The copyright belongs to the original author. Reprinting is welcome. But in order to respect the labor of the original author and the translator, please indicate the source! Thank you!

9. Word Boundaries

The metacharacter \b is also an "anchor" that matches positions. This kind of match is a 0 - length match.
There are 4 positions considered as "word boundaries":
1) The position before the first character in the string (if the first character of the string is a "word character")
2) The position after the last character in the string (if the last character of the string is a "word character")
3) Between a "word character" and a "non - word character", where the "non - word character" immediately follows the "word character"
4) Between a "non - word character" and a "word character", where the "word character" immediately follows the "non - word character"
A "word character" is a character that can be matched by \w, and a "non - word character" is a character that can be matched by \W. In most regular expression implementations, "word characters" usually include letters, digits, and the underscore _.
For example: \b can match a single 4 but not a part of a larger number. This regular expression will not match the 4 in "44".
In other words, it can almost be said that \b matches the positions at the start and end of an "alphanumeric sequence".
The complement set of "word boundaries" is \B, which matches positions between two "word characters" or between two "non - word characters".
· Delving into the Inside of the Regular Expression Engine
Let's look at applying the regular expression \b to the string "This island is beautiful". The engine first processes the symbol \b. Since \b is 0 - length, the position in front of the first character T will be examined. Because T is a "word character" and the character in front of it is an empty character (void), so \b matches the word boundary. Then \b fails to match with the first character "T". The matching process continues until the fifth space character, and a \b is matched between the fourth character "s" and the space character. However, the space character does not match \b. Continuing backward, to the sixth character "i", a \b is matched between the fifth space character and "i", and then \b matches both the sixth and seventh characters. However, the eighth character does not match the second "word boundary", so the match fails again. When reaching the 13th character i, because it forms a word boundary with the previous space character, and \b matches "is". The engine then tries to match the second \b. Because the 15th space character forms a word boundary with "s", the match is successful. The engine "hastily" returns the result of the successful match.
10. Alternation Operator
The "|" in the regular expression means alternation. You can use the alternation operator to match one of several possible regular expressions.
If you want to search for the text "cat" or "dog", you can use \b(cat|dog)\b. If you want more options, you just need to expand the list \b(cat|dog|rabbit)\b.
The alternation operator has the lowest priority in the regular expression, that is, it tells the engine to either match all the expressions on the left of the alternation operator or all the expressions on the right. You can also use parentheses to limit the scope of the alternation operator. For example, \b(?:cat|dog)\b, which tells the regular engine to treat (cat|dog) as a single regular expression unit.
· Note the "eagerness to claim success" of the Regular Expression Engine
The regular expression engine is eager. When it finds a valid match, it will stop searching. Therefore, under certain conditions, the order of the expressions on both sides of the alternation operator will affect the result. Suppose you want to search for a list of functions of a programming language: Get, GetValue, Set, or SetValue. An obvious solution is \b(Get|GetValue|Set|SetValue)\b. Let's see the result when searching for SetValue.
Because \b(Get|GetValue)\b and \b(Set|SetValue)\b both fail, and \b(Set|SetValue)\b matches successfully. Because the regular expression - oriented engine is "eager", it will return the first successful match, which is "Set", and will not continue to search for other better matches.
Contrary to our expectation, the regular expression does not match the entire string. There are several possible solutions. One is to consider the "eagerness" of the regular engine, change the order of the options, for example, we use \b(SetValue|Set|GetValue|Get)\b, so that we can preferentially search for the longest match. We can also combine the four options into two options: \b((Get|Set)(Value)?)\b. Because the question mark quantifier is greedy, SetValue will always be matched before Set.
A better solution is to use word boundaries: \b((Get|Set)Value|Get|Set)\b or \b(Get(Value)?|Set(Value)?)\b. Furthermore, since all the options have the same ending, we can optimize the regular expression to \b(Get|Set)(Value)?\b.
11. Groups and Backreferences
Put a part of the regular expression inside parentheses, and you can form a group. Then you can perform some regular operations on the entire group, such as the quantifier operation.
It should be noted that only parentheses "()" can be used to form groups. "" is used to define character sets. "{}" is used to define quantifier operations.
When a regular expression group is defined with "()", the regular engine will number the matched groups in order and store them in the cache. When backreferencing the matched group, you can use the form "\number" for reference. \1 refers to the first matched backreference group, \2 refers to the second group, and so on, \n refers to the nth group. And \0 refers to the entire matched regular expression itself. Let's look at an example.
Suppose you want to match the start tag and end tag of an HTML tag, as well as the text between the tags. For example, This is a test, we want to match <B> and </B> and the text in between. We can use the following regular expression: "<(*)>.*?</\1>".
First, "<" will match the first character "<" of "<B>". Then matches B, * will match 0 to multiple alphanumeric characters, followed by 0 to multiple characters that are not ">". Finally, the ">" in the regular expression will match the ">" of "<B>". Next, the regular engine will perform lazy matching on the characters before the end tag until a "</" symbol is encountered. Then "\1" in the regular expression refers to the group "(*)" matched before, in this example, the tag name "B" is referred to. So the ending tag to be matched is "</B>".
You can refer to the same backreference group multiple times, \b(+) \1\b will match "axaxa", "bxbxb", and "cxcxc". If the referenced group with a number form has no valid match, the content referred to is simply empty.
A backreference cannot be used for itself. \1\1 is incorrect. Therefore, you cannot use \0 to match the regular expression itself; it can only be used in the replacement operation.
Backreferences cannot be used inside character sets. \w\1 inside the character set does not represent a backreference. Inside the character set, \1 can be interpreted as an octal - encoded character.
Backreferences will slow down the engine because it needs to store the matched groups. If you don't need backreferences, you can tell the engine not to store a certain group. For example, (?:Value). Where "(?:" followed by "?:" will tell the engine not to store the matched value of the group (Value) for backreference.
· Quantifier Operations and Backreferences
When using a quantifier operator on a group, the content of the backreference in the cache will be continuously refreshed, keeping only the last matched content. For example, \b()\1\b will match "cab=cab", but \b()\w\1\b will not. Because when () first matches "c", "\1" represents "c"; then () will continue to match "a" and "b". Finally, "\1" represents "b", so it will match "cab=b".
Application: Checking for repeated words - when editing text, it is easy to enter repeated words, such as "the the". Using \b(\w+)\s+\1\b can detect these repeated words. To delete the second word, you can simply use the replacement function to replace "\1".
· Naming and Referencing of Groups
In PHP, Python, you can use (?P<name>group)>> to name a group. In this example, the lexeme?P is used to name the group (group). Where name is the name you give to the group. You can use (?P=name) to reference it.
Named Groups in.NET
The.NET framework also supports named groups. Unfortunately, Microsoft programmers decided to invent their own syntax instead of following the rules of Perl, Python. So far, no other regular expression implementation supports the syntax invented by Microsoft.
Here is an example in.NET:
(?<group>)(?'second'group)
As you can see,.NET provides two lexemes to create named groups: one is to use angle brackets "<>", or to use single quotes "''". Angle brackets are more convenient to use in strings, and single quotes are more useful in ASP code because "" is used as an HTML tag in ASP code.
To reference a named group, use \k<name> or \k'name'.
When performing search and replace, you can use "${name}" to reference a named group.
12. Matching Modes of Regular Expressions
The regular expression engines discussed in this tutorial all support three matching modes:
i makes the regular expression case - insensitive,
s enables "single - line mode", that is, the dot "." matches the newline character
m enables "multi - line mode", that is, "^" and "$" match the positions before and after the newline character.
· Turning Modes On or Off Inside the Regular Expression
If you insert the modifier (?ism) inside the regular expression, the modifier only affects the regular expression to its right. (?-i) is to turn off case insensitivity. You can test it quickly. \btest\b should match TEst, but not teST or TEST.
13. Atomic Groups and Preventing Backtracking
In some special cases, because backtracking will make the engine's efficiency extremely low.
Let's look at an example: to match a string where each field is separated by a comma, and the 12th field starts with P.
We can easily think of such a regular expression ^((*?,){11}*)$\1^P. This regular expression works well in normal cases. But in extreme cases, if the 12th field does not start with P, catastrophic backtracking will occur. For example, if the string to be searched is "1,2,3,4,5,6,7,8,9,10,11,12,13". First, the regular expression successfully matches until the 12th character. At this time, the string consumed by the previous regular expression is "1,2,3,4,5,6,7,8,9,10,11,", and the next character does not match "12". So the engine backtracks, and the string consumed by the regular expression at this time is "1,2,3,4,5,6,7,8,9,10,11". Continue the next matching process, the next regular symbol is the dot ^., which can match the next comma ",". However, ^.^((*?,){11}*)$\1^P does not match the "1" in "12". The match fails, and backtracking continues. You can imagine that such a combination of backtracking is a very large number. Therefore, it may cause the engine to crash.
There are several solutions to prevent such huge backtracking:
One simple solution is to make the match as precise as possible. Replace the dot with a negated character set. For example, we use the following regular expression ^((*?,){11}*)$\1^P, which can reduce the number of failed backtracking times to 11 times.
Another solution is to use atomic groups.
The purpose of atomic groups is to make the regular engine fail faster. Therefore, it can effectively prevent massive backtracking. The syntax of an atomic group is (?>regular expression)>>. All regular expressions between (?>) will be considered as a single regular symbol. Once the match fails, the engine will backtrack to the part of the regular expression before the atomic group. The previous example can be expressed with an atomic group as (.*?,){11}(?>*P)>>. Once the 12th field matches failed, the engine backtracks to ^((*?,){11}*)$.
14. Look - Aheads and Look - Behinds
Perl 5 introduced two powerful regular syntaxes: "look - aheads" and "look - behinds". They are also called "zero - length assertions". They are as zero - length as anchors (the so - called zero - length means that the regular expression does not consume the matched string). The difference is that "look - aheads and look - behinds" will actually match characters, but they will discard the match and only return the match result: match or not match. This is why they are called "assertions". They do not actually consume characters in the string, but only assert whether a match is possible.
Almost all the regular expression implementations discussed in this article support "look - aheads and look - behinds". The only exception is that Javascript only supports look - aheads.
· Positive and Negative Look - Aheads
As we mentioned in a previous example: to find a q that is not followed by a u. That is to say, either there is no character after q or the character after is not u. A solution using negative look - ahead is \bq(?!u)\b. The syntax of negative look - ahead is (?!look - ahead content)>>.
Positive look - ahead is similar to negative look - ahead: (?=look - ahead content)>>.
If there is a group in the "look - ahead content" part, a backreference will also be generated. But the look - ahead itself does not generate a backreference, nor is it counted in the numbering of backreferences. This is because the look - ahead itself will be discarded, and only the judgment result of match or not is retained. If you want to retain the matched result as a backreference, you can use \1 to generate a backreference.
· Positive and Negative Look - Behinds
Look - behinds have the same effect as look - aheads, but in the opposite direction.
The syntax of negative look - behind is (?<!look - behind content)>>.
The syntax of positive look - behind is (?<=look - behind content)>>.
As you can see, compared with look - aheads, there is an additional left angle bracket to indicate the direction.
Example: (?<!a)b will match a b that is not preceded by an a.
It is worth noting that: look - aheads start matching the "look" regular expression from the current string position; look - behinds start by backtracking one character from the current string position and then start matching the "look" regular expression.
· Delving into the Inside of the Regular Expression Engine
Let's look at a simple example.
Apply the regular expression \bq(?!u)\b to the string "Iraq". The first symbol of the regular expression is \b. As we know, the engine will scan the entire string before matching \b. When the fourth character "q" is matched, there is an empty character (void) after "q". The next regular symbol is the look - ahead. The engine notices that it has entered a part of the look - ahead regular expression. The next regular symbol is \b, which does not match the empty character, resulting in the match of the look - ahead regular expression failing. Because it is a negative look - ahead, it means that the entire look - ahead result is successful. So the match result "q" is returned.
We apply the same regular expression to "quit". \bq(?!u)\b matches "q". The next regular symbol is the part of the look - ahead \b, which matches the second character "i" in the string. The engine continues to the next character "i". However, the engine then notices that the look - ahead part has been processed and the look - ahead has been successful. So the engine discards the matched string part, which will cause the engine to backtrack to the character "u".
Because the look - ahead is negative, it means that the successful match of the look - ahead part leads to the failure of the entire look - ahead, so the engine has to backtrack. Finally, because there are no other "q"s to match with \b, the entire match fails.
To make sure you understand the implementation of look - ahead clearly, let's apply \bq(?!u)\b to "quit". \bq(?!u)\b first matches "q". Then the look - ahead successfully matches "u", the matched part is discarded, and only the judgment result of match is returned. The engine backtracks from the character "i" to "u". Since the look - ahead is successful, the engine continues to process the next regular symbol \b. The result is that \b does not match "u". So the match fails. Since there are no other "q"s behind, the match of the entire regular expression fails.
· Further Understanding of the Internal Mechanism of the Regular Expression Engine
Let's apply \b(?<=a)b\b to "thingamabob". The engine starts processing the look - behind regular symbol and the first character in the string. In this example, the look - behind tells the regular expression engine to backtrack one character and then check if an "a" is matched. Because there is no character in front of "t", the engine cannot backtrack. So the look - behind fails. The engine continues to the next character "h". Again, the engine temporarily backtracks one character and checks if an "a" is matched. It finds a "t", and the look - behind fails again.
The look - behind continues to fail until the regular expression reaches "m" in the string, and then the positive look - behind is matched. Because it is zero - length, the current position of the string is still "m". The next regular symbol is \b, which fails to match "m". The next character is the second "a" in the string. The engine temporarily backtracks one character and finds that \b does not match "m".
The next character is the first "b" in the string. The engine temporarily backtracks one character and finds that the look - behind is satisfied, and \b matches "b". So the entire regular expression is matched. As a result, the regular expression returns the first "b" in the string.
· Applications of Look - Aheads and Look - Behinds
Let's look at such an example: find a word that has 6 characters and contains "cat".
First, we can solve it without using look - aheads and look - behinds, for example: \b\w{6}\b.*cat.*\b\w{6}\b.
It's simple enough! But when the requirement becomes to find a word that has 6 - 12 characters and contains "cat", "dog", or "mouse", this method becomes a bit clumsy.
Let's look at the solution using look - aheads. In this example, we have two basic requirements to meet: one is that we need a word with 6 characters, and the other is that the word contains "cat".
The regular expression to meet the first requirement is \b\w{6}\b. The regular expression to meet the second requirement is \b.*cat.*\b.
Combining the two, we can get the following regular expression:
\b(?=\w{6}\b.*cat.*)\w{6}\b
The specific matching process is left to the reader. But one thing to note is that look - aheads do not consume characters, so when judging that the word meets the condition of having 6 characters, the engine will continue to match the subsequent regular expression from the position before the judgment.
Finally, some optimization can be done to get the following regular expression: \b(?=\w{6,12}\b.*(cat|dog|mouse).*)\w{6,12}\b
15. Conditional Tests in Regular Expressions
The syntax of conditional tests is (?(condition)then|else). The "if" part can be a look - ahead or look - behind expression. If it is a look - ahead, the syntax becomes (?(?=look - ahead)then|else).
If the if part is true, the regular engine will try to match the then part, otherwise the engine will try to match the else part.
It should be remembered that look - aheads and look - behinds do not actually consume any characters, so the subsequent matching of the then and else parts starts from the part before the if test.
16. Adding Comments to Regular Expressions
The syntax for adding comments to a regular expression is (?#comment).
Example: Add comments to a regular expression used to match valid dates:
(?#year)(19|20)\d\d(?#month)(0|1)(?#day)(0||3)

[ Last edited by 无奈何 on 2006 - 10 - 26 at 11:47 AM ]
  ☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul

Floor 5 Posted 2006-10-26 11:43 ·  中国 浙江 宁波 鹏博士宽带
荣誉版主
★★★
Credits 1,338
Posts 356
Joined 2005-07-15 12:09
20-year member
UID 40733
Gender Male
Status Offline
Repost Note: There is an error in the reference link after reposting. Please go to the original web page if needed.

http://www.regexlab.com/zh/regref.htm]
Introduction
A regular expression describes a pattern for string matching and can be used for: (1) checking if a substring matching a certain rule exists in a string and obtaining this substring; (2) performing flexible replacement operations on the string according to the matching rule.

Learning regular expressions is actually very simple, and the few relatively abstract concepts are also easy to understand. The reason why many people feel that regular expressions are more complicated is, on the one hand, that most documents do not explain step by step, and the concepts are not paid attention to in the order of priority, which brings difficulties to readers' understanding; on the other hand, the documents provided by various engines generally have to introduce their unique functions, but this part of the unique functions are not what we should understand first.

Each example in the article can be clicked to enter the test page for testing. Enough talk, let's start.


--------------------------------------------------------------------------------

1. Regular Expression Rules
1.1 Ordinary Characters
Letters, numbers, Chinese characters, underscores, and punctuation marks not specially defined in the following sections are all "ordinary characters". The ordinary characters in the expression, when matching a string, match the same character.

Example 1: The expression "c", when matching the string "abcde", the matching result is: successful; the matched content is: "c"; the matched position is: starts at 2, ends at 3. (Note: Whether the subscript starts from 0 or 1 may be different depending on the current programming language)

Example 2: The expression "bcd", when matching the string "abcde", the matching result is: successful; the matched content is: "bcd"; the matched position is: starts at 1, ends at 4.


--------------------------------------------------------------------------------

1.2 Simple Escape Characters
Some characters that are not easy to write are preceded by "\"". These characters we are already familiar with.

Expression
Can match

\r, \n
Represent carriage return and newline characters

\t
Tab character

\\
Represent "\" itself


There are also some punctuation marks that have special uses in the following sections. After adding "\" in front, they represent the symbol itself. For example: ^, $ have special meanings. If you want to match the "^" and "$" characters in the string, the expression needs to be written as "\^" and "\$".

Expression
Can match

\^
Match the ^ symbol itself

\$
Match the $ symbol itself

\.
Match the decimal point (.) itself


The matching method of these escape characters is similar to "ordinary characters". Also match the same character.

Example 1: The expression "\$d", when matching the string "abc$de", the matching result is: successful; the matched content is: "$d"; the matched position is: starts at 3, ends at 5.


--------------------------------------------------------------------------------

1.3 Expressions That Can Match 'Multiple Characters'
Some representation methods in regular expressions can match any one of 'multiple characters'. For example, the expression "\d" can match any one digit. Although it can match any of the characters, it can only be one, not multiple. This is just like when playing poker, the big and small kings can replace any card, but only one card.

Expression
Can match

\d
Any one digit, any one of 0~9

\w
Any one letter or digit or underscore, that is, any one of A~Z, a~z, 0~9, _

\s
Any one of blank characters including spaces, tabs, form feeds, etc.

.
The decimal point can match any character except the newline character (\n)


Example 1: The expression "\d\d", when matching "abc123", the matching result is: successful; the matched content is: "12"; the matched position is: starts at 3, ends at 5.

Example 2: The expression "a.\d", when matching "aaa100", the matching result is: successful; the matched content is: "aa1"; the matched position is: starts at 1, ends at 4.


--------------------------------------------------------------------------------

1.4 Custom Expressions That Can Match 'Multiple Characters'
Use square brackets to enclose a series of characters, which can match any one of them. Use to enclose a series of characters, which can match any one of the characters other than them. Similarly, although it can match any one of them, it can only be one, not multiple.

Expression
Can match


Match "a" or "b" or "5" or "@"


Match any one character other than "a", "b", "c"


Match any one letter between "f"~"k"


Match any one character other than "A"~"F", "0"~"3"


Example 1: The expression "" matches "abc123", the matching result is: successful; the matched content is: "bc"; the matched position is: starts at 1, ends at 3.

Example 2: The expression "" matches "abc123", the matching result is: successful; the matched content is: "1"; the matched position is: starts at 3, ends at 4.


--------------------------------------------------------------------------------

1.5 Special Symbols for Modifying Matching Times
The expressions mentioned in the previous sections, whether they can only match one character or can match any one of multiple characters, can only match once. If you use an expression plus a special symbol for modifying matching times, you can repeat the matching without repeatedly writing the expression.

The method of use is: "time modifier" is placed after "the modified expression". For example: "" can be written as "{2}".

Expression
Function

{n}
The expression repeats n times, for example: "\w{2}" is equivalent to "\w\w"; "a{5}" is equivalent to "aaaaa"

{m,n}
The expression repeats at least m times and at most n times, for example: "ba{1,3}" can match "ba" or "baa" or "baaa"

{m,}
The expression repeats at least m times, for example: "\w\d{2,}" can match "a12", "_456", "M12344"...

?
Match the expression 0 times or 1 time, equivalent to {0,1}, for example: "a?" can match "a", "ac", "ad"

+
The expression appears at least 1 time, equivalent to {1,}, for example: "a+b" can match "ab", "aab", "aaab"...

*
The expression does not appear or appears any number of times, equivalent to {0,}, for example: "\^*b" can match "b", "^^^b"...


Example 1: The expression "\d+\.?\d*" matches "It costs $12.5", the matching result is: successful; the matched content is: "12.5"; the matched position is: starts at 10, ends at 14.

Example 2: The expression "go{2,8}gle" matches "Ads by goooooogle", the matching result is: successful; the matched content is: "goooooogle"; the matched position is: starts at 7, ends at 17.


--------------------------------------------------------------------------------

1.6 Other Some Special Symbols Representing Abstract Meanings
Some symbols represent abstract special meanings in the expression:

Expression
Function

^
Matches the beginning of the string, does not match any character

$
Matches the end of the string, does not match any character

\b
Matches a word boundary, that is, the position between a word and a space, does not match any character


The further text description is still relatively abstract, so examples are given to help everyone understand.

Example 1: The expression "^aaa" matches "xxx aaa xxx", the matching result is: failure. Because "^" requires matching the beginning of the string, so only when "aaa" is at the beginning of the string can "^aaa" match, for example: "aaa xxx xxx".

Example 2: The expression "aaa$" matches "xxx aaa xxx", the matching result is: failure. Because "$" requires matching the end of the string, so only when "aaa" is at the end of the string can "aaa$" match, for example: "xxx xxx aaa".

Example 3: The expression ".\b." matches "@@@abc", the matching result is: successful; the matched content is: "@a"; the matched position is: starts at 2, ends at 4.
Further explanation: "\b" is similar to "^" and "$", it does not match any character by itself, but it requires that on the left and right sides of the position where it is in the matching result, one side is in the "\w" range and the other side is in the non-" \w" range.

Example 4: The expression "\bend\b" matches "weekend, endfor, end", the matching result is: successful; the matched content is: "end"; the matched position is: starts at 15, ends at 18.

Some symbols can affect the relationship between sub-expressions inside the expression:

Expression
Function

|
"Or" relationship between the expressions on the left and right sides, matches the left or the right

( )
(1). When modifying the matching times, the expression in the parentheses can be modified as a whole
(2). When obtaining the matching result, the content matched by the expression in the parentheses can be obtained separately


Example 5: The expression "Tom|Jack" matches the string "I'm Tom, he is Jack", the matching result is: successful; the matched content is: "Tom"; the matched position is: starts at 4, ends at 7. When matching the next one, the matching result is: successful; the matched content is: "Jack"; the matched position is: starts at 15, ends at 19.

Example 6: The expression "(go\s*)+" matches "Let's go go go!", the matching result is: successful; the matched content is: "go go go"; the matched position is: starts at 6, ends at 14.

Example 7: The expression "¥(\d+\.?\d*)" matches "$10.9, ¥20.5", the matching result is: successful; the matched content is: "¥20.5"; the matched position is: starts at 6, ends at 10. The content matched by the parentheses range alone is: "20.5".


--------------------------------------------------------------------------------

2. Some Advanced Rules in Regular Expressions
2.1 Greed and Non-greed in Matching Times
When using the special symbols for modifying matching times, there are several representation methods that can make the same expression match different times, such as: "{m,n}", "{m,}", "?", "*", "+", and the specific number of matches depends on the matched string. This kind of expression that repeats an indefinite number of times always matches as many as possible during the matching process. For example, for the text "dxxxdxxxd", examples are as follows:

Expression
Matching result

(d)(\w+)
"\w+" will match all characters "xxxdxxxd" after the first "d"

(d)(\w+)(d)
"\w+" will match all characters "xxxdxxx" between the first "d" and the last "d". Although "\w+" can also match the last "d", in order to make the entire expression match successfully, "\w+" can "give up" the last "d" that it could have matched


It can be seen that "\w+" always matches as many characters that meet its rules as possible when matching. Although in the second example, it does not match the last "d", it is also to make the entire expression match successfully. Similarly, expressions with "*" and "{m,n}" all match as much as possible, and expressions with "?" also try to "match" when it can be matched or not. This matching principle is called "greedy" mode.

Non-greedy mode:

Adding a "?" after the special symbol for modifying matching times can make the expression with an indefinite number of matches match as few as possible, and make the expression that can be matched or not match as "not match" as possible. This matching principle is called "non-greedy" mode, also called "reluctant" mode. If matching less will cause the entire expression to fail to match, similar to the greedy mode, the non-greedy mode will match a little more minimally to make the entire expression match successfully. Examples are as follows, for the text "dxxxdxxxd" examples:

Expression
Matching result

(d)(\w+?)
"\w+?" will match as few characters as possible after the first "d", and the result is: "\w+?" only matches one "x"

(d)(\w+?)(d)
In order to make the entire expression match successfully, "\w+?" has to match "xxx" to make the subsequent "d" match, so that the entire expression matches successfully. Therefore, the result is: "\w+?" matches "xxx"


More situations, examples are as follows:

Example 1: The expression "<td>(.*)</td>" matches the string "<td><p>aa</p></td> <td><p>bb</p></td>", the matching result is: successful; the matched content is the entire string "<td><p>aa</p></td> <td><p>bb</p></td>", and the "</td>" in the expression will match the last "</td>" in the string.

Example 2: In contrast, the expression "<td>(.*?)</td>" matches the same string in Example 1, and will only get "<td><p>aa</p></td>", and when matching the next one, the second "<td><p>bb</p></td>" can be obtained.


--------------------------------------------------------------------------------

2.2 Backreferences \1, \2...
When the expression is matching, the expression engine will record the string matched by the expression enclosed in parentheses "( )". When obtaining the matching result, the string matched by the expression enclosed in parentheses can be obtained separately. This point has been shown many times in the previous examples. In the actual application scenario, when finding with a certain boundary and the content to be obtained does not include the boundary, parentheses must be used to specify the range to be obtained. For example, the previous "<td>(.*?)</td>".

In fact, "the string matched by the expression enclosed in parentheses" can not only be used after the matching is over, but also can be used during the matching process. The part behind the expression can refer to the string that has been matched by the "sub-matching in the parentheses" before. The reference method is "\" plus a number. "\1" refers to the string matched by the first pair of parentheses, "\2" refers to the string matched by the second pair of parentheses... and so on. If there is another pair of parentheses inside a pair of parentheses, the outer pair of parentheses is sorted first. In other words, which pair of parentheses has the left parenthesis "(" first, then this pair is sorted first.

Examples are as follows:

Example 1: The expression "('|")(.*?)(\1)" matches "'Hello', "World"", the matching result is: successful; the matched content is: " 'Hello' ". When matching the next one, " "World" " can be matched.

Example 2: The expression "(\w)\1{4,}" matches "aa bbbb abcdefg ccccc 111121111 999999999", the matching result is: successful; the matched content is "ccccc". When matching the next one, 999999999 will be obtained. This expression requires that the character in the "\w" range is repeated at least 5 times, pay attention to the difference from "\w{5,}".

Example 3: The expression "<(\w+)\s*(\w+(=('|").*?\4)?\s*)*>.*?</\1>" matches "<td id='td1' style="bgcolor:white"></td>", the matching result is successful. If "<td>" does not match "</td>", it will match failure; if it is changed to other pairs, it can also match successful.


--------------------------------------------------------------------------------

2.3 Positive Lookahead, Negative Lookahead; Positive Lookbehind, Negative Lookbehind
In the previous sections, I talked about several special symbols representing abstract meanings: "^", "$", "\b". They have one thing in common: they do not match any character by themselves, but only attach a condition to "the two ends of the string" or "the gap between characters". After understanding this concept, this section will continue to introduce another more flexible representation method that attaches conditions to "the two ends" or "the gap".

Positive lookahead: "(?=xxxxx)", "(?!xxxxx)"

Format: "(?=xxxxx)", in the matched string, the condition attached to the "gap" or "two ends" where it is located is: the right side of the gap where it is located must be able to match the expression of xxxxx. Because it is only used as a condition attached to this gap here, it does not affect the subsequent expression to really match the characters after this gap. This is similar to "\b", which does not match any character by itself. "\b" just takes the characters before and after the gap where it is located for judgment, and will not affect the subsequent expression to really match.

Example 1: The expression "Windows (?=NT|XP)" matches "Windows 98, Windows NT, Windows 2000", and will only match "Windows " in "Windows NT", and other "Windows " words will not be matched.

Example 2: The expression "(\w)((?=\1\1\1)(\1))+" matches the string "aaa ffffff 999999999", and will be able to match the first 4 of 6 "f"s and the first 7 of 9 "9"s. This expression can be read as: if a letter or digit is repeated 4 or more times, then match the part before the last 2 of it. Of course, this expression can not be written like this, the purpose here is for demonstration.

Format: "(?!xxxxx)", the right side of the gap where it is located must not be able to match the expression of xxxxx.

Example 3: The expression "((?!\bstop\b).)+" matches "fdjka ljfdl stop fjdsla fdj", and will match from the beginning to the position before "stop", and if there is no "stop" in the string, it will match the entire string.

Example 4: The expression "do(?!\w)" matches the string "done, do, dog", and can only match "do". In this example, using "(?!\w)" after "do" has the same effect as using "\b".

Negative lookbehind: "(?<=xxxxx)", "(?<!xxxxx)"

The concepts of these two formats are similar to positive lookahead. Negative lookbehind requires that the "left side" of the gap where it is located, the two formats respectively require that the specified expression can be matched and must not be able to match, instead of judging the right side. The same as "positive lookahead" is: they are all a kind of condition attached to the gap where they are located, and they do not match any character by themselves.

Example 5: The expression "(?<=\d{4})\d+(?=\d{4})" matches "1234567890123456", and will match the middle 8 digits except the first 4 digits and the last 4 digits. Since JScript.RegExp does not support negative lookbehind, this example cannot be demonstrated. Many other engines can support negative lookbehind, such as: java.util.regex package in Java 1.4 and above, System.Text.RegularExpressions namespace in .NET, and the simplest and easiest-to-use DEELX regular engine recommended on this site.


--------------------------------------------------------------------------------

3. Other General Rules
There are also some general rules among various regular expression engines that were not mentioned in the previous explanations.

3.1 In the expression, "\xXX" and "\uXXXX" can be used to represent a character ("X" represents a hexadecimal number)

Form
Character range

\xXX
Characters with numbers in the range 0 ~ 255, for example: space can be represented by "\x20"

\uXXXX
Any character can be represented by "\u" plus its 4-digit hexadecimal number, for example: "\u4E2D"


3.2 While "\s", "\d", "\w", "\b" in the expression represent special meanings, the corresponding uppercase letters represent the opposite meanings

Expression
Can match

\S
Match all non-whitespace characters ("\s" can match various whitespace characters)

\D
Match all non-digit characters

\W
Match all characters other than letters, digits, and underscores

\B
Match non-word boundaries, that is, the character gaps when both sides are in the "\w" range or both sides are not in the "\w" range


3.3 Characters that have special meanings in the expression and need to add "\" to match the character itself are summarized

Character
Description

^
Matches the start position of the input string. To match the ^ character itself, use "\^"

$
Matches the end position of the input string. To match the $ character itself, use "\$"

( )
Marks the start and end positions of a sub-expression. To match parentheses, use "\(" and "\)"


Used to customize expressions that can match 'multiple characters'. To match square brackets, use "\"

{ }
Symbol for modifying matching times. To match braces, use "\{" and "\}"

.
Matches any character except the newline character (\n). To match the decimal point itself, use "\."

?
Modifies the matching times to 0 times or 1 time. To match the "?" character itself, use "\?"

+
Modifies the matching times to at least 1 time. To match the "+" character itself, use "\+"

*
Modifies the matching times to 0 times or any times. To match the "*" character itself, use "\*"

|
"Or" relationship between the expressions on the left and right sides. To match "|" itself, use "\|"


3.4 If the sub-expression inside the parentheses "( )" does not want the matching result to be recorded for future use, the format "(?:xxxxx)" can be used

Example 1: The expression "(?:(\w)\1)+" matches "a bbccdd efg", and the result is "bbccdd". The matching result of the parentheses "(?:)" range is not recorded, so "\1" is used to reference "\w".

3.5 Introduction to the commonly used expression attribute settings: Ignorecase, Singleline, Multiline, Global

Expression attribute
Description

Ignorecase
By default, the letters in the expression are case-sensitive. Configuring Ignorecase can make the matching case-insensitive. Some expression engines extend the concept of "case" to the case of the UNICODE range.

Singleline
By default, the decimal point "." matches characters except the newline character (\n). Configuring Singleline can make the decimal point match all characters including the newline character.

Multiline
By default, the expressions "^" and "$" only match the start ① and end ④ positions of the string. For example:

①xxxxxxxxx②\n
③xxxxxxxxx④

Configuring Multiline can make "^" match ①, and also match the position ③ before the start of the next line after the newline character, and make "$" match ④, and also match the position ② at the end of a line before the newline character.

Global
Mainly plays a role when using the expression for replacement. Configuring Global means replacing all matches.


--------------------------------------------------------------------------------


4. Other Tips
4.1 If you want to understand what complex regular grammars the advanced regular engine supports, you can refer to the description document of the DEELX regular engine on this site.

4.2 If you want the content matched by the expression to be the entire string, not a part from the string, then you can use "^" and "$" at the beginning and end of the expression, for example: "^\d+$" requires the entire string to only have digits.

4.3 If you require the matched content to be a complete word, not a part of a word, then use "\b" at the beginning and end of the expression, for example: use "\b(if|while|else|void|int……)\b" to match keywords in the program.

4.4 The expression should not match an empty string. Otherwise, it will always get a successful match, but the result will match nothing. For example: when preparing to write an expression that matches "123", "123.", "123.5", ".5" and other forms, integers, decimal points, and decimal numbers can be omitted, but do not write the expression as: "\d*\.?\d*", because if there is nothing, this expression can also match successfully. A better way to write it is: "\d+\.?\d*|\.\d+".

4.5 Do not loop infinitely for sub-matches that can match empty strings. If each part of the sub-expression inside the parentheses can match 0 times, and this parentheses as a whole can match infinitely, then the situation may be more serious than the previous one, and there may be an infinite loop during the matching process. Although some regular expression engines have avoided this kind of infinite loop, such as the regular expression in .NET, we should still try to avoid this situation. If we encounter an infinite loop when writing an expression, we can also start from this point and see if it is the reason mentioned in this section.

4.6 Reasonably choose greedy mode and non-greedy mode, refer to the topic discussion.

4.7 For a certain character on the left and right sides of or "|", it is best that only one side can match, so that it will not be different because the expressions on both sides of "|" are exchanged in position.

[ Last edited by 无奈何 on 2006-10-26 at 12:17 PM ]
  ☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul

Floor 6 Posted 2006-10-26 11:43 ·  中国 浙江 宁波 鹏博士宽带
荣誉版主
★★★
Credits 1,338
Posts 356
Joined 2005-07-15 12:09
20-year member
UID 40733
Gender Male
Status Offline
Repost Note: If there are errors in the quoted links after reposting, please go to the original web page if needed.

Regular Expression Topic
http://www.regexlab.com/zh/regtopic.htm]

Introduction
This article will gradually discuss some topics about the use of regular expressions. This article is an extension after the basic article of this site. Before reading this article, it is recommended to read the "Regular Expression Reference Document" article first.


--------------------------------------------------------------------------------

1. Recursive Matching of Expressions
Sometimes, we need to use regular expressions to analyze the parenthesis pairing situation in a calculation formula. For example, using the expressions "\( * \)" or "\( .*? \)" can match a pair of small parentheses. But if there is another layer of parentheses embedded inside the parentheses, such as "( ( ) )", then this writing method will not match correctly, and the result obtained is "( ( )". Similar situations also include HTML nested tags such as "<font> </font>", etc. This section will discuss how to match paired parentheses or paired tags with nesting.

Matching nested unknown levels:

Some regular expression engines have specific support for such nesting. And as long as the stack space allows, they can support arbitrary unknown levels of nesting: such as Perl, PHP, GRETA, etc. In PHP and GRETA, the expression uses "(?R)" to represent the nested part.

The expression for matching "small parentheses pairs" nested with unknown levels is written as: "\( ( | (?R))* \)".



Matching nested with limited levels:

For regular expression engines that do not support nesting, a certain method can be used to match nested with limited levels. The idea is as follows:

Step 1, write an expression that cannot support nesting: "\( * \)", "<font>((?!</?font>).)*</font>". These two expressions, when matching nested text, only match the innermost layer.

Step 2, write an expression that can match nested one layer: "\( ( | \( * \))* \)". This expression, when matching with a nesting level greater than one, can only match the innermost two layers. At the same time, this expression can also match text without nesting or the innermost layer of nesting.

To match the "<font>" tag nested one layer, the expression is: "<font>((?!</?font>).|(<font>((?!</?font>).)*</font>))*</font>". This expression, when matching text with a nesting level of "<font>" greater than one, only matches the innermost two layers.

Step 3, find the relationship between the expression that can match nested (n) layers and the expression that matches nested (n-1) layers. For example, the expression that can match nested (n) layers is:

( and ] | )*

Looking back at the "expression that can match nested one layer" written earlier:

  \( ( | \(()*\) )* \)
<font> ( (?!</?font>). | (<font>((?!</?font>).)*</font>) )* </font>
             
The convenience of PHP and GRETA is that the expression that matches nested (n-1) layers is represented by (?R):
\( ( | (?R) )* \)

Step 4, by analogy, the expression that matches nested (n) layers with limited levels can be written. The expression written in this way, although it looks very long, the matching efficiency is still very high after compilation.


--------------------------------------------------------------------------------

2. Efficiency of Non-Greedy Matching
Maybe many people have had such an experience as me: when we want to match text like "<td>content</td>" or "bold", we write such an expression according to the positive look-ahead function: "<td>(|<(?!/td>))*</td>" or "<td>((?!</td>).)*</td>".

When we find the non-greedy matching, we suddenly realize that the same functional expression can be written so simply: "<td>.*?</td>". Suddenly, it's like finding a treasure. Whenever matching by boundaries, try to use the simple non-greedy matching ".*?". Especially for complex expressions, the expression written with the non-greedy matching ".*?" is indeed much more concise.

However, when there are multiple non-greedy matches in an expression, or multiple expressions with unknown matching times, this expression may have a trap in efficiency. Sometimes, the matching speed is inexplicably slow, and even start to doubt whether regular expressions are practical.

Generation of Efficiency Traps:

In the basic article of this site, the description of non-greedy matching says: "If matching less will cause the entire expression to fail to match, similar to the greedy mode, the non-greedy mode will minimally match some more to make the entire expression match successfully."

The specific matching process is as follows:

The "non-greedy part" first matches the minimum number of times, and then tries to match the "right expression".
If the right expression matches successfully, the entire expression matching ends. If the right expression matches failed, the "non-greedy part" will increase the matching by one time, and then try to match the "right expression" again.
If the right expression still matches failed, the "non-greedy part" will increase the matching by one time again. Try to match the "right expression" again.
And so on, the final result is that the "non-greedy part" uses the minimum number of matching times to make the entire expression match successfully. Or finally still matches failed.
When there are multiple non-greedy matches in an expression, take the expression "d(\w+?)d(\w+?)z" as an example. For the "\w+?" in the first parentheses, the "d(\w+?)z" on the right belongs to its "right expression". For the "\w+?" in the second parentheses, the "z" on the right belongs to its "right expression".

When "z" matches failed, the second "\w+?" will "increase the matching by one time" and try to match "z" again. If the second "\w+?" no matter how "increase the matching times" until the entire text ends, "z" still cannot match, then it means that "d(\w+?)z" matches failed, that is, the "right" of the first "\w+?" matches failed. At this time, the first "\w+?" will increase the matching by one time, and then perform the matching of "d(\w+?)z" again. Cycle the process described earlier until the first "\w+?" no matter how "increase the matching times", the subsequent "d(\w+?)z" still cannot match, then the entire expression is declared to match failed.

In fact, for the entire expression to match successfully, the greedy matching will also appropriately "give up" the matched characters. Therefore, the greedy matching also has a similar situation. When there are more expressions with unknown matching times in an expression, in order to make the entire expression match successfully, each greedy or non-greedy expression has to try to reduce or increase the matching times, which is likely to form a large loop of attempts, resulting in a long matching time. This article calls it a "trap" because this efficiency problem is often not easy to detect.

For example: "d(\w+?)d(\w+?)d(\w+?)z" when matching "ddddddddddd...", it will take a long time to judge that the matching failed.

Avoidance of Efficiency Traps:

The principle to avoid efficiency traps is: avoid "multiple loops" of "attempted matching". It's not that non-greedy matching is bad, but when using non-greedy matching, attention should be paid to avoiding the problem of excessive "loop attempts".

Situation 1: For an expression with only one non-greedy or greedy matching, there is no efficiency trap. That is, to match text like "<td> content </td>", the expressions "<td>(|<(?!/td>))*</td>", "<td>((?!</td>).)*</td>" and "<td>.*?</td>" have exactly the same efficiency.

Situation 2: If there are multiple expressions with unknown matching times in an expression, unnecessary attempted matching should be prevented.

For example, for the expression "<script language='(.*?)'>(.*?)</script>", if the previous part of the expression matches successfully when encountering "<script language='vbscript'>", but the subsequent "(.*?)</script>" matches failed, it will cause the first ".*?" to increase the matching times and try again. And for the real purpose of the expression, it is incorrect to make the first ".*?" increase the matching to "vbscript'>", so this attempt is an unnecessary attempt.

Therefore, for expressions identified by boundaries, do not let the part with unknown matching times cross its boundary. In the previous expression, the first ".*?" should be rewritten as "*". The subsequent ".*?" has no other expressions with unknown matching times on the right, so this non-greedy matching has no efficiency trap. Therefore, the expression for matching this script block should be written as: "<script language='(*)'>(.*?)</script>" better.

[ Last edited by 无奈何 on 2006-10-26 at 12:20 PM ]
  ☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul

Floor 7 Posted 2006-10-26 11:44 ·  中国 浙江 宁波 鹏博士宽带
荣誉版主
★★★
Credits 1,338
Posts 356
Joined 2005-07-15 12:09
20-year member
UID 40733
Gender Male
Status Offline
Regular Expression Reference Manual__Mini Version
A regular expression is a text pattern composed of ordinary characters (such as characters from a to z) and special characters (called metacharacters). This pattern describes one or more strings to be matched when searching the text body. A regular expression acts as a template to match a certain character pattern with the string being searched.
This article lists in detail various characters that can be used in regular expressions to match text. When you need to explain an existing regular expression, it can be used as a quick reference. For more detailed content, please refer to: Francois Liger, Craig McQueen, Pal Wilton C# String and Regular Expression Reference Manual Beijing: Tsinghua University Press 2003.2
I. . Matching Characters
Character Classes
Characters Matched
Examples
\d
Any digit from 0-9
\d\d matches 72, but not aa or 7a
\D
Any non-digit character
\D\D\D matches abc, but not 123
\w
Any word character, including A-Z, a-z, 0-9, and underscore
\w\w\w\w matches Ab-2, but not ∑£$%* or Ab_@
\W
Any non-word character
\W matches @, but not a
\s
Any whitespace character, including tab, newline, carriage return, form feed, and vertical tab
Matches all traditional whitespace characters defined in HTML, XML, and other standards
\S
Any non-whitespace character
Any character other than whitespace, such as A%&g3; etc.
.
Any character
Matches any character except newline unless the MultiLine option is set

Any character in the brackets
will match a single character, a, b, or c.
will match any character from a to z

Any character not in the brackets
will match a single character other than a, b, c, which can be a, b, or A, B, C
will match any character not belonging to a-z, but can match all uppercase letters
II. . Repeating Characters
Repeating Characters
Meaning
Examples
{n}
Matches the preceding character n times
x{2} matches xx, but not x or xxx
{n,}
Matches the preceding character at least n times
x{2} matches 2 or more x, such as xxx, xxx..
{n,m}
Matches the preceding character at least n times and at most m times. If n is 0, this parameter is optional
x{2,4} matches xx, xxx, xxxx, but not xxxxx
?
Matches the preceding character 0 or 1 time, essentially optional
x? matches x or zero x
+
Matches the preceding character 0 or more times
x+ matches x or xx or any number of x greater than 0
*
Matches the preceding character 0 or more times
x* matches 0, 1, or more x
III. . Anchoring Characters
Anchoring Characters
Description
^
The following pattern must be at the start of the string. If it is a multi-line string, it must be at the start of the line. For multi-line text (a string containing carriage returns), the multi-line flag needs to be set
$
The preceding pattern must be at the end of the string. If it is a multi-line string, it must be at the end of the line
\A
The preceding pattern must be at the start of the string, ignoring the multi-line flag
\z
The preceding pattern must be at the end of the string, ignoring the multi-line flag
\Z
The preceding pattern must be at the end of the string, or before a newline
\b
Matches a word boundary, that is, the point between a word character and a non-word character. Remember that a word character is one of . Located at the start of a word
\B
Matches a non-word character boundary position, not the start of a word
Note: Anchoring characters can be applied to characters or combinations, placed at the left or right end of the string
IV. . Grouping Characters
Grouping Characters
Definition
Examples
()
This character can group the characters matched by the pattern inside the brackets. It is a capturing group, that is, the characters matched by the pattern are set as the ExplicitCapture option――by default, characters are not part of the match
The input string is: ABC1DEF2XY
The regular expression that matches 3 characters from A to Z and 1 digit: ( {3}\d )
Will produce two matches: Match 1=ABC1; Match 2=DEF2
Each match corresponds to a group: the first group of Match1=ABC; the first group of Match2=DEF
With backreferences, you can access the group through its number in the regular expression and C# and the classes Group, GroupCollection. If the ExplicitCapture option is set, the content captured by the group cannot be used
(?:)
This character can group the characters matched by the pattern inside the brackets. It is a non-capturing group, which means that the characters matched by the pattern will not be captured as a group, but it constitutes part of the final match result. It is basically the same as the above group type, but the option ExplicitCapture is set
The input string is: 1A BB SA 1 C
The regular expression that matches a digit or a letter from A to Z followed by any word character is: (?:\d|\w )
It will produce 3 matches: each 1st match=1A; each 2nd match=BB; each 3rd match=SA
But no group is captured
(?)
This option groups the characters matched by the pattern inside the brackets and names the group with the value specified in the angle brackets. In the regular expression, backreferences can be used with the name instead of the number. Even if the ExplicitCapture option is not set, it is a capturing group. This means that backreferences can use the characters matched in the group or access through the Group class
The input string is: Characters in Sienfeld included Jerry Seinfeld, Elaine Benes, Cosno Kramer and George Costanza The regular expression that can match their names and capture the last name in a group llastName is: \b+(?+)\b
It produced 4 matches: First Match=Jerry Seinfeld; Second Match=Elaine Benes; Third Match=Cosmo Kramer; Fourth Match=George Costanza
Each match corresponds to a lastName group:
1st match: lastName group=Seinfeld
2nd match: lastName group=Benes
3rd match: lastName group=Kramer
4th match: lastName group=Costanza
The group will be captured regardless of whether the option ExplictCapture is set
(?=)
Positive assertion. The right side of the assertion must be the pattern specified in the brackets. This pattern does not constitute part of the final match
The regular expression \S+(?=.NET) for the input string to be matched is: The languages were Java, C#.NET, VB.NET, C, Jscript.NET, Pascal
Will produce the following matches:〕
C#
VB
JScript
(?!)
Negative assertion. It specifies that the pattern must not be immediately to the right of the assertion. This pattern does not constitute part of the final match
\d{3}(?!) for the input string to be matched is: 123A 456 789 111C
Will produce the following matches:
456
789
(?
Reverse positive assertion. The left side of the assertion must be the specified pattern in the brackets. This pattern does not constitute part of the final match
The regular expression (?
It will produce the following matches:
Mexico
England
(?
Reverse positive assertion. The left side of the assertion must not be the specified pattern in the brackets. This pattern does not constitute part of the final match
The regular expression (?
It will achieve the following matches:
56F
89C
(?>)
Non-backtracking group. Prevents the Regex engine from backtracking and prevents a match from being achieved
Suppose you want to match all words ending with "ing". The input string is as follows: He was very trusing
The regular expression is: .*ing
It will achieve one match――the word trusting. "." matches any character, of course, it also matches "ing". So, the Regex engine backtracks one position and stops at the 2nd "t", then matches the specified pattern "ing". However, if backtracking is disabled: (?>.*)ing
It will achieve 0 matches. "." can match all characters, including "ing"――cannot match, so the match fails
V. . Decision Characters
Characters
Description
Examples
(?(regex)yes_regex|no_regex )
If the expression regex matches, then it will try to match the expression yes. Otherwise, it matches the expression no. The regular expression no is an optional parameter. Note that the width of the pattern making the decision is 0. This means that the expression yes or no will start matching from the same position as the regex expression
The regular expression (?(\d)dA|A-Z)B) for the input string to be matched is: 1A CB 3A 5C 3B
The matches it achieves are:
1A
CB
3A
(?(group name or number)yes_regex|no_regex )
If the regular expression in the group achieves a match, then it tries to match the yes regular expression. Otherwise, it tries to match the regular expression no. no is optional
The regular expression
(\d7)?-(?(1)\d\d| for the input string to be matched is:
77 -77A 69-AA 57-B
The matches it achieves are:
77 -77A
- AA
Note: The characters listed in the above table force the processor to perform an if-else decision
VI. . Replacement Characters
Characters
Description
$group
Replace with the group number specified by group
${name}
Replace the last substring matched by a (?) group
$$
Replace a character $
$&
Replace the entire match
$ ^
Replace all text before the input string match
$'
Replace all text after the input string match
$+
Replace the last captured group
$_
Replace the entire input string
Note: The above are common replacement characters, not all
VII. . Escape Sequences
Characters
Description
\\
Matches the character "\"
\.
Matches the character "."
\*
Matches the character "*"
\+
Matches the character "+"
\?
Matches the character "?"
\|
Matches the character "|"
\(
Matches the character "("
\)
Matches the character ")"
\{
Matches the character "{"
\}
Matches the character "}"
\ ^
Matches the character "^"
\$
Matches the character "$"
\n
Matches newline
\r
Matches carriage return
\t
Matches tab
\v
Matches vertical tab
\f
Matches form feed
\nnn
Matches an 8-digit number, the ASCII character specified by nnn. For example, \103 matches uppercase C
\xnn
Matches a 16-digit number, the ASCII character specified by nn. For example, \x43 matches uppercase C
\unnnn
Matches a Unicode character specified by 4-digit 16-digit numbers (represented by nnnn)
\cV
Matches a control character, such as \cV matches Ctrl-V
VIII. . Option Flags
Option Flags
Names
I
IgnoreCase
M
Multiline
N
ExplicitCapture
S
SingleLine
X
IgnorePatternWhitespace
Note: The meaning of the options themselves is as shown in the following table:
Flags
Names
IgnoreCase
Makes pattern matching case-insensitive. The default option is case-sensitive matching
RightToLeft
Searches the input string from right to left. The default is from left to right to conform to the reading habits of English, etc., but not to the reading habits of Arabic or Hebrew
None
No flags are set. This is the default option
Multiline
Specifies that ^ and $ can match the start and end of lines, as well as the start and end of the string. This means that each line separated by a newline can be matched. However, the character "." still does not match newline
SingleLine
Specifies that the special character "." matches any character, including newline. By default, the special character "." does not match newline. Usually used together with the MultiLine option
ECMAScript
ECMA (European Computer Manufacturer's Association) has defined how regular expressions should be implemented, and it has been implemented in the ECMAScript specification, which is a standard-based JavaScript. This option can only be used with the IgnoreCase and MultiLine flags. Using it with any other flags will cause an exception in ECMAScript
IgnorePatternWhitespace
This option removes all unescaped whitespace characters from the used regular expression pattern. It makes the expression span multiple lines of text, but it must ensure that all whitespace in the pattern is escaped. If this option is set, the "#" character can also be used to comment the regular expression
Complied
It compiles the regular expression into code closer to machine code. This is fast, but does not allow any modification to it

[ Last edited by 无奈何 on 2006-10-26 at 11:51 AM ]
  ☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul

Floor 8 Posted 2006-10-26 11:44 ·  中国 浙江 宁波 鹏博士宽带
荣誉版主
★★★
Credits 1,338
Posts 356
Joined 2005-07-15 12:09
20-year member
UID 40733
Gender Male
Status Offline
Foreword
  Regular Expressions (regular expressions, hereinafter referred to as RE) has always been a mysterious area for me. Seeing some great people on the Internet simply use RE to solve certain text problems, I got the idea of learning RE. But I am naturally a bit lazy and always hope to see if there is a way to learn it quickly. So I invited the Google god again. With His power, I found Mr. Jim Hollenhorst's article on the Internet. After reading it, I thought it was really good, so I made a small summary report to share with the friends of Move-to.Net, hoping to bring a little help to you all in learning RE. The URL of Mr. Jim Hollenhorst's article is as follows. Those who need it can directly click the link.
  The 30 Minute Regex Tutorial By Jim Hollenhorst
  http://www.codeproject.com/useritems/RegexTutorial.asp
  What is RE?
  I believe that all of you have used the wildcard "*" when doing file searches. For example, when you want to find all Word files in the Windows directory, you may use "*\.doc" to do the search, because "*" represents any character. What RE does is similar to this function, but its function is more powerful.
  When writing a program, it is often necessary to compare whether a string matches a specific pattern. The main function of RE is to describe this specific pattern. Therefore, RE can be regarded as a description of a specific pattern. For example, "\w+" represents any non-null string composed of letters and numbers. In the.NET framework, a very powerful class library is provided, through which it is very easy to use RE to do text search and replacement, decode complex headers, and verify text and other tasks.
  The best way to learn RE is to do it yourself through examples. Mr. Jim Hollenhorst also provides a tool program Expresso (have a cup of coffee), to help us learn RE. The download URL is http://www.codeproject.com/useritems/RegexTutorial/ExpressoSetup2_1C.zip.
  Next, let's experience some examples.
  Some simple examples
  Suppose you want to find a string in the article where Elvis is followed by alive, using RE may go through the following process. The parentheses are the meaning of the RE below:
  1. elvis (find elvis)
  The above represents the order of characters to be found as elvis. In.NET, you can set to ignore the case of characters, so "Elvis", "ELVIS" or "eLvIs" all match the RE of 1. But because this only cares about the order of characters appearing as elvis, so pelvis also matches the RE of 1. You can improve it with the RE of 2.
  2. \belvis\b (regard elvis as a whole word to find, such as elvis, Elvis when ignoring case of characters)
  "\b" has a special meaning in RE. In the above example, it refers to the word boundary. So \belvis\b uses \b to define the front and back boundaries of elvis, that is, to find the word elvis.
  Suppose you want to find a string in the same line where elvis is followed by alive, then you will use two other special meaning characters ".", and "*". "." represents any character except the newline character, and "*" represents repeating the item before * until the string that matches the RE is found. So ".*" refers to any number of characters except the newline character. So to find a string in the same line where elvis is followed by alive, you can enter the RE of 3 as follows.
  3. \belvis\b.*\balive\b (find a string where elvis is followed by alive, such as elvis is alive)
  You can form a powerful RE with simple special characters, but you also find that when using more and more special characters, the RE will be more and more difficult to understand.
  Let's look at another example
  Form an effective phone number
  Suppose you want to collect a 7-digit phone number in the format xxx-xxxx from a web page, where x is a digit, the RE may be written like this.
  4. \b\d\d\d-\d\d\d\d (find a 7-digit phone number, such as 123-1234)
  Each \d represents a digit. "-" is a general hyphen. To avoid too many repeated \d, the RE can be rewritten in the way of 5.
  5. \b\d?}-\d?} (a better way to find a 7-digit phone number, such as 123-1234)
  ?} after \d means repeating the previous item three times, that is, equivalent to \d\d\d.
  RE learning and testing tool Expresso
  Because RE is not easy to read and users are prone to making wrong RE, Mr. Jim developed a tool software Expresso to help users learn and test RE. In addition to the URL mentioned above, you can also go to the Ultrapico website (http://www.Ultrapico.com). After installing Expresso, in the Expression Library, Mr. Jim has established the examples of the article in it. You can test while reading the article, and you can also try to modify the RE of the example, and you can see the result immediately. I think it is very useful. You can give it a try.
  Basic concepts of RE in.NET
  Special characters
  Some characters have special meanings, such as "\b", ".", "*", "\d" and so on that we saw before. "\s" represents any whitespace character, such as spaces, tabs, newlines, etc. "\w" represents any letter or digit character.
  Let's look at some examples
  6. \ba\w*\b (find a word starting with a, such as able)
  This RE describes that we want to find the start boundary of a word (\b), then the letter "a", then any number of letters and numbers (\w*), then the end boundary of the end word (\b).
  7. \d+ (find a numeric string)
  "+" is very similar to "*", except that + repeats the previous item at least once. That is, there is at least one digit.
  8. \b\w?}\b (find a word of six alphanumeric characters, such as ab123c)
  The following table is the commonly used special characters of RE
  . Any character except the newline character
  \w Any alphanumeric character
  \s Any whitespace character
  \d Any digit character
  \b Define word boundary
  ^ Start of the article, such as "^The" to indicate that the string appearing at the start of the article is "The"
  $ End of the article, such as "End$" to indicate that it appears at the end of the article as "End"
  The special characters "^" and "$" are used to find that some words must be the start or end of the article. This is especially useful when verifying whether the input matches a certain pattern. For example, to verify a 7-digit phone number, you may enter the RE of 9 as follows.
  9. ^\d?}-\d?}$ (verify a 7-digit phone number)
  This is the same as the 5th RE, but there are no other characters before and after it, that is, the entire string is only this 7-digit phone number. In.NET, if the Multiline option is set, "^" and "$" will be compared line by line. As long as the start and end of a line match the RE, it is not compared once for the entire article string.
  Escaped characters
  Sometimes you may need the literal meaning of "^", "$" simply instead of treating them as special characters. At this time, the "\" character is used to remove the special meaning of special characters. Therefore, "\^", "\.", "\\" represent the literal meanings of "^", ".", "\\" respectively.
  Repeat the previous item
  We have seen before that "?}" and "*" can be used to repeat the previous characters. Later, we will see how to use the same syntax to repeat the entire subexpression. The following table is some ways to use repeating the previous item.
  * Repeat any number of times
  + Repeat at least once
  ? Repeat zero or one time
  {n} Repeat n times
  {n,m} Repeat at least n times, but not more than m times
  {n,} Repeat at least n times
  Let's try some examples
  10. \b\w?,6}\b (find a word of five or six alphanumeric characters, such as as25d, d58sdf, etc.)
  11. \b\d?}\s\d?}-\d?} (find a 10-digit phone number, such as 800 123-1234)
  12. \d?}-\d?}-\d?} (find a social security number, such as 123-45-6789)
  13. ^\w* (the first word of each line or the entire article)
  Try in Espresso the difference between having Multiline and not having Multiline.
  Match characters in a certain range
  Sometimes when you need to find some specific characters, what should you do? At this time, the square brackets "" come in handy. Therefore, is to find these vowels "a", "e", "i", "o", "u", is to find these symbols ".", "?", "!", and the special meanings of special characters in the square brackets will be removed, that is, interpreted as simple literal meanings. You can also specify certain ranges of characters, such as "", which refers to any lowercase letter or any digit.
  Next, let's look at a more complex RE example for finding a phone number
  14. \(?\d?} \s?\d?}\d?} (find a 10-digit phone number, such as (080) 333-1234 )
  Such a RE can find phone numbers in more formats, such as (080) 123-4567, 511 254 6654, etc. "\(?" represents one or zero left parentheses "(", and "" represents finding one right parenthesis ")" or a space, "\s?" refers to one or zero spaces. But such a RE will find a phone number like "800) 45-3321", that is, there is no problem of symmetric balance of parentheses. Later, we will learn alternatives to solve such problems.
  Negation
  Sometimes you need to find characters not in a certain specific character group. The following table shows how to make such a description.
  \W Any character that is not alphanumeric
  \S Any character that is not a whitespace character
  \D Any character that is not a digit character
  \B Not at the word boundary position
   Any character that is not x
   Any character that is not a, e, i, o, u
  15. \S+ (a string that does not contain whitespace characters)
  Alternatives
  Sometimes you need to find a few specific choices. At this time, the special character "|" comes in handy. For example, to find a 5-digit and a 9-digit (with "-" number) postal code.
  16. \b\d?}-\d?}\b|\b\d?}\b (find a 5-digit and a 9-digit (with "-" number) postal code)
  When using Alternatives, you need to pay attention to the order before and after. Because RE will give priority to the leftmost item that matches in Alternatives. In 16, if the item for finding 5-digit numbers is placed in front, then this RE will only find 5-digit postal codes. After understanding alternatives, you can make a better correction to 14.
  17. (\(\d?}\)|\d?})\s?\d?}\d?} (a 10-digit phone number)
  Grouping
  Parentheses can be used to define a subexpression. Through the definition of the subexpression, you can repeat or perform other processing on the subexpression.
  18. (\d?,3}\.)?}\d?,3} (a simple RE for finding an IP address)
  The meaning of this RE is the first part (\d?,3}\.), which means that the number has at least one digit and at most three digits, and is followed by a "." symbol. There are three such types, and then followed by one to three digits, that is, a number like 192.72.28.1.
  But there is a shortcoming here, because the IP address number is at most 255, but the above RE is only in line with one to three digits. So this needs to make the compared number less than 256, but RE alone cannot do such a comparison. In 19, alternatives are used to limit the address within the required range, that is, 0 to 255.
  19. ((2\d|25|?\d\d?)\.)?}(2\d|25|?\d\d?) (find an IP address)
  Have you found that RE is more and more like what aliens say? Just looking for an IP address simply, it is quite difficult to understand directly by looking at the RE.
  Expresso Analyzer View
  Expresso provides a function that can turn the entered RE into a tree-like description, separated into groups, providing a good debugging environment. Other functions, such as partial match (Partial Match only finds the part of the highlighted RE) and exclude match (Exclude Match only does not find the part of the highlighted RE) are left for you to try.
  When a subexpression is grouped with parentheses, the text that matches the subexpression can be used in subsequent program processing or the RE itself. Under the default situation, the matched groups are named by numbers, starting from 1, and the order is from left to right. This automatic group naming can be seen in the skeleton view or result view in Expresso.
  Backreference is used to find the same text as the matched text captured in the group. For example, "\1" refers to the text captured in group 1.
  20. \b(\w+)\b\s*\b (find repeated words, here the repetition refers to the same word, with a space in between, such as dog dog)
  (\w+) will capture a word of at least one character of letters or numbers, and name it group 1. Then it is to find any whitespace character, and then the same text as group 1.
  If you don't like the automatically named 1 of the group, you can also name it yourself. For example, in the above example, (\w+) is rewritten as (?<Word>\w+), which is to name the captured group as Word. Backreference should be rewritten as \k<Word>
  21. \b(?<Word>\w+)\b\s*\k<Word>\b (use a self-named group to capture repeated words)
  There are many special syntax elements when using parentheses. The more common list is as follows:
  Captures
  (exp) Match exp and capture it into an automatically named group
  (?<name>exp) Match exp and capture it into a named group name
  (?:exp) Match exp, do not capture it
  Lookarounds
  (?=exp) Match the text whose end is exp
  (?).*(?=) (text between HTML tags)
  This uses lookahead and lookbehind assertion to extract the text between HTML, excluding HTML tags.
  Please批注(Comments Please)
  Parentheses also have a special use, which is to enclose comments. The syntax is "(?#comment)". If the "Ignore Pattern Whitespace" option is set, the whitespace characters in the RE will be ignored when the RE is used. When this option is set, the text after "#" will be ignored.
  31. Text between HTML tags, plus comments
  (? #HTML tag
  ) #End the prefix search
  .* #Match any text
  (?= #Find the end, but do not include it
   #Match the string captured in group 1, that is, the previous parentheses' HTML tag
  ) #End the suffix search
  Greedy and Lazy
  When the RE is to find a range of repetitions (such as ".*"), it usually finds the most characters that match, that is, Greedy matching. For example.
  32. a.*b (the most characters that match from a to b)
  If there is a string "aabab", the matched string obtained by using the above RE is "aabab", because this is to find the most characters. Sometimes you want to match the least characters, that is, lazy matching. As long as you add a question mark (?) to the table of repeating the previous item, you can turn them all into lazy matching. Therefore, "*?" means repeating any number of times, but using the least number of repetitions to match. For example:
  33. a.*?b (the least characters that match from a to b)
  If there is a string "aabab", the first matched string obtained by using the above RE is "aab" and then "ab", because this is to find the least characters.
  *? Repeat any number of times, with the principle of the least number of repetitions
  +? Repeat at least once, with the principle of the least number of repetitions
  ?? Repeat zero or one time, with the principle of the least number of repetitions
  {n,m}? Repeat at least n times, but not more than m times, with the principle of the least number of repetitions
  {n,}? Repeat at least n times, with the principle of the least number of repetitions
  What else is not mentioned?
  So far, many elements for building RE have been mentioned. Of course, there are still many elements not mentioned. The following table sorts out some elements not mentioned. The number in the leftmost field is the example in Expresso.
  # Syntax Description
  \a Bell character
  \b Usually refers to the word boundary, and in the character group it represents backspace
  \t Tab
  34 \r Carriage return
  \v Vertical Tab
  \f From feed
  35 \n New line
  \e Escape
  36 \nnn ASCII 8-bit code is a character of nnn
  37 \xnn Hexadecimal code is a character of nn
  38 \unnnn Unicode is a character of nnnn
  39 \cN Control N character. For example, Ctrl-M is \cM
  40 \A Start of the string (similar to ^, but not through the multiline option)
  41 \Z End of the string
  \z End of the string
  42 \G Start of the current search
  43 \p{name} Unicode character group name is a character of name. For example, \p{Lowercase_Letter} refers to lowercase letters
  (?>exp) Greedy subexpression, also known as non-backtracking subexpression. This only matches once and does not take backtracking.
  44 (?-exp)
  or (?-exp) Balanced group. Although complex, it is easy to use. It allows the named capture group to be operated and used in the stack. (I don't understand this either)
  45 (?im-nsx:exp) Change the RE option for subexpression exp. For example, (?-i:Elvis) is to turn off the option of ignoring case of Elvis.
  46 (?im-nsx) Change the RE option for the subsequent group.
  (?(exp)yes|no) The subexpression exp is regarded as zero-width positive lookahead. If there is a match at this time, the yes subexpression is the next match target. If not, the no subexpression is the next match target.
  (?(exp)yes) The same as above but without the no subexpression
  (?(name)yes|no) If the name group is a valid group name, then the yes subexpression is the next match target. If not, the no subexpression is the next match target.
  47 (?(name)yes) The same as above but without the no subexpression

[ Last edited by 无奈何 on 2006-10-26 at 11:53 AM ]
  ☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul

Floor 9 Posted 2006-10-26 11:44 ·  中国 浙江 宁波 鹏博士宽带
荣誉版主
★★★
Credits 1,338
Posts 356
Joined 2005-07-15 12:09
20-year member
UID 40733
Gender Male
Status Offline
Foreword
  Regular Expressions (abbreviated as RE hereinafter) have always been a mysterious area for me. Seeing some great people on the Internet easily solve certain text problems using RE, I got the idea of learning RE. But I am naturally a bit lazy and always hope to find a way to learn it quickly. So I turned to the Google god. With His power, I found an article by Mr. Jim Hollenhorst on the Internet. After reading it, I thought it was really good, so I made a small summary report to share with the friends of Move-to.Net, hoping to bring a little help to you great people in learning RE. The URL of Mr. Jim Hollenhorst's article is as follows, and those who need it can directly click the link.
  The 30 Minute Regex Tutorial By Jim Hollenhorst
  

http://www.codeproject.com/useritems/RegexTutorial.asp

  What is RE?
  I believe that all of you great people have used the wildcard "*" when doing file searches. For example, when you want to search for all Word files in the Windows directory, you may use "*doc" to do the search, because "*" represents any character. What RE does is similar to this function, but its function is more powerful.
  When writing a program, it is often necessary to compare whether a string matches a specific pattern. The main function of RE is to describe this specific pattern. Therefore, RE can be regarded as a description of a specific pattern. For example, "\w+" represents any non-null string composed of letters and numbers. In the.NET framework, a very powerful class library is provided, through which it is very easy to use RE to perform text search and replacement, decode complex headers, and verify text, etc.
  The best way to learn RE is to experience it through examples. Mr. Jim Hollenhorst also provides a tool program Expresso (have a cup of coffee), to help us learn RE. The download URL is

http://www.codeproject.com/useritems/RegexTutorial/ExpressoSetup2_1C.zip

.
  Next, let's experience some examples.
  Some simple examples
  Suppose you want to find a string with Elvis followed by alive in the article, using RE may go through the following process, and the parentheses are the meaning of the RE:
  1. elvis (search for elvis)
  The above represents the order of characters to be searched as elvis. In.NET, you can set to ignore the case of characters, so "Elvis", "ELVIS" or "eLvIs" are all in line with the RE of 1. But because this only cares about the order of characters appearing as elvis, so pelvis also conforms to the RE of 1. You can improve it with the RE of 2.
  2. \belvis\b (regard elvis as a whole word to search, such as elvis, Elvis when ignoring case of characters)
  "\b" has a special meaning in RE. In the above example, it refers to the word boundary, so \belvis\b uses \b to define the front and back boundaries of elvis, that is, to find the word elvis.
  Suppose you want to find a string with elvis followed by alive in the same line, then you will use two other special meaning characters "." and "*". "." represents any character except the newline character, and "*" represents repeating the item before * until the string that matches the RE is found. So ".*" means any number of characters except the newline character. So to find a string with elvis followed by alive in the same line, you can enter the RE of 3 as follows.
  3. \belvis\b.*\balive\b (search for the string with elvis followed by alive, such as elvis is alive)
  You can form a powerful RE with simple special characters, but you also find that when using more and more special characters, the RE will become more and more difficult to understand.
Let's look at another example
  Form a valid phone number
  Suppose you want to collect a 7-digit phone number in the format xxx-xxxx from a web page, where x is a digit, the RE may be written like this.
  4. \b\d\d\d-\d\d\d\d (search for a 7-digit phone number, such as 123-1234)
  Each \d represents a digit. "-" is a general hyphen. To avoid too many repeated \d, the RE can be rewritten in the way of 5.
  5. \b\d?}-\d?} (a better way to search for a 7-digit phone number, such as 123-1234)
  The?} after \d means repeating the previous item three times, which is equivalent to \d\d\d.
  RE learning and testing tool Expresso
  Because RE is not easy to read and users are prone to making wrong RE, Mr. Jim developed a tool software Expresso to help users learn and test RE. In addition to the URL mentioned above, you can also go to the Ultrapico website (

http://www.Ultrapico.com)

. After installing Expresso, in the Expression Library, Mr. Jim has established all the examples of the article in it. You can test while reading the article, and you can also try to modify the RE of the example, and you can see the result immediately. I think it is very easy to use. You great people can give it a try.
  Basic concepts of RE in.NET
  Special characters
  Some characters have special meanings, such as "\b", ".", "*", "\d" that we have seen before. "\s" represents any whitespace character, such as spaces, tabs, newlines, etc. "\w" represents any letter or digit character.
  Let's look at some more examples
  6. \ba\w*\b (search for words starting with a, such as able)
  This RE describes that you want to find the start boundary of a word (\b), then the letter "a", then any number of letters and digits (\w*), then the end boundary of this word (\b).
  7. \d+ (search for a string of digits)
  "+" is very similar to "*", except that + repeats the previous item at least once. That is, there is at least one digit.
  8. \b\w?}\b (search for a word of six alphanumeric characters, such as ab123c)
  The following table shows the commonly used special characters in RE
  . Any character except the newline character
  \w Any alphanumeric character
  \s Any whitespace character
  \d Any digit character
  \b Define word boundary
  ^ Start of the article, such as "^The" to indicate that the string appearing at the start of the article is "The"
  $ End of the article, such as "End$" to indicate that it appears at the end of the article as "End"
  The special characters "^" and "$" are used to find that certain words must be at the start or end of the article. This is especially useful when verifying whether the input conforms to a certain pattern. For example, to verify a 7-digit phone number, you may enter the RE of 9 as follows.
  9. ^\d?}-\d?}$ (verify a 7-digit phone number)
  This is the same as the 5th RE, but there are no other characters before and after it, that is, the entire string is only this 7-digit phone number. In.NET, if the Multiline option is set, then "^" and "$" will be compared line by line, as long as the start and end of a line conform to the RE, instead of comparing the entire article string at once.
  Escaped characters
  Sometimes you may need the literal meaning of "^" and "$" instead of treating them as special characters. At this time, the "\\" character is used to remove the special meaning of special characters. Therefore, "\^", "\.", "\\" represent the literal meanings of "^", ".", "\\" respectively.
  Repeat the previous item
  We have seen that "?}" and "*" can be used to repeat the previous characters. Later, we will see how to use the same syntax to repeat the entire subexpression. The following table shows some ways to use repeating the previous item.
  * Repeat any number of times
  + Repeat at least once
  ? Repeat zero or one time
  {n} Repeat n times
  {n,m} Repeat at least n times, but not more than m times
  {n,} Repeat at least n times
  Let's try some more examples
  10. \b\w?,6}\b (search for words of five or six alphanumeric characters, such as as25d, d58sdf, etc.)
  11. \b\d?}\s\d?}-\d?} (search for a 10-digit phone number, such as 800 123-1234)
  12. \d?}-\d?}-\d?} (search for a social security number, such as 123-45-6789)
  13. ^\w* (the first word of each line or the entire article)
  Try in Espresso the difference between having Multiline and not having Multiline.
  Match characters in a certain range
  Sometimes when you need to find some specific characters, what should you do? At this time, the square brackets "" come in handy. Therefore, is to find the vowels "a", "e", "i", "o", "u", and is to find the symbols ".", "?", "!". The special meanings of special characters in the square brackets will be removed, that is, interpreted as pure literal meanings. You can also specify characters in certain ranges, such as "", which means any lowercase letter or any digit.
  Next, let's look at a more complex example of finding a phone number's RE
  14. \(?\d?} \s?\d?}\d?} (search for a 10-digit phone number, such as (080) 333-1234 )
  Such a RE can find phone numbers in more formats, such as (080) 123-4567, 511 254 6654, etc. "\(?" represents one or zero left parentheses "(", and "" represents finding one right parenthesis ")" or a space character, "\s?" refers to one or zero whitespace groups. But such a RE will find a phone number like "800) 45-3321", that is, there is no problem of symmetric balance of parentheses. Later, we will learn about alternatives to solve such problems.
  Negation
  Sometimes you need to find characters not included in a certain specific character group. The following table shows how to make such a description.
  \W Any character that is not alphanumeric
  \S Any character that is not a whitespace character
  \D Any character that is not a digit character
  \B Not at the word boundary position
   Any character that is not x
   Any character that is not a, e, i, o, u
  15. \S+ (a string that does not contain whitespace characters)
  Alternatives
  Sometimes you need to find several specific choices. At this time, the special character "|" comes in handy. For example, to find a 5-digit and a 9-digit (with "-" sign) postal code.
  16. \b\d?}-\d?}\b|\b\d?}\b (search for a 5-digit and a 9-digit (with "-" sign) postal code)
  When using Alternatives, you need to pay attention to the order before and after, because RE will give priority to the item that matches the leftmost in the Alternatives. In 16, if the item to find the 5-digit number is placed in front, then this RE will only find the 5-digit postal code. After understanding the alternatives, you can make a better modification to 14.
  17. (\(\d?}\)|\d?})\s?\d?}\d?} (a 10-digit phone number)
  Grouping
  Parentheses can be used to define a subexpression. Through the definition of the subexpression, you can perform repetition or other processing on the subexpression.
  18. (\d?,3}\.)?}\d?,3} (a simple RE for finding an IP address)
  The meaning of this RE is that the first part (\d?,3}\.), which means that the number has at least one digit and at most three digits, and is followed by a "." symbol. There are three such types, and then followed by 1 to 3 digits, that is, a number like 192.72.28.1.
  But there is a shortcoming, because the IP address number is at most 255, but the above RE only requires that the number is 1 to 3 digits to be in line, so this requires that the compared number is less than 256, but RE alone cannot make such a comparison. In 19, use alternatives to limit the address within the required range, that is, 0 to 255.
  19. ((2\d|25|?\d\d?)\.)?}(2\d|25|?\d\d?) (search for an IP address)
  Have you found that RE is more and more like what aliens say? Just looking for an IP address simply, it is quite difficult to understand directly from the RE.
  Expresso Analyzer View
  Expresso provides a function that can turn the entered RE into a tree-like explanation, separated into groups, providing a good debugging environment. Other functions, such as partial match (Partial Match only finds the part of the RE in reverse white) and exclude match (Exclude Match only does not find the part of the RE in reverse white) are left for you great people to try.
  When a subexpression is grouped by parentheses, the text that matches the subexpression can be used in subsequent program processing or in the RE itself. Under the default situation, the matched groups are named by numbers, starting from 1, and the order is from left to right. This automatic group naming can be seen in the skeleton view or result view in Expresso.
  Backreference is used to find the same text as the matched text captured in the group. For example, "" refers to the text captured in group 1.
  20. \b(\w+)\b\s*\b (search for repeated words, here the repetition means the same word, with a space in between, such as dog dog)
(\w+) will capture a word of at least one character of letters or digits and name it group 1, then find any whitespace characters, and then the same text as group 1.
  If you don't like the automatically named 1 of the group, you can also name it yourself. For the above example, (\w+) is rewritten as (?\w+), which is to name the captured group as Word, and the Backreference should be rewritten as \k
21. \b(?\w+)\b\s*\k\b (use a self-named group to capture repeated words)
  There are many special syntax elements when using parentheses. The more common list is as follows:
  Captures
  (exp) Match exp and capture it into an automatically named group
  (?exp) Match exp and capture it into a named group name
  (?:exp) Match exp, but do not capture it
  Lookarounds
  (?=exp) Match text whose end is exp
  (?).*(?=) (text between HTML tags)
  This uses lookahead and lookbehind assertion to extract the text between HTML, not including the HTML tags.
  Please批注(Comments Please)
  Parentheses also have a special use to enclose comments. The syntax is "(?#comment)". If the "Ignore Pattern Whitespace" option is set, the whitespace characters in the RE will be ignored when the RE is used. When this option is set, the text after "#" will be ignored.
  31. Text between HTML tags, with comments
  (? #HTML tag
  ) #End the prefix to find
  .* #Match any text
  (?= #Find the end, but do not include it
   #Match the string of the captured group 1, that is, the HTML tag in the previous parentheses
  ) #End the suffix to find
  Greedy and Lazy
  When the RE is to find a range of repetition (such as ".*"), it usually finds the most characters of the matching word, that is, Greedy matching. For example.
  32. a.*b (the matching word with the most characters starting with a and ending with b)
  If there is a string "aabab", the matching string obtained using the above RE is "aabab", because this is to find the word with the most characters. Sometimes you want to match the word with the least characters, that is, lazy matching. As long as you add a question mark (?) to the table of repeating the previous item, you can turn them all into lazy matching. Therefore, "*?" means repeating any number of times, but using the least number of repetitions to match. For example:
  33. a.*?b (the matching word with the least characters starting with a and ending with b)
  If there is a string "aabab", the first matching string obtained using the above RE is "aab" and then "ab", because this is to find the word with the least characters.
  *? Repeat any number of times, with the principle of the least number of repetitions
  +? Repeat at least once, with the principle of the least number of repetitions
  ?? Repeat zero or one time, with the principle of the least number of repetitions
  {n,m}? Repeat at least n times, but not more than m times, with the principle of the least number of repetitions
  {n,}? Repeat at least n times, with the principle of the least number of repetitions
What else is not mentioned?
  So far, many elements for building RE have been mentioned. Of course, there are still many elements not mentioned. The following table sorts out some of the elements not mentioned. The number in the leftmost field is the explanation in the example in Expresso.
  # Syntax Explanation
  \a Bell character
  \b Usually refers to the word boundary, and in the character group, it represents backspace
  \t Tab
  34 \r Carriage return
  \v Vertical Tab
  \f From feed
  35 \n New line
  \e Escape
  36 \nnn ASCII octal code is nnn character
  37 \xnn Hexadecimal code is nn character
  38 \unnnn Unicode is nnnn character
  39 \cN Control N character, for example, Ctrl-M is \cM
  40 \A Start of the string (similar to ^, but without the need for the multiline option)
  41 \Z End of the string
  \z End of the string
  42 \G Start of the current search
  43 \p{name} Unicode character group name is name character, for example, \p{Lowercase_Letter} refers to lowercase letters
  (?>exp) Greedy subexpression, also known as non-backtracking subexpression. This only matches once and does not backtrack.
  44 (?-exp)
  or (?-exp) Balanced group. Although complex, it is easy to use. It allows the named capture group to be operated and used in the stack. (I don't understand this either)
  45 (?im-nsx:exp) Change the RE option for subexpression exp. For example, (?-i:Elvis) is to turn off the option of ignoring the case of Elvis.
  46 (?im-nsx) Change the RE option for the subsequent group.
  (?(exp)yes|no) The subexpression exp is regarded as a zero-width positive lookahead. If there is a match at this time, the yes subexpression is the next matching target. If not, the no subexpression is the next matching target.
  (?(exp)yes) The same as above but without the no subexpression
  (?(name)yes|no) If the name group is a valid group name, then the yes subexpression is the next matching target. If not, the no subexpression is the next matching target.
  47 (?(name)yes) The same as above but without the no subexpression

[ Last edited by 无奈何 on 2006-10-26 at 12:22 PM ]
  ☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul

Floor 10 Posted 2006-10-26 11:51 ·  中国 四川 成都 教育网
铂金会员
★★★★
Credits 7,493
Posts 2,672
Joined 2005-09-02 00:00
20-year member
UID 42173
Gender Male
Status Offline
Top floor. RegExp is useful but hard to remember, and the methods are different in different software. findstr, UltraEdit, perl, Python, JavaScript are all different, so depressed

C:\>BLOG http://initiative.yo2.cn/
C:\>hh.exe ntcmds.chm::/ntcmds.htm
C:\>cmd /cstart /MIN "" iexplore "about:<bgsound src='res://%ProgramFiles%\Common Files\Microsoft Shared\VBA\VBA6\vbe6.dll/10/5432'>"
Floor 11 Posted 2006-10-26 20:17 ·  中国 福建 泉州 电信
银牌会员
★★★
Credits 1,276
Posts 469
Joined 2002-12-23 13:00
23-year member
UID 586
Gender Male
From 福建泉州
Status Offline
Save it for later study.
QQ:366840202
http://chenall.net
Floor 12 Posted 2006-10-26 20:45 ·  中国 北京 朝阳区 联通
金牌会员
★★★★
Credits 2,902
Posts 1,147
Joined 2006-09-21 12:00
19-year member
UID 63324
Gender Male
Status Offline
Very useful classic, collect it ~ :)
    Redtek,一个永远在网上流浪的人……

_.,-*~'`^`'~*-,.__.,-*~'`^`'~*-,._,_.,-*~'`^`'~*-,._,_.,-*~'`^`'~*-,._
Floor 13 Posted 2006-10-27 00:17 ·  中国 湖北 武汉 电信
版主
★★★★★
Credits 11,386
Posts 4,938
Joined 2006-07-23 17:10
19-year member
UID 59080
Status Offline

  This regular expression is definitely something to learn... Hehe... Collected... The moderator has worked hard...
Floor 14 Posted 2006-10-27 00:21 ·  中国 辽宁 大连 电信
中级用户
★★
DOS之友
Credits 332
Posts 168
Joined 2005-10-06 00:00
20-year member
UID 43171
Gender Male
From 天涯
Status Offline
Regular expressions seem simple. But it's really hard to use them well.
测试环境: windows xp pro sp2 高手是这样炼成的:C:\WINDOWS\Help\ntcmds.chm
Floor 15 Posted 2006-10-27 01:28 ·  中国 甘肃 甘南藏族自治州 合作市 电信
金牌会员
★★★★
Credits 4,103
Posts 1,744
Joined 2006-01-20 13:00
20-year member
UID 49241
Gender Male
From 甘肃.临泽
Status Offline
It's really difficult to use well. Let's study first.
Forum Jump: