Foreword
Regular Expressions (abbreviated as RE hereinafter) have always been a mysterious area for me. Seeing some great people on the Internet easily solve certain text problems using RE, I got the idea of learning RE. But I am naturally a bit lazy and always hope to find a way to learn it quickly. So I turned to the Google god. With His power, I found an article by Mr. Jim Hollenhorst on the Internet. After reading it, I thought it was really good, so I made a small summary report to share with the friends of Move-to.Net, hoping to bring a little help to you great people in learning RE. The URL of Mr. Jim Hollenhorst's article is as follows, and those who need it can directly click the link.
The 30 Minute Regex Tutorial By Jim Hollenhorst
http://www.codeproject.com/useritems/RegexTutorial.asp
What is RE?
I believe that all of you great people have used the wildcard "*" when doing file searches. For example, when you want to search for all Word files in the Windows directory, you may use "*doc" to do the search, because "*" represents any character. What RE does is similar to this function, but its function is more powerful.
When writing a program, it is often necessary to compare whether a string matches a specific pattern. The main function of RE is to describe this specific pattern. Therefore, RE can be regarded as a description of a specific pattern. For example, "\w+" represents any non-null string composed of letters and numbers. In the.NET framework, a very powerful class library is provided, through which it is very easy to use RE to perform text search and replacement, decode complex headers, and verify text, etc.
The best way to learn RE is to experience it through examples. Mr. Jim Hollenhorst also provides a tool program Expresso (have a cup of coffee), to help us learn RE. The download URL is
http://www.codeproject.com/useritems/RegexTutorial/ExpressoSetup2_1C.zip
.
Next, let's experience some examples.
Some simple examples
Suppose you want to find a string with Elvis followed by alive in the article, using RE may go through the following process, and the parentheses are the meaning of the RE:
1. elvis (search for elvis)
The above represents the order of characters to be searched as elvis. In.NET, you can set to ignore the case of characters, so "Elvis", "ELVIS" or "eLvIs" are all in line with the RE of 1. But because this only cares about the order of characters appearing as elvis, so pelvis also conforms to the RE of 1. You can improve it with the RE of 2.
2. \belvis\b (regard elvis as a whole word to search, such as elvis, Elvis when ignoring case of characters)
"\b" has a special meaning in RE. In the above example, it refers to the word boundary, so \belvis\b uses \b to define the front and back boundaries of elvis, that is, to find the word elvis.
Suppose you want to find a string with elvis followed by alive in the same line, then you will use two other special meaning characters "." and "*". "." represents any character except the newline character, and "*" represents repeating the item before * until the string that matches the RE is found. So ".*" means any number of characters except the newline character. So to find a string with elvis followed by alive in the same line, you can enter the RE of 3 as follows.
3. \belvis\b.*\balive\b (search for the string with elvis followed by alive, such as elvis is alive)
You can form a powerful RE with simple special characters, but you also find that when using more and more special characters, the RE will become more and more difficult to understand.
Let's look at another example
Form a valid phone number
Suppose you want to collect a 7-digit phone number in the format xxx-xxxx from a web page, where x is a digit, the RE may be written like this.
4. \b\d\d\d-\d\d\d\d (search for a 7-digit phone number, such as 123-1234)
Each \d represents a digit. "-" is a general hyphen. To avoid too many repeated \d, the RE can be rewritten in the way of 5.
5. \b\d?}-\d?} (a better way to search for a 7-digit phone number, such as 123-1234)
The?} after \d means repeating the previous item three times, which is equivalent to \d\d\d.
RE learning and testing tool Expresso
Because RE is not easy to read and users are prone to making wrong RE, Mr. Jim developed a tool software Expresso to help users learn and test RE. In addition to the URL mentioned above, you can also go to the Ultrapico website (
http://www.Ultrapico.com)
. After installing Expresso, in the Expression Library, Mr. Jim has established all the examples of the article in it. You can test while reading the article, and you can also try to modify the RE of the example, and you can see the result immediately. I think it is very easy to use. You great people can give it a try.
Basic concepts of RE in.NET
Special characters
Some characters have special meanings, such as "\b", ".", "*", "\d" that we have seen before. "\s" represents any whitespace character, such as spaces, tabs, newlines, etc. "\w" represents any letter or digit character.
Let's look at some more examples
6. \ba\w*\b (search for words starting with a, such as able)
This RE describes that you want to find the start boundary of a word (\b), then the letter "a", then any number of letters and digits (\w*), then the end boundary of this word (\b).
7. \d+ (search for a string of digits)
"+" is very similar to "*", except that + repeats the previous item at least once. That is, there is at least one digit.
8. \b\w?}\b (search for a word of six alphanumeric characters, such as ab123c)
The following table shows the commonly used special characters in RE
. Any character except the newline character
\w Any alphanumeric character
\s Any whitespace character
\d Any digit character
\b Define word boundary
^ Start of the article, such as "^The" to indicate that the string appearing at the start of the article is "The"
$ End of the article, such as "End$" to indicate that it appears at the end of the article as "End"
The special characters "^" and "$" are used to find that certain words must be at the start or end of the article. This is especially useful when verifying whether the input conforms to a certain pattern. For example, to verify a 7-digit phone number, you may enter the RE of 9 as follows.
9. ^\d?}-\d?}$ (verify a 7-digit phone number)
This is the same as the 5th RE, but there are no other characters before and after it, that is, the entire string is only this 7-digit phone number. In.NET, if the Multiline option is set, then "^" and "$" will be compared line by line, as long as the start and end of a line conform to the RE, instead of comparing the entire article string at once.
Escaped characters
Sometimes you may need the literal meaning of "^" and "$" instead of treating them as special characters. At this time, the "\\" character is used to remove the special meaning of special characters. Therefore, "\^", "\.", "\\" represent the literal meanings of "^", ".", "\\" respectively.
Repeat the previous item
We have seen that "?}" and "*" can be used to repeat the previous characters. Later, we will see how to use the same syntax to repeat the entire subexpression. The following table shows some ways to use repeating the previous item.
* Repeat any number of times
+ Repeat at least once
? Repeat zero or one time
{n} Repeat n times
{n,m} Repeat at least n times, but not more than m times
{n,} Repeat at least n times
Let's try some more examples
10. \b\w?,6}\b (search for words of five or six alphanumeric characters, such as as25d, d58sdf, etc.)
11. \b\d?}\s\d?}-\d?} (search for a 10-digit phone number, such as 800 123-1234)
12. \d?}-\d?}-\d?} (search for a social security number, such as 123-45-6789)
13. ^\w* (the first word of each line or the entire article)
Try in Espresso the difference between having Multiline and not having Multiline.
Match characters in a certain range
Sometimes when you need to find some specific characters, what should you do? At this time, the square brackets "" come in handy. Therefore, is to find the vowels "a", "e", "i", "o", "u", and is to find the symbols ".", "?", "!". The special meanings of special characters in the square brackets will be removed, that is, interpreted as pure literal meanings. You can also specify characters in certain ranges, such as "", which means any lowercase letter or any digit.
Next, let's look at a more complex example of finding a phone number's RE
14. \(?\d?} \s?\d?}\d?} (search for a 10-digit phone number, such as (080) 333-1234 )
Such a RE can find phone numbers in more formats, such as (080) 123-4567, 511 254 6654, etc. "\(?" represents one or zero left parentheses "(", and "" represents finding one right parenthesis ")" or a space character, "\s?" refers to one or zero whitespace groups. But such a RE will find a phone number like "800) 45-3321", that is, there is no problem of symmetric balance of parentheses. Later, we will learn about alternatives to solve such problems.
Negation
Sometimes you need to find characters not included in a certain specific character group. The following table shows how to make such a description.
\W Any character that is not alphanumeric
\S Any character that is not a whitespace character
\D Any character that is not a digit character
\B Not at the word boundary position
Any character that is not x
Any character that is not a, e, i, o, u
15. \S+ (a string that does not contain whitespace characters)
Alternatives
Sometimes you need to find several specific choices. At this time, the special character "|" comes in handy. For example, to find a 5-digit and a 9-digit (with "-" sign) postal code.
16. \b\d?}-\d?}\b|\b\d?}\b (search for a 5-digit and a 9-digit (with "-" sign) postal code)
When using Alternatives, you need to pay attention to the order before and after, because RE will give priority to the item that matches the leftmost in the Alternatives. In 16, if the item to find the 5-digit number is placed in front, then this RE will only find the 5-digit postal code. After understanding the alternatives, you can make a better modification to 14.
17. (\(\d?}\)|\d?})\s?\d?}\d?} (a 10-digit phone number)
Grouping
Parentheses can be used to define a subexpression. Through the definition of the subexpression, you can perform repetition or other processing on the subexpression.
18. (\d?,3}\.)?}\d?,3} (a simple RE for finding an IP address)
The meaning of this RE is that the first part (\d?,3}\.), which means that the number has at least one digit and at most three digits, and is followed by a "." symbol. There are three such types, and then followed by 1 to 3 digits, that is, a number like 192.72.28.1.
But there is a shortcoming, because the IP address number is at most 255, but the above RE only requires that the number is 1 to 3 digits to be in line, so this requires that the compared number is less than 256, but RE alone cannot make such a comparison. In 19, use alternatives to limit the address within the required range, that is, 0 to 255.
19. ((2\d|25|?\d\d?)\.)?}(2\d|25|?\d\d?) (search for an IP address)
Have you found that RE is more and more like what aliens say? Just looking for an IP address simply, it is quite difficult to understand directly from the RE.
Expresso Analyzer View
Expresso provides a function that can turn the entered RE into a tree-like explanation, separated into groups, providing a good debugging environment. Other functions, such as partial match (Partial Match only finds the part of the RE in reverse white) and exclude match (Exclude Match only does not find the part of the RE in reverse white) are left for you great people to try.
When a subexpression is grouped by parentheses, the text that matches the subexpression can be used in subsequent program processing or in the RE itself. Under the default situation, the matched groups are named by numbers, starting from 1, and the order is from left to right. This automatic group naming can be seen in the skeleton view or result view in Expresso.
Backreference is used to find the same text as the matched text captured in the group. For example, "" refers to the text captured in group 1.
20. \b(\w+)\b\s*\b (search for repeated words, here the repetition means the same word, with a space in between, such as dog dog)
(\w+) will capture a word of at least one character of letters or digits and name it group 1, then find any whitespace characters, and then the same text as group 1.
If you don't like the automatically named 1 of the group, you can also name it yourself. For the above example, (\w+) is rewritten as (?\w+), which is to name the captured group as Word, and the Backreference should be rewritten as \k
21. \b(?\w+)\b\s*\k\b (use a self-named group to capture repeated words)
There are many special syntax elements when using parentheses. The more common list is as follows:
Captures
(exp) Match exp and capture it into an automatically named group
(?exp) Match exp and capture it into a named group name
(?:exp) Match exp, but do not capture it
Lookarounds
(?=exp) Match text whose end is exp
(?).*(?=) (text between HTML tags)
This uses lookahead and lookbehind assertion to extract the text between HTML, not including the HTML tags.
Please批注(Comments Please)
Parentheses also have a special use to enclose comments. The syntax is "(?#comment)". If the "Ignore Pattern Whitespace" option is set, the whitespace characters in the RE will be ignored when the RE is used. When this option is set, the text after "#" will be ignored.
31. Text between HTML tags, with comments
(? #HTML tag
) #End the prefix to find
.* #Match any text
(?= #Find the end, but do not include it
#Match the string of the captured group 1, that is, the HTML tag in the previous parentheses
) #End the suffix to find
Greedy and Lazy
When the RE is to find a range of repetition (such as ".*"), it usually finds the most characters of the matching word, that is, Greedy matching. For example.
32. a.*b (the matching word with the most characters starting with a and ending with b)
If there is a string "aabab", the matching string obtained using the above RE is "aabab", because this is to find the word with the most characters. Sometimes you want to match the word with the least characters, that is, lazy matching. As long as you add a question mark (?) to the table of repeating the previous item, you can turn them all into lazy matching. Therefore, "*?" means repeating any number of times, but using the least number of repetitions to match. For example:
33. a.*?b (the matching word with the least characters starting with a and ending with b)
If there is a string "aabab", the first matching string obtained using the above RE is "aab" and then "ab", because this is to find the word with the least characters.
*? Repeat any number of times, with the principle of the least number of repetitions
+? Repeat at least once, with the principle of the least number of repetitions
?? Repeat zero or one time, with the principle of the least number of repetitions
{n,m}? Repeat at least n times, but not more than m times, with the principle of the least number of repetitions
{n,}? Repeat at least n times, with the principle of the least number of repetitions
What else is not mentioned?
So far, many elements for building RE have been mentioned. Of course, there are still many elements not mentioned. The following table sorts out some of the elements not mentioned. The number in the leftmost field is the explanation in the example in Expresso.
# Syntax Explanation
\a Bell character
\b Usually refers to the word boundary, and in the character group, it represents backspace
\t Tab
34 \r Carriage return
\v Vertical Tab
\f From feed
35 \n New line
\e Escape
36 \nnn ASCII octal code is nnn character
37 \xnn Hexadecimal code is nn character
38 \unnnn Unicode is nnnn character
39 \cN Control N character, for example, Ctrl-M is \cM
40 \A Start of the string (similar to ^, but without the need for the multiline option)
41 \Z End of the string
\z End of the string
42 \G Start of the current search
43 \p{name} Unicode character group name is name character, for example, \p{Lowercase_Letter} refers to lowercase letters
(?>exp) Greedy subexpression, also known as non-backtracking subexpression. This only matches once and does not backtrack.
44 (?-exp)
or (?-exp) Balanced group. Although complex, it is easy to use. It allows the named capture group to be operated and used in the stack. (I don't understand this either)
45 (?im-nsx:exp) Change the RE option for subexpression exp. For example, (?-i:Elvis) is to turn off the option of ignoring the case of Elvis.
46 (?im-nsx) Change the RE option for the subsequent group.
(?(exp)yes|no) The subexpression exp is regarded as a zero-width positive lookahead. If there is a match at this time, the yes subexpression is the next matching target. If not, the no subexpression is the next matching target.
(?(exp)yes) The same as above but without the no subexpression
(?(name)yes|no) If the name group is a valid group name, then the yes subexpression is the next matching target. If not, the no subexpression is the next matching target.
47 (?(name)yes) The same as above but without the no subexpression
[
Last edited by 无奈何 on 2006-10-26 at 12:22 PM ]