|
无奈何
荣誉版主
      
积分 1338
发帖 356
注册 2005-7-15
状态 离线
|
 『楼 主』:
[推荐]正则表达式Regex文章收集
使用 LLM 解释/回答一下
贴一些我收集的关于正则表达式的文章,资源以及相关链接我会修改本帖补充。
正则表达式库 http://regexlib.com/default.aspx
正则表达式在线验证(荐) http://osteele.com/tools/rework/#
正则表达式在线演示 http://osteele.com/tools/reanimator/
正则表达式在线验证(中文) http://www.regexlab.com/zh/workshop.asp
RegexBuddy最好的正则表达式学习验证工具 http://www.regexbuddy.com/
先贴这些,想到后再补充。
Last edited by 无奈何 on 2006-10-26 at 12:57 PM ]
Post some articles about regular expressions that I collected. I will modify this post to add resources and related links.
Regular expression library http://regexlib.com/default.aspx
Recommended online regular expression verification http://osteele.com/tools/rework/#
Online regular expression demonstration http://osteele.com/tools/reanimator/
Online regular expression verification (Chinese) http://www.regexlab.com/zh/workshop.asp
RegexBuddy the best regular expression learning and verification tool http://www.regexbuddy.com/
Post these first, and I will supplement them when I think of them.
Last edited by 无奈何 on 2006-10-26 at 12:57 PM ]
|

☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul
|
|
2006-10-26 11:42 |
|
|
无奈何
荣誉版主
      
积分 1338
发帖 356
注册 2005-7-15
状态 离线
|
『第 2 楼』:
正则表达式语法
使用 LLM 解释/回答一下
正则表达式语法
JScript 和 VBScript 正则表达式
正则表达式语法
一个正则表达式就是由普通字符(例如字符 a 到 z)以及特殊字符(称为元字符)组成的文字模式。该模式描述在查找文字主体时待匹配的一个或多个字符串。正则表达式作为一个模板,将某个字符模式与所搜索的字符串进行匹配。
这里有一些可能会遇到的正则表达式示例:
JScript
VBScript
匹配
/^\*$/
"^\*$"
匹配一个空白行。
/\d{2}-\d{5}/
"\d{2}-\d{5}"
验证一个ID 号码是否由一个2位数字,一个连字符以及一个5位数字组成。
/.*/
".*"
匹配一个 HTML 标记。
下表是元字符及其在正则表达式上下文中的行为的一个完整列表:
字符
描述
\
将下一个字符标记为一个特殊字符、或一个原义字符、或一个 向后引用、或一个八进制转义符。例如,'n' 匹配字符 "n"。'\n' 匹配一个换行符。序列 '\\' 匹配 "\" 而 "\(" 则匹配 "("。
^
匹配输入字符串的开始位置。如果设置了 RegExp 对象的 Multiline 属性,^ 也匹配 '\n' 或 '\r' 之后的位置。
$
匹配输入字符串的结束位置。如果设置了RegExp 对象的 Multiline 属性,$ 也匹配 '\n' 或 '\r' 之前的位置。
*
匹配前面的子表达式零次或多次。例如,zo* 能匹配 "z" 以及 "zoo"。* 等价于{0,}。
+
匹配前面的子表达式一次或多次。例如,'zo+' 能匹配 "zo" 以及 "zoo",但不能匹配 "z"。+ 等价于 {1,}。
?
匹配前面的子表达式零次或一次。例如,"do(es)?" 可以匹配 "do" 或 "does" 中的"do" 。? 等价于 {0,1}。
{n}
n 是一个非负整数。匹配确定的 n 次。例如,'o{2}' 不能匹配 "Bob" 中的 'o',但是能匹配 "food" 中的两个 o。
{n,}
n 是一个非负整数。至少匹配n 次。例如,'o{2,}' 不能匹配 "Bob" 中的 'o',但能匹配 "foooood" 中的所有 o。'o{1,}' 等价于 'o+'。'o{0,}' 则等价于 'o*'。
{n,m}
m 和 n 均为非负整数,其中n
?
当该字符紧跟在任何一个其他限制符 (*, +, ?, {n}, {n,}, {n,m}) 后面时,匹配模式是非贪婪的。非贪婪模式尽可能少的匹配所搜索的字符串,而默认的贪婪模式则尽可能多的匹配所搜索的字符串。例如,对于字符串 "oooo",'o+?' 将匹配单个 "o",而 'o+' 将匹配所有 'o'。
.
匹配除 "\n" 之外的任何单个字符。要匹配包括 '\n' 在内的任何字符,请使用象 '' 的模式。
(pattern)
匹配 pattern 并获取这一匹配。所获取的匹配可以从产生的 Matches 集合得到,在VBScript 中使用 SubMatches 集合,在JScript 中则使用 $0…$9 属性。要匹配圆括号字符,请使用 '\(' 或 '\)'。
(?:pattern)
匹配 pattern 但不获取匹配结果,也就是说这是一个非获取匹配,不进行存储供以后使用。这在使用 "或" 字符 (|) 来组合一个模式的各个部分是很有用。例如, 'industr(?:y|ies) 就是一个比 'industry|industries' 更简略的表达式。
(?=pattern)
正向预查,在任何匹配 pattern 的字符串开始处匹配查找字符串。这是一个非获取匹配,也就是说,该匹配不需要获取供以后使用。例如,'Windows (?=95|98|NT|2000)' 能匹配 "Windows 2000" 中的 "Windows" ,但不能匹配 "Windows 3.1" 中的 "Windows"。预查不消耗字符,也就是说,在一个匹配发生后,在最后一次匹配之后立即开始下一次匹配的搜索,而不是从包含预查的字符之后开始。
(?!pattern)
负向预查,在任何不匹配 pattern 的字符串开始处匹配查找字符串。这是一个非获取匹配,也就是说,该匹配不需要获取供以后使用。例如'Windows (?!95|98|NT|2000)' 能匹配 "Windows 3.1" 中的 "Windows",但不能匹配 "Windows 2000" 中的 "Windows"。预查不消耗字符,也就是说,在一个匹配发生后,在最后一次匹配之后立即开始下一次匹配的搜索,而不是从包含预查的字符之后开始
x|y
匹配 x 或 y。例如,'z|food' 能匹配 "z" 或 "food"。'(z|f)ood' 则匹配 "zood" 或 "food"。
xyz]
字符集合。匹配所包含的任意一个字符。例如, '' 可以匹配 "plain" 中的 'a'。
xyz]
负值字符集合。匹配未包含的任意字符。例如, '' 可以匹配 "plain" 中的'p'。
a-z]
字符范围。匹配指定范围内的任意字符。例如,'' 可以匹配 'a' 到 'z' 范围内的任意小写字母字符。
a-z]
负值字符范围。匹配任何不在指定范围内的任意字符。例如,'' 可以匹配任何不在 'a' 到 'z' 范围内的任意字符。
\b
匹配一个单词边界,也就是指单词和空格间的位置。例如, 'er\b' 可以匹配"never" 中的 'er',但不能匹配 "verb" 中的 'er'。
\B
匹配非单词边界。'er\B' 能匹配 "verb" 中的 'er',但不能匹配 "never" 中的 'er'。
\cx
匹配由 x 指明的控制字符。例如, \cM 匹配一个 Control-M 或回车符。x 的值必须为 A-Z 或 a-z 之一。否则,将 c 视为一个原义的 'c' 字符。
\d
匹配一个数字字符。等价于 。
\D
匹配一个非数字字符。等价于 。
\f
匹配一个换页符。等价于 \x0c 和 \cL。
\n
匹配一个换行符。等价于 \x0a 和 \cJ。
\r
匹配一个回车符。等价于 \x0d 和 \cM。
\s
匹配任何空白字符,包括空格、制表符、换页符等等。等价于 。
\S
匹配任何非空白字符。等价于 。
\t
匹配一个制表符。等价于 \x09 和 \cI。
\v
匹配一个垂直制表符。等价于 \x0b 和 \cK。
\w
匹配包括下划线的任何单词字符。等价于''。
\W
匹配任何非单词字符。等价于 ''。
\xn
匹配 n,其中 n 为十六进制转义值。十六进制转义值必须为确定的两个数字长。例如,'\x41' 匹配 "A"。'\x041' 则等价于 '\x04' & "1"。正则表达式中可以使用 ASCII 编码。.
\num
匹配 num,其中 num 是一个正整数。对所获取的匹配的引用。例如,'(.)\1' 匹配两个连续的相同字符。
\n
标识一个八进制转义值或一个向后引用。如果 \n 之前至少 n 个获取的子表达式,则 n 为向后引用。否则,如果 n 为八进制数字 (0-7),则 n 为一个八进制转义值。
\nm
标识一个八进制转义值或一个向后引用。如果 \nm 之前至少有 nm 个获得子表达式,则 nm 为向后引用。如果 \nm 之前至少有 n 个获取,则 n 为一个后跟文字 m 的向后引用。如果前面的条件都不满足,若 n 和 m 均为八进制数字 (0-7),则 \nm 将匹配八进制转义值 nm。
\nml
如果 n 为八进制数字 (0-3),且 m 和 l 均为八进制数字 (0-7),则匹配八进制转义值 nml。
\un
匹配 n,其中 n 是一个用四个十六进制数字表示的 Unicode 字符。例如, \u00A9 匹配版权符号 (©)
2001 Microsoft Corporation. 保留所有权利。
Last edited by 无奈何 on 2006-10-26 at 11:45 AM ]
Regular Expression Syntax
JScript and VBScript Regular Expressions
Regular Expression Syntax
A regular expression is a text pattern composed of ordinary characters (such as characters a to z) as well as special characters (called metacharacters). This pattern describes one or more strings to be matched when searching the text body. A regular expression acts as a template, matching a certain character pattern with the string being searched.
Here are some examples of regular expressions that may be encountered:
JScript
VBScript
Match
/^\*$/
"^\*$"
Match a blank line.
/\d{2}-\d{5}/
"\d{2}-\d{5}"
Verify whether an ID number consists of a 2-digit number, a hyphen, and a 5-digit number.
/.*/
".*"
Match an HTML tag.
The following table is a complete list of metacharacters and their behaviors in the context of regular expressions:
Character
Description
\
Mark the next character as a special character, a literal character, a backreference, or an octal escape. For example, 'n' matches the character "n". '\n' matches a newline character. The sequence '\\' matches "\" and "\(" matches "(".
^
Match the start position of the input string. If the Multiline property of the RegExp object is set, ^ also matches the position after '\n' or '\r'.
$
Match the end position of the input string. If the Multiline property of the RegExp object is set, $ also matches the position before '\n' or '\r'.
*
Match the preceding subexpression zero or more times. For example, zo* can match "z" and "zoo". * is equivalent to {0,}.
+
Match the preceding subexpression one or more times. For example, 'zo+' can match "zo" and "zoo", but not "z". + is equivalent to {1,}.
?
Match the preceding subexpression zero or one time. For example, "do(es)?" can match "do" in "do" or "does". ? is equivalent to {0,1}.
{n}
n is a non-negative integer. Match exactly n times. For example, 'o{2}' cannot match 'o' in "Bob", but can match the two o's in "food".
{n,}
n is a non-negative integer. Match at least n times. For example, 'o{2,}' cannot match 'o' in "Bob", but can match all o's in "foooood". 'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.
{n,m}
m and n are both non-negative integers, where n
?
When this character immediately follows any other qualifier (*, +, ?, {n}, {n,}, {n,m}), the matching pattern is non-greedy. The non-greedy mode matches the searched string as few times as possible, while the default greedy mode matches the searched string as many times as possible. For example, for the string "oooo", 'o+?' will match a single "o", while 'o+' will match all 'o's.
.
Match any single character except "\n". To match any character including '\n', use a pattern like ''.
(pattern)
Match pattern and capture this match. The captured match can be obtained from the resulting Matches collection. In VBScript, use the SubMatches collection, and in JScript, use the $0…$9 properties. To match parenthesis characters, use '\(' or '\)'.
(?:pattern)
Match pattern but do not capture the match result, that is, this is a non-capturing match and is not stored for later use. This is useful when using the "or" character (|) to combine parts of a pattern. For example, 'industr(?:y|ies) is a more concise expression than 'industry|industries'.
(?=pattern)
Positive lookahead, matches the search string at the beginning of any string that matches pattern. This is a non-capturing match, that is, this match does not need to be captured for later use. For example, 'Windows (?=95|98|NT|2000)' can match "Windows" in "Windows 2000", but cannot match "Windows" in "Windows 3.1". Lookaheads do not consume characters, that is, after a match occurs, the search for the next match starts immediately after the last match, not from after the character containing the lookahead.
(?! pattern)
Negative lookahead, matches the search string at the beginning of any string that does not match pattern. This is a non-capturing match, that is, this match does not need to be captured for later use. For example, 'Windows (?!95|98|NT|2000)' can match "Windows" in "Windows 3.1", but cannot match "Windows" in "Windows 2000". Lookaheads do not consume characters, that is, after a match occurs, the search for the next match starts immediately after the last match, not from after the character containing the lookahead
x|y
Match x or y. For example, 'z|food' can match "z" or "food". '(z|f)ood' matches "zood" or "food".
xyz]
Character set. Match any of the included characters. For example, '' can match 'a' in "plain".
xyz]
Negative character set. Match any character not included. For example, '' can match 'p' in "plain".
a-z]
Character range. Match any character within the specified range. For example, '' can match any lowercase letter character within the range 'a' to 'z'.
a-z]
Negative character range. Match any character not within the specified range. For example, '' can match any character not within the range 'a' to 'z'.
\b
Match a word boundary, that is, the position between a word and a space. For example, 'er\b' can match 'er' in "never", but cannot match 'er' in "verb".
\B
Match a non-word boundary. 'er\B' can match 'er' in "verb", but cannot match 'er' in "never".
\cx
Match the control character indicated by x . For example, \cM matches a Control-M or carriage return. x must be one of A-Z or a-z. Otherwise, c is treated as a literal 'c' character.
\d
Match a digit character. Equivalent to .
\D
Match a non-digit character. Equivalent to .
\f
Match a form feed. Equivalent to \x0c and \cL.
\n
Match a newline character. Equivalent to \x0a and \cJ.
\r
Match a carriage return. Equivalent to \x0d and \cM.
\s
Match any whitespace character, including space, tab, form feed, etc. Equivalent to .
\S
Match any non-whitespace character. Equivalent to .
\t
Match a tab character. Equivalent to \x09 and \cI.
\v
Match a vertical tab character. Equivalent to \x0b and \cK.
\w
Match any word character including underscore. Equivalent to ''.
\W
Match any non-word character. Equivalent to ''.
\xn
Match n, where n is a hexadecimal escape value. The hexadecimal escape value must be exactly two digits long. For example, '\x41' matches "A". '\x041' is equivalent to '\x04' & "1". ASCII encoding can be used in regular expressions. .
\num
Match num, where num is a positive integer. A reference to the captured match. For example, '(.)\1' matches two consecutive identical characters.
\n
Identify an octal escape value or a backreference. If there are at least n captured subexpressions before \n, then n is a backreference. Otherwise, if n is an octal digit (0-7), then n is an octal escape value.
\nm
Identify an octal escape value or a backreference. If there are at least nm obtained subexpressions before \nm, then nm is a backreference. If there are at least n captures before \nm, then n is a backreference followed by the literal m . If none of the above conditions are met, and n and m are both octal digits (0-7), then \nm will match the octal escape value nm.
\nml
If n is an octal digit (0-3), and m and l are both octal digits (0-7), then match the octal escape value nml.
\un
Match n, where n is a Unicode character represented by four hexadecimal digits. For example, \u00A9 matches the copyright symbol (©)
2001 Microsoft Corporation. All rights reserved.
Last edited by 无奈何 on 2006-10-26 at 11:45 AM ]
|

☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul
|
|
2006-10-26 11:42 |
|
|
无奈何
荣誉版主
      
积分 1338
发帖 356
注册 2005-7-15
状态 离线
|
『第 3 楼』:
深入浅出之正则表达式(一)
使用 LLM 解释/回答一下
深入浅出之正则表达式(一)
原贴地址: http://dragon.cnblogs.com/archive/2006/05/08/394078.html
前言:
半年前我对正则表达式产生了兴趣,在网上查找过不少资料,看过不少的教程,最后在使用一个正则表达式工具RegexBuddy时发现他的教程写的非常好,可以说是我目前见过最好的正则表达式教程。于是一直想把他翻译过来。这个愿望直到这个五一长假才得以实现,结果就有了这篇文章。关于本文的名字,使用“深入浅出”似乎已经太俗。但是通读原文以后,觉得只有用“深入浅出”才能准确的表达出该教程给我的感受,所以也就不能免俗了。
本文是Jan Goyvaerts为RegexBuddy写的教程的译文,版权归原作者所有,欢迎转载。但是为了尊重原作者和译者的劳动,请注明出处!谢谢!
1. 什么是正则表达式
基本说来,正则表达式是一种用来描述一定数量文本的模式。Regex代表Regular Express。本文将用>来表示一段具体的正则表达式。
一段文本就是最基本的模式,简单的匹配相同的文本。
2. 不同的正则表达式引擎
正则表达式引擎是一种可以处理正则表达式的软件。通常,引擎是更大的应用程序的一部分。在软件世界,不同的正则表达式并不互相兼容。本教程会集中讨论Perl 5 类型的引擎,因为这种引擎是应用最广泛的引擎。同时我们也会提到一些和其他引擎的区别。许多近代的引擎都很类似,但不完全一样。例如.NET正则库,JDK正则包。
3. 文字符号
最基本的正则表达式由单个文字符号组成。如>,它将匹配字符串中第一次出现的字符“a”。如对字符串“Jack is a boy”。“J”后的“a”将被匹配。而第二个“a”将不会被匹配。
正则表达式也可以匹配第二个“a”,这必须是你告诉正则表达式引擎从第一次匹配的地方开始搜索。在文本编辑器中,你可以使用“查找下一个”。在编程语言中,会有一个函数可以使你从前一次匹配的位置开始继续向后搜索。
类似的,>会匹配“About cats and dogs”中的“cat”。这等于是告诉正则表达式引擎,找到一个>,紧跟一个>,再跟一个>。
要注意,正则表达式引擎缺省是大小写敏感的。除非你告诉引擎忽略大小写,否则>不会匹配“Cat”。
· 特殊字符
对于文字字符,有11个字符被保留作特殊用途。他们是:
\ ^ $ . | ? * + ( )
这些特殊字符也被称作元字符。
如果你想在正则表达式中将这些字符用作文本字符,你需要用反斜杠“\”对其进行换码 (escape)。例如你想匹配“1+1=2”,正确的表达式为>.
需要注意的是,>也是有效的正则表达式。但它不会匹配“1+1=2”,而会匹配“123+111=234”中的“111=2”。因为“+”在这里表示特殊含义(重复1次到多次)。
在编程语言中,要注意,一些特殊的字符会先被编译器处理,然后再传递给正则引擎。因此正则表达式>在C++中要写成“1\\+1=2”。为了匹配“C:\temp”,你要用正则表达式>。而在C++中,正则表达式则变成了“C:\\\\temp”。
· 不可显示字符
可以使用特殊字符序列来代表某些不可显示字符:
>代表Tab(0x09)
>代表回车符(0x0D)
>代表换行符(0x0A)
要注意的是Windows中文本文件使用“\r\n”来结束一行而Unix使用“\n”。
4. 正则表达式引擎的内部工作机制
知道正则表达式引擎是如何工作的有助于你很快理解为何某个正则表达式不像你期望的那样工作。
有两种类型的引擎:文本导向(text-directed)的引擎和正则导向(regex-directed)的引擎。Jeffrey Friedl把他们称作DFA和NFA引擎。本文谈到的是正则导向的引擎。这是因为一些非常有用的特性,如“惰性”量词(lazy quantifiers)和反向引用(backreferences),只能在正则导向的引擎中实现。所以毫不意外这种引擎是目前最流行的引擎。
你可以轻易分辨出所使用的引擎是文本导向还是正则导向。如果反向引用或“惰性”量词被实现,则可以肯定你使用的引擎是正则导向的。你可以作如下测试:将正则表达式>应用到字符串“regex not”。如果匹配的结果是regex,则引擎是正则导向的。如果结果是regex not,则是文本导向的。因为正则导向的引擎是“猴急”的,它会很急切的进行表功,报告它找到的第一个匹配 。
· 正则导向的引擎总是返回最左边的匹配
这是需要你理解的很重要的一点:即使以后有可能发现一个“更好”的匹配,正则导向的引擎也总是返回最左边的匹配。
当把>应用到“He captured a catfish for his cat”,引擎先比较>和“H”,结果失败了。于是引擎再比较>和“e”,也失败了。直到第四个字符,>匹配了“c”。>匹配了第五个字符。到第六个字符>没能匹配“p”,也失败了。引擎再继续从第五个字符重新检查匹配性。直到第十五个字符开始,>匹配上了“catfish”中的“cat”,正则表达式引擎急切的返回第一个匹配的结果,而不会再继续查找是否有其他更好的匹配。
5. 字符集
字符集是由一对方括号“”括起来的字符集合。使用字符集,你可以告诉正则表达式引擎仅仅匹配多个字符中的一个。如果你想匹配一个“a”或一个“e”,使用>。你可以使用>匹配gray或grey。这在你不确定你要搜索的字符是采用美国英语还是英国英语时特别有用。相反,>将不会匹配graay或graey。字符集中的字符顺序并没有什么关系,结果都是相同的。
你可以使用连字符“-”定义一个字符范围作为字符集。>匹配0到9之间的单个数字。你可以使用不止一个范围。>匹配单个的十六进制数字,并且大小写不敏感。你也可以结合范围定义与单个字符定义。>匹配一个十六进制数字或字母X。再次强调一下,字符和范围定义的先后顺序对结果没有影响。
· 字符集的一些应用
查找一个可能有拼写错误的单词,比如> 或 >。
查找程序语言的标识符,>。(*表示重复0或多次)
查找C风格的十六进制数>。(+表示重复一次或多次)
· 取反字符集
在左方括号“ \ ^ -”。“]”代表字符集定义的结束;“\”代表转义;“^”代表取反;“-”代表范围定义。其他常见的元字符在字符集定义内部都是正常字符,不需要转义。例如,要搜索星号*或加号+,你可以用>。当然,如果你对那些通常的元字符进行转义,你的正则表达式一样会工作得很好,但是这会降低可读性。
在字符集定义中为了将反斜杠“\”作为一个文字字符而非特殊含义的字符,你需要用另一个反斜杠对它进行转义。>将会匹配一个反斜杠和一个X。“]^-”都可以用反斜杠进行转义,或者将他们放在一个不可能使用到他们特殊含义的位置。我们推荐后者,因为这样可以增加可读性。比如对于字符“^”,将它放在除了左括号“”或“x”。>或>都会匹配一个“-”或“x”。
· 字符集的简写
因为一些字符集非常常用,所以有一些简写方式。
>代表>;
>代表单词字符。这个是随正则表达式实现的不同而有些差异。绝大多数的正则表达式实现的单词字符集都包含了>。
>代表“白字符”。这个也是和不同的实现有关的。在绝大多数的实现中,都包含了空格符和Tab符,以及回车换行符>。
字符集的缩写形式可以用在方括号之内或之外。>匹配一个白字符后面紧跟一个数字。>匹配单个白字符或数字。>将匹配一个十六进制数字。
取反字符集的简写
> = >
> = >
> = >
· 字符集的重复
如果你用“?*+”操作符来重复一个字符集,你将会重复整个字符集。而不仅是它匹配的那个字符。正则表达式>会匹配837以及222。
如果你仅仅想重复被匹配的那个字符,可以用向后引用达到目的。我们以后将讲到向后引用。
6. 使用?*或+ 进行重复
?:告诉引擎匹配前导字符0次或一次。事实上是表示前导字符是可选的。
+:告诉引擎匹配前导字符1次或多次
*:告诉引擎匹配前导字符0次或多次
匹配没有属性的HTML标签,“”是文字符号。第一个字符集匹配一个字母,第二个字符集匹配一个字母或数字。
我们似乎也可以用。但是它会匹配。但是这个正则表达式在你知道你要搜索的字符串不包含类似的无效标签时还是足够有效的。
· 限制性重复
许多现代的正则表达式实现,都允许你定义对一个字符重复多少次。词法是:{min,max}。min和max都是非负整数。如果逗号有而max被忽略了,则max没有限制。如果逗号和max都被忽略了,则重复min次。
因此{0,}和*一样,{1,}和+ 的作用一样。
你可以用>匹配1000~9999之间的数字(“\b”表示单词边界)。>匹配一个在100~99999之间的数字。
· 注意贪婪性
假设你想用一个正则表达式匹配一个HTML标签。你知道输入将会是一个有效的HTML文件,因此正则表达式不需要排除那些无效的标签。所以如果是在两个尖括号之间的内容,就应该是一个HTML标签。
许多正则表达式的新手会首先想到用正则表达式 >>,他们会很惊讶的发现,对于测试字符串,“This is a first test”,你可能期望会返回,然后继续进行匹配的时候,返回。
但事实是不会。正则表达式将会匹配“first”。很显然这不是我们想要的结果。原因在于“+”是贪婪的。也就是说,“+”会导致正则表达式引擎试图尽可能的重复前导字符。只有当这种重复会引起整个正则表达式匹配失败的情况下,引擎会进行回溯。也就是说,它会放弃最后一次的“重复”,然后处理正则表达式余下的部分。
和“+”类似,“?*”的重复也是贪婪的。
· 深入正则表达式引擎内部
让我们来看看正则引擎如何匹配前面的例子。第一个记号是“”。到目前为止,“first test”。引擎会试图将“>”与换行符进行匹配,结果失败了。于是引擎进行回溯。结果是现在“first tes”。于是引擎将“>”与“t”进行匹配。显然还是会失败。这个过程继续,直到“first”与“>”匹配。于是引擎找到了一个匹配“first”。记住,正则导向的引擎是“急切的”,所以它会急着报告它找到的第一个匹配。而不是继续回溯,即使可能会有更好的匹配,例如“”。所以我们可以看到,由于“+”的贪婪性,使得正则表达式引擎返回了一个最左边的最长的匹配。
· 用懒惰性取代贪婪性
一个用于修正以上问题的可能方案是用“+”的惰性代替贪婪性。你可以在“+”后面紧跟一个问号“?”来达到这一点。“*”,“{}”和“?”表示的重复也可以用这个方案。因此在上面的例子中我们可以使用“”。让我们再来看看正则表达式引擎的处理过程。
再一次,正则表达式记号“”匹配“M”,结果失败了。引擎会进行回溯,和上一个例子不同,因为是惰性重复,所以引擎是扩展惰性重复而不是减少,于是“”。这次得到了一个成功匹配。引擎于是报告“”是一个成功的匹配。整个过程大致如此。
· 惰性扩展的一个替代方案
我们还有一个更好的替代方案。可以用一个贪婪重复与一个取反字符集:“]+>”。之所以说这是一个更好的方案在于使用惰性重复时,引擎会在找到一个成功匹配前对每一个字符进行回溯。而使用取反字符集则不需要进行回溯。
最后要记住的是,本教程仅仅谈到的是正则导向的引擎。文本导向的引擎是不回溯的。但是同时他们也不支持惰性重复操作。
7. 使用“.”匹配几乎任意字符
在正则表达式中,“.”是最常用的符号之一。不幸的是,它也是最容易被误用的符号之一。
“.”匹配一个单个的字符而不用关心被匹配的字符是什么。唯一的例外是新行符。在本教程中谈到的引擎,缺省情况下都是不匹配新行符的。因此在缺省情况下,“.”等于是字符集(Window)或( Unix)的简写。
这个例外是因为历史的原因。因为早期使用正则表达式的工具是基于行的。它们都是一行一行的读入一个文件,将正则表达式分别应用到每一行上去。在这些工具中,字符串是不包含新行符的。因此“.”也就从不匹配新行符。
现代的工具和语言能够将正则表达式应用到很大的字符串甚至整个文件上去。本教程讨论的所有正则表达式实现都提供一个选项,可以使“.”匹配所有的字符,包括新行符。在RegexBuddy, EditPad Pro或PowerGREP等工具中,你可以简单的选中“点号匹配新行符”。在Perl中,“.”可以匹配新行符的模式被称作“单行模式”。很不幸,这是一个很容易混淆的名词。因为还有所谓“多行模式”。多行模式只影响行首行尾的锚定(anchor),而单行模式只影响“.”。
其他语言和正则表达式库也采用了Perl的术语定义。当在.NET Framework中使用正则表达式类时,你可以用类似下面的语句来激活单行模式:Regex.Match(“string”,”regex”,RegexOptions.SingleLine)
· 保守的使用点号“.”
点号可以说是最强大的元字符。它允许你偷懒:用一个点号,就能匹配几乎所有的字符。但是问题在于,它也常常会匹配不该匹配的字符。
我会以一个简单的例子来说明。让我们看看如何匹配一个具有“mm/dd/yy”格式的日期,但是我们想允许用户来选择分隔符。很快能想到的一个方案是>。看上去它能匹配日期“02/12/03”。问题在于02512703也会被认为是一个有效的日期。
>看上去是一个好一点的解决方案。记住点号在一个字符集里不是元字符。这个方案远不够完善,它会匹配“99/99/99”。而>又更进一步。尽管他也会匹配“19/39/99”。你想要你的正则表达式达到如何完美的程度取决于你想达到什么样的目的。如果你想校验用户输入,则需要尽可能的完美。如果你只是想分析一个已知的源,并且我们知道没有错误的数据,用一个比较好的正则表达式来匹配你想要搜寻的字符就已经足够。
8. 字符串开始和结束的锚定
锚定和一般的正则表达式符号不同,它不匹配任何字符。相反,他们匹配的是字符之前或之后的位置。“^”匹配一行字符串第一个字符前的位置。>将会匹配字符串“abc”中的a。>将不会匹配“abc”中的任何字符。
类似的,$匹配字符串中最后一个字符的后面的位置。所以>匹配“abc”中的c。
· 锚定的应用
在编程语言中校验用户输入时,使用锚定是非常重要的。如果你想校验用户的输入为整数,用>。
用户输入中,常常会有多余的前导空格或结束空格。你可以用>和>来匹配前导空格或结束空格。
· 使用“^”和“$”作为行的开始和结束锚定
如果你有一个包含了多行的字符串。例如:“first line\n\rsecond line”(其中\n\r表示一个新行符)。常常需要对每行分别处理而不是整个字符串。因此,几乎所有的正则表达式引擎都提供一个选项,可以扩展这两种锚定的含义。“^”可以匹配字串的开始位置(在f之前),以及每一个新行符的后面位置(在\n\r和s之间)。类似的,$会匹配字串的结束位置(最后一个e之后),以及每个新行符的前面(在e与\n\r之间)。
在.NET中,当你使用如下代码时,将会定义锚定匹配每一个新行符的前面和后面位置:Regex.Match("string", "regex", RegexOptions.Multiline)
应用:string str = Regex.Replace(Original, "^", "> ", RegexOptions.Multiline)--将会在每行的行首插入“> ”。
· 绝对锚定
>只匹配整个字符串的开始位置,>只匹配整个字符串的结束位置。即使你使用了“多行模式”,>和>也从不匹配新行符。
即使\Z和$只匹配字符串的结束位置,仍然有一个例外的情况。如果字符串以新行符结束,则\Z和$将会匹配新行符前面的位置,而不是整个字符串的最后面。这个“改进”是由Perl引进的,然后被许多的正则表达式实现所遵循,包括Java,.NET等。如果应用>到“joe\n”,则匹配结果是“joe”而不是“joe\n”。
Last edited by 无奈何 on 2006-10-26 at 11:58 AM ]
### A Concise Explanation of Regular Expressions (Part 1)
Original post address: http://dragon.cnblogs.com/archive/2006/05/08/394078.html
Foreword:
Half a year ago, I became interested in regular expressions. I searched a lot of materials on the Internet and read many tutorials. Finally, when I used a regular expression tool RegexBuddy, I found that its tutorials were written very well, which can be said to be the best regular expression tutorial I have seen so far. So I have always wanted to translate it. This wish was only realized during this May Day holiday, and this article came into being. Regarding the name of this article, using "A Concise Explanation" seems to be too common. But after reading the original text thoroughly, I think that only "A Concise Explanation" can accurately express the feeling I got from this tutorial, so I can't avoid being common.
This article is a translation of the tutorial written by Jan Goyvaerts for RegexBuddy. The copyright belongs to the original author. Reprinting is welcome. But in order to respect the labor of the original author and the translator, please indicate the source! Thank you!
1. What is a Regular Expression
Basically, a regular expression is a pattern used to describe a certain amount of text. Regex stands for Regular Express. In this article, > will be used to represent a specific regular expression.
A piece of text is the most basic pattern, simply matching the same text.
2. Different Regular Expression Engines
A regular expression engine is software that can process regular expressions. Usually, the engine is part of a larger application. In the software world, different regular expressions are not compatible with each other. This tutorial will focus on the Perl 5 type engine because this engine is the most widely used engine. We will also mention some differences from other engines. Many modern engines are similar but not exactly the same. For example, the.NET regular library, the JDK regular package.
3. Literal Characters
The most basic regular expression consists of a single literal character. For example, >, which will match the first occurrence of the character "a" in the string. For example, for the string "Jack is a boy". The "a" after "J" will be matched. The second "a" will not be matched.
The regular expression can also match the second "a", which must be that you tell the regular expression engine to start searching from the place of the first match. In a text editor, you can use "Find Next". In a programming language, there will be a function that allows you to continue searching backward from the position of the previous match.
Similarly, > will match "cat" in "About cats and dogs". This is equivalent to telling the regular expression engine to find a >, followed by a >, and then a >.
It should be noted that the regular expression engine is case-sensitive by default. Unless you tell the engine to ignore case, otherwise > will not match "Cat".
· Special Characters
For literal characters, 11 characters are reserved for special purposes. They are:
\ ^ $ . | ? * + ( )
These special characters are also called metacharacters.
If you want to use these characters as text characters in the regular expression, you need to escape them with the backslash "\". For example, if you want to match "1+1=2", the correct expression is >.
It should be noted that > is also a valid regular expression. But it will not match "1+1=2", but will match "123+111=234" in "111=2". Because "+" here has a special meaning (repeat 1 to multiple times).
In a programming language, it should be noted that some special characters will be processed by the compiler first and then passed to the regular engine. Therefore, the regular expression > in C++ should be written as "1\\+1=2". To match "C:\temp", you need to use the regular expression >. And in C++, the regular expression becomes "C:\\\\temp".
· Invisible Characters
Special character sequences can be used to represent some invisible characters:
> represents Tab (0x09)
> represents the carriage return character (0x0D)
> represents the newline character (0x0A)
It should be noted that text files in Windows use "\r\n" to end a line while Unix uses "\n".
4. Internal Working Mechanism of Regular Expression Engines
Knowing how the regular expression engine works will help you quickly understand why a certain regular expression doesn't work as you expect.
There are two types of engines: text-directed engines and regex-directed engines. Jeffrey Friedl calls them DFA and NFA engines. This article is about regex-directed engines. This is because some very useful features, such as "lazy" quantifiers and backreferences, can only be implemented in regex-directed engines. So it is no surprise that this engine is currently the most popular engine.
You can easily tell whether the engine you are using is text-directed or regex-directed. If backreferences or "lazy" quantifiers are implemented, it can be confirmed that the engine you are using is regex-directed. You can make the following test: Apply the regular expression > to the string "regex not". If the matching result is regex, the engine is regex-directed. If the result is regex not, it is text-directed. Because the regex-directed engine is "eager", it will be eager to show off and report the first match it finds.
· The regex-directed engine always returns the leftmost match
This is an important point you need to understand: even if there may be a "better" match later, the regex-directed engine always returns the leftmost match.
When applying > to "He captured a catfish for his cat", the engine first compares > with "H", and the result fails. Then the engine compares > with "e", and it also fails. Until the fourth character, > matches "c". > matches the fifth character. At the sixth character, > fails to match "p". The engine continues to recheck the matchability from the fifth character. Until the fifteenth character starts, > matches "cat" in "catfish", and the regular expression engine eagerly returns the result of the first match without continuing to search for other better matches.
5. Character Sets
A character set is a collection of characters enclosed in a pair of square brackets "". Using a character set, you can tell the regular expression engine to match only one of multiple characters. If you want to match an "a" or an "e", use >. You can use > to match gray or grey. This is especially useful when you are not sure whether the characters you are searching for are in American English or British English. Conversely, > will not match graay or graey. The order of characters in the character set has no relation, and the result is the same.
You can use the hyphen "-" to define a character range as a character set. > matches a single digit from 0 to 9. You can use more than one range. > matches a single hexadecimal digit, case-insensitive. You can also combine range definitions with single character definitions. > matches a hexadecimal digit or letter X. Again, the order of characters and range definitions has no effect on the result.
· Some Applications of Character Sets
Find a word that may have a spelling error, such as > or >.
Find program language identifiers, >. (* means repeat 0 or more times)
Find C-style hexadecimal numbers >. (+ means repeat once or more times)
· Negated Character Sets
Immediately after the left square bracket " \ ^ -". "]" represents the end of the character set definition; "\" represents escape; "^" represents negation; "-" represents range definition. Other common metacharacters are normal characters inside the character set definition and do not need to be escaped. For example, to search for an asterisk * or a plus sign +, you can use >. Of course, if you escape those usual metacharacters, your regular expression will work well, but this will reduce readability.
In the character set definition, to use the backslash "\" as a literal character instead of a special meaning character, you need to escape it with another backslash. > will match a backslash and an X. "] ^ -" can all be escaped with a backslash, or placed in a position where their special meanings are not likely to be used. We recommend the latter because this can increase readability. For example, for the character "^", placing it in a position other than after the left bracket "" or "x". > or > will match a "-" or "x".
· Shorthand for Character Sets
Because some character sets are very common, there are some shorthand ways.
> represents >;
> represents word characters. This varies with different regular expression implementations. In most regular expression implementations, the word character set includes >.
> represents "white characters". This is also related to different implementations. In most implementations, it includes space characters, Tab characters, and carriage return and newline characters >.
The abbreviated form of the character set can be used inside or outside the square brackets. > matches a white character followed by a digit. > matches a single white character or digit. > will match a hexadecimal digit.
Shorthand for negated character sets
> = >
> = >
> = >
· Repetition of Character Sets
If you use the "?*+" operator to repeat a character set, you will repeat the entire character set. Not just the character it matches. The regular expression > will match 837 and 222.
If you only want to repeat the matched character, you can achieve it with backreferences. We will talk about backreferences later.
6. Using?* or + for Repetition
? : Tells the engine to match the preceding character 0 or 1 times. In fact, it means that the preceding character is optional.
+ : Tells the engine to match the preceding character 1 or more times
* : Tells the engine to match the preceding character 0 or more times
To match an HTML tag without attributes, "" is a literal character. The first character set matches a letter, and the second character set matches a letter or digit.
We also seem to be able to use. But it will match. But this regular expression is still effective enough when you know that the string you are searching does not contain invalid tags like this.
· Restrictive Repetition
Many modern regular expression implementations allow you to define how many times a character is repeated. The syntax is: {min,max}. min and max are non-negative integers. If there is a comma and max is omitted, then max is unlimited. If the comma and max are both omitted, then repeat min times.
Therefore, {0,} is the same as *, and {1,} is the same as +.
You can use > to match numbers between 1000~9999 ("\b" means word boundary). > matches a number between 100~99999.
· Note on Greediness
Suppose you want to use a regular expression to match an HTML tag. You know that the input will be a valid HTML file, so the regular expression does not need to exclude those invalid tags. So if the content between the two angle brackets should be an HTML tag.
Many novice regular expression users will first think of using the regular expression >>, and they will be very surprised to find that for the test string "This is a first test", you may expect to return, and then when continuing to match, return.
But the fact is no. The regular expression will match "first". Obviously this is not what we want. The reason is that "+" is greedy. That is to say, "+" will cause the regular expression engine to try to repeat the preceding character as much as possible. Only when this repetition will cause the entire regular expression match to fail, the engine will backtrack. That is, it will give up the last "repetition" and then process the remaining part of the regular expression.
Similar to "+", the repetition of "?*" is also greedy.
· Deep into the Regular Expression Engine
Let's see how the regular engine matches the previous example. The first token is "". So far, "first test". The engine will try to match ">" with the newline character, and the result fails. Then the engine backtracks. The result is now "first tes". So the engine matches ">" with "t". Obviously it will still fail. This process continues until "first" matches ">". So the engine finds a match "first". Remember, the regex-directed engine is "eager", so it will be eager to report the first match it finds. Instead of continuing to backtrack, even if there may be a better match, such as "". So we can see that due to the greediness of "+", the regular expression engine returns the leftmost and longest match.
· Replace Greediness with Laziness
A possible solution to correct the above problem is to use the lazy version of "+" instead of greedy. You can follow a question mark "?" after "+" to achieve this. The repetition represented by "*", "{}", and "?" can also use this solution. So in the above example, we can use "". Let's take a look at the processing of the regular expression engine again.
Again, the regular expression token "" matches "M", and the result fails. The engine will backtrack. Different from the previous example, because it is lazy repetition, the engine expands the lazy repetition instead of reducing it, so "". This time a successful match is obtained. The engine then reports "" as a successful match. The whole process is roughly like this.
· An Alternative to Lazy Expansion
We also have a better alternative. You can use a greedy repetition and a negated character set: "]+>". The reason why this is a better solution is that when using lazy repetition, the engine will backtrack for each character before finding a successful match. And using a negated character set does not need to backtrack.
Finally, it should be remembered that this tutorial only talks about regex-directed engines. Text-directed engines do not backtrack. But at the same time they also do not support lazy repetition operations.
7. Using "." to Match Almost Any Character
In regular expressions, "." is one of the most commonly used symbols. Unfortunately, it is also one of the most easily misused symbols.
"." matches a single character regardless of what the matched character is. The only exception is the newline character. In the engine discussed in this tutorial, by default, it does not match the newline character. Therefore, by default, "." is equivalent to the shorthand of the character set (Window) or (Unix).
This exception is due to historical reasons. Because the early tools using regular expressions were line-based. They all read a file line by line and applied regular expressions to each line separately. In these tools, the string does not contain newline characters. Therefore, "." never matches a newline character.
Modern tools and languages can apply regular expressions to very large strings or even the entire file. All regular expression implementations discussed in this tutorial provide an option that can make "." match all characters, including newline characters. In tools such as RegexBuddy, EditPad Pro or PowerGREP, you can simply select "Dot matches newline". In Perl, the pattern in which "." can match newline characters is called "single-line mode". Unfortunately, this is a confusing noun. Because there is also the so-called "multiline mode". Multiline mode only affects the anchoring of the beginning and end of the line, while single-line mode only affects ".".
Other languages and regular expression libraries also use Perl's terminology definition. When using the regular expression class in the.NET Framework, you can use a statement like the following to activate single-line mode: Regex.Match("string","regex",RegexOptions.SingleLine)
· Conservative Use of Dot "."
The dot can be said to be the most powerful metacharacter. It allows you to be lazy: with a dot, you can match almost all characters. But the problem is that it often matches characters that should not be matched.
I will use a simple example to illustrate. Let's see how to match a date in the "mm/dd/yy" format, but we want to allow the user to choose the delimiter. A solution that can be quickly thought of is >. It seems that it can match the date "02/12/03". The problem is that 02512703 will also be considered a valid date.
> seems to be a better solution. Remember that the dot is not a metacharacter in a character set. This solution is far from perfect, and it will match "99/99/99". And > is a step further. Although it will also match "19/39/99". How perfect you want your regular expression to be depends on what you want to achieve. If you want to verify user input, you need to be as perfect as possible. If you just want to analyze a known source and we know there is no wrong data, using a better regular expression to match the characters you want to search is enough.
8. Anchoring at the Start and End of the String
Anchors are different from general regular expression symbols. They do not match any characters. Instead, they match the position before or after the character. "^" matches the position before the first character of a line of string. > will match "a" in the string "abc". > will not match any characters in "abc".
Similarly, $ matches the position after the last character in the string. So > matches "c" in "abc".
· Application of Anchors
When verifying user input in a programming language, using anchors is very important. If you want to verify that the user's input is an integer, use >.
In user input, there are often redundant leading spaces or trailing spaces. You can use > and > to match leading spaces or trailing spaces.
· Using "^" and "$" as Line Start and End Anchors
If you have a string containing multiple lines. For example: "first line\n\rsecond line" (where \n\r represents a newline character). Often, each line needs to be processed separately instead of the entire string. Therefore, almost all regular expression engines provide an option that can expand the meaning of these two anchors. "^" can match the start position of the substring (before f), and the position after each newline character (between \n\r and s). Similarly, $ will match the end position of the substring (after the last e), and the position before each newline character (between e and \n\r).
In.NET, when you use the following code, it will define anchors to match the position before and after each newline character: Regex.Match("string", "regex", RegexOptions.Multiline)
Application: string str = Regex.Replace(Original, "^", "> ", RegexOptions.Multiline)--will insert "> " at the beginning of each line.
· Absolute Anchors
> only matches the start position of the entire string, > only matches the end position of the entire string. Even if you use "multiline mode", > and > never match newline characters.
Even though \Z and $ only match the end position of the string, there is still an exception. If the string ends with a newline character, \Z and $ will match the position before the newline character, not the very end of the entire string. This "improvement" was introduced by Perl and then followed by many regular expression implementations, including Java, .NET, etc. If you apply > to "joe\n", the matching result will be "joe" instead of "joe\n".
Last edited by 无奈何 on 2006-10-26 at 11:58 AM ]
|

☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul
|
|
2006-10-26 11:43 |
|
|
无奈何
荣誉版主
      
积分 1338
发帖 356
注册 2005-7-15
状态 离线
|
『第 4 楼』:
深入浅出之正则表达式(二)
使用 LLM 解释/回答一下
深入浅出之正则表达式(二)
原贴地址: http://dragon.cnblogs.com/archive/2006/05/09/394923.html
前言:
本文是前一片文章《深入浅出之正则表达式(一)》的续篇,在本文中讲述了正则表达式中的组与向后引用,先前向后查看,条件测试,单词边界,选择符等表达式及例子,并分析了正则引擎在执行匹配时的内部机理。
本文是Jan Goyvaerts为RegexBuddy写的教程的译文,版权归原作者所有,欢迎转载。但是为了尊重原作者和译者的劳动,请注明出处!谢谢!
9. 单词边界
元字符>也是一种对位置进行匹配的“锚”。这种匹配是0长度匹配。
有4种位置被认为是“单词边界”:
1) 在字符串的第一个字符前的位置(如果字符串的第一个字符是一个“单词字符”)
2) 在字符串的最后一个字符后的位置(如果字符串的最后一个字符是一个“单词字符”)
3) 在一个“单词字符”和“非单词字符”之间,其中“非单词字符”紧跟在“单词字符”之后
4) 在一个“非单词字符”和“单词字符”之间,其中“单词字符”紧跟在“非单词字符”后面
“单词字符”是可以用“\w”匹配的字符,“非单词字符”是可以用“\W”匹配的字符。在大多数的正则表达式实现中,“单词字符”通常包括>。
例如:>能够匹配单个的4而不是一个更大数的一部分。这个正则表达式不会匹配“44”中的4。
换种说法,几乎可以说>匹配一个“字母数字序列”的开始和结束的位置。
“单词边界”的取反集为>,他要匹配的位置是两个“单词字符”之间或者两个“非单词字符”之间的位置。
· 深入正则表达式引擎内部
让我们看看把正则表达式>应用到字符串“This island is beautiful”。引擎先处理符号>。因为\b是0长度 ,所以第一个字符T前面的位置会被考察。因为T是一个“单词字符”,而它前面的字符是一个空字符(void),所以\b匹配了单词边界。接着>和第一个字符“T”匹配失败。匹配过程继续进行,直到第五个空格符,和第四个字符“s”之间又匹配了>。然而空格符和>不匹配。继续向后,到了第六个字符“i”,和第五个空格字符之间匹配了>,然后>和第六、第七个字符都匹配了。然而第八个字符和第二个“单词边界”不匹配,所以匹配又失败了。到了第13个字符i,因为和前面一个空格符形成“单词边界”,同时>和“is”匹配。引擎接着尝试匹配第二个>。因为第15个空格符和“s”形成单词边界,所以匹配成功。引擎“急着”返回成功匹配的结果。
10. 选择符
正则表达式中“|”表示选择。你可以用选择符匹配多个可能的正则表达式中的一个。
如果你想搜索文字“cat”或“dog”,你可以用>。如果你想有更多的选择,你只要扩展列表>。
选择符在正则表达式中具有最低的优先级,也就是说,它告诉引擎要么匹配选择符左边的所有表达式,要么匹配右边的所有表达式。你也可以用圆括号来限制选择符的作用范围。如>,这样告诉正则引擎把(cat|dog)当成一个正则表达式单位来处理。
· 注意正则引擎的“急于表功”性
正则引擎是急切的,当它找到一个有效的匹配时,它会停止搜索。因此在一定条件下,选择符两边的表达式的顺序对结果会有影响。假设你想用正则表达式搜索一个编程语言的函数列表:Get,GetValue,Set或SetValue。一个明显的解决方案是>。让我们看看当搜索SetValue时的结果。
因为>和>都失败了,而>匹配成功。因为正则导向的引擎都是“急切”的,所以它会返回第一个成功的匹配,就是“Set”,而不去继续搜索是否有其他更好的匹配。
和我们期望的相反,正则表达式并没有匹配整个字符串。有几种可能的解决办法。一是考虑到正则引擎的“急切”性,改变选项的顺序,例如我们使用>,这样我们就可以优先搜索最长的匹配。我们也可以把四个选项结合起来成两个选项:>。因为问号重复符是贪婪的,所以SetValue总会在Set之前被匹配。
一个更好的方案是使用单词边界:>或>。更进一步,既然所有的选择都有相同的结尾,我们可以把正则表达式优化为>。
11. 组与向后引用
把正则表达式的一部分放在圆括号内,你可以将它们形成组。然后你可以对整个组使用一些正则操作,例如重复操作符。
要注意的是,只有圆括号“()”才能用于形成组。“”用于定义字符集。“{}”用于定义重复操作。
当用“()”定义了一个正则表达式组后,正则引擎则会把被匹配的组按照顺序编号,存入缓存。当对被匹配的组进行向后引用的时候,可以用“\数字”的方式进行引用。>引用第一个匹配的后向引用组,>引用第二个组,以此类推,>引用第n个组。而>则引用整个被匹配的正则表达式本身。我们看一个例子。
假设你想匹配一个HTML标签的开始标签和结束标签,以及标签中间的文本。比如This is a test,我们要匹配和以及中间的文字。我们可以用如下正则表达式:“]*>.*?”
首先,“”将会匹配“”的第一个字符“”。然后匹配B,*将会匹配0到多次字母数字,后面紧接着0到多个非“>”的字符。最后正则表达式的“>”将会匹配“”的“>”。接下来正则引擎将对结束标签之前的字符进行惰性匹配,直到遇到一个“”符号。然后正则表达式中的“\1”表示对前面匹配的组“(*)”进行引用,在本例中,被引用的是标签名“B”。所以需要被匹配的结尾标签为“”
你可以对相同的后向引用组进行多次引用,>将匹配“axaxa”、“bxbxb”以及“cxcxc”。如果用数字形式引用的组没有有效的匹配,则引用到的内容简单的为空。
一个后向引用不能用于它自身。>是错误的。因此你不能将>用于一个正则表达式匹配本身,它只能用于替换操作中。
后向引用不能用于字符集内部。>中的>并不表示后向引用。在字符集内部,>可以被解释为八进制形式的转码。
向后引用会降低引擎的速度,因为它需要存储匹配的组。如果你不需要向后引用,你可以告诉引擎对某个组不存储。例如:>。其中“(”后面紧跟的“?:”会告诉引擎对于组(Value),不存储匹配的值以供后向引用。
· 重复操作与后向引用
当对组使用重复操作符时,缓存里后向引用内容会被不断刷新,只保留最后匹配的内容。例如:>将匹配“cab=cab”,但是>却不会。因为()第一次匹配“c”时,“\1”代表“c”;然后()会继续匹配“a”和“b”。最后“\1”代表“b”,所以它会匹配“cab=b”。
应用:检查重复单词--当编辑文字时,很容易就会输入重复单词,例如“the the”。使用>可以检测到这些重复单词。要删除第二个单词,只要简单的利用替换功能替换掉“\1”就可以了。
· 组的命名和引用
在PHP,Python中,可以用group)>>来对组进行命名。在本例中,词法?P就是对组(group)进行了命名。其中name是你对组的起的名字。你可以用(?P=name)进行引用。
.NET的命名组
.NET framework也支持命名组。不幸的是,微软的程序员们决定发明他们自己的语法,而不是沿用Perl、Python的规则。目前为止,还没有任何其他的正则表达式实现支持微软发明的语法。
下面是.NET中的例子:
(?group)(?’second’group)
正如你所看到的,.NET提供两种词法来创建命名组:一是用尖括号“”,或者用单引号“’’”。尖括号在字符串中使用更方便,单引号在ASP代码中更有用,因为ASP代码中“”被用作HTML标签。
要引用一个命名组,使用\k或\k’name’.
当进行搜索替换时,你可以用“${name}”来引用一个命名组。
12. 正则表达式的匹配模式
本教程所讨论的正则表达式引擎都支持三种匹配模式:
>使正则表达式对大小写不敏感,
>开启“单行模式”,即点号“.”匹配新行符
>开启“多行模式”,即“^”和“$”匹配新行符的前面和后面的位置。
· 在正则表达式内部打开或关闭模式
如果你在正则表达式内部插入修饰符(?ism),则该修饰符只对其右边的正则表达式起作用。(?-i)是关闭大小写不敏感。你可以很快的进行测试。>应该匹配TEst,但是不能匹配teST或TEST.
13. 原子组与防止回溯
在一些特殊情况下,因为回溯会使得引擎的效率极其低下。
让我们看一个例子:要匹配这样的字串,字串中的每个字段间用逗号做分隔符,第12个字段由P开头。
我们容易想到这样的正则表达式>。这个正则表达式在正常情况下工作的很好。但是在极端情况下,如果第12个字段不是由P开头,则会发生灾难性的回溯。如要搜索的字串为“1,2,3,4,5,6,7,8,9,10,11,12,13”。首先,正则表达式一直成功匹配直到第12个字符。这时,前面的正则表达式消耗的字串为“1,2,3,4,5,6,7,8,9,10,11,”,到了下一个字符,并不匹配“12”。所以引擎进行回溯,这时正则表达式消耗的字串为“1,2,3,4,5,6,7,8,9,10,11”。继续下一次匹配过程,下一个正则符号为点号>,可以匹配下一个逗号“,”。然而,>>并不匹配字符“12”中的“1”。匹配失败,继续回溯。大家可以想象,这样的回溯组合是个非常大的数量。因此可能会造成引擎崩溃。
用于阻止这样巨大的回溯有几种方案:
一种简单的方案是尽可能的使匹配精确。用取反字符集代替点号。例如我们用如下正则表达式>,这样可以使失败回溯的次数下降到11次。
另一种方案是使用原子组。
原子组的目的是使正则引擎失败的更快一点。因此可以有效的阻止海量回溯。原子组的语法是正则表达式)>>。位于(?>)之间的所有正则表达式都会被认为是一个单一的正则符号。一旦匹配失败,引擎将会回溯到原子组前面的正则表达式部分。前面的例子用原子组可以表达成(.*?,){11})P>>。一旦第十二个字段匹配失败,引擎回溯到原子组前面的>。
14. 向前查看与向后查看
Perl 5 引入了两个强大的正则语法:“向前查看”和“向后查看”。他们也被称作“零长度断言”。他们和锚定一样都是零长度的(所谓零长度即指该正则表达式不消耗被匹配的字符串)。不同之处在于“前后查看”会实际匹配字符,只是他们会抛弃匹配只返回匹配结果:匹配或不匹配。这就是为什么他们被称作“断言”。他们并不实际消耗字符串中的字符,而只是断言一个匹配是否可能。
几乎本文讨论的所有正则表达式的实现都支持“向前向后查看”。唯一的一个例外是Javascript只支持向前查看。
· 肯定和否定式的向前查看
如我们前面提过的一个例子:要查找一个q,后面没有紧跟一个u。也就是说,要么q后面没有字符,要么后面的字符不是u。采用否定式向前查看后的一个解决方案为>。否定式向前查看的语法是查看的内容)>>。
肯定式向前查看和否定式向前查看很类似:查看的内容)>>。
如果在“查看的内容”部分有组,也会产生一个向后引用。但是向前查看本身并不会产生向后引用,也不会被计入向后引用的编号中。这是因为向前查看本身是会被抛弃掉的,只保留匹配与否的判断结果。如果你想保留匹配的结果作为向后引用,你可以用>来产生一个向后引用。
· 肯定和否定式的先后查看
向后查看和向前查看有相同的效果,只是方向相反
否定式向后查看的语法是:查看内容)>>
肯定式向后查看的语法是:查看内容)>>
我们可以看到,和向前查看相比,多了一个表示方向的左尖括号。
例:>将会匹配一个没有“a”作前导字符的“b”。
值得注意的是:向前查看从当前字符串位置开始对“查看”正则表达式进行匹配;向后查看则从当前字符串位置开始先后回溯一个字符,然后再开始对“查看”正则表达式进行匹配。
· 深入正则表达式引擎内部
让我们看一个简单例子。
把正则表达式>应用到字符串“Iraq”。正则表达式的第一个符号是>。正如我们知道的,引擎在匹配>以前会扫过整个字符串。当第四个字符“q”被匹配后,“q”后面是空字符(void)。而下一个正则符号是向前查看。引擎注意到已经进入了一个向前查看正则表达式部分。下一个正则符号是>,和空字符不匹配,从而导致向前查看里的正则表达式匹配失败。因为是一个否定式的向前查看,意味着整个向前查看结果是成功的。于是匹配结果“q”被返回了。
我们在把相同的正则表达式应用到“quit”。>匹配了“q”。下一个正则符号是向前查看部分的>,它匹配了字符串中的第二个字符“i”。引擎继续走到下个字符“i”。然而引擎这时注意到向前查看部分已经处理完了,并且向前查看已经成功。于是引擎抛弃被匹配的字符串部分,这将导致引擎回退到字符“u”。
因为向前查看是否定式的,意味着查看部分的成功匹配导致了整个向前查看的失败,因此引擎不得不进行回溯。最后因为再没有其他的“q”和>匹配,所以整个匹配失败了。
为了确保你能清楚地理解向前查看的实现,让我们把>应用到“quit”。>首先匹配“q”。然后向前查看成功匹配“u”,匹配的部分被抛弃,只返回可以匹配的判断结果。引擎从字符“i”回退到“u”。由于向前查看成功了,引擎继续处理下一个正则符号>。结果发现>和“u”不匹配。因此匹配失败了。由于后面没有其他的“q”,整个正则表达式的匹配失败了。
· 更进一步理解正则表达式引擎内部机制
让我们把>应用到“thingamabob”。引擎开始处理向后查看部分的正则符号和字符串中的第一个字符。在这个例子中,向后查看告诉正则表达式引擎回退一个字符,然后查看是否有一个“a”被匹配。因为在“t”前面没有字符,所以引擎不能回退。因此向后查看失败了。引擎继续走到下一个字符“h”。再一次,引擎暂时回退一个字符并检查是否有个“a”被匹配。结果发现了一个“t”。向后查看又失败了。
向后查看继续失败,直到正则表达式到达了字符串中的“m”,于是肯定式的向后查看被匹配了。因为它是零长度的,字符串的当前位置仍然是“m”。下一个正则符号是>,和“m”匹配失败。下一个字符是字符串中的第二个“a”。引擎向后暂时回退一个字符,并且发现>不匹配“m”。
在下一个字符是字符串中的第一个“b”。引擎暂时性的向后退一个字符发现向后查看被满足了,同时>匹配了“b”。因此整个正则表达式被匹配了。作为结果,正则表达式返回字符串中的第一个“b”。
· 向前向后查看的应用
我们来看这样一个例子:查找一个具有6位字符的,含有“cat”的单词。
首先,我们可以不用向前向后查看来解决问题,例如:
>
足够简单吧!但是当需求变成查找一个具有6-12位字符,含有“cat”,“dog”或“mouse”的单词时,这种方法就变得有些笨拙了。
我们来看看使用向前查看的方案。在这个例子中,我们有两个基本需求要满足:一是我们需要一个6位的字符,二是单词含有“cat”。
满足第一个需求的正则表达式为>。满足第二个需求的正则表达式为>。
把两者结合起来,我们可以得到如下的正则表达式:
>
具体的匹配过程留给读者。但是要注意的一点是,向前查看是不消耗字符的,因此当判断单词满足具有6个字符的条件后,引擎会从开始判断前的位置继续对后面的正则表达式进行匹配。
最后作些优化,可以得到下面的正则表达式:
>
15. 正则表达式中的条件测试
条件测试的语法为>。“if”部分可以是向前向后查看表达式。如果用向前查看,则语法变为:>,其中else部分是可选的。
如果if部分为true,则正则引擎会试图匹配then部分,否则引擎会试图匹配else部分。
需要记住的是,向前先后查看并不实际消耗任何字符,因此后面的then与else部分的匹配时从if测试前的部分开始进行尝试。
16. 为正则表达式添加注释
在正则表达式中添加注释的语法是:>
例:为用于匹配有效日期的正则表达式添加注释:
(?#year)(19|20)\d\d(?#month)(0|1)(?#day)(0||3)
Last edited by 无奈何 on 2006-10-26 at 11:47 AM ]
### In - depth Explanation of Regular Expressions (Part 2)
Original post address: http://dragon.cnblogs.com/archive/2006/05/09/394923.html
Foreword:
This article is a sequel to the previous article "In - depth Explanation of Regular Expressions (Part 1)". In this article, it describes groups and backreferences in regular expressions, positive and negative look - aheads, conditional tests, word boundaries, the alternation operator, etc., and examples, and analyzes the internal mechanism of the regular expression engine when it performs matching.
This article is a translation of a tutorial written by Jan Goyvaerts for RegexBuddy. The copyright belongs to the original author. Reprinting is welcome. But in order to respect the labor of the original author and the translator, please indicate the source! Thank you!
9. Word Boundaries
The metacharacter \b is also an "anchor" that matches positions. This kind of match is a 0 - length match.
There are 4 positions considered as "word boundaries":
1) The position before the first character in the string (if the first character of the string is a "word character")
2) The position after the last character in the string (if the last character of the string is a "word character")
3) Between a "word character" and a "non - word character", where the "non - word character" immediately follows the "word character"
4) Between a "non - word character" and a "word character", where the "word character" immediately follows the "non - word character"
A "word character" is a character that can be matched by \w, and a "non - word character" is a character that can be matched by \W. In most regular expression implementations, "word characters" usually include letters, digits, and the underscore _.
For example: \b can match a single 4 but not a part of a larger number. This regular expression will not match the 4 in "44".
In other words, it can almost be said that \b matches the positions at the start and end of an "alphanumeric sequence".
The complement set of "word boundaries" is \B, which matches positions between two "word characters" or between two "non - word characters".
· Delving into the Inside of the Regular Expression Engine
Let's look at applying the regular expression \b to the string "This island is beautiful". The engine first processes the symbol \b. Since \b is 0 - length, the position in front of the first character T will be examined. Because T is a "word character" and the character in front of it is an empty character (void), so \b matches the word boundary. Then \b fails to match with the first character "T". The matching process continues until the fifth space character, and a \b is matched between the fourth character "s" and the space character. However, the space character does not match \b. Continuing backward, to the sixth character "i", a \b is matched between the fifth space character and "i", and then \b matches both the sixth and seventh characters. However, the eighth character does not match the second "word boundary", so the match fails again. When reaching the 13th character i, because it forms a word boundary with the previous space character, and \b matches "is". The engine then tries to match the second \b. Because the 15th space character forms a word boundary with "s", the match is successful. The engine "hastily" returns the result of the successful match.
10. Alternation Operator
The "|" in the regular expression means alternation. You can use the alternation operator to match one of several possible regular expressions.
If you want to search for the text "cat" or "dog", you can use \b(cat|dog)\b. If you want more options, you just need to expand the list \b(cat|dog|rabbit)\b.
The alternation operator has the lowest priority in the regular expression, that is, it tells the engine to either match all the expressions on the left of the alternation operator or all the expressions on the right. You can also use parentheses to limit the scope of the alternation operator. For example, \b(?:cat|dog)\b, which tells the regular engine to treat (cat|dog) as a single regular expression unit.
· Note the "eagerness to claim success" of the Regular Expression Engine
The regular expression engine is eager. When it finds a valid match, it will stop searching. Therefore, under certain conditions, the order of the expressions on both sides of the alternation operator will affect the result. Suppose you want to search for a list of functions of a programming language: Get, GetValue, Set, or SetValue. An obvious solution is \b(Get|GetValue|Set|SetValue)\b. Let's see the result when searching for SetValue.
Because \b(Get|GetValue)\b and \b(Set|SetValue)\b both fail, and \b(Set|SetValue)\b matches successfully. Because the regular expression - oriented engine is "eager", it will return the first successful match, which is "Set", and will not continue to search for other better matches.
Contrary to our expectation, the regular expression does not match the entire string. There are several possible solutions. One is to consider the "eagerness" of the regular engine, change the order of the options, for example, we use \b(SetValue|Set|GetValue|Get)\b, so that we can preferentially search for the longest match. We can also combine the four options into two options: \b((Get|Set)(Value)?)\b. Because the question mark quantifier is greedy, SetValue will always be matched before Set.
A better solution is to use word boundaries: \b((Get|Set)Value|Get|Set)\b or \b(Get(Value)?|Set(Value)?)\b. Furthermore, since all the options have the same ending, we can optimize the regular expression to \b(Get|Set)(Value)?\b.
11. Groups and Backreferences
Put a part of the regular expression inside parentheses, and you can form a group. Then you can perform some regular operations on the entire group, such as the quantifier operation.
It should be noted that only parentheses "()" can be used to form groups. "" is used to define character sets. "{}" is used to define quantifier operations.
When a regular expression group is defined with "()", the regular engine will number the matched groups in order and store them in the cache. When backreferencing the matched group, you can use the form "\number" for reference. \1 refers to the first matched backreference group, \2 refers to the second group, and so on, \n refers to the nth group. And \0 refers to the entire matched regular expression itself. Let's look at an example.
Suppose you want to match the start tag and end tag of an HTML tag, as well as the text between the tags. For example, This is a test, we want to match <B> and </B> and the text in between. We can use the following regular expression: "<(*)>.*?</\1>".
First, "<" will match the first character "<" of "<B>". Then matches B, * will match 0 to multiple alphanumeric characters, followed by 0 to multiple characters that are not ">". Finally, the ">" in the regular expression will match the ">" of "<B>". Next, the regular engine will perform lazy matching on the characters before the end tag until a "</" symbol is encountered. Then "\1" in the regular expression refers to the group "(*)" matched before, in this example, the tag name "B" is referred to. So the ending tag to be matched is "</B>".
You can refer to the same backreference group multiple times, \b(+) \1\b will match "axaxa", "bxbxb", and "cxcxc". If the referenced group with a number form has no valid match, the content referred to is simply empty.
A backreference cannot be used for itself. \1\1 is incorrect. Therefore, you cannot use \0 to match the regular expression itself; it can only be used in the replacement operation.
Backreferences cannot be used inside character sets. \w\1 inside the character set does not represent a backreference. Inside the character set, \1 can be interpreted as an octal - encoded character.
Backreferences will slow down the engine because it needs to store the matched groups. If you don't need backreferences, you can tell the engine not to store a certain group. For example, (?:Value). Where "(?:" followed by "?:" will tell the engine not to store the matched value of the group (Value) for backreference.
· Quantifier Operations and Backreferences
When using a quantifier operator on a group, the content of the backreference in the cache will be continuously refreshed, keeping only the last matched content. For example, \b()\1\b will match "cab=cab", but \b()\w\1\b will not. Because when () first matches "c", "\1" represents "c"; then () will continue to match "a" and "b". Finally, "\1" represents "b", so it will match "cab=b".
Application: Checking for repeated words - when editing text, it is easy to enter repeated words, such as "the the". Using \b(\w+)\s+\1\b can detect these repeated words. To delete the second word, you can simply use the replacement function to replace "\1".
· Naming and Referencing of Groups
In PHP, Python, you can use (?P<name>group)>> to name a group. In this example, the lexeme?P is used to name the group (group). Where name is the name you give to the group. You can use (?P=name) to reference it.
Named Groups in.NET
The.NET framework also supports named groups. Unfortunately, Microsoft programmers decided to invent their own syntax instead of following the rules of Perl, Python. So far, no other regular expression implementation supports the syntax invented by Microsoft.
Here is an example in.NET:
(?<group>)(?'second'group)
As you can see,.NET provides two lexemes to create named groups: one is to use angle brackets "<>", or to use single quotes "''". Angle brackets are more convenient to use in strings, and single quotes are more useful in ASP code because "" is used as an HTML tag in ASP code.
To reference a named group, use \k<name> or \k'name'.
When performing search and replace, you can use "${name}" to reference a named group.
12. Matching Modes of Regular Expressions
The regular expression engines discussed in this tutorial all support three matching modes:
i makes the regular expression case - insensitive,
s enables "single - line mode", that is, the dot "." matches the newline character
m enables "multi - line mode", that is, "^" and "$" match the positions before and after the newline character.
· Turning Modes On or Off Inside the Regular Expression
If you insert the modifier (?ism) inside the regular expression, the modifier only affects the regular expression to its right. (?-i) is to turn off case insensitivity. You can test it quickly. \btest\b should match TEst, but not teST or TEST.
13. Atomic Groups and Preventing Backtracking
In some special cases, because backtracking will make the engine's efficiency extremely low.
Let's look at an example: to match a string where each field is separated by a comma, and the 12th field starts with P.
We can easily think of such a regular expression ^((*?,){11}*)$\1^P. This regular expression works well in normal cases. But in extreme cases, if the 12th field does not start with P, catastrophic backtracking will occur. For example, if the string to be searched is "1,2,3,4,5,6,7,8,9,10,11,12,13". First, the regular expression successfully matches until the 12th character. At this time, the string consumed by the previous regular expression is "1,2,3,4,5,6,7,8,9,10,11,", and the next character does not match "12". So the engine backtracks, and the string consumed by the regular expression at this time is "1,2,3,4,5,6,7,8,9,10,11". Continue the next matching process, the next regular symbol is the dot ^., which can match the next comma ",". However, ^.^((*?,){11}*)$\1^P does not match the "1" in "12". The match fails, and backtracking continues. You can imagine that such a combination of backtracking is a very large number. Therefore, it may cause the engine to crash.
There are several solutions to prevent such huge backtracking:
One simple solution is to make the match as precise as possible. Replace the dot with a negated character set. For example, we use the following regular expression ^((*?,){11}*)$\1^P, which can reduce the number of failed backtracking times to 11 times.
Another solution is to use atomic groups.
The purpose of atomic groups is to make the regular engine fail faster. Therefore, it can effectively prevent massive backtracking. The syntax of an atomic group is (?>regular expression)>>. All regular expressions between (?>) will be considered as a single regular symbol. Once the match fails, the engine will backtrack to the part of the regular expression before the atomic group. The previous example can be expressed with an atomic group as (.*?,){11}(?>*P)>>. Once the 12th field matches failed, the engine backtracks to ^((*?,){11}*)$.
14. Look - Aheads and Look - Behinds
Perl 5 introduced two powerful regular syntaxes: "look - aheads" and "look - behinds". They are also called "zero - length assertions". They are as zero - length as anchors (the so - called zero - length means that the regular expression does not consume the matched string). The difference is that "look - aheads and look - behinds" will actually match characters, but they will discard the match and only return the match result: match or not match. This is why they are called "assertions". They do not actually consume characters in the string, but only assert whether a match is possible.
Almost all the regular expression implementations discussed in this article support "look - aheads and look - behinds". The only exception is that Javascript only supports look - aheads.
· Positive and Negative Look - Aheads
As we mentioned in a previous example: to find a q that is not followed by a u. That is to say, either there is no character after q or the character after is not u. A solution using negative look - ahead is \bq(?!u)\b. The syntax of negative look - ahead is (?!look - ahead content)>>.
Positive look - ahead is similar to negative look - ahead: (?=look - ahead content)>>.
If there is a group in the "look - ahead content" part, a backreference will also be generated. But the look - ahead itself does not generate a backreference, nor is it counted in the numbering of backreferences. This is because the look - ahead itself will be discarded, and only the judgment result of match or not is retained. If you want to retain the matched result as a backreference, you can use \1 to generate a backreference.
· Positive and Negative Look - Behinds
Look - behinds have the same effect as look - aheads, but in the opposite direction.
The syntax of negative look - behind is (?<!look - behind content)>>.
The syntax of positive look - behind is (?<=look - behind content)>>.
As you can see, compared with look - aheads, there is an additional left angle bracket to indicate the direction.
Example: (?<!a)b will match a b that is not preceded by an a.
It is worth noting that: look - aheads start matching the "look" regular expression from the current string position; look - behinds start by backtracking one character from the current string position and then start matching the "look" regular expression.
· Delving into the Inside of the Regular Expression Engine
Let's look at a simple example.
Apply the regular expression \bq(?!u)\b to the string "Iraq". The first symbol of the regular expression is \b. As we know, the engine will scan the entire string before matching \b. When the fourth character "q" is matched, there is an empty character (void) after "q". The next regular symbol is the look - ahead. The engine notices that it has entered a part of the look - ahead regular expression. The next regular symbol is \b, which does not match the empty character, resulting in the match of the look - ahead regular expression failing. Because it is a negative look - ahead, it means that the entire look - ahead result is successful. So the match result "q" is returned.
We apply the same regular expression to "quit". \bq(?!u)\b matches "q". The next regular symbol is the part of the look - ahead \b, which matches the second character "i" in the string. The engine continues to the next character "i". However, the engine then notices that the look - ahead part has been processed and the look - ahead has been successful. So the engine discards the matched string part, which will cause the engine to backtrack to the character "u".
Because the look - ahead is negative, it means that the successful match of the look - ahead part leads to the failure of the entire look - ahead, so the engine has to backtrack. Finally, because there are no other "q"s to match with \b, the entire match fails.
To make sure you understand the implementation of look - ahead clearly, let's apply \bq(?!u)\b to "quit". \bq(?!u)\b first matches "q". Then the look - ahead successfully matches "u", the matched part is discarded, and only the judgment result of match is returned. The engine backtracks from the character "i" to "u". Since the look - ahead is successful, the engine continues to process the next regular symbol \b. The result is that \b does not match "u". So the match fails. Since there are no other "q"s behind, the match of the entire regular expression fails.
· Further Understanding of the Internal Mechanism of the Regular Expression Engine
Let's apply \b(?<=a)b\b to "thingamabob". The engine starts processing the look - behind regular symbol and the first character in the string. In this example, the look - behind tells the regular expression engine to backtrack one character and then check if an "a" is matched. Because there is no character in front of "t", the engine cannot backtrack. So the look - behind fails. The engine continues to the next character "h". Again, the engine temporarily backtracks one character and checks if an "a" is matched. It finds a "t", and the look - behind fails again.
The look - behind continues to fail until the regular expression reaches "m" in the string, and then the positive look - behind is matched. Because it is zero - length, the current position of the string is still "m". The next regular symbol is \b, which fails to match "m". The next character is the second "a" in the string. The engine temporarily backtracks one character and finds that \b does not match "m".
The next character is the first "b" in the string. The engine temporarily backtracks one character and finds that the look - behind is satisfied, and \b matches "b". So the entire regular expression is matched. As a result, the regular expression returns the first "b" in the string.
· Applications of Look - Aheads and Look - Behinds
Let's look at such an example: find a word that has 6 characters and contains "cat".
First, we can solve it without using look - aheads and look - behinds, for example: \b\w{6}\b.*cat.*\b\w{6}\b.
It's simple enough! But when the requirement becomes to find a word that has 6 - 12 characters and contains "cat", "dog", or "mouse", this method becomes a bit clumsy.
Let's look at the solution using look - aheads. In this example, we have two basic requirements to meet: one is that we need a word with 6 characters, and the other is that the word contains "cat".
The regular expression to meet the first requirement is \b\w{6}\b. The regular expression to meet the second requirement is \b.*cat.*\b.
Combining the two, we can get the following regular expression:
\b(?=\w{6}\b.*cat.*)\w{6}\b
The specific matching process is left to the reader. But one thing to note is that look - aheads do not consume characters, so when judging that the word meets the condition of having 6 characters, the engine will continue to match the subsequent regular expression from the position before the judgment.
Finally, some optimization can be done to get the following regular expression: \b(?=\w{6,12}\b.*(cat|dog|mouse).*)\w{6,12}\b
15. Conditional Tests in Regular Expressions
The syntax of conditional tests is (?(condition)then|else). The "if" part can be a look - ahead or look - behind expression. If it is a look - ahead, the syntax becomes (?(?=look - ahead)then|else).
If the if part is true, the regular engine will try to match the then part, otherwise the engine will try to match the else part.
It should be remembered that look - aheads and look - behinds do not actually consume any characters, so the subsequent matching of the then and else parts starts from the part before the if test.
16. Adding Comments to Regular Expressions
The syntax for adding comments to a regular expression is (?#comment).
Example: Add comments to a regular expression used to match valid dates:
(?#year)(19|20)\d\d(?#month)(0|1)(?#day)(0||3)
Last edited by 无奈何 on 2006 - 10 - 26 at 11:47 AM ]
|

☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul
|
|
2006-10-26 11:43 |
|
|
无奈何
荣誉版主
      
积分 1338
发帖 356
注册 2005-7-15
状态 离线
|
『第 5 楼』:
揭开正则表达式的神秘面纱
使用 LLM 解释/回答一下
转贴注:转贴后引用链接有错误,如需要请到原始网页查看。
http://www.regexlab.com/zh/regref.htm]
引言
正则表达式(regular expression)描述了一种字符串匹配的模式,可以用来:(1)检查一个串中是否含有符合某个规则的子串,并且可以得到这个子串;(2)根据匹配规则对字符串进行灵活的替换操作。
正则表达式学习起来其实是很简单的,不多的几个较为抽象的概念也很容易理解。之所以很多人感觉正则表达式比较复杂,一方面是因为大多数的文档没有做到由浅入深地讲解,概念上没有注意先后顺序,给读者的理解带来困难;另一方面,各种引擎自带的文档一般都要介绍它特有的功能,然而这部分特有的功能并不是我们首先要理解的。
文章中的每一个举例,都可以点击进入到测试页面进行测试。闲话少说,开始。
--------------------------------------------------------------------------------
1. 正则表达式规则
1.1 普通字符
字母、数字、汉字、下划线、以及后边章节中没有特殊定义的标点符号,都是"普通字符"。表达式中的普通字符,在匹配一个字符串的时候,匹配与之相同的一个字符。
举例1:表达式 "c",在匹配字符串 "abcde" 时,匹配结果是:成功;匹配到的内容是:"c";匹配到的位置是:开始于2,结束于3。(注:下标从0开始还是从1开始,因当前编程语言的不同而可能不同)
举例2:表达式 "bcd",在匹配字符串 "abcde" 时,匹配结果是:成功;匹配到的内容是:"bcd";匹配到的位置是:开始于1,结束于4。
--------------------------------------------------------------------------------
1.2 简单的转义字符
一些不便书写的字符,采用在前面加 "\" 的方法。这些字符其实我们都已经熟知了。
表达式
可匹配
\r, \n
代表回车和换行符
\t
制表符
\\
代表 "\" 本身
还有其他一些在后边章节中有特殊用处的标点符号,在前面加 "\" 后,就代表该符号本身。比如:^, $ 都有特殊意义,如果要想匹配字符串中 "^" 和 "$" 字符,则表达式就需要写成 "\^" 和 "\$"。
表达式
可匹配
\^
匹配 ^ 符号本身
\$
匹配 $ 符号本身
\.
匹配小数点(.)本身
这些转义字符的匹配方法与 "普通字符" 是类似的。也是匹配与之相同的一个字符。
举例1:表达式 "\$d",在匹配字符串 "abc$de" 时,匹配结果是:成功;匹配到的内容是:"$d";匹配到的位置是:开始于3,结束于5。
--------------------------------------------------------------------------------
1.3 能够与 '多种字符' 匹配的表达式
正则表达式中的一些表示方法,可以匹配 '多种字符' 其中的任意一个字符。比如,表达式 "\d" 可以匹配任意一个数字。虽然可以匹配其中任意字符,但是只能是一个,不是多个。这就好比玩扑克牌时候,大小王可以代替任意一张牌,但是只能代替一张牌。
表达式
可匹配
\d
任意一个数字,0~9 中的任意一个
\w
任意一个字母或数字或下划线,也就是 A~Z,a~z,0~9,_ 中任意一个
\s
包括空格、制表符、换页符等空白字符的其中任意一个
.
小数点可以匹配除了换行符(\n)以外的任意一个字符
举例1:表达式 "\d\d",在匹配 "abc123" 时,匹配的结果是:成功;匹配到的内容是:"12";匹配到的位置是:开始于3,结束于5。
举例2:表达式 "a.\d",在匹配 "aaa100" 时,匹配的结果是:成功;匹配到的内容是:"aa1";匹配到的位置是:开始于1,结束于4。
--------------------------------------------------------------------------------
1.4 自定义能够匹配 '多种字符' 的表达式
使用方括号 包含一系列字符,能够匹配其中任意一个字符。用 包含一系列字符,则能够匹配其中字符之外的任意一个字符。同样的道理,虽然可以匹配其中任意一个,但是只能是一个,不是多个。
表达式
可匹配
匹配 "a" 或 "b" 或 "5" 或 "@"
匹配 "a","b","c" 之外的任意一个字符
匹配 "f"~"k" 之间的任意一个字母
匹配 "A"~"F","0"~"3" 之外的任意一个字符
举例1:表达式 "" 匹配 "abc123" 时,匹配的结果是:成功;匹配到的内容是:"bc";匹配到的位置是:开始于1,结束于3。
举例2:表达式 "" 匹配 "abc123" 时,匹配的结果是:成功;匹配到的内容是:"1";匹配到的位置是:开始于3,结束于4。
--------------------------------------------------------------------------------
1.5 修饰匹配次数的特殊符号
前面章节中讲到的表达式,无论是只能匹配一种字符的表达式,还是可以匹配多种字符其中任意一个的表达式,都只能匹配一次。如果使用表达式再加上修饰匹配次数的特殊符号,那么不用重复书写表达式就可以重复匹配。
使用方法是:"次数修饰"放在"被修饰的表达式"后边。比如:"" 可以写成 "{2}"。
表达式
作用
{n}
表达式重复n次,比如:"\w{2}" 相当于 "\w\w";"a{5}" 相当于 "aaaaa"
{m,n}
表达式至少重复m次,最多重复n次,比如:"ba{1,3}"可以匹配 "ba"或"baa"或"baaa"
{m,}
表达式至少重复m次,比如:"\w\d{2,}"可以匹配 "a12","_456","M12344"...
?
匹配表达式0次或者1次,相当于 {0,1},比如:"a?"可以匹配 "a","ac","ad"
+
表达式至少出现1次,相当于 {1,},比如:"a+b"可以匹配 "ab","aab","aaab"...
*
表达式不出现或出现任意次,相当于 {0,},比如:"\^*b"可以匹配 "b","^^^b"...
举例1:表达式 "\d+\.?\d*" 在匹配 "It costs $12.5" 时,匹配的结果是:成功;匹配到的内容是:"12.5";匹配到的位置是:开始于10,结束于14。
举例2:表达式 "go{2,8}gle" 在匹配 "Ads by goooooogle" 时,匹配的结果是:成功;匹配到的内容是:"goooooogle";匹配到的位置是:开始于7,结束于17。
--------------------------------------------------------------------------------
1.6 其他一些代表抽象意义的特殊符号
一些符号在表达式中代表抽象的特殊意义:
表达式
作用
^
与字符串开始的地方匹配,不匹配任何字符
$
与字符串结束的地方匹配,不匹配任何字符
\b
匹配一个单词边界,也就是单词和空格之间的位置,不匹配任何字符
进一步的文字说明仍然比较抽象,因此,举例帮助大家理解。
举例1:表达式 "^aaa" 在匹配 "xxx aaa xxx" 时,匹配结果是:失败。因为 "^" 要求与字符串开始的地方匹配,因此,只有当 "aaa" 位于字符串的开头的时候,"^aaa" 才能匹配,比如:"aaa xxx xxx"。
举例2:表达式 "aaa$" 在匹配 "xxx aaa xxx" 时,匹配结果是:失败。因为 "$" 要求与字符串结束的地方匹配,因此,只有当 "aaa" 位于字符串的结尾的时候,"aaa$" 才能匹配,比如:"xxx xxx aaa"。
举例3:表达式 ".\b." 在匹配 "@@@abc" 时,匹配结果是:成功;匹配到的内容是:"@a";匹配到的位置是:开始于2,结束于4。
进一步说明:"\b" 与 "^" 和 "$" 类似,本身不匹配任何字符,但是它要求它在匹配结果中所处位置的左右两边,其中一边是 "\w" 范围,另一边是 非"\w" 的范围。
举例4:表达式 "\bend\b" 在匹配 "weekend,endfor,end" 时,匹配结果是:成功;匹配到的内容是:"end";匹配到的位置是:开始于15,结束于18。
一些符号可以影响表达式内部的子表达式之间的关系:
表达式
作用
|
左右两边表达式之间 "或" 关系,匹配左边或者右边
( )
(1). 在被修饰匹配次数的时候,括号中的表达式可以作为整体被修饰
(2). 取匹配结果的时候,括号中的表达式匹配到的内容可以被单独得到
举例5:表达式 "Tom|Jack" 在匹配字符串 "I'm Tom, he is Jack" 时,匹配结果是:成功;匹配到的内容是:"Tom";匹配到的位置是:开始于4,结束于7。匹配下一个时,匹配结果是:成功;匹配到的内容是:"Jack";匹配到的位置时:开始于15,结束于19。
举例6:表达式 "(go\s*)+" 在匹配 "Let's go go go!" 时,匹配结果是:成功;匹配到内容是:"go go go";匹配到的位置是:开始于6,结束于14。
举例7:表达式 "¥(\d+\.?\d*)" 在匹配 "$10.9,¥20.5" 时,匹配的结果是:成功;匹配到的内容是:"¥20.5";匹配到的位置是:开始于6,结束于10。单独获取括号范围匹配到的内容是:"20.5"。
--------------------------------------------------------------------------------
2. 正则表达式中的一些高级规则
2.1 匹配次数中的贪婪与非贪婪
在使用修饰匹配次数的特殊符号时,有几种表示方法可以使同一个表达式能够匹配不同的次数,比如:"{m,n}", "{m,}", "?", "*", "+",具体匹配的次数随被匹配的字符串而定。这种重复匹配不定次数的表达式在匹配过程中,总是尽可能多的匹配。比如,针对文本 "dxxxdxxxd",举例如下:
表达式
匹配结果
(d)(\w+)
"\w+" 将匹配第一个 "d" 之后的所有字符 "xxxdxxxd"
(d)(\w+)(d)
"\w+" 将匹配第一个 "d" 和最后一个 "d" 之间的所有字符 "xxxdxxx"。虽然 "\w+" 也能够匹配上最后一个 "d",但是为了使整个表达式匹配成功,"\w+" 可以 "让出" 它本来能够匹配的最后一个 "d"
由此可见,"\w+" 在匹配的时候,总是尽可能多的匹配符合它规则的字符。虽然第二个举例中,它没有匹配最后一个 "d",但那也是为了让整个表达式能够匹配成功。同理,带 "*" 和 "{m,n}" 的表达式都是尽可能地多匹配,带 "?" 的表达式在可匹配可不匹配的时候,也是尽可能的 "要匹配"。这 种匹配原则就叫作 "贪婪" 模式 。
非贪婪模式:
在修饰匹配次数的特殊符号后再加上一个 "?" 号,则可以使匹配次数不定的表达式尽可能少的匹配,使可匹配可不匹配的表达式,尽可能的 "不匹配"。这种匹配原则叫作 "非贪婪" 模式,也叫作 "勉强" 模式。如果少匹配就会导致整个表达式匹配失败的时候,与贪婪模式类似,非贪婪模式会最小限度的再匹配一些,以使整个表达式匹配成功。举例如下,针对文本 "dxxxdxxxd" 举例:
表达式
匹配结果
(d)(\w+?)
"\w+?" 将尽可能少的匹配第一个 "d" 之后的字符,结果是:"\w+?" 只匹配了一个 "x"
(d)(\w+?)(d)
为了让整个表达式匹配成功,"\w+?" 不得不匹配 "xxx" 才可以让后边的 "d" 匹配,从而使整个表达式匹配成功。因此,结果是:"\w+?" 匹配 "xxx"
更多的情况,举例如下:
举例1:表达式 "<td>(.*)</td>" 与字符串 "<td><p>aa</p></td> <td><p>bb</p></td>" 匹配时,匹配的结果是:成功;匹配到的内容是 "<td><p>aa</p></td> <td><p>bb</p></td>" 整个字符串, 表达式中的 "</td>" 将与字符串中最后一个 "</td>" 匹配。
举例2:相比之下,表达式 "<td>(.*?)</td>" 匹配举例1中同样的字符串时,将只得到 "<td><p>aa</p></td>", 再次匹配下一个时,可以得到第二个 "<td><p>bb</p></td>"。
--------------------------------------------------------------------------------
2.2 反向引用 \1, \2...
表达式在匹配时,表达式引擎会将小括号 "( )" 包含的表达式所匹配到的字符串记录下来。在获取匹配结果的时候,小括号包含的表达式所匹配到的字符串可以单独获取。这一点,在前面的举例中,已经多次展示了。在实际应用场合中,当用某种边界来查找,而所要获取的内容又不包含边界时,必须使用小括号来指定所要的范围。比如前面的 "<td>(.*?)</td>"。
其实,"小括号包含的表达式所匹配到的字符串" 不仅是在匹配结束后才可以使用,在匹配过程中也可以使用。表达式后边的部分,可以引用前面 "括号内的子匹配已经匹配到的字符串"。引用方法是 "\" 加上一个数字。"\1" 引用第1对括号内匹配到的字符串,"\2" 引用第2对括号内匹配到的字符串……以此类推,如果一对括号内包含另一对括号,则外层的括号先排序号。换句话说,哪一对的左括号 "(" 在前,那这一对就先排序号。
举例如下:
举例1:表达式 "('|")(.*?)(\1)" 在匹配 " 'Hello', "World" " 时,匹配结果是:成功;匹配到的内容是:" 'Hello' "。再次匹配下一个时,可以匹配到 " "World" "。
举例2:表达式 "(\w)\1{4,}" 在匹配 "aa bbbb abcdefg ccccc 111121111 999999999" 时,匹配结果是:成功;匹配到的内容是 "ccccc"。再次匹配下一个时,将得到 999999999。这个表达式要求 "\w" 范围的字符至少重复5次,注意与 "\w{5,}" 之间的区别。
举例3:表达式 "<(\w+)\s*(\w+(=('|").*?\4)?\s*)*>.*?</\1>" 在匹配 "<td id='td1' style="bgcolor:white"></td>" 时,匹配结果是成功。如果 "<td>" 与 "</td>" 不配对,则会匹配失败;如果改成其他配对,也可以匹配成功。
--------------------------------------------------------------------------------
2.3 预搜索,不匹配;反向预搜索,不匹配
前面的章节中,我讲到了几个代表抽象意义的特殊符号:"^","$","\b"。它们都有一个共同点,那就是:它们本身不匹配任何字符,只是对 "字符串的两头" 或者 "字符之间的缝隙" 附加了一个条件。理解到这个概念以后,本节将继续介绍另外一种对 "两头" 或者 "缝隙" 附加条件的,更加灵活的表示方法。
正向预搜索:"(?=xxxxx)","(?!xxxxx)"
格式:"(?=xxxxx)",在被匹配的字符串中,它对所处的 "缝隙" 或者 "两头" 附加的条件是:所在缝隙的右侧,必须能够匹配上 xxxxx 这部分的表达式。因为它只是在此作为这个缝隙上附加的条件,所以它并不影响后边的表达式去真正匹配这个缝隙之后的字符。这就类似 "\b",本身不匹配任何字符。"\b" 只是将所在缝隙之前、之后的字符取来进行了一下判断,不会影响后边的表达式来真正的匹配。
举例1:表达式 "Windows (?=NT|XP)" 在匹配 "Windows 98, Windows NT, Windows 2000" 时,将只匹配 "Windows NT" 中的 "Windows ",其他的 "Windows " 字样则不被匹配。
举例2:表达式 "(\w)((?=\1\1\1)(\1))+" 在匹配字符串 "aaa ffffff 999999999" 时,将可以匹配6个"f"的前4个,可以匹配9个"9"的前7个。这个表达式可以读解成:重复4次以上的字母数字,则匹配其剩下最后2位之前的部分。当然,这个表达式可以不这样写,在此的目的是作为演示之用。
格式:"(?!xxxxx)",所在缝隙的右侧,必须不能匹配 xxxxx 这部分表达式。
举例3:表达式 "((?!\bstop\b).)+" 在匹配 "fdjka ljfdl stop fjdsla fdj" 时,将从头一直匹配到 "stop" 之前的位置,如果字符串中没有 "stop",则匹配整个字符串。
举例4:表达式 "do(?!\w)" 在匹配字符串 "done, do, dog" 时,只能匹配 "do"。在本条举例中,"do" 后边使用 "(?!\w)" 和使用 "\b" 效果是一样的。
反向预搜索:"(?<=xxxxx)","(?<!xxxxx)"
这两种格式的概念和正向预搜索是类似的,反向预搜索要求的条件是:所在缝隙的 "左侧",两种格式分别要求必须能够匹配和必须不能够匹配指定表达式,而不是去判断右侧。与 "正向预搜索" 一样的是:它们都是对所在缝隙的一种附加条件,本身都不匹配任何字符。
举例5:表达式 "(?<=\d{4})\d+(?=\d{4})" 在匹配 "1234567890123456" 时,将匹配除了前4个数字和后4个数字之外的中间8个数字。由于 JScript.RegExp 不支持反向预搜索,因此,本条举例不能够进行演示。很多其他的引擎可以支持反向预搜索,比如:Java 1.4 以上的 java.util.regex 包,.NET 中System.Text.RegularExpressions 命名空间,以及本站推荐的最简单易用的 DEELX 正则引擎。
--------------------------------------------------------------------------------
3. 其他通用规则
还有一些在各个正则表达式引擎之间比较通用的规则,在前面的讲解过程中没有提到。
3.1 表达式中,可以使用 "\xXX" 和 "\uXXXX" 表示一个字符("X" 表示一个十六进制数)
形式
字符范围
\xXX
编号在 0 ~ 255 范围的字符,比如:空格可以使用 "\x20" 表示
\uXXXX
任何字符可以使用 "\u" 再加上其编号的4位十六进制数表示,比如:"\u4E2D"
3.2 在表达式 "\s","\d","\w","\b" 表示特殊意义的同时,对应的大写字母表示相反的意义
表达式
可匹配
\S
匹配所有非空白字符("\s" 可匹配各个空白字符)
\D
匹配所有的非数字字符
\W
匹配所有的字母、数字、下划线以外的字符
\B
匹配非单词边界,即左右两边都是 "\w" 范围或者左右两边都不是 "\w" 范围时的字符缝隙
3.3 在表达式中有特殊意义,需要添加 "\" 才能匹配该字符本身的字符汇总
字符
说明
^
匹配输入字符串的开始位置。要匹配 "^" 字符本身,请使用 "\^"
$
匹配输入字符串的结尾位置。要匹配 "$" 字符本身,请使用 "\$"
( )
标记一个子表达式的开始和结束位置。要匹配小括号,请使用 "\(" 和 "\)"
用来自定义能够匹配 '多种字符' 的表达式。要匹配中括号,请使用 "\"
{ }
修饰匹配次数的符号。要匹配大括号,请使用 "\{" 和 "\}"
.
匹配除了换行符(\n)以外的任意一个字符。要匹配小数点本身,请使用 "\."
?
修饰匹配次数为 0 次或 1 次。要匹配 "?" 字符本身,请使用 "\?"
+
修饰匹配次数为至少 1 次。要匹配 "+" 字符本身,请使用 "\+"
*
修饰匹配次数为 0 次或任意次。要匹配 "*" 字符本身,请使用 "\*"
|
左右两边表达式之间 "或" 关系。匹配 "|" 本身,请使用 "\|"
3.4 括号 "( )" 内的子表达式,如果希望匹配结果不进行记录供以后使用,可以使用 "(?:xxxxx)" 格式
举例1:表达式 "(?:(\w)\1)+" 匹配 "a bbccdd efg" 时,结果是 "bbccdd"。括号 "(?:)" 范围的匹配结果不进行记录,因此 "(\w)" 使用 "\1" 来引用。
3.5 常用的表达式属性设置简介:Ignorecase,Singleline,Multiline,Global
表达式属性
说明
Ignorecase
默认情况下,表达式中的字母是要区分大小写的。配置为 Ignorecase 可使匹配时不区分大小写。有的表达式引擎,把 "大小写" 概念延伸至 UNICODE 范围的大小写。
Singleline
默认情况下,小数点 "." 匹配除了换行符(\n)以外的字符。配置为 Singleline 可使小数点可匹配包括换行符在内的所有字符。
Multiline
默认情况下,表达式 "^" 和 "$" 只匹配字符串的开始 ① 和结尾 ④ 位置。如:
①xxxxxxxxx②\n
③xxxxxxxxx④
配置为 Multiline 可以使 "^" 匹配 ① 外,还可以匹配换行符之后,下一行开始前 ③ 的位置,使 "$" 匹配 ④ 外,还可以匹配换行符之前,一行结束 ② 的位置。
Global
主要在将表达式用来替换时起作用,配置为 Global 表示替换所有的匹配。
--------------------------------------------------------------------------------
4. 其他提示
4.1 如果想要了解高级的正则引擎还支持那些复杂的正则语法,可参见本站 DEELX 正则引擎的说明文档。
4.2 如果要要求表达式所匹配的内容是整个字符串,而不是从字符串中找一部分,那么可以在表达式的首尾使用 "^" 和 "$",比如:"^\d+$" 要求整个字符串只有数字。
4.3 如果要求匹配的内容是一个完整的单词,而不会是单词的一部分,那么在表达式首尾使用 "\b",比如:使用 "\b(if|while|else|void|int……)\b" 来匹配程序中的关键字。
4.4 表达式不要匹配空字符串。否则会一直得到匹配成功,而结果什么都没有匹配到。比如:准备写一个匹配 "123"、"123."、"123.5"、".5" 这几种形式的表达式时,整数、小数点、小数数字都可以省略,但是不要将表达式写成:"\d*\.?\d*",因为如果什么都没有,这个表达式也可以匹配成功。更好的写法是:"\d+\.?\d*|\.\d+"。
4.5 能匹配空字符串的子匹配不要循环无限次。如果括号内的子表达式中的每一部分都可以匹配 0 次,而这个括号整体又可以匹配无限次,那么情况可能比上一条所说的更严重,匹配过程中可能死循环。虽然现在有些正则表达式引擎已经通过办法避免了这种情况出现死循环了,比如 .NET 的正则表达式,但是我们仍然应该尽量避免出现这种情况。如果我们在写表达式时遇到了死循环,也可以从这一点入手,查找一下是否是本条所说的原因。
4.6 合理选择贪婪模式与非贪婪模式,参见话题讨论。
4.7 或 "|" 的左右两边,对某个字符最好只有一边可以匹配,这样,不会因为 "|" 两边的表达式因为交换位置而有所不同。
Last edited by 无奈何 on 2006-10-26 at 12:17 PM ]
Repost Note: There is an error in the reference link after reposting. Please go to the original web page if needed.
http://www.regexlab.com/zh/regref.htm]
Introduction
A regular expression describes a pattern for string matching and can be used for: (1) checking if a substring matching a certain rule exists in a string and obtaining this substring; (2) performing flexible replacement operations on the string according to the matching rule.
Learning regular expressions is actually very simple, and the few relatively abstract concepts are also easy to understand. The reason why many people feel that regular expressions are more complicated is, on the one hand, that most documents do not explain step by step, and the concepts are not paid attention to in the order of priority, which brings difficulties to readers' understanding; on the other hand, the documents provided by various engines generally have to introduce their unique functions, but this part of the unique functions are not what we should understand first.
Each example in the article can be clicked to enter the test page for testing. Enough talk, let's start.
--------------------------------------------------------------------------------
1. Regular Expression Rules
1.1 Ordinary Characters
Letters, numbers, Chinese characters, underscores, and punctuation marks not specially defined in the following sections are all "ordinary characters". The ordinary characters in the expression, when matching a string, match the same character.
Example 1: The expression "c", when matching the string "abcde", the matching result is: successful; the matched content is: "c"; the matched position is: starts at 2, ends at 3. (Note: Whether the subscript starts from 0 or 1 may be different depending on the current programming language)
Example 2: The expression "bcd", when matching the string "abcde", the matching result is: successful; the matched content is: "bcd"; the matched position is: starts at 1, ends at 4.
--------------------------------------------------------------------------------
1.2 Simple Escape Characters
Some characters that are not easy to write are preceded by "\"". These characters we are already familiar with.
Expression
Can match
\r, \n
Represent carriage return and newline characters
\t
Tab character
\\
Represent "\" itself
There are also some punctuation marks that have special uses in the following sections. After adding "\" in front, they represent the symbol itself. For example: ^, $ have special meanings. If you want to match the "^" and "$" characters in the string, the expression needs to be written as "\^" and "\$".
Expression
Can match
\^
Match the ^ symbol itself
\$
Match the $ symbol itself
\.
Match the decimal point (.) itself
The matching method of these escape characters is similar to "ordinary characters". Also match the same character.
Example 1: The expression "\$d", when matching the string "abc$de", the matching result is: successful; the matched content is: "$d"; the matched position is: starts at 3, ends at 5.
--------------------------------------------------------------------------------
1.3 Expressions That Can Match 'Multiple Characters'
Some representation methods in regular expressions can match any one of 'multiple characters'. For example, the expression "\d" can match any one digit. Although it can match any of the characters, it can only be one, not multiple. This is just like when playing poker, the big and small kings can replace any card, but only one card.
Expression
Can match
\d
Any one digit, any one of 0~9
\w
Any one letter or digit or underscore, that is, any one of A~Z, a~z, 0~9, _
\s
Any one of blank characters including spaces, tabs, form feeds, etc.
.
The decimal point can match any character except the newline character (\n)
Example 1: The expression "\d\d", when matching "abc123", the matching result is: successful; the matched content is: "12"; the matched position is: starts at 3, ends at 5.
Example 2: The expression "a.\d", when matching "aaa100", the matching result is: successful; the matched content is: "aa1"; the matched position is: starts at 1, ends at 4.
--------------------------------------------------------------------------------
1.4 Custom Expressions That Can Match 'Multiple Characters'
Use square brackets to enclose a series of characters, which can match any one of them. Use to enclose a series of characters, which can match any one of the characters other than them. Similarly, although it can match any one of them, it can only be one, not multiple.
Expression
Can match
Match "a" or "b" or "5" or "@"
Match any one character other than "a", "b", "c"
Match any one letter between "f"~"k"
Match any one character other than "A"~"F", "0"~"3"
Example 1: The expression "" matches "abc123", the matching result is: successful; the matched content is: "bc"; the matched position is: starts at 1, ends at 3.
Example 2: The expression "" matches "abc123", the matching result is: successful; the matched content is: "1"; the matched position is: starts at 3, ends at 4.
--------------------------------------------------------------------------------
1.5 Special Symbols for Modifying Matching Times
The expressions mentioned in the previous sections, whether they can only match one character or can match any one of multiple characters, can only match once. If you use an expression plus a special symbol for modifying matching times, you can repeat the matching without repeatedly writing the expression.
The method of use is: "time modifier" is placed after "the modified expression". For example: "" can be written as "{2}".
Expression
Function
{n}
The expression repeats n times, for example: "\w{2}" is equivalent to "\w\w"; "a{5}" is equivalent to "aaaaa"
{m,n}
The expression repeats at least m times and at most n times, for example: "ba{1,3}" can match "ba" or "baa" or "baaa"
{m,}
The expression repeats at least m times, for example: "\w\d{2,}" can match "a12", "_456", "M12344"...
?
Match the expression 0 times or 1 time, equivalent to {0,1}, for example: "a?" can match "a", "ac", "ad"
+
The expression appears at least 1 time, equivalent to {1,}, for example: "a+b" can match "ab", "aab", "aaab"...
*
The expression does not appear or appears any number of times, equivalent to {0,}, for example: "\^*b" can match "b", "^^^b"...
Example 1: The expression "\d+\.?\d*" matches "It costs $12.5", the matching result is: successful; the matched content is: "12.5"; the matched position is: starts at 10, ends at 14.
Example 2: The expression "go{2,8}gle" matches "Ads by goooooogle", the matching result is: successful; the matched content is: "goooooogle"; the matched position is: starts at 7, ends at 17.
--------------------------------------------------------------------------------
1.6 Other Some Special Symbols Representing Abstract Meanings
Some symbols represent abstract special meanings in the expression:
Expression
Function
^
Matches the beginning of the string, does not match any character
$
Matches the end of the string, does not match any character
\b
Matches a word boundary, that is, the position between a word and a space, does not match any character
The further text description is still relatively abstract, so examples are given to help everyone understand.
Example 1: The expression "^aaa" matches "xxx aaa xxx", the matching result is: failure. Because "^" requires matching the beginning of the string, so only when "aaa" is at the beginning of the string can "^aaa" match, for example: "aaa xxx xxx".
Example 2: The expression "aaa$" matches "xxx aaa xxx", the matching result is: failure. Because "$" requires matching the end of the string, so only when "aaa" is at the end of the string can "aaa$" match, for example: "xxx xxx aaa".
Example 3: The expression ".\b." matches "@@@abc", the matching result is: successful; the matched content is: "@a"; the matched position is: starts at 2, ends at 4.
Further explanation: "\b" is similar to "^" and "$", it does not match any character by itself, but it requires that on the left and right sides of the position where it is in the matching result, one side is in the "\w" range and the other side is in the non-" \w" range.
Example 4: The expression "\bend\b" matches "weekend, endfor, end", the matching result is: successful; the matched content is: "end"; the matched position is: starts at 15, ends at 18.
Some symbols can affect the relationship between sub-expressions inside the expression:
Expression
Function
|
"Or" relationship between the expressions on the left and right sides, matches the left or the right
( )
(1). When modifying the matching times, the expression in the parentheses can be modified as a whole
(2). When obtaining the matching result, the content matched by the expression in the parentheses can be obtained separately
Example 5: The expression "Tom|Jack" matches the string "I'm Tom, he is Jack", the matching result is: successful; the matched content is: "Tom"; the matched position is: starts at 4, ends at 7. When matching the next one, the matching result is: successful; the matched content is: "Jack"; the matched position is: starts at 15, ends at 19.
Example 6: The expression "(go\s*)+" matches "Let's go go go!", the matching result is: successful; the matched content is: "go go go"; the matched position is: starts at 6, ends at 14.
Example 7: The expression "¥(\d+\.?\d*)" matches "$10.9, ¥20.5", the matching result is: successful; the matched content is: "¥20.5"; the matched position is: starts at 6, ends at 10. The content matched by the parentheses range alone is: "20.5".
--------------------------------------------------------------------------------
2. Some Advanced Rules in Regular Expressions
2.1 Greed and Non-greed in Matching Times
When using the special symbols for modifying matching times, there are several representation methods that can make the same expression match different times, such as: "{m,n}", "{m,}", "?", "*", "+", and the specific number of matches depends on the matched string. This kind of expression that repeats an indefinite number of times always matches as many as possible during the matching process. For example, for the text "dxxxdxxxd", examples are as follows:
Expression
Matching result
(d)(\w+)
"\w+" will match all characters "xxxdxxxd" after the first "d"
(d)(\w+)(d)
"\w+" will match all characters "xxxdxxx" between the first "d" and the last "d". Although "\w+" can also match the last "d", in order to make the entire expression match successfully, "\w+" can "give up" the last "d" that it could have matched
It can be seen that "\w+" always matches as many characters that meet its rules as possible when matching. Although in the second example, it does not match the last "d", it is also to make the entire expression match successfully. Similarly, expressions with "*" and "{m,n}" all match as much as possible, and expressions with "?" also try to "match" when it can be matched or not. This matching principle is called "greedy" mode.
Non-greedy mode:
Adding a "?" after the special symbol for modifying matching times can make the expression with an indefinite number of matches match as few as possible, and make the expression that can be matched or not match as "not match" as possible. This matching principle is called "non-greedy" mode, also called "reluctant" mode. If matching less will cause the entire expression to fail to match, similar to the greedy mode, the non-greedy mode will match a little more minimally to make the entire expression match successfully. Examples are as follows, for the text "dxxxdxxxd" examples:
Expression
Matching result
(d)(\w+?)
"\w+?" will match as few characters as possible after the first "d", and the result is: "\w+?" only matches one "x"
(d)(\w+?)(d)
In order to make the entire expression match successfully, "\w+?" has to match "xxx" to make the subsequent "d" match, so that the entire expression matches successfully. Therefore, the result is: "\w+?" matches "xxx"
More situations, examples are as follows:
Example 1: The expression "<td>(.*)</td>" matches the string "<td><p>aa</p></td> <td><p>bb</p></td>", the matching result is: successful; the matched content is the entire string "<td><p>aa</p></td> <td><p>bb</p></td>", and the "</td>" in the expression will match the last "</td>" in the string.
Example 2: In contrast, the expression "<td>(.*?)</td>" matches the same string in Example 1, and will only get "<td><p>aa</p></td>", and when matching the next one, the second "<td><p>bb</p></td>" can be obtained.
--------------------------------------------------------------------------------
2.2 Backreferences \1, \2...
When the expression is matching, the expression engine will record the string matched by the expression enclosed in parentheses "( )". When obtaining the matching result, the string matched by the expression enclosed in parentheses can be obtained separately. This point has been shown many times in the previous examples. In the actual application scenario, when finding with a certain boundary and the content to be obtained does not include the boundary, parentheses must be used to specify the range to be obtained. For example, the previous "<td>(.*?)</td>".
In fact, "the string matched by the expression enclosed in parentheses" can not only be used after the matching is over, but also can be used during the matching process. The part behind the expression can refer to the string that has been matched by the "sub-matching in the parentheses" before. The reference method is "\" plus a number. "\1" refers to the string matched by the first pair of parentheses, "\2" refers to the string matched by the second pair of parentheses... and so on. If there is another pair of parentheses inside a pair of parentheses, the outer pair of parentheses is sorted first. In other words, which pair of parentheses has the left parenthesis "(" first, then this pair is sorted first.
Examples are as follows:
Example 1: The expression "('|")(.*?)(\1)" matches "'Hello', "World"", the matching result is: successful; the matched content is: " 'Hello' ". When matching the next one, " "World" " can be matched.
Example 2: The expression "(\w)\1{4,}" matches "aa bbbb abcdefg ccccc 111121111 999999999", the matching result is: successful; the matched content is "ccccc". When matching the next one, 999999999 will be obtained. This expression requires that the character in the "\w" range is repeated at least 5 times, pay attention to the difference from "\w{5,}".
Example 3: The expression "<(\w+)\s*(\w+(=('|").*?\4)?\s*)*>.*?</\1>" matches "<td id='td1' style="bgcolor:white"></td>", the matching result is successful. If "<td>" does not match "</td>", it will match failure; if it is changed to other pairs, it can also match successful.
--------------------------------------------------------------------------------
2.3 Positive Lookahead, Negative Lookahead; Positive Lookbehind, Negative Lookbehind
In the previous sections, I talked about several special symbols representing abstract meanings: "^", "$", "\b". They have one thing in common: they do not match any character by themselves, but only attach a condition to "the two ends of the string" or "the gap between characters". After understanding this concept, this section will continue to introduce another more flexible representation method that attaches conditions to "the two ends" or "the gap".
Positive lookahead: "(?=xxxxx)", "(?!xxxxx)"
Format: "(?=xxxxx)", in the matched string, the condition attached to the "gap" or "two ends" where it is located is: the right side of the gap where it is located must be able to match the expression of xxxxx. Because it is only used as a condition attached to this gap here, it does not affect the subsequent expression to really match the characters after this gap. This is similar to "\b", which does not match any character by itself. "\b" just takes the characters before and after the gap where it is located for judgment, and will not affect the subsequent expression to really match.
Example 1: The expression "Windows (?=NT|XP)" matches "Windows 98, Windows NT, Windows 2000", and will only match "Windows " in "Windows NT", and other "Windows " words will not be matched.
Example 2: The expression "(\w)((?=\1\1\1)(\1))+" matches the string "aaa ffffff 999999999", and will be able to match the first 4 of 6 "f"s and the first 7 of 9 "9"s. This expression can be read as: if a letter or digit is repeated 4 or more times, then match the part before the last 2 of it. Of course, this expression can not be written like this, the purpose here is for demonstration.
Format: "(?!xxxxx)", the right side of the gap where it is located must not be able to match the expression of xxxxx.
Example 3: The expression "((?!\bstop\b).)+" matches "fdjka ljfdl stop fjdsla fdj", and will match from the beginning to the position before "stop", and if there is no "stop" in the string, it will match the entire string.
Example 4: The expression "do(?!\w)" matches the string "done, do, dog", and can only match "do". In this example, using "(?!\w)" after "do" has the same effect as using "\b".
Negative lookbehind: "(?<=xxxxx)", "(?<!xxxxx)"
The concepts of these two formats are similar to positive lookahead. Negative lookbehind requires that the "left side" of the gap where it is located, the two formats respectively require that the specified expression can be matched and must not be able to match, instead of judging the right side. The same as "positive lookahead" is: they are all a kind of condition attached to the gap where they are located, and they do not match any character by themselves.
Example 5: The expression "(?<=\d{4})\d+(?=\d{4})" matches "1234567890123456", and will match the middle 8 digits except the first 4 digits and the last 4 digits. Since JScript.RegExp does not support negative lookbehind, this example cannot be demonstrated. Many other engines can support negative lookbehind, such as: java.util.regex package in Java 1.4 and above, System.Text.RegularExpressions namespace in .NET, and the simplest and easiest-to-use DEELX regular engine recommended on this site.
--------------------------------------------------------------------------------
3. Other General Rules
There are also some general rules among various regular expression engines that were not mentioned in the previous explanations.
3.1 In the expression, "\xXX" and "\uXXXX" can be used to represent a character ("X" represents a hexadecimal number)
Form
Character range
\xXX
Characters with numbers in the range 0 ~ 255, for example: space can be represented by "\x20"
\uXXXX
Any character can be represented by "\u" plus its 4-digit hexadecimal number, for example: "\u4E2D"
3.2 While "\s", "\d", "\w", "\b" in the expression represent special meanings, the corresponding uppercase letters represent the opposite meanings
Expression
Can match
\S
Match all non-whitespace characters ("\s" can match various whitespace characters)
\D
Match all non-digit characters
\W
Match all characters other than letters, digits, and underscores
\B
Match non-word boundaries, that is, the character gaps when both sides are in the "\w" range or both sides are not in the "\w" range
3.3 Characters that have special meanings in the expression and need to add "\" to match the character itself are summarized
Character
Description
^
Matches the start position of the input string. To match the ^ character itself, use "\^"
$
Matches the end position of the input string. To match the $ character itself, use "\$"
( )
Marks the start and end positions of a sub-expression. To match parentheses, use "\(" and "\)"
Used to customize expressions that can match 'multiple characters'. To match square brackets, use "\"
{ }
Symbol for modifying matching times. To match braces, use "\{" and "\}"
.
Matches any character except the newline character (\n). To match the decimal point itself, use "\."
?
Modifies the matching times to 0 times or 1 time. To match the "?" character itself, use "\?"
+
Modifies the matching times to at least 1 time. To match the "+" character itself, use "\+"
*
Modifies the matching times to 0 times or any times. To match the "*" character itself, use "\*"
|
"Or" relationship between the expressions on the left and right sides. To match "|" itself, use "\|"
3.4 If the sub-expression inside the parentheses "( )" does not want the matching result to be recorded for future use, the format "(?:xxxxx)" can be used
Example 1: The expression "(?:(\w)\1)+" matches "a bbccdd efg", and the result is "bbccdd". The matching result of the parentheses "(?:)" range is not recorded, so "\1" is used to reference "\w".
3.5 Introduction to the commonly used expression attribute settings: Ignorecase, Singleline, Multiline, Global
Expression attribute
Description
Ignorecase
By default, the letters in the expression are case-sensitive. Configuring Ignorecase can make the matching case-insensitive. Some expression engines extend the concept of "case" to the case of the UNICODE range.
Singleline
By default, the decimal point "." matches characters except the newline character (\n). Configuring Singleline can make the decimal point match all characters including the newline character.
Multiline
By default, the expressions "^" and "$" only match the start ① and end ④ positions of the string. For example:
①xxxxxxxxx②\n
③xxxxxxxxx④
Configuring Multiline can make "^" match ①, and also match the position ③ before the start of the next line after the newline character, and make "$" match ④, and also match the position ② at the end of a line before the newline character.
Global
Mainly plays a role when using the expression for replacement. Configuring Global means replacing all matches.
--------------------------------------------------------------------------------
4. Other Tips
4.1 If you want to understand what complex regular grammars the advanced regular engine supports, you can refer to the description document of the DEELX regular engine on this site.
4.2 If you want the content matched by the expression to be the entire string, not a part from the string, then you can use "^" and "$" at the beginning and end of the expression, for example: "^\d+$" requires the entire string to only have digits.
4.3 If you require the matched content to be a complete word, not a part of a word, then use "\b" at the beginning and end of the expression, for example: use "\b(if|while|else|void|int……)\b" to match keywords in the program.
4.4 The expression should not match an empty string. Otherwise, it will always get a successful match, but the result will match nothing. For example: when preparing to write an expression that matches "123", "123.", "123.5", ".5" and other forms, integers, decimal points, and decimal numbers can be omitted, but do not write the expression as: "\d*\.?\d*", because if there is nothing, this expression can also match successfully. A better way to write it is: "\d+\.?\d*|\.\d+".
4.5 Do not loop infinitely for sub-matches that can match empty strings. If each part of the sub-expression inside the parentheses can match 0 times, and this parentheses as a whole can match infinitely, then the situation may be more serious than the previous one, and there may be an infinite loop during the matching process. Although some regular expression engines have avoided this kind of infinite loop, such as the regular expression in .NET, we should still try to avoid this situation. If we encounter an infinite loop when writing an expression, we can also start from this point and see if it is the reason mentioned in this section.
4.6 Reasonably choose greedy mode and non-greedy mode, refer to the topic discussion.
4.7 For a certain character on the left and right sides of or "|", it is best that only one side can match, so that it will not be different because the expressions on both sides of "|" are exchanged in position.
Last edited by 无奈何 on 2006-10-26 at 12:17 PM ]
|

☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul
|
|
2006-10-26 11:43 |
|
|
无奈何
荣誉版主
      
积分 1338
发帖 356
注册 2005-7-15
状态 离线
|
『第 6 楼』:
正则表达式话题
使用 LLM 解释/回答一下
转贴注:转贴后引用链接有错误,如需要请到原始网页查看。
正则表达式话题
http://www.regexlab.com/zh/regtopic.htm]
引言
本文将逐步讨论一些正则表达式的使用话题。本文为本站基础篇之后的扩展,在阅读本文之前,建议先阅读正则表达式参考文档一文。
--------------------------------------------------------------------------------
1. 表达式的递归匹配
有时候,我们需要用正则表达式来分析一个计算式中的括号配对情况。比如,使用表达式 "\( * \)" 或者 "\( .*? \)" 可以匹配一对小括号。但是如果括号 内还嵌有一层括号的话 ,如 "( ( ) )",则这种写法将不能够匹配正确,得到的结果是 "( ( )" 。类似情况的还有 HTML 中支持嵌套的标签如 "<font> </font>" 等。本节将要讨论的是,想办法把有嵌套的的成对括号或者成对标签匹配出来。
匹配未知层次的嵌套:
有的正则表达式引擎,专门针对这种嵌套提供了支持。并且在栈空间允许的情况下,能够支持任意未知层次的嵌套:比如 Perl,PHP,GRETA 等。在 PHP 和 GRETA 中,表达式中使用 "(?R)" 来表示嵌套部分。
匹配嵌套了未知层次的 "小括号对" 的表达式写法如下:"\( ( | (?R))* \)"。
匹配有限层次的嵌套:
对于不支持嵌套的正则表达式引擎,只能通过一定的办法来匹配有限层次的嵌套。思路如下:
第一步,写一个不能支持嵌套的表达式:"\( * \)","<font>((?!</?font>).)*</font>"。 这两个表达式在匹配有嵌套的文本时,只匹配最内层。
第二步,写一个可匹配嵌套一层的表达式:"\( ( | \( * \))* \)"。这个表达式在匹配嵌套层数大于一时,只能匹配最里面的两层,同时,这个表达式也能匹配没有嵌套的文本或者嵌套的最里层。
匹配嵌套一层的 "<font>" 标签,表达式为:"<font>((?!</?font>).|(<font>((?!</?font>).)*</font>))*</font>"。这个表达式在匹配 "<font>" 嵌套层数大于一的文本时,只匹配最里面的两层。
第三步,找到匹配嵌套(n)层的表达式 与 嵌套(n-1)层的表达式之间的关系。比如,能够匹配嵌套(n)层的表达式为:
( 和 之外的表达式] | )*
回头来看前面编写的“可匹配嵌套一层”的表达式:
\( ( | \(()*\) )* \)
<font> ( (?!</?font>). | (<font>((?!</?font>).)*</font>) )* </font>
PHP 和 GRETA 的简便之处在于,匹配嵌套(n-1)层的表达式用 (?R) 表示:
\( ( | (?R) )* \)
第四步,依此类推,可以编写出匹配有限(n)层的表达式。这种方式写出来的表达式,虽然看上去很长,但是这种表达式经过编译后,匹配效率仍然是很高的。
--------------------------------------------------------------------------------
2. 非贪婪匹配的效率
可能有不少的人和我一样,有过这样的经历:当我们要匹配类似 "<td>内容</td>" 或者 "加粗" 这样的文本时,我们根据正向预搜索功能写出这样的表达式:"<td>(|<(?!/td>))*</td>" 或者 "<td>((?!</td>).)*</td>"。
当发现非贪婪匹配之时,恍然大悟,同样功能的表达式可以写得如此简单:"<td>.*?</td>"。 顿时间如获至宝,凡是按边界匹配的地方,尽量使用简捷的非贪婪匹配 ".*?"。特别是对于复杂的表达式来说,采用非贪婪匹配 ".*?" 写出来的表达式的确是简练了许多。
然而,当一个表达式中,有多个非贪婪匹配时,或者多个未知匹配次数的表达式时,这个表达式将可能存在效率上的陷阱。有时候,匹配速度慢得莫名奇妙,甚至开始怀疑正则表达式是否实用。
效率陷阱的产生:
在本站基础文章里,对非贪婪匹配的描述中说到:“如果少匹配就会导致整个表达式匹配失败的时候,与贪婪模式类似,非贪婪模式会最小限度的再匹配一些,以使整个表达式匹配成功。”
具体的匹配过程是这样的:
"非贪婪部分" 先匹配最少次数,然后尝试匹配 "右侧的表达式"。
如果右侧的表达式匹配成功,则整个表达式匹配结束。如果右侧表达式匹配失败,则 "非贪婪部分" 将增加匹配一次,然后再尝试匹配 "右侧的表达式"。
如果右侧的表达式又匹配失败,则 "非贪婪部分" 将再增加匹配一次。再尝试匹配 "右侧的表达式"。
依此类推,最后得到的结果是 "非贪婪部分" 以尽可能少的匹配次数,使整个表达式匹配成功。或者最终仍然匹配失败。
当一个表达式中有多个非贪婪匹配,以表达式 "d(\w+?)d(\w+?)z" 为例,对于第一个括号中的 "\w+?" 来说,右边的 "d(\w+?)z" 属于它的 "右侧的表达式",对于第二个括号中的 "\w+?" 来说,右边的 "z" 属于它的 "右侧的表达式"。
当 "z" 匹配失败时,第二个 "\w+?" 会 "增加匹配一次",再尝试匹配 "z"。如果第二个 "\w+?" 无论怎样 "增加匹配次数",直至整篇文本结束,"z" 都不能匹配,那么表示 "d(\w+?)z" 匹配失败,也就是说第一个 "\w+?" 的 "右侧" 匹配失败。此时,第一个 "\w+?" 会增加匹配一次,然后再进行 "d(\w+?)z" 的匹配。循环前面所讲的过程,直至第一个 "\w+?" 无论怎么 "增加匹配次数",后边的 "d(\w+?)z" 都不能匹配时,整个表达式才宣告匹配失败。
其实,为了使整个表达式匹配成功,贪婪匹配也会适当的“让出”已经匹配的字符。因此贪婪匹配也有类似的情况。当一个表达式中有较多的未知匹配次数的表达式时,为了让整个表达式匹配成功,各个贪婪或非贪婪的表达式都要进行尝试减少或增加匹配次数,由此容易形成一个大循环的尝试,造成了很长的匹配时间。本文之所以称之为“陷阱”,因为这种效率问题往往不易察觉。
举例:"d(\w+?)d(\w+?)d(\w+?)z" 匹配 "ddddddddddd..." 时,将花费较长一段时间才能判断出匹配失败 。
效率陷阱的避免:
避免效率陷阱的原则是:避免“多重循环”的“尝试匹配”。并不是说非贪婪匹配就是不好的,只是在运用非贪婪匹配的时候,需要注意避免过多“循环尝试”的问题。
情况一:对于只有一个非贪婪或者贪婪匹配的表达式来说,不存在效率陷阱。也就是说,要匹配类似 "<td> 内容 </td>" 这样的文本,表达式 "<td>(|<(?!/td>))*</td>" 和 "<td>((?!</td>).)*</td>" 和 "<td>.*?</td>" 的效率是完全相同的。
情况二:如果一个表达式中有多个未知匹配次数的表达式,应防止进行不必要的尝试匹配。
比如,对表达式 "<script language='(.*?)'>(.*?)</script>" 来说, 如果前面部分表达式在遇到 "<script language='vbscript'>" 时匹配成功后,而后边的 "(.*?)</script>" 却匹配失败,将导致第一个 ".*?" 增加匹配次数再尝试。而对于表达式真正目的,让第一个 ".*?" 增加匹配成“vbscript'>”是不对的,因此这种尝试是不必要的尝试。
因此,对依靠边界来识别的表达式,不要让未知匹配次数的部分跨过它的边界。前面的表达式中,第一个 ".*?" 应该改写成 "*"。后边那个 ".*?" 的右边再没有未知匹配次数的表达式,因此这个非贪婪匹配没有效率陷阱。于是,这个匹配脚本块的表达式,应该写成:"<script language='(*)'>(.*?)</script>" 更好。
Last edited by 无奈何 on 2006-10-26 at 12:20 PM ]
Repost Note: If there are errors in the quoted links after reposting, please go to the original web page if needed.
Regular Expression Topic
http://www.regexlab.com/zh/regtopic.htm]
Introduction
This article will gradually discuss some topics about the use of regular expressions. This article is an extension after the basic article of this site. Before reading this article, it is recommended to read the "Regular Expression Reference Document" article first.
--------------------------------------------------------------------------------
1. Recursive Matching of Expressions
Sometimes, we need to use regular expressions to analyze the parenthesis pairing situation in a calculation formula. For example, using the expressions "\( * \)" or "\( .*? \)" can match a pair of small parentheses. But if there is another layer of parentheses embedded inside the parentheses, such as "( ( ) )", then this writing method will not match correctly, and the result obtained is "( ( )". Similar situations also include HTML nested tags such as "<font> </font>", etc. This section will discuss how to match paired parentheses or paired tags with nesting.
Matching nested unknown levels:
Some regular expression engines have specific support for such nesting. And as long as the stack space allows, they can support arbitrary unknown levels of nesting: such as Perl, PHP, GRETA, etc. In PHP and GRETA, the expression uses "(?R)" to represent the nested part.
The expression for matching "small parentheses pairs" nested with unknown levels is written as: "\( ( | (?R))* \)".
Matching nested with limited levels:
For regular expression engines that do not support nesting, a certain method can be used to match nested with limited levels. The idea is as follows:
Step 1, write an expression that cannot support nesting: "\( * \)", "<font>((?!</?font>).)*</font>". These two expressions, when matching nested text, only match the innermost layer.
Step 2, write an expression that can match nested one layer: "\( ( | \( * \))* \)". This expression, when matching with a nesting level greater than one, can only match the innermost two layers. At the same time, this expression can also match text without nesting or the innermost layer of nesting.
To match the "<font>" tag nested one layer, the expression is: "<font>((?!</?font>).|(<font>((?!</?font>).)*</font>))*</font>". This expression, when matching text with a nesting level of "<font>" greater than one, only matches the innermost two layers.
Step 3, find the relationship between the expression that can match nested (n) layers and the expression that matches nested (n-1) layers. For example, the expression that can match nested (n) layers is:
( and ] | )*
Looking back at the "expression that can match nested one layer" written earlier:
\( ( | \(()*\) )* \)
<font> ( (?!</?font>). | (<font>((?!</?font>).)*</font>) )* </font>
The convenience of PHP and GRETA is that the expression that matches nested (n-1) layers is represented by (?R):
\( ( | (?R) )* \)
Step 4, by analogy, the expression that matches nested (n) layers with limited levels can be written. The expression written in this way, although it looks very long, the matching efficiency is still very high after compilation.
--------------------------------------------------------------------------------
2. Efficiency of Non-Greedy Matching
Maybe many people have had such an experience as me: when we want to match text like "<td>content</td>" or "bold", we write such an expression according to the positive look-ahead function: "<td>(|<(?!/td>))*</td>" or "<td>((?!</td>).)*</td>".
When we find the non-greedy matching, we suddenly realize that the same functional expression can be written so simply: "<td>.*?</td>". Suddenly, it's like finding a treasure. Whenever matching by boundaries, try to use the simple non-greedy matching ".*?". Especially for complex expressions, the expression written with the non-greedy matching ".*?" is indeed much more concise.
However, when there are multiple non-greedy matches in an expression, or multiple expressions with unknown matching times, this expression may have a trap in efficiency. Sometimes, the matching speed is inexplicably slow, and even start to doubt whether regular expressions are practical.
Generation of Efficiency Traps:
In the basic article of this site, the description of non-greedy matching says: "If matching less will cause the entire expression to fail to match, similar to the greedy mode, the non-greedy mode will minimally match some more to make the entire expression match successfully."
The specific matching process is as follows:
The "non-greedy part" first matches the minimum number of times, and then tries to match the "right expression".
If the right expression matches successfully, the entire expression matching ends. If the right expression matches failed, the "non-greedy part" will increase the matching by one time, and then try to match the "right expression" again.
If the right expression still matches failed, the "non-greedy part" will increase the matching by one time again. Try to match the "right expression" again.
And so on, the final result is that the "non-greedy part" uses the minimum number of matching times to make the entire expression match successfully. Or finally still matches failed.
When there are multiple non-greedy matches in an expression, take the expression "d(\w+?)d(\w+?)z" as an example. For the "\w+?" in the first parentheses, the "d(\w+?)z" on the right belongs to its "right expression". For the "\w+?" in the second parentheses, the "z" on the right belongs to its "right expression".
When "z" matches failed, the second "\w+?" will "increase the matching by one time" and try to match "z" again. If the second "\w+?" no matter how "increase the matching times" until the entire text ends, "z" still cannot match, then it means that "d(\w+?)z" matches failed, that is, the "right" of the first "\w+?" matches failed. At this time, the first "\w+?" will increase the matching by one time, and then perform the matching of "d(\w+?)z" again. Cycle the process described earlier until the first "\w+?" no matter how "increase the matching times", the subsequent "d(\w+?)z" still cannot match, then the entire expression is declared to match failed.
In fact, for the entire expression to match successfully, the greedy matching will also appropriately "give up" the matched characters. Therefore, the greedy matching also has a similar situation. When there are more expressions with unknown matching times in an expression, in order to make the entire expression match successfully, each greedy or non-greedy expression has to try to reduce or increase the matching times, which is likely to form a large loop of attempts, resulting in a long matching time. This article calls it a "trap" because this efficiency problem is often not easy to detect.
For example: "d(\w+?)d(\w+?)d(\w+?)z" when matching "ddddddddddd...", it will take a long time to judge that the matching failed.
Avoidance of Efficiency Traps:
The principle to avoid efficiency traps is: avoid "multiple loops" of "attempted matching". It's not that non-greedy matching is bad, but when using non-greedy matching, attention should be paid to avoiding the problem of excessive "loop attempts".
Situation 1: For an expression with only one non-greedy or greedy matching, there is no efficiency trap. That is, to match text like "<td> content </td>", the expressions "<td>(|<(?!/td>))*</td>", "<td>((?!</td>).)*</td>" and "<td>.*?</td>" have exactly the same efficiency.
Situation 2: If there are multiple expressions with unknown matching times in an expression, unnecessary attempted matching should be prevented.
For example, for the expression "<script language='(.*?)'>(.*?)</script>", if the previous part of the expression matches successfully when encountering "<script language='vbscript'>", but the subsequent "(.*?)</script>" matches failed, it will cause the first ".*?" to increase the matching times and try again. And for the real purpose of the expression, it is incorrect to make the first ".*?" increase the matching to "vbscript'>", so this attempt is an unnecessary attempt.
Therefore, for expressions identified by boundaries, do not let the part with unknown matching times cross its boundary. In the previous expression, the first ".*?" should be rewritten as "*". The subsequent ".*?" has no other expressions with unknown matching times on the right, so this non-greedy matching has no efficiency trap. Therefore, the expression for matching this script block should be written as: "<script language='(*)'>(.*?)</script>" better.
Last edited by 无奈何 on 2006-10-26 at 12:20 PM ]
|

☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul
|
|
2006-10-26 11:43 |
|
|
无奈何
荣誉版主
      
积分 1338
发帖 356
注册 2005-7-15
状态 离线
|
『第 7 楼』:
正则表达式参考手册__Mini版
使用 LLM 解释/回答一下
正则表达式参考手册__Mini版
正则表达式就是由普通字符(例如字符 a 到 z )以及特殊字符(称为元字符)组成的文字模式。该模式描述在查找文字主体时待匹配的一个或多个字符串。正则表达式作为一个模板,将某个字符模式与所搜索的字符串进行匹配。
本文详细地列出了能在正则表达式中使用,以匹配文本的各种字符。当你需要解释一个现有的正则表达式时,可以作为一个快捷的参考。更多详细内容,请参考: Francois Liger,Craig McQueen,Pal Wilton C# 字符串和正则表达式参考手册 北京:清华大学出版社 2003.2
一 . 匹配字符
字符类
匹配的字符
举 例
\d
从0 - 9的任一数字
\d\d 匹配 72, 但不匹配 aa 或 7a
\D
任一非数字字符
\D\D\D 匹配 abc, 但不匹配 123
\w
任一单词字符,包括 A-Z,a-z,0-9 和下划线
\w\w\w\w 匹配 Ab-2 ,但不匹配 ∑ £$ %* 或 Ab_@
\W
任一非单词字符
\W 匹配@,但不匹配 a
\s
任一空白字符,包括制表符,换行符,回车符,换页符和垂直制表符
匹配在 HTML,XML 和其他标准定义中的所有传统空白字符
\S
任一非空白字符
空白字符以外的任意字符 , 如 A%&g3; 等
.
任一字符
匹配除换行符以外的任意字符除非设置了 MultiLine 先项
括号中的任一字符
将匹配一个单字符 ,a,b 或 c.
将匹配从 a 到 z 的任一字符
不在括号中的任一字符
将匹配一个 a 、 b 、 c 之外的单字符 , 可以 a,b 或 A 、 B 、 C
将匹配不属于 a-z 的任一字符 , 但可以匹配所有的大写字母
二 . 重复字符
重复字符
含 义
举 例
{ n }
匹配前面的字符 n 次
x{2} 匹配 xx, 但不匹配 x 或 xxx
{ n, }
匹配前面的字符至少 n 次
x{2} 匹配 2 个或更多的 x, 如 xxx,xxx..
{ n,m }
匹配前面的字符至少 n 次 , 至多 m 次。如果 n 为 0 ,此参数为可选参数
x{2,4} 匹配 xx,xxx,xxxx, 但不匹配 xxxxx
?
匹配前面的字符 0 次或 1 次,实质上也是可选的
x? 匹配 x 或零个 x
+
匹配前面的字符 0 次或多次
x+ 匹配 x 或 xx 或大于 0 的任意多个 x
*
匹配前面的字符 0 次或更多次
x* 匹配 0,1 或更多个 x
三 . 定位字符
定位字符
描 述
^
随后的模式必须位于字符串的开始位置,如果是一个多行字符串,则必须位于行首。对于多行文本(包含回车符的一个字符串)来说,需要设置多行标志
$
前面的模式必须位于字符串的未端,如果是一个多行字符串,必须位于行尾
\A
前面的模式必须位于字符串的开始位置,忽略多行标志
\z
前面的模式必须位于字符串的未端,忽略多行标志
\Z
前面的模式必须位于字符串的未端,或者位于一个换行符前
\b
匹配一个单词边界,也就是一个单词字符和非单词字符中间的点。要记住一个单词字符是 中的一个字符。位于一个单词的词首
\B
匹配一个非单词字符边界位置,不是一个单词的词首
注:定位字符可以应用于字符或组合,放在字符串的左端或右端
四 . 分组字符
分组字符
定 义
举 例
()
此字符可以组合括号内模式所匹配的字符, 它是一个捕获组,也就是说模式匹配的字符作为最终设置了 ExplicitCapture 选项――默认状态下字符不是匹配的一部分
输入字符串为: ABC1DEF2XY
匹配 3 个从 A 到 Z 的字符和 1 个数字的正则表达式:( {3}\d )
将产生两次匹配: Match 1=ABC1;Match 2=DEF2
每次匹配对应一个组: Match1 的第一个组= ABC;Match2 的第 1 个组= DEF
有了反向引用,就可以通过它在正则表达式中的编号以及 C# 和类 Group,GroupCollection 来访问组。如果设置了 ExplicitCapture 选项,就不能使用组所捕获的内容
( ?: )
此字符可以组合括号内模式所匹配的字符, 它是一个非捕获组,这意味着模式所的字符将不作为一个组来捕获,但它构成了最终匹配结果的一部分。它基本上与上面的组类型相同,但设定了选项 ExplicitCapture
输入字符串为: 1A BB SA 1 C
匹配一个数字或一个 A 到 Z 的字母,接着是任意单词字符的正则表达式为:( ?:\d|\w )
它将产生 3 次匹配:每 1 次匹配= 1A ;每 2 次匹配= BB; 每 3 次匹配= SA
但是没有组被捕获
( ? )
此选项组合括号内模式所匹配的字符,并用尖括号中指定 的值为组命名。在正则表达式中,可以使用名称进行反向引用,而不必使用编号。即使不设置 ExplicitCapture 选项,它也是一个捕获组。这意味着反向引用可以利用组内匹配的字符,或者通过 Group 类访问
输入字符串为: Characters in Sienfeld included Jerry Seinfeld,Elaine Benes,Cosno Kramer and George Costanza 能够匹配它们的姓名,并在一个组 llastName 中捕获姓的正则表达式为: \b+(?+)\b
它产生了 4 次匹配: First Match=Jerry Seinfeld; Second Match=Elaine Benes; Third Match=Cosmo Kramer; Fourth Match=George Costanza
每一次匹配都对应了一个 lastName 组:
第 1 次匹配: lastName group=Seinfeld
第 2 次匹配: lastName group=Benes
第 3 次匹配: lastName group=Kramer
第 4 次匹配: lastName group=Costanza
不管是否设置了选项 ExplictCapture ,组都将被捕获
( ?= )
正声明。声明的右侧必须是括号中指定的模式。此模式不构成最终匹配的一部分
正则表达式 \S+(?=.NET) 要匹配的输入字符串为: The languages were Java,C#.NET,VB.NET,C,Jscript.NET,Pascal
将产生如下匹配:〕
C#
VB
JScript
( ?! )
负声明。它规定模式不能紧临着声明的右侧。此模式不构成最终匹配的一部分
\d{3}(?!) 要匹配的输入字符串为: 123A 456 789 111C
将产生如下匹配:
456
789
( ?
反向正声明。声明的左侧必须为括号内的指定模式。此模式不构成最终匹配的一部分
正则表达式 (?
它将产生如下匹配:
Mexico
England
( ?
反向正声明。声明的左侧必须不能是括号内的指定模式。此模式不构成最终匹配的一部分
正则表达式 (?
它将实现如下匹配:
56F
89C
( ?> )
非回溯组。防止 Regex 引擎回溯并且防止实现一次匹配
假设要匹配所有以“ ing ”结尾的单词。输入字符串如下: He was very trusing
正则表达式为: .*ing
它将实现一次匹配――单词 trusting 。“ . ” 匹配任意字符,当然也匹配“ ing ”。所以, Regex 引擎回溯一位并在第 2 个“ t ”停止,然后匹配指定的模式“ ing ”。但是,如果禁用回溯操作: (?>.*)ing
它将实现 0 次匹配。“ . ”能匹配所有的字符,包括“ ing ”――不能匹配,从而匹配失败
五 . 决策字符
字 符
描 述
举 例
( ?(regex)yes_regex|no_regex )
如果表达式 regex 匹配,那么将试图匹配表达式 yes 。否则匹配表达式 no 。正则表达式 no 是可先参数。注意,作出决策的模式宽度为 0. 这意味着表达式 yes 或 no 将从与 regex 表达式相同的位置开始匹配
正则表达式 (?(\d)dA|A-Z)B) 要匹配的输入字符串为: 1A CB 3A 5C 3B
它实现的匹配是:
1A
CB
3A
( ?(group name or number)yes_regex|no_regex )
如果组中的正则表达式实现了匹配,那么试图匹配 yes 正则表达式。否则,试图匹配正则表达式 no 。 no 是可先的参数
正则表达式
(\d7)?-(?(1)\d\d| 要匹配的输入字符串为:
77 -77A 69-AA 57-B
它实现的匹配为:
77 -77A
- AA
注:上面表中列出的字符强迫处理器执行一次 if-else 决策
六 . 替换字符
字 符
描 述
$group
用 group 指定的组号替换
${name}
替换被一个 (?) 组匹配的最后子串
$$
替换一个字符 $
$&
替换整个的匹配
$ ^
替换输入字符串匹配之前的所有文本
$'
替换输入字符串匹配之后的所有文本
$+
替换最后捕获的组
$_
替换整个的输入字符串
注:以上为常用替换字符,不全
七 . 转义序列
字 符
描 述
\\
匹配字符“ \ ”
\.
匹配字符“ . ”
\*
匹配字符“ * ”
\+
匹配字符“ + ”
\?
匹配字符“ ? ”
\|
匹配字符“ | ”
\(
匹配字符“ ( ”
\)
匹配字符“ ) ”
\{
匹配字符“ { ”
\}
匹配字符“ } ”
\ ^
匹配字符“ ^ ”
\$
匹配字符“ $ ”
\n
匹配换行符
\r
匹配回车符
\t
匹配制表符
\v
匹配垂直制表符
\f
匹配换面符
\nnn
匹配一个 8 进数字, nnn 指定的 ASCII 字符。如 \103 匹配大写的 C
\xnn
匹配一个 16 进数字, nn 指定的 ASCII 字符。如 \x43 匹配大写的 C
\unnnn
匹配由 4 位 16 进数字(由 nnnn 表示)指定的 Unicode 字符
\cV
匹配一个控制字符,如 \cV 匹配 Ctrl-V
八 . 选项标志
选项标志
名 称
I
IgnoreCase
M
Multiline
N
ExplicitCapture
S
SingleLine
X
IgnorePatternWhitespace
注:选项本身的信作含义如下表所示:
标 志
名 称
IgnoreCase
使模式匹配不区分大小写。默认的选项是匹配区分大小写
RightToLeft
从右到左搜索输入字符串。默认是从左到右以符合英语等的阅读习惯,但不符合阿拉伯语或希伯来语的阅读习惯
None
不设置标志。这是默认选项
Multiline
指定 ^ 和 $ 可以 匹配行首和行尾,以及字符串的开始和结尾。这意味着可以匹配每个用换行符分隔的行。但是,字符“ . ”仍然不匹配换行符
SingleLine
规定特殊字符“ . ” 匹配任意的字符,包括换行符。默认情况下,特殊字符“ . ”不匹配换行符。通常与 MultiLine 选项一起使用
ECMAScript
ECMA(European Coputer Manufacturer's Association, 欧洲计算机生产商协会 ) 已经定义了正则表达式应该如何实现,而且已经在 ECMAScript 规范中实现,这是一个基于标准的 JavaScript 。这个选项只能与 IgnoreCase 和 MultiLine 标志一起使用。与其它任何标志一起使用, ECMAScript 都将产生异常
IgnorePatternWhitespace
此选项从使用的正则表达式模式中删除所有非转义空白字符。它使表达式能跨越多行文本,但必须确保对模式中所有的空白进行转义。如果设置了此选项,还可以使用“ # ”字符来注释下则表达式
Complied
它把正则表达式编译为更接近机器代码的代码。这样速度快,但不允许对它进行任何修改
Last edited by 无奈何 on 2006-10-26 at 11:51 AM ]
Regular Expression Reference Manual__Mini Version
A regular expression is a text pattern composed of ordinary characters (such as characters from a to z) and special characters (called metacharacters). This pattern describes one or more strings to be matched when searching the text body. A regular expression acts as a template to match a certain character pattern with the string being searched.
This article lists in detail various characters that can be used in regular expressions to match text. When you need to explain an existing regular expression, it can be used as a quick reference. For more detailed content, please refer to: Francois Liger, Craig McQueen, Pal Wilton C# String and Regular Expression Reference Manual Beijing: Tsinghua University Press 2003.2
I. . Matching Characters
Character Classes
Characters Matched
Examples
\d
Any digit from 0-9
\d\d matches 72, but not aa or 7a
\D
Any non-digit character
\D\D\D matches abc, but not 123
\w
Any word character, including A-Z, a-z, 0-9, and underscore
\w\w\w\w matches Ab-2, but not ∑£$%* or Ab_@
\W
Any non-word character
\W matches @, but not a
\s
Any whitespace character, including tab, newline, carriage return, form feed, and vertical tab
Matches all traditional whitespace characters defined in HTML, XML, and other standards
\S
Any non-whitespace character
Any character other than whitespace, such as A%&g3; etc.
.
Any character
Matches any character except newline unless the MultiLine option is set
Any character in the brackets
will match a single character, a, b, or c.
will match any character from a to z
Any character not in the brackets
will match a single character other than a, b, c, which can be a, b, or A, B, C
will match any character not belonging to a-z, but can match all uppercase letters
II. . Repeating Characters
Repeating Characters
Meaning
Examples
{n}
Matches the preceding character n times
x{2} matches xx, but not x or xxx
{n,}
Matches the preceding character at least n times
x{2} matches 2 or more x, such as xxx, xxx..
{n,m}
Matches the preceding character at least n times and at most m times. If n is 0, this parameter is optional
x{2,4} matches xx, xxx, xxxx, but not xxxxx
?
Matches the preceding character 0 or 1 time, essentially optional
x? matches x or zero x
+
Matches the preceding character 0 or more times
x+ matches x or xx or any number of x greater than 0
*
Matches the preceding character 0 or more times
x* matches 0, 1, or more x
III. . Anchoring Characters
Anchoring Characters
Description
^
The following pattern must be at the start of the string. If it is a multi-line string, it must be at the start of the line. For multi-line text (a string containing carriage returns), the multi-line flag needs to be set
$
The preceding pattern must be at the end of the string. If it is a multi-line string, it must be at the end of the line
\A
The preceding pattern must be at the start of the string, ignoring the multi-line flag
\z
The preceding pattern must be at the end of the string, ignoring the multi-line flag
\Z
The preceding pattern must be at the end of the string, or before a newline
\b
Matches a word boundary, that is, the point between a word character and a non-word character. Remember that a word character is one of . Located at the start of a word
\B
Matches a non-word character boundary position, not the start of a word
Note: Anchoring characters can be applied to characters or combinations, placed at the left or right end of the string
IV. . Grouping Characters
Grouping Characters
Definition
Examples
()
This character can group the characters matched by the pattern inside the brackets. It is a capturing group, that is, the characters matched by the pattern are set as the ExplicitCapture option――by default, characters are not part of the match
The input string is: ABC1DEF2XY
The regular expression that matches 3 characters from A to Z and 1 digit: ( {3}\d )
Will produce two matches: Match 1=ABC1; Match 2=DEF2
Each match corresponds to a group: the first group of Match1=ABC; the first group of Match2=DEF
With backreferences, you can access the group through its number in the regular expression and C# and the classes Group, GroupCollection. If the ExplicitCapture option is set, the content captured by the group cannot be used
(?:)
This character can group the characters matched by the pattern inside the brackets. It is a non-capturing group, which means that the characters matched by the pattern will not be captured as a group, but it constitutes part of the final match result. It is basically the same as the above group type, but the option ExplicitCapture is set
The input string is: 1A BB SA 1 C
The regular expression that matches a digit or a letter from A to Z followed by any word character is: (?:\d|\w )
It will produce 3 matches: each 1st match=1A; each 2nd match=BB; each 3rd match=SA
But no group is captured
(?)
This option groups the characters matched by the pattern inside the brackets and names the group with the value specified in the angle brackets. In the regular expression, backreferences can be used with the name instead of the number. Even if the ExplicitCapture option is not set, it is a capturing group. This means that backreferences can use the characters matched in the group or access through the Group class
The input string is: Characters in Sienfeld included Jerry Seinfeld, Elaine Benes, Cosno Kramer and George Costanza The regular expression that can match their names and capture the last name in a group llastName is: \b+(?+)\b
It produced 4 matches: First Match=Jerry Seinfeld; Second Match=Elaine Benes; Third Match=Cosmo Kramer; Fourth Match=George Costanza
Each match corresponds to a lastName group:
1st match: lastName group=Seinfeld
2nd match: lastName group=Benes
3rd match: lastName group=Kramer
4th match: lastName group=Costanza
The group will be captured regardless of whether the option ExplictCapture is set
(?=)
Positive assertion. The right side of the assertion must be the pattern specified in the brackets. This pattern does not constitute part of the final match
The regular expression \S+(?=.NET) for the input string to be matched is: The languages were Java, C#.NET, VB.NET, C, Jscript.NET, Pascal
Will produce the following matches:〕
C#
VB
JScript
(?!)
Negative assertion. It specifies that the pattern must not be immediately to the right of the assertion. This pattern does not constitute part of the final match
\d{3}(?!) for the input string to be matched is: 123A 456 789 111C
Will produce the following matches:
456
789
(?
Reverse positive assertion. The left side of the assertion must be the specified pattern in the brackets. This pattern does not constitute part of the final match
The regular expression (?
It will produce the following matches:
Mexico
England
(?
Reverse positive assertion. The left side of the assertion must not be the specified pattern in the brackets. This pattern does not constitute part of the final match
The regular expression (?
It will achieve the following matches:
56F
89C
(?>)
Non-backtracking group. Prevents the Regex engine from backtracking and prevents a match from being achieved
Suppose you want to match all words ending with "ing". The input string is as follows: He was very trusing
The regular expression is: .*ing
It will achieve one match――the word trusting. "." matches any character, of course, it also matches "ing". So, the Regex engine backtracks one position and stops at the 2nd "t", then matches the specified pattern "ing". However, if backtracking is disabled: (?>.*)ing
It will achieve 0 matches. "." can match all characters, including "ing"――cannot match, so the match fails
V. . Decision Characters
Characters
Description
Examples
(?(regex)yes_regex|no_regex )
If the expression regex matches, then it will try to match the expression yes. Otherwise, it matches the expression no. The regular expression no is an optional parameter. Note that the width of the pattern making the decision is 0. This means that the expression yes or no will start matching from the same position as the regex expression
The regular expression (?(\d)dA|A-Z)B) for the input string to be matched is: 1A CB 3A 5C 3B
The matches it achieves are:
1A
CB
3A
(?(group name or number)yes_regex|no_regex )
If the regular expression in the group achieves a match, then it tries to match the yes regular expression. Otherwise, it tries to match the regular expression no. no is optional
The regular expression
(\d7)?-(?(1)\d\d| for the input string to be matched is:
77 -77A 69-AA 57-B
The matches it achieves are:
77 -77A
- AA
Note: The characters listed in the above table force the processor to perform an if-else decision
VI. . Replacement Characters
Characters
Description
$group
Replace with the group number specified by group
${name}
Replace the last substring matched by a (?) group
$$
Replace a character $
$&
Replace the entire match
$ ^
Replace all text before the input string match
$'
Replace all text after the input string match
$+
Replace the last captured group
$_
Replace the entire input string
Note: The above are common replacement characters, not all
VII. . Escape Sequences
Characters
Description
\\
Matches the character "\"
\.
Matches the character "."
\*
Matches the character "*"
\+
Matches the character "+"
\?
Matches the character "?"
\|
Matches the character "|"
\(
Matches the character "("
\)
Matches the character ")"
\{
Matches the character "{"
\}
Matches the character "}"
\ ^
Matches the character "^"
\$
Matches the character "$"
\n
Matches newline
\r
Matches carriage return
\t
Matches tab
\v
Matches vertical tab
\f
Matches form feed
\nnn
Matches an 8-digit number, the ASCII character specified by nnn. For example, \103 matches uppercase C
\xnn
Matches a 16-digit number, the ASCII character specified by nn. For example, \x43 matches uppercase C
\unnnn
Matches a Unicode character specified by 4-digit 16-digit numbers (represented by nnnn)
\cV
Matches a control character, such as \cV matches Ctrl-V
VIII. . Option Flags
Option Flags
Names
I
IgnoreCase
M
Multiline
N
ExplicitCapture
S
SingleLine
X
IgnorePatternWhitespace
Note: The meaning of the options themselves is as shown in the following table:
Flags
Names
IgnoreCase
Makes pattern matching case-insensitive. The default option is case-sensitive matching
RightToLeft
Searches the input string from right to left. The default is from left to right to conform to the reading habits of English, etc., but not to the reading habits of Arabic or Hebrew
None
No flags are set. This is the default option
Multiline
Specifies that ^ and $ can match the start and end of lines, as well as the start and end of the string. This means that each line separated by a newline can be matched. However, the character "." still does not match newline
SingleLine
Specifies that the special character "." matches any character, including newline. By default, the special character "." does not match newline. Usually used together with the MultiLine option
ECMAScript
ECMA (European Computer Manufacturer's Association) has defined how regular expressions should be implemented, and it has been implemented in the ECMAScript specification, which is a standard-based JavaScript. This option can only be used with the IgnoreCase and MultiLine flags. Using it with any other flags will cause an exception in ECMAScript
IgnorePatternWhitespace
This option removes all unescaped whitespace characters from the used regular expression pattern. It makes the expression span multiple lines of text, but it must ensure that all whitespace in the pattern is escaped. If this option is set, the "#" character can also be used to comment the regular expression
Complied
It compiles the regular expression into code closer to machine code. This is fast, but does not allow any modification to it
Last edited by 无奈何 on 2006-10-26 at 11:51 AM ]
|

☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul
|
|
2006-10-26 11:44 |
|
|
无奈何
荣誉版主
      
积分 1338
发帖 356
注册 2005-7-15
状态 离线
|
『第 8 楼』:
正则表达式高级学习技巧
使用 LLM 解释/回答一下
前言
Regular Expressions(正则表达式,以下用RE称呼)对小弟来说一直都是神密的地带,看到一些网络上的大大,简单用RE就决解了某些文字的问题,小弟便兴起了学一学RE的想法,但小弟天生就比较懒一些,总希望看有没有些快速学习的方式,于是小弟又请出Google大神,藉由祂的神力,小弟在网络上找到了Jim Hollenhorst先生的文章,经过了阅读,小弟觉得真是不错,所以就做个小心得报告,跟Move-to.Net的朋友分享,希望能为各位大大带来一丁点在学习RE时的帮助。Jim Hollenhorst大大文章之网址如下,有需要的大大可直接连结。
The 30 Minute Regex Tutorial By Jim Hollenhorst
http://www.codeproject.com/useritems/RegexTutorial.asp
什么是RE?
想必各位大大在做文件查找的时侯都有使用过万用字符”*”,比如说想查找在Windows目录下所有的Word文件时,你可能就会用”*.doc”这样的方式来做查找,因为”*”所代表的是任意的字符。RE所做的就是类似这样的功能,但其功能更为强大。
写程序时,常需要比对字符串是否符合特定样式,RE最主要的功能就是来描述这特定的样式,因此可以将RE视为特定样式的描述式,举个例子来说,”\w+”所代表的就是任何字母与数字所组成的非空字符串(non-null string)。在.NET framework中提供了非常强大的类别库,藉此可以很轻易的使用RE来做文字的查找与取代、对复杂标头的译码及验证文字等工作。
学习RE最好的方式就是藉由例子亲自来做做看。Jim Hollenhorst大大也提供了一个工具程序Expresso(来杯咖啡吧),来帮助我们学习RE,下载的网址是 http://www.codeproject.com/useritems/RegexTutorial/ExpressoSetup2_1C.zip。
接下来,就让我们来体验一些例子吧。
一些简单的例子
假设要查找文章中Elvis后接有alive的文字符串的话,使用RE可能会经过下列的过程,括号是所下RE的意思:
1. elvis (查找elvis)
上述代表所要查找的字符顺序为elvis。在.NET中可以设定乎略字符的大小写,所以”Elvis”、”ELVIS”或者是”eLvIs”都是符合1所下的RE。但因为这只管字符出现的顺序为elvis,所以pelvis也是符合1所下的RE。可以用2的RE来改进。
2. \belvis\b (将elvis视为一整体的字查找,如elvis、Elvis乎略字符大小写时)
“\b”在RE中有特别的意思,在上述的例子中所指的就是字的边界,所以\belvis\b用\b把elvis的前后边界界定出来,也就是要elvis这个字。
假设要将同一行里elvis后接有alive的文字符串找出来,此时就会用到另外二个特别意义的字符”.”及”*”。”.”所代表就是除了换行字符的任意字符,而”*”所代表的是重复*之前项目直到找到符合RE的字符串。所以”.*”所指的就是除了换行字符外的任意数目的字符数。所以查找同一行里elvis后接有alive的文字符串找出来,则可下如3之RE。
3. \belvis\b.*\balive\b (查找elvis后面接有alive的文字符串,如elvis is alive)
用简单之特别字符就可以组成功能强大的RE,但也发现当使用越来越多的特别字符时,RE就会越来越难看得懂了。
再看看另外的例子
组成有效的电话号码
假使要从网页上收集顾客格式为xxx-xxxx的7位数字的电话号码,其中x是数字,RE可能会这样写。
4. \b\d\d\d-\d\d\d\d (查找七位数字之电话号码,如123-1234)
每一个\d代表一个数字。”-”则是一般的连字符号,为避免太多重复的\d,RE可以改写成如5的方式。
5. \b\d?}-\d?} (查找七位数字电话号码较好的方法,如123-1234)
在\d后的?},代表重复前一个项目三次,也就是相等于\d\d\d。
RE的学习及测试工具 Expresso
因为RE不易阅读及使用者容易会下错RE的特性,Jim大大开发了一个工具软件Expresso,用来帮助使用者学习及测试RE,除了上面所述的网址之外,也可以上Ultrapico网站( http://www.Ultrapico.com)。安装完Expresso后,在Expression Library中,Jim大大把文章的例子都建立在其中,可以边看文章边测试,也可以试着修改范例所下的RE,马上可以看到结果,小弟觉得非常好用。各位大大可以试试。
.NET中RE的基础概念
特殊字符
有些字符有特别的意义,比如之前所看到的”\b”、”.”、”*”、”\d”等。”\s”所代表的是任意空格符,比如说spaces、tabs、newlines等.。”\w”代表是任意字母或数字字符。
再看一些例子吧
6. \ba\w*\b (查找a开头的字,如able)
这RE描述要查找一个字的开始边界(\b),再来是字母”a”,再加任意数目的字母数字(\w*),再接结束这个字的结束边界(\b)。
7. \d+ (查找数字字符串)
“+”和”*”非常相似,除了+至少要重复前面的项目一次。也就是说至少有一个数字。
8. \b\w?}\b (查找六个字母数字的字,如ab123c)
下表为RE常用的特殊字符
. 除了换行字符的任意字符
\w 任意字母数字字符
\s 任意空格符
\d 任意数字字符
\b 界定字的边界
^ 文章的开头,如”^The'' 用以表示出现于文章开头的字符串为”The”
$ 文章的结尾,如”End$”用以表示出现在文章的结尾为”End”
特殊字符”^”及”$”是用来查找某些字必需是文章的开头或结尾,这在验证输入是否符合某一样式时特别用有,比如说要验证七位数字的电话号码,可能会输入如下9的RE。
9. ^\d?}-\d?}$ (验证七位数字之电话号码)
这和第5个RE相同,但其前后都无其它的字符,也就是整串字符串只有这七个数字的电话号码。在.NET中如果设定Multiline这个选项,则”^”和”$”会每行进行比较,只要某行的开头结尾符合RE即可,而不是整个文章字符串做一次比较。
转意字符(Escaped characters)
有时可能会需要”^”、”$”单纯的字面意义(literal meaning)而不要将它们当成特殊字符,此时”\”字符就是用来移除特殊字符特别意义的字符,因此”\^”、”\.”、”\”所代表的就是”^”、”.”、”\”的字面意义。
重复前述项目
在前面看过”?}”及”*”可以用来重复前述字符,之后我们会看到如何用同样的语法重复整个次描述(subexpressions)。下表是使用重复前述项目的一些方式。
* 重复任意次数
+ 重复至少一次
? 重复零次或一次
{n} 重复n次
{n,m} 重复至少n次,但不超过m次
{n,} 重复至少n次
再来试一些例子吧
10. \b\w?,6}\b (查找五个或六个字母数字字符的字,如as25d、d58sdf等)
11. \b\d?}\s\d?}-\d?} (查找十个数字的电话号码,如800 123-1234)
12. \d?}-\d?}-\d?} (查找社会保险号码,如 123-45-6789)
13. ^\w* (每行或整篇文章的第一个字)
在Espresso可试试有Multiline和没Multiline的不同。
匹配某范围的字符
有时需要查找某些特定的字符时怎么辨?这时中括号””就派上了用场。因此所要查找的是”a”、”e”、”i”、”o”、”u”这些元音,所要查找的是”.”、”?”、”!”这些符号,在中括号中的特殊字符的特别意义都会被移除,也就是解译成单纯的字面意义。也可以指定某些范围的字符,如””,所指的就是任意小写字母或任意数字。
接下来再看一个比较初复杂查找电话号码的RE例子
14. \(?\d?} \s?\d?}\d?} (查找十位数字之电话号码,如(080) 333-1234 )
这样的RE可查找出较多种格式的电话号码,如(080) 123-4567、511 254 6654等。”\(?”代表一个或零个左小括号”(“,而””代表查找一个右小括号”)”或空格符,”\s?”指一个或零个空格符组。但这样的RE会将类似”800) 45-3321”这样的电话找出来,也就是括号没有对称平衡的问题,之后会学到择一(alternatives)来决解这样的问题。
不包含在某特定字符组里(Negation)
有时需要查找在包含在某特定字符组里的字符,下表说明如何做类似这样的描述。
\W 不是字母数字的任意字符
\S 不是空格符的任意字符
\D 不是数字字符的任意字符
\B 不在字边界的位置
不是x的任意字符
不是a、e、i、o、u的任意字符
15. \S+ (不包含空格符的字符串)
择一(Alternatives)
有时会需要查找几个特定的选择,此时”|”这个特殊字符就派上用场了,举例来说,要查找五个数字及九个数字(有”-”号)的邮政编码。
16. \b\d?}-\d?}\b|\b\d?}\b (查找五个数字及九个数字(有”-”号)的邮政编码)
在使用Alternatives时需要注意的是前后的次序,因为RE在Alternatives中会优先选择符合最左边的项目,16中,如果把查找五个数字的项目放在前面,则这RE只会找到五个数字的邮政编码。了解了择一,可将14做更好的修正。
17. (\(\d?}\)|\d?})\s?\d?}\d?} (十个数字的电话号码)
群组(Grouping)
括号可以用来介定一个次描述,经由次描述的介定,可以针对次描述做重复或及他的处理。
18. (\d?,3}\.)?}\d?,3} (寻找网络地址的简单RE)
此RE的意思第一个部分(\d?,3}\.)?},所指的是,数字最小一位最多三位,并且后面接有”.”符号,此类型的共有三个,之后再接一到三位的数字,也就是如192.72.28.1这样的数字。
但这样会有个缺点,因为网络地址数字最多只到255,但上述的RE只要是一到三位的数字都是符合的,所以这需要让比较的数字小于256才行,但只单独使用RE并无法做这样的比较。在19中使用择一来将地址的限制在所需要的范围内,也就是0到255。
19. ((2\d|25|?\d\d?)\.)?}(2\d|25|?\d\d?) (寻找网络地址)
有没有发觉RE越来越像外星人说的话了?就以简单的寻找网络地址,直接看RE都满难理解的哩。
Expresso Analyzer View
Expresso提供了一个功能,它可以将所下的RE变成树状的说明,一组组的分开说明,提供了一个好的除错环境。其它的功能,如部分符合(Partial Match只查找反白RE的部分)及除外符合(Exclude Match只不查找反白RE的部分)就留给各位大大试试啰。
当次描述用括号群组起来时,符合次描述的文字可用在之后的程序处理或RE本身。在预设的情型下,所符合的群组是由数字命名,由1开始,由顺序是由左至右,这自动群组命名,可在Expresso中的skeleton view或result view中看到。
Backreference是用来查找群组中抓取的符合文字所相同的文字。举例来说””所指符合群组1所抓取的文字。
20. \b(\w+)\b\s*\b (寻找重复字,此处说的重复是指同样的字,中间有空白隔开如dog dog这样的字)
(\w+)会抓取至少一个字符的字母或数字的字,并将它命名为群组1,之后是查找任意空格符,再接和群组1相同的文字。
如果不喜欢群组自动命名的1,也可以自行命名,以上述例子为例,(\w+)改写为(?\w+),这就是将所抓取的群组命名为Word,Backreference就要改写成为\k
21. \b(?\w+)\b\s*\k\b (使用自行命名群组抓取重复字)
使用括号还有许多特别的语法元素,比较通用的列表如下:
抓取(Captures)
(exp) 符合exp并抓取它进自动命名的群组
(?exp) 符合exp并抓取它进命名的群组name
(?:exp) 符合exp,不抓取它
Lookarounds
(?=exp) 符合字尾为exp的文字
(?).*(?=) (HTML卷标间的文字)
这使用lookahead及lookbehind assertion来取出HTML间的文字,不包括HTML卷标。
请批注(Comments Please)
括号还有个特殊的用途就是用来包住批注,语法为”(?#comment)”,若设定”Ignore Pattern Whitespace”选项,则RE中的空格符当RE使用时会乎略。此选项设定时,”#”之后的文字会乎略。
31. HTML卷标间的文字,加上批注
(? #HTML标签
) #结束查找前缀
.* #符合任何文字
(?= #查找字尾,但不包含它
#符合所抓取群组1之字符串,也就是前面小括号的HTML标签
) #结束查找字尾
寻找最多字符的字及最少字符的字(Greedy and Lazy)
当RE下要查找一个范围的重复时(如”.*”),它通常会寻找最多字符的符合字,也就是Greedy matching。举例来说。
32. a.*b (开始为a结束为b的最多字符的符合字)
若有一字符串是”aabab”,使用上述RE所得到的符合字符串就是”aabab”,因为这是寻找最多字符的字。有时希望是符合最少字符的字也就是lazy matching。只要将重复前述项目的表加上问号(?)就可以把它们全部变成lazy matching。因此”*?”代表的就是重复任意次数,但是使用最少重复的次数来符合。举个例子来说:
33. a.*?b (开始为a结束为b的最少字符的符合字)
若有一字符串是”aabab”,使用上述RE第一个所得到的符合字符串就是”aab”再来是”ab”,因为这是寻找最少字符的字。
*? 重复任意次数,最少重复次数为原则
+? 重复至少一次,最少重复次数为原则
?? 重复零次或一次,最少重复次数为原则
{n,m}? 重复至少n次,但不超过m次,最少重复次数为原则
{n,}? 重复至少n次,最少重复次数为原则
还有什么没提到呢?
到目前为止,已经提到了许多建立RE的元素,当然还有许多元素没有提到,下表整理了一些没提到的元素,在最左边的字段的数字是说明在Expresso中的例子。
# 语法 说明
\a Bell 字符
\b 通常是指字的边界,在字符组里所代表的就是backspace
\t Tab
34 \r Carriage return
\v Vertical Tab
\f From feed
35 \n New line
\e Escape
36 \nnn ASCII八位码为nnn的字符
37 \xnn 十六位码为nn的字符
38 \unnnn Unicode为nnnn的字符
39 \cN Control N字符,举例来说Ctrl-M是\cM
40 \A 字符串的开始(和^相似,但不需籍由multiline选项)
41 \Z 字符串的结尾
\z 字符串的结尾
42 \G 目前查找的开始
43 \p{name} Unicode 字符组名称为name的字符,比如说\p{Lowercase_Letter} 所指的就是小写字
(?>exp) Greedy次描述,又称之为non-backtracking次描述。这只符合一次且不采backtracking。
44 (?-exp)
or (?-exp) 平衡群组。虽复杂但好用。它让已命名的抓取群组可以在堆栈中操作使用。(小弟对这个也是不太懂哩)
45 (?im-nsx:exp) 为次描述exp更改RE选项,比如(?-i:Elvis)就是把Elvis大乎略大小写的选项关掉
46 (?im-nsx) 为之后的群组更改RE选项。
(?(exp)yes|no) 次描述exp视为zero-width positive lookahead。若此时有符合,则yes次描述为下一个符合标的,若否,则no 次描述为下一个符合标的。
(?(exp)yes) 和上述相同但无no次描述
(?(name)yes|no) 若name群组为有效群组名称,则yes次描述为下一个符合标的,若否,则no 次描述为下一个符合标的。
47 (?(name)yes) 和上述相同但无no次描述
Last edited by 无奈何 on 2006-10-26 at 11:53 AM ]
Foreword
Regular Expressions (regular expressions, hereinafter referred to as RE) has always been a mysterious area for me. Seeing some great people on the Internet simply use RE to solve certain text problems, I got the idea of learning RE. But I am naturally a bit lazy and always hope to see if there is a way to learn it quickly. So I invited the Google god again. With His power, I found Mr. Jim Hollenhorst's article on the Internet. After reading it, I thought it was really good, so I made a small summary report to share with the friends of Move-to.Net, hoping to bring a little help to you all in learning RE. The URL of Mr. Jim Hollenhorst's article is as follows. Those who need it can directly click the link.
The 30 Minute Regex Tutorial By Jim Hollenhorst
http://www.codeproject.com/useritems/RegexTutorial.asp
What is RE?
I believe that all of you have used the wildcard "*" when doing file searches. For example, when you want to find all Word files in the Windows directory, you may use "*\.doc" to do the search, because "*" represents any character. What RE does is similar to this function, but its function is more powerful.
When writing a program, it is often necessary to compare whether a string matches a specific pattern. The main function of RE is to describe this specific pattern. Therefore, RE can be regarded as a description of a specific pattern. For example, "\w+" represents any non-null string composed of letters and numbers. In the.NET framework, a very powerful class library is provided, through which it is very easy to use RE to do text search and replacement, decode complex headers, and verify text and other tasks.
The best way to learn RE is to do it yourself through examples. Mr. Jim Hollenhorst also provides a tool program Expresso (have a cup of coffee), to help us learn RE. The download URL is http://www.codeproject.com/useritems/RegexTutorial/ExpressoSetup2_1C.zip.
Next, let's experience some examples.
Some simple examples
Suppose you want to find a string in the article where Elvis is followed by alive, using RE may go through the following process. The parentheses are the meaning of the RE below:
1. elvis (find elvis)
The above represents the order of characters to be found as elvis. In.NET, you can set to ignore the case of characters, so "Elvis", "ELVIS" or "eLvIs" all match the RE of 1. But because this only cares about the order of characters appearing as elvis, so pelvis also matches the RE of 1. You can improve it with the RE of 2.
2. \belvis\b (regard elvis as a whole word to find, such as elvis, Elvis when ignoring case of characters)
"\b" has a special meaning in RE. In the above example, it refers to the word boundary. So \belvis\b uses \b to define the front and back boundaries of elvis, that is, to find the word elvis.
Suppose you want to find a string in the same line where elvis is followed by alive, then you will use two other special meaning characters ".", and "*". "." represents any character except the newline character, and "*" represents repeating the item before * until the string that matches the RE is found. So ".*" refers to any number of characters except the newline character. So to find a string in the same line where elvis is followed by alive, you can enter the RE of 3 as follows.
3. \belvis\b.*\balive\b (find a string where elvis is followed by alive, such as elvis is alive)
You can form a powerful RE with simple special characters, but you also find that when using more and more special characters, the RE will be more and more difficult to understand.
Let's look at another example
Form an effective phone number
Suppose you want to collect a 7-digit phone number in the format xxx-xxxx from a web page, where x is a digit, the RE may be written like this.
4. \b\d\d\d-\d\d\d\d (find a 7-digit phone number, such as 123-1234)
Each \d represents a digit. "-" is a general hyphen. To avoid too many repeated \d, the RE can be rewritten in the way of 5.
5. \b\d?}-\d?} (a better way to find a 7-digit phone number, such as 123-1234)
?} after \d means repeating the previous item three times, that is, equivalent to \d\d\d.
RE learning and testing tool Expresso
Because RE is not easy to read and users are prone to making wrong RE, Mr. Jim developed a tool software Expresso to help users learn and test RE. In addition to the URL mentioned above, you can also go to the Ultrapico website ( http://www.Ultrapico.com). After installing Expresso, in the Expression Library, Mr. Jim has established the examples of the article in it. You can test while reading the article, and you can also try to modify the RE of the example, and you can see the result immediately. I think it is very useful. You can give it a try.
Basic concepts of RE in.NET
Special characters
Some characters have special meanings, such as "\b", ".", "*", "\d" and so on that we saw before. "\s" represents any whitespace character, such as spaces, tabs, newlines, etc. "\w" represents any letter or digit character.
Let's look at some examples
6. \ba\w*\b (find a word starting with a, such as able)
This RE describes that we want to find the start boundary of a word (\b), then the letter "a", then any number of letters and numbers (\w*), then the end boundary of the end word (\b).
7. \d+ (find a numeric string)
"+" is very similar to "*", except that + repeats the previous item at least once. That is, there is at least one digit.
8. \b\w?}\b (find a word of six alphanumeric characters, such as ab123c)
The following table is the commonly used special characters of RE
. Any character except the newline character
\w Any alphanumeric character
\s Any whitespace character
\d Any digit character
\b Define word boundary
^ Start of the article, such as "^The" to indicate that the string appearing at the start of the article is "The"
$ End of the article, such as "End$" to indicate that it appears at the end of the article as "End"
The special characters "^" and "$" are used to find that some words must be the start or end of the article. This is especially useful when verifying whether the input matches a certain pattern. For example, to verify a 7-digit phone number, you may enter the RE of 9 as follows.
9. ^\d?}-\d?}$ (verify a 7-digit phone number)
This is the same as the 5th RE, but there are no other characters before and after it, that is, the entire string is only this 7-digit phone number. In.NET, if the Multiline option is set, "^" and "$" will be compared line by line. As long as the start and end of a line match the RE, it is not compared once for the entire article string.
Escaped characters
Sometimes you may need the literal meaning of "^", "$" simply instead of treating them as special characters. At this time, the "\" character is used to remove the special meaning of special characters. Therefore, "\^", "\.", "\\" represent the literal meanings of "^", ".", "\\" respectively.
Repeat the previous item
We have seen before that "?}" and "*" can be used to repeat the previous characters. Later, we will see how to use the same syntax to repeat the entire subexpression. The following table is some ways to use repeating the previous item.
* Repeat any number of times
+ Repeat at least once
? Repeat zero or one time
{n} Repeat n times
{n,m} Repeat at least n times, but not more than m times
{n,} Repeat at least n times
Let's try some examples
10. \b\w?,6}\b (find a word of five or six alphanumeric characters, such as as25d, d58sdf, etc.)
11. \b\d?}\s\d?}-\d?} (find a 10-digit phone number, such as 800 123-1234)
12. \d?}-\d?}-\d?} (find a social security number, such as 123-45-6789)
13. ^\w* (the first word of each line or the entire article)
Try in Espresso the difference between having Multiline and not having Multiline.
Match characters in a certain range
Sometimes when you need to find some specific characters, what should you do? At this time, the square brackets "" come in handy. Therefore, is to find these vowels "a", "e", "i", "o", "u", is to find these symbols ".", "?", "!", and the special meanings of special characters in the square brackets will be removed, that is, interpreted as simple literal meanings. You can also specify certain ranges of characters, such as "", which refers to any lowercase letter or any digit.
Next, let's look at a more complex RE example for finding a phone number
14. \(?\d?} \s?\d?}\d?} (find a 10-digit phone number, such as (080) 333-1234 )
Such a RE can find phone numbers in more formats, such as (080) 123-4567, 511 254 6654, etc. "\(?" represents one or zero left parentheses "(", and "" represents finding one right parenthesis ")" or a space, "\s?" refers to one or zero spaces. But such a RE will find a phone number like "800) 45-3321", that is, there is no problem of symmetric balance of parentheses. Later, we will learn alternatives to solve such problems.
Negation
Sometimes you need to find characters not in a certain specific character group. The following table shows how to make such a description.
\W Any character that is not alphanumeric
\S Any character that is not a whitespace character
\D Any character that is not a digit character
\B Not at the word boundary position
Any character that is not x
Any character that is not a, e, i, o, u
15. \S+ (a string that does not contain whitespace characters)
Alternatives
Sometimes you need to find a few specific choices. At this time, the special character "|" comes in handy. For example, to find a 5-digit and a 9-digit (with "-" number) postal code.
16. \b\d?}-\d?}\b|\b\d?}\b (find a 5-digit and a 9-digit (with "-" number) postal code)
When using Alternatives, you need to pay attention to the order before and after. Because RE will give priority to the leftmost item that matches in Alternatives. In 16, if the item for finding 5-digit numbers is placed in front, then this RE will only find 5-digit postal codes. After understanding alternatives, you can make a better correction to 14.
17. (\(\d?}\)|\d?})\s?\d?}\d?} (a 10-digit phone number)
Grouping
Parentheses can be used to define a subexpression. Through the definition of the subexpression, you can repeat or perform other processing on the subexpression.
18. (\d?,3}\.)?}\d?,3} (a simple RE for finding an IP address)
The meaning of this RE is the first part (\d?,3}\.), which means that the number has at least one digit and at most three digits, and is followed by a "." symbol. There are three such types, and then followed by one to three digits, that is, a number like 192.72.28.1.
But there is a shortcoming here, because the IP address number is at most 255, but the above RE is only in line with one to three digits. So this needs to make the compared number less than 256, but RE alone cannot do such a comparison. In 19, alternatives are used to limit the address within the required range, that is, 0 to 255.
19. ((2\d|25|?\d\d?)\.)?}(2\d|25|?\d\d?) (find an IP address)
Have you found that RE is more and more like what aliens say? Just looking for an IP address simply, it is quite difficult to understand directly by looking at the RE.
Expresso Analyzer View
Expresso provides a function that can turn the entered RE into a tree-like description, separated into groups, providing a good debugging environment. Other functions, such as partial match (Partial Match only finds the part of the highlighted RE) and exclude match (Exclude Match only does not find the part of the highlighted RE) are left for you to try.
When a subexpression is grouped with parentheses, the text that matches the subexpression can be used in subsequent program processing or the RE itself. Under the default situation, the matched groups are named by numbers, starting from 1, and the order is from left to right. This automatic group naming can be seen in the skeleton view or result view in Expresso.
Backreference is used to find the same text as the matched text captured in the group. For example, "\1" refers to the text captured in group 1.
20. \b(\w+)\b\s*\b (find repeated words, here the repetition refers to the same word, with a space in between, such as dog dog)
(\w+) will capture a word of at least one character of letters or numbers, and name it group 1. Then it is to find any whitespace character, and then the same text as group 1.
If you don't like the automatically named 1 of the group, you can also name it yourself. For example, in the above example, (\w+) is rewritten as (?<Word>\w+), which is to name the captured group as Word. Backreference should be rewritten as \k<Word>
21. \b(?<Word>\w+)\b\s*\k<Word>\b (use a self-named group to capture repeated words)
There are many special syntax elements when using parentheses. The more common list is as follows:
Captures
(exp) Match exp and capture it into an automatically named group
(?<name>exp) Match exp and capture it into a named group name
(?:exp) Match exp, do not capture it
Lookarounds
(?=exp) Match the text whose end is exp
(?).*(?=) (text between HTML tags)
This uses lookahead and lookbehind assertion to extract the text between HTML, excluding HTML tags.
Please批注(Comments Please)
Parentheses also have a special use, which is to enclose comments. The syntax is "(?#comment)". If the "Ignore Pattern Whitespace" option is set, the whitespace characters in the RE will be ignored when the RE is used. When this option is set, the text after "#" will be ignored.
31. Text between HTML tags, plus comments
(? #HTML tag
) #End the prefix search
.* #Match any text
(?= #Find the end, but do not include it
#Match the string captured in group 1, that is, the previous parentheses' HTML tag
) #End the suffix search
Greedy and Lazy
When the RE is to find a range of repetitions (such as ".*"), it usually finds the most characters that match, that is, Greedy matching. For example.
32. a.*b (the most characters that match from a to b)
If there is a string "aabab", the matched string obtained by using the above RE is "aabab", because this is to find the most characters. Sometimes you want to match the least characters, that is, lazy matching. As long as you add a question mark (?) to the table of repeating the previous item, you can turn them all into lazy matching. Therefore, "*?" means repeating any number of times, but using the least number of repetitions to match. For example:
33. a.*?b (the least characters that match from a to b)
If there is a string "aabab", the first matched string obtained by using the above RE is "aab" and then "ab", because this is to find the least characters.
*? Repeat any number of times, with the principle of the least number of repetitions
+? Repeat at least once, with the principle of the least number of repetitions
?? Repeat zero or one time, with the principle of the least number of repetitions
{n,m}? Repeat at least n times, but not more than m times, with the principle of the least number of repetitions
{n,}? Repeat at least n times, with the principle of the least number of repetitions
What else is not mentioned?
So far, many elements for building RE have been mentioned. Of course, there are still many elements not mentioned. The following table sorts out some elements not mentioned. The number in the leftmost field is the example in Expresso.
# Syntax Description
\a Bell character
\b Usually refers to the word boundary, and in the character group it represents backspace
\t Tab
34 \r Carriage return
\v Vertical Tab
\f From feed
35 \n New line
\e Escape
36 \nnn ASCII 8-bit code is a character of nnn
37 \xnn Hexadecimal code is a character of nn
38 \unnnn Unicode is a character of nnnn
39 \cN Control N character. For example, Ctrl-M is \cM
40 \A Start of the string (similar to ^, but not through the multiline option)
41 \Z End of the string
\z End of the string
42 \G Start of the current search
43 \p{name} Unicode character group name is a character of name. For example, \p{Lowercase_Letter} refers to lowercase letters
(?>exp) Greedy subexpression, also known as non-backtracking subexpression. This only matches once and does not take backtracking.
44 (?-exp)
or (?-exp) Balanced group. Although complex, it is easy to use. It allows the named capture group to be operated and used in the stack. (I don't understand this either)
45 (?im-nsx:exp) Change the RE option for subexpression exp. For example, (?-i:Elvis) is to turn off the option of ignoring case of Elvis.
46 (?im-nsx) Change the RE option for the subsequent group.
(?(exp)yes|no) The subexpression exp is regarded as zero-width positive lookahead. If there is a match at this time, the yes subexpression is the next match target. If not, the no subexpression is the next match target.
(?(exp)yes) The same as above but without the no subexpression
(?(name)yes|no) If the name group is a valid group name, then the yes subexpression is the next match target. If not, the no subexpression is the next match target.
47 (?(name)yes) The same as above but without the no subexpression
Last edited by 无奈何 on 2006-10-26 at 11:53 AM ]
|

☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul
|
|
2006-10-26 11:44 |
|
|
无奈何
荣誉版主
      
积分 1338
发帖 356
注册 2005-7-15
状态 离线
|
『第 9 楼』:
正则表达式高级学习技巧
使用 LLM 解释/回答一下
前言
Regular Expressions(正则表达式,以下用RE称呼)对小弟来说一直都是神密的地带,看到一些网络上的大大,简单用RE就决解了某些文字的问题,小弟便兴起了学一学RE的想法,但小弟天生就比较懒一些,总希望看有没有些快速学习的方式,于是小弟又请出Google大神,藉由祂的神力,小弟在网络上找到了Jim Hollenhorst先生的文章,经过了阅读,小弟觉得真是不错,所以就做个小心得报告,跟Move-to.Net的朋友分享,希望能为各位大大带来一丁点在学习RE时的帮助。Jim Hollenhorst大大文章之网址如下,有需要的大大可直接连结。
The 30 Minute Regex Tutorial By Jim Hollenhorst
http://www.codeproject.com/useritems/RegexTutorial.asp
什么是RE?
想必各位大大在做文件查找的时侯都有使用过万用字符”*”,比如说想查找在Windows目录下所有的Word文件时,你可能就会用”*.doc”这样的方式来做查找,因为”*”所代表的是任意的字符。RE所做的就是类似这样的功能,但其功能更为强大。
写程序时,常需要比对字符串是否符合特定样式,RE最主要的功能就是来描述这特定的样式,因此可以将RE视为特定样式的描述式,举个例子来说,”\w+”所代表的就是任何字母与数字所组成的非空字符串(non-null string)。在.NET framework中提供了非常强大的类别库,藉此可以很轻易的使用RE来做文字的查找与取代、对复杂标头的译码及验证文字等工作。
学习RE最好的方式就是藉由例子亲自来做做看。Jim Hollenhorst大大也提供了一个工具程序Expresso(来杯咖啡吧),来帮助我们学习RE,下载的网址是
http://www.codeproject.com/useritems/RegexTutorial/ExpressoSetup2_1C.zip
。
接下来,就让我们来体验一些例子吧。
一些简单的例子
假设要查找文章中Elvis后接有alive的文字符串的话,使用RE可能会经过下列的过程,括号是所下RE的意思:
1. elvis (查找elvis)
上述代表所要查找的字符顺序为elvis。在.NET中可以设定乎略字符的大小写,所以”Elvis”、”ELVIS”或者是”eLvIs”都是符合1所下的RE。但因为这只管字符出现的顺序为elvis,所以pelvis也是符合1所下的RE。可以用2的RE来改进。
2. \belvis\b (将elvis视为一整体的字查找,如elvis、Elvis乎略字符大小写时)
“\b”在RE中有特别的意思,在上述的例子中所指的就是字的边界,所以\belvis\b用\b把elvis的前后边界界定出来,也就是要elvis这个字。
假设要将同一行里elvis后接有alive的文字符串找出来,此时就会用到另外二个特别意义的字符”.”及”*”。”.”所代表就是除了换行字符的任意字符,而”*”所代表的是重复*之前项目直到找到符合RE的字符串。所以”.*”所指的就是除了换行字符外的任意数目的字符数。所以查找同一行里elvis后接有alive的文字符串找出来,则可下如3之RE。
3. \belvis\b.*\balive\b (查找elvis后面接有alive的文字符串,如elvis is alive)
用简单之特别字符就可以组成功能强大的RE,但也发现当使用越来越多的特别字符时,RE就会越来越难看得懂了。
再看看另外的例子
组成有效的电话号码
假使要从网页上收集顾客格式为xxx-xxxx的7位数字的电话号码,其中x是数字,RE可能会这样写。
4. \b\d\d\d-\d\d\d\d (查找七位数字之电话号码,如123-1234)
每一个\d代表一个数字。”-”则是一般的连字符号,为避免太多重复的\d,RE可以改写成如5的方式。
5. \b\d?}-\d?} (查找七位数字电话号码较好的方法,如123-1234)
在\d后的?},代表重复前一个项目三次,也就是相等于\d\d\d。
RE的学习及测试工具 Expresso
因为RE不易阅读及使用者容易会下错RE的特性,Jim大大开发了一个工具软件Expresso,用来帮助使用者学习及测试RE,除了上面所述的网址之外,也可以上Ultrapico网站(
http://www.Ultrapico.com)
。安装完Expresso后,在Expression Library中,Jim大大把文章的例子都建立在其中,可以边看文章边测试,也可以试着修改范例所下的RE,马上可以看到结果,小弟觉得非常好用。各位大大可以试试。
.NET中RE的基础概念
特殊字符
有些字符有特别的意义,比如之前所看到的”\b”、”.”、”*”、”\d”等。”\s”所代表的是任意空格符,比如说spaces、tabs、newlines等.。”\w”代表是任意字母或数字字符。
再看一些例子吧
6. \ba\w*\b (查找a开头的字,如able)
这RE描述要查找一个字的开始边界(\b),再来是字母”a”,再加任意数目的字母数字(\w*),再接结束这个字的结束边界(\b)。
7. \d+ (查找数字字符串)
“+”和”*”非常相似,除了+至少要重复前面的项目一次。也就是说至少有一个数字。
8. \b\w?}\b (查找六个字母数字的字,如ab123c)
下表为RE常用的特殊字符
. 除了换行字符的任意字符
\w 任意字母数字字符
\s 任意空格符
\d 任意数字字符
\b 界定字的边界
^ 文章的开头,如”^The'' 用以表示出现于文章开头的字符串为”The”
$ 文章的结尾,如”End$”用以表示出现在文章的结尾为”End”
特殊字符”^”及”$”是用来查找某些字必需是文章的开头或结尾,这在验证输入是否符合某一样式时特别用有,比如说要验证七位数字的电话号码,可能会输入如下9的RE。
9. ^\d?}-\d?}$ (验证七位数字之电话号码)
这和第5个RE相同,但其前后都无其它的字符,也就是整串字符串只有这七个数字的电话号码。在.NET中如果设定Multiline这个选项,则”^”和”$”会每行进行比较,只要某行的开头结尾符合RE即可,而不是整个文章字符串做一次比较。
转意字符(Escaped characters)
有时可能会需要”^”、”$”单纯的字面意义(literal meaning)而不要将它们当成特殊字符,此时”\”字符就是用来移除特殊字符特别意义的字符,因此”\^”、”\.”、”\”所代表的就是”^”、”.”、”\”的字面意义。
重复前述项目
在前面看过”?}”及”*”可以用来重复前述字符,之后我们会看到如何用同样的语法重复整个次描述(subexpressions)。下表是使用重复前述项目的一些方式。
* 重复任意次数
+ 重复至少一次
? 重复零次或一次
{n} 重复n次
{n,m} 重复至少n次,但不超过m次
{n,} 重复至少n次
再来试一些例子吧
10. \b\w?,6}\b (查找五个或六个字母数字字符的字,如as25d、d58sdf等)
11. \b\d?}\s\d?}-\d?} (查找十个数字的电话号码,如800 123-1234)
12. \d?}-\d?}-\d?} (查找社会保险号码,如 123-45-6789)
13. ^\w* (每行或整篇文章的第一个字)
在Espresso可试试有Multiline和没Multiline的不同。
匹配某范围的字符
有时需要查找某些特定的字符时怎么辨?这时中括号””就派上了用场。因此所要查找的是”a”、”e”、”i”、”o”、”u”这些元音,所要查找的是”.”、”?”、”!”这些符号,在中括号中的特殊字符的特别意义都会被移除,也就是解译成单纯的字面意义。也可以指定某些范围的字符,如””,所指的就是任意小写字母或任意数字。
接下来再看一个比较初复杂查找电话号码的RE例子
14. \(?\d?} \s?\d?}\d?} (查找十位数字之电话号码,如(080) 333-1234 )
这样的RE可查找出较多种格式的电话号码,如(080) 123-4567、511 254 6654等。”\(?”代表一个或零个左小括号”(“,而””代表查找一个右小括号”)”或空格符,”\s?”指一个或零个空格符组。但这样的RE会将类似”800) 45-3321”这样的电话找出来,也就是括号没有对称平衡的问题,之后会学到择一(alternatives)来决解这样的问题。
不包含在某特定字符组里(Negation)
有时需要查找在包含在某特定字符组里的字符,下表说明如何做类似这样的描述。
\W 不是字母数字的任意字符
\S 不是空格符的任意字符
\D 不是数字字符的任意字符
\B 不在字边界的位置
不是x的任意字符
不是a、e、i、o、u的任意字符
15. \S+ (不包含空格符的字符串)
择一(Alternatives)
有时会需要查找几个特定的选择,此时”|”这个特殊字符就派上用场了,举例来说,要查找五个数字及九个数字(有”-”号)的邮政编码。
16. \b\d?}-\d?}\b|\b\d?}\b (查找五个数字及九个数字(有”-”号)的邮政编码)
在使用Alternatives时需要注意的是前后的次序,因为RE在Alternatives中会优先选择符合最左边的项目,16中,如果把查找五个数字的项目放在前面,则这RE只会找到五个数字的邮政编码。了解了择一,可将14做更好的修正。
17. (\(\d?}\)|\d?})\s?\d?}\d?} (十个数字的电话号码)
群组(Grouping)
括号可以用来介定一个次描述,经由次描述的介定,可以针对次描述做重复或及他的处理。
18. (\d?,3}\.)?}\d?,3} (寻找网络地址的简单RE)
此RE的意思第一个部分(\d?,3}\.)?},所指的是,数字最小一位最多三位,并且后面接有”.”符号,此类型的共有三个,之后再接一到三位的数字,也就是如192.72.28.1这样的数字。
但这样会有个缺点,因为网络地址数字最多只到255,但上述的RE只要是一到三位的数字都是符合的,所以这需要让比较的数字小于256才行,但只单独使用RE并无法做这样的比较。在19中使用择一来将地址的限制在所需要的范围内,也就是0到255。
19. ((2\d|25|?\d\d?)\.)?}(2\d|25|?\d\d?) (寻找网络地址)
有没有发觉RE越来越像外星人说的话了?就以简单的寻找网络地址,直接看RE都满难理解的哩。
Expresso Analyzer View
Expresso提供了一个功能,它可以将所下的RE变成树状的说明,一组组的分开说明,提供了一个好的除错环境。其它的功能,如部分符合(Partial Match只查找反白RE的部分)及除外符合(Exclude Match只不查找反白RE的部分)就留给各位大大试试啰。
当次描述用括号群组起来时,符合次描述的文字可用在之后的程序处理或RE本身。在预设的情型下,所符合的群组是由数字命名,由1开始,由顺序是由左至右,这自动群组命名,可在Expresso中的skeleton view或result view中看到。
Backreference是用来查找群组中抓取的符合文字所相同的文字。举例来说””所指符合群组1所抓取的文字。
20. \b(\w+)\b\s*\b (寻找重复字,此处说的重复是指同样的字,中间有空白隔开如dog dog这样的字)
(\w+)会抓取至少一个字符的字母或数字的字,并将它命名为群组1,之后是查找任意空格符,再接和群组1相同的文字。
如果不喜欢群组自动命名的1,也可以自行命名,以上述例子为例,(\w+)改写为(?\w+),这就是将所抓取的群组命名为Word,Backreference就要改写成为\k
21. \b(?\w+)\b\s*\k\b (使用自行命名群组抓取重复字)
使用括号还有许多特别的语法元素,比较通用的列表如下:
抓取(Captures)
(exp) 符合exp并抓取它进自动命名的群组
(?exp) 符合exp并抓取它进命名的群组name
(?:exp) 符合exp,不抓取它
Lookarounds
(?=exp) 符合字尾为exp的文字
(?).*(?=) (HTML卷标间的文字)
这使用lookahead及lookbehind assertion来取出HTML间的文字,不包括HTML卷标。
请批注(Comments Please)
括号还有个特殊的用途就是用来包住批注,语法为”(?#comment)”,若设定”Ignore Pattern Whitespace”选项,则RE中的空格符当RE使用时会乎略。此选项设定时,”#”之后的文字会乎略。
31. HTML卷标间的文字,加上批注
(? #HTML标签
) #结束查找前缀
.* #符合任何文字
(?= #查找字尾,但不包含它
#符合所抓取群组1之字符串,也就是前面小括号的HTML标签
) #结束查找字尾
寻找最多字符的字及最少字符的字(Greedy and Lazy)
当RE下要查找一个范围的重复时(如”.*”),它通常会寻找最多字符的符合字,也就是Greedy matching。举例来说。
32. a.*b (开始为a结束为b的最多字符的符合字)
若有一字符串是”aabab”,使用上述RE所得到的符合字符串就是”aabab”,因为这是寻找最多字符的字。有时希望是符合最少字符的字也就是lazy matching。只要将重复前述项目的表加上问号(?)就可以把它们全部变成lazy matching。因此”*?”代表的就是重复任意次数,但是使用最少重复的次数来符合。举个例子来说:
33. a.*?b (开始为a结束为b的最少字符的符合字)
若有一字符串是”aabab”,使用上述RE第一个所得到的符合字符串就是”aab”再来是”ab”,因为这是寻找最少字符的字。
*? 重复任意次数,最少重复次数为原则
+? 重复至少一次,最少重复次数为原则
?? 重复零次或一次,最少重复次数为原则
{n,m}? 重复至少n次,但不超过m次,最少重复次数为原则
{n,}? 重复至少n次,最少重复次数为原则
还有什么没提到呢?
到目前为止,已经提到了许多建立RE的元素,当然还有许多元素没有提到,下表整理了一些没提到的元素,在最左边的字段的数字是说明在Expresso中的例子。
# 语法 说明
\a Bell 字符
\b 通常是指字的边界,在字符组里所代表的就是backspace
\t Tab
34 \r Carriage return
\v Vertical Tab
\f From feed
35 \n New line
\e Escape
36 \nnn ASCII八位码为nnn的字符
37 \xnn 十六位码为nn的字符
38 \unnnn Unicode为nnnn的字符
39 \cN Control N字符,举例来说Ctrl-M是\cM
40 \A 字符串的开始(和^相似,但不需籍由multiline选项)
41 \Z 字符串的结尾
\z 字符串的结尾
42 \G 目前查找的开始
43 \p{name} Unicode 字符组名称为name的字符,比如说\p{Lowercase_Letter} 所指的就是小写字
(?>exp) Greedy次描述,又称之为non-backtracking次描述。这只符合一次且不采backtracking。
44 (?-exp)
or (?-exp) 平衡群组。虽复杂但好用。它让已命名的抓取群组可以在堆栈中操作使用。(小弟对这个也是不太懂哩)
45 (?im-nsx:exp) 为次描述exp更改RE选项,比如(?-i:Elvis)就是把Elvis大乎略大小写的选项关掉
46 (?im-nsx) 为之后的群组更改RE选项。
(?(exp)yes|no) 次描述exp视为zero-width positive lookahead。若此时有符合,则yes次描述为下一个符合标的,若否,则no 次描述为下一个符合标的。
(?(exp)yes) 和上述相同但无no次描述
(?(name)yes|no) 若name群组为有效群组名称,则yes次描述为下一个符合标的,若否,则no 次描述为下一个符合标的。
47 (?(name)yes) 和上述相同但无no次描述
Last edited by 无奈何 on 2006-10-26 at 12:22 PM ]
Foreword
Regular Expressions (abbreviated as RE hereinafter) have always been a mysterious area for me. Seeing some great people on the Internet easily solve certain text problems using RE, I got the idea of learning RE. But I am naturally a bit lazy and always hope to find a way to learn it quickly. So I turned to the Google god. With His power, I found an article by Mr. Jim Hollenhorst on the Internet. After reading it, I thought it was really good, so I made a small summary report to share with the friends of Move-to.Net, hoping to bring a little help to you great people in learning RE. The URL of Mr. Jim Hollenhorst's article is as follows, and those who need it can directly click the link.
The 30 Minute Regex Tutorial By Jim Hollenhorst
http://www.codeproject.com/useritems/RegexTutorial.asp
What is RE?
I believe that all of you great people have used the wildcard "*" when doing file searches. For example, when you want to search for all Word files in the Windows directory, you may use "*doc" to do the search, because "*" represents any character. What RE does is similar to this function, but its function is more powerful.
When writing a program, it is often necessary to compare whether a string matches a specific pattern. The main function of RE is to describe this specific pattern. Therefore, RE can be regarded as a description of a specific pattern. For example, "\w+" represents any non-null string composed of letters and numbers. In the.NET framework, a very powerful class library is provided, through which it is very easy to use RE to perform text search and replacement, decode complex headers, and verify text, etc.
The best way to learn RE is to experience it through examples. Mr. Jim Hollenhorst also provides a tool program Expresso (have a cup of coffee), to help us learn RE. The download URL is
http://www.codeproject.com/useritems/RegexTutorial/ExpressoSetup2_1C.zip
.
Next, let's experience some examples.
Some simple examples
Suppose you want to find a string with Elvis followed by alive in the article, using RE may go through the following process, and the parentheses are the meaning of the RE:
1. elvis (search for elvis)
The above represents the order of characters to be searched as elvis. In.NET, you can set to ignore the case of characters, so "Elvis", "ELVIS" or "eLvIs" are all in line with the RE of 1. But because this only cares about the order of characters appearing as elvis, so pelvis also conforms to the RE of 1. You can improve it with the RE of 2.
2. \belvis\b (regard elvis as a whole word to search, such as elvis, Elvis when ignoring case of characters)
"\b" has a special meaning in RE. In the above example, it refers to the word boundary, so \belvis\b uses \b to define the front and back boundaries of elvis, that is, to find the word elvis.
Suppose you want to find a string with elvis followed by alive in the same line, then you will use two other special meaning characters "." and "*". "." represents any character except the newline character, and "*" represents repeating the item before * until the string that matches the RE is found. So ".*" means any number of characters except the newline character. So to find a string with elvis followed by alive in the same line, you can enter the RE of 3 as follows.
3. \belvis\b.*\balive\b (search for the string with elvis followed by alive, such as elvis is alive)
You can form a powerful RE with simple special characters, but you also find that when using more and more special characters, the RE will become more and more difficult to understand.
Let's look at another example
Form a valid phone number
Suppose you want to collect a 7-digit phone number in the format xxx-xxxx from a web page, where x is a digit, the RE may be written like this.
4. \b\d\d\d-\d\d\d\d (search for a 7-digit phone number, such as 123-1234)
Each \d represents a digit. "-" is a general hyphen. To avoid too many repeated \d, the RE can be rewritten in the way of 5.
5. \b\d?}-\d?} (a better way to search for a 7-digit phone number, such as 123-1234)
The?} after \d means repeating the previous item three times, which is equivalent to \d\d\d.
RE learning and testing tool Expresso
Because RE is not easy to read and users are prone to making wrong RE, Mr. Jim developed a tool software Expresso to help users learn and test RE. In addition to the URL mentioned above, you can also go to the Ultrapico website (
http://www.Ultrapico.com)
. After installing Expresso, in the Expression Library, Mr. Jim has established all the examples of the article in it. You can test while reading the article, and you can also try to modify the RE of the example, and you can see the result immediately. I think it is very easy to use. You great people can give it a try.
Basic concepts of RE in.NET
Special characters
Some characters have special meanings, such as "\b", ".", "*", "\d" that we have seen before. "\s" represents any whitespace character, such as spaces, tabs, newlines, etc. "\w" represents any letter or digit character.
Let's look at some more examples
6. \ba\w*\b (search for words starting with a, such as able)
This RE describes that you want to find the start boundary of a word (\b), then the letter "a", then any number of letters and digits (\w*), then the end boundary of this word (\b).
7. \d+ (search for a string of digits)
"+" is very similar to "*", except that + repeats the previous item at least once. That is, there is at least one digit.
8. \b\w?}\b (search for a word of six alphanumeric characters, such as ab123c)
The following table shows the commonly used special characters in RE
. Any character except the newline character
\w Any alphanumeric character
\s Any whitespace character
\d Any digit character
\b Define word boundary
^ Start of the article, such as "^The" to indicate that the string appearing at the start of the article is "The"
$ End of the article, such as "End$" to indicate that it appears at the end of the article as "End"
The special characters "^" and "$" are used to find that certain words must be at the start or end of the article. This is especially useful when verifying whether the input conforms to a certain pattern. For example, to verify a 7-digit phone number, you may enter the RE of 9 as follows.
9. ^\d?}-\d?}$ (verify a 7-digit phone number)
This is the same as the 5th RE, but there are no other characters before and after it, that is, the entire string is only this 7-digit phone number. In.NET, if the Multiline option is set, then "^" and "$" will be compared line by line, as long as the start and end of a line conform to the RE, instead of comparing the entire article string at once.
Escaped characters
Sometimes you may need the literal meaning of "^" and "$" instead of treating them as special characters. At this time, the "\\" character is used to remove the special meaning of special characters. Therefore, "\^", "\.", "\\" represent the literal meanings of "^", ".", "\\" respectively.
Repeat the previous item
We have seen that "?}" and "*" can be used to repeat the previous characters. Later, we will see how to use the same syntax to repeat the entire subexpression. The following table shows some ways to use repeating the previous item.
* Repeat any number of times
+ Repeat at least once
? Repeat zero or one time
{n} Repeat n times
{n,m} Repeat at least n times, but not more than m times
{n,} Repeat at least n times
Let's try some more examples
10. \b\w?,6}\b (search for words of five or six alphanumeric characters, such as as25d, d58sdf, etc.)
11. \b\d?}\s\d?}-\d?} (search for a 10-digit phone number, such as 800 123-1234)
12. \d?}-\d?}-\d?} (search for a social security number, such as 123-45-6789)
13. ^\w* (the first word of each line or the entire article)
Try in Espresso the difference between having Multiline and not having Multiline.
Match characters in a certain range
Sometimes when you need to find some specific characters, what should you do? At this time, the square brackets "" come in handy. Therefore, is to find the vowels "a", "e", "i", "o", "u", and is to find the symbols ".", "?", "!". The special meanings of special characters in the square brackets will be removed, that is, interpreted as pure literal meanings. You can also specify characters in certain ranges, such as "", which means any lowercase letter or any digit.
Next, let's look at a more complex example of finding a phone number's RE
14. \(?\d?} \s?\d?}\d?} (search for a 10-digit phone number, such as (080) 333-1234 )
Such a RE can find phone numbers in more formats, such as (080) 123-4567, 511 254 6654, etc. "\(?" represents one or zero left parentheses "(", and "" represents finding one right parenthesis ")" or a space character, "\s?" refers to one or zero whitespace groups. But such a RE will find a phone number like "800) 45-3321", that is, there is no problem of symmetric balance of parentheses. Later, we will learn about alternatives to solve such problems.
Negation
Sometimes you need to find characters not included in a certain specific character group. The following table shows how to make such a description.
\W Any character that is not alphanumeric
\S Any character that is not a whitespace character
\D Any character that is not a digit character
\B Not at the word boundary position
Any character that is not x
Any character that is not a, e, i, o, u
15. \S+ (a string that does not contain whitespace characters)
Alternatives
Sometimes you need to find several specific choices. At this time, the special character "|" comes in handy. For example, to find a 5-digit and a 9-digit (with "-" sign) postal code.
16. \b\d?}-\d?}\b|\b\d?}\b (search for a 5-digit and a 9-digit (with "-" sign) postal code)
When using Alternatives, you need to pay attention to the order before and after, because RE will give priority to the item that matches the leftmost in the Alternatives. In 16, if the item to find the 5-digit number is placed in front, then this RE will only find the 5-digit postal code. After understanding the alternatives, you can make a better modification to 14.
17. (\(\d?}\)|\d?})\s?\d?}\d?} (a 10-digit phone number)
Grouping
Parentheses can be used to define a subexpression. Through the definition of the subexpression, you can perform repetition or other processing on the subexpression.
18. (\d?,3}\.)?}\d?,3} (a simple RE for finding an IP address)
The meaning of this RE is that the first part (\d?,3}\.), which means that the number has at least one digit and at most three digits, and is followed by a "." symbol. There are three such types, and then followed by 1 to 3 digits, that is, a number like 192.72.28.1.
But there is a shortcoming, because the IP address number is at most 255, but the above RE only requires that the number is 1 to 3 digits to be in line, so this requires that the compared number is less than 256, but RE alone cannot make such a comparison. In 19, use alternatives to limit the address within the required range, that is, 0 to 255.
19. ((2\d|25|?\d\d?)\.)?}(2\d|25|?\d\d?) (search for an IP address)
Have you found that RE is more and more like what aliens say? Just looking for an IP address simply, it is quite difficult to understand directly from the RE.
Expresso Analyzer View
Expresso provides a function that can turn the entered RE into a tree-like explanation, separated into groups, providing a good debugging environment. Other functions, such as partial match (Partial Match only finds the part of the RE in reverse white) and exclude match (Exclude Match only does not find the part of the RE in reverse white) are left for you great people to try.
When a subexpression is grouped by parentheses, the text that matches the subexpression can be used in subsequent program processing or in the RE itself. Under the default situation, the matched groups are named by numbers, starting from 1, and the order is from left to right. This automatic group naming can be seen in the skeleton view or result view in Expresso.
Backreference is used to find the same text as the matched text captured in the group. For example, "" refers to the text captured in group 1.
20. \b(\w+)\b\s*\b (search for repeated words, here the repetition means the same word, with a space in between, such as dog dog)
(\w+) will capture a word of at least one character of letters or digits and name it group 1, then find any whitespace characters, and then the same text as group 1.
If you don't like the automatically named 1 of the group, you can also name it yourself. For the above example, (\w+) is rewritten as (?\w+), which is to name the captured group as Word, and the Backreference should be rewritten as \k
21. \b(?\w+)\b\s*\k\b (use a self-named group to capture repeated words)
There are many special syntax elements when using parentheses. The more common list is as follows:
Captures
(exp) Match exp and capture it into an automatically named group
(?exp) Match exp and capture it into a named group name
(?:exp) Match exp, but do not capture it
Lookarounds
(?=exp) Match text whose end is exp
(?).*(?=) (text between HTML tags)
This uses lookahead and lookbehind assertion to extract the text between HTML, not including the HTML tags.
Please批注(Comments Please)
Parentheses also have a special use to enclose comments. The syntax is "(?#comment)". If the "Ignore Pattern Whitespace" option is set, the whitespace characters in the RE will be ignored when the RE is used. When this option is set, the text after "#" will be ignored.
31. Text between HTML tags, with comments
(? #HTML tag
) #End the prefix to find
.* #Match any text
(?= #Find the end, but do not include it
#Match the string of the captured group 1, that is, the HTML tag in the previous parentheses
) #End the suffix to find
Greedy and Lazy
When the RE is to find a range of repetition (such as ".*"), it usually finds the most characters of the matching word, that is, Greedy matching. For example.
32. a.*b (the matching word with the most characters starting with a and ending with b)
If there is a string "aabab", the matching string obtained using the above RE is "aabab", because this is to find the word with the most characters. Sometimes you want to match the word with the least characters, that is, lazy matching. As long as you add a question mark (?) to the table of repeating the previous item, you can turn them all into lazy matching. Therefore, "*?" means repeating any number of times, but using the least number of repetitions to match. For example:
33. a.*?b (the matching word with the least characters starting with a and ending with b)
If there is a string "aabab", the first matching string obtained using the above RE is "aab" and then "ab", because this is to find the word with the least characters.
*? Repeat any number of times, with the principle of the least number of repetitions
+? Repeat at least once, with the principle of the least number of repetitions
?? Repeat zero or one time, with the principle of the least number of repetitions
{n,m}? Repeat at least n times, but not more than m times, with the principle of the least number of repetitions
{n,}? Repeat at least n times, with the principle of the least number of repetitions
What else is not mentioned?
So far, many elements for building RE have been mentioned. Of course, there are still many elements not mentioned. The following table sorts out some of the elements not mentioned. The number in the leftmost field is the explanation in the example in Expresso.
# Syntax Explanation
\a Bell character
\b Usually refers to the word boundary, and in the character group, it represents backspace
\t Tab
34 \r Carriage return
\v Vertical Tab
\f From feed
35 \n New line
\e Escape
36 \nnn ASCII octal code is nnn character
37 \xnn Hexadecimal code is nn character
38 \unnnn Unicode is nnnn character
39 \cN Control N character, for example, Ctrl-M is \cM
40 \A Start of the string (similar to ^, but without the need for the multiline option)
41 \Z End of the string
\z End of the string
42 \G Start of the current search
43 \p{name} Unicode character group name is name character, for example, \p{Lowercase_Letter} refers to lowercase letters
(?>exp) Greedy subexpression, also known as non-backtracking subexpression. This only matches once and does not backtrack.
44 (?-exp)
or (?-exp) Balanced group. Although complex, it is easy to use. It allows the named capture group to be operated and used in the stack. (I don't understand this either)
45 (?im-nsx:exp) Change the RE option for subexpression exp. For example, (?-i:Elvis) is to turn off the option of ignoring the case of Elvis.
46 (?im-nsx) Change the RE option for the subsequent group.
(?(exp)yes|no) The subexpression exp is regarded as a zero-width positive lookahead. If there is a match at this time, the yes subexpression is the next matching target. If not, the no subexpression is the next matching target.
(?(exp)yes) The same as above but without the no subexpression
(?(name)yes|no) If the name group is a valid group name, then the yes subexpression is the next matching target. If not, the no subexpression is the next matching target.
47 (?(name)yes) The same as above but without the no subexpression
Last edited by 无奈何 on 2006-10-26 at 12:22 PM ]
|

☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul
|
|
2006-10-26 11:44 |
|
|
electronixtar
铂金会员
      
积分 7493
发帖 2672
注册 2005-9-2
状态 离线
|
『第 10 楼』:
使用 LLM 解释/回答一下
顶一楼。RegExp好用但是难记啊,而且不同的软件方法不一样。findstr的UltraEdit的perl的Python的JavaScript的都不一样,郁闷
Top floor. RegExp is useful but hard to remember, and the methods are different in different software. findstr, UltraEdit, perl, Python, JavaScript are all different, so depressed
|

C:\>BLOG http://initiative.yo2.cn/
C:\>hh.exe ntcmds.chm::/ntcmds.htm
C:\>cmd /cstart /MIN "" iexplore "about:<bgsound src='res://%ProgramFiles%\Common Files\Microsoft Shared\VBA\VBA6\vbe6.dll/10/5432'>" |
|
2006-10-26 11:51 |
|
|
chenall
银牌会员
    
积分 1276
发帖 469
注册 2002-12-23 来自 福建泉州
状态 离线
|
|
2006-10-26 20:17 |
|
|
redtek
金牌会员
     
积分 2902
发帖 1147
注册 2006-9-21
状态 离线
|
『第 12 楼』:
使用 LLM 解释/回答一下
非常有用的精典,收藏~:)
Very useful classic, collect it ~ :)
|

Redtek,一个永远在网上流浪的人……
_.,-*~'`^`'~*-,.__.,-*~'`^`'~*-,._,_.,-*~'`^`'~*-,._,_.,-*~'`^`'~*-,._ |
|
2006-10-26 20:45 |
|
|
lxmxn
版主
       
积分 11386
发帖 4938
注册 2006-7-23
状态 离线
|
『第 13 楼』:
使用 LLM 解释/回答一下
这正则表达式非学不可……嘿嘿……收藏了……版主辛苦了……
This regular expression is definitely something to learn... Hehe... Collected... The moderator has worked hard...
|
|
2006-10-27 00:17 |
|
|
IceCrack
中级用户
   DOS之友
积分 332
发帖 168
注册 2005-10-6 来自 天涯
状态 离线
|
『第 14 楼』:
使用 LLM 解释/回答一下
正则法则看着挺简单的。但是想用好不容易哦
Regular expressions seem simple. But it's really hard to use them well.
|

测试环境: windows xp pro sp2 高手是这样炼成的:C:\WINDOWS\Help\ntcmds.chm |
|
2006-10-27 00:21 |
|
|
vkill
金牌会员
     
积分 4103
发帖 1744
注册 2006-1-20 来自 甘肃.临泽
状态 离线
|
『第 15 楼』:
使用 LLM 解释/回答一下
用好了真难,先学习再说
It's really difficult to use well. Let's study first.
|
|
2006-10-27 01:28 |
|
|