网页文章下载整理之——体验 VIM 篇
作者:无奈何@cn-dos.net
VIM 拥有强大的功能和灵活的可定制性让喜爱它的人为此痴迷。我虽是刚刚入手,但已经领略了它的非凡。下面以实际应用的一个例子来体验一下 VIM 强大的文字处理及多程序协作的能力。希望通过这种实例形式的体验,引起大家对 VIM 的兴趣。本文不对 VIM 的基本操作做太多介绍,只是说明涉及到需要使用的命令及操作。如果想了解更多请查看手册或搜索网络上大量的关于 VIM 的优秀文章。
在网上看到《YY先生语录》很是喜欢作者这种基于生活的幽默风格,于是便想将所有的语录收集下来,很自然的来到了作者的主页:
http://web.mblogger.cn/philewar/category/612.aspx 。我们接下来要做的就是将列表中的所有文章收集起来。思考一下怎样完成我们的任务,首先需要整理出目录列表,下载列表中的网页;然后整理出文章链接,下载所有链接中的网页;最后顺序合并网页整理出我们需要的内容。
浏览观察一下上面给出的链接,发现目录列表是很有规律的,从 ...612.aspx?p=1 到 ...612.aspx?p=13 。现在开工,当然首先需要正确安装 VIM 并能运行,启动 gvim 也可以启动 vim 。启动后的 gvim 在普通模式下工作,我们想将所有的文件放在某一目录,比如 D:\YY 目录,我们要执行一系列命令来完成,所以需要切换到命令模式。听起来蛮复杂的,事实上是很简单的,只需输入“:”此时光标定位于窗口最下面等待我们输入命令。依次输入下列命令:
- :cd D:\
- :sil !md YY
- :cd YY
此时我们已经在 D:\YY 目录下工作了,可以理解为我们 CMD 下的当前目录。说一下第二行“:sil !md YY” 不要写成“:sil! md YY” 后者是不会执行的,在 VIM 命令模式下 ! 表示执行外部命令,此外的其他含义我们后面的例子中再谈。 :sil 是 :silent 的简写其含义是静默方式执行命令,此行可以改写为“!md YY”但是会弹出确认窗口,我们并不需要所以以 :sil 方式调用。此行完整的解释是以静默方式调用外部命令“md YY”
准备就绪,我们生成从第 1 页到第 13 页的列表,VIM 下有多种方式完成,我们来体验一下典型的 3 种。
○替换方式:
首先复制第一页地址
http://web.mblogger.cn/philewar/category/612.aspx?p=1 到剪贴板,在 gvim 中按 <S-Insert> 粘贴文本,<S-Insert> 是组合键 Shift + Insert ,其下内容遵从这种简写方式,如:<C-a> 表示 Ctrl + a 。第一行内容出现在文本窗口了,在普通模式下按下 Y 复制当前行,不清楚当前在何种模式下工作的话两次按下 <ESC> 保证退回到普通模式下了。然后键入 12p 第一行被复制了 12 次,一共有 13 行文本了。执行下面命令:
:%s/\d\+$/\=line(".")/
我们想要的列表生成完成了,解释一下 % 表示对所有行执行操作;S/1/2/ 是替换命令,替换 1 为 2 ;\d\+$ 是正则表达式,匹配行尾的连续数字;\= 是表达式求值;line() 是求行号函数; . 代表当前行。合起来的意思就好理解了,在所有行上执行替换命令,替换行尾的连续数字为当前行号。再来看看第二种方式。
○宏操作方式:
先两次按下 u 撤回到有 1 行文本那一步,不小心按多了的话,按 <C-r> 重做。依次按下 qaYp$<C-a>q 这时会发现有两行递增的文本了,我们将这个操作录制到宏 a 中了,我们多次调用一下这个宏,键入 11@a 我们要的列表又一次被生成了,其中 @ 代表宏调用,我们让这个宏执行了 11 次。注意在 windows 下默认安装 VIM 后,<C-a> 被重新映射,以迎合 windows 下全选的习惯,但这无疑是一种损失,你可以重新定义映射,或者简单的屏蔽掉 windows 按键习惯定义。修改 VIM 安装目录下的 _vimrc 找到 source $VIMRUNTIME/mswin.vim
与 behave mswin 所在行并在行首添加双引号将其注释掉。
○外部命令方式:
是不是上面的两种操作有些晕头,身为 DOS联盟 的一员,我们是不是更喜欢交给我们熟悉的 CMD 来处理这样的问题。好吧,将所有文本删除,依次键入 ggdG 是不是文本消失了。执行下面的命令:
:% !for /l \%i in (1,1,13) do @echo.http://web.mblogger.cn/philewar/category/612.aspx?p=\%i
怎么样,我们熟悉的命令又派上用场了,又一次完成了任务。VIM 命令模式和 CMD 下唯一的区别就是 % 要 \% 转义一下。
接下来就是将这些链接下载下来,netrw 插件可以完成网页的下载(需要 wget.exe),我们不用这个插件自己来处理。首先到这个页面:
http://users.ugent.be/~bpuype/wget/ 下载 wget.exe 置于系统可搜索到的路径下。切回到 gvim 窗口,执行下面的命令:
% !wget -i - -w 3 -q -O -
执行完这个命令,gvim 会抛出一个 CMD 命令窗口,我们只需要等待命令结束,等待时间因网速快慢而不等。利用这个时间解说一下 wget的这几个参数,-i 是读取网址列表文件,- 表示从命令行标准输出获取;-w 是下载每个网页后的等待时间,由于这个网站的原因连续下载会下载到与前一次重复的内容,等待 3 秒后就没问题了;-q 不输出下载信息;-O - 文件输出到命令行终端。下载完成了,我们的文本窗口一下多了九千多行,是不是很爽。(提醒:我们这里重点只是谈 VIM 与外部命令的协作,在我的多次实验中,这种方式偶尔会不稳定,如果在接下来的步骤中发现文本和我提到的有少许的差异,比如某行多出几个 ... 也不要吃惊,手动修改一下继续我们的体验。在文章后面会有更好方法的介绍。)赶快浏览一下文件,执行查找命令:
/YY先生语录\D\+
按 n 键粗略浏览一下,发现我们想要提取的链接地址所在行比较固定,特征字符是“_TitleUrl”,我们将其他无用的行都删除掉,依次执行下面的命令:
- :g!/_TitleUrl/d
- :%s/.*href="//
- :%s/"> */\t/
- :%s/<\/a>.*//
第一行是一个全局命令,将不符合匹配的行全部删除,这是前面我们提到的 ! 的另一种用法,在模式地址前面表示取反。下面三行的替换命令比较简单,我们也可以将其合并为一条命令 :%s/^.*href="\(*\)"> *\(*\).*/\1\t\2/ 。随后发现列表是逆序的我们不满意这种排列方式,动手将其翻转一下,执行命令:
:g/^/m0
这是一个让人费解的命令,有必要解释一下,匹配模式 /^/ 表示每一行,因为每一行必然存在不可见的“行首”,所以任何行都不会落下都被匹配;m0 将匹配行移动到 0 行,这是个虚拟的地址,表示文件的最顶部。很自然的每一行都移动到顶部了,所以文件就被翻转了。观察一下发现不在《YY先生语录》顺序号之内的行也夹杂在其中,比如 YY先生语录前传x ,我们将不合群的行移动到列表最后,并排列一下顺序,执行命令:
- :g!/\(YY先生语录\|YY(\)\d\+/m$
- :/YY先生语录296/+1,$ sort /.*\t/ n
上面第一行的命令是将不符合匹配模式的行移动到文本最后,想看看这个正则表达式到底匹配什么呢?直接输入 /\(YY先生语录\|YY(\)\d\+/ 回车后便会看到匹配结果。第二条命令有些复杂,完成的功能是将 YY先生语录296 之后一行到结尾的行排序,排序的规则是忽略<TAB>之前的内容,并按数字升序排序。这个排序功能是不是很强大,想了解更多可以执行这个命令::help :sort 查看。我们可以手动将其他不想要的行删除比如:XX女士语录 一行,移动光标到此行在普通模式下按 dd 便可删除一行。文本中制表符之后的内容只是帮助我们整理行序的,现在没有用了我们将其删除掉。
:%s/\t.*//
到现在我们还没有保存文件,执行这个命令保存一下。
:w list.txt
接下来要做的就是将这些所有的链接下载下来,执行下面的命令:
:!wget -w 3 -nc -k -i %
此次执行 wget 下载有两个不同于上次的参数,-nc 不下载已经存在的文件;-k 转换相对链接为绝对链接。我们看一下这一次的命令有什么特点。由于下载的文件较多在网络状况不好的情况下,不能保证每个文件都能正确下载,虽然这种情况极少出现,-nc 给了我们重新检查一次的可能,完全可以将刚才的命令重新运行一下检查一下有没有遗漏的文件没有下载。-k 参数对于含有链接与帖图的网页的后期整理是非常有用处的,如果想保留下这些信息的话。wget 默认将下载相关信息输出到屏幕,我们可以随时查看下载进度,观看屏幕的闪动比无预期的等待要舒服多了。好了我们需要的文件下载完成了,关掉 CMD 窗口,回到 gvim 我们将文件合并起来。
:% !for /f \%i in (%) do @type \%~nxi
执行完后所有的文件按我们前面的排列顺序被合并在了一起。在普通模式下按下 G 跳转到文章最后,好家伙有 17 多万行,虽然文本比较巨大,但不要担心 VIM 的速度。按 <C-b> <C-f> 前后翻页,粗览一下我们想要的文字藏在哪里,找到文本块的特征字符。再按下 G 跳转到文章最后,按下 ma 我们作个标记,然后执行一下下面的命令。
:g/<div class="post">/,/<\/div><link/m$
这样我们需要的部分被放置到文件的最后,按下 'a 转到标记处,按下 dgg 将标记之前的内容删除,另存一下。
:saveas YY.txt
gvim 自动切换到我们另存的文件上来,这样我们下面所作的修改都是针对这个已经过滤过的新文件了。剩下的没有太多的技巧了,执行一系列的替换,下面命令的可以替换掉网页标签和常用的实体字符,当然有一些并不一定都会出现在这个文本里。
:%s/*<*>//g
:%s/"\|"/"/g
:%s/&\|&/\&/g
:%s/<\|</</g
:%s/>\|>/>/g
:%s/ / /g
:%s/·\|·/·/g
:%s/…/…/g
:%s/–\|–/–/g
:%s/—\|—/—/g
:%s/‘\|‘/‘/g
:%s/’\|’/’/g
:%s/“\|“/“/g
:%s/”\|”/”/g
然后执行一系列的替换完成最后的格式整理。
:%s/\r//g
:%s/^*//g
:%s/^$\n//g
%s/^\d\+年\d\+月\d\+日.*/\t\t&\r\r/g
可以直接到这里查看整理后的效果,
http://www.cn-dos.net/forum/viewthread.php?tid=24956
我们辛苦敲入的这些命令难道只是为了完成这次的任务吗?不是的。VIM 可以将命令和操作存为脚本多次执行,具体如何实现只能下回分解了。
尾注:本篇是 VIM 的实例应用,并没有对 VIM 基本操作做过多介绍,对于新手照着文章一步步做下来是很困难的,推荐先阅读一下 VIM 帮助自带的 30 分钟教程 vimtutor ,可以到这个页面:
http://vimcdoc.sourceforge.net/ 下载中文版帮助。
Last edited by 无奈何 on 2006-11-22 at 06:15 AM ]
Web Article Download and Organization - Experiencing VIM
Author: Wu Nai He@cn-dos.net
VIM's powerful functions and flexible customizability make those who love it obsessed with it. Although I have just started using it, I have already experienced its extraordinary features. The following takes an example of practical application to experience VIM's powerful text processing and multi-program collaboration capabilities. I hope that through this example-based experience, it can arouse everyone's interest in VIM. This article does not introduce too much about the basic operations of VIM, but only explains the commands and operations that need to be used. If you want to know more, please refer to the manual or search a large number of excellent articles about VIM on the Internet.
I saw "Mr. YY's Quotations" on the Internet and really liked the author's humorous style based on life. So I wanted to collect all the quotations, and naturally came to the author's homepage:
http://web.mblogger.cn/philewar/category/612.aspx. What we need to do next is to collect all the articles in the list. Think about how to complete our task. First, we need to sort out the directory list and download the web pages in the list; then sort out the article links and download all the web pages in the links; finally, merge the web pages in order to sort out the content we need.
Browsing and observing the links given above, we find that the directory list is very regular, from...612.aspx?p=1 to...612.aspx?p=13. Now let's start working. Of course, first of all, we need to correctly install VIM and be able to run it. We can start gvim or vim. After starting, gvim works in normal mode. We want to put all the files in a certain directory, such as the D:\YY directory. We need to execute a series of commands to complete, so we need to switch to command mode. It sounds complicated, but in fact it is very simple. Just enter ":" and the cursor is positioned at the bottom of the window waiting for us to enter the command. Enter the following commands in sequence:
- :cd D:\
- :sil!md YY
- :cd YY
At this time, we are working in the D:\YY directory, which can be understood as the current directory in CMD. Let's talk about the second line ":sil!md YY". Don't write it as ":sil! md YY", the latter will not be executed. In the VIM command mode,! means executing an external command, and other meanings of it will be discussed in our subsequent examples. :sil is the abbreviation of :silent, and its meaning is to execute the command in silent mode. This line can be rewritten as "!md YY", but a confirmation window will pop up, and we don't need it, so we call it in :sil mode. The complete explanation of this line is to call the external command "md YY" in silent mode.
We are ready. We generate a list from page 1 to page 13. There are multiple ways to complete this in VIM. Let's experience three typical ones.
○Replacement method:
First, copy the address of the first page
http://web.mblogger.cn/philewar/category/612.aspx?p=1 to the clipboard, paste the text in gvim by pressing <S-Insert>, <S-Insert> is the combination key Shift + Insert, and the following content follows this abbreviated way, such as: <C-a> means Ctrl + a. The first line of content appears in the text window. In normal mode, press Y to copy the current line. If you are not sure which mode you are working in, press <ESC> twice to ensure that you return to normal mode. Then type 12p, and the first line is copied 12 times, and there are a total of 13 lines of text. Execute the following command:
:%s/\d\+$/\=line(".")/
The list we want is generated. Let's explain. % means performing operations on all lines; S/1/2/ is the substitution command, substituting 1 with 2; \d\+$ is a regular expression, matching consecutive numbers at the end of the line; \= is expression evaluation; line() is a function to find the line number; . represents the current line. It is easy to understand when combined. It means performing the substitution command on all lines, substituting the consecutive numbers at the end of the line with the current line number. Let's take a look at the second method.
○Macro operation method:
First, press u twice to undo to the step where there is 1 line of text. If you press too many times, press <C-r> to redo. Press qaYp$<C-a>q in sequence. At this time, you will find that there are two lines of increasing text. We have recorded this operation into macro a. We call this macro multiple times. Type 11@a, and our list is generated again. Among them, @ represents macro call, and we let this macro execute 11 times. Note that after VIM is installed by default under Windows, <C-a> is remapped to cater to the habit of selecting all under Windows, but this is undoubtedly a loss. You can redefine the mapping or simply shield the Windows key habit definition. Modify _vimrc in the VIM installation directory, find the lines source $VIMRUNTIME/mswin.vim and behave mswin, and add double quotes at the beginning of the line to comment them out.
○External command method:
Is the above two operations a bit confusing? As a member of the DOS Union, do we prefer to leave such problems to our familiar CMD to handle? Okay, delete all the text, and type ggdG in sequence, and the text disappears. Execute the following command:
:%!for /l %i in (1,1,13) do @echo.http://web.mblogger.cn/philewar/category/612.aspx?p=%i
How about it, our familiar command is used again, and the task is completed again. The only difference between the VIM command mode and CMD is that % needs to be escaped with \%.
Next, we need to download these links. The netrw plugin can complete the download of web pages (wget.exe is required), but we will handle it ourselves without using this plugin. First, go to this page:
http://users.ugent.be/~bpuype/wget/ to download wget.exe and place it in a path that can be searched by the system. Switch back to the gvim window and execute the following command:
%!wget -i - -w 3 -q -O -
After executing this command, gvim will throw a CMD command window. We just need to wait for the command to end. The waiting time depends on the network speed. Use this time to explain the parameters of wget. -i is to read the URL list file, - means to get it from the standard output of the command line; -w is the waiting time after downloading each web page. Due to the reason of this website, continuous downloading will download the same content as the previous time. Waiting for 3 seconds will be fine; -q does not output download information; -O - outputs the file to the command line terminal. After the download is completed, there are more than 9,000 lines in our text window all of a sudden. Isn't it cool. (Reminder: Our focus here is only to talk about the collaboration between VIM and external commands. In my multiple experiments, this method is occasionally unstable. If there are slight differences from what I mentioned in the subsequent steps, such as a few... more on a certain line, don't be surprised. Just modify it manually and continue our experience. There will be a better method introduced later in the article.) Hurry up to browse the file and execute the find command:
/YY先生语录\D\+
Press the n key to browse roughly, and find that the line where the link address we want to extract is located is relatively fixed, and the characteristic character is "_TitleUrl". We delete all the useless lines. Execute the following commands in sequence:
- :g!/_TitleUrl/d
- :%s/.*href="//
- :%s/"> */\t/
- :%s/<\/a>.*//
The first line is a global command, which deletes all lines that do not match the match. This is another usage of! mentioned earlier, which means taking the negation in front of the pattern address. The following three substitution commands are relatively simple. We can also combine them into one command :%s/^.*href="\(*\)"> *\(*\).*/\1\t\2/. Then we find that the list is in reverse order and we are not satisfied with this arrangement. Let's flip it manually. Execute the command:
:g/^/m0
This is a confusing command. It is necessary to explain. The matching pattern /^/ means each line. Because each line must have an invisible "line start", so no line will be missed and all are matched; m0 moves the matched line to line 0, which is a virtual address, indicating the top of the file. Naturally, each line is moved to the top, so the file is flipped. Observe and find that lines not in the "YY先生语录" serial number are also mixed in it, such as "YY先生语录前传x". We move the out-of-place lines to the end of the list and sort them. Execute the command:
- :g!/\(YY先生语录\|YY(\)\d\+/m$
- :/YY先生语录296/+1,$ sort /.*\t/ n
The first command above is to move the lines that do not match the matching pattern to the end of the text. Do you want to see what this regular expression matches exactly? Just enter /\(YY先生语录\|YY(\)\d\+/ and press Enter to see the matching result. The second command is a bit complicated. The function it completes is to sort the lines from the line after YY先生语录296 to the end. The sorting rule is to ignore the content before <TAB> and sort in ascending order by number. This sorting function is very powerful. If you want to know more, you can execute this command: :help :sort to view. We can manually delete other unwanted lines, such as the line "XX女士语录". Move the cursor to this line and press dd in normal mode to delete a line. The content after the tab in the text is only to help us sort the lines, and it is not useful now. We delete it.
:%s/\t.*//
We haven't saved the file yet. Execute this command to save it.
:w list.txt
Next, what we need to do is to download all these links. Execute the following command:
:!wget -w 3 -nc -k -i %
There are two parameters different from the last time when executing wget this time. -nc does not download existing files; -k converts relative links to absolute links. Let's see what characteristics this time's command has. Since there are many downloaded files, in the case of poor network conditions, it cannot be guaranteed that each file can be downloaded correctly. Although this situation is very rare, -nc gives us a chance to check again. We can completely re-run the previous command to check if there are any missing files that have not been downloaded. The -k parameter is very useful for the later organization of web pages containing links and pictures. If we want to keep these information. wget outputs the download information to the screen by default. We can check the download progress at any time. Watching the screen flash is more comfortable than waiting without expectation. Okay, the files we need are downloaded. Close the CMD window and return to gvim. We merge the files.
:%!for /f %i in (%) do @type %~nxi
After execution, all the files are merged together in the order we arranged earlier. Press G in normal mode to jump to the end of the article. Goodness, there are more than 170,000 lines. Although the text is very large, don't worry about VIM's speed. Press <C-b> <C-f> to turn pages back and forth, and browse roughly to see where the text we want is hidden, and find the characteristic characters of the text block. Press G again to jump to the end of the article, press ma to make a mark, and then execute the following command.
:g/<div class="post">/,/<\/div><link/m$
In this way, the part we need is placed at the end of the file. Press 'a to go to the mark, press dgg to delete the content before the mark, and save it separately.
:saveas YY.txt
gvim automatically switches to the file we saved separately. In this way, the modifications we make below are all for this filtered new file. There are not too many skills left. Execute a series of substitutions. The following commands can replace web page tags and common entity characters. Of course, some may not all appear in this text.
:%s/*<*>//g
:%s/"\|"/"/g
:%s/&\|&/\&/g
:%s/<\|</</g
:%s/>\|>/>/g
:%s/ / /g
:%s/·\|·/·/g
:%s/…/…/g
:%s/–\|–/–/g
:%s/—\|—/—/g
:%s/‘\|‘/‘/g
:%s/’\|’/’/g
:%s/“\|“/“/g
:%s/”\|”/”/g
Then execute a series of substitutions to complete the final format organization.
:%s/\r//g
:%s/^*//g
:%s/^$\n//g
%s/^\d\+年\d\+月\d\+日.*/\t\t&\r\r/g
You can directly view the organized effect here:
http://www.cn-dos.net/forum/viewthread.php?tid=24956
Are the commands we worked hard to type only for completing this task? No. VIM can save commands and operations as scripts for multiple executions. Specifically how to implement it will be explained next time.
Postscript: This article is an example application of VIM and does not introduce too much about the basic operations of VIM. It is very difficult for beginners to follow the article step by step. It is recommended to read the 30-minute tutorial vimtutor in the VIM help first. You can go to this page:
http://vimcdoc.sourceforge.net/ to download the Chinese version of the help.
Last edited by 无奈何 on 2006-11-22 at 06:15 AM ]