China DOS Union

-- Unite DOS · Advance DOS · Grow DOS --

Union site: www.cn-dos.net Forum site: www.cn-dos.net/forum
DOS stands for freedom, openness and progress. Let us work hard, learn from the openness and GNU spirit of FreeDOS and Linux, and together build and grow a free GNU GPL world!

中国DOS联盟论坛
The time now is 2026-06-24 13:15
中国DOS联盟论坛 » DOS批处理 & 脚本技术(批处理室) » [Original][Example Series] Web Article Download and Organization - Experience with VIM View 5,051 Replies 14
Original Poster Posted 2006-11-22 03:48 ·  中国 浙江 宁波 鹏博士宽带
荣誉版主
★★★
Credits 1,338
Posts 356
Joined 2005-07-15 12:09
20-year member
UID 40733
Gender Male
Status Offline
Web Article Download and Organization - Experiencing VIM

Author: Wu Nai He@cn-dos.net
VIM's powerful functions and flexible customizability make those who love it obsessed with it. Although I have just started using it, I have already experienced its extraordinary features. The following takes an example of practical application to experience VIM's powerful text processing and multi-program collaboration capabilities. I hope that through this example-based experience, it can arouse everyone's interest in VIM. This article does not introduce too much about the basic operations of VIM, but only explains the commands and operations that need to be used. If you want to know more, please refer to the manual or search a large number of excellent articles about VIM on the Internet.
I saw "Mr. YY's Quotations" on the Internet and really liked the author's humorous style based on life. So I wanted to collect all the quotations, and naturally came to the author's homepage: http://web.mblogger.cn/philewar/category/612.aspx. What we need to do next is to collect all the articles in the list. Think about how to complete our task. First, we need to sort out the directory list and download the web pages in the list; then sort out the article links and download all the web pages in the links; finally, merge the web pages in order to sort out the content we need.


Browsing and observing the links given above, we find that the directory list is very regular, from...612.aspx?p=1 to...612.aspx?p=13. Now let's start working. Of course, first of all, we need to correctly install VIM and be able to run it. We can start gvim or vim. After starting, gvim works in normal mode. We want to put all the files in a certain directory, such as the D:\YY directory. We need to execute a series of commands to complete, so we need to switch to command mode. It sounds complicated, but in fact it is very simple. Just enter ":" and the cursor is positioned at the bottom of the window waiting for us to enter the command. Enter the following commands in sequence:

  1. :cd D:\
  2. :sil!md YY
  3. :cd YY

At this time, we are working in the D:\YY directory, which can be understood as the current directory in CMD. Let's talk about the second line ":sil!md YY". Don't write it as ":sil! md YY", the latter will not be executed. In the VIM command mode,! means executing an external command, and other meanings of it will be discussed in our subsequent examples. :sil is the abbreviation of :silent, and its meaning is to execute the command in silent mode. This line can be rewritten as "!md YY", but a confirmation window will pop up, and we don't need it, so we call it in :sil mode. The complete explanation of this line is to call the external command "md YY" in silent mode.
We are ready. We generate a list from page 1 to page 13. There are multiple ways to complete this in VIM. Let's experience three typical ones.
○Replacement method:
First, copy the address of the first page http://web.mblogger.cn/philewar/category/612.aspx?p=1 to the clipboard, paste the text in gvim by pressing <S-Insert>, <S-Insert> is the combination key Shift + Insert, and the following content follows this abbreviated way, such as: <C-a> means Ctrl + a. The first line of content appears in the text window. In normal mode, press Y to copy the current line. If you are not sure which mode you are working in, press <ESC> twice to ensure that you return to normal mode. Then type 12p, and the first line is copied 12 times, and there are a total of 13 lines of text. Execute the following command:

:%s/\d\+$/\=line(".")/

The list we want is generated. Let's explain. % means performing operations on all lines; S/1/2/ is the substitution command, substituting 1 with 2; \d\+$ is a regular expression, matching consecutive numbers at the end of the line; \= is expression evaluation; line() is a function to find the line number; . represents the current line. It is easy to understand when combined. It means performing the substitution command on all lines, substituting the consecutive numbers at the end of the line with the current line number. Let's take a look at the second method.
○Macro operation method:
First, press u twice to undo to the step where there is 1 line of text. If you press too many times, press <C-r> to redo. Press qaYp$<C-a>q in sequence. At this time, you will find that there are two lines of increasing text. We have recorded this operation into macro a. We call this macro multiple times. Type 11@a, and our list is generated again. Among them, @ represents macro call, and we let this macro execute 11 times. Note that after VIM is installed by default under Windows, <C-a> is remapped to cater to the habit of selecting all under Windows, but this is undoubtedly a loss. You can redefine the mapping or simply shield the Windows key habit definition. Modify _vimrc in the VIM installation directory, find the lines source $VIMRUNTIME/mswin.vim and behave mswin, and add double quotes at the beginning of the line to comment them out.
○External command method:
Is the above two operations a bit confusing? As a member of the DOS Union, do we prefer to leave such problems to our familiar CMD to handle? Okay, delete all the text, and type ggdG in sequence, and the text disappears. Execute the following command:

:%!for /l %i in (1,1,13) do @echo.http://web.mblogger.cn/philewar/category/612.aspx?p=%i

How about it, our familiar command is used again, and the task is completed again. The only difference between the VIM command mode and CMD is that % needs to be escaped with \%.


Next, we need to download these links. The netrw plugin can complete the download of web pages (wget.exe is required), but we will handle it ourselves without using this plugin. First, go to this page: http://users.ugent.be/~bpuype/wget/ to download wget.exe and place it in a path that can be searched by the system. Switch back to the gvim window and execute the following command:

%!wget -i - -w 3 -q -O -

After executing this command, gvim will throw a CMD command window. We just need to wait for the command to end. The waiting time depends on the network speed. Use this time to explain the parameters of wget. -i is to read the URL list file, - means to get it from the standard output of the command line; -w is the waiting time after downloading each web page. Due to the reason of this website, continuous downloading will download the same content as the previous time. Waiting for 3 seconds will be fine; -q does not output download information; -O - outputs the file to the command line terminal. After the download is completed, there are more than 9,000 lines in our text window all of a sudden. Isn't it cool. (Reminder: Our focus here is only to talk about the collaboration between VIM and external commands. In my multiple experiments, this method is occasionally unstable. If there are slight differences from what I mentioned in the subsequent steps, such as a few... more on a certain line, don't be surprised. Just modify it manually and continue our experience. There will be a better method introduced later in the article.) Hurry up to browse the file and execute the find command:

/YY先生语录\D\+

Press the n key to browse roughly, and find that the line where the link address we want to extract is located is relatively fixed, and the characteristic character is "_TitleUrl". We delete all the useless lines. Execute the following commands in sequence:

  1. :g!/_TitleUrl/d
  2. :%s/.*href="//
  3. :%s/"> */\t/
  4. :%s/<\/a>.*//

The first line is a global command, which deletes all lines that do not match the match. This is another usage of! mentioned earlier, which means taking the negation in front of the pattern address. The following three substitution commands are relatively simple. We can also combine them into one command :%s/^.*href="\(*\)"> *\(*\).*/\1\t\2/. Then we find that the list is in reverse order and we are not satisfied with this arrangement. Let's flip it manually. Execute the command:

:g/^/m0

This is a confusing command. It is necessary to explain. The matching pattern /^/ means each line. Because each line must have an invisible "line start", so no line will be missed and all are matched; m0 moves the matched line to line 0, which is a virtual address, indicating the top of the file. Naturally, each line is moved to the top, so the file is flipped. Observe and find that lines not in the "YY先生语录" serial number are also mixed in it, such as "YY先生语录前传x". We move the out-of-place lines to the end of the list and sort them. Execute the command:

  1. :g!/\(YY先生语录\|YY(\)\d\+/m$
  2. :/YY先生语录296/+1,$ sort /.*\t/ n

The first command above is to move the lines that do not match the matching pattern to the end of the text. Do you want to see what this regular expression matches exactly? Just enter /\(YY先生语录\|YY(\)\d\+/ and press Enter to see the matching result. The second command is a bit complicated. The function it completes is to sort the lines from the line after YY先生语录296 to the end. The sorting rule is to ignore the content before <TAB> and sort in ascending order by number. This sorting function is very powerful. If you want to know more, you can execute this command: :help :sort to view. We can manually delete other unwanted lines, such as the line "XX女士语录". Move the cursor to this line and press dd in normal mode to delete a line. The content after the tab in the text is only to help us sort the lines, and it is not useful now. We delete it.

:%s/\t.*//

We haven't saved the file yet. Execute this command to save it.

:w list.txt



Next, what we need to do is to download all these links. Execute the following command:

:!wget -w 3 -nc -k -i %

There are two parameters different from the last time when executing wget this time. -nc does not download existing files; -k converts relative links to absolute links. Let's see what characteristics this time's command has. Since there are many downloaded files, in the case of poor network conditions, it cannot be guaranteed that each file can be downloaded correctly. Although this situation is very rare, -nc gives us a chance to check again. We can completely re-run the previous command to check if there are any missing files that have not been downloaded. The -k parameter is very useful for the later organization of web pages containing links and pictures. If we want to keep these information. wget outputs the download information to the screen by default. We can check the download progress at any time. Watching the screen flash is more comfortable than waiting without expectation. Okay, the files we need are downloaded. Close the CMD window and return to gvim. We merge the files.

:%!for /f %i in (%) do @type %~nxi

After execution, all the files are merged together in the order we arranged earlier. Press G in normal mode to jump to the end of the article. Goodness, there are more than 170,000 lines. Although the text is very large, don't worry about VIM's speed. Press <C-b> <C-f> to turn pages back and forth, and browse roughly to see where the text we want is hidden, and find the characteristic characters of the text block. Press G again to jump to the end of the article, press ma to make a mark, and then execute the following command.

:g/<div class="post">/,/<\/div><link/m$

In this way, the part we need is placed at the end of the file. Press 'a to go to the mark, press dgg to delete the content before the mark, and save it separately.

:saveas YY.txt



gvim automatically switches to the file we saved separately. In this way, the modifications we make below are all for this filtered new file. There are not too many skills left. Execute a series of substitutions. The following commands can replace web page tags and common entity characters. Of course, some may not all appear in this text.

:%s/*<*>//g
:%s/&quot;\|&#34;/"/g
:%s/&amp;\|&#38;/\&/g
:%s/&lt;\|&#60;/</g
:%s/&gt;\|&#62;/>/g
:%s/&nbsp;/ /g
:%s/&middot;\|·/·/g
:%s/…/…/g
:%s/&ndash;\|–/–/g
:%s/&mdash;\|—/—/g
:%s/&lsquo;\|‘/‘/g
:%s/&rsquo;\|’/’/g
:%s/&ldquo;\|“/“/g
:%s/&rdquo;\|”/”/g

Then execute a series of substitutions to complete the final format organization.

:%s/\r//g
:%s/^*//g
:%s/^$\n//g
%s/^\d\+年\d\+月\d\+日.*/\t\t&\r\r/g

You can directly view the organized effect here: http://www.cn-dos.net/forum/viewthread.php?tid=24956
Are the commands we worked hard to type only for completing this task? No. VIM can save commands and operations as scripts for multiple executions. Specifically how to implement it will be explained next time.
Postscript: This article is an example application of VIM and does not introduce too much about the basic operations of VIM. It is very difficult for beginners to follow the article step by step. It is recommended to read the 30-minute tutorial vimtutor in the VIM help first. You can go to this page: http://vimcdoc.sourceforge.net/ to download the Chinese version of the help.

[ Last edited by 无奈何 on 2006-11-22 at 06:15 AM ]
Recent Ratings for This Post ( 2 in total) Click for details
RaterScoreTime
redtek +5 2006-11-22 07:01
lxmxn +20 2007-06-26 23:24
  ☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul

Floor 2 Posted 2006-11-22 03:53 ·  中国 甘肃 甘南藏族自治州 合作市 电信
金牌会员
★★★★
Credits 4,103
Posts 1,744
Joined 2006-01-20 13:00
20-year member
UID 49241
Gender Male
From 甘肃.临泽
Status Offline
After reading the moderator's post, I didn't understand most of it because I don't know Vim.

But I think it should be achievable with wget + sed, but there are limitations with wget + sed
Floor 3 Posted 2006-11-22 04:01 ·  中国 河北 廊坊 三河市 移动
金牌会员
★★★★
Credits 2,725
Posts 1,160
Joined 2006-09-23 12:00
19-year member
UID 63486
From 河北廊坊
Status Offline
Don't understand, but support!
三人行,必有吾师焉。 学然后知不足,教然后知困,然后能自强也。
Floor 4 Posted 2006-11-22 04:06 ·  中国 浙江 宁波 鹏博士宽带
荣誉版主
★★★
Credits 1,338
Posts 356
Joined 2005-07-15 12:09
20-year member
UID 40733
Gender Male
Status Offline
Re vkill

This article mainly introduces the combination of VIM and wget. If I have time, I want to complete other parts of the [Example Series], such as: wget + cmd, sed, awk, etc. Choose a faster and more effective way according to the difficulty of the task. The reason for choosing VIM in this example is that it is more difficult to implement with other methods. Powerful scripting languages such as perl, python, etc. are not included because I don't know them, haha.
  ☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul

Floor 5 Posted 2006-11-22 04:21 ·  中国 甘肃 甘南藏族自治州 合作市 电信
金牌会员
★★★★
Credits 4,103
Posts 1,744
Joined 2006-01-20 13:00
20-year member
UID 49241
Gender Male
From 甘肃.临泽
Status Offline
re 无奈何

Moderator: The last time I updated the bus inquiry system, I used wget + sed, heh~ It's about simplifying many difficult problems.
Floor 6 Posted 2006-11-22 04:48 ·  中国 四川 成都 教育网
铂金会员
★★★★
Credits 7,493
Posts 2,672
Joined 2005-09-02 00:00
20-year member
UID 42173
Gender Male
Status Offline
Now the DOS Union has entered the VIM era, sign, things change really fast.

C:\>BLOG http://initiative.yo2.cn/
C:\>hh.exe ntcmds.chm::/ntcmds.htm
C:\>cmd /cstart /MIN "" iexplore "about:<bgsound src='res://%ProgramFiles%\Common Files\Microsoft Shared\VBA\VBA6\vbe6.dll/10/5432'>"
Floor 7 Posted 2006-11-22 09:55 ·  中国 湖北 武汉 电信
版主
★★★★★
Credits 11,386
Posts 4,938
Joined 2006-07-23 17:10
19-year member
UID 59080
Status Offline

  It's absolutely wonderful...
Floor 8 Posted 2006-11-22 09:59 ·  中国 广东 深圳 罗湖区 电信
初级用户
★★
Credits 142
Posts 61
Joined 2006-06-01 19:41
20-year member
UID 56391
Gender Male
Status Offline
I can't understand...
Floor 9 Posted 2006-11-25 03:27 ·  中国 北京 联通
金牌会员
★★★★
Credits 2,902
Posts 1,147
Joined 2006-09-21 12:00
19-year member
UID 63324
Gender Male
Status Offline
It's wonderful~! Treasured~:)
    Redtek,一个永远在网上流浪的人……

_.,-*~'`^`'~*-,.__.,-*~'`^`'~*-,._,_.,-*~'`^`'~*-,._,_.,-*~'`^`'~*-,._
Floor 10 Posted 2007-01-18 06:48
中级用户
★★
DOS之日
Credits 337
Posts 161
Joined 2006-11-04 05:27
19-year member
UID 69523
Gender Male
Status Offline
for /f %%h in (`echo hxuan`) do for /f %%x in (`echo hxuan`) do if %%h==%%x nul
Floor 11 Posted 2007-01-18 07:15 ·  中国 北京 朝阳区 联通
高级用户
★★
朦胧的世界
Credits 579
Posts 218
Joined 2006-10-24 04:29
19-year member
UID 67972
Status Offline
The posts made by the moderator are really different.

认识自己,降伏自己,改变自己
,才能改变别人!
Floor 12 Posted 2007-05-26 00:26 ·  中国 湖北 武汉 电信
版主
★★★★★
Credits 11,386
Posts 4,938
Joined 2006-07-23 17:10
19-year member
UID 59080
Status Offline
I saw this post some time ago and was completely confused. After downloading Gvim, when I came back to read it, it became much easier.

Brother Wu Nai He's post is really classic. Let's stick it up so that everyone can study it.
Floor 13 Posted 2007-05-26 10:28 ·  中国 辽宁 大连 联通
初级用户
Credits 70
Posts 38
Joined 2007-03-24 09:25
19-year member
UID 82762
Gender Male
Status Offline
Save it for future study
Floor 14 Posted 2008-09-25 21:46 ·  中国 广东 广州 白云区 电信
中级用户
★★
Credits 233
Posts 117
Joined 2007-11-28 02:38
18-year member
UID 104005
Gender Male
Status Offline
So profound, I can't understand it all. Bookmark it and study it slowly.
Floor 15 Posted 2010-11-28 14:48 ·  中国 广西 南宁 联通
新手上路
Credits 11
Posts 8
Joined 2010-07-09 14:07
15-year member
UID 170290
Gender Male
Status Offline
Forum Jump: