Web Article Download and Organization - Experiencing VIM
Author: Wu Nai He@cn-dos.net
VIM's powerful functions and flexible customizability make those who love it obsessed with it. Although I have just started using it, I have already experienced its extraordinary features. The following takes an example of practical application to experience VIM's powerful text processing and multi-program collaboration capabilities. I hope that through this example-based experience, it can arouse everyone's interest in VIM. This article does not introduce too much about the basic operations of VIM, but only explains the commands and operations that need to be used. If you want to know more, please refer to the manual or search a large number of excellent articles about VIM on the Internet.
I saw "Mr. YY's Quotations" on the Internet and really liked the author's humorous style based on life. So I wanted to collect all the quotations, and naturally came to the author's homepage: http://web.mblogger.cn/philewar/category/612.aspx. What we need to do next is to collect all the articles in the list. Think about how to complete our task. First, we need to sort out the directory list and download the web pages in the list; then sort out the article links and download all the web pages in the links; finally, merge the web pages in order to sort out the content we need.
Browsing and observing the links given above, we find that the directory list is very regular, from...612.aspx?p=1 to...612.aspx?p=13. Now let's start working. Of course, first of all, we need to correctly install VIM and be able to run it. We can start gvim or vim. After starting, gvim works in normal mode. We want to put all the files in a certain directory, such as the D:\YY directory. We need to execute a series of commands to complete, so we need to switch to command mode. It sounds complicated, but in fact it is very simple. Just enter ":" and the cursor is positioned at the bottom of the window waiting for us to enter the command. Enter the following commands in sequence:
- :cd D:\
- :sil!md YY
- :cd YY
At this time, we are working in the D:\YY directory, which can be understood as the current directory in CMD. Let's talk about the second line ":sil!md YY". Don't write it as ":sil! md YY", the latter will not be executed. In the VIM command mode,! means executing an external command, and other meanings of it will be discussed in our subsequent examples. :sil is the abbreviation of :silent, and its meaning is to execute the command in silent mode. This line can be rewritten as "!md YY", but a confirmation window will pop up, and we don't need it, so we call it in :sil mode. The complete explanation of this line is to call the external command "md YY" in silent mode.
We are ready. We generate a list from page 1 to page 13. There are multiple ways to complete this in VIM. Let's experience three typical ones.
○Replacement method:
First, copy the address of the first page http://web.mblogger.cn/philewar/category/612.aspx?p=1 to the clipboard, paste the text in gvim by pressing <S-Insert>, <S-Insert> is the combination key Shift + Insert, and the following content follows this abbreviated way, such as: <C-a> means Ctrl + a. The first line of content appears in the text window. In normal mode, press Y to copy the current line. If you are not sure which mode you are working in, press <ESC> twice to ensure that you return to normal mode. Then type 12p, and the first line is copied 12 times, and there are a total of 13 lines of text. Execute the following command:
:%s/\d\+$/\=line(".")/
The list we want is generated. Let's explain. % means performing operations on all lines; S/1/2/ is the substitution command, substituting 1 with 2; \d\+$ is a regular expression, matching consecutive numbers at the end of the line; \= is expression evaluation; line() is a function to find the line number; . represents the current line. It is easy to understand when combined. It means performing the substitution command on all lines, substituting the consecutive numbers at the end of the line with the current line number. Let's take a look at the second method.
○Macro operation method:
First, press u twice to undo to the step where there is 1 line of text. If you press too many times, press <C-r> to redo. Press qaYp$<C-a>q in sequence. At this time, you will find that there are two lines of increasing text. We have recorded this operation into macro a. We call this macro multiple times. Type 11@a, and our list is generated again. Among them, @ represents macro call, and we let this macro execute 11 times. Note that after VIM is installed by default under Windows, <C-a> is remapped to cater to the habit of selecting all under Windows, but this is undoubtedly a loss. You can redefine the mapping or simply shield the Windows key habit definition. Modify _vimrc in the VIM installation directory, find the lines source $VIMRUNTIME/mswin.vim and behave mswin, and add double quotes at the beginning of the line to comment them out.
○External command method:
Is the above two operations a bit confusing? As a member of the DOS Union, do we prefer to leave such problems to our familiar CMD to handle? Okay, delete all the text, and type ggdG in sequence, and the text disappears. Execute the following command:
:%!for /l %i in (1,1,13) do @echo.http://web.mblogger.cn/philewar/category/612.aspx?p=%i
How about it, our familiar command is used again, and the task is completed again. The only difference between the VIM command mode and CMD is that % needs to be escaped with \%.
Next, we need to download these links. The netrw plugin can complete the download of web pages (wget.exe is required), but we will handle it ourselves without using this plugin. First, go to this page: http://users.ugent.be/~bpuype/wget/ to download wget.exe and place it in a path that can be searched by the system. Switch back to the gvim window and execute the following command:
%!wget -i - -w 3 -q -O -
After executing this command, gvim will throw a CMD command window. We just need to wait for the command to end. The waiting time depends on the network speed. Use this time to explain the parameters of wget. -i is to read the URL list file, - means to get it from the standard output of the command line; -w is the waiting time after downloading each web page. Due to the reason of this website, continuous downloading will download the same content as the previous time. Waiting for 3 seconds will be fine; -q does not output download information; -O - outputs the file to the command line terminal. After the download is completed, there are more than 9,000 lines in our text window all of a sudden. Isn't it cool. (Reminder: Our focus here is only to talk about the collaboration between VIM and external commands. In my multiple experiments, this method is occasionally unstable. If there are slight differences from what I mentioned in the subsequent steps, such as a few... more on a certain line, don't be surprised. Just modify it manually and continue our experience. There will be a better method introduced later in the article.) Hurry up to browse the file and execute the find command:
/YY先生语录\D\+
Press the n key to browse roughly, and find that the line where the link address we want to extract is located is relatively fixed, and the characteristic character is "_TitleUrl". We delete all the useless lines. Execute the following commands in sequence:
- :g!/_TitleUrl/d
- :%s/.*href="//
- :%s/"> */\t/
- :%s/<\/a>.*//
The first line is a global command, which deletes all lines that do not match the match. This is another usage of! mentioned earlier, which means taking the negation in front of the pattern address. The following three substitution commands are relatively simple. We can also combine them into one command :%s/^.*href="\(*\)"> *\(*\).*/\1\t\2/. Then we find that the list is in reverse order and we are not satisfied with this arrangement. Let's flip it manually. Execute the command:
:g/^/m0
This is a confusing command. It is necessary to explain. The matching pattern /^/ means each line. Because each line must have an invisible "line start", so no line will be missed and all are matched; m0 moves the matched line to line 0, which is a virtual address, indicating the top of the file. Naturally, each line is moved to the top, so the file is flipped. Observe and find that lines not in the "YY先生语录" serial number are also mixed in it, such as "YY先生语录前传x". We move the out-of-place lines to the end of the list and sort them. Execute the command:
- :g!/\(YY先生语录\|YY(\)\d\+/m$
- :/YY先生语录296/+1,$ sort /.*\t/ n
The first command above is to move the lines that do not match the matching pattern to the end of the text. Do you want to see what this regular expression matches exactly? Just enter /\(YY先生语录\|YY(\)\d\+/ and press Enter to see the matching result. The second command is a bit complicated. The function it completes is to sort the lines from the line after YY先生语录296 to the end. The sorting rule is to ignore the content before <TAB> and sort in ascending order by number. This sorting function is very powerful. If you want to know more, you can execute this command: :help :sort to view. We can manually delete other unwanted lines, such as the line "XX女士语录". Move the cursor to this line and press dd in normal mode to delete a line. The content after the tab in the text is only to help us sort the lines, and it is not useful now. We delete it.
:%s/\t.*//
We haven't saved the file yet. Execute this command to save it.
:w list.txt
Next, what we need to do is to download all these links. Execute the following command:
:!wget -w 3 -nc -k -i %
There are two parameters different from the last time when executing wget this time. -nc does not download existing files; -k converts relative links to absolute links. Let's see what characteristics this time's command has. Since there are many downloaded files, in the case of poor network conditions, it cannot be guaranteed that each file can be downloaded correctly. Although this situation is very rare, -nc gives us a chance to check again. We can completely re-run the previous command to check if there are any missing files that have not been downloaded. The -k parameter is very useful for the later organization of web pages containing links and pictures. If we want to keep these information. wget outputs the download information to the screen by default. We can check the download progress at any time. Watching the screen flash is more comfortable than waiting without expectation. Okay, the files we need are downloaded. Close the CMD window and return to gvim. We merge the files.
:%!for /f %i in (%) do @type %~nxi
After execution, all the files are merged together in the order we arranged earlier. Press G in normal mode to jump to the end of the article. Goodness, there are more than 170,000 lines. Although the text is very large, don't worry about VIM's speed. Press <C-b> <C-f> to turn pages back and forth, and browse roughly to see where the text we want is hidden, and find the characteristic characters of the text block. Press G again to jump to the end of the article, press ma to make a mark, and then execute the following command.
:g/<div class="post">/,/<\/div><link/m$
In this way, the part we need is placed at the end of the file. Press 'a to go to the mark, press dgg to delete the content before the mark, and save it separately.
:saveas YY.txt
gvim automatically switches to the file we saved separately. In this way, the modifications we make below are all for this filtered new file. There are not too many skills left. Execute a series of substitutions. The following commands can replace web page tags and common entity characters. Of course, some may not all appear in this text.
:%s/*<*>//g
:%s/"\|"/"/g
:%s/&\|&/\&/g
:%s/<\|</</g
:%s/>\|>/>/g
:%s/ / /g
:%s/·\|·/·/g
:%s/…/…/g
:%s/–\|–/–/g
:%s/—\|—/—/g
:%s/‘\|‘/‘/g
:%s/’\|’/’/g
:%s/“\|“/“/g
:%s/”\|”/”/g
Then execute a series of substitutions to complete the final format organization.
:%s/\r//g
:%s/^*//g
:%s/^$\n//g
%s/^\d\+年\d\+月\d\+日.*/\t\t&\r\r/g
You can directly view the organized effect here: http://www.cn-dos.net/forum/viewthread.php?tid=24956
Are the commands we worked hard to type only for completing this task? No. VIM can save commands and operations as scripts for multiple executions. Specifically how to implement it will be explained next time.
Postscript: This article is an example application of VIM and does not introduce too much about the basic operations of VIM. It is very difficult for beginners to follow the article step by step. It is recommended to read the 30-minute tutorial vimtutor in the VIM help first. You can go to this page: http://vimcdoc.sourceforge.net/ to download the Chinese version of the help.
[ Last edited by 无奈何 on 2006-11-22 at 06:15 AM ]
Recent Ratings for This Post
( 2 in total)
Click for details
☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul
