现实需求
当遇到长串经典的讨论帖;
当看到分多页的软件教程;
当发现让人爱不释手的连载小说的时候。
如何保存这些有很多分页的内容就成为了一件冗杂而又枯燥的机械劳动。
无论是手工复制还是依靠软件保存,都需要大量的人为干预,这是身为智慧生物的我们所不能容忍的。
既然计算机的出现就是替代人进行一些繁复的工作的,那为什么不把尽可能多的工作扔给它们呢?
可惜豆腐目前还没有发现一款软件可以满足我的要求,既然没有现成的可用那就自己动手吧。
思路分析
要解决一个问题必须先有一个环境,毕竟一个方案不可能通吃所有问题。我们就先设问题是要合并论坛中常见的多页主题。
要合并一个多页主题,我们首先得获取这个主题的每一个分页的内容,这种重复性的工作让机器来做是再适合不过的了。
其次我们需要分辨用户贴出的内容从哪里开始,在哪里结束。这部分第一次需要人来完成,后面的就交给机器吧。
最后我们需要获取我们需要的内容并把它重新组织起来生成最终的成果,这同样只需机器就可以很好的完成。
只要我们满足了上面三点,我们就可以把自己从重复劳动中解救出来做其它的事情了。
解决方案
由于高级语言需要专门的学习和配套的软件,这无形提高了应用的难度,最终豆腐选择了用CMD命令行来完成这个工作。
当然,CMD命令中是没有获取网页内容的功能的,我们还需要
Curl这个强大的命令行工具来助我们一臂之力。
我们就以合并
CCF精品技术论坛的
MPlayer 2006-03-03 K&K 更新在 992 楼为例,顺着刚才的思路来一步步尝试以达到最终的Goal。
网页抓取
在Curl的帮助下,我们可以轻松的通过命令行来抓取我们想要的网页:
curl -o tmp1.txt http://bbs.et8.net/bbs/showthread.php?t=634659&page=1&pp=15
这样我们就把该主题第一页的内容保存在了tmp1.txt文件中。
对于某些需要检测浏览器信息的网站,我们可以用
-A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
来伪装成IE浏览器。
对于需要使用cookies的网站,我们可以用
-D cookie1.txt
来保存cookies,用
-b cookie1.txt
来读取cookies。
对于防盗链的网站,我们可以用
-e "http://bbs.et8.net/"
来伪装成从某个相关联接进入的。 再与CMD中强大的 FOR 命令和变量相结合,加上人类的小小智慧,就可以打造出自动抓取该主题的全部内容的脚本。
分析该主题的URL,我们可以知道 page= 表示页数,这为自动化处理提供了基础,同时我们知道该主题有73页,最终的抓取脚本如下:
@echo off
setlocal ENABLEDELAYEDEXPANSION
set last=1
for /l %%i in (1,1,73) do (
echo %%i
curl -b cookie!last!.txt -D cookie%%i.txt -A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" -e "http://bbs.et8.net/bbs/showthread.php?t=634659^&page=!last!^&pp=15" -o tmp%%i.txt http://bbs.et8.net/bbs/showthread.php?t=634659^&page=%%i^&pp=15
set /a last=%%i-1
)
copy tmp*.txt temp.txt
del cookie*.txt
del tmp*.txt
endlocal
将上面脚本保存为 grab.cmd 运行后我们就的到了保存了该主题全部73页内容的 temp.txt 文件。
内容分析
由于CMD字符处理的问题,我们先把 temp.txt 另存为 ANSI 编码。
分析单页的内容后豆腐发现该论坛程序在用户内容开始之前有一个每页唯一的 <div id="posts">,
而在结束的时候有一个同样唯一的 <!-- start content table --> ,这正是我们所希望找到的可以作为标志位的地方。
文本处理
由于 FOR 命令一次只能以同样的规则处理一行的内容,于是豆腐便采用 FOR 嵌套的方式来处理整个大文件。
先用
for /f "delims=" %%i in (temp.txt) do ( echo %%i >tmp.txt )
将 temp.txt 的内容一次一行地写入 tmp.txt。
再套用另一个 FOR 来处理 tmp.txt 的一行。
标志设置
我们可以通过 FOR 的 delims= 和 tokens= 参数来分割和保存一行的内容
我们用
for /f "tokens=1-3 delims=<->= " %%j in (tmp.txt)
参数设定以 "<"、">"、"-"、"="、" "来分割一行,
并把分割后的前三段内容存入 %%j %%k %%l 三个变量中。接着我们用 if 语句来判断这三个变量是否符合设置标志位的条件:
if "%%j"=="div" if "%%k"=="id" if %%l=="posts" set flag=1
if "%%j"=="start" if "%%k"=="content" if "%%l"=="table" set flag=0
flag=1 代表用户内容开始,flag=0代表用户内容结束。
内容剪裁
由于CMD命令行处理的限制,HTML中的注释开始符号 "<!--" 会被处理成 "<--" 这就会造成不期望的内容被显示出来。
我们可以再加一个 FOR 来解决这个问题:
for /f "tokens=1-8 delims=< " %%m in (tmp.txt) do (
if not "%%t"=="-->" if not "%%s"=="-->" if not "%%r"=="-->" if not "%%q"=="-->" if not "%%p"=="-->" if not "%%o"=="-->" if not "%%m"=="ECHO" if !flag!==1 echo %%i >>new.htm)
同时,我们也完成了把开始标志位后的内容存入 new.htm 的工作。
最终脚本
@echo off
setlocal ENABLEDELAYEDEXPANSION
set flag=0
for /f "delims=" %%i in (temp.txt) do (
echo %%i >tmp.txt
for /f "tokens=1-3 delims=<->= " %%j in (tmp.txt) do (
if "%%j"=="div" if "%%k"=="id" if %%l=="posts" set flag=1
if "%%j"=="start" if "%%k"=="content" if "%%l"=="table" set flag=0
for /f "tokens=1-8 delims=< " %%m in (tmp.txt) do (
if not "%%t"=="-->" if not "%%s"=="-->" if not "%%r"=="-->" if not "%%q"=="-->" if not "%%p"=="-->" if not "%%o"=="-->" if not "%%m"=="ECHO" if !flag!==1 echo %%i >>new.htm
)
)
)
del tmp.txt
endlocal
保存脚本为 merge.cmd 运行后得到合并出的 new.htm 文件就是该主题全部1083帖的内容。
优化改进
该脚本只完成了抓取文本内容的工作,我们还可以通过判断 IMG 元素来找到图片内容,
并把 src 属性后面的路径补完成完整路径,就可以正确显示出内容中的图片。
后记总结
CMD和Curl相结合可以完成很多批量的复杂工作,虽然第一次多花点时间,但之后就可以方便的使用了。
该脚本可以顺利抓取合并CCF精品技术论坛的任意主题以及部分基于vBulletin的论坛,但对于其它论坛还需要分别修改才可使用。
本文为chenke_ikari原创,首发于
豆腐的简陋小屋
本文采用
Creative Commons 署名-非商业性使用-相同方式共享 2.5 China 许可协议 进行许可
Practical Needs
When encountering long and classic discussion threads;
When seeing software tutorials divided into multiple pages;
When discovering serialized novels that are hard to put down.
How to save these multi-page contents has become a tedious and boring mechanical task.
Whether it is manual copying or relying on software to save, a large amount of human intervention is required, which is something we intelligent beings cannot tolerate.
Since the appearance of computers is to replace people in some complicated work, why not leave as much work as possible to them?
Unfortunately, Tofu has not found a software that meets my requirements yet. Since there is no ready-made one available, I have to do it myself.
Ideological Analysis
To solve a problem, there must first be an environment, after all, a single solution cannot cover all problems. Let's first assume the problem is to merge multi-page topics common in forums.
To merge a multi-page topic, we first need to obtain the content of each page of this topic. This repetitive work is most suitable for machines to do.
Secondly, we need to distinguish where the content posted by the user starts and where it ends. The first time this part needs to be done by humans, and the rest can be left to the machine.
Finally, we need to obtain the content we need and reorganize it to generate the final result, which can also be well done by the machine.
As long as we meet the above three points, we can free ourselves from repetitive work and do other things.
Solution
Since high-level languages require specialized learning and supporting software, which invisibly increases the difficulty of application, finally Tofu chose to use the CMD command line to complete this task.
Of course, there is no function to obtain web content in the CMD command. We also need the powerful command line tool
Curl to help us.
Let's take merging the
CCF Elite Technology Forum's
MPlayer 2006-03-03 K&K Update at Post 992 as an example, and follow the previous ideas to try step by step to achieve the final Goal.
Web Page Crawling
With the help of
Curl, we can easily crawl the web pages we want through the command line:
curl -o tmp1.txt http://bbs.et8.net/bbs/showthread.php?t=634659&page=1&pp=15
In this way, we have saved the content of the first page of this topic in the tmp1.txt file.
For some websites that need to detect browser information, we can use
-A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
to disguise as an IE browser.
For websites that need to use cookies, we can use
-D cookie1.txt
to save cookies and
-b cookie1.txt
to read cookies.
For websites with anti-hotlinking, we can use
-e "http://bbs.et8.net/"
to disguise as entering from a certain related link. Combined with the powerful FOR command and variables in CMD, plus a little human wisdom, we can create a script to automatically crawl all the content of this topic.
Analyzing the URL of this topic, we can know that page= represents the page number, which provides the basis for automated processing. At the same time, we know that this topic has 73 pages. The final crawling script is as follows:
@echo off
setlocal ENABLEDELAYEDEXPANSION
set last=1
for /l %%i in (1,1,73) do (
echo %%i
curl -b cookie!last!.txt -D cookie%%i.txt -A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" -e "http://bbs.et8.net/bbs/showthread.php?t=634659^&page=!last!^&pp=15" -o tmp%%i.txt http://bbs.et8.net/bbs/showthread.php?t=634659^&page=%%i^&pp=15
set /a last=%%i-1
)
copy tmp*.txt temp.txt
del cookie*.txt
del tmp*.txt
endlocal
Save the above script as grab.cmd. After running it, we get the temp.txt file that saves all 73 pages of this topic.
Content Analysis
Due to the problem of CMD character processing, we first save temp.txt as ANSI encoding.
After analyzing the content of a single page, Tofu found that the forum program has a <div id="posts"> unique to each page before the user content starts,
and there is an equally unique <!-- start content table --> at the end, which is exactly the flag we hope to find as a marker.
Text Processing
Since the FOR command can only process one line of content at a time with the same rules, Tofu then uses the nested FOR method to process the entire large file.
First, use
for /f "delims=" %%i in (temp.txt) do ( echo %%i >tmp.txt )
to write the content of temp.txt line by line into tmp.txt.
Then apply another FOR to process a line of tmp.txt.
Flag Setting
We can use the delims= and tokens= parameters of FOR to split and save the content of a line.
We use
for /f "tokens=1-3 delims=<->= " %%j in (tmp.txt)
to set the parameters to split a line with "<", ">", "-", "=", " ",
and store the first three segments of the split content into the three variables %%j %%k %%l. Then we use the if statement to judge whether these three variables meet the conditions for setting the flag:
if "%%j"=="div" if "%%k"=="id" if %%l=="posts" set flag=1
if "%%j"=="start" if "%%k"=="content" if "%%l"=="table" set flag=0
flag=1 means the user content starts, and flag=0 means the user content ends.
Content Trimming
Due to the limitation of CMD command line processing, the HTML comment start symbol "<!--" will be processed into "<--", which will cause unexpected content to be displayed.
We can add another FOR to solve this problem:
for /f "tokens=1-8 delims=< " %%m in (tmp.txt) do (
if not "%%t"=="-->" if not "%%s"=="-->" if not "%%r"=="-->" if not "%%q"=="-->" if not "%%p"=="-->" if not "%%o"=="-->" if not "%%m"=="ECHO" if !flag!==1 echo %%i >>new.htm)
At the same time, we have also completed the work of storing the content after the start flag into new.htm.
Final Script
@echo off
setlocal ENABLEDELAYEDEXPANSION
set flag=0
for /f "delims=" %%i in (temp.txt) do (
echo %%i >tmp.txt
for /f "tokens=1-3 delims=<->= " %%j in (tmp.txt) do (
if "%%j"=="div" if "%%k"=="id" if %%l=="posts" set flag=1
if "%%j"=="start" if "%%k"=="content" if "%%l"=="table" set flag=0
for /f "tokens=1-8 delims=< " %%m in (tmp.txt) do (
if not "%%t"=="-->" if not "%%s"=="-->" if not "%%r"=="-->" if not "%%q"=="-->" if not "%%p"=="-->" if not "%%o"=="-->" if not "%%m"=="ECHO" if !flag!==1 echo %%i >>new.htm
)
)
)
del tmp.txt
endlocal
Save the script as merge.cmd. After running it, the merged new.htm file obtained is the content of all 1083 posts of this topic.
Optimization and Improvement
This script only completes the work of crawling text content. We can also find picture content by judging the IMG element,
and complete the path after the src attribute to the full path, so that the pictures in the content can be displayed correctly.
Postscript and Summary
The combination of CMD and
Curl can complete many batch complex tasks. Although it takes a little more time at the first time, it can be used conveniently later.
This script can smoothly crawl and merge any topics of the CCF Elite Technology Forum and some forums based on vBulletin, but it needs to be modified separately for other forums to be used.
This article is original by chenke_ikari and first published on
Tofu's Simple Hut
This article is licensed under the
Creative Commons Attribution-NonCommercial-ShareAlike 2.5 China License