Practical Needs
When encountering long and classic discussion threads;
When seeing software tutorials divided into multiple pages;
When discovering serialized novels that are hard to put down.
How to save these multi-page contents has become a tedious and boring mechanical task.
Whether it is manual copying or relying on software to save, a large amount of human intervention is required, which is something we intelligent beings cannot tolerate.
Since the appearance of computers is to replace people in some complicated work, why not leave as much work as possible to them?
Unfortunately, Tofu has not found a software that meets my requirements yet. Since there is no ready-made one available, I have to do it myself.
Ideological Analysis
To solve a problem, there must first be an environment, after all, a single solution cannot cover all problems. Let's first assume the problem is to merge multi-page topics common in forums.
To merge a multi-page topic, we first need to obtain the content of each page of this topic. This repetitive work is most suitable for machines to do.
Secondly, we need to distinguish where the content posted by the user starts and where it ends. The first time this part needs to be done by humans, and the rest can be left to the machine.
Finally, we need to obtain the content we need and reorganize it to generate the final result, which can also be well done by the machine.
As long as we meet the above three points, we can free ourselves from repetitive work and do other things.
Solution
Since high-level languages require specialized learning and supporting software, which invisibly increases the difficulty of application, finally Tofu chose to use the CMD command line to complete this task.
Of course, there is no function to obtain web content in the CMD command. We also need the powerful command line tool Curl to help us.
Let's take merging the CCF Elite Technology Forum's MPlayer 2006-03-03 K&K Update at Post 992 as an example, and follow the previous ideas to try step by step to achieve the final Goal.
Web Page Crawling
With the help of Curl, we can easily crawl the web pages we want through the command line:
In this way, we have saved the content of the first page of this topic in the tmp1.txt file.
For some websites that need to detect browser information, we can use
For websites that need to use cookies, we can use
For websites with anti-hotlinking, we can use
Analyzing the URL of this topic, we can know that page= represents the page number, which provides the basis for automated processing. At the same time, we know that this topic has 73 pages. The final crawling script is as follows:
Save the above script as grab.cmd. After running it, we get the temp.txt file that saves all 73 pages of this topic.
Content Analysis
Due to the problem of CMD character processing, we first save temp.txt as ANSI encoding.
After analyzing the content of a single page, Tofu found that the forum program has a <div id="posts"> unique to each page before the user content starts,
and there is an equally unique <!-- start content table --> at the end, which is exactly the flag we hope to find as a marker.
Text Processing
Since the FOR command can only process one line of content at a time with the same rules, Tofu then uses the nested FOR method to process the entire large file.
First, use
Then apply another FOR to process a line of tmp.txt.
Flag Setting
We can use the delims= and tokens= parameters of FOR to split and save the content of a line.
We use
and store the first three segments of the split content into the three variables %%j %%k %%l. Then we use the if statement to judge whether these three variables meet the conditions for setting the flag:
flag=1 means the user content starts, and flag=0 means the user content ends.
Content Trimming
Due to the limitation of CMD command line processing, the HTML comment start symbol "<!--" will be processed into "<--", which will cause unexpected content to be displayed.
We can add another FOR to solve this problem:
At the same time, we have also completed the work of storing the content after the start flag into new.htm.
Final Script
Save the script as merge.cmd. After running it, the merged new.htm file obtained is the content of all 1083 posts of this topic.
Optimization and Improvement
This script only completes the work of crawling text content. We can also find picture content by judging the IMG element,
and complete the path after the src attribute to the full path, so that the pictures in the content can be displayed correctly.
Postscript and Summary
The combination of CMD and Curl can complete many batch complex tasks. Although it takes a little more time at the first time, it can be used conveniently later.
This script can smoothly crawl and merge any topics of the CCF Elite Technology Forum and some forums based on vBulletin, but it needs to be modified separately for other forums to be used.
This article is original by chenke_ikari and first published on Tofu's Simple Hut
This article is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 China License

When encountering long and classic discussion threads;
When seeing software tutorials divided into multiple pages;
When discovering serialized novels that are hard to put down.
How to save these multi-page contents has become a tedious and boring mechanical task.
Whether it is manual copying or relying on software to save, a large amount of human intervention is required, which is something we intelligent beings cannot tolerate.
Since the appearance of computers is to replace people in some complicated work, why not leave as much work as possible to them?
Unfortunately, Tofu has not found a software that meets my requirements yet. Since there is no ready-made one available, I have to do it myself.
Ideological Analysis
To solve a problem, there must first be an environment, after all, a single solution cannot cover all problems. Let's first assume the problem is to merge multi-page topics common in forums.
To merge a multi-page topic, we first need to obtain the content of each page of this topic. This repetitive work is most suitable for machines to do.
Secondly, we need to distinguish where the content posted by the user starts and where it ends. The first time this part needs to be done by humans, and the rest can be left to the machine.
Finally, we need to obtain the content we need and reorganize it to generate the final result, which can also be well done by the machine.
As long as we meet the above three points, we can free ourselves from repetitive work and do other things.
Solution
Since high-level languages require specialized learning and supporting software, which invisibly increases the difficulty of application, finally Tofu chose to use the CMD command line to complete this task.
Of course, there is no function to obtain web content in the CMD command. We also need the powerful command line tool Curl to help us.
Let's take merging the CCF Elite Technology Forum's MPlayer 2006-03-03 K&K Update at Post 992 as an example, and follow the previous ideas to try step by step to achieve the final Goal.
Web Page Crawling
With the help of Curl, we can easily crawl the web pages we want through the command line:
curl -o tmp1.txt http://bbs.et8.net/bbs/showthread.php?t=634659&page=1&pp=15In this way, we have saved the content of the first page of this topic in the tmp1.txt file.
For some websites that need to detect browser information, we can use
-A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" to disguise as an IE browser.For websites that need to use cookies, we can use
-D cookie1.txt to save cookies and -b cookie1.txt to read cookies.For websites with anti-hotlinking, we can use
-e "http://bbs.et8.net/" to disguise as entering from a certain related link. Combined with the powerful FOR command and variables in CMD, plus a little human wisdom, we can create a script to automatically crawl all the content of this topic.Analyzing the URL of this topic, we can know that page= represents the page number, which provides the basis for automated processing. At the same time, we know that this topic has 73 pages. The final crawling script is as follows:
@echo off
setlocal ENABLEDELAYEDEXPANSION
set last=1
for /l %%i in (1,1,73) do (
echo %%i
curl -b cookie!last!.txt -D cookie%%i.txt -A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" -e "http://bbs.et8.net/bbs/showthread.php?t=634659^&page=!last!^&pp=15" -o tmp%%i.txt http://bbs.et8.net/bbs/showthread.php?t=634659^&page=%%i^&pp=15
set /a last=%%i-1
)
copy tmp*.txt temp.txt
del cookie*.txt
del tmp*.txt
endlocal
Save the above script as grab.cmd. After running it, we get the temp.txt file that saves all 73 pages of this topic.
Content Analysis
Due to the problem of CMD character processing, we first save temp.txt as ANSI encoding.
After analyzing the content of a single page, Tofu found that the forum program has a <div id="posts"> unique to each page before the user content starts,
and there is an equally unique <!-- start content table --> at the end, which is exactly the flag we hope to find as a marker.
Text Processing
Since the FOR command can only process one line of content at a time with the same rules, Tofu then uses the nested FOR method to process the entire large file.
First, use
for /f "delims=" %%i in (temp.txt) do ( echo %%i >tmp.txt ) to write the content of temp.txt line by line into tmp.txt.Then apply another FOR to process a line of tmp.txt.
Flag Setting
We can use the delims= and tokens= parameters of FOR to split and save the content of a line.
We use
for /f "tokens=1-3 delims=<->= " %%j in (tmp.txt) to set the parameters to split a line with "<", ">", "-", "=", " ",and store the first three segments of the split content into the three variables %%j %%k %%l. Then we use the if statement to judge whether these three variables meet the conditions for setting the flag:
if "%%j"=="div" if "%%k"=="id" if %%l=="posts" set flag=1
if "%%j"=="start" if "%%k"=="content" if "%%l"=="table" set flag=0
flag=1 means the user content starts, and flag=0 means the user content ends.
Content Trimming
Due to the limitation of CMD command line processing, the HTML comment start symbol "<!--" will be processed into "<--", which will cause unexpected content to be displayed.
We can add another FOR to solve this problem:
for /f "tokens=1-8 delims=< " %%m in (tmp.txt) do (
if not "%%t"=="-->" if not "%%s"=="-->" if not "%%r"=="-->" if not "%%q"=="-->" if not "%%p"=="-->" if not "%%o"=="-->" if not "%%m"=="ECHO" if !flag!==1 echo %%i >>new.htm)
At the same time, we have also completed the work of storing the content after the start flag into new.htm.
Final Script
@echo off
setlocal ENABLEDELAYEDEXPANSION
set flag=0
for /f "delims=" %%i in (temp.txt) do (
echo %%i >tmp.txt
for /f "tokens=1-3 delims=<->= " %%j in (tmp.txt) do (
if "%%j"=="div" if "%%k"=="id" if %%l=="posts" set flag=1
if "%%j"=="start" if "%%k"=="content" if "%%l"=="table" set flag=0
for /f "tokens=1-8 delims=< " %%m in (tmp.txt) do (
if not "%%t"=="-->" if not "%%s"=="-->" if not "%%r"=="-->" if not "%%q"=="-->" if not "%%p"=="-->" if not "%%o"=="-->" if not "%%m"=="ECHO" if !flag!==1 echo %%i >>new.htm
)
)
)
del tmp.txt
endlocal
Save the script as merge.cmd. After running it, the merged new.htm file obtained is the content of all 1083 posts of this topic.
Optimization and Improvement
This script only completes the work of crawling text content. We can also find picture content by judging the IMG element,
and complete the path after the src attribute to the full path, so that the pictures in the content can be displayed correctly.
Postscript and Summary
The combination of CMD and Curl can complete many batch complex tasks. Although it takes a little more time at the first time, it can be used conveniently later.
This script can smoothly crawl and merge any topics of the CCF Elite Technology Forum and some forums based on vBulletin, but it needs to be modified separately for other forums to be used.
This article is original by chenke_ikari and first published on Tofu's Simple Hut
This article is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 China License


DigestI