China DOS Union

-- Unite DOS · Advance DOS · Grow DOS --

Union site: www.cn-dos.net Forum site: www.cn-dos.net/forum
DOS stands for freedom, openness and progress. Let us work hard, learn from the openness and GNU spirit of FreeDOS and Linux, and together build and grow a free GNU GPL world!

中国DOS联盟论坛
The time now is 2026-06-27 05:10
中国DOS联盟论坛 » DOS批处理 & 脚本技术(批处理室) » [Original] CMD and Curl Combined: Automatically Merging Multi-Page Threads DigestI View 6,262 Replies 7
Original Poster Posted 2006-08-01 23:07 ·  中国 四川 成都 电信
初级用户
Credits 58
Posts 6
Joined 2006-08-01 22:48
19-year member
UID 59645
Status Offline
Practical Needs

When encountering long and classic discussion threads;
When seeing software tutorials divided into multiple pages;
When discovering serialized novels that are hard to put down.

How to save these multi-page contents has become a tedious and boring mechanical task.
Whether it is manual copying or relying on software to save, a large amount of human intervention is required, which is something we intelligent beings cannot tolerate.
Since the appearance of computers is to replace people in some complicated work, why not leave as much work as possible to them?
Unfortunately, Tofu has not found a software that meets my requirements yet. Since there is no ready-made one available, I have to do it myself.

Ideological Analysis

To solve a problem, there must first be an environment, after all, a single solution cannot cover all problems. Let's first assume the problem is to merge multi-page topics common in forums.
To merge a multi-page topic, we first need to obtain the content of each page of this topic. This repetitive work is most suitable for machines to do.
Secondly, we need to distinguish where the content posted by the user starts and where it ends. The first time this part needs to be done by humans, and the rest can be left to the machine.
Finally, we need to obtain the content we need and reorganize it to generate the final result, which can also be well done by the machine.
As long as we meet the above three points, we can free ourselves from repetitive work and do other things.

Solution

Since high-level languages require specialized learning and supporting software, which invisibly increases the difficulty of application, finally Tofu chose to use the CMD command line to complete this task.
Of course, there is no function to obtain web content in the CMD command. We also need the powerful command line tool Curl to help us.
Let's take merging the CCF Elite Technology Forum's MPlayer 2006-03-03 K&K Update at Post 992 as an example, and follow the previous ideas to try step by step to achieve the final Goal.

Web Page Crawling

With the help of Curl, we can easily crawl the web pages we want through the command line:

curl -o tmp1.txt http://bbs.et8.net/bbs/showthread.php?t=634659&page=1&pp=15


In this way, we have saved the content of the first page of this topic in the tmp1.txt file.
For some websites that need to detect browser information, we can use
-A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
to disguise as an IE browser.
For websites that need to use cookies, we can use
-D cookie1.txt
to save cookies and
-b cookie1.txt
to read cookies.
For websites with anti-hotlinking, we can use
-e "http://bbs.et8.net/"
to disguise as entering from a certain related link. Combined with the powerful FOR command and variables in CMD, plus a little human wisdom, we can create a script to automatically crawl all the content of this topic.
Analyzing the URL of this topic, we can know that page= represents the page number, which provides the basis for automated processing. At the same time, we know that this topic has 73 pages. The final crawling script is as follows:

 @echo off
setlocal ENABLEDELAYEDEXPANSION
set last=1
for /l %%i in (1,1,73) do (
echo %%i
curl -b cookie!last!.txt -D cookie%%i.txt -A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" -e "http://bbs.et8.net/bbs/showthread.php?t=634659^&page=!last!^&pp=15" -o tmp%%i.txt http://bbs.et8.net/bbs/showthread.php?t=634659^&page=%%i^&pp=15
set /a last=%%i-1
)
copy tmp*.txt temp.txt
del cookie*.txt
del tmp*.txt
endlocal


Save the above script as grab.cmd. After running it, we get the temp.txt file that saves all 73 pages of this topic.

Content Analysis

Due to the problem of CMD character processing, we first save temp.txt as ANSI encoding.
After analyzing the content of a single page, Tofu found that the forum program has a <div id="posts"> unique to each page before the user content starts,
and there is an equally unique <!-- start content table --> at the end, which is exactly the flag we hope to find as a marker.

Text Processing

Since the FOR command can only process one line of content at a time with the same rules, Tofu then uses the nested FOR method to process the entire large file.
First, use
for /f "delims=" %%i in (temp.txt) do ( echo %%i >tmp.txt )
to write the content of temp.txt line by line into tmp.txt.
Then apply another FOR to process a line of tmp.txt.

Flag Setting

We can use the delims= and tokens= parameters of FOR to split and save the content of a line.
We use
for /f "tokens=1-3 delims=<->= " %%j in (tmp.txt)
to set the parameters to split a line with "<", ">", "-", "=", " ",
and store the first three segments of the split content into the three variables %%j %%k %%l. Then we use the if statement to judge whether these three variables meet the conditions for setting the flag:

 if "%%j"=="div" if "%%k"=="id" if %%l=="posts" set flag=1
if "%%j"=="start" if "%%k"=="content" if "%%l"=="table" set flag=0


flag=1 means the user content starts, and flag=0 means the user content ends.

Content Trimming

Due to the limitation of CMD command line processing, the HTML comment start symbol "<!--" will be processed into "<--", which will cause unexpected content to be displayed.
We can add another FOR to solve this problem:

 for /f "tokens=1-8 delims=< " %%m in (tmp.txt) do (
if not "%%t"=="-->" if not "%%s"=="-->" if not "%%r"=="-->" if not "%%q"=="-->" if not "%%p"=="-->" if not "%%o"=="-->" if not "%%m"=="ECHO" if !flag!==1 echo %%i >>new.htm)


At the same time, we have also completed the work of storing the content after the start flag into new.htm.

Final Script

 @echo off
setlocal ENABLEDELAYEDEXPANSION
set flag=0
for /f "delims=" %%i in (temp.txt) do (
echo %%i >tmp.txt
for /f "tokens=1-3 delims=<->= " %%j in (tmp.txt) do (
if "%%j"=="div" if "%%k"=="id" if %%l=="posts" set flag=1
if "%%j"=="start" if "%%k"=="content" if "%%l"=="table" set flag=0
for /f "tokens=1-8 delims=< " %%m in (tmp.txt) do (
if not "%%t"=="-->" if not "%%s"=="-->" if not "%%r"=="-->" if not "%%q"=="-->" if not "%%p"=="-->" if not "%%o"=="-->" if not "%%m"=="ECHO" if !flag!==1 echo %%i >>new.htm
)
)
)
del tmp.txt
endlocal


Save the script as merge.cmd. After running it, the merged new.htm file obtained is the content of all 1083 posts of this topic.

Optimization and Improvement

This script only completes the work of crawling text content. We can also find picture content by judging the IMG element,
and complete the path after the src attribute to the full path, so that the pictures in the content can be displayed correctly.

Postscript and Summary

The combination of CMD and Curl can complete many batch complex tasks. Although it takes a little more time at the first time, it can be used conveniently later.
This script can smoothly crawl and merge any topics of the CCF Elite Technology Forum and some forums based on vBulletin, but it needs to be modified separately for other forums to be used.

This article is original by chenke_ikari and first published on Tofu's Simple Hut
This article is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 China License


Floor 2 Posted 2006-08-02 21:15 ·  中国 辽宁 大连 教育网
中级用户
★★
DOS之友
Credits 332
Posts 168
Joined 2005-10-06 00:00
20-year member
UID 43171
Gender Male
From 天涯
Status Offline
Dofu's humble hut Why can't I log in here?
测试环境: windows xp pro sp2 高手是这样炼成的:C:\WINDOWS\Help\ntcmds.chm
Floor 3 Posted 2006-08-02 22:52 ·  中国 辽宁 葫芦岛 中移铁通
荣誉版主
★★★
Credits 1,338
Posts 356
Joined 2005-07-15 12:09
20-year member
UID 40733
Gender Male
Status Offline
RE ikari
As soon as I logged in to the forum, I saw your three articles that were marked as essence. First of all, welcome to join our forum, and I hope you can participate in the forum discussions more in the future.
Regarding the use of curl, why not choose the way of curl "http://bbs.et8.net/bbs/showthread.php?t=634659&page=" -o tmp#1.htm to download multi-page links?
In fact, I am also making a batch script similar to the crawler function, but it has just started. At the beginning, I was quite hesitant when choosing between using curl or wget. Curl has very powerful functions of simulating browsers but does not have the ability to recursively download links. Wget has powerful recursive download capabilities but does not have the convenience of downloading links in order and with regularity like curl. But I finally chose wget because I value the recursive download ability and the function of converting relative links to absolute links more, which can facilitate further processing of web pages. For the deficiencies of wget, I wrote a script to complete the download of multi-page links, which is part of the batch script I mentioned.

downhtm.cmd



  1. @echo off
  2. for /f "eol=# tokens=1,2 delims= " %%i in (url.txt) do (
  3. call :setpage "%%i" %%j
  4. )
  5. goto :EOF

  6. :setpage
  7. set flag=0
  8. set _url="%~1"
  9. set pages=%2
  10. set endpage=%pages:*-=%
  11. call set startpage=%%pages:-%endpage%=%%
  12. if "%pages:~0,1%" GTR "9" (
  13. set pages=%pages:~1%
  14. set startpage=1%startpage:~1%
  15. set endpage=1%endpage%
  16. set flag=1
  17. )
  18. for /l %%i in (%startpage%,1,%endpage%) do (
  19. call :download %%i
  20. )
  21. goto :EOF

  22. :download
  23. set num=%1
  24. if "%flag%" == "1" (
  25. set num=%num:~1%
  26. )
  27. call set url=%%_url:(*)=%num%%%
  28. wget -k %url%
  29. goto :EOF
Posted helplessly on 2006-08-02 22:38



url.txt
The format is like this:

#This line is a comment line, starting with "#"
#Web page number starts with a letter, indicating that it can be aligned with multiple digits.
http://www.cn-dos.net/forum/forumdisplay.php?fid=9&page=(*) 1-5
http://www1.mydeskcity.com/xpbz(*).htm A01-05
http://www.cn-dos.net/forum/forumdisplay.php?fid=23

Please note that the second line of the downhtm.cmd file has a tab after delims=, which may be displayed as multiple spaces.
Recent Ratings for This Post ( 1 in total) Click for details
RaterScoreTime
ngd +4 2009-12-16 21:39
  ☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul

Floor 4 Posted 2006-08-03 08:52 ·  中国 四川 成都 电信
初级用户
Credits 58
Posts 6
Joined 2006-08-01 22:48
19-year member
UID 59645
Status Offline
Originally posted by IceCrack at 2006-8-2 21:15:
Dofu's Simple Hut Why can't I log in here?


Because it uses a foreign DDNS service, and there are problems with DNS server resolution in some provinces and cities. If needed, you can use a proxy or change the DNS server to
61.144.227.5
203.198.7.66
and there will be no side effects, and it can also solve problems like Google and Gmail.
Floor 5 Posted 2006-08-03 09:02 ·  中国 四川 成都 电信
初级用户
Credits 58
Posts 6
Joined 2006-08-01 22:48
19-year member
UID 59645
Status Offline
Originally posted by Helpless at 2006-8-2 22:52:
RE ikari
As soon as I log in to the forum, I see your three articles that have been set as essence. First of all, welcome to join our forum. At the same time, I hope that you can participate in the forum discussions more in the future.
Regarding the use of curl, for...



Hehe, that's the forum moderators' favor. It's a shame to say. Whether it's cmd or curl, Toufu is learning on the fly. I write whatever comes to mind. There's still a lot to learn. I post here just to start a discussion.

Regarding the problem of the curl download link, Toufu really used a clumsy method. I will definitely study Wget and the moderator's script carefully before asking for advice.
Floor 6 Posted 2006-12-23 08:20 ·  中国 四川 成都 教育网
铂金会员
★★★★
Credits 7,493
Posts 2,672
Joined 2005-09-02 00:00
20-year member
UID 42173
Gender Male
Status Offline
Good post, bump~~
Floor 7 Posted 2006-12-24 04:31 ·  中国 广东 东莞 电信
新手上路
Credits 19
Posts 8
Joined 2006-03-31 22:13
20-year member
UID 53128
Status Offline
Learning
Floor 8 Posted 2009-11-09 08:31 ·  中国 广西 南宁 电信
初级用户
★★
Credits 99
Posts 53
Joined 2006-08-18 18:44
19-year member
UID 60809
Status Offline
Helplessly, the moderator's comment is quite accurate. It would be nice if the two software were combined.
Forum Jump: