在学校上网一直不是很方便,难得有机会研究了一个东东,拿出来抛砖引玉,大家共同探讨批量下载与智能下载。
需求:
在6dzone.com上看到一些不错的图片,
http://www.6dzone.com/photo/wyxc/userphoto.asp?pid=8029
http://www.6dzone.com/photo/wyxc/userphoto.asp?pid=10369
但保存图片实在很麻烦,本来网速就超慢,还需要打开一个一个的网页,然后右击图片“另存为”,还是考虑一下批量下载。
分析:
目前用的较多的图片批量下载工具好像是 GlobalFetch ,下载的图片倒是很多,但99%不是自己想要的。
另外迅雷和网际快车都有批量下载功能,但必须是有共同特征的下载地址,只适合下载***001.jpg、***002.jpg、***003.jpg……之类地址的文件。
目前还没有发现一款软件可以满足我的要求,还是自己写个程序来模拟用户操作,毕竟数据都是要通过http完成的,从html代码中也能找到一些共性,把这些重复性的操作都交给程序来执行就行了。
每种语言都有各自的特点,cmd batch 做这些事实在是方便高效,即写即运行。而用高级语言实现就很麻烦,“杀鸡焉用宰牛刀”!
所需命令行工具
curl ——功能强大的命令行浏览器、下载工具
http://www.cn-dos.net/forum/viewthread.php?tid=20453&fpage=1&highlight=curl
wget ——功能强大的命令行下载工具
http://baike.baidu.com/view/1312507.htm
sed ——功能强大的命令行流编辑器
http://www.cn-dos.net/forum/viewthread.php?tid=24210&fpage=1&highlight=sed%20%2B%20wget%20%2B%20%E6%AD%A3%E5%88%99
用好sed、grep、awk等编辑器还需掌握正则表达式
正则表达式
http://www.cn-dos.net/forum/viewthread.php?tid=24206&fpage=1&highlight=sed%20%2B%20wget%20%2B%20%E6%AD%A3%E5%88%99
要解决一个问题必须先有一个环境,毕竟一个方案不可能通吃所有问题,只针对6dzone的相册下载。个人喜欢使用遨游浏览器,先将喜欢的相册网址添加到收藏夹再将收藏的网址导出为bookmark.html文件。
使用sed解析bookmark.htm文件,获取所要的网址,
找出有网址的行
sed "/photo/!d" bookmark.htm
或者
sed -n "/photo/p" bookmark.htm
再获取引号中间的网址,合起来就是:(其中"的ASII码值为34,转换为正则表达式即为 \x22)
sed "/photo/!d;s/*\x22//;s/\x22.*//" bookmark.htm
另外6dzone需要注册用户认证登陆才能看到大图片,需要使用curl模拟用户登陆并导出cookie
首先得分析一下网页代码及其表单,推荐使用View page插件,另外IE的httpwatch和Firefox的TamperData都是很不错的插件!
curl提交表单一般有2种方法:get方式和post方式,这得取决于表单的method,另外还得分析一下 Action 和 提交表单要用的 Name
以下是cn-dos论坛登陆页面的html代码
----------------------------------------------------------------------------------------------------------
<FORM
action=logging.php?action=login method=post><INPUT type=hidden value=28c5c8a4 name=formhash> <INPUT type=hidden value=http://www.cn-dos.net/forum/viewthread.php?tid=22634&fpage=1&highlight=sed%20%2B%20wget%20%2B%20%E6%AD%A3%E5%88%99&sid=FHJYXn name=referer>
<TABLE cellSpacing=0 cellPadding=0 width="99%" align=center border=0>
<TBODY>
<TR>
<TD bgColor=#dde3ec>
<TABLE cellSpacing=1 cellPadding=4 width="100%" border=0>
<TBODY>
<TR class=header>
<TD colSpan=2>会员登录</TD></TR>
<TR>
<TD bgColor=#f8f9fc>隐身登录:</TD>
<TD class=smalltxt bgColor=#ffffff><SELECT name=loginmode> <OPTION value="" selected>- 使用默认 -</OPTION> <OPTION value=normal>正常模式</OPTION> <OPTION value=invisible>隐身模式</OPTION></SELECT> </TD></TR>
<TR>
<TD bgColor=#f8f9fc>界面风格:</TD>
<TD bgColor=#ffffff><SELECT name=styleid><OPTION value="" selected>- 使用默认 -</OPTION> <OPTION value=1>Default Style</OPTION></SELECT> </TD></TR>
<TR>
<TD bgColor=#f8f9fc>Cookie 有效期:</TD>
<TD class=smalltxt bgColor=#ffffff><INPUT type=radio value=31536000 name=cookietime> 一年 <INPUT style="BACKGROUND: #ffffcc" type=radio CHECKED value=31536000 name=cookietime> 一个月 <INPUT type=radio value=86400 name=cookietime> 一天 <INPUT type=radio value=0 name=cookietime> 浏览器进程 <A href="faq.php?page=usermaint#2" target=_blank></A></TD></TR>
<TR>
<TD bgColor=#ffffff colSpan=2 height=1></TD></TR>
<TR>
<TD align=middle colSpan=2><FONT color=red>注意:</FONT>老用户 <B>首次</B> 登录转换的PHP论坛前,请先修复密码,详情请见<A href="http://www.cn-dos.net/forum/announcement.php?id=2#2">论坛公告</A>。</TD></TR>
<TR>
<TR>
<TD width="21%" bgColor=#f8f9fc>用户名(必填):</TD>
<TD bgColor=#ffffff><INPUT style="BACKGROUND: #ffffcc" tabIndex=1 maxLength=40 size=25
name=username> <SPAN class=smalltxt><A href="register.php">立即注册</A></SPAN></TD></TR>
<TR>
<TD bgColor=#f8f9fc>密码(必填):</TD>
<TD bgColor=#ffffff><INPUT style="BACKGROUND: #ffffcc" tabIndex=2 type=password size=25 value=""
name=password> <SPAN class=smalltxt><A href="member.php?action=lostpasswd">忘记密码</A></SPAN></TD></TR>
----------------------------------------------------------------------------------------------------------
用curl登陆cn-dos论坛
curl -d "username=ngd&password=cndos" http://www.cn-dos.net/forum/logging.php?action=login
顺带提一下直接在浏览器中打开并登陆可以使用下面的代码
http://www.cn-dos.net/forum/logging.php?action=login&username=ngd&password=cndos&loginsubmit=.
将登陆后的cookie保存在6dzonecookie.txt中
curl -c 6dzonecookie.txt
使用cookie文件
curl -b 6dzonecookie.txt
伪装成IE浏览器
curl -A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.01)"
合起来就是
curl -c 6dzonecookie.txt -A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.01)" -d "username=dddddd6&pwd=cndos" http://www.6dzone.com/user/f_login.asp>nul
wget最简单的用法
wget http://www.cn-dos.net/forum/images/default/logo.gif
我这网络不太稳定,网速超慢,再多加一些参数
wget -t 8 -w 3 -T 30 -c -N
其他的没什么好说的,主要是sed的用法,再来一个for嵌套循环就OK了。
全部代码:
@echo off
rem code by 拟谷盗 for download 6dzone photo.
curl -c 6dzonecookie.txt -A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.01)" -d "username=dddddd6&pwd=cndos" http://www.6dzone.com/user/f_login.asp>nul
for /f "delims=" %%a in ('sed "/photo/!d;s/*\x22//;s/\x22.*//" bookmark.htm') do (
for /f "usebackq delims=" %%b in (`curl %%a ^| sed "/pic_id/!d;s/*\x22//;s/\x22.*//;s/photo.asp/pic.asp/g;s/\/photo/http:\/\/www.6dzone.com&/g"`) do (
curl -b 6dzonecookie.txt -A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.01)" %%b | sed "/http:\/\/.*jpg/!d;s/.*http/http/g;s/jpg.*/jpg/g" >>picurl.list
wget -t 8 -w 3 -T 30 -c -N -i picurl.list
del picurl.list
)
)
del 6dzonecookie.txt
exit/b
粗略看了一下,写的很乱,暂时就这样了,有时间再整理整理。
另外代码也不够简洁,哪位达人再帮忙改改,以后下载网页图片就方便多了!<img src="images/smilies/face-wink.png" align="absmiddle" border="0"><img src="images/smilies/face-smile-big.png" align="absmiddle" border="0">
curl+wget+sed+bat文件 下载
http://upload.cn-dos.net/img/095.rar
解压后运行 bat文件即可下载图片
Last edited by ngd on 2008-3-15 at 02:05 PM ]
It's not convenient to surf the Internet at school all the time. It's rare to have the opportunity to study something. I take it out to start a discussion, and everyone can discuss batch downloading and intelligent downloading together.
Requirement:
I saw some nice pictures on 6dzone.com,
http://www.6dzone.com/photo/wyxc/userphoto.asp?pid=8029
http://www.6dzone.com/photo/wyxc/userphoto.asp?pid=10369
But it's really troublesome to save the pictures. Originally, the Internet speed is extremely slow, and I need to open one web page after another, then right-click the picture and "Save as". I still consider batch downloading.
Analysis:
The picture batch downloading tool that is used more currently seems to be GlobalFetch. A lot of pictures are downloaded, but 99% are not what I want.
In addition, Thunder and FlashGet both have batch downloading functions, but they must have download addresses with common characteristics, and are only suitable for downloading files with addresses like ***001.jpg, ***002.jpg, ***003.jpg...
I haven't found a software that can meet my requirements yet. It's still better to write a program to simulate user operations. After all, data is all completed through http, and some commonalities can also be found in the html code. Just leave these repetitive operations to the program to execute.
Each language has its own characteristics. It's really convenient and efficient to do these things with cmd batch, write and run immediately. But it's very troublesome to implement with a high-level language. "Why use a butcher's knife to kill a chicken"!
Required command-line tools
curl ——A powerful command-line browser and download tool
http://www.cn-dos.net/forum/viewthread.php?tid=20453&fpage=1&highlight=curl
wget ——A powerful command-line download tool
http://baike.baidu.com/view/1312507.htm
sed ——A powerful command-line stream editor
http://www.cn-dos.net/forum/viewthread.php?tid=24210&fpage=1&highlight=sed%20%2B%20wget%20%2B%20%E6%AD%A3%E5%88%99
To use sed, grep, awk and other editors well, one needs to master regular expressions
Regular expression
http://www.cn-dos.net/forum/viewthread.php?tid=24206&fpage=1&highlight=sed%20%2B%20wget%20%2B%20%E6%AD%A3%E5%88%99
To solve a problem, one must first have an environment. After all, one plan can't cover all problems. It's only for downloading albums from 6dzone. I personally like to use Maxthon browser. First, add the favorite album URLs to the favorites, then export the bookmarked URLs to a bookmark.html file.
Use sed to parse the bookmark.htm file and get the desired URLs,
Find the lines with URLs
sed "/photo/!d" bookmark.htm
Or
sed -n "/photo/p" bookmark.htm
Then get the URLs in quotes. Combined, it is: (where the ASCII value of " is 34, converted to regular expression is \x22)
sed "/photo/!d;s/*\x22//;s/\x22.*//" bookmark.htm
In addition, 6dzone requires registered user authentication and login to see large pictures. One needs to use curl to simulate user login and export cookies
First, one needs to analyze the web page code and its form. It is recommended to use the View page plugin. In addition, IE's httpwatch and Firefox's TamperData are very good plugins!
There are generally 2 methods for curl to submit forms: get method and post method. This depends on the method of the form. In addition, one also needs to analyze Action and the Name to be used for submitting the form
The following is the html code of the cn-dos forum login page
----------------------------------------------------------------------------------------------------------
<FORM
action=logging.php?action=login method=post><INPUT type=hidden value=28c5c8a4 name=formhash> <INPUT type=hidden value=http://www.cn-dos.net/forum/viewthread.php?tid=22634&fpage=1&highlight=sed%20%2B%20wget%20%2B%20%E6%AD%A3%E5%88%99&sid=FHJYXn name=referer>
<TABLE cellSpacing=0 cellPadding=0 width="99%" align=center border=0>
<TBODY>
<TR>
<TD bgColor=#dde3ec>
<TABLE cellSpacing=1 cellPadding=4 width="100%" border=0>
<TBODY>
<TR class=header>
<TD colSpan=2>Member Login</TD></TR>
<TR>
<TD bgColor=#f8f9fc>Invisible Login:</TD>
<TD class=smalltxt bgColor=#ffffff><SELECT name=loginmode> <OPTION value="" selected>- Use Default -</OPTION> <OPTION value=normal>Normal Mode</OPTION> <OPTION value=invisible>Invisible Mode</OPTION></SELECT> </TD></TR>
<TR>
<TD bgColor=#f8f9fc>Interface Style:</TD>
<TD bgColor=#ffffff><SELECT name=styleid><OPTION value="" selected>- Use Default -</OPTION> <OPTION value=1>Default Style</OPTION></SELECT> </TD></TR>
<TR>
<TD bgColor=#f8f9fc>Cookie Validity Period:</TD>
<TD class=smalltxt bgColor=#ffffff><INPUT type=radio value=31536000 name=cookietime> One Year <INPUT style="BACKGROUND: #ffffcc" type=radio CHECKED value=31536000 name=cookietime> One Month <INPUT type=radio value=86400 name=cookietime> One Day <INPUT type=radio value=0 name=cookietime> Browser Process <A href="faq.php?page=usermaint#2" target=_blank></A></TD></TR>
<TR>
<TD bgColor=#ffffff colSpan=2 height=1></TD></TR>
<TR>
<TD align=middle colSpan=2><FONT color=red>Note:</FONT> For the first time <B>logging in</B> to the converted PHP forum for old users, please repair the password first. For details, please see <A href="http://www.cn-dos.net/forum/announcement.php?id=2#2">Forum Announcement</A>.</TD></TR>
<TR>
<TR>
<TD width="21%" bgColor=#f8f9fc>Username (Required):</TD>
<TD bgColor=#ffffff><INPUT style="BACKGROUND: #ffffcc" tabIndex=1 maxLength=40 size=25
name=username> <SPAN class=smalltxt><A href="register.php">Register Now</A></SPAN></TD></TR>
<TR>
<TD bgColor=#f8f9fc>Password (Required):</TD>
<TD bgColor=#ffffff><INPUT style="BACKGROUND: #ffffcc" tabIndex=2 type=password size=25 value=""
name=password> <SPAN class=smalltxt><A href="member.php?action=lostpasswd">Forgot Password</A></SPAN></TD></TR>
----------------------------------------------------------------------------------------------------------
Use curl to log in to the cn-dos forum
curl -d "username=ngd&password=cndos" http://www.cn-dos.net/forum/logging.php?action=login
By the way, the code to open and log in directly in the browser can use the following code
http://www.cn-dos.net/forum/logging.php?action=login&username=ngd&password=cndos&loginsubmit=.
Save the logged-in cookie in 6dzonecookie.txt
curl -c 6dzonecookie.txt
Use the cookie file
curl -b 6dzonecookie.txt
Pretend to be an IE browser
curl -A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.01)"
Combined, it is
curl -c 6dzonecookie.txt -A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.01)" -d "username=dddddd6&pwd=cndos" http://www.6dzone.com/user/f_login.asp>nul
The simplest usage of wget
wget http://www.cn-dos.net/forum/images/default/logo.gif
My network is not very stable, and the Internet speed is extremely slow. Add some more parameters
wget -t 8 -w 3 -T 30 -c -N
There's not much to say about the others. It's mainly about the usage of sed. Just add a for nested loop.
The entire code:
@echo off
rem code by 拟谷盗 for download 6dzone photo.
curl -c 6dzonecookie.txt -A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.01)" -d "username=dddddd6&pwd=cndos" http://www.6dzone.com/user/f_login.asp>nul
for /f "delims=" %%a in ('sed "/photo/!d;s/*\x22//;s/\x22.*//" bookmark.htm') do (
for /f "usebackq delims=" %%b in (`curl %%a ^| sed "/pic_id/!d;s/*\x22//;s/\x22.*//;s/photo.asp/pic.asp/g;s/\/photo/http:\/\/www.6dzone.com&/g"`) do (
curl -b 6dzonecookie.txt -A "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.01)" %%b | sed "/http:\/\/.*jpg/!d;s/.*http/http/g;s/jpg.*/jpg/g" >>picurl.list
wget -t 8 -w 3 -T 30 -c -N -i picurl.list
del picurl.list
)
)
del 6dzonecookie.txt
exit/b
Roughly looking at it, it's written very messily. It's like this for now. I'll organize it later.
In addition, the code is not concise enough. Can any expert help modify it? Then it will be convenient to download web page pictures in the future!;):D
curl+wget+sed+bat file download
http://upload.cn-dos.net/img/095.rar
Extract it and run the bat file to download pictures
Last edited by ngd on 2008-3-15 at 02:05 PM ]