Board logo

标题: 求助,如何在网页源文件中提取图片地址? [打印本页]

作者: tianzizhi     时间: 2007-1-22 06:35    标题: 求助,如何在网页源文件中提取图片地址?

我想从下面的网页源文件中把所有图片的地址提取出来放到一文件tu.txt里.一行个地址,如:
http://www.pcpop.com/pp/images/pp4_r3_c3.gif
http://www.pcpop.com/pp/images/pp4_r3_c3.gif
http://www.pcpop.com/pp/images/pp4_r3_c3.gif
,图片类型为jpg,gif,图片链接在源文件里无固定位置的.只提取http开头的完整的图片地址.
网页源文件下载:
p.txt.
你也可以随便找个有图片的网页把它的源文件拿来试验,不要vbs版的.
网页源文件一部分为:
<TR align=middle>
<TD align=left width=306>
<TABLE cellSpacing=0 cellPadding=0 width="99%" border=0>
<TBODY>
<TR>
<TD width="7%"><IMG height=25 src="http://www.pcpop.com/pp/images/pp4_r3_c3.gif" width=21 border=0></TD>
<TD vAlign=bottom width="90%" background=images/pp4_r3_c10.gif>
<TABLE cellSpacing=0 cellPadding=0 width="100%" border=0>
<TBODY>
<TR>
<TD class=ffffff_14 align=left height=22><STRONG><A class=ffffff_14 title=美女风情 href="http://www.pcpop.com/pp/t007400212_12512_1.html" target=_blank>美女风情</A></STRONG></TD></TR></TBODY></TABLE></TD>
<TD width="3%"><A href="http://www.pcpop.com/pp/t007400212_12512_1.html" target=_blank><IMG height=25 src="http://www.pcpop.com/pp/images/pp4_r3_c12.gif" width=51 border=0></A></TD></TR></TBODY></TABLE>
<TABLE cellSpacing=0 cellPadding=0 width="99%" border=0>
<TBODY>
<TR>
<TD height=1></TD></TR></TBODY></TABLE>
<TABLE cellSpacing=0 cellPadding=0 width="99%" border=0>
<TBODY>
<TR>
<TD bgColor=#ffffff height=2></TD></TR></TBODY></TABLE>
<TABLE cellSpacing=0 cellPadding=0 width="99%" border=0>
<TBODY>
<TR>
<TD align=middle bgColor=#f2f2f2>
<TABLE cellSpacing=0 cellPadding=0 width=101 border=0>
<TBODY>
<TR>
<TD vAlign=top align=middle width=101 background=images/pp1_6.jpg height=132>
<TABLE cellSpacing=0 cellPadding=0 width=20 border=0>
<TBODY>
<TR>
<TD height=4></TD></TR></TBODY></TABLE>
<TABLE cellSpacing=0 cellPadding=0 width="100%" border=0>
<TBODY>
<TR>
<TD align=right><A href="http://union.soqii.com/page/01/" target=_blank><IMG height=120 src="http://img2.pcpop.com/ArticleImages/0x0/0/386/000386225.jpg" width=90 border=0></A></TD>
<TD width=6></TD></TR></TBODY></TABLE></TD></TR></TBODY></TABLE>
<TABLE cellSpacing=0 cellPadding=0 width="100%" border=0>
<TBODY>
<TR>
<TD align=middle bgColor=#ffffff height=40><A class=x434343 href="http://union.soqii.com/page/01/" target=_blank>即将发生的诱惑</A><BR><A class=x434343 href="http://union.soqii.com/page/01/" target=_blank>(632P)</A></TD></TR></TBODY></TABLE></TD>
<TD width=20 background=images/pp3_4.gif>&nbsp;</TD>
<TD align=middle bgColor=#f2f2f2>
<TABLE cellSpacing=0 cellPadding=0 width=101 border=0>
<TBODY>
<TR>
<TD vAlign=top align=middle width=101 background=images/pp1_6.jpg height=132>
<TABLE cellSpacing=0 cellPadding=0 width=20 border=0>
<TBODY>
<TR>
<TD height=4></TD></TR></TBODY></TABLE>
<TABLE cellSpacing=0 cellPadding=0 width="100%" border=0>
<TBODY>
<TR>
<TD align=right><A href="http://www.huzhai.com/bbs/" target=_blank><IMG height=120 src="http://img2.pcpop.com/ArticleImages/0x0/0/332/000332157.jpg" width=90 border=0></A></TD>
<TD width=6></TD></TR></TBODY></TABLE></TD></TR></TBODY></TABLE>
<TABLE cellSpacing=0 cellPadding=0 width="100%" border=0>
<TBODY>
<TR>
<TD align=middle bgColor=#ffffff height=40><A class=x434343 href="http://www.huzhai.com/bbs/" target=_blank>偷拍对面楼里的</A><BR><A class=x434343 href="http://www.huzhai.com/bbs/" target=_blank>(354P)</A></TD></TR></TBODY></TABLE></TD></TR></TBODY></TABLE>
<TABLE cellSpacing=0 cellPadding=0 width="99%" border=0>
<TBODY>
<TR>
<TD height=2></TD></TR></TBODY></TABLE></TD>
<TD align=right width=306>
<TABLE cellSpacing=0 cellPadding=0 width="99%" border=0>
<TBODY>
<TR>
<TD width="7%"><IMG height=25 src="http://www.pcpop.com/pp/images/pp4_r3_c3.gif" width=21 border=0></TD>
<TD vAlign=bottom width="90%" background=images/pp4_r3_c10.gif>
<TABLE cellSpacing=0 cellPadding=0 width="100%" border=0>
<TBODY>
<TR>
<TD class=ffffff_14 align=left height=22><STRONG><A class=ffffff_14 title=婷婷玉立 href="http://www.pcpop.com/pp/t007400212_11894_1.html" target=_blank>婷婷玉立</A></STRONG></TD></TR></TBODY></TABLE></TD>
<TD width="3%"><A href="http://www.pcpop.com/pp/t007400212_11894_1.html" target=_blank><IMG height=25 src="http://www.pcpop.com/pp/images/pp4_r3_c12.gif" width=51 border=0></A></TD></TR></TBODY></TABLE>
<TABLE cellSpacing=0 cellPadding=0 width="99%" border=0>
<TBODY>
<TR>

[ Last edited by tianzizhi on 2007-1-22 at 06:44 AM ]
作者: lxmxn     时间: 2007-1-22 06:54

  这个利用sed的正则匹配应该比较好解决。

作者: namejm     时间: 2007-1-22 07:23
  贴一段纯批处理的代码出来:
@echo off
setlocal enabledelayedexpansion
for /f "delims=" %%i in ('findstr "http://.*gif" test.txt') do (
    set "var=%%i"
    set "var=!var: src=☆!"
    set "var=!var:.gif"=☆!"
    set "var=!var:*☆=!"
    for /f "tokens=1 delims==☆" %%j in ("!var!") do echo %%~j.gif
)
pause

作者: dikex     时间: 2007-1-22 07:41
html标记语言貌似对行要求放的很松,也就是所有的代码可以写在同一行上面,同一行出现两个用scr=引导的图片地址也不为奇,这时namejm的代码好像只能找到第一个吧
对网页的语言了解不多,不知有没有记错……
作者: tianzizhi     时间: 2007-1-22 07:44
thanks very much!!!!
作者: vkill     时间: 2007-1-22 08:15
sed "/http/s/.*\(http:\/\/.*\.gif\).*/\1/;/^http:/!d" test.txt
作者: namejm     时间: 2007-1-22 08:43
  若同一行上可能存在几个gif链接,那就使用下面的代码吧(为了兼容连接符&,结果添加了引号,仍然会过滤掉感叹):
@echo off
setlocal enabledelayedexpansion
for /f "delims=" %%i in ('findstr "http://.*gif" test.txt') do (
    set "var=%%i"
    set "var=!var:"=!"
    set "var=!var: src=☆!"
    set "var=!var:.gif=★!"
    call :pick-up "!var!"
)
pause
goto :eof

:pick-up
set "var=%var:*☆=%"
for /f "tokens=1 delims==★" %%j in ("%var%") do echo "%%j.gif"
set "str_tmp=%var:☆=%"
if not "%str_tmp%"=="%var%" goto pick-up
goto :eof

作者: amao     时间: 2007-2-4 00:13
注意到楼主这句话-------图片类型为jpg,gif

@sed "/jpg\|gif/!d;s/.*src=\x22\([^\x22]*\)\x22.*/\1/;/^http/!d" p.txt> temp.txt

基于GNU sed 4.1.4

[ Last edited by amao on 2007-2-4 at 10:59 AM ]
作者: minmin888     时间: 2007-5-8 11:10
学到东西! '<'并没有处理好