Board logo

标题: 如何截取内容? [打印本页]

作者: oicq63236     时间: 2009-4-28 12:37    标题: 如何截取内容?
这是一个笑话网,我通过脚本自动下载,并想截取下列红色部分的内容到一个文件内,然后调用DOS版飞信向手机发送信息,现在取不出红色的内容.

红色的内容每条都不一样,但是蓝色部分是固定的,即两块蓝色区域中的内容.




<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" >
<head >
<meta name="description" content=
"手机短信网:恋爱宝典:人
  会谈恋爱,
  不特殊;
  牛
  会吃青草,
  不特殊;
  猪
  会按电话,
  才特殊;
  还按!
  真是神猪!
  哇噻!还会笑!
  真是酷呆了的猪!"
 />
<title>恋爱宝典</title>
<link href="dx.css" rel="stylesheet" type="text/css" media="screen" />
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<script language="javascript" src="kj/ArClickCount.asp?ID=10001"></script>
</head>
<body >
<form name="form1" method="post" action="ck.aspx?id=10001" id="form1">
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwULLTE2MTc5NTk2NDAPZBYCZg9kFgICAg88KwANAQAPFgYeC18hRGF0YUJvdW5kZx4JUGFnZUNvdW50AgEeC18hSXRlbUNvdW50AgVkFgJmD2QWDAIBD2QWBmYPZBYCZg8VAwI1OQzmlbTom4rkuJPlrrYM5pW06JuK5LiT5a62ZAIBD2QWAmYPFQIFMTAxMDdq5LiA5pelIOS4gOWvueiLjeidh+avjeWtkOWcqOS4gOi1t+WQg+WNiOmkkCAgIA0K44CA44CA5YS/5a2Q6Zeu6IuN6J2H5aaI5aaIOuS4uuS7gOS5iOaIkeS7rOavj+WkqemDveWQg+Wkp2QCAg8PFgIeBFRleHQFAzEwMGRkAgIPZBYGZg9kFgJmDxUDAjY0DOeUn+aXpeefreS/oQznlJ/ml6Xnn63kv6FkAgEPZBYCZg8VAgUxMDA0MHDmlZnmjojlr7nkuIDlkI3mmbrlipvml6nnhp/nmoQ25bKB55S35a2p6L+b6KGM5rWL6aqM44CC5pWZ5o6I6Zeu77ya5L2g55qE55Sf5pel5piv5ZOq5LiA5aSp77yf5bCP5a2p77yaMuaciDIw5pelZAICDw8WAh8DBQIyM2RkAgMPZBYGZg9kFgJmDxUDAjYzDOaBi+eIseWuneWFuAzmgYvniLHlrp3lhbhkAgEPZBYCZg8VAgUxMDExNCPlr7nlprMg5LiN566h6Zi05pm05ZyG57y6IOS5n+S4jeWPmGQCAg8PFgIfAwUCMTJkZAIED2QWBmYPZBYCZg8VAwI2MQzmg4XotqPnn63kv6EM5oOF6Laj55+t5L+hZAIBD2QWAmYPFQIFMTAxODl45ZCM5a2m6IGa6aSQ77yM5Yia5LiK55qE5LiA55uY6bih56uL5Yi76KKr5oqi5YWJ77yM5pyA5ZCO5Ymp5LiL6bih5aS05ZKM6bih5bGB6IKh77yM5LiA5ZCM5a2m56qB5Y+R5aWH5oOz77ya5aSn5a6254yc546wZAICDw8WAh8DBQM1MThkZAIFD2QWBmYPZBYCZg8VAwI2NAznlJ/ml6Xnn63kv6EM55Sf5pel55+t5L+hZAIBD2QWAmYPFQIFMTAwMjVL56WI5oS/5oKo55qE55Sf5pel77yM5Li65oKo5bim5p2l5LiA5Liq5pyA55Gw5Li95pyA6YeR56Kn6L6J54WM55qE5LiA55Sf44CCZAICDw8WAh8DBQE0ZGQCBg8PFgIeB1Zpc2libGVoZGQYAQUJR3JpZFZpZXcxD2dk+e6Jj6LcuYHd40BDIRCE1wgjaTw=" />

<div id="bg">
<div id="container">

<div id="Header">
<div id="toplogo">
</div>

<div id="topr">
<div id="topcenter">
<span class="STYLE2">打造收录短信最新、最全的手机短信网</span>&nbsp;&nbsp;&nbsp; <a href="#" onClick="this.style.behavior='url(#default#homepage)';this.sethomepage('http://www.52dxx.com');return false;">设为首页</A>
</div>
<div id="dht">
<div id="touleft">
<div id="nav">
<ul>
<li><a href="index.aspx" title="短信首页" >短信首页</a></li>
<li class="navjg"></li>
<li><a href="jrdx.aspx" title="节日手机短信"

作者: freeants001     时间: 2009-4-28 22:35
也许一条简单的for命令就可以了 ,楼主试试看
for /f "tokens=* eol=<" %%i in (test.htm) do @echo.%%i>dest_text.txt

作者: jmz573515     时间: 2009-4-29 02:50
vbs的

set fso=createobject("scripting.filesystemobject")
set file=fso.opentextfile("test.htm")
s=file.readall
file.close
Function getImages(Str)
Set re = New RegExp
re.global=true
re.Pattern = "\n\W+"""
Set Contents = re.Execute(Str)
For Each Match in Contents
Images = Images +Match+vbcrlf
Next
getImages =Images
End Function
set file=fso.createtextfile("#test.txt")
file.write getimages(s)
file.close
createobject("wscript.shell").run "#test.txt"

作者: yangfengoo     时间: 2009-5-3 01:10
Originally posted by freeants001 at 2009-4-28 22:35:
也许一条简单的for命令就可以了 ,楼主试试看
for /f "tokens=* eol=<" %%i in (test.htm) do @echo.%%i>dest_text.txt


思路很好但有点问题,改正如下:
for /f "tokens=1 eol=<" %%i in (test.htm) do @echo.%%i>>dest_text.txt

作者: freeants001     时间: 2009-5-3 01:27
楼上的发现有点问题,具体是什么问题,你能说说吗?我测试了是没发现问题,除了下面句中的红色部分被提取了外.不过你修改后的好象也没解决这个问题<img src="images/smilies/face-wink.png" align="absmiddle" border="0">
真是酷呆了的猪!"  />

作者: sady2009     时间: 2009-5-3 01:53
sed -n "/description/,/\/>/p" test.htm | sed -e "s/<meta name=\"description\" content=//g" -e "s/\/>//g" >b.txt

作者: yangfengoo     时间: 2009-5-3 06:37
Originally posted by freeants001 at 2009-5-3 01:27:
楼上的发现有点问题,具体是什么问题,你能说说吗?我测试了是没发现问题,除了下面句中的红色部分被提取了外.不过你修改后的好象也没解决这个问题;)
真是酷呆了的猪!"  />


不会啊]就是多了/>
我测试了没有 />

作者: freeants001     时间: 2009-5-3 07:08
呵呵,是去掉了/>

作者: yishanju     时间: 2009-5-3 11:55
还是用正则来得更可靠一点