抓取与正则

抓取

urllib 和 urllib2 中设置http超时

import socket		
socket.setdefaulttimeout(5)

最简洁的抓取语句:

 urllib.urlopen(url).read() 
u = urllib.urlopen(url)
c = u.read() 
u.close

POST数据:

u = urllib.urlopen('http://url', urllib.urlencode({'gtalk':fromid,'msg':content}))
c = u.read()
u.close

传递cookies

import cookielib, urllib2, urllib
 
cj = cookielib.CookieJar()
opener.addheaders=[("Cookie","dV9pZA**=Mg**; iCast2_1470_1097hO=0_2_20160; _iCast2_1470_1097hO=1")]
url_post = 'http://test.soften.cn'
content = opener.open(url_post, urllib.urlencode(posts)).read()
opener.close()
 
print content

正则

匹配:

m = re.search('tags=(.*?)">', c, re.I+re.S+re.M)
if m:
    print m.group(1)

替换:

pattern = re.compile('<style.*?</style>|<script.*?</script>', re.S | re.I)
html = re.sub(pattern, ' ', html)

URL编码

m = {'name' : 'somebody'; 'gender' : 'male'}
s = urllib.urlencode(m)
print s
##gender=male&name=somebody
content = "zhongwen zifu"
urllib.quote(content)
#urllib.unquote(content)

资料

python/抓取与正则.txt · 最后更改: 2009/11/01 08:29 由 kenvin
到顶部
chimeric.de = chi`s home Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0 红麦软件 红麦软件