====== 抓取与正则 ======
===== 抓取=====
urllib 和 urllib2 中设置http超时
import socket
socket.setdefaulttimeout(5)
最简洁的抓取语句: urllib.urlopen(url).read()
u = urllib.urlopen(url)
c = u.read()
u.close
POST数据:
u = urllib.urlopen('http://url', urllib.urlencode({'gtalk':fromid,'msg':content}))
c = u.read()
u.close
传递cookies
import cookielib, urllib2, urllib
cj = cookielib.CookieJar()
opener.addheaders=[("Cookie","dV9pZA**=Mg**; iCast2_1470_1097hO=0_2_20160; _iCast2_1470_1097hO=1")]
url_post = 'http://test.soften.cn'
content = opener.open(url_post, urllib.urlencode(posts)).read()
opener.close()
print content
===== 正则 =====
匹配:
m = re.search('tags=(.*?)">', c, re.I+re.S+re.M)
if m:
print m.group(1)
替换:
pattern = re.compile('|', re.S | re.I)
html = re.sub(pattern, ' ', html)
===== URL编码 =====
m = {'name' : 'somebody'; 'gender' : 'male'}
s = urllib.urlencode(m)
print s
##gender=male&name=somebody
content = "zhongwen zifu"
urllib.quote(content)
#urllib.unquote(content)
===== 资料 =====
* http://www.woodpecker.org.cn/diveintopython/http_web_services/index.html
* [[http://wiki.ubuntu.org.cn/Python%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F%E6%93%8D%E4%BD%9C%E6%8C%87%E5%8D%97|Python正则表达式操作指南]]