====== 抓取与正则 ====== ===== 抓取===== urllib 和 urllib2 中设置http超时 import socket socket.setdefaulttimeout(5) 最简洁的抓取语句: urllib.urlopen(url).read() u = urllib.urlopen(url) c = u.read() u.close POST数据: u = urllib.urlopen('http://url', urllib.urlencode({'gtalk':fromid,'msg':content})) c = u.read() u.close 传递cookies import cookielib, urllib2, urllib cj = cookielib.CookieJar() opener.addheaders=[("Cookie","dV9pZA**=Mg**; iCast2_1470_1097hO=0_2_20160; _iCast2_1470_1097hO=1")] url_post = 'http://test.soften.cn' content = opener.open(url_post, urllib.urlencode(posts)).read() opener.close() print content ===== 正则 ===== 匹配: m = re.search('tags=(.*?)">', c, re.I+re.S+re.M) if m: print m.group(1) 替换: pattern = re.compile('|', re.S | re.I) html = re.sub(pattern, ' ', html) ===== URL编码 ===== m = {'name' : 'somebody'; 'gender' : 'male'} s = urllib.urlencode(m) print s ##gender=male&name=somebody content = "zhongwen zifu" urllib.quote(content) #urllib.unquote(content) ===== 资料 ===== * http://www.woodpecker.org.cn/diveintopython/http_web_services/index.html * [[http://wiki.ubuntu.org.cn/Python%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F%E6%93%8D%E4%BD%9C%E6%8C%87%E5%8D%97|Python正则表达式操作指南]]