谁用过python中的re来抓取网页,能否给个例子,谢谢

发布网友

共3个回答

热心网友

这是我写的一个非常简单的抓取页面的脚本，作用为获得指定URL的所有链接地址并获取所有链接的标题。

===========geturls.py================
#coding:utf-8
import urllib
import urlparse
import re
import socket
import threading

#定义链接正则
urlre = re.compile(r"href=[\"']?([^ >\"']+)")
titlere = re.compile(r"<title>(.*?)</title>",re.I)

#设置超时时间为10秒
timeout = 10
socket.setdefaulttimeout(timeout)

#定义最高线程数
max = 10
#定义当前线程数
current = 0

def gettitle(url):
global current
try:
content = urllib.urlopen(url).read()
except:
current -= 1
return
if titlere.search(content):
title = titlere.search(content).group(1)
try:
title = title.decode('gbk').encode('utf-8')
except:
title = title
else:
title = "无标题"
print "%s: %s" % (url,title)
current -= 1
return

def geturls(url):
global current,max
ts = []
content = urllib.urlopen(url)
#使用set去重
result = set()
for eachline in content:
if urlre.findall(eachline):
temp = urlre.findall(eachline)
for x in temp:
#如果为站内链接，前面加上url
if not x.startswith("http:"):
x = urlparse.urljoin(url,x)
#不记录js和css文件
if not x.endswith(".js") and not x.endswith(".css"):
result.add(x)
threads = []
for url in result:
t = threading.Thread(target=gettitle,args=(url,))
threads.append(t)
i = 0
while i < len(threads):
if current < max:
threads[i].start()
i += 1
current += 1
else:
pass

geturls("http://www.baidu.com")

使用正则表达式（re）只能做到一些比较简单或者机械的功能，如果需要更强大的网页分析功能，请尝试一下beautiful soup或者pyquery,希望能帮到你

热心网友

urllib2抓取

热心网友

re是不能抓去网页的
要抓取网页得用urllib,或者httplib
re是正则表达式，是分析要抓取的内容的模块！
很简单的！看看书吧！
或者假我：qq175662137

全部频道

谁用过python中的re来抓取网页,能否给个例子,谢谢