精通Python网络爬虫-第四章-Urllib库与URLError异常处理

2017-07-10

字数统计: 1.5k字 | 阅读时长≈ 8分

4.1 Urllib库

抓取网页用的
升级合并后，模块中包的位置变化较多

时代变迁下的比较:

Python2.x时代	Python3.x时代
import urllib2	import urllib.request,urllib.error
import urllib	import urllib2.request,urllib.error,urllib.parse
import urlparse	import urllib.parse
urllib2.urlopen	urllib.request.urlopen
urllib.urlencode	urllib.parse.urlencode
urllib.quote	urllib.request.quote
cookielib.CookieJar	http.CookieJar
urllib2.Request	urllib.request.Request

4.2 快速使用

//导入模块
>>> import urllib.request
//打开网页
>>> file=urllib.request.urlopen("http://baidu.com")
//读取全部内容.内容赋给字符串变量
>>> data=file.read()
//读取全部内容，内容赋给列表变量
>>> data=file.readlines()
//读取一行内容
>>> dataline=file.readline()
>>> print dataline
>>> print data
>>> print(file.info())
>>> print(file.getcode())
>>> print(file.geturl())

爬取网页并保存到本地文件：

爬取网页，赋给变量
变量写入本地文件，*.html
关闭文件

1
2
3

>>> fhandle=open("/User/linking/Dev/Python/py-books-study/deep-in-python-web-crawler/baidu.html")
>>> fhandle.write(data)
>>> fhandle.close()

URL标准中一般只允许一部分ASCII字符，如数字、字母、部分符号等。若是特殊字符，如中文、：、或者&等，需要编码。编码格式：

1 2	>>> urllib.request.quote("http://www.baidu.com") # out: http%3A//www.baidu.com

解码格式：

1 2	>>> urllib.request.unquote("http%3A//www.baidu.com") # out: http://www.baidu.com

编码在Python中十分重要，因为平台及历史原因等，需要特别注意。

4.3 模拟浏览器 – Headers 属性

问题：爬取时出现 403错误，由于网页做了反爬虫设置。

解决方式：设置Headers信息，模拟浏览器登录。

浏览器检查工具可以查看到User-Agents，两种方法：

# 法1：build_opener()修改报头
import urllib.request
url="http://blog.csdn.net/weiwei_pig/article/details/51178226"
headers=("User-Agent","Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Mobile Safari/537.36")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
data = opener.open(url).read()
# print(data)
fhandle=open("/Users/linking/Dev/python/py-books-study/deep-in-python-web-crawler/csdn.html","wb")
fhandle.write(data)
fhandle.close()


# 法2: 用add_header()添加报头
import urllib.request
url="http://blog.csdn.net/weiwei_pig/article/details/51178226"
req=urllib.request.Request(url)
req.add_header("User-Agent","Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Mobile Safari/537.36")
data = urllib.request.urlopen(req).read()

4.4 超时设置

超时时间设置，服务器响应时间，速度快慢，性能考量，超时抛出异常。

import urllib.request
for i in range(1,100):
    try:
        file = urllib.request.urlopen("http://yum.iqianyue.com",timeout=1)
        data = file.read()
        print(len(data))
    except Exception as e:
        print("出现异常 --> " + str(e))

# out
# ...
# 14165
# 14165
# 14165
# 14165
# 出现异常 --> <urlopen error timed out>
# 14165
# 出现异常 --> <urlopen error timed out>
# 14165
# 14165
# 14165
# 14165
# 14165
# ...

上述，循环发100个请求，1秒比较频繁，出现异常。

4.5 HTTP 协议请求实战

1. GET请求

import urllib.request
keywd="hello"
url="http://www.baidu.com/s?wd="+keywd
//构建请求Request对象
req=urllib.request.Request(url)  
data=urllib.request.urlopen(req).read()
fhandle=open("/Users/linking/Dev/python/py-books-study/deep-in-python-web-crawler/searchHello.html","wb")
fhandle.write(data)
fhandle.close()

保存的网页打开后，看到的效果与百度搜索“hello”效果一样。

这是英文，如果是搜索中文呢？如keywd=”呵呵”，出现了

# out：
# UnicodeEncodeError: 'ascii' codec can't 
encode characters in position 10-12: ordinal 
not in range(128)

编码出现问题，需要用前面讲到的urllib.request.quote(keywd)来编码。

import urllib.request
keywd="呵呵"
url="http://www.baidu.com/s?wd="
key_code=urllib.request.quote(keywd)
url_all=url+key_code
req=urllib.request.Request(url_all)
data=urllib.request.urlopen(req).read()
fhandle=open("/Users/linking/Dev/python/py-books-study/deep-in-python-web-crawler/searchCHNword.html","wb")
fhandle.write(data)
fhandle.close()

2. POST请求实例

import urllib.request
import urllib.parse
url = "http://www.iqianyue.com/mypost"
postdata = urllib.parse.urlencode({
    "name":"ceo@iqianyue.com",
    "pass":"aA123456"
}).encode('utf-8') # 传递的数据经urlencode处理，在encode（）设置为utf-8编码
req = urllib.request.Request(url,postdata)
req.add_header('User-Agent','Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Mobile Safari/537.36')
data = urllib.request.urlopen(req).read()
fhandle = open("/Users/linking/Dev/python/py-books-study/deep-in-python-web-crawler/post.html","wb")
fhandle.write(data)
fhandle.close()

登陆后得到的数据返回跟正常登陆返回一样的。

4.6 代理服务器设置

4.7 DebugLog开启

import urllib.request
httphd=urllib.request.HTTPHandler(debuglevel=1)
httpshd=urllib.request.HTTPSHandler(debuglevel=1)
opener=urllib.request.build_opener(httphd,httpshd)
urllib.request.install_opener(opener)
data=urllib.request.urlopen("http://edu.51cto.com")

# send: b'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: edu.51cto.com\r\nUser-Agent: Python-urllib/3.6\r\nConnection: close\r\n\r\n'
# reply: 'HTTP/1.1 200 OK\r\n'
# header: Date header: Content-Type header: Transfer-Encoding header: Connection header: Set-Cookie
# header: Server header: Vary header: Vary header: Vary header: Set-Cookie header: Set-Cookie
# header: Set-Cookie header: Load-Balancing header: Load-Balancing
# Process finished with exit code 0

4.8 异常处理神器 - URLError

import urllib.request
import urllib.error
try:
    file=urllib.request.urlopen("http://blog.csdn.net")
    data=file.read()
    # print(data)
    fhandle=open("/Users/linking/Dev/python/py-books-study/deep-in-python-web-crawler/csdndownload.html","wb")
    fhandle.write(data)
    fhandle.close()
except urllib.error.URLError as e:
    print(e.code)
    print(e.reason)

没看到403，forbidden错误；怀疑是Python最新版做了模拟浏览器登陆。

还有HTTPError这个URLError的子类

import urllib.request
import urllib.error
try:
    file=urllib.request.urlopen("http://blog.csdn.net")
    data=file.read()
    # print(data)
    fhandle=open("/Users/linking/Dev/python/py-books-study/deep-in-python-web-crawler/csdndownload.html","wb")
    fhandle.write(data)
    fhandle.close()
except urllib.error.HTTPError as e:
    print(e.code)
    print(e.reason)
except urllib.error.URLError as e:  # 父类兜底
    print(e.code)
    print(e.reason)

有则输出，没有则忽略：

import urllib.request
import urllib.error
try:
    file=urllib.request.urlopen("http://www.baiduddd.com")
except urllib.error.URLError as e:  # 父类兜底
    if hasattr(e,"code"):
        print(e.code)
    if hasattr(e,"reason"):
        print(e.reason)

# 输出: [Errno 8] nodename nor servname provided, or not known

本文作者： Linking
本文链接： https://linking.fun/2017/07/10/精通Python网络爬虫-第四章-Urllib库与URLError异常处理/

目录