Urllib是python常用的标准库,主要用于操作URL。该标准库不需要安装,可以直接使用。

GET

urllib的request模块可以非常方便的抓取URL内容,也就是发送一个GET亲求到指定的页面,然后返回HTTP的响应。

from urllib import request

with request.urlopen('https://weibo.com/3861214023/fans?from=100505&wvr=6&mod=headfans&current=fans#place') as f:
    data = f.read()
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', data)

返回结果如下:

Status: 200 OK
Server: nginx/1.6.1
Date: Sat, 05 May 2018 10:50:17 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Cache-Control: no-cache, must-revalidate
Expires: Sat, 26 Jul 1997 05:00:00 GMT
Pragma: no-cache
Pragma: no-cache
DPOOL_HEADER: dryad65
SINA-LB: aGEuMzIuZzEucXhnLmxiLnNpbmFub2RlLmNvbQ==
SINA-TS: YjBjYTk0Y2UgMCAwIDAgMyA0Cg==
Data: b'<!DOCTYPE html>\n<html>\n<head>\n    <meta http-equiv="Content-type" content="text/html; charset=gb2312"/>\n    <title>Sina Visitor System</title>\n</head>\n</html>'

如果我们要想模拟浏览器发送GET请求,就需要使用Request对象,通过往Request对象添加HTTP头,我们就可以把请求伪装成浏览器。那我们该添加什么头部信息呢?如果我们需要让爬虫模拟成浏览器,模拟成浏览器可以设置User-Agent信息。

任意打开一个网页,比如pillow然后按F12,会出现一个窗口。切换到Network标签页: 然后单击头像。 此时,我们可以观察到下方的窗口出现了一些数据。将界面右上方的标签切换到“Headers”中,即可以看到了对应的头信息,此时往下拖动,就可以找到User-Agent字样的一串信息。这一串信息即是我们下面模拟浏览器所需要用到的信息。

将User-Agent设置到request实例中

from urllib import request

req = request.Request('https://weibo.com/3861214023/fans?from=100505&wvr=6&mod=headfans&current=fans#place')
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36')
with request.urlopen(req) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read())

POST

如果要以POST发送一个请求,只需要把参数data以bytes形式传入。刚刚我们就是没有登录微博,所以并没有访问到信息,现在我们就来模拟一下微博登录:

from urllib import request, parse
print('Login to weibo.cn...')
phone = input('Phone: ')
passwd = input('Password: ')
login_data = parse.urlencode([
    ('username', phone),
    ('password', passwd),
    ('entry', 'mweibo'),
    ('client_id', ''),
    ('savestate', '1'),
    ('ec', ''),
    ('pagerefer', 'https://passport.weibo.cn/signin/welcome?entry=mweibo&r=http%3A%2F%2Fm.weibo.cn%2F')
])
req = request.Request('https://passport.weibo.cn/sso/login')
req.add_header('Origin', 'https://passport.weibo.cn')
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36')
req.add_header('Referer', 'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=http%3A%2F%2Fm.weibo.cn%2F')

with request.urlopen(req, data=login_data.encode('utf-8')) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read())

如果登录成功,就返回如下响应:

Status: 200 OK
Server: nginx/1.6.1
Date: Sat, 05 May 2018 11:59:53 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Cache-Control: no-cache, must-revalidate
Expires: Sat, 26 Jul 1997 05:00:00 GMT
Pragma: no-cache
Access-Control-Allow-Origin: https://passport.weibo.cn
Access-Control-Allow-Credentials: true
Set-Cookie: SUB=_2A2536ezpDeRhGeVG7VMT8SrMyT-IHXVVFfShrDV6PUJbkdANLWfskW1NT7Qv44Ixyp7YfqMsWMYrXoFaMJSbifdJ; Path=/; Domain=.weibo.cn; Expires=Sun, 05 May 2019 11:59:53 GMT; HttpOnly
Set-Cookie: SUHB=0laF4WlTYBf22X; expires=Sunday, 05-May-2019 11:59:53 GMT; path=/; domain=.weibo.cn
Set-Cookie: SCF=AjZhop50InncweqUlDBHHusO0_W7UHupReq55BpnTIhTP9MmXNgj1BUo2m7xO0Lp17uBLWueceLgYr6rALe6V0w.; expires=Tuesday, 02-May-2028 11:59:53 GMT; path=/; domain=.weibo.cn; httponly
Set-Cookie: SSOLoginState=1525521593; path=/; domain=weibo.cn
Set-Cookie: ALF=1528113593; expires=Monday, 04-Jun-2018 11:59:53 GMT; path=/; domain=.sina.cn
DPOOL_HEADER: luna135
Set-Cookie: login=219e4a0617d3288012618552eb672f75; Path=/
Data: b'{"retcode":20000000,"msg":"","data":{"crossdomainlist":{"weibo.com":"https:\\/\\/passport.weibo.com\\/sso\\/crossdomain?entry=mweibo&action=login&proj=1&ticket=ST-Mzg2MTIxNDAyMw%3D%3D-1525521593-yf-131AB34F7DB98AD2BC12F56AE31B2619-1","sina.com.cn":"https:\\/\\/login.sina.com.cn\\/sso\\/crossdomain?entry=mweibo&action=login&proj=1&ticket=ST-Mzg2MTIxNDAyMw%3D%3D-1525521593-yf-FEBA584D64D9FC2DDF44FC7CB1E666AD-1","weibo.cn":"https:\\/\\/passport.sina.cn\\/sso\\/crossdomain?entry=mweibo&action=login&ticket=ST-Mzg2MTIxNDAyMw%3D%3D-1525521593-yf-5227F4F974447594AF77E5114BF47236-1"},"loginresulturl":"","uid":"3861214023"}}'

如果登录失败,我们获得如下响应:

...
Data: {"retcode":50011015,"msg":"\u7528\u6237\u540d\u6216\u5bc6\u7801\u9519\u8bef","data":{"username":"example@python.org","errline":536}}