微信公众号文章爬取

2018-08-27

字数统计: 1.8k字 | 阅读时长≈ 8分

问题链接：
https://mp.weixin.qq.com/s?src=11&amp;timestamp=1503491806&amp;ver=348&amp;signature=3lw-RB4CU15GgtDueSeaDNXAkuSPC7KYea7bl3-GWPjrz9-iK1KHwqwkQdyJ6jBGxiY16x9Absid5UUxXknJl1V5pIxOkbLsZ7rYivkYJgUPhqMEke7la64EZ3ryb8TN&amp;new=1
正确链接：
http://mp.weixin.qq.com/s?src=11&timestamp=1503491925&ver=348&signature=1ir9ZqJ7U4QJwH-Ztci48TK4fCx6mX*I4Ol7sitsB21v-0mjrgQMf4ZuVwtWkMDNw*EcaVcQdpZa6nkGYBnkX3l-H-i33xuY5ozJW27bijTQRa97bwf4smd46fPEgVrW&new=1

对比分析得出问题是，其中每一个参数后面多了amp;这几个字符，替换掉

1	url.replace("amp;","")

由此，网页链接正常了，可以通过文章链接获取具体的文章源码，从中用正则可以取出标题和内容

1.3 代理

由于频繁爬取微信文章会被搜狗官方屏蔽IP，所以需要找点代理服务器及端口备用。如http://yum.iqianyue.com/proxy 。

Linking注：这个要多试，我在实践的过程中，被警告访问频繁，所以不停的换IP测试。最终成功了。

1.4 注意事项

功能分开写，多个函数，一个函数一个功能，单一职责原则；
代理服务器爬取在4.6节；
爬取关注的内容后，写入对应文件；
注意异常处理，发生异常，延时处理，time.sleep(7)。

1.5 代码实现

# 2017-08-27
import re
import urllib.request
import urllib.error
import os
import time

# 1.模拟成浏览器
headers = ("User-Agent", "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Mobile Safari/537.36")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
# 将opener安装为全局
urllib.request.install_opener(opener)

# 设置一个listurl存储文章网址列表
listurl = []
# 自定义函数，使用代理服务器
def use_proxy(proxy_addr,url):
    # 建立异常处理机制
    try:
        import urllib.request
        proxy = urllib.request.ProxyHandler({'http': proxy_addr})
        opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
        urllib.request.install_opener(opener)
        data = urllib.request.urlopen(url).read().decode('utf-8')
        return data
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
        # 若URLError异常，延时10秒执行
        time.sleep(10)
    except Exception as e:
        print("exception:" + str(e))
        # 延时1秒执行
        time.sleep(1)

# 获取所有文章链接
def getlisturl(key, pagestart, pageend, proxy):
    try:
        page = pagestart
        # 编码关键词key
        keycode = urllib.request.quote(key)
        # 编码&page
        pagecode = urllib.request.quote("&page")
        # 循环爬取各页的文章链接
        for page in range(pagestart, pageend+1):
            # 构建url
            url = 'http://weixin.sogou.com/weixin?type=2&quary=' + keycode + "&page=" + str(page)
            # 用代理服务器爬取，解决IP被封问题
            data1 = use_proxy(proxy, url)
            # 获取文章链接的正则表达式
            listurlpattern = '<div class="txt-box">.*?(http://.*?)"'
            # 获取每页的所有文章链接并添加到列表listurl中
            listurl.append(re.compile(listurlpattern,re.S).findall(data1))
        print("共获取到" + str(len(listurl)) + "页")
        return listurl
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
        # 若URLError异常，延时10秒执行
        time.sleep(10)
    except Exception as e:
        print("exception:" + str(e))
        # 延时1秒执行
        time.sleep(1)

# 通过文章链接获取对应内容
def getcontent(listurl, proxy):
    i = 0
    # 设置本地文件的开始html编码
    html1 = '''<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>微信文章页面</title>
</head>
<body>'''
    fh = open(os.getcwd() + "/1.html", "wb")
    fh.write(html1.encode("utf-8"))
    fh.close()
    # 再次以追加的方式打开文档，写入内容
    fh = open(os.getcwd() + "/1.html", "ab")
    # listurl为二维列表，第一维-第几页，第二维-该页的第几篇文章
    for i in range(0, len(listurl)):
        for j in range(0, len(listurl[i])):
            try:
                url = listurl[i][j]
                # 处理成真是的url，替换掉抓取的网址中多余的"amp;"字符串
                url = url.replace("amp;", "")
                # 使用代理爬取网址内容
                data = use_proxy(proxy, url)
                # 文章标题正则
                titlepattern = "<title>(.*?)</title>"
                # 文章内容正则
                contentpattern = 'id="js_content">(.*?)</div>'
                title = re.compile(titlepattern, re.S).findall(data)
                content = re.compile(contentpattern, re.S).findall(data)
                # 初始化标题与内容
                thistitle = "此次没有获取到标题"
                thiscontent = "此次没有获取到内容"
                # 若标题列表不为空，取列表的第0个元素
                if(title!=[]):
                    thistitle = title[0]
                if(content!=[]):
                    thiscontent = content[0]
                # 将标题和内容汇总赋给变量dataall
                dataall = "<p>标题为："+thistitle+"</p><p>内容为："+thiscontent+"</p<br>"
                # 将标题和内容写入文件
                fh.write(dataall.encode("utf-8"))
                print("第"+str(i+1)+"个网页第"+str(j+1)+"次处理")
            except urllib.error.URLError as e:
                if hasattr(e, "code"):
                    print(e.code)
                if hasattr(e, "reason"):
                    print(e.reason)
                # 若URLError异常，延时10秒执行
                time.sleep(10)
            except Exception as e:
                print("exception:" + str(e))
                # 延时1秒执行
                time.sleep(1)
    fh.close()
    # 设置结束部分
    html2 = '''</body>
</html>
    '''
    fh = open(os.getcwd()+"/1.html", "ab")
    fh.write(html2.encode("utf-8"))
    fh.close()

def main():
    # 设置关键词
    key = "人工智能"
    # 设置代理服务器，可能失效，需要更新.地址：http://www.xicidaili.com/
    proxy = "218.106.98.166:53281"
    # 可以为getlisturl()与getcontent设置不同的代理服务器
    proxy2 = "49.75.83.234:8998"
    # 起始页
    pagestart = 1
    # 结束页
    pageend = 2
    listurl = getlisturl(key, pagestart, pageend, proxy)
    getcontent(listurl, proxy)

if __name__ == '__main__':
    main()

结果，爬取了2个主页的文章链接，然后爬取了文章的标题和内容，构造成完整的网页之后存在1.html这个文件中，可以直接打开。

二、微信实验室

说到微信的搜索，想起微信实验室里的两个功能，暂时是隐藏在设置里，需要手动添加的，推荐给大家。

2.1 搜一搜

此功能比常规搜索功能更强大

与本爬虫类似，微信很早就有了自己的搜索工具，只不过还没有放入主功能目录，还在实验公测阶段，需要自行手动添加。

2.2 看一看

实验室中的另一个项目，看一看，根据你的日常浏览习惯，为你推荐比较感兴趣的文章。

添加步骤：进入微信-设置-（倒数第一项）实验室（插件）点击进入即可添加。

Linking注：腾讯这是变相的证明“人人都是产品经理”这句话，几亿用户做产品测试。

本文作者： Linking
本文链接： https://linking.fun/2018/08/27/微信公众号文章爬取/

Linking

微信公众号文章爬取

目录

一、微信文章爬虫

1.1 tips

1.2 分析过程

1.3 代理

1.4 注意事项

1.5 代码实现

二、微信实验室