编写爬虫之爬取网易云音乐上的精彩评论

首先感谢【小甲鱼】极客Python之效率革命。讲的很好,通俗易懂,适合入门。

感兴趣的朋友可以访问https://fishc.com.cn/forum-319-1.html来支持小甲鱼。谢谢大家。
想要学习requests库的可以查阅: https://fishc.com.cn/forum.php?mod=viewthread&tid=95893&extra=page%3D1%26filter%3Dtypeid%26typeid%3D701

1.首先我们来分析一下,先元素定位

精彩评论.png

我们先把网页源代码爬下来看看

# -*- coding:UTF-8 -*-
import requests

def get_url(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029."
                      "110 Safari/537.36 SE 2.X MetaSr 1.0"}
    res = requests.get(url, headers=headers)
    return res

def main():
    url = input("请输入链接地址:")
    res = get_url(url)

    with open("res.txt", "w", encoding="utf-8") as file:
        file.write(res.text)

if __name__ == "__main__":
    main()

发现内容里面并没有我们想要的精彩评论。

2.放慢浏览器的加载速度,一旦出现精彩评论内容,就给它取消掉,找到评价对应的资源文件

放慢浏览器加载速度.png

Request URL:https://music.163.com/weapi/v1/resource/comments/R_SO_4_1356350562?csrf_token=643432a22c0bfd772c33e2726c942e48
Request Method:POST
这样我们把这个目标文件给下载下来(用requests去模范浏览器请求)

# -*- coding:UTF-8 -*-
import requests

def get_comments(url):
    name_id = url.split('=')[1]
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029."
                      "110 Safari/537.36 SE 2.X MetaSr 1.0",
        "refer": "https://music.163.com/"
    }
    params = "jvRGxPQYIeDQiiYsS8qg51ryAhi9TwM0H3NGLu7B9re4EOw9/a7jHRW0P5jhupFbSamLsjHvSpivhbtFiTObUOR2mYA7nFh5KUxaXn3bYh8GXy9sGTbxLeFCuY0KoNAfwWICK0n9ZRPlBHQ1CGBiohOq8+FDDPVBJhbcYgOSPhpTiZ22Ea+/xoYuk7UHnXHty093tfxAXJU032N1uaksCQmMzHxafQ1OA0BroKvyEMA="
    encSecKey = "969f735e7bc94d2b6a6f8371dd89e27d16161ea019a7d2b31391c257452c358678e7ffc11c45712a7f1e47fb1bea81dcf0dbb6f6335045766c06ef1fcc3758987cd30a8674510a062bf626dc2aed8b24c25e7a92ecb1ea38ac514e937f69343923a669d9024ff7a65f8154a35f854de05b67a56dd46d7fa5c136b02c414ce0ea"
    data = {
        "params": params,
        "encSecKey": encSecKey
    }
    target_url = "https://music.163.com/weapi/v1/resource/comments/R_SO_4_{}?csrf_token=".format(name_id)  # 对目标URL进行分析,让每个URL都能用
    res = requests.post(target_url, headers=headers, data=data)  # 把这个post请求给构造出来,F12看浏览器里面是怎么样的
    return res


def main():
    url = input("请输入链接地址:")
    res = get_comments(url)


    with open("data.txt", "w", encoding="utf-8") as file:
        file.write(res.text)


if __name__ == "__main__":
    main()

3.提取我们要的数据(把返回内容保存为json用火狐打开分析下,看看我们需要提取的数据是在哪里的)

火狐直接打开看.png

这样我们就知道我们要的数据在哪里了

上完整代码

# -*- coding:UTF-8 -*-
import requests
import json

def get_hot_comment(res):
    comment_json = json.loads(res.text)  # 将已编码的 JSON 字符串解码为 Python 对象
    hot_comments = comment_json['hotComments']
    print(hot_comments)
    with open('hot_comment.txt', 'w', encoding='utf-8') as file:
        for each in hot_comments:
            file.write(each['user']['nickname'] + ':\n\n')
            file.write(each['content'] + '\n')
            file.write('-'*50 + '\n')

def get_comments(url):
    name_id = url.split('=')[1]
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029."
                      "110 Safari/537.36 SE 2.X MetaSr 1.0",
        "refer": "https://music.163.com/"
    }
    params = "jvRGxPQYIeDQiiYsS8qg51ryAhi9TwM0H3NGLu7B9re4EOw9/a7jHRW0P5jhupFbSamLsjHvSpivhbtFiTObUOR2mYA7nFh5KUxaXn3bYh8GXy9sGTbxLeFCuY0KoNAfwWICK0n9ZRPlBHQ1CGBiohOq8+FDDPVBJhbcYgOSPhpTiZ22Ea+/xoYuk7UHnXHty093tfxAXJU032N1uaksCQmMzHxafQ1OA0BroKvyEMA="
    encSecKey = "969f735e7bc94d2b6a6f8371dd89e27d16161ea019a7d2b31391c257452c358678e7ffc11c45712a7f1e47fb1bea81dcf0dbb6f6335045766c06ef1fcc3758987cd30a8674510a062bf626dc2aed8b24c25e7a92ecb1ea38ac514e937f69343923a669d9024ff7a65f8154a35f854de05b67a56dd46d7fa5c136b02c414ce0ea"
    data = {
        "params": params,
        "encSecKey": encSecKey
    }
    target_url = "https://music.163.com/weapi/v1/resource/comments/R_SO_4_{}?csrf_token=".format(name_id)

    res = requests.post(target_url, headers=headers, data=data)
    return res

def main():
    url = input("请输入链接地址:")
    res = get_comments(url)
    get_hot_comment(res)


if __name__ == "__main__":
    main()

实现的效果

效果.png

是不是挺有意思的呢