当前位置 博文首页 > 蔡小小的博客:如何对扫黑风暴做一次数据可视化?

    蔡小小的博客:如何对扫黑风暴做一次数据可视化?

    作者:[db:作者] 时间:2021-09-11 16:51

    • 绪论
    • 如何查找视频id
    • 项目结构
      • 一.爬虫部分
        • 爬取评论内容
        • 爬取评论时间
      • 二.数据处理部分
        • 评论的时间戳转换为正常时间
        • 评论内容读入CSV
        • 统计一天各个时间段内的评论数
        • 统计最近评论数
      • 三. 数据分析
        • 制作词云图
        • 制作最近评论数条形图与折线图
        • 制作每小时评论条形图与折线图
        • 制作最近评论数饼图
        • 制作每小时评论饼图
        • 制作观看时间区间评论统计饼图
        • 制作扫黑风暴主演提及占比饼图
        • 制作评论内容情感分析图

    绪论

    本期是对腾讯热播剧——扫黑风暴的一次爬虫与数据分析,耗时两个小时,总爬取条数3W条评论,总体来说比较普通,值得注意的一点是评论的情绪文本分析处理,这是第一次接触的知识。
    在这里插入图片描述

    爬虫方面:由于腾讯的评论数据是封装在json里面,所以只需要找到json文件,对需要的数据进行提取保存即可。
    在这里插入图片描述

    • 视频网址:https://v.qq.com/x/cover/mzc00200lxzhhqz.html
    • 评论json数据网址:https://video.coral.qq.com/varticle/7225749902/comment/v2
    • 注:只要替换视频数字id的值,即可爬取其他视频的评论

    如何查找视频id?


    在这里插入图片描述
    在这里插入图片描述

    项目结构:


    在这里插入图片描述


    一. 爬虫部分:

    1.爬取评论内容代码:spiders.py


    import requests
    import re
    import random
    
    
    def get_html(url, params):
        uapools = [
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14'
        ]
    
        thisua = random.choice(uapools)
        headers = {"User-Agent": thisua}
        r = requests.get(url, headers=headers, params=params)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        r.encoding = 'utf-8'  # 不加此句出现乱码
        return r.text
    
    
    def parse_page(infolist, data):
        commentpat = '"content":"(.*?)"'
        lastpat = '"last":"(.*?)"'
    
        commentall = re.compile(commentpat, re.S).findall(data)
        next_cid = re.compile(lastpat).findall(data)[0]
    
        infolist.append(commentall)
    
        return next_cid
    
    
    
    def print_comment_list(infolist):
        j = 0
        for page in infolist:
            print('第' + str(j + 1) + '页\n')
            commentall = page
            for i in range(0, len(commentall)):
                print(commentall[i] + '\n')
            j += 1
    
    
    def save_to_txt(infolist, path):
        fw = open(path, 'w+', encoding='utf-8')
        j = 0
        for page in infolist:
            #fw.write('第' + str(j + 1) + '页\n')
            commentall = page
            for i in range(0, len(commentall)):
                fw.write(commentall[i] + '\n')
            j += 1
        fw.close()
    
    
    def main():
        infolist = []
        vid = '7225749902';
        cid = "0";
        page_num = 3000
        url = 'https://video.coral.qq.com/varticle/' + vid + '/comment/v2'
        #print(url)
    
        for i in range(page_num):
            params = {'orinum': '10', 'cursor': cid}
            html = get_html(url, params)
            cid = parse_page(infolist, html)
    
    
        print_comment_list(infolist)
        save_to_txt(infolist, 'content.txt')
    
    
    main()
    
    

    2.爬取评论时间代码:sp.py

    import requests
    import re
    import random
    
    
    def get_html(url, params):
        uapools = [
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14'
        ]
    
        thisua = random.choice(uapools)
        headers = {"User-Agent": thisua}
        r = requests.get(url, headers=headers, params=params)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        r.encoding = 'utf-8'  # 不加此句出现乱码
        return r.text
    
    
    def parse_page(infolist, data):
        commentpat = '"time":"(.*?)"'
        lastpat = '"last":"(.*?)"'
    
        commentall = re.compile(commentpat, re.S).findall(data)
        next_cid = re.compile(lastpat).findall(data)[0]
    
        infolist.append(commentall)
    
        return next_cid
    
    
    
    def print_comment_list(infolist):
        j = 0
        for page in infolist:
            print('第' + str(j + 1) + '页\n')
            commentall = page
            for i in range(0, len(commentall)):
                print(commentall[i] + '\n')
            j += 1
    
    
    def save_to_txt(infolist, path):
        fw = open(path, 'w+', encoding='utf-8')
        j = 0
        for page in infolist:
            #fw.write('第' + str(j + 1) + '页\n')
            commentall = page
            for i in range(0, len(commentall)):
                fw.write(commentall[i] + '\n')
            j += 1
        fw.close()
    
    
    def main():
        infolist = []
        vid = '7225749902';
        cid = "0";
        page_num =3000
        url = 'https://video.coral.qq.com/varticle/' + vid + '/comment/v2'
        #print(url)
    
        for i in range(page_num):
            params = {'orinum': '10', 'cursor': cid}
            html = get_html(url, params)
            cid = parse_page(infolist, html)
    
    
        print_comment_list(infolist)
        save_to_txt(infolist, 'time.txt')
    
    
    main()
    
    

    二.数据处理部分


    1.评论的时间戳转换为正常时间 time.py


    # coding=gbk
    import csv
    import time
    
    csvFile = open("data.csv",'w',newline='',encoding='utf-8')
    writer = csv.writer(csvFile)
    csvRow = []
    #print(csvRow)
    f = open("time.txt",'r',encoding='utf-8')
    for line in f:
        csvRow = int(line)
        #print(csvRow)
    
        timeArray = time.localtime(csvRow)
        csvRow = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
        print(csvRow)
        csvRow = csvRow.split()
        writer.writerow(csvRow)
    
    f.close()
    csvFile.close()
    

    在这里插入图片描述


    2.评论内容读入csv CD.py


    # coding=gbk
    import csv
    csvFile = open("content.csv",'w',newline='',encoding='utf-8')
    writer = csv.writer(csvFile)
    csvRow = []
    
    f = open("content.txt",'r',encoding='utf-8')
    for line in f:
        csvRow = line.split()
        writer.writerow(csvRow)
    
    f.close()
    csvFile.close()
    

    3.统计一天各个时间段内的评论数 py.py


    # coding=gbk
    import csv
    
    from pyecharts import options as opts
    from sympy.combinatorics import Subset
    from wordcloud import WordCloud
    
    with open('../Spiders/data.csv') as csvfile:
        reader = csv.reader(csvfile)
    
        data1 = [str(row[1])[0:2] for row in reader]
    
        print(data1)
    print(