当前位置博文首页 > 蔡小小的博客：如何对扫黑风暴做一次数据可视化？

最大化缩小

蔡小小的博客：如何对扫黑风暴做一次数据可视化？

作者：[db:作者] 时间：2021-09-11 16:51

绪论
如何查找视频id
项目结构
- 一.爬虫部分
  - 爬取评论内容
  - 爬取评论时间
- 二.数据处理部分
  - 评论的时间戳转换为正常时间
  - 评论内容读入CSV
  - 统计一天各个时间段内的评论数
  - 统计最近评论数
- 三. 数据分析
  - 制作词云图
  - 制作最近评论数条形图与折线图
  - 制作每小时评论条形图与折线图
  - 制作最近评论数饼图
  - 制作每小时评论饼图
  - 制作观看时间区间评论统计饼图
  - 制作扫黑风暴主演提及占比饼图
  - 制作评论内容情感分析图

绪论

本期是对腾讯热播剧——扫黑风暴的一次爬虫与数据分析，耗时两个小时，总爬取条数3W条评论，总体来说比较普通，值得注意的一点是评论的情绪文本分析处理，这是第一次接触的知识。
在这里插入图片描述

爬虫方面：由于腾讯的评论数据是封装在json里面，所以只需要找到json文件，对需要的数据进行提取保存即可。
在这里插入图片描述

视频网址：https://v.qq.com/x/cover/mzc00200lxzhhqz.html
评论json数据网址：https://video.coral.qq.com/varticle/7225749902/comment/v2
注：只要替换视频数字id的值，即可爬取其他视频的评论

如何查找视频id？

在这里插入图片描述

项目结构：

在这里插入图片描述

一. 爬虫部分：

1.爬取评论内容代码：spiders.py

import requests
import re
import random


def get_html(url, params):
    uapools = [
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14'
    ]

    thisua = random.choice(uapools)
    headers = {"User-Agent": thisua}
    r = requests.get(url, headers=headers, params=params)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    r.encoding = 'utf-8'  # 不加此句出现乱码
    return r.text


def parse_page(infolist, data):
    commentpat = '"content":"(.*?)"'
    lastpat = '"last":"(.*?)"'

    commentall = re.compile(commentpat, re.S).findall(data)
    next_cid = re.compile(lastpat).findall(data)[0]

    infolist.append(commentall)

    return next_cid



def print_comment_list(infolist):
    j = 0
    for page in infolist:
        print('第' + str(j + 1) + '页\n')
        commentall = page
        for i in range(0, len(commentall)):
            print(commentall[i] + '\n')
        j += 1


def save_to_txt(infolist, path):
    fw = open(path, 'w+', encoding='utf-8')
    j = 0
    for page in infolist:
        #fw.write('第' + str(j + 1) + '页\n')
        commentall = page
        for i in range(0, len(commentall)):
            fw.write(commentall[i] + '\n')
        j += 1
    fw.close()


def main():
    infolist = []
    vid = '7225749902';
    cid = "0";
    page_num = 3000
    url = 'https://video.coral.qq.com/varticle/' + vid + '/comment/v2'
    #print(url)

    for i in range(page_num):
        params = {'orinum': '10', 'cursor': cid}
        html = get_html(url, params)
        cid = parse_page(infolist, html)


    print_comment_list(infolist)
    save_to_txt(infolist, 'content.txt')


main()

2.爬取评论时间代码：sp.py

import requests
import re
import random


def get_html(url, params):
    uapools = [
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14'
    ]

    thisua = random.choice(uapools)
    headers = {"User-Agent": thisua}
    r = requests.get(url, headers=headers, params=params)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    r.encoding = 'utf-8'  # 不加此句出现乱码
    return r.text


def parse_page(infolist, data):
    commentpat = '"time":"(.*?)"'
    lastpat = '"last":"(.*?)"'

    commentall = re.compile(commentpat, re.S).findall(data)
    next_cid = re.compile(lastpat).findall(data)[0]

    infolist.append(commentall)

    return next_cid



def print_comment_list(infolist):
    j = 0
    for page in infolist:
        print('第' + str(j + 1) + '页\n')
        commentall = page
        for i in range(0, len(commentall)):
            print(commentall[i] + '\n')
        j += 1


def save_to_txt(infolist, path):
    fw = open(path, 'w+', encoding='utf-8')
    j = 0
    for page in infolist:
        #fw.write('第' + str(j + 1) + '页\n')
        commentall = page
        for i in range(0, len(commentall)):
            fw.write(commentall[i] + '\n')
        j += 1
    fw.close()


def main():
    infolist = []
    vid = '7225749902';
    cid = "0";
    page_num =3000
    url = 'https://video.coral.qq.com/varticle/' + vid + '/comment/v2'
    #print(url)

    for i in range(page_num):
        params = {'orinum': '10', 'cursor': cid}
        html = get_html(url, params)
        cid = parse_page(infolist, html)


    print_comment_list(infolist)
    save_to_txt(infolist, 'time.txt')


main()

二.数据处理部分

1.评论的时间戳转换为正常时间 time.py

# coding=gbk
import csv
import time

csvFile = open("data.csv",'w',newline='',encoding='utf-8')
writer = csv.writer(csvFile)
csvRow = []
#print(csvRow)
f = open("time.txt",'r',encoding='utf-8')
for line in f:
    csvRow = int(line)
    #print(csvRow)

    timeArray = time.localtime(csvRow)
    csvRow = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
    print(csvRow)
    csvRow = csvRow.split()
    writer.writerow(csvRow)

f.close()
csvFile.close()

在这里插入图片描述

2.评论内容读入csv CD.py

# coding=gbk
import csv
csvFile = open("content.csv",'w',newline='',encoding='utf-8')
writer = csv.writer(csvFile)
csvRow = []

f = open("content.txt",'r',encoding='utf-8')
for line in f:
    csvRow = line.split()
    writer.writerow(csvRow)

f.close()
csvFile.close()

3.统计一天各个时间段内的评论数 py.py

# coding=gbk
import csv

from pyecharts import options as opts
from sympy.combinatorics import Subset
from wordcloud import WordCloud

with open('../Spiders/data.csv') as csvfile:
    reader = csv.reader(csvfile)

    data1 = [str(row[1])[0:2] for row in reader]

    print(data1)
print(





上一篇：liudada8265的博客：互联网晚报 | 8月22日 星期日 | 抖音回应腾 
下一篇：没有了



 



	   

	  
立即下载 - IIS7 站长工具包

蔡小小的博客：如何对扫黑风暴做一次数据可视化？

作者：[db:作者] 时间：2021-09-11 16:51

绪论

如何查找视频id？

项目结构：

一. 爬虫部分：

1.爬取评论内容代码：spiders.py

2.爬取评论时间代码：sp.py

二.数据处理部分

1.评论的时间戳转换为正常时间 time.py

2.评论内容读入csv CD.py

3.统计一天各个时间段内的评论数 py.py

最新 更多<<

推荐 更多<<

蔡小小的博客：如何对扫黑风暴做一次数据可视化？

作者：[db:作者] 时间：2021-09-11 16:51

绪论

如何查找视频id？

项目结构：

一. 爬虫部分：

1.爬取评论内容代码：spiders.py

2.爬取评论时间代码：sp.py

二.数据处理部分

1.评论的时间戳转换为正常时间 time.py

2.评论内容读入csv CD.py

3.统计一天各个时间段内的评论数 py.py

最新 更多<<

推荐 更多<<

最新更多<<

推荐更多<<