小旺的博客：精通Scrapy网络爬虫【九】下载文件和图片实战演练

当前位置博文首页 > 小旺的博客：精通Scrapy网络爬虫【九】下载文件和图片实战演练

小旺的博客：精通Scrapy网络爬虫【九】下载文件和图片实战演练

作者：[db:作者] 时间：2021-07-17 09:46

FilesPipeline和ImagesPipeline

FilesPipeline使用说明

在配置文件settings.py中启用FilesPipeline，通常将其置于其他ItemPipeline之前：

ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline': 1,
}

在配置文件settings.py中，使用FILES_STORE指定文件下载目录

FILES_STORE='C:/Users/30452/PycharmProjects/untitled10'

在Spider解析一个包含文件下载链接的页面时，将所有需要下载文件的url地址收集到一个列表，赋给item的file_urls字段（item[‘file_urls’]）。FilesPipeline在处理每一项item时，会读取item[‘file_urls’]，对其中每一个url进行下载，Spider示例代码如下


class DownloadBookSpider(scrapy.Spider):
    def parse(response):
        item={}
        item['file_urls']=[]
        for url in response.xpath('//a/@href').extract():
            download_url=response.urljoin(url)
            item['file_urls'].append(download_url)
        yield item

当FilesPipeline下载完item[‘file_urls’]中的所有文件后，会将各文件的下载结果信息收集到另一个列表，赋给item的files字段（item[‘files’]）。
下载结果信息包括以下内容：
● Path文件下载到本地的路径（相对于FILES_STORE的相对路径）。
● Checksum文件的校验和。
● url文件的url地址。

ImagesPipeline使用说明

ImagesPipeline是FilesPipeline的子类，使用上和FilesPipeline大同小异，只是在所使用的item字段和配置选项上略有差别

	FilesPipeline	ImagesPipeline
导入路径	scrapy.pipelines.files.FilesPipeline	scrapy.pipelines.images.ImagesPipeline
Item字段	file_urls,files	image_urls,images
下载目录	FILES_STORE	IMAGES_STORE

ImagesPipeline特有功能：

为图片生成缩略图，在配置文件settings.py中设置IMAGES_THUMBS，它是一个字典，每一项的值是缩略图的尺寸，代码如下：

IMAGES_THUMBS={
    'small':(50,50),
    'big':(270,270),
}

过滤掉尺寸过小的图片，在配置文件settings.py中设置IMAGES_MIN_WIDTH和IMAGES_MIN_HEIGHT，它们分别指定图片最小的宽和高，代码如下：

IMAGES_MIN_WIDTH=200
IMAGES_MIN_HEIGHT=200

项目实战：爬取matplotlib例子源码文件

在浏览器中访问http://matplotlib.org/examples/index.html

分析页面

在这里插入图片描述所有例子页面的链接都在<div class="toctree-wrappercompound">下的每一个<li class="toctree-l2">中

在一个例子页面中，例子源码文件的下载地址可在<a class="reference external">中找到

编写代码

创建Scrapy项目，取名为matplotlib_examples
使用scrapy genspider命令创建Spider

scrapy startproject matplotlib_examples
cd matplotlib_examples
scrapy genspider examples matplotlib.org

在配置文件settings.py中启用FilesPipeline，并指定文件下载目录


ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline': 1,
}
FILES_STORE='C:/Users/30452/PycharmProjects/untitled10'

实现ExampleItem，需定义file_urls和files两个字段，在items.py中

class ExampleItem(scrapy.Item):
    file_urls = scrapy.Field()
    files = scrapy.Field()

实现Examples

import scrapy
from scrapy.linkextractors import LinkExtractor
from ..items import ExampleItem


class ExamplesSpider(scrapy.Spider):
    name = 'examples'
    allowed_domains = ['matplotlib.org']
    start_urls = ['https://matplotlib.org/2.0.2/examples/index.html']

    def parse(self, response):
        le = LinkExtractor(restrict_css='div.toctree-wrapper.compound', deny='/index.html$')
        print(len(le.extract_links(response)))
        for link in le.extract_links(response):
            yield scrapy.Request(link.url, callback=self.parse_example)

    def parse_example(self, response):
        href = response.css('a.reference.external::attr(href)').extract_first()
        url = response.urljoin(href)
        example = ExampleItem()
        example['file_urls'] = [url]
        return example

parse方法是例子列表页面的解析函数，在该方法中提取每个例子页面的链接，用其构造Request对象并提交
parse_example方法为例子页面的解析函数
运行爬虫
查看目录
在这里插入图片描述
修改FilesPipeline为文件命名的规则

在pipelines.py
实现一个FilesPipeline的子类，覆写file_path方法来实现所期望的文件命名规则

from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
from os.path import basename,dirname,join

class MyFilesPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None):
        path=urlparse(request.url).path
        return join(basename(dirname(path)),basename(path))

修改配置文件，使用MyFilesPipeline替代FilesPipeline：


ITEM_PIPELINES = {
    # 'scrapy.pipelines.files.FilesPipeline': 1,
    'matplotlib_examples.pipelines.files.MyFilesPipeline': 1,
}

重新运行爬虫
在这里插入图片描述

上一篇：小旺的博客：Android-ListView简单使用

下一篇：没有了

立即下载 - IIS7 站长工具包