Scrapy Wayback Middleware

Middleware for submitting all scraped response URLs to the Internet Archive Wayback Machine for archival.

Installation

pip install scrapy-wayback-middleware

Setup

Add scrapy_wayback_middleware.WaybackMiddleware to your project's SPIDER_MIDDLEWARES settings. By default, the middleware will make GET requests to web.archive.org/save/{URL}, but if the WAYBACK_MIDDLEWARE_POST setting is True then it will make POST requests to pragma.archivelab.org instead.

Configuration

To configure custom behavior for certain methods, subclass WaybackMiddleware and override the get_item_urls method to pull additional links to archive from individual items or handle_wayback to change how responses from the Wayback Machine are handled. The WAYBACK_MIDDLEWARE_POST can be set to True to adjust request behavior.

Duplicate Filtering

In order to avoid sending duplicate requests with WAYBACK_MIDDLEWARE_POST set to False, you'll need to either include web.archive.org in your spider's allowed_domains property (if specified) or disable scrapy.spidermiddlewares.offsite.OffsiteMiddleware in your settings.

Rate Limits

While neither endpoint returns headers indicating specific rate limits, the GET endpoint at web.archive.org/save has a rate limit of 25 requests/minute, resetting each minute. The middleware is configured to wait for 60 seconds whenever it sees a 429 error code to handle this.

Package Rankings

Top 11.45% on Pypi.org

Badges

Extracted from project README

Related Projects

waybackpack

Download the entire Wayback Machine archive for a given URL.

11 Apr 2016 2,862

logparser

A tool for parsing Scrapy log files periodically and incrementally, extending the HTTP JSON API o...

20 Jan 2019 88

scrappy

scrapy best practice

02 Mar 2016 37

scrapy-wayback-machine

A Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.

05 Apr 2017 109

scrapy-examples

Multifarious Scrapy examples. Spiders for alexa / amazon / douban / douyu / github / linkedin etc.

11 Jan 2014 3,171

get_user_headers

Python module to retrieve identifying request headers from the user's browser for use by local bots

15 Jul 2016 1

pywb

Core Python Web Archiving Toolkit for replay and recording of web archives

09 Dec 2013 1,366

waymore

Find way more from the Wayback Machine, Common Crawl, Alien Vault OTX, URLScan & VirusTotal!

24 Jun 2022 1,675

scrapy-scraper

Web crawler and scraper based on Scrapy and Playwright's headless browser.

13 Apr 2023 9

wayback-machine-scraper

A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Way...

04 Apr 2017 416