scrapy-beautifulsoup

Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup

MIT License

Downloads

211

Stars

20

Committers

View Code on GitHub

Ecosystems: Python

.. image:: https://badge.fury.io/py/scrapy-beautifulsoup.svg :target: http://badge.fury.io/py/scrapy-beautifulsoup :alt: PyPI version

.. image:: https://requires.io/github/alecxe/scrapy-beautifulsoup/requirements.svg?branch=master :target: https://requires.io/github/alecxe/scrapy-beautifulsoup/requirements/?branch=master :alt: Requirements Status

scrapy-beautifulsoup

Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup

Installation

The package is on PyPI and can be installed with pip:

::

 pip install scrapy-beautifulsoup

Configuration

Add the middleware to DOWNLOADER_MIDDLEWARES dictionary setting:

::

DOWNLOADER_MIDDLEWARES = {
    'scrapy_beautifulsoup.middleware.BeautifulSoupMiddleware': 400
}

By default, BeautifulSoup would use the built-in html.parser parser. To change it, set the BEAUTIFULSOUP_PARSER setting:

::

BEAUTIFULSOUP_PARSER = "html5lib"  # or BEAUTIFULSOUP_PARSER = "lxml"

html5lib is an extremely lenient parser and, if the target HTML is seriously broken, you might consider being it your first choice. Note: html5lib <https://pypi.python.org/pypi/html5lib>_ has to be installed in this case:

::

pip install html5lib

Motivation

BeautifulSoup <https://www.crummy.com/software/BeautifulSoup/bs4/doc/>_ itself with the help of an underlying parser of choice <https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser>_ does a pretty good job of handling non-well-formed or broken HTML. In some cases, it makes sense to pipe the HTML through BeautifulSoup to "fix" it.

.. |GitHub version| image:: https://badge.fury.io/gh/alecxe%2Fscrapy-beautifulsoup.svg :target: http://badge.fury.io/gh/alecxe%2Fscrapy-beautifulsoup .. |Requirements Status| image:: https://requires.io/github/alecxe/scrapy-beautifulsoup/requirements.svg?branch=master :target: https://requires.io/github/alecxe/scrapy-beautifulsoup/requirements/?branch=master

Package Rankings

Top 13.93% on Pypi.org

Related Projects

anime_spiders

A collection of self-using anime-related crawlers.

pyanyapi

Tools for convenient interface creation over various types of data in a declarative way.

soupsieve

A modern CSS selector implementation for BeautifulSoup

07 Dec 2018 181

LiSpider

Web-Scraping-with-Beautiful-Soup-and-Selenium

This repository offers a guide to web scraping with Beautiful Soup and Selenium. It covers data e...

scraping_tutorial

Basics of scraping with python, requests, beautifulsoup4, selenium, etc.

ECommerceCrawlers

实战🐍多种网站、电商数据爬虫🕷。包含🕸：淘宝商品、微信公众号、大众点评、企查查、招聘网站、闲鱼、阿里任务、博客园、微博、百度贴吧、豆瓣电影、包图网、全景网、豆瓣音乐、某省药监局、搜狐新闻、机器学...

29 Mar 2019 4,682

WebScrape

Web + Command Line Webscraper Tool!

autoscraper

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

31 Aug 2020 6,197

dude

dude uncomplicated data extraction: A simple framework for writing web scrapers using Python deco...

14 Feb 2022 413

MechanicalSoup

A Python library for automating interaction with websites.

26 May 2014 4,565

PythonSpiderNotes

Python入门网络爬虫之精华版

19 Aug 2015 6,877

feapder

🚀🚀🚀feapder is an easy to use, powerful crawler framework | feapder是一款上手简单，功能强大的Python爬虫框架。内置AirSp...

08 Feb 2021 2,596

News-Aggregator

Django project to scrape a news website using Beautiful soup and display in our template.

24 Apr 2020 126

spider

爬虫