scrappy

scrapy best practice

Stars

37

View Code on GitHub

Ecosystems: Python

scrappy

scrapy best practice

requirements

pip install requirements.txt

structrue

|____bin    #bash scripts
|____requirements.txt
|____scrappy
| |____dbs    #storge dao
| |____extensions    #scrapy extensions
| |____items
| |____middlewares
| |____resources    #static resources
| |____scripts    #py scripts
| |____services    #py services
| |____spiders    #spiders definition
| |____utils    #python utils
|____scrapy.cfg

usage

code some spider in spiders

extends CrawlSpider
define name
define start_urls or start_requests function
define parse function to parse the response
define models in items
define pipeline in pipelines

notice

items

handleInsert.

parse the item before insert
handleUpdate.

parse the item before update

spiders

extends BaseSpider

CrawlSpider.

normal spider

the spider will distributly if set ENABLE_REDIS value to True in settings
scrappy.extensions.scrapy_redis.spiders.RedisSpider.

spider will not shutdown , always pop request form redis

resource

ResourceHelper.

reading, wirting and creating files

middlewares

RemoveCookieMiddleware.

remove cookie before request
RandomProxyMiddleware.

random switch proxy before request
UserAgentMiddleware.

random switch UserAgent before request

setting

it will automatic switch configuration file (Linux as product platform)

ENABLE_REDIS.

Enable redis distribution , redis stat

have a nice day :)

Related Projects

LiSpider

scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

14 Apr 2015 1,182

ECommerceCrawlers

实战🐍多种网站、电商数据爬虫🕷。包含🕸：淘宝商品、微信公众号、大众点评、企查查、招聘网站、闲鱼、阿里任务、博客园、微博、百度贴吧、豆瓣电影、包图网、全景网、豆瓣音乐、某省药监局、搜狐新闻、机器学...

29 Mar 2019 4,682

scrapyd-cluster-on-heroku

Set up free and scalable Scrapyd cluster for distributed web-crawling with just a few clicks. DEMO

02 Apr 2019 121

scraping_tutorial

Basics of scraping with python, requests, beautifulsoup4, selenium, etc.

SpiderKeeper

admin ui for scrapy/open source scrapinghub

18 Jan 2016 2,738

anime_spiders

A collection of self-using anime-related crawlers.

scrapy-examples

Multifarious Scrapy examples. Spiders for alexa / amazon / douban / douyu / github / linkedin etc.

11 Jan 2014 3,171

OpenScraper

An open source webapp for scraping: towards a public service for webscraping

boris-spider

boris-spider是一款使用Python语言编写的爬虫框架，于多年的爬虫业务中不断磨合而诞生，相比于scrapy，该框架更易上手，且又满足复杂的需求，支持分布式及批次采集。

PythonSpiderNotes

Python入门网络爬虫之精华版

19 Aug 2015 6,877

python-scrapyd-api

A Python wrapper for working with Scrapyd's API.

11 Sep 2014 267

djangoscraper

Django Scrapy App

logparser

A tool for parsing Scrapy log files periodically and incrementally, extending the HTTP JSON API o...

scrapy-scraper

Web crawler and scraper based on Scrapy and Playwright's headless browser.