scrapy best practice
scrapy best practice
pip install requirements.txt
|____bin #bash scripts
|____requirements.txt
|____scrappy
| |____dbs #storge dao
| |____extensions #scrapy extensions
| |____items
| |____middlewares
| |____resources #static resources
| |____scripts #py scripts
| |____services #py services
| |____spiders #spiders definition
| |____utils #python utils
|____scrapy.cfg
code some spider in spiders
extends CrawlSpider
define name
define start_urls or start_requests function
define parse function to parse the response
define models in items
define pipeline in pipelines
handleInsert.
parse the item before insert
handleUpdate.
parse the item before update
extends BaseSpider
CrawlSpider.
normal spider
the spider will distributly if set ENABLE_REDIS value to True in settings
scrappy.extensions.scrapy_redis.spiders.RedisSpider.
spider will not shutdown , always pop request form redis
ResourceHelper.
reading, wirting and creating files
RemoveCookieMiddleware.
remove cookie before request
RandomProxyMiddleware.
random switch proxy before request
UserAgentMiddleware.
random switch UserAgent before request
it will automatic switch configuration file (Linux as product platform)
ENABLE_REDIS.
Enable redis distribution , redis stat
have a nice day :)