disqus-crawler

Crawl DISQUS comments from a blog into a local MongoDB database

Installation

Clone the github repository and cd into it

git clone [email protected]:louisguitton/disqus-crawler.git
cd disqus-crawler
python3 -m venv venv
source venv/bin/activate
pip install --upgrade -r requirements.txt

Usage example

Open main.sh and change the url to the blog page you want to crawl
Make sure a mongod instance is running on your computer (Installation instructions for MongoDB are here)

mongod --config /usr/local/etc/mongod.conf

Make sure a splash instance is running (more information here)

$ docker run -p 8050:8050 scrapinghub/splash
2019-10-10 12:03:39.116598 [-] Server listening on http://0.0.0.0:8050

Run the main.sh script

$ sh main.sh
CRAWLING ... http://www.purseblog.com/louis-vuitton/louis-vuitton-spring-2016-bag-ad-campaign/
2019-10-10 14:07:28 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: purseblog)
...

Usage

mongo

use disqus
db.comments.count()
db.comments.find().pretty().limit(2)

Meta

I wrote this project for my master thesis in 2016 on Paid/Owned/Earned Media, and measuring brands on social channels and blogs.

For the crawling, this project uses scrapy. It stores the comments in a MongoDB database, using the pymongo client. A good tutorial to follow is this one.

When scrapping the web, 2 kinds of problems arise:

the target page is too slow to render because it uses a lot of javascript
the target page renders everything really fast but what you were interested in was something that disappears when the page is rendered

To overcome these situations, one can deploy a tiny web-browser on a local machine that will render the pages at his will. This project uses Splash, on a local Docker container. A good tutorial to follow is this one.

Related Projects

ECommerceCrawlers

实战🐍多种网站、电商数据爬虫🕷。包含🕸：淘宝商品、微信公众号、大众点评、企查查、招聘网站、闲鱼、阿里任务、博客园、微博、百度贴吧、豆瓣电影、包图网、全景网、豆瓣音乐、某省药监局、搜狐新闻、机器学...

29 Mar 2019 4,682

News-Aggregator

Django project to scrape a news website using Beautiful soup and display in our template.

24 Apr 2020 126

scrapyd-cluster-on-heroku

Set up free and scalable Scrapyd cluster for distributed web-crawling with just a few clicks. DEMO

02 Apr 2019 121

OpenScraper

An open source webapp for scraping: towards a public service for webscraping

20 Feb 2018 92

course_web_scrapping

iNeuron Webscraper Python Project using beautifulsoup and flask

27 Apr 2022 6

haraj_scrap

It handles asynchronous data fetching, city-based search filtering, and storage in a MongoDB data...

22 Jul 2024 0

DjangoBlog

🍺基于Django的博客系统

02 Nov 2016 6,572

News-Aggregator1

Django project to scrape a news website using Beautiful soup and display in our template.

27 Jul 2024 0

NewsBlur

NewsBlur is a personal news reader that brings people together to talk about the world. A new sou...

05 Jan 2009 6,839

scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

14 Apr 2015 1,182

iQueensu

Project iQueensu backend

17 Oct 2018 2

webscraping-from-0-to-hero

The web scraping open project repository aims to share knowledge and experiences about web scrapi...

26 May 2022 1,533

mongoengine

A Python Object-Document-Mapper for working with MongoDB

05 Mar 2012 4,200

anime_spiders

A collection of self-using anime-related crawlers.

16 Jan 2017 0

pili

Python Flask application with a strong inclination for social network and blogging features

16 Oct 2016 2