habrahabr-dataset

Dataset collected from popular Russian collective blogs Habrahabr, Geektimes and Megamozg owned by TM.

Data format

`habr_posts/<post_id>`

{
    "_id": 115710,
    "_last_update": "2015-04-08T00:00:00",
    "title": "Собираем данные с помощью Scrapy",
    "published": "2011-03-18T23:13:00",
    "author": "bekbulatov",
    "author_url": "http://habrahabr.ru/users/bekbulatov/",
    "author_rating": 3.8,
    "hubs": [["Python", "http://habrahabr.ru/hub/python/"]],
    "favs_count": 315,
    "pageviews": 21281,
    "tags": ["scrapy", "парсинг", "python", "crawler"],
    "comments_count": 49,
    "content_html": "..."
}

`posts.csv`

Summary table about all posts from the dataset in CSV format. Encoding is UTF-8.

Columns:

post_id
last_update
published
title
author
favs_count
pageviews
comments_count
comments_parsed
comments_banned
first_comment_time
last_comment_time
author_comments
tags
content_length
hubs_count
hubs

How to create dataset

Use script download_all_habr.py to fetch and parse all the pages available now. Habrahabr posts are indexed with continuous integer numbers from 1 to about 300000. You should specify the range of indices to download. If you want to distribute your download across several machines, just specify them different pieces of the whole range.

$ python download_all_habr.py --start-index 1 --finish-index 300000

Script will create the directory habr_pages and download post contents there.

Related Projects

facebook-scraper

Scrape Facebook public pages without an API key

15 Mar 2019 2,393

russian-ira-facebook-ads-datasette

Explore 3,500 Facebook ads reported to have been bought by the Russian Internet Research Agency

27 Jul 2018 17

spevktator

An open source investigation tool to collect and analyse public VK community wall posts

25 Aug 2022 37

haraj_scrap

It handles asynchronous data fetching, city-based search filtering, and storage in a MongoDB data...

22 Jul 2024 0

yars

Yet Another Reddit Scrapper (without API keys) | Scrap search results, posts and images from subr...

10 Sep 2024 6

toolbox

A collection of tools, APIs and other resources to use in creative coding web projects.

20 Dec 2014 72

get-all-hacker-news-submissions-comments

Simple Python scripts to download all Hacker News submissions and comments and store them in a Po...

11 May 2015 116

hnapp

Hacker News faceted search engine, RSS & JSON feeds

03 Nov 2014 60

OpenScraper

An open source webapp for scraping: towards a public service for webscraping

20 Feb 2018 92

weibo-search

获取微博搜索结果信息，搜索即可以是微博关键词搜索，也可以是微博话题搜索

25 Mar 2020 1,681

exhentai-metadata-archive

Archival. JSONL dump of the exhentai metadata from the community crawl.

26 Jul 2019 12

pygooglenews

If Google News had a Python library

30 Jun 2020 1,272

webscraping-from-0-to-hero

The web scraping open project repository aims to share knowledge and experiences about web scrapi...

26 May 2022 1,533

ksh-scraper

scraper for the hungarian statistics office

04 Dec 2010 9

Temphael

A Tumblr-scraping text post bot

20 Apr 2017 14