PDF Crawler

This is SimFin's open source PDF crawler. Can be used to crawl all PDFs from a website.

You specify a starting page and all pages that link from that page are crawled (ignoring links that lead to other pages, while still fetching PDFs that are linked on the original page but hosted on a different domain).

Can crawl files "hidden" with javascript too (the crawler can render the page and click on all elements to make new links appear).

Built in proxy support.

We use this crawler to gather PDFs from company websites to find financial reports that are then uploaded to SimFin, but can be used for other documents too.

Development

How to install pdf-extractor for development.

$ git clone https://github.com/SimFin/pdf-crawler.git
$ cd pdf-crawler

# Make a virtual environment with the tool of your choice. Please use Python version 3.6+
# Here an example based on pyenv:
$ pyenv virtualenv 3.6.6 pdf-crawler

$ pip install -e .

Usage Example

After having installed pdf-crawler as described in the "Development" section, you can import and use the crawler class like so:

import crawler

crawler.crawl(url="https://simfin.com/crawlingtest/",output_dir="crawling_test",method="rendered-all")

Parameters

License

Available under MIT license

Credits

@gwaramadze, @q7v6rhgfzc8tnj3d, @thf24

Related Projects

AutoCrawler

Google, Naver multiprocess image web crawler (Selenium)

21 Nov 2018 1,487

crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extr...

10 Jan 2024 4,020

FileSensor

Dynamic file detection tool based on crawler 基于爬虫的动态敏感文件探测工具

27 Feb 2017 253

pylinkvalidator

pylinkvalidator is a standalone and pure python link validator and crawler that traverses a web s...

24 Jun 2014 142

GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). In...

06 Dec 2013 2,630

creepy

Dead simple web crawler for Python

07 May 2013 39

OpenScraper

An open source webapp for scraping: towards a public service for webscraping

20 Feb 2018 92

pdfminer.six

Community maintained fork of pdfminer - we fathom PDF

29 Aug 2014 5,843