TinyCrawler

An highly customizable crawler that uses multiprocessing and proxies to download one or more websites following a given filter, search and save functions.

REMEMBER THAT DDOS IS ILLEGAL. DO NOT USE THIS SOFTWARE FOR ILLEGAL PURPOSE.

Installing TinyCrawler

.. code:: shell

pip install tinycrawler

TODOs for next version

Test proxies while normally downloading. - DONE
Parallelize different domains downloads. - DONE
Add dropping for high failure proxy and add parameters for such rate - DONE, yet to be tested
Make failure rate domain specific with also a global mean.
Enable failure rate also for local.
Check robots txt also before downloading urls
Reduce robots timeout defaults to 2 hours
Change to exponential the wait timeout for the download attempts
To define a binary file, check if in the first 1000 characters you find a number greater than 3/5 of zeros
Add useragent
Stop downloads when all proxies are dead.
Try to use active_children as a way to test for active processes
Add test for proxies
Add way to save progress automatically every given timeout.
Add way to automatically save tested proxies.

Preview (Test case)

This is the preview of the console when running the test_base.py_.

|preview|

Basic usage example

.. code:: python

from tinycrawler import TinyCrawler, Log
from bs4 import BeautifulSoup


def url_validator(url: str, logger: Log)->bool:
    """Return a boolean representing if the crawler should parse given url."""
    return url.startswith("http://interestingurl.com")


def file_parser(url:str, soup:BeautifulSoup, logger: Log):
    """Parse and elaborate given soup."""
    # soup parsing...
    pass

TinyCrawler(
    file_parser=file_parser,
    url_validator=url_validator
).run("https://www.example.com/")

Example loading proxies

.. code:: python

from tinycrawler import TinyCrawler, Log
from bs4 import BeautifulSoup


def url_validator(url: str, logger: Log)->bool:
    """Return a boolean representing if the crawler should parse given url."""
    return url.startswith("http://interestingurl.com")


def file_parser(url:str, soup:BeautifulSoup, logger: Log):
    """Parse and elaborate given soup."""
    # soup parsing...
    pass

crawler = TinyCrawler(
    file_parser=file_parser,
    url_validator=url_validator
)
crawler.load_proxies("http://myexampletestserver.com", "path/to/proxies.json")
crawler.run("https://www.example.com/")

Proxies are expected to be in the following format:

.. code:: python

[
  {
    "ip": "89.236.17.108",
    "port": 3128,
    "type": [
      "https",
      "http"
    ]
  },
  {
    "ip": "128.199.141.151",
    "port": 3128,
    "type": [
      "https",
      "http"
    ]
  }
]