ekrhizoc (E6c): A web crawler
εκρίζωση (Greek) ekrízosi / uprooting, eradication
Also known as E6c.
Implementation of a simple python web crawler. Input: URL (seed). Output: Simple textual sitemap (to show links between pages).
<a>
elements.E6C_MAX_URL_LENGTH
length in charactersThis project implements a Basic Universal Crawler based on breadth first search graph traversal.
Behaviour of the application can be configured via Environment Variables.
Environment Variable | Description | Type | Default Value |
---|---|---|---|
E6C_LOG_LEVEL |
Level of logging - overrides verbose/quiet flag | string | - |
E6C_LOG_DIR |
Directory to save logs | string | - |
E6C_BIN_DIR |
Directory to save any output (bin) | string | bin |
E6C_IGNORE_FILETYPES |
File types of websites to ignore (e.g. ".filetype1,.filetype2") | string | ".png,.pdf,.txt,.doc,.jpg,.gif" |
E6C_URL_REQUEST_TIMER |
Time (in seconds) to wait per request (not to populate server with multiple requests) | float | 0.1 |
E6C_MAX_URLS |
The maximum number of urls to fetch/crawl | integer | 10000 |
E6C_MAX_URL_LENGTH |
The maximum length (character count) of a url to fetch/crawl | integer | 300 |
conda
or miniconda
conda
, poetry
, pre-commit
):$ make env
$ make env-update
On a terminal, run the following (execute on project's root directory):
$ . ./scripts/helpers/environment.sh
poetry
:$ ekrhizoc
[ Not Available ]
(part of CI/CD)
[ Work in progress... ]
To run the tests, open a terminal and run the following (execute on project's root directory):
$ . ./scripts/helpers/environment.sh
$ make test
$ make test-coverage
Increment the version number:
$ poetry version {bump rule}
where valid bump rules are:
Use CHANGELOG.md
to track the evolution of this package.
The [UNRELEASED]
tag at the top of the file should always be there to log the work until a release occurs.
Work should be logged under one of the following subtitles:
On a release, a version of the following format should be added to all the current unreleased changes in the file.
## [major.minor.patch] - YYYY-MM-DD
On a terminal, run the following (execute on project's root directory):
$ . ./scripts/helpers/environment.sh
$ make build-package
$ make publish-package
On a terminal, run the following (execute on project's root directory):
$ . ./scripts/helpers/environment.sh
$ make build-docker
For production, a Docker image is used. This image is published publicly on docker hub.
$ docker pull nichelia/ekrhizoc:{version}
$ docker run --rm -it -v ~/ekrhizoc_bin:/tmp/bin nichelia/ekrhizoc:{version} {command}
where version is the published application version