Dumb web to disk tool; html, markdown / md / text, epub
AGPL-3.0 License
Dumb web to disk tool; html, markdown / md / text, epub
Python 3 and 2.7
Table of contents generated with markdown-toc
python3 -m venv py3venv # optional...
# TODO better way than directly from command line list
python -m pip install --upgrade markdownify readability-lxml git+https://github.com/clach04/pypub.git git+https://github.com/clach04/w2d.git # Python 2 or 3 - without trafilatura
python -m pip install --upgrade markdownify readability-lxml trafilatura git+https://github.com/clach04/pypub.git git+https://github.com/clach04/w2d.git # Python 3 only
python -m pip install -e git+https://github.com/clach04/w2d.git#egg=w2d
w2d
w2d https://en.wikipedia.org/wiki/EPUB
w2d local_file.html
TODO document debian packages that can be installed
git clone https://github.com/clach04/w2d.git
cd w2d
python3 -m venv py3venv
. py3venv/bin/activate
python -m pip install -r requirements.txt
python setup.py develop # optional to have w2d binary
python -m w2d
python -m w2d https://en.wikipedia.org/wiki/EPUB
python -m w2d local_file.html
# if setup.py ran in install or develop mode
w2d
w2d https://en.wikipedia.org/wiki/EPUB
w2d local_file.html
set W2D_OUTPUT_FORMAT=epub
export W2D_OUTPUT_FORMAT=epub
python -m w2d https://en.wikipedia.org/wiki/EPUB
Then read with an standards compliant epub reader, e.g. https://addons.mozilla.org/en-US/firefox/addon/epubreader/
html
export W2D_EXTRACTOR=postlight
export MP_URL=http://localhost:3000/parser
export MP_URL=http://username:password@localhost:3000/parser
export W2D_OUTPUT_FORMAT=html
env W2D_OUTPUT_FORMAT=html python -m w2d https://en.wikipedia.org/wiki/EPUB
python -m w2d https://en.wikipedia.org/wiki/EPUB
Alternative config
cat .env
W2D_EXTRACTOR=postlight
MP_URL=http://localhost:3000/parser
export W2D_OUTPUT_FORMAT=md
python -m w2d https://en.wikipedia.org/wiki/EPUB
env W2D_OUTPUT_FORMAT=html W2D_EXTRACTOR=raw python -m w2d http://localhost:8000/one.html
env W2D_OUTPUT_FORMAT=md W2D_EXTRACTOR=raw python -m w2d http://localhost:8000/one.html # needs either Pandoc binary in path or markdownify library available
W2D_OUTPUT_FORMAT
(may be set to html
, md
, epub
, and all
)W2D_EPUB_TOOL
(may be set to pypub
or pandoc
- NOTE needs pandoc exe in path)W2D_INTERMEDIATE_FORMAT
(may be set to html
or md
)W2D_EXTRACTOR
(may be set to readability
, postlight
, postlight_exe
, or raw
- if postlight is used also see/set MP_URL
)W2D_CACHE_DIR
, if not set defaults to scrape_cache
in current directory#id_marker
) will cause new cache entry to be pulled downThis project builds on a number of other tools to perform the heavy lifting:
Windows 10 - Python 3.10
(py310venv) C:\code\py\w2d>pip list
Package Version
---------------- ---------
beautifulsoup4 4.9.3
certifi 2023.7.22
chardet 3.0.4
courlan 0.5.0
cssselect 1.2.0
dateparser 1.1.8
htmldate 0.8.1
idna 2.8
Jinja2 2.11.3
jusText 3.0.0
langcodes 3.3.0
lxml 4.9.3
markdownify 0.11.6
MarkupSafe 1.1.1
pip 22.0.4
pypub 1.5
python-dateutil 2.8.2
pytz 2023.3
readability 0.3.1
readability-lxml 0.8.1
regex 2023.6.3
requests 2.22.0
setuptools 58.1.0
six 1.16.0
soupsieve 2.4.1
tld 0.13
trafilatura 0.8.2
tzdata 2023.3
tzlocal 5.0.1
urllib3 1.25.11
Windows 10 - Python 2.7.18
(py210venv) C:\code\py\w2d>pip list
DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop sup
port for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-
2-support pip 21.0 will remove support for this functionality.
Package Version
----------------------------- ---------
backports.functools-lru-cache 1.6.6
beautifulsoup4 4.9.3
certifi 2021.10.8
chardet 3.0.4
idna 2.8
Jinja2 2.11.3
lxml 4.9.3
markdownify 0.11.6
MarkupSafe 1.1.1
pip 20.3.4
pypub 1.6
readability 0.3.1
requests 2.22.0
setuptools 44.1.1
six 1.16.0
soupsieve 1.9.6
urllib3 1.25.11
wheel 0.37.1
Linux Ubuntu 18.04.6 LTS (Bionic Beaver) - Python 3.6.9
Without trafilatura:
(py3venv) clach04@fugly:/tmp$ pip list
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
beautifulsoup4 (4.12.2)
certifi (2023.7.22)
chardet (5.0.0)
charset-normalizer (2.0.12)
cssselect (1.1.0)
idna (3.4)
Jinja2 (3.0.3)
lxml (4.9.3)
markdownify (0.11.6)
MarkupSafe (2.0.1)
pip (9.0.1)
pkg-resources (0.0.0)
pypub (1.6)
readability-lxml (0.8.1)
requests (2.27.1)
setuptools (39.0.1)
six (1.16.0)
soupsieve (2.3.2.post1)
urllib3 (1.26.16)
w2d (0.0.1)
With trafilatura:
(py3venv) :~/w2d$ pip list
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
backports-datetime-fromisoformat (2.0.0)
backports.zoneinfo (0.2.1)
beautifulsoup4 (4.12.2)
certifi (2023.7.22)
chardet (5.0.0)
charset-normalizer (3.0.1)
courlan (0.9.3)
cssselect (1.1.0)
dateparser (1.1.3)
htmldate (1.4.3)
idna (3.4)
importlib-resources (5.4.0)
Jinja2 (3.0.3)
jusText (3.0.0)
langcodes (3.3.0)
lxml (4.9.3)
markdownify (0.11.6)
MarkupSafe (2.0.1)
pip (9.0.1)
pkg-resources (0.0.0)
pypub (1.6)
python-dateutil (2.8.2)
pytz (2023.3)
pytz-deprecation-shim (0.1.0.post0)
readability-lxml (0.8.1)
regex (2022.3.2)
requests (2.27.1)
setuptools (39.0.1)
six (1.16.0)
soupsieve (2.3.2.post1)
tld (0.12.6)
trafilatura (1.6.1)
tzdata (2023.3)
tzlocal (4.2)
urllib3 (1.26.16)
zipp (3.6.0)