dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators
AGPL-3.0 License
Published by roniemartinez over 2 years ago
Full Changelog: https://github.com/roniemartinez/dude/compare/0.14.0...0.15.0
Published by roniemartinez over 2 years ago
See: https://docs.python.org/3/library/fnmatch.html
Wildcards are easier to understand and simpler to use compared to regular expressions
- @select(css=".title", url=r".*\.com")
+ @select(css=".title", url="*.com/*")
def result_title(element):
return {"title": element.text_content()}
Full Changelog: https://github.com/roniemartinez/dude/compare/0.13.0...0.14.0
Published by roniemartinez over 2 years ago
Full Changelog: https://github.com/roniemartinez/dude/compare/0.12.2...0.13.0
Published by roniemartinez over 2 years ago
Full Changelog: https://github.com/roniemartinez/dude/compare/0.12.1...0.12.2
Published by roniemartinez over 2 years ago
Full Changelog: https://github.com/roniemartinez/dude/compare/0.12.0...0.12.1
Published by roniemartinez over 2 years ago
Full Changelog: https://github.com/roniemartinez/dude/compare/0.11.0...0.12.0
You can now save data after scraping a page. Save functions should be decorated with is_per_page=True
and execute the scraper with --save-per-page
to use it.
@save("jsonl", is_per_page=True)
def save_jsonl(data, output) -> bool:
global jsonl_file
jsonl_file.writelines((json.dumps(item) + "\n" for item in data))
return True
The shutdown even is called before the application terminates. This is useful when freeing resources, file handles, databases or other use-cases before ending.
@shutdown()
def zip_all():
global SAVE_DIR
shutil.make_archive("images-and-pdfs", "zip", SAVE_DIR)
Published by roniemartinez over 2 years ago
dude scrape ... --follow-urls
or
if __name__ == "__main__":
import dude
dude.run(..., follow_urls=True)
More details at https://roniemartinez.github.io/dude/advanced/14_events.html
import uuid
from pathlib import Path
from dude import post_setup, pre_setup, startup
SAVE_DIR: Path
@startup()
def initialize_csv():
"""
Connection to databases or API and other use-cases can be done here before the web scraping process is started.
"""
global SAVE_DIR
SAVE_DIR = Path(__file__).resolve().parent / "temp"
SAVE_DIR.mkdir(exist_ok=True)
@pre_setup()
def screenshot(page):
"""
Perform actions here after loading a page (or after a successful HTTP response) and before modifying things in the
setup stage.
"""
unique_name = str(uuid.uuid4())
page.screenshot(path=SAVE_DIR / f"{unique_name}.png") # noqa
@post_setup()
def print_pdf(page):
"""
Perform actions here after running the setup stage.
"""
unique_name = str(uuid.uuid4())
page.pdf(path=SAVE_DIR / f"{unique_name}.pdf") # noqa
if __name__ == "__main__":
import dude
dude.run(urls=["https://dude.ron.sh"])
Full Changelog: https://github.com/roniemartinez/dude/compare/0.10.1...0.11.0
Published by roniemartinez over 2 years ago
Full Changelog: https://github.com/roniemartinez/dude/compare/0.10.0...0.10.1
Published by roniemartinez over 2 years ago
Full Changelog: https://github.com/roniemartinez/dude/compare/0.9.2...0.10.0
Published by roniemartinez over 2 years ago
Full Changelog: https://github.com/roniemartinez/dude/compare/0.9.1...0.9.2
Published by roniemartinez over 2 years ago
Full Changelog: https://github.com/roniemartinez/dude/compare/0.9.0...0.9.1
Published by roniemartinez over 2 years ago
Full Changelog: https://github.com/roniemartinez/dude/compare/0.8.0...0.9.0
Published by roniemartinez over 2 years ago
Full Changelog: https://github.com/roniemartinez/dude/compare/0.7.1...0.8.0
Published by roniemartinez over 2 years ago
Full Changelog: https://github.com/roniemartinez/dude/compare/0.7.0...0.7.1
Published by roniemartinez over 2 years ago
Full Changelog: https://github.com/roniemartinez/dude/compare/0.6.1...0.7.0
Published by roniemartinez over 2 years ago
Full Changelog: https://github.com/roniemartinez/dude/compare/0.6.0...0.6.1
Published by roniemartinez over 2 years ago
Full Changelog: https://github.com/roniemartinez/dude/compare/0.5.1...0.6.0
Published by roniemartinez over 2 years ago
Full Changelog: https://github.com/roniemartinez/dude/compare/0.5.0...0.5.1
Published by roniemartinez over 2 years ago
Full Changelog: https://github.com/roniemartinez/dude/compare/0.4.2...0.5.0
Published by roniemartinez over 2 years ago
Full Changelog: https://github.com/roniemartinez/dude/compare/0.4.1...0.4.2