dude

dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators

AGPL-3.0 License

Downloads
272
Stars
413
Committers
3
dude - 🔨 Run adblock on HTTPX request event hook

Published by roniemartinez over 2 years ago

What's Changed

New Contributors

Full Changelog: https://github.com/roniemartinez/dude/compare/0.14.0...0.15.0

dude - ✨ Use fnmatch

Published by roniemartinez over 2 years ago

What's Changed

Other

fnmatch: URL pattern matcher now uses Unix style wildcards (fnmatch) instead of regex

See: https://docs.python.org/3/library/fnmatch.html

Wildcards are easier to understand and simpler to use compared to regular expressions

- @select(css=".title", url=r".*\.com")
+ @select(css=".title", url="*.com/*")
def result_title(element):
    return {"title": element.text_content()}

Full Changelog: https://github.com/roniemartinez/dude/compare/0.13.0...0.14.0

dude - ✨ Make return value of decorated functions optional

Published by roniemartinez over 2 years ago

What's Changed

Full Changelog: https://github.com/roniemartinez/dude/compare/0.12.2...0.13.0

dude - 🐛 Fix PlaywrightScraper overwriting output file

Published by roniemartinez over 2 years ago

What's Changed

Full Changelog: https://github.com/roniemartinez/dude/compare/0.12.1...0.12.2

dude - 🔨 Refactor for Alpha

Published by roniemartinez over 2 years ago

What's Changed

Full Changelog: https://github.com/roniemartinez/dude/compare/0.12.0...0.12.1

dude - ✨ Add shutdown event and save per page option

Published by roniemartinez over 2 years ago

What's Changed

Other

Full Changelog: https://github.com/roniemartinez/dude/compare/0.11.0...0.12.0

✨ Save data on each page

You can now save data after scraping a page. Save functions should be decorated with is_per_page=True and execute the scraper with --save-per-page to use it.

@save("jsonl", is_per_page=True)
def save_jsonl(data, output) -> bool:
    global jsonl_file
    jsonl_file.writelines((json.dumps(item) + "\n" for item in data))
    return True

✨ Shutdown event

The shutdown even is called before the application terminates. This is useful when freeing resources, file handles, databases or other use-cases before ending.

@shutdown()
def zip_all():
    global SAVE_DIR
    shutil.make_archive("images-and-pdfs", "zip", SAVE_DIR)

✨ How dude runs internally

events

dude - ✨ Events and Basic Spider

Published by roniemartinez over 2 years ago

What's Changed

Features

Documentation

Fixes

Other

✨ Basic Spider

Example

dude scrape ... --follow-urls

or

if __name__ == "__main__":
    import dude

    dude.run(..., follow_urls=True)

✨ Events

More details at https://roniemartinez.github.io/dude/advanced/14_events.html

Example

import uuid
from pathlib import Path

from dude import post_setup, pre_setup, startup

SAVE_DIR: Path


@startup()
def initialize_csv():
    """
    Connection to databases or API and other use-cases can be done here before the web scraping process is started.
    """
    global SAVE_DIR
    SAVE_DIR = Path(__file__).resolve().parent / "temp"
    SAVE_DIR.mkdir(exist_ok=True)


@pre_setup()
def screenshot(page):
    """
    Perform actions here after loading a page (or after a successful HTTP response) and before modifying things in the
    setup stage.
    """
    unique_name = str(uuid.uuid4())
    page.screenshot(path=SAVE_DIR / f"{unique_name}.png")  # noqa


@post_setup()
def print_pdf(page):
    """
    Perform actions here after running the setup stage.
    """
    unique_name = str(uuid.uuid4())
    page.pdf(path=SAVE_DIR / f"{unique_name}.pdf")  # noqa


if __name__ == "__main__":
    import dude

    dude.run(urls=["https://dude.ron.sh"])

Diagram showing when events are executed

image

Full Changelog: https://github.com/roniemartinez/dude/compare/0.10.1...0.11.0

dude - 🏁 Fix Windows support

Published by roniemartinez over 2 years ago

What's Changed

Full Changelog: https://github.com/roniemartinez/dude/compare/0.10.0...0.10.1

dude - ✨ Block ads

Published by roniemartinez over 2 years ago

What's Changed

Added

Changed

Full Changelog: https://github.com/roniemartinez/dude/compare/0.9.2...0.10.0

dude - 🔧 Disable notifications

Published by roniemartinez over 2 years ago

What's Changed

Full Changelog: https://github.com/roniemartinez/dude/compare/0.9.1...0.9.2

dude - 📚 Add migration examples

Published by roniemartinez over 2 years ago

What's Changed

Full Changelog: https://github.com/roniemartinez/dude/compare/0.9.0...0.9.1

dude - ✨ Add option to use Selenium

Published by roniemartinez over 2 years ago

What's Changed

Added

Fixed

Docs

Full Changelog: https://github.com/roniemartinez/dude/compare/0.8.0...0.9.0

dude - ✨ Add option to use Pyppeteer

Published by roniemartinez over 2 years ago

What's Changed

Full Changelog: https://github.com/roniemartinez/dude/compare/0.7.1...0.8.0

dude - 📚 Add parser support table

Published by roniemartinez over 2 years ago

What's Changed

Full Changelog: https://github.com/roniemartinez/dude/compare/0.7.0...0.7.1

dude - ✨ Add Text and Regex selectors for lxml

Published by roniemartinez over 2 years ago

What's Changed

Full Changelog: https://github.com/roniemartinez/dude/compare/0.6.1...0.7.0

dude - 🐛 Fix lxml documentation

Published by roniemartinez over 2 years ago

What's Changed

Full Changelog: https://github.com/roniemartinez/dude/compare/0.6.0...0.6.1

dude - ✨ lxml Implementation

Published by roniemartinez over 2 years ago

What's Changed

Full Changelog: https://github.com/roniemartinez/dude/compare/0.5.1...0.6.0

dude - 📚 Update README and documentation for BS4 and Parsel support

Published by roniemartinez over 2 years ago

What's Changed

Full Changelog: https://github.com/roniemartinez/dude/compare/0.5.0...0.5.1

dude - ✨ Option to use Parsel for scraping

Published by roniemartinez over 2 years ago

What's Changed

Added

Removed

Full Changelog: https://github.com/roniemartinez/dude/compare/0.4.2...0.5.0

dude - 📚 Update documentation and examples

Published by roniemartinez over 2 years ago

What's Changed

Full Changelog: https://github.com/roniemartinez/dude/compare/0.4.1...0.4.2