dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators
AGPL-3.0 License
Dude is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax.
🚨 Dude is currently in Pre-Alpha. Please expect breaking changes.
To install, simply run the following from terminal.
pip install pydude
playwright install # Install playwright binaries for Chrome, Firefox and Webkit.
The simplest web scraper will look like this:
from dude import select
@select(css="a")
def get_link(element):
return {"url": element.get_attribute("href")}
The example above will get all the hyperlink elements in a page and calls the handler function get_link()
for each element.
You can run your scraper from terminal/shell/command-line by supplying URLs, the output filename of your choice and the paths to your python scripts to dude scrape
command.
dude scrape --url "<url>" --output data.json path/to/script.py
The output in data.json
should contain the actual URL and the metadata prepended with underscore.
[
{
"_page_number": 1,
"_page_url": "https://dude.ron.sh/",
"_group_id": 4502003824,
"_group_index": 0,
"_element_index": 0,
"url": "/url-1.html"
},
{
"_page_number": 1,
"_page_url": "https://dude.ron.sh/",
"_group_id": 4502003824,
"_group_index": 0,
"_element_index": 1,
"url": "/url-2.html"
},
{
"_page_number": 1,
"_page_url": "https://dude.ron.sh/",
"_group_id": 4502003824,
"_group_index": 0,
"_element_index": 2,
"url": "/url-3.html"
}
]
Changing the output to --output data.csv
should result in the following CSV content.
pip install pydude[bs4]
pip install pydude[parsel]
pip install pydude[lxml]
pip install pydude[selenium]
By default, Dude uses Playwright but gives you an option to use parser backends that you are familiar with. It is possible to use parser backends like BeautifulSoup4, Parsel, lxml, and Selenium.
Here is the summary of features supported by each parser backend.
Pull the docker image using the following command.
docker pull roniemartinez/dude
Assuming that script.py
exist in the current directory, run Dude using the following command.
docker run -it --rm -v "$PWD":/code roniemartinez/dude dude scrape --url <url> script.py
Read the complete documentation at https://roniemartinez.github.io/dude/. All the advanced and useful features are documented there.
ufw
) into the name says it is a very simple framework.Thanks goes to these wonderful people (emoji key):
This project follows the all-contributors specification. Contributions of any kind welcome!