An open source webapp for scraping: towards a public service for webscraping
MIT License
part 1/3 of the TADATA! sofware suite (ApiViz / Solidata_backend / Solidata_frontend / OpenScraper )
Scraping can quickly become a mess, mostly if you need to scrap several websites in order to eventually get a structured dataset. Usually you need to set up several scrapers for every website, configure the spiders one by one, get the data from every website, and clean up the mess to get from this raw material one structured dataset you know that exists...
So you have mainly three options when it comes to scrap the web :
So let's say you are a researcher, a journalist, a public servant in an administration, a member of any association who want to survey some evolutions in the society... Let's say you need data not easy to get, and you can't afford to spend thousand of euros in using a private service for webscraping.
You'd have a list of different websites you want to scrap similar information from, each website having some urls where are listed those data (in our first case social innovation projects). For every information you know it could be similarly described with : a title, an abstract, an image, a list of tags, an url, and the name and url of the source website, and so on...
So to use OpenScraper you would have to :
To make that job a bit easier (and far cheaper) OpenScraper aims to display an online GUI interface (a webapp on the client side) so you'll just have to set the field names (the data structure you expect), then enter a list of websites to scrap, for each one set up the xpath to scrap for each field, and finally click on a button to run the scraper configured for each website...
... and tadaaaa, you'll have your data : you will be able able to import it, share it, and visualize it (at least we're working on it as quickly as we can)...
OpenScraper is developped in open source, and will provide a documentation as much as a legal framework (licence and CGU) aiming to make the core system of OpenScraper fit the RGPD, in the letter and in the spirit.
clone or download the repo
install MongoDB locally or get the URI of the MongoDB you're using
install chromedriver
$ brew tap caskroom/cask
$ brew cask install chromedriver
$ sudo apt-get install chromium-chromedriver
go to your openscraper folder
create a virtual environment for python 2.7 virtual environment)
$ python venv venv
$ source venv/bin/activate
install the libraries
$ pip install -r requirements.txt
optionnal : notes for installing python libs on linux servers
$ sudo apt-get install build-essential libssl-dev libffi-dev python-dev python-psycopg2 python-mysqldb python-setuptools libgnutls-dev libcurl4-gnutls-dev
$ sudo apt install libcurl4-openssl-dev libssl-dev
$ sudo apt-get install python-pip
$ sudo pip install --upgrade pip
$ sudo pip install --upgrade virtualenv
$ sudo pip install --upgrade setuptools
optionnal : create a config/settings_secret.py
file based on config/settings_example.py
with your mongoDB URI (if you're not using default mongoDB connection) :
run app
$ cd openscraper
$ python main.py
you can also choose options when running main.py
-p
or --port
: the number of your port (default : 8000
)
-m
or --mode
: the mode (default : default
) - choices : default
(uses settings_example.py
in openscraper/config
folder) | production
(uses settings_secret.py
in ~/config
folder )
example :
$ python main.py -p 8100 --mode=production
check in your browser at localhost:8000
(or whichever port you entered)
create/update your datamodel at localhost:8000/datamodel/edit
create/update your spiders at localhost:8000/contributors
run the test spider in the browser by clicking on the test spider at localhost:8000/contributors