Web Scraping
Getting Started
# get the code
git clone https://github.com/sottom/scraping_tutorial.git
cd scraping_tutorial
# create virtual environment
python3 -m venv venv
# install dependencies
pip install -r requirements.txt
# run a any file you like
python {any_file}.py
Websites for Scraping Tutorial (Tech Talk):
- https://www.ratemyprofessors.com/
- https://www.premierleague.com/players
- https://www.zacks.com/stock/research/AAPL/earnings-announcements
- https://free-proxy-list.net/
- https://www.timeanddate.com/holidays/us/
- https://codepen.io/gaearon/pen/oWWQNa
Important Points
Be Respectful
- Look at and follow the /robots.txt file
- Don't make too many requests
- don't publish data that isn't yours (be careful about this)
Can websites figure out you're scraping them?
How do I make my scraper more humanlike?
- Change User-Agents
- Change IP addresses
- Don't follow the same pattern every time you scrape
- Scraping Intervals
- Scraping Click Paths
- Click Timing
- Click position is hard, because a click has no
screenX
or screenY
When you run into issues, start Googling
Why Scrape?
- Really come to understand how the web works
- get data not available from APIs
- interview prep
What to do before scraping
- check for APIs
- 1_professor.py
- 2_sports.py
- unfortunately, this site has started blocking unauthorized calls.
- check for data in the global scope
Requests & BeautifulSoup
When to use it
- when the data you want is loaded on page startup
Example usage
Notes
- could use regex, could use other parsers, doesn't matter
Headless Browser
When to use it
- when code is rendered by javascript, otherwise you don't get what you expect (React App)
- when you need to login
- use for web automation
Example usage
- 6_holidays2.py
- learningsuite (not included)
Notes
Chrome Extension
When to use it
- when selenium doesn't work
Notes
- chrome extensions do a ton more than scrape
- out of scope
Scraping Framework like Scrapy
When to use it
- When you want to run a big operation and scrape on multiple threads (for complex projects)
- comparison site