scraping_tutorial

Basics of scraping with python, requests, beautifulsoup4, selenium, etc.

Stars
1

Web Scraping

Getting Started

# get the code
git clone https://github.com/sottom/scraping_tutorial.git
cd scraping_tutorial

# create virtual environment
python3 -m venv venv

# install dependencies
pip install -r requirements.txt

# run a any file you like
python {any_file}.py

Websites for Scraping Tutorial (Tech Talk):

  1. https://www.ratemyprofessors.com/
  2. https://www.premierleague.com/players
  3. https://www.zacks.com/stock/research/AAPL/earnings-announcements
  4. https://free-proxy-list.net/
  5. https://www.timeanddate.com/holidays/us/
  6. https://codepen.io/gaearon/pen/oWWQNa

Important Points

Be Respectful

  • Look at and follow the /robots.txt file
  • Don't make too many requests
  • don't publish data that isn't yours (be careful about this)

Can websites figure out you're scraping them?

  • Yes.

How do I make my scraper more humanlike?

  • Change User-Agents
  • Change IP addresses
  • Don't follow the same pattern every time you scrape
    • Scraping Intervals
    • Scraping Click Paths
    • Click Timing
    • Click position is hard, because a click has no screenX or screenY

When you run into issues, start Googling

Why Scrape?

  • Really come to understand how the web works
  • get data not available from APIs
  • interview prep

What to do before scraping

  • check for APIs
    • 1_professor.py
    • 2_sports.py
      • unfortunately, this site has started blocking unauthorized calls.
  • check for data in the global scope

Requests & BeautifulSoup

When to use it

  • when the data you want is loaded on page startup

Example usage

  • 4_proxy.py
  • 5_holidays.py

Notes

  • could use regex, could use other parsers, doesn't matter

Headless Browser

When to use it

  • when code is rendered by javascript, otherwise you don't get what you expect (React App)
  • when you need to login
  • use for web automation

Example usage

  • 6_holidays2.py
  • learningsuite (not included)

Notes

Chrome Extension

When to use it

  • when selenium doesn't work

Notes

  • chrome extensions do a ton more than scrape
  • out of scope

Scraping Framework like Scrapy

When to use it

  • When you want to run a big operation and scrape on multiple threads (for complex projects)
  • comparison site