visual_web_crawler

Team Caelum OREGON STATE UNIVERSITY Senior Project!!!

Stars
0

visual web crawler

Repo for capstone project

visual_web_crawler
|-Crawler       Implementation of crawler algorithm (DEV)
|-crawler_app   web application code [PROD]
|-deploy        ansible code for deploying to server
|-[dev files]   Jupyter notebook sprints, dev requirements, etc

installation

  • make sure you have python3, virtualenv (pip install virtualenv), and Firefox installed
  • clone this repo
    create new virtualenv
# use -p if you want to use special python interpreter i'm using miniconda bc it is awesome
cas@ubuntu:~/working_dir/visual_web_crawler$ virtualenv -p /home/cas/miniconda/bin/python crawler

activate crawler virtual env

cas@ubuntu:~/working_dir/visual_web_crawler$ source crawler/bin/activate
(crawler) cas@ubuntu:~/working_dir/visual_web_crawler$ ls
crawler web_crawler_POC.ipynb README.md requirements.txt

install requirements

cas@ubuntu:~/working_dir/visual_web_crawler$ pip install -r requirements.txt

get geckodriver for selenium

# download latest
(crawler) cas@ubuntu:~/working_dir/visual_web_crawler$ curl -LO https://github.com/mozilla/geckodriver/releases/download/v0.19.1/geckodriver-v0.19.1-linux64.tar.gz
# untar/unzip
(crawler) cas@ubuntu:~/working_dir/visual_web_crawler$ gunzip geckodriver-v0.19.1-linux64.tar.gz
(crawler) cas@ubuntu:~/working_dir/visual_web_crawler$ tar -xvf geckodriver-v0.19.1-linux64.tar
# remove tarball
(crawler) cas@ubuntu:~/working_dir/visual_web_crawler$ rm geckodriver-v0.19.1-linux64.tar  
# point it to virtualenv bin
(crawler) cas@ubuntu:~/working_dir/visual_web_crawler$ mv geckodriver crawler/bin/

look at jupyter notebook

(crawler) cas@ubuntu:~/working_dir/visual_web_crawler$ jupyter notebook