Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.
MIT License
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
Web scraping is the process of automatically mining data or collecting information from the World Wide Web. It is a field with active developments sharing a common goal with the semantic web vision, an ambitious initiative that still requires breakthroughs in text processing, semantic understanding, artificial intelligence and human-computer interactions. Current web scraping solutions range from the ad-hoc, requiring human effort, to fully automated systems that are able to convert entire web sites into structured information, with limitations.
The purpose of this project is to develop a scraper tool to achieve web-scrapping. This was achieved using Ruby, Open-uri, Watir and Nokogiri gem. The coursera page uses the production build of React which made the project more interesting to build.
Watir stands for Web Application Testing In Ruby It facilitates the writing of automated tests by mimicing the behavior of a user interacting with a website.
Nokogiri is an HTML, XML, SAX, and Reader parser. Among Nokogiri's many features is the ability to search documents via XPath or CSS3 selectors.
The above Ruby Gems can be sources from Ruby Gems
In this project, I created a scraper which extracts free coursera courses from the coursera.org.
To get started, you should first get this file in your local machine by downloading this project or typing.
git clone https://github.com/IjayAbby/Web-Scraper-Ruby-Capstone-Project
Ruby installed on local machine
Text editor (preferably: VSCode, Atom, Sublime)
Git
Chrome Browser
If you have installed Ruby
on your machine:
git clone
command or download the zip file.cd directory name
command.gem install <gem name>
as listed in the Gemfile.- gem install colorize
- gem install nokogiri
If it asks for permission use ``sudo gem install <gem name>``
ruby bin/main.rb
command.(Selenium::WebDriver::Error::WebDriverError)
it is because the project uses ChromeDriver from Chrome Browser to render the page. To fix that here are the links:Give the project sometime to load then you will be able to see the results in your terminal. Enjoy and play around with the options either to quit or load next page.
Run command rspec <file name>
to test the various methods in the classes.
When you run the project it will show you free courses available on the selected page through your browser, then prompts the user to see more or stop. If you want to see more results you can press the 'y' button or 'Enter/Return' key. If you want to stop or found a job that matches you, then press the 'n' or 'q' button. The scraping process will be stopped.
In the free courses available you can be able to see Partner, Course, Level, Enrollment to the course.
🤝 Contributions, issues and feature requests are welcome! Start by:
1. Forking the project
2. Cloning the project to your local machine
3. cd into the project directory
4. Run git checkout -b your-branch-name
5. Make your contributions
6. Push your branch up to your forked repository
7. Open a Pull Request with a detailed description to the development branch of the original project for a review
Please feel free to contribute to any of these!
Feel free to check the issues page.
👤 Ijay Abby
Give a 🌟 if you like this project! 😊
📝 Copyright
- Thanks are owed to Sarah Chamorro, a FullTime TSE at Microverse.
- Microverse
- Rubocop
- Nokogiri
- Watir
- webdrivers
This project is MIT licensed.
Happy coding!