CrawleMe! is is easy way of crawling image or link urls from any web site.
CrawleMe! is is easy way of crawling image or link urls from any web site.
Create your web page wrapper class.
from crawleme.base import BasePage
class MyPage(BasePage):
url = 'http://www.mysite.com'
item_path = '//*[@id="campaign_list"]/div/a'
item_attribute = 'href'
Create a instance of wrapper class and call crawle method.
crawler = MyPage()
urls = crawler.crawle()
for url in urls:
print url
Result:
http://www.mysite.com/id/5
http://www.mysite.com/aboutus/
http://www.mysite.com/foo/
http://www.mysite.com/bar/
http://www.mysite.com/baz/
Also, you can pass or override the url or item_path of wrapper class on creating class instance.
crawler = MyPage(url='http://www.mysite.com/id/112312')
url: Url of page that will be crawled. item_path: X-Path of selected DOM element(s). item_attribute: Attribute of selected DOM element(s). has_only_single_item (default=False): crawle method returns only single value when there is True fix_urls (default=True): Sometimes may be DOM object attributes contains only path value without hostname and protocol. This attributes fix the parsed value as full url.
crawle([timeout=crawleme.conf.REQUEST_TIMEOUT],[renew=False]): Parses value list or single value from the page by the specified attributes.
get_filename([timeout=crawleme.conf.REQUEST_TIMEOUT]): Returns requested filename.
read([timeout=crawleme.conf.REQUEST_TIMEOUT]): read data from stream.