crawleme

CrawleMe! is is easy way of crawling image or link urls from any web site.

Stars
11

What is CrawleMe! ?

CrawleMe! is is easy way of crawling image or link urls from any web site.

How It Works ?

Create your web page wrapper class.

from crawleme.base import BasePage

class MyPage(BasePage):
	url = 'http://www.mysite.com'
	item_path = '//*[@id="campaign_list"]/div/a'
	item_attribute = 'href'

Create a instance of wrapper class and call crawle method.

crawler = MyPage()
urls = crawler.crawle()

for url in urls:
	print url

Result:

http://www.mysite.com/id/5
http://www.mysite.com/aboutus/
http://www.mysite.com/foo/
http://www.mysite.com/bar/
http://www.mysite.com/baz/

Also, you can pass or override the url or item_path of wrapper class on creating class instance.

crawler = MyPage(url='http://www.mysite.com/id/112312')

Properties:

url: Url of page that will be crawled. item_path: X-Path of selected DOM element(s). item_attribute: Attribute of selected DOM element(s). has_only_single_item (default=False): crawle method returns only single value when there is True fix_urls (default=True): Sometimes may be DOM object attributes contains only path value without hostname and protocol. This attributes fix the parsed value as full url.

Methods:

crawle([timeout=crawleme.conf.REQUEST_TIMEOUT],[renew=False]): Parses value list or single value from the page by the specified attributes.

get_filename([timeout=crawleme.conf.REQUEST_TIMEOUT]): Returns requested filename.

read([timeout=crawleme.conf.REQUEST_TIMEOUT]): read data from stream.

Related Projects