Klepto

A mean little DSL'd capybara (poltergeist) based web scraper that structures data into ActiveRecord or wherever(TM).

Features

CSS or XPath Syntax
Full javascript processing via phantomjs / poltergeist
All the fun of capybara
Scrape multiple pages with a single bot
Pretty nifty DSL
Test coverage!

Installing

You need at least PhantomJS 1.8.1. There are no other external dependencies (you don't need Qt, or a running X server, etc.)

Mac

Homebrew: brew install phantomjs
MacPorts: sudo port install phantomjs
Manual install: Download this

Linux

Download the 32
bit
or 64
bit
binary.
Extract the tarball and copy bin/phantomjs into your PATH

Windows

Download the precompiled binary for Windows

Manual compilation

Do this as a last resort if the binaries don't work for you. It will take quite a long time as it has to build WebKit.

Download the source tarball
Extract and cd in
./build.sh

(See also the PhantomJS building guide.)

Then put klepto in your gemfile.

gem 'klepto', '>= 0.2.5'

Usage (All your content are belong to us)

Say you want a bunch of Bieb tweets! How is there not profit in that?

# Fetch a web site or multiple. Bot#new takes a *splat!
@bot = Klepto::Bot.new("https://twitter.com/justinbieber"){
  # By default, it uses CSS selectors
  name      'h1.fullname'

  # If you love C# or you are over 40, XPath is an option!
  username "//span[contains(concat(' ',normalize-space(@class),' '),' screen-name ')]", :syntax => :xpath
  
  # By default Klepto uses the #text method, you can pass an :attr to use instead...
  #   or a block that will receive the Capybara Node or Result set.
  tweet_ids 'li.stream-item', :match => :all, :attr => 'data-item-id'
  
  # Want to match all the nodes for the selector? Pass :match => :all
  links 'span.url a', :match => :all do |node|
    node[:href]
  end

  # Nested structures? Let klepto know this is a resource
  last_tweet 'li.stream-item', :as => :resource do
    twitter_id do |node|
      node['data-item-id']
    end
    content '.content p'
    timestamp '._timestamp', :attr => 'data-time'
    permalink '.time a', :attr => :href
  end      

  # Multiple Nested structures? Let klepto know this is a collection of resources
  # Does bieber, tweet to much? Maybe. Lets only get the new stuff kids crave.
  tweets    'li.stream-item', :as => :collection, :limit => 10 do
    twitter_id do |node|
      node['data-item-id']
    end
    tweet '.content p', :css
    timestamp '._timestamp', :attr => 'data-time'
    permalink '.time a', :css, :attr => :href
  end     

  # Set some headers, why not.
  config.headers({
    'Referer'     => 'http://www.twitter.com'
  })  

  # on_http_status can take a splat of statuses or ~statuses(4xx,5xx)
  #   you can also have multiple handlers on a status
  #   Note: Capybara automatically follows redirects, so the statuses 3xx
  #   are never present. If you want to watch for a redirect pass see below
  config.on_http_status(:redirect){
    puts "Something redirected..."
  }
  config.on_http_status(200){
    puts "Expected this, NBD."
  }

  config.on_http_status('5xx','4xx'){
    puts "HOLY CRAP!"
  }

  config.after(:get) do |page|
    # This is fired after each HTTP GET. It receives a Capybara::Node
  end  

  # If you want to do something with each resource, like stick it in AR
  #   go for it here...
  config.after do |resource|
    @user = User.new
    @user.name = resource[:name]
    @user.username = resource[:username]
    @user.save

    resource[:tweets].each do |tweet|
      Tweet.create(tweet)
    end
  end #=> Profit!
}

# You can get an array of hashes(resources), so if you wanted to do something else 
# you could do it here...
@bot.resources.each do |resource|
  pp resource
end

Got a string of HTML you don't need to crawl first?

@html = Capybara::Node::Simple.new(@html_string)
@structure = Klepto::Structure.build(@html){
  # inside the build method, everything works the same as Bot.new
  name      'h1.fullname'
  username  'span.screen-name'

  links 'span.url a', :match => :all do |node|
    node[:href]
  end

  tweets    'li.stream-item', :as => :collection do
    twitter_id do |node|
      node['data-item-id']
    end
    tweet '.content p', :css
    timestamp '._timestamp', :attr => 'data-time'
    permalink '.time a', :css, :attr => :href
  end       
}

Configuration Options

config.headers - Hash; Sets request headers
config.url - String; Set URL to structure
config.abort_on_failure - Boolean(Default: true); Should structuring be aborted on 4xx or 5xx

Callbacks & Processing

before
- :get (browser, url)
after
- :structure (Hash) - receives the structure from the page
- :get (browser, url) - called after each HTTP GET
- :abort (browser, hash(details)) - called after a 4xx or 5xx if config.abort_on_failure is true (default)

Stuff I'm going to add.

Ensure after(:each) work at resource/collection level as well
Add after(:all)
:if, :unless for as: (:collection|:resource) to. context should be captured node that block is run against
Access to hash from within a block (for bulk assignment of other attributes) ?
config.allow_rescue_in_block #should exceptions in blocks be auto rescued with nil as the return value
:default should be able to take a proc

Async

-> https://github.com/igrigorik/em-synchrony

Cookie Stuffing

cookies({
  'Has Fun' => true
})

Pre-req Steps

prepare [
  [:GET, 'http://example.com'],
  [:POST, 'http://example.com/login', {username: 'cory', password: '123456'}],
]

Page Assertions

assertions do
  #presence and value assertions...
end
on_assertion_failure{ |response, bot| }

Structure :if unless: lambda{|node| node.class.include?("newsflash")}

Package Rankings

Top 22.04% on Rubygems.org

Related Projects

cistern

Ruby API client framework

05 Jun 2012 83

sinatra_more

Generators, helpers and extensions enabling complex sinatra apps

24 Oct 2009 330

joshua

Framework agnostic REST / JSON-RPC API implementation (Ruby lang)

04 Feb 2020 2

aitch

A simple HTTP client.

29 Apr 2013 21

wrappi

Making APIs fun again!

27 Jul 2017 4

bad_pigeon

A tool for extracting tweet data from GraphQL requests made by the Twitter website 🐦

19 Jun 2023 17

REST-assured

Real stubs and spies for HTTP(S) services

13 Sep 2011 38

frenetic

A Ruby-based Hypermedia API client.

09 Apr 2012 42

pancake

Stackem Up!

15 May 2009 234

csvget

Uses parselets and rwget to generate csv files from websites

28 Aug 2009 47

porth

Plain Old Ruby Template Handler

16 Oct 2011 16

XSpear

🔱 Powerfull XSS Scanning and Parameter analysis tool&gem

12 Jul 2019 1,172

ruby-airbnb

Airbnb 的 Ruby 代码风格指南

17 Aug 2016 100

site_health

Crawl a site and check various health indicators

24 Oct 2017 1

fixings

An opinionated Rails toolkit for code formatting, logging, testing, etc

16 Jan 2020 0