
A dual interface Go module for building simple web scrapers

MIT License



A dual interface Go module for building simple web scrapers


  • Go struct-tag interface
  • Command-line interface
    • HTML⇒JSON API server
    • Single binary
    • Simple configuration
    • Zero-downtime config reload with kill -s SIGHUP <scraper-pid>



See the latest release or download it with this one-liner: curl | bash


$ go get -v

Go Example

package main

import (


func main() {
	type result struct {
		Title string `scraper:"h3 span"`
		URL   string `scraper:"a[href] | @href"`

	type google struct {
		URL    string   `scraper:"{{query}}"`
		Result []result `scraper:"#rso div[class=g]"`
		Query  string   `scraper:"query"`

	g := google{Query: "hello world"}

	if err := scraper.Execute(&g); err != nil {

	for i, r := range g.Result {
		fmt.Printf("#%d: '%s' => %s\n", i+1, r.Title, r.URL)
#1: 'Helloworld Travel – Deals on Accommodation, Flights ...' =>
#2: '"Hello, World!" program - Wikipedia' =>,_World!%22_program
#3: 'Helloworld Travel - Wikipedia' =>
#4: 'Helloworld Travel Limited' =>
#5: 'Total immersion, Serious fun! with Hello-World!' =>
#6: 'Helloworld Travel - Home | Facebook' =>

CLI Example

Given google.json

  "/search": {
    "url": "{{query}}",
    "list": "#rso div[class=g]",
    "result": {
      "title": "h3 span",
      "url": ["a[href]", "@href"]
$ scraper google.json
2015/05/16 20:10:46 listening on 3000...
$ curl "localhost:3000/search?query=hellokitty"
    "title": "Official Home of Hello Kitty \u0026 Friends | Hello Kitty Shop",
    "url": ""
    "title": "Hello Kitty - Wikipedia, the free encyclopedia",
    "url": ""


  <path>: {
    "method": <method>
    "url": <url>
    "list": <selector>,
    "result": {
      <field>: <extractor>,
      <field>: [<extractor>, <extractor>, ...],
  • <path> - Required The path of the scraper
    • Accessible at http://<host>:port/<path>
    • You may define path variables like: my/path/:var when set to /my/path/foo then :var = "foo"
  • <url> - Required The URL of the remote server to scrape
    • It may contain template variables in the form {{ var }}, scraper will look for a var path variable, if not found, it will then look for a query parameter var
  • result - Required represents the resulting JSON object, after executing the <extractor> on the current DOM context. A field may use sequence of <extractor>s to perform more complex queries.
  • <method> - The HTTP request method (defaults to GET)
  • <extractor> - A string in which must be one of:
    • a regex in form /abc/ - searches the text of the current DOM context (extracts the first group when provided).
    • a regex in form s/abc/xyz/ - searches the text of the current DOM context and replaces with the provided text (sed-like syntax).
    • an attribute in the form @abc - gets the attribute abc from the DOM context.
    • a function in the form html() - gets the DOM context as string
    • a function in the form trim() - trims space from the beginning and the end of the string
    • a query param in the form query-param(abc) - parses the current context as a URL and extracts the provided param
    • a css selector abc (if not in the forms above) alters the DOM context.
  • list - Optional A css selector used to split the root DOM context into a set of DOM contexts. Useful for capturing search results.


Replace <variable> with your configuration, documented above.

  1. Define your endpoint struct:
type endpoint struct {
  Method string   `scraper:"<method>"`
  URL    string   `scraper:"<url>"`
  Result []result `scraper:"<list>`
  <param>  string `scraper:"<param>"`

Method, URL, Result and Debug are special fields, the remaining string fields are treated as input parameters. Input parameters use the field name with first character lowercased by default.

  1. Define your result struct:
type result struct {
  <field> string `scraper:"<extractor>"`
  <field> string `scraper:"<extractor> | <extractor>"`

The result struct is used to define field to extractor mappings. All fields must be strings. Struct tags cannot contain arrays so instead we join multiple extractors with |.

  1. Execute it:
e := endpoint{MyParam: "hello world"}
if err := scraper.Execute(&e); err != nil {
// e.Result is now set

Similar projects