ramper2

Crawling Library for Clojure

MIT License

Stars
0

Ramper

A library for fast web crawling in Clojure.

Show me the code

Starting a crawl with which crawls at most 10000 pages.

(require '[clojure.java.io :as io]
         '[ramper.instance :as instance])

(instance/start (io/file "seed.txt") (io/file "store-dir") {:max-urls 10000})

Testing your setup

It's best to test the machine(s) you are using for a crawl against a local proxy server, so that the throughput is high in the case the bandwidth is not the limiting factor. By default you can pass the http-opts with :proxy-url option at initialization time.

(intance/start seed-file store-dir
       {:max-urls 100000
        :nb-fetchers 32
        :nb-parsers 10
        :http-opts {:proxy-url "http://localhost:8080"}})

In case you are using a different library for fetching (via the http-get option) you need to make sure that this function uses a proxy either via http-opts or otherwise.

Running the crawler against a local graph server we can use BUbiNG. The following will start a server on port 8080 with a 100 Million sites, average page degree 50, average depth 3 and 0.01% of sites being broken.

java -cp bubing-0.9.15.jar:bubing-0.9.15-deps/* -Xmx4G -server it.unimi.di.law.bubing.test.NamedGraphServerHttpProxy -s 100000000 -d 50 -m 3 -t 1000 -D .0001 -A1000 -

The precompiled jars can be found at http://law.di.unimi.it/software/index.php?path=download/.

There also is a babashka script that downloads all of these dependencies for you and launches the proxy server with the above configuration for you.

./download_bubing

Customize your crawl

Ramper comes with a couple of options to customize your crawl. These are

fetch-filter

A filter that is applied to every url before it goes through the sieve. Let's say you would want to only fetch urls that contain clojure in their name and use the https scheme.

(require '[ramper.customization :as custom])

(defn clojure-url? [url]
  (clojure.string/index-of url "clojure"))

(instance/start seed-file store-dir {:fetch-filter (every-pred custom/https? clojure-url?)}

schedule-filter

A filter that is applied to every url before the resource gets fetched (just after the sieve). For example let's you want to only fetch a limited number of urls per domain.

(defn max-per-domain-filter [max-per-domain]
  (let [domain-to-count (atom {})]
    (fn [url]
      (let [base (url/base url)]
        (when (< (get @domain-to-count base 0) max-per-domain)
          (swap! domain-to-count update base (fnil inc 0))
          true)))))

(instance/start seed-file store-dir {:schedule-filter (max-per-domain 100)})

The max-per-domain-filter is also provided by the customization ns.

store-filter

A filter that is applied before a response is stored. Suppose you want to only store sites that contain the word "clojure".

(require '[clojure.string :as str]
         '[ramper.html-parser :as html])

(defn contains-clojure? [resp]
  (some-> resp :body html/html->text str/lower-case (clojure.string/index-of "clojure")))

(instance/start seed-file store-dir {:store-filter contains-clojure? })

follow-filter

In the same vein as above, suppose you only want to continue following the links of a page when the it contains the word "clojure".

(instance/start seed-file store-dir {:follow-filter contains-clojure?})

Robot.txt

By default the robots.txt standard is followed. Meaning the "robots.txt" is downloaded before any other content is fetched and adhered by. "nofollow" attributes are respected. Currently robot meta tags of the form <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> are ignored.

Compiling

When developing you need to build the java files once before jacking in.

clojure -T:build java

Tests

The tests can be run with

clojure -X:test

If one wants to run a specific test, use the -X option. See also cognitect.test-runner for options which tests to invoke.

clojure -X:test :nses [ramper.workers.parsing-thread-test]

Profiling

For Java mission control to work correctly you need to set

echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid

License

Distributed under the MIT License. See LICENSE.