RelSpider

A web crawler that indexes relations between the different profiles of users online and ties them together into an identity graph that's similar to a social graph but instead of mapping relations between different identities it maps the relation between different representations of the same identity.

State of current refactored master (0.2.0)

Things left to be done in the refactoring:

Remove the added Bookshelf code and just go with pure Knex.js instead
Ensure that the README is up to date
There's likely going to be a GitHub milestone created for this as well so check that one as well

Tech stuff (or just a big huge list of different cool aspects of this script)

Crawls link-tags and a-tags in HTML sites that have a "me"-relation. The "me"-relation is defined by the mother of all Microformats – the XFN – and is widely supported by big sites like Twitter, Google+, GitHub, Lanyrd etc.
RelSpider uses a combination of PostgreSQL and Neo4j to save all relations it finds.
Thanks to Neo4j RelSpider supports firing of WebHooks on completion of crawling the graph of a certain identity.
Thanks to Neo4j all relations are indexes as one directional and thus profile A and B might be part of the same identity graph when a lookup is made on profile A, while they're not if you do the lookup on profile B.
Only crawls sites scheduled for crawling through some of the methods in the API.
RelSpider fully supports robots.txt to check whether it's allowed to index a page or not.
Robots.txt files are cached for a day in either memory or Memcached.
RelSpider throttles the number of requests made to each host so that a request isn't ever done more often than every 10 seconds - that way it ensures that it will stay away from being banned.
Thanks PostgreSQL and Memcached multiple workers of RelSpider can be spawned without them going nuts and fetching the same pages multiple times. Using PostgreSQL a worker is always reserving a page for itself for 10 minutes prior to fetching it and thanks to Memcache it doesn't have to refetch a robots.tx-file if another worker has already fetched it in the last day.
Supports a configurable number of parallel fetches and whenever all fetches isn't being utilized it scales down accordingly to go easy on the database.
Modular - the crawler can be used separate from the API and web, one can easily replace those with ones own creations.
Supports canonical-links and permanent redirects - saves the original URL:s as alias of their targets and don't include the aliases in results
Refreshes nodes in its index every 24 hours
Nodes that hasn't themselves been requested in the last week and that has no incoming relations is cleaned up and removed to keep size of index down

Roadmap

Better recrawl mechanism
Parse and save more interesting data from pages
Investigate optional social graph parsing (not primary focus as social graphs are so interconnected that you easily end up crawling half of the internet)

Run it

Locally

RelSpider is built to work well with a Heroku-like setup and therefor uses foreman to start itself. First install Foreman if you haven't got it installed before, then set the required RelSpider configuration as outlined below and lastly start RelSpider by typing:

foreman start

On Heroku

Running on Heroku is easy - you basically just push the code up there and you're of. You can read more about that in their general quick start guide and then their Node.js quick start guide.

To avoid having to configure anything it is recommended to use the PostgreSQL and GrapheneDB add-ons. It's also recommended to use the Memcache - at least if you ever want to run more than one process.

This script can be run on Heroku for free in small scale - even with all the recommended add-ons added.

Configuration

To configure Foreman locally create a .env file in the top folder of RelSpider and add all required options below as well as any optional ones you would like to use.

When used with Heroku it will work automatically if the recommended add-ons are used, but of course all configurations can be specified there as well.

Required

DATABASE_URL="postgres://foo@localhost/relspider" - how to connect to your PostgreSQL database. Provided by PostgreSQL Heroku Add-on.

Optional

NEO4J_URL - how to connect to your Neo4j database. Defaults to http://localhost:7474. Provided by GrapheneDB Heroku Add-on.
RELSPIDER_API_USER="foo" - used with RELSPIDER_API_PASS to lock down the API with HTTP Authentication. Default is to require no authentication.
RELSPIDER_API_PASS="bar" - see RELSPIDER_API_USER
RELSPIDER_PARALLEL="30" - the number of parallel fetches per process, never will more fetches than these be made. Defaults to 30 parallel fetches.
RELSPIDER_CACHE="memcached" - if set to memcached then MemJS will be used for caching, see that module for additional configuration details. Defaults to memory cache unless MEMCACHE_USERNAME, which is provided by the Memcache Heroku Add-on or MEMCACHIER_USERNAME, which is provided by the Memcachier Heroku Add-on, is set - if any of them are set MemJS is instead auto-configured to use them.

API methods

/api/lookup

Used to fetch the identity graph of a URL. If a URL isn't yet crawled then it will be scheduled to be so.

Parameters

url - required the URL to do the lookup on
callback - a URL, a "WebHook", to which to POST the resulting identity graph when it has been fully crawled. Only used if the identity graph isn't yet fully crawled. The format of the POST:ed body is the same as the JSON that's in the response of this request.

Response

HTTP 202 response if identity graph isn't yet fully crawled, otherwise a HTTP 200 response with a JSON body like:

{
  "url": "http://github.com/voxpelli",
  "related": [
    "http://twitter.com/voxpelli",
    "http://github.com/voxpelli",
    "http://voxpelli.com/",
    "http://kodfabrik.se/"
 ],
  "incomplete": true
}

The url key in the response shows the URL that the lookup was made on. The related key includes the full identity graph, including the URL used in the lookup. The incomplete key is sometimes included - it then shows that there has been pages found in the graph that RelSpider for some reason hasn't been able to crawl and that therefor the graph might show its true extent.

/api/add

Used to schedule a site for crawling. Often you want /api/lookup instead.

Parameters

url - required the URL to do schedule to crawl.

Response

HTTP 202 with a message of success!

/api/refresh

Used to force the refresh of a node. Useful when debugging and you don't want to wait 24 hours for the next scheduled refresh. Won't ever refresh more often than every 1 minute though.