A web crawler that indexes the identity/social graph of profiles on the web
A web crawler that indexes relations between the different profiles of users online and ties them together into an identity graph that's similar to a social graph but instead of mapping relations between different identities it maps the relation between different representations of the same identity.
Things left to be done in the refactoring:
RelSpider is built to work well with a Heroku-like setup and therefor uses foreman
to start itself. First install Foreman if you haven't got it installed before, then set the required RelSpider configuration as outlined below and lastly start RelSpider by typing:
foreman start
Running on Heroku is easy - you basically just push the code up there and you're of. You can read more about that in their general quick start guide and then their Node.js quick start guide.
To avoid having to configure anything it is recommended to use the PostgreSQL and GrapheneDB add-ons. It's also recommended to use the Memcache - at least if you ever want to run more than one process.
This script can be run on Heroku for free in small scale - even with all the recommended add-ons added.
To configure Foreman locally create a .env file in the top folder of RelSpider and add all required options below as well as any optional ones you would like to use.
When used with Heroku it will work automatically if the recommended add-ons are used, but of course all configurations can be specified there as well.
DATABASE_URL="postgres://foo@localhost/relspider"
- how to connect to your PostgreSQL database. Provided by PostgreSQL Heroku Add-on.NEO4J_URL
- how to connect to your Neo4j database. Defaults to http://localhost:7474
. Provided by GrapheneDB Heroku Add-on.RELSPIDER_API_USER="foo"
- used with RELSPIDER_API_PASS
to lock down the API with HTTP Authentication. Default is to require no authentication.RELSPIDER_API_PASS="bar"
- see RELSPIDER_API_USER
RELSPIDER_PARALLEL="30"
- the number of parallel fetches per process, never will more fetches than these be made. Defaults to 30
parallel fetches.RELSPIDER_CACHE="memcached"
- if set to memcached
then MemJS will be used for caching, see that module for additional configuration details. Defaults to memory cache unless MEMCACHE_USERNAME
, which is provided by the Memcache Heroku Add-on or MEMCACHIER_USERNAME
, which is provided by the Memcachier Heroku Add-on, is set - if any of them are set MemJS is instead auto-configured to use them.Used to fetch the identity graph of a URL. If a URL isn't yet crawled then it will be scheduled to be so.
url
- required the URL to do the lookup oncallback
- a URL, a "WebHook", to which to POST the resulting identity graph when it has been fully crawled. Only used if the identity graph isn't yet fully crawled. The format of the POST:ed body is the same as the JSON that's in the response of this request.HTTP 202 response if identity graph isn't yet fully crawled, otherwise a HTTP 200 response with a JSON body like:
{
"url": "http://github.com/voxpelli",
"related": [
"http://twitter.com/voxpelli",
"http://github.com/voxpelli",
"http://voxpelli.com/",
"http://kodfabrik.se/"
],
"incomplete": true
}
The url
key in the response shows the URL that the lookup was made on. The related
key includes the full identity graph, including the URL used in the lookup. The incomplete
key is sometimes included - it then shows that there has been pages found in the graph that RelSpider for some reason hasn't been able to crawl and that therefor the graph might show its true extent.
Used to schedule a site for crawling. Often you want /api/lookup instead.
url
- required the URL to do schedule to crawl.HTTP 202 with a message of success!
Used to force the refresh of a node. Useful when debugging and you don't want to wait 24 hours for the next scheduled refresh. Won't ever refresh more often than every 1 minute though.
url
- required the URL to do schedule for refresh.HTTP 202 with a message of success!
MIT http://voxpelli.mit-license.org
Sometimes there is an open demo up and running on a free Heroku instance with all the above recommended add-ons: http://relspider.herokuapp.com/
Big refactoring, among other things:
New features: