grab-rss

RSS to email script

Stars
12

grab_rss is a simple RSS to email gateway; it downloads feeds (RSS, Atom, or anything else Universal Feed Parser understands), formats them into plaintext emails, and sends them to you. The idea would be to run it once an hour or once a day or whatever in a cron job.

  • Why write this?

I had been using Google Reader. Mutt is a better information organizer than Reader, and since I run my own mail server there are fewer privacy concerns (I expect at some point Google will begin analyzing what people look at in their Readers, what gets the most clicks, how long you spend looking at particular items, etc so as to enhance the advertising experience - assuming they don't do this already of course). I initially was going to replace it with a standalone GUI RSS reader, but the ones I evaluated for Linux were buggy, incomplete, very slow or a combination of the three. So I ended up writing this brand new buggy, incomplete and very slow program instead. Your own ugly babies are always the cutest ones.

  • Why not just use rss2email?

Using pickle for the state is obnoxious - hard to read, edit, or save/merge using version control. I strongly prefer plain text for everything that's even remotely important, because it lets me use existing high quality diff/merge tools built into DVCSes to save my data, and quality editors to modify them as necessary.

In short: die, binary data, die in a fire.

(And then I realized it was far better to save the seen items state in a sqlite db than a plain text file in case an exception is thrown and we revert state. Hypocrisy rocks!)

  • What's there?

All the basic functionality works: it reads RSS feeds, sends you emails about them, remembers which ones it has already told you about. I'm using it as my only RSS source, and I'm happy with it.

  • What's missing?

No provision for HTML mail. Everything is converted to text/plain, stripping out everything except links. I don't like HTML mail.

If you want to read anything that's not written in mostly ASCII you're out of luck; all input text is forced to 7-bit ASCII.

Probably other things that haven't even occured to me. Suggestions welcome, patches better.

  • What do I need?

Currently you need Python 2.6 (2.4/2.5 might work, haven't tried) plus the following dependencies:

Required: feedparser - http://www.feedparser.org stripogram - http://www.zope.org/Members/chrisw/StripOGram

Optional: dateutil.parser - http://labix.org/python-dateutil multiprocessing - http://code.google.com/p/python-multiprocessing/

The multiprocessing module is included in Python 2.6

Patches to reduce dependencies while improving or maintaining functionality happily accepted.

All the dependencies are included in Gentoo:

emerge dev-python/python-dateutil dev-python/feedparser dev-python/stripogram

and in Debian:

apt-get install zope-stripogram python-dateutil python-feedparser

and probably in most other reasonably sane Linux distros.

  • How do I use it?

grab_rss uses 3 files, which are located in either $GRAB_RSS_DIR or ~/.grab_rss:

  • grab_rss.conf: A configuration file (see more about that below)

  • feeds.txt: A list of feed locations, one per line. Like so:

""" http://globalguerrillas.typepad.com/globalguerrillas/atom.xml http://randombit.net/bitbashing/index.atom http://taint.org/feed """

  • seen.db: A sqlite database listing already seen/sent posts
    (you shouldn't need to ever look at this)
  • What do I put in grab_rss.conf?

The only required item is what email address you want the output sent to:

""" [GrabRSS] to = [email protected] """

Using the full set of options:

""" [GrabRSS] to = [email protected] from = [email protected] smtp_host = mail.example.net socket_timeout = 10 pool_size = 4 user_agent = Lynx/2.8.7 """

The pool_size specifies how many processes to use for downloading feeds (using the multiprocessing module, which means you have to have Python 2.6 or have installed it specially for this to work).

Running several downloads in parallel can substantially speed up how fast grab_rss runs (for my 60-something feeds, from just under a minute with 1 process to under 10 seconds with 8 procs; further increases in pool size didn't decrease runtimes). The optimal size will depend a lot on your local hardware and network as well as how many feeds you are trying to get (and all of their hardware and networks). The default pool size is 0, which means don't use the multiprocessing module at all. Play with the --pool-size option if you want to experiment.

The default socket timeout is 30 seconds, which is probably fine unless you have a very wonky network or are running only a single process (in which case a single down server can hang your entire run by the full socket timeout - with multiple processes useful work will still be happening while one process waits around for the timeout). Be warned this timeout applies to both pulling down the feeds and to the SMTP timeout, though I haven't encountered any problems with that.

The SMTP host defaults to localhost which is probably the right thing to do at least a third of the time.

No provisions for SMTP authentication currently, might be useful but I don't need it so I haven't written it.

  • How do I filter this?

You can filter by the sender (set in grab_rss.conf) and/or the existence of the header X-GrabRSS-Feed. The value of X-GrabRSS-Feed is set to the URL of the feed that this post was from, if you want to filter posts to different mboxes based on source.

In procmail speak:

:0:

  • ^X-GrabRSS-Feed: RSSFeeds

  • Can I reuse this?

Sure. License is stock GPLv2. If you need it under some other license for some reason, contact me.