Open Source Ecosystems

pywik Welcome to pywik the static webserver csv logfile analyzer. ** Features can filter and list:
- Visited Pages,
- Search Queries,
- Server errors,
- unknown files (not in the good set),
- bot visits,
- 404 pages (malware shows up here),
- visits from tor users (and 404s from tor users),
- external referers
  ** Setup
  install the dependencies (you will need mongodb):
  #+BEGIN_SRC
  virtualenv --no-site-packages env
  source env/bin/activate
  pip install -r deps.txt
  #+END_SRC
  setup environment for pywik:
  #+BEGIN_SRC
  mkdir logs
  ./updatelists.sh
  mkdir data/myhost
  touch data/myhost/goodpaths
  touch data/myhost/ignoremissing
  touch data/myhost/ignorepaths
  touch data/myhost/ownhosts
  touch data/myhost/classes
  #+END_SRC
  ** Host specific files
  pywik uses a few host specific files, which improve the output
  considerably. Create a directory under data with your hostname as the
  name and populate the following files accordingly.
  *** ownhosts
  a list of hostnames that are considered part of your
  infrastructure. Any log entries with referers from other than
  these hosts are considered external hits.
  *** goodpath
  Any path considered a page visit, each line is a regexp.
  *** ignoremissing
  Any path that is regularly generating 404 responses, each line is a regexp.
  *** ignorepaths
  Any path that is uninteresting for tracking pageviews, like all
  requisites for pages (e.g. .css, .js, etc files), each line is a
  regexp.
  *** rss
  Each line is a regexp for an rss/atom feed.
  *** classes
  This file allows you to categorise the entries. The format is the
  following: Each class starts with its name, then pairwise
  fieldnames and regexps. Classes are separated with empty lines.
  #+BEGIN_SRC
  Users
  path
  /user/?id=
Indexed Products path /products/?id= http_user_agent .*Googlebot/2.1 #+END_SRC The above example defines two new classes:
- Users are any entries that start with the path "/user/?id="
- indexed products, certain paths starting with "/products..." and
  are hit by googlebot - notice the double rule one for the path,
  the other for the user agent
  ** Web-server logformat
  set your webserver to use the following logformats, or use:
  #+BEGIN_SRC
  ./ncsa2csv.py <access.log | ./load.py mysite
  #+END_SRC
  to convert from NCSA logs to csv format - note however that this is
  missing some data, that the csv based format provides.
  *** Apache
  For Apache the following should work:
  #+BEGIN_SRC
  LogFormat "%{%Y-%m-%dT%H:%M:%S%z}t;x;%h;0;%v;%R;%s;%I;%O;%D;%{Referer}i;%u;%{User-agent}i;%{X-Forwarded-For}i;x" csv-http
  LogFormat "%{%Y-%m-%dT%H:%M:%S%z}t;x;%h;1;%v;%R;%s;%I;%O;%D;%{Referer}i;%u;%{User-agent}i;%{X-Forwarded-For}i;x" csv-https
  #+END_SRC
  *** nginx
  For nginx the following should work:
  #+BEGIN_SRC
  log_format csv-http '"$time_local";$connection;"$remote_addr";0;"$http_host";"$request";'
  '$status;$request_length;$body_bytes_sent;$request_time;"$http_referer";"$remote_user";'
  '"$http_user_agent";"$http_x_forwarded_for";$msec';
  log_format csv-https '"$time_local";$connection;"$remote_addr";1;"$http_host";"$request";'
  '$status;$request_length;$body_bytes_sent;$request_time;"$http_referer";"$remote_user";'
  '"$http_user_agent";"$http_x_forwarded_for";$msec';
  #+END_SRC
  and for your hosts use them for logging:
  #+BEGIN_SRC
  access_log /var/log/nginx/access.csv csv-http;
  #+END_SRC
  or
  #+BEGIN_SRC
  access_log /var/log/nginx/access.csv csv-https;
  #+END_SRC
  respectively for https hosts stanzas.
  ** Running pywik
  #+BEGIN_SRC
  ./fetchlogs.sh myhost.net
  ./pywik.py month myhost | less
  #+END_SRC
  if you find anything interesting, you can extract all logentries
  matching certain fields:
  #+BEGIN_SRC
  ./getentries.py logs/access.csv myhost path 'cart.php?a=asdf&templatefile=../../../configuration.php'
  #+END_SRC
  Alternatively you can also run pywik as a Flask webapp:
  #+BEGIN_SRC
  ./webapp.py
  #+END_SRC
  Point your browser at http://localhost:5002/myhost/today
  and start clicking around.
  ** Plugins
  You can easily extend the functionality of pywik using
  plugins. Plugins can be
- global if you put them into data/plugins
- or site-specific if you put them in data//plugins
  There are two kind of plugins:
- those that generate queries for filtered listings for output,
- and those that enrich the database with while parsing the logfile
  For examples look into data/plugins, addrapp and tor are
  good canditates for starting off.
  *** Plugin Initialization
  Plugins providing an init(ctx) function, will be able to
  initialize themselves. The param ctx is a dictionary, that
  currently only has one key 'host'.
  *** query plugins
  Query plugins implement a queries() function that returns a list of:
  #+BEGIN_SRC
  ('title', {'field1': value1, 'field2': value2},['displayfield1', 'displayfield2'])
  #+END_SRC
- Where 'title' is the title to be displayed,
- the second elem is a dict containing a mongodb filter expression,
- the final elem is a list of fieldnames to be returned by mongo
  for each matching elements
This can be as simple as:

#+BEGIN_SRC python def queries(): return [('tor', {'tags': ['tor', 'page'], },['path', 'hostname', 'http_user_agent']), ('tor404', {'tags': ['tor'], 'status': 404 },['path', 'hostname', 'http_user_agent'])] #+END_SRC *** loader plugins Loader plugins enrich the information in each log entry during database import. A loader plugin implements a process(entry) interface, that returns the changed entry.

#+BEGIN_SRC python def process(entry): if entry['path']=='/foo': entry['foo']='bar' return entry #+END_SRC

Here's a more advanced example (you can find more in data/plugins) #+BEGIN_SRC python from load import basepath with open('%s/data/torexits.csv' % basepath,'r') as fp: torexits=[x.strip() for x in fp] #print '[tor plugin]', len(torexits), 'torexits loaded'

def process(entry): if entry['remote_addr'] in torexits: entry['tags'].append('tor') return entry #+END_SRC ** Bugs Many, reporting them is encouraged, fixing them very welcome.