pywik

webserver log analyzer

Stars
6
Committers
1
  • pywik Welcome to pywik the static webserver csv logfile analyzer. ** Features can filter and list:

    • Visited Pages,
    • Search Queries,
    • Server errors,
    • unknown files (not in the good set),
    • bot visits,
    • 404 pages (malware shows up here),
    • visits from tor users (and 404s from tor users),
    • external referers
      ** Setup
      install the dependencies (you will need mongodb):
      #+BEGIN_SRC
      virtualenv --no-site-packages env
      source env/bin/activate
      pip install -r deps.txt
      #+END_SRC
      setup environment for pywik:
      #+BEGIN_SRC
      mkdir logs
      ./updatelists.sh
      mkdir data/myhost
      touch data/myhost/goodpaths
      touch data/myhost/ignoremissing
      touch data/myhost/ignorepaths
      touch data/myhost/ownhosts
      touch data/myhost/classes
      #+END_SRC
      ** Host specific files
      pywik uses a few host specific files, which improve the output
      considerably. Create a directory under data with your hostname as the
      name and populate the following files accordingly.
      *** ownhosts
      a list of hostnames that are considered part of your
      infrastructure. Any log entries with referers from other than
      these hosts are considered external hits.
      *** goodpath
      Any path considered a page visit, each line is a regexp.
      *** ignoremissing
      Any path that is regularly generating 404 responses, each line is a regexp.
      *** ignorepaths
      Any path that is uninteresting for tracking pageviews, like all
      requisites for pages (e.g. .css, .js, etc files), each line is a
      regexp.
      *** rss
      Each line is a regexp for an rss/atom feed.
      *** classes
      This file allows you to categorise the entries. The format is the
      following: Each class starts with its name, then pairwise
      fieldnames and regexps. Classes are separated with empty lines.
      #+BEGIN_SRC
      Users
      path
      /user/?id=

    Indexed Products path /products/?id= http_user_agent .*Googlebot/2.1 #+END_SRC The above example defines two new classes:

    • Users are any entries that start with the path "/user/?id="
    • indexed products, certain paths starting with "/products..." and
      are hit by googlebot - notice the double rule one for the path,
      the other for the user agent
      ** Web-server logformat
      set your webserver to use the following logformats, or use:
      #+BEGIN_SRC
      ./ncsa2csv.py <access.log | ./load.py mysite
      #+END_SRC
      to convert from NCSA logs to csv format - note however that this is
      missing some data, that the csv based format provides.
      *** Apache
      For Apache the following should work:
      #+BEGIN_SRC
      LogFormat "%{%Y-%m-%dT%H:%M:%S%z}t;x;%h;0;%v;%R;%s;%I;%O;%D;%{Referer}i;%u;%{User-agent}i;%{X-Forwarded-For}i;x" csv-http
      LogFormat "%{%Y-%m-%dT%H:%M:%S%z}t;x;%h;1;%v;%R;%s;%I;%O;%D;%{Referer}i;%u;%{User-agent}i;%{X-Forwarded-For}i;x" csv-https
      #+END_SRC
      *** nginx
      For nginx the following should work:
      #+BEGIN_SRC
      log_format csv-http '"$time_local";$connection;"$remote_addr";0;"$http_host";"$request";'
      '$status;$request_length;$body_bytes_sent;$request_time;"$http_referer";"$remote_user";'
      '"$http_user_agent";"$http_x_forwarded_for";$msec';
      log_format csv-https '"$time_local";$connection;"$remote_addr";1;"$http_host";"$request";'
      '$status;$request_length;$body_bytes_sent;$request_time;"$http_referer";"$remote_user";'
      '"$http_user_agent";"$http_x_forwarded_for";$msec';
      #+END_SRC
      and for your hosts use them for logging:
      #+BEGIN_SRC
      access_log /var/log/nginx/access.csv csv-http;
      #+END_SRC
      or
      #+BEGIN_SRC
      access_log /var/log/nginx/access.csv csv-https;
      #+END_SRC
      respectively for https hosts stanzas.
      ** Running pywik
      #+BEGIN_SRC
      ./fetchlogs.sh myhost.net
      ./pywik.py month myhost | less
      #+END_SRC
      if you find anything interesting, you can extract all logentries
      matching certain fields:
      #+BEGIN_SRC
      ./getentries.py logs/access.csv myhost path 'cart.php?a=asdf&templatefile=../../../configuration.php'
      #+END_SRC
      Alternatively you can also run pywik as a Flask webapp:
      #+BEGIN_SRC
      ./webapp.py
      #+END_SRC
      Point your browser at http://localhost:5002/myhost/today
      and start clicking around.
      ** Plugins
      You can easily extend the functionality of pywik using
      plugins. Plugins can be
    • global if you put them into data/plugins
    • or site-specific if you put them in data//plugins
      There are two kind of plugins:
    • those that generate queries for filtered listings for output,
    • and those that enrich the database with while parsing the logfile
      For examples look into data/plugins, addrapp and tor are
      good canditates for starting off.
      *** Plugin Initialization
      Plugins providing an init(ctx) function, will be able to
      initialize themselves. The param ctx is a dictionary, that
      currently only has one key 'host'.
      *** query plugins
      Query plugins implement a queries() function that returns a list of:
      #+BEGIN_SRC
      ('title', {'field1': value1, 'field2': value2},['displayfield1', 'displayfield2'])
      #+END_SRC
    • Where 'title' is the title to be displayed,
    • the second elem is a dict containing a mongodb filter expression,
    • the final elem is a list of fieldnames to be returned by mongo
      for each matching elements

    This can be as simple as:

    #+BEGIN_SRC python def queries(): return [('tor', {'tags': ['tor', 'page'], },['path', 'hostname', 'http_user_agent']), ('tor404', {'tags': ['tor'], 'status': 404 },['path', 'hostname', 'http_user_agent'])] #+END_SRC *** loader plugins Loader plugins enrich the information in each log entry during database import. A loader plugin implements a process(entry) interface, that returns the changed entry.

    #+BEGIN_SRC python def process(entry): if entry['path']=='/foo': entry['foo']='bar' return entry #+END_SRC

    Here's a more advanced example (you can find more in data/plugins) #+BEGIN_SRC python from load import basepath with open('%s/data/torexits.csv' % basepath,'r') as fp: torexits=[x.strip() for x in fp] #print '[tor plugin]', len(torexits), 'torexits loaded'

def process(entry): if entry['remote_addr'] in torexits: entry['tags'].append('tor') return entry #+END_SRC ** Bugs Many, reporting them is encouraged, fixing them very welcome.

Related Projects