-
pywik
Welcome to pywik the static webserver csv logfile analyzer.
** Features
can filter and list:
- Visited Pages,
- Search Queries,
- Server errors,
- unknown files (not in the good set),
- bot visits,
- 404 pages (malware shows up here),
- visits from tor users (and 404s from tor users),
- external referers
** Setup
install the dependencies (you will need mongodb):
#+BEGIN_SRC
virtualenv --no-site-packages env
source env/bin/activate
pip install -r deps.txt
#+END_SRC
setup environment for pywik:
#+BEGIN_SRC
mkdir logs
./updatelists.sh
mkdir data/myhost
touch data/myhost/goodpaths
touch data/myhost/ignoremissing
touch data/myhost/ignorepaths
touch data/myhost/ownhosts
touch data/myhost/classes
#+END_SRC
** Host specific files
pywik uses a few host specific files, which improve the output
considerably. Create a directory under data with your hostname as the
name and populate the following files accordingly.
*** ownhosts
a list of hostnames that are considered part of your
infrastructure. Any log entries with referers from other than
these hosts are considered external hits.
*** goodpath
Any path considered a page visit, each line is a regexp.
*** ignoremissing
Any path that is regularly generating 404 responses, each line is a regexp.
*** ignorepaths
Any path that is uninteresting for tracking pageviews, like all
requisites for pages (e.g. .css, .js, etc files), each line is a
regexp.
*** rss
Each line is a regexp for an rss/atom feed.
*** classes
This file allows you to categorise the entries. The format is the
following: Each class starts with its name, then pairwise
fieldnames and regexps. Classes are separated with empty lines.
#+BEGIN_SRC
Users
path
/user/?id=
Indexed Products
path
/products/?id=
http_user_agent
.*Googlebot/2.1
#+END_SRC
The above example defines two new classes:
- Users are any entries that start with the path "/user/?id="
- indexed products, certain paths starting with "/products..." and
are hit by googlebot - notice the double rule one for the path,
the other for the user agent
** Web-server logformat
set your webserver to use the following logformats, or use:
#+BEGIN_SRC
./ncsa2csv.py <access.log | ./load.py mysite
#+END_SRC
to convert from NCSA logs to csv format - note however that this is
missing some data, that the csv based format provides.
*** Apache
For Apache the following should work:
#+BEGIN_SRC
LogFormat "%{%Y-%m-%dT%H:%M:%S%z}t;x;%h;0;%v;%R;%s;%I;%O;%D;%{Referer}i;%u;%{User-agent}i;%{X-Forwarded-For}i;x" csv-http
LogFormat "%{%Y-%m-%dT%H:%M:%S%z}t;x;%h;1;%v;%R;%s;%I;%O;%D;%{Referer}i;%u;%{User-agent}i;%{X-Forwarded-For}i;x" csv-https
#+END_SRC
*** nginx
For nginx the following should work:
#+BEGIN_SRC
log_format csv-http '"$time_local";$connection;"$remote_addr";0;"$http_host";"$request";'
'$status;$request_length;$body_bytes_sent;$request_time;"$http_referer";"$remote_user";'
'"$http_user_agent";"$http_x_forwarded_for";$msec';
log_format csv-https '"$time_local";$connection;"$remote_addr";1;"$http_host";"$request";'
'$status;$request_length;$body_bytes_sent;$request_time;"$http_referer";"$remote_user";'
'"$http_user_agent";"$http_x_forwarded_for";$msec';
#+END_SRC
and for your hosts use them for logging:
#+BEGIN_SRC
access_log /var/log/nginx/access.csv csv-http;
#+END_SRC
or
#+BEGIN_SRC
access_log /var/log/nginx/access.csv csv-https;
#+END_SRC
respectively for https hosts stanzas.
** Running pywik
#+BEGIN_SRC
./fetchlogs.sh myhost.net
./pywik.py month myhost | less
#+END_SRC
if you find anything interesting, you can extract all logentries
matching certain fields:
#+BEGIN_SRC
./getentries.py logs/access.csv myhost path 'cart.php?a=asdf&templatefile=../../../configuration.php'
#+END_SRC
Alternatively you can also run pywik as a Flask webapp:
#+BEGIN_SRC
./webapp.py
#+END_SRC
Point your browser at http://localhost:5002/myhost/today
and start clicking around.
** Plugins
You can easily extend the functionality of pywik using
plugins. Plugins can be
- global if you put them into data/plugins
- or site-specific if you put them in data//plugins
There are two kind of plugins:
- those that generate queries for filtered listings for output,
- and those that enrich the database with while parsing the logfile
For examples look into data/plugins, addrapp and tor are
good canditates for starting off.
*** Plugin Initialization
Plugins providing an init(ctx) function, will be able to
initialize themselves. The param ctx is a dictionary, that
currently only has one key 'host'.
*** query plugins
Query plugins implement a queries() function that returns a list of:
#+BEGIN_SRC
('title', {'field1': value1, 'field2': value2},['displayfield1', 'displayfield2'])
#+END_SRC
- Where 'title' is the title to be displayed,
- the second elem is a dict containing a mongodb filter expression,
- the final elem is a list of fieldnames to be returned by mongo
for each matching elements
This can be as simple as:
#+BEGIN_SRC python
def queries():
return [('tor', {'tags': ['tor', 'page'], },['path', 'hostname', 'http_user_agent']),
('tor404', {'tags': ['tor'], 'status': 404 },['path', 'hostname', 'http_user_agent'])]
#+END_SRC
*** loader plugins
Loader plugins enrich the information in each log entry during
database import. A loader plugin implements a process(entry)
interface, that returns the changed entry.
#+BEGIN_SRC python
def process(entry):
if entry['path']=='/foo': entry['foo']='bar'
return entry
#+END_SRC
Here's a more advanced example (you can find more in data/plugins)
#+BEGIN_SRC python
from load import basepath
with open('%s/data/torexits.csv' % basepath,'r') as fp:
torexits=[x.strip() for x in fp]
#print '[tor plugin]', len(torexits), 'torexits loaded'
def process(entry):
if entry['remote_addr'] in torexits:
entry['tags'].append('tor')
return entry
#+END_SRC
** Bugs
Many, reporting them is encouraged, fixing them very welcome.