Stream Processing of website click data using Kafka and monitored and visualised using Prometheus and Grafana
Consider we run an e-commerce website. An everyday use case with e-commerce is to identify, for every product purchased, the click that led to this purchase. Attribution is the joining of checkout(purchase) of a product to a click. There are multiple types of attribution; we will focus on First Click Attribution
.
Our objectives are:
To run the code, you'll need the following:
If you are using windows please setup WSL and a local Ubuntu Virtual machine following the instructions here. Install the above prerequisites on your ubuntu terminal, if you have trouble installing docker follow the steps here.
Our streaming pipeline architecture is as follows (from left to right):
Application
: Website generates clicks and checkout event data.Queue
: The clicks and checkout data are sent to their corresponding Kafka topics.Stream processing
:
Monitoring & Alerting
: Apache Flink metrics are pulled by Prometheus and visualized using Graphana.We use Apache Table API to
We store the SQL DDL and DML in the folders source
, process
, and sink
corresponding to the above steps. We use *
Clone and run the streaming job (via terminal) as shown below:
git clone the repo
cd streaming_click_data
make run # restart all containers, & start streaming job
make ui
and click on Jobs -> Running Jobs -> checkout-attribution-job
to see our running job.make open
command or go to http://localhost:3000 via your browser (username: admin
, password:flink
).*## Check output
Once we start the job, it will run asynchronously. We can check the Flink UI (http://localhost:8081/ or make ui
) and clicking on Jobs -> Running Jobs -> checkout-attribution-job
to see our running job.
We can check the output of our job, by looking at the attributed checkouts.
Open a postgres terminal as shown below.
pgcli -h localhost -p 5432 -U postgres -d postgres
# password: postgres
Use the below query to check that the output updates every few seconds.
SELECT checkout_id, click_id, checkout_time, click_time, user_name FROM commerce.attributed_checkouts order by checkout_time desc limit 5;
Use make down
to spin down the containers.
Contributions are welcome. If you would like to contribute you can help by opening a github issue or putting up a PR.