Prescreening Tasks for Data Engineer
APACHE-2.0 License
This document provides the exercises that are part of the prescreening process for Data Engineering positions. As such, their goal is to provide insights into solutioning process and skills application for solving example tasks related to this position. Thus, please make sure to explain the important details and specify any assumption taken.
Assume having a Sales table with the following fields:
And another table named Discounts with the following structure:
For a customer and a sales order there can exist a discount value in the Discounts table, however there is no information about the precise discount value per each item in the Sales table. Thus, for this exercise it is necessary to:
Implement the solution above, preferably using Python (with Pandas), Java, C#, or another programming language. You can also use a pseudo-code instead. Please make sure to provide comments where needed to the solution provided.
I will assume that this is an e-commerce app, and due to the description of the problem this is a design problem of an OLTP system, the exercises are not asking about agregations, data ingestion, transformations and less how to move data between systems, in order to get insights from the preexisted data, there are no references neither of the velocity of the data nor about the tasks scheduled, all the situations mentioned before are features of the OLAP solutions, Hence, I will avoid based my decisions thinking in OLAP systems.
A sales order is a record that a customer can use to initiate or request a sale. In general, the sales order is filled out by the customer to order certain products from a business. The customer pays for these products when they submit the sales order. This document represents the promise that the requested goods will be ordered by the business and delivered to the customer later. When the goods are fulfilled and picked up by the customer, the sales order is used as a receipt to make sure the correct amount and types of products are given to the customer. In simple terms, a sales order is a request from a customer for specific items.
There are two relevant situations in the description:
The initial data model allows assigning a discount per user to each purchased item that appears in the sales table, but there are two problems with this approach:
let's focus in the relevant tables using PostgreSQL
CREATE TABLE IF NOT EXISTS Sales(
sales_order_id int NOT NULL,
sales_order_item int NOT NULL,
customer_id int null,
date timestamp not null,
transaction_value float not null,
discounted_value float not null,
PRIMARY KEY(sales_order_id, sales_order_item, customer_id)
);
CREATE TABLE IF NOT EXISTS Discounts(
sales_order_id int not null,
customer_id int not null,
discount_value float not null
PRIMARY KEY(sales_order_id, customer_id)
);
insert into sales values
(1, 2, 150, NOW(), 200, 200),
(1, 3, 150, NOW(), 310, 310),
(1, 4, 150, NOW(), 80, 80)
insert into discounts values (1, 150, 0.3)
select * from sales
select * from discounts
WITH sales_updated_CTE(sales_order_id,sales_order_item, customer_id, date,transaction_value, discount)
AS
(
SELECT
s.sales_order_id,
s.sales_order_item,
s.customer_id,
s.date,
s.transaction_value,
(s.transaction_value - (s.transaction_value * d.discount_value)) AS discount
FROM sales s
JOIN discounts d
ON s.sales_order_id = d.sales_order_id
WHERE s.sales_order_id = %s AND s.customer_id = %s
)
UPDATE sales as s
SET discounted_value = c.discount
FROM sales_updated_CTE as c
WHERE c.sales_order_id = s.sales_order_id
AND c.customer_id = s.customer_id
AND c.sales_order_item = s.sales_order_item
select * from sales
The second approach (edge case), considered add more than one unit of the same product to the order (quantity) and create different discounts to each item, and update the discount is less complex
CREATE TABLE IF NOT EXISTS Sales2(
sales_order_id int NOT NULL,
customer_id int null,
date timestamp not null,
PRIMARY KEY(sales_order_id, customer_id)
);
CREATE TABLE IF NOT EXISTS Order_detail(
sales_order_id int not null,
sales_order_item int NOT NULL,
transaction_value float not null,
quantity int not null,
discount_value float not null,
discounted_value float not null,
PRIMARY KEY(sales_order_id, sales_order_item)
);
insert into Sales2 values (1, 150, NOW())
insert into Order_detail values
(1, 2, 150, 2, 0.3, 150),
(1, 3, 210, 1, 0.4, 210),
(1, 4, 80, 3, 0.2, 80)
select * from sales2
select * from Order_detail
UPDATE Order_detail
SET discounted_value = (transaction_value - (transaction_value * discount_value)) * quantity
WHERE sales_order_id = %s
select * from Order_detail
Install Docker Desktop on Windows, it will install docker compose as well, docker compose will alow you to run multiple containers applications, this project has two containers with Jupyter Notebook and PostgreSQL
Install git-bash for windows, once installed , open git bash and download this repository, this will download all the folders and the docker-compose.yml file, and other files needed.
ramse@DESKTOP-K6K6E5A MINGW64 /c
$ git clone https://github.com/Wittline/dataengineering-assignment.git
ramse@DESKTOP-K6K6E5A MINGW64 ~/documents/github/dataengineering-assignment/docker (main)
$ docker-compose up
let's wait until all the images and containers are created
ramse@DESKTOP-K6K6E5A MINGW64 ~/documents/github/dataengineering-assignment/docker (main)
$ docker ps
It will show you all the containers and images contained in the docker-compose-yml file, this mean that all the images were created correctly
let's check the docker-compose.yml file
version: '3'
services:
jupyter-notebook:
image: jupyter/minimal-notebook
volumes:
- ./folder:/home/jovyan/work
ports:
- 8888:8888
container_name: jupyter-notebook-container
environment:
- JUPYTER_TOKEN=jupyter
command: jupyter notebook --NotebookApp.iopub_data_rate_limit=3e10
depends_on:
- "db"
db:
container_name: pg_container
image: postgres
restart: always
environment:
POSTGRES_USER: "db"
POSTGRES_PASSWORD: "db"
POSTGRES_DB: "db_db"
ports:
- "5432:5432"
volumes:
- pg_data:/var/lib/postgresql/data/
volumes:
pg_data:
The jupyter notebook of this project is called dataengineering-assignment.ipynb over the folder work, there is a copy of this file in the folder called folder in this repository, please keep this file on its folder.
Now you can run the project step by step using this notebook, the files sql_queries.py and db_engine.py contains all the logic needed