taxi-fare

Test project on developing a data architecture.

MIT License

Stars
0
Committers
3

New York City Taxi Fare

๐Ÿ“– About

This project implements a complete pipeline for taxi fare prediction in New York City, using an event-based data stream and a data lake for data storage and analysis.


๐Ÿงช Technology

The project was developed with:

โ†’ Python โ†’ Apache Kafka โ†’ Apache Airflow โ†’ Apache Spark โ†’ Fast API โ†’ Docker


๐Ÿ”– Proposed solution to the challenge

๐Ÿ—๏ธ Proposed architecture

๐Ÿ“ Project structure

taxi-fare/
โ”‚
โ”œโ”€โ”€ dags/
โ”‚   โ””โ”€โ”€ taxi_raides_dag.py
โ”œโ”€โ”€ data/
โ”‚   โ””โ”€โ”€ train.csv
โ”œโ”€โ”€ docker/
โ”‚   โ”œโ”€โ”€ airflow.dockerfile
โ”‚   โ””โ”€โ”€ api.dockerfile
โ”œโ”€โ”€ jars/
โ”‚   โ”œโ”€โ”€ aws-java-sdk-bundle-1.12.262.jar                
โ”‚   โ””โ”€โ”€ hadoop-aws-3.3.4.jar
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ api.py                
โ”‚   โ”œโ”€โ”€ consolidate.py                
โ”‚   โ”œโ”€โ”€ consumer.py                
โ”‚   โ”œโ”€โ”€ producer.py                
โ”‚   โ”œโ”€โ”€ utils.py                
โ”œโ”€โ”€ docker-compose.yml             
โ”œโ”€โ”€ requirements.txt               
โ””โ”€โ”€ README.md                      

๐Ÿ”Œ Getting started

Clone the project:

$ git clone https://github.com/GesielLopes/taxi-fare.git

Access the project folder:

$ cd taxi-fare

Download the train.csv file in https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction/data and save it in the data folder

Download the aws-java-sdk-bundle-1.12.262.jar in https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle/1.12.262 and save it in the jars folder

Download the hadoop-aws-3.3.4.jar in https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.3.4 and save it in the jars folder

Execute docker compose to run the data project:

# Execute docker compose
$ docker compose up -d

๐Ÿš€ Using the project

  1. Access MinIO web client:
    • http://localhost:9000
    • username and password 'minioadmin'
    • Create buckets RAW and REFINED for manipulete files like a AWS S3 ๐Ÿท.
  2. Access airflow web client:
  3. Access the API

๐Ÿ“• Using the api

Accessing via terminal, with curl for example:

$ curl -X 'GET' 'http://localhost:8000/api/' -H 'accept: application/json'

$ curl -X 'GET' 'http://localhost:8000/api/?pickup_date=2011-12-13' -H 'accept: application/json'

$ curl -X 'GET' 'http://localhost:8000/api/?pickup_longitude=-73.9755630493164&pickup_latitude=40.752681732177734' -H 'accept: application/json'

$ curl -X 'GET' 'http://localhost:8000/api/?pickup_date=2011-12-13&pickup_longitude=-73.9755630493164&pickup_latitude=40.752681732177734' -H 'accept: application/json'

Accessing via browser. Just access via url:

http://localhost:8000/api

http://localhost:8000/api/?pickup_date=2011-12-13

http://localhost:8000/api/?pickup_longitude=-73.9755630493164&pickup_latitude=40.752681732177734

http://localhost:8000/api/?pickup_date=2011-12-13&pickup_longitude=-73.9755630493164&pickup_latitude=40.752681732177734

Accessing API Swagger via browser. Just access via url:

http://localhost:8000/docs

๐Ÿ“‹ TODO List

  • Add an ENV file for sensitive data
  • Create the project's unit tests
  • Automate bucket creation
  • Automate data flow in the API when refined data does not exist
  • Add data science environment
  • Feel free to open issues or submit pull requests for improvements or fixes

๐Ÿ“ License

This project is licensed under the MIT License.

Related Projects