Mid-Bootcamp project for Core Code school Big Data & Machine Learning course.
MIT License
Data has been gathered from different sources listed below:
The purpose for this project is educational. This project is the first of two to be done in the Data Bootcamp at 🍊 Core Code School.
Requirements for the project is to build a data app. This app should have a backend built with Flask, a frontend built with Streamlit and a database (Postgres or MongoDB).
The project has 3 main services: Database / API(backend) / Streamlit(frontend). Lets describe these services:
The data service is a custom mongodb image where the used data is added to the database in the init phase.
The original csv source/penguins_lter.csv
is transformed into database/docker-entrypoint-initdb.d/seed.json
by running generate-seed-data.py
.
Once having the seed for the database, building the mongo image, mongo-init.js
mongoshell script will create the admin and api users, and create the database with the different collections.
kaggle-raw-data
- the seed.json itselfng-species-raw-data
- the species.json collection from web-scrapping NGindividuals
- collection with each penguin information regarding measures, each document has pointers to islands
, regions
, species
, studynames
islands
- collection with the data regarding the islandregions
- collection with the data regarding the regionspecies
- collection with the data regarding the speciesstudynames
- collection with the data regarding the speciesThis collections are extracted from kaggle-raw-data
in order to be able to include extra data for each collection without changing the individuals
collection that is the main one.
The API is a backend service for the streamlit frontend and the one that comunicates with the database. The sub-repo for the api is structured as follows:
main.py
- entry point for the flask server.config.py
- env variables all in the same place.routes
- dir with all the routes, entry point to the API.controllers
- dir with the controllers for each route, responsible to exec the code for that route.libs
- utils used along the project for different porpuses.decorators
- custom decorator methods.GET - /<collection>
- returns all the documents found for this collection on the databasePATCH - /<collection>/<id>
- modify the document <id>
of the <collection>
. The payload should be compliant with the collections fields.handle_error
- for each route, this decorator catches the errors and returns a json error response.validate_route
- as root route is based on parameters, this decorator checks the collection exists at the db, if not it throws an error before accesing the controller.mongo_client
- setup for the mongodb connection using flask_pymongo
.response
- utils to return different responses.This is the service where the data is displayed. This sub-repo is structured as following:
main.py
- entry point for the streamlit apputils
- dir with methods used along the projectpages
- dir with the pages available in the streamlit appcomponents
- dir with the components used along the projectapi
- dir with the methods used to call backend to retrieve dataYou can clone the repo and run docker-compose up
.
Env variables needed to run the project
MONGO_URI
- uri for MongoDB DB (incl. db-name).
MONGO_DBNAME
- database name where all data will be stored.
MONGO_ADMIN_USERNAME
- username for the database admin user.
MONGO_ADMIN_PASSWORD
- password for the database admin user.
MONGO_API_USERNAME
- username for the database user used in the api.
MONGO_API_PASSWORD
- password for the database user used in the api.
FLASK_DEBUG
- flag to run Flask in debug mode, False
or True
.
FLASK_ENV
- environment where Flask is running, development
.
API_URL
- url for the API.
API_PORT
- the port where the API will be available.
Some features have been not included on this first version, so here are some WIP and future work to be done on this repo:
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.