
Mid-Bootcamp project for Core Code school Big Data & Machine Learning course.

Insignts for Palmer Archipelago Penguins

About the Data

Data has been gathered from different sources listed below:


The purpose for this project is educational. This project is the first of two to be done in the Data Bootcamp at 🍊 Core Code School.

Requirements for the project is to build a data app. This app should have a backend built with Flask, a frontend built with Streamlit and a database (Postgres or MongoDB).


  • API - Python, Flask, PyMongo
  • Data - MongoDB, Python, Jupyter Notebook, Pandas, mongoshell script
  • Streamlit - Python, Pandas, Streamlit API, Streamlit State API
  • Other tools - Commitizen, GitHub Projects, GitHub Actions, Okteto


The project has 3 main services: Database / API(backend) / Streamlit(frontend). Lets describe these services:


The data service is a custom mongodb image where the used data is added to the database in the init phase.

The original csv source/penguins_lter.csv is transformed into database/docker-entrypoint-initdb.d/seed.json by running

Once having the seed for the database, building the mongo image, mongo-init.js mongoshell script will create the admin and api users, and create the database with the different collections.


  • kaggle-raw-data- the seed.json itself
  • ng-species-raw-data- the species.json collection from web-scrapping NG
  • individuals - collection with each penguin information regarding measures, each document has pointers to islands, regions, species, studynames
  • islands - collection with the data regarding the island
  • regions - collection with the data regarding the region
  • species - collection with the data regarding the species
  • studynames - collection with the data regarding the species

This collections are extracted from kaggle-raw-data in order to be able to include extra data for each collection without changing the individuals collection that is the main one.

⚙️ API

The API is a backend service for the streamlit frontend and the one that comunicates with the database. The sub-repo for the api is structured as follows:

  • entry point for the flask server.
  • env variables all in the same place.
  • routes- dir with all the routes, entry point to the API.
  • controllers- dir with the controllers for each route, responsible to exec the code for that route.
  • libs - utils used along the project for different porpuses.
  • decorators - custom decorator methods.


  • GET - /<collection> - returns all the documents found for this collection on the database
  • PATCH - /<collection>/<id> - modify the document <id> of the <collection>. The payload should be compliant with the collections fields.


  • handle_error - for each route, this decorator catches the errors and returns a json error response.
  • validate_route- as root route is based on parameters, this decorator checks the collection exists at the db, if not it throws an error before accesing the controller.


  • mongo_client- setup for the mongodb connection using flask_pymongo.
  • response- utils to return different responses.


This is the service where the data is displayed. This sub-repo is structured as following:

  • - entry point for the streamlit app
  • utils - dir with methods used along the project
  • pages- dir with the pages available in the streamlit app
  • components - dir with the components used along the project
  • api - dir with the methods used to call backend to retrieve data

Features and Screenshots


You can clone the repo and run docker-compose up.


Env variables needed to run the project

  • MONGO_URI - uri for MongoDB DB (incl. db-name).

  • MONGO_DBNAME - database name where all data will be stored.

  • MONGO_ADMIN_USERNAME - username for the database admin user.

  • MONGO_ADMIN_PASSWORD - password for the database admin user.

  • MONGO_API_USERNAME - username for the database user used in the api.

  • MONGO_API_PASSWORD - password for the database user used in the api.

  • FLASK_DEBUG - flag to run Flask in debug mode, False or True.

  • FLASK_ENV - environment where Flask is running, development.

  • API_URL - url for the API.

  • API_PORT - the port where the API will be available.


Some features have been not included on this first version, so here are some WIP and future work to be done on this repo:

  • Production pipeline for API and Streamlit
  • Refactor MongoDB seed
  • Add Auth to Flask API
  • Enable PDF download of visualizations
  • Add more visualizations


Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.
