Master's Final Degree Project on Artificial Intelligence and Big Data
MIT License
Artificial Intelligence and Big Data
The motivation behind the project is to work as a team with the idea of joining everthing we've seen, in other words:
Being able to design, research, develop and deploy a Data Science idea designing a Big Data Architecture from which to train a model with a conclusion in mind while being ethical and not breaking any EU laws.
For reference about the changes, please, check out our CHANGELOG.
"Hype" is all you need
This is research into what defines the success of films, and whether success can be predicted (proportionally) based on the hype (expectation) generated around a film; to be able to be expandable with both series and anime, video games or any other type of multimedia content or not.
It is intended, as possible definitions of the success of a film, to be able to predict:
For this, various data sources will be used, such as: Twitter, Reddit, YouTube, IMDB, and those that we can discover as the investigation progresses. One of the main and central components of the application is sentiment analysis, which would become the main focus of the prediction.
For the official documentation visit the /docs folder
Not in a specific order.
Our idea is to have a non-biased model that does not get influenced by people's opinion, rather, can know the difference between the general sentiment and how well will it reflect the movie's success.
Regarding the ethics, our goal woudln't be to forcefeed certain movies, nor to dictate whatpeople should do/watch, it'd be to have, just another tool to decide what you may want to see.
All the data will have an origin tag/field as to better identify it's properties
Instead of following the classic paradigm of ETL, first extract the data, then transform it BEFORE loading it. Data Lakes strives for the ELT, extract the data, load it FIRST then transform it when you need to use it.
And we'll be using it to store all the (raw) data, that we collect in the span of the project. We'll be having Diogenes syndrome towards the data. We'd rather delete data than not having enough.
From this point forward we should have quality data, data that is "clean". Following the aforementioned ELT paradigm, a Data Warehouse is where the information will be loaded ONCE Transformed.
It will serve us as the main storage for our models, all the data that comes to this point, should and must be: clean, standarized, normalized and regularized. It should be as ready as possible for the model.
We're not going to sell anyting, but, our Product idea is to have a model that retrains with differente sources of information to display the outcome on the web and with some storytelling with the conclusion.
The initial estimation, it should be updated with the real roadmap at the end.
The project has not yet been finished
We've splitted the product in different phases. The traditiona Product phases, and expanded the Data Science development ones:
SCRUM
Pepe
Pepe
Our teachers
Engine Version 20.10
Compose Version 1.29.2
>= 3.6.x
>= v15.14.0
All the images versions will be provided on each Dockerfile with the exact version, avoid the latest
for security reasons, upgrades will be manual.
Execute the following command on the folder you want to store the project in
git clone https://github.com/jofaval/tfm-iabd.git
cd tfm-iabd
And now configure the project's branches with Git flow
For Windows
cd tools/windows/git/
git-flow.bat
For Linux
cd tools/linux/git/
./git-flow.sh
Execute the tools/windows/infra/stop.bat
or the tools/linux/infra/stop.sh
file
or execute the following commands on the shell
cd app/infra
docker-compose up -d
Execute the tools/windows/infra/stop.bat
or the tools/linux/infra/stop.sh
file
or execute the following commands on the shell
cd app/infra
docker-compose down
Handled by the Github Actions workflow
Name | Role |
---|---|
Diego del Caño | Data Scientist / Data Analyst |
Juan Crespin Valero | Data Analyst / SysAdmin |
Nerea Gluskova | Data Engineer / SysAdmin |
Pepe Fabra Valverde | Data Architect / Data Engineer / Data Scientist |
Table generated with: https://www.tablesgenerator.com/markdown_tables
I (Pepe) will be supervising each task, but we're all out here to help each other.
Defined as Preparation of docker images, ready and interjoined to support the architecture.
Docker (Docker-compose), Linux, if cloud computing were to be required (AWS, Azure or Google Cloud)
The information regarding the infrastructure it's in the Infrastructure section.
Defined as Retrieving all the necessary data for it's work. (JUST retrieving data)
Node-RED
Defined as After the data has being retrieved, create a middleground with the common data that may be needed so that all sources end up with the same Data Model, in other words, standarizing the sources.
Node-RED
Defined as Storing the normalized data into the NoSQL DB (MongoDB most likely).
Node-RED
Defined as At this point, the data has been normalized, but not cleaned, the data should be ready for the Model to train with.
Python (Google Colab?)
Defined as Developing and implement the required model(s) for the desired performance and outcome.
Artificial Intelligence and/or Machine Learning.
Python (Google Colab?)
Defined as Designing and developing the story (StoryTelling) and all the required/desired visualizations for whaterever the outcome(s) are that we want.
PowerBI or Tableau, up to taste.
Defined as Prepare the connections, and proper usage of the model via endpoints and utilities.
Cloud Platform (if used), Git (Github)
The license used (MIT License) can be seen here or you can read it locally by downloading the LICENSE file
All the data used is being used and stored up-to-date with the European Union's legislation, more precisely, to Span's laws which comply with E.U.'s law GDPR (General Data Protection Regulation) and following the standards described at the Charter of European Digital Rights (EDRi, EDR initiative), surrounding the usage A.I. towards sentiment analysis and overall in the possible bias it may provide to the user. As to be ethical and prepare the model for the coming years.
For more information about the ethics of our model, please refer to the Ethics' section.
We plan to use the extracted data and it's provided data to better analyze the sentiments of users all around the world about the hype generated by a movie, wether is it's announcement, a trailer, some celeb talking about it.
By analyzing the general feeling, whether positive, negative, or neutral, we could determine if one user at a time, had a good or bad experience, they were hyped, or not. So we can later influence our model towards the idea people have/had of the movie.
We'll collect the raw text data, if it's a thread, the more information we'll collect, so we can tokenize, lemmatize, preprocess and prepare the text. Our methodology is to preprocess, and clean the data, tokenized it into a word embedding, and using Transformers, maybe Siamese Neural Networks, but surely mT5 HuggingFace BERT to make a Logic Consequence with NLI so that we can “classify the data”.
Maybe even reviews or the general feeling, in case of adaptations we'd have even more information.
And to display the conclusion obtained thanks to the insight of the data extracted. We’ll use personal websites, github of course, a medium article. We’d like to develop and research a paper so that we could more clearly provide, document and explain the results obtained and it’s conclusions.
As for the tools, Tableau, but maybe we could get PowerBI through studentship, it’s unclear at the moment.
TODO