Building an Amazon Prime content-based Movie Recommender System

TF-IDF, Cosine similarity, BM25, BERT

Check the article here: Building an Amazon Prime content-based Movie Recommender System

The aim of this article is to show you how to quickly create a content-based recommendation system. When you select a movie on platforms such as Amazon Prime or Netflix you may also notice that they will always show you similar movies that may be to your liking, this document shows , explains and implements three approaches to calculate those similarities using the description of each movie, the approaches are the following:

TF-IDF and Cosine Similarity

TF-IDF (term frequency-inverse document frequency) is a traditional count-based feature engineering strategy for textual data which is part of the Bag of words model, Despite is very effective for extract features from text, it is losing additional information like semantics and the context around the text. Once the raw corpus is processed by the TF-IDF, we will calculate the similarities of pairwise document using cosine similarity metric, the result of the last step is the information that the recommender needs.

BM25

It is an improved version of TF-IDF, it will give you better relevance in the similarity than TF -IDF ->Cosine, It will not depends of the frequency of words contained in the documents and is returned more realistic results.

TF-IDF in RED, the frequency of the words will influence the score
BM25 in BLUE, will limit the influence of the frequency of words

BERT

This technique is represented by dense vectors, this means that the values of the weights matrix will have more values associated in their columns for each document, therefore, much more information in it. Internally BERT is using many encoding layers to be able to generate the dense vector, which leads to a more meaningful understanding of the text and the semantics on it. The second step in this approach is to calculate the similarities of pairwise document using cosine similarity, the result of the last step is the information that the recommender needs.

Amazon Prime Movies Dataset

This dataset with 7261 records contains a list of all the movies streaming on the Amazon Prime platform in India.

https://www.kaggle.com/padhmam/amazon-prime-movies

CODE

The following code represents the main class of the entire recommender, examples of how to use it will be shown further on.

import pandas as pd
import re
import numpy as np
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import BM25

class MovieRecommender:

    def __init__(self, filename, columns, t_column, d_column):
        self.filename = filename
        self.columns = columns
        self.title_column = t_column
        self.description_column = d_column
        self.df = None

    def process(self, show=True):
        self.df = pd.read_csv(self.filename)
        self.df = self.df[self.columns]
        self.df[self.description_column].fillna('', inplace=True)
        self.df[self.description_column] = self.df[self.title_column] + '. ' +  self.df[self.description_column].map(str)
        self.df.dropna(inplace=True)
        self.df.drop_duplicates(inplace=True)
        return self.df

    def show_df_records(self, n = 5):
        return self.df.head(n)

    def show_info_details(self):
        return self.df.info()

    def __normalize(self, d):
        stopwords = nltk.corpus.stopwords.words('english')
        d = re.sub(r'[^a-zA-Z0-9\s]', '', d, re.I|re.A)
        d = d.lower().strip()
        tks = nltk.word_tokenize(d)
        f_tks = [t for t in tks if t not in stopwords]
        return ' '.join(f_tks)

    def get_normalized_corpus(self, tokens = False):
        n_corpus = np.vectorize(self.__normalize)        
        if tokens == True:
            norm_courpus = n_corpus(list(self.df[self.description_column]))
            return np.array([nltk.word_tokenize(d) for d in norm_corpus])            
        else:
            return n_corpus(list(self.df[self.description_column]))
            
    def get_features(self, norm_corpus):
        tf_idf = TfidfVectorizer(ngram_range=(1,2), min_df=2)
        tfidf_array = tf_idf.fit_transform(norm_corpus)
        return tfidf_array
    
    def get_vector_cosine(self, tfidf_array):
        return pd.DataFrame(cosine_similarity(tfidf_array))

    def get_bm25_weights(self, corpus):

        bm25 = BM25(corpus)
        avg_idf = sum(float(val) for val in bm25.idf.values()) / len(bm25.idf)
        weights = []
        for doc in corpus:
            scores = bm25.get_scores(doc, avg_idf)
            weights.append(scores)
            
        return pd.DataFrame(weights)
        
    def get_bert_weights(self, corpus):
        model = SentenceTransformer('bert-base-nli-mean-tokens')
        vectors = model.encode(corpus)
        weights = pd.DataFrame(cosine_similarity(vectors))
        
        return weights
    
    def search_movies_by_term(self, term='movie'):
        movies = self.df[self.title_column].values
        possible_options = [(i, movie) for i, movie in enumerate(movies) for word in movie.split(' ') if word == term]
        return possible_options
    
    def recommendation(self, index, vector, n):
        similarities = vector.iloc[index].values
        similar_indices = np.argsort(-similarities)[1:n + 1]
        movies = self.df[self.title_column].values
        similar_movies =  movies[similar_indices]
        return similar_movies

TF-IDF and Cosine Similarity

The class MovieRecommender contains all the method necessary for read datasets, clean text, and create weights based on each approach.

mr = MovieRecommender('archive.zip', ['Movie Name', 'Plot'], 'Movie Name', 'Plot')
df = mr.process()
mr.show_df_records(5)

The method get_normalized_corpus is cleaning the text and removing stopwords, if you pass the parameter True, it will return an array of words for each sentence.

norm_corpus = mr.get_normalized_corpus()
norm_corpus[:3]

The method get_features, is vectorizing the documents, converting the words to numerical values and taking into account the frequency of each word, it is applying TF-IDF

tfidf_array = mr.get_features(norm_corpus)
tfidf_array.shape
(7507,28013)

The method get_vector_cosine is returning the cosine similarity for Pairwise document similarity.

vector_cosine = mr.get_vector_cosine(tfidf_array)
vector_cosine.head()

This is an additional method which is useful to search options for experiments, in this case i searched Batman and it returns options and their ids.

mr.search_movies_by_term('Batman')
[(1029, 'Batman v Superman: Dawn of Justice'), (5560, 'Batman Begins')]

The recommendation method is used for search recommendations inside the vector of weights, notice that is receiving the vector of weights and the number of recommendations expected

movies_recommended  = mr.recommendation(5560, vector_cosine, 3)
print(movies_recommended)
['Pratibad' 'Bhagat Singh Ki Udeek' 'Batman v Superman: Dawn of Justice']

This is for search and check the description of movies:

df[df['Movie Name'] == 'Pratibad' ].values

df[df['Movie Name'] == 'Bhagat Singh Ki Udeek' ].values

df[df['Movie Name'] == 'Batman Begins' ].values

BM25

BM25 es expecting receive the documents as a tokens

norm_corpus_tokens = mr.get_normalized_corpus(True)
norm_corpus_tokens[:3]

wts = mr.get_bm25_weights(norm_corpus_tokens)
bm25_wts_df = pd.DataFrame(wts)
bm25_wts_df.head()

movies_recommended  = mr.recommendation(5560, bm25_wts_df, 3)
print(movies_recommended)
['The Dark Knight' 'Pratibad' 'Akrandhana']

BERT

wts_df = mr.get_bert_weights(norm_corpus)
wts_df.head()

movies_recommended  = mr.recommendation(5560, wts_df, 3)
print(movies_recommended)
[The Dark Knight Dune Wake Of Death ]

df[df['Movie Name'] == 'Dune' ].values

Summary

Batman Begins

In the wake of his parents murder, disillusioned industrial heir Bruce Wayne travels the world seeking the means to fight injustice.

So dramatic! We already know that Batman is too weird, but checking the words involved in the description fight injustice is totally sufficient for techniques based on semantics.
All the movies recommended by the three approaches seem correct to me, there is someone involved in the fight for injustice.
The third recommendation given by the BM25 approach does not seem correct to me.
The recommendations given by BERT seem very natural and logical to me, you can notice the absence of the words mentioned in point 1, here you can see that the recommendations are based on the semantics of the text and not only on the frequency of the words.
To conclude, vectorization techniques that generate dense vectors are more robust and more sensitive to detecting natural language.
I am a beginner in the world of NLP, and the best way to understand theoretical concepts is learning by doing, you can ask for modifications.