BigDataETLAndSentimentAnalysis

A Java based project aims to extract news articles from large .sgm file, process them and load them into MongoDB Database. It includes an Apache Spark job for word frequency analysis directly from .sgm files, and a sentiment analysis implementation using a Bag-of-Words model in Java.

Stars
0
Committers
2

BigDataETLAndSentimentAnalysis

Overview

This project provides a comprehensive solution for processing and analyzing Reuters news data. It includes:

  • A Java application for parsing and storing news articles in MongoDB.
  • An Apache Spark job for word frequency analysis directly from .sgm files.
  • A Java-based sentiment analysis implementation using a Bag-of-Words model which provides polarity of words.

Features

  • Data Parsing and Storage: Extracts news articles from .sgm files and stores them in a MongoDB database.
  • Word Frequency Analysis: Utilizes Apache Spark to count word frequencies in news articles.
  • Sentiment Analysis: Implements a Bag-of-Words model in Java to classify news article titles as positive, negative, or neutral.

Technologies Used

  • Java
  • MongoDB
  • Apache Spark
  • Bag-of-Words Model
Related Projects