NLP-Notebooks

Explore NLP tasks with Python using NLTK, SpaCy & scikit-learn: Tokenization, Normalization, NER, POS tagging, Encoding, Word embedding.

MIT License

Stars
4
Committers
1

NLP-Notebooks

This repository contains notebooks showcasing various Natural Language Processing (NLP) tasks implemented using Python and popular NLP libraries such as NLTK, SpaCy, and scikit-learn. The notebooks cover a wide range of NLP tasks including tokenization, normalization (stemming and lemmatization), bags of words, named entity recognition (NER), part-of-speech (POS) tagging, different encoding techniques, word embedding using Word2Vec and GloVe, and TF-IDF (Term Frequency-Inverse Document Frequency).

Notebooks

  • Tokenization : Notebook demonstrating tokenization techniques using NLTK and SpaCy.

  • Stemming : Implemented stemming techniques with NLTK and SpaCy in Python

  • Lemmatization : Explored lemmatization methods in Python using NLTK and SpaCy

  • Named Entity Recognition : Performed Named Entity Recognition (NER) using NLTK and SpaCy in Python. Understand how to identify and extract named entities such as person names, organization names, locations, etc.

  • Part-of-Speech Tagging : Implemented POS tagging techniques with NLTK and SpaCy in Python. Learn how to assign grammatical categories to words in a text corpus, such as noun, verb, adjective, etc.

  • Stopwords : Demonstrated stopwords removal techniques using NLTK and SpaCy in Python. Understand how to filter out common words that do not carry significant meaning in text analysis tasks.

    Encoding Techniques -

    • One Hot Encoding : Performed OHE on text documents into binary vectors, demonstrated using NLTK and SpaCy in Python.
    • Bag of Words : Represented text documents as vectors based on word frequency, using NLTK and SpaCy in Python.
    • TF-IDF : Assigns scores to words in documents based on their frequency (term frequency) and rarity (inverse document frequency), using NLTK and SpaCy in Python.

    Word Embedding -

    • Word2Vec : Implementated of Word2Vec in Python using both pretrained and scratch-built models.
    • Avg Word2Vec : Utilization of average Word2Vec embeddings in Python, demonstrating efficient word embedding techniques for natural language processing tasks.
    • GloVe : Utilized Stanford's pre-trained GloVe model for efficient word embedding in natural language processing tasks.
    • FastText : Leveraged Gensim and the FastText library for effective text representation and classification using subword information and Skipgram architecture.

Requirements

  • Python 3
  • Jupyter Notebook/Google Colab
  • NLTK
  • SpaCy
  • Scikit learn
  • Gensim

License

This project is licensed under the MIT License - see the LICENSE file for details.