segment

A tool to segment text based on frequencies and the Viterbi algorithm "#TheBoyWhoLived" => ['#', 'The', 'Boy', 'Who', 'Lived']

Stars

81

View Code on GitHub

Ecosystems: Python

This module segments text according word frequency using the Viterbi algorithm. Probably due to Peter Norvig somehow.

Three sources of frequency information is provided.

One is from the Google NGram corpus, a general web corpus.

The second is from the Rovereto Twitter N-Gram Corpus, which is better for some Twitter data.

The third is from a webcrawl dataset of anchor text provided by Vinay Goel of the Internet Archive.

> from segment.segmenter import Analyzer
> e = Analyzer('en')
> e.segment("AbeLincoln")
['Abe', 'Lincoln']
> e.segment("BieberHeartsBeliebers")
['Bi', 'e', 'ber', 'Hearts', 'Be', 'lieber', 's']
> t = Analyzer('twitter')
> t.segment("BieberHeartsBeliebers")
['Bieber', 'Hearts', 'Beliebers']
> t = Analyzer('anchor')
> t.segment("wordpress&sex")
['wordpress', '&', 'sex']

Related Projects

subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation

01 Sep 2015 2,146

Article-Summarizer

Uses frequency analysis to summarize text.

04 Jan 2017 183

cn_segment

Chinese word segmentation based on statistical methods (for Python)

HarvestText

文本挖掘和预处理工具（文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等），无监督或弱监督方法

19 Nov 2018 2,391

Tracking-Anything-with-DEVA

[ICCV 2023] Tracking Anything with Decoupled Video Segmentation

17 Aug 2023 1,233

speech-to-text-wavenet

Speech-to-Text-WaveNet : End-to-end sentence level English speech recognition based on DeepMind's...

14 Nov 2016 3,945

Whisper-transcription_and_diarization-speaker-identification-

How to use OpenAIs Whisper to transcribe and diarize audio files

12 Oct 2022 285

pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/

24 Feb 2018 678

nlp_xiaojiang

自然语言处理（nlp），小姜机器人（闲聊检索式chatbot），BERT句向量-相似度（Sentence Similarity），XLNET句向量-相似度（text xlnet embeddin...

09 Apr 2019 1,519

fast-sentence-segment

Fast and Efficient Sentence Segmentation

vocabsieve

Simple sentence mining tool for language learning

10 Jul 2021 372

text-based-search-engine

Implementation of a search engine using TF-IDF and Word Embedding-based vectorization techniques ...