kss | Python Ecosystem Directory

kss - v3.2.0

Published by hyunwoongko about 3 years ago

change default value of use_quotes_brackets_processing to False
- it is for speed. if you want to use this option, set to True.
make preprocessing part parallelizable.
- preprocessing is much faster now :-)

kss - v3.1.0

Published by hyunwoongko about 3 years ago

Fix default rule using morpheme features.
- previous version segmented "없다 거나" -> ["없다", "거나"]
- current version doesn't segment these cases.
Remove none backend option.
segment error rate
- 3.5%+ -> 1.3% (mecab) / 2.3% (pynori)

kss - v3.0.3

Published by hyunwoongko about 3 years ago

Fix bug reported in https://github.com/hyunwoongko/kss/issues/7

kss - v3.0.2

Published by hyunwoongko about 3 years ago

Hot fix of logging bugs for longer text.
Add Memoization with LRU Cache for quotes calibration.
- Quote calibration algorithm has time complexity of O(2^N).
- It is very poor. So I applied memoization with caching.

kss - v3.0.1

Published by hyunwoongko about 3 years ago

1. Use morpheme features

Unlike 2.xx, unspecified eomi can also be segmented. (default backend is pynori)

e.g. ~소서(경어), ~세용(신조어), ~했음/임(전성어미) ~구나(미등록 어미), etc.

>>> split_sentences("부디 만수무강 하옵소서 천천히 가세용~ 너 밥을 먹는구나 응 맞아 난 근데 어제 이사했음 그랬구나 이제 마지막임 응응")
['부디 만수무강 하옵소서', '천천히 가세용~', '너 밥을 먹는구나', '응 맞아 난 근데 어제 이사했음', '그랬구나 이제 마지막임', '응응']

Boost segmentation speed via changing morpheme analyzer backend to mecab.

>>> split_sentences("부디 만수무강 하옵소서 천천히 가세용~ 너 밥을 먹는구나 응 맞아 난 근데 어제 이사했음 그랬구나 이제 마지막임 응응", backend="mecab")
['부디 만수무강 하옵소서', '천천히 가세용~', '너 밥을 먹는구나', '응 맞아 난 근데 어제 이사했음', '그랬구나 이제 마지막임', '응응']

You can turn off this by changing morpheme analyzer backend to none.

>>> split_sentences("부디 만수무강 하옵소서 천천히 가세용~ 너 밥을 먹는구나 응 맞아 난 근데 어제 이사했음 그랬구나 이제 마지막임 응응", backend="none") 
['부디 만수무강 하옵소서 천천히 가세용~ 너 밥을 먹는구나 응 맞아 난 근데 어제 이사했음 그랬구나 이제 마지막임 응응']

2. Support multiprocessing and batch processing

You can input Tuple[str] and List[str] as input text for batch processing.

>>> split_sentences(["안녕하세요 반가워요", "반갑습니다. 잘 지내시나요?"])
[['안녕하세요', '반가워요'], ['반갑습니다.', '잘 지내시나요?']]

You can change the number of multiprocess worker. default is -1 (max)

>>> split_sentences(["안녕하세요 반가워요", "반갑습니다. 잘 지내시나요?"], num_workers=4)
[['안녕하세요', '반가워요'], ['반갑습니다.', '잘 지내시나요?']]

Package Rankings

Top 3.34% on Pypi.org

Related Projects

Kudasai

Streamlining Japanese-English Translation with Advanced Preprocessing and Integrated Translation ...

02 Jan 2023 10

klipse

Klipse is a JavaScript plugin for embedding interactive code snippets in tech blogs.

19 Nov 2015 3,108

augtxt

yet another text augmentation python package

22 Nov 2020 2

tmep

Template and Macros Expansion for Path names.

25 Aug 2016 1

kcc

KCC (a.k.a. Kindle Comic Converter) is a comic and manga converter for ebook readers.

30 Nov 2012 2,320

Linly-Talker

Digital Avatar Conversational System - Linly-Talker. 😄✨ Linly-Talker is an intelligent AI system ...

17 Oct 2023 1,255

HarvestText

文本挖掘和预处理工具（文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等），无监督或弱监督方法

19 Nov 2018 2,391

flashtext

Extract Keywords from sentence or Replace keywords in sentences.

15 Aug 2017 5,547

pyfn

A python module to process data for Frame Semantic Parsing

22 Aug 2018 23

argumentation-management

Annotator combining different NLP pipelines.

28 Jun 2021 0

pykaldi

A Python wrapper for Kaldi

19 Jun 2017 992

kshingle

Split strings into (character-based) k-shingles

02 Dec 2020 4