pecab

Pecab: Pure python Korean morpheme analyzer based on Mecab

APACHE-2.0 License

Downloads
3.9K
Stars
156
Committers
3
pecab - v1.0.8 Latest Release

Published by hyunwoongko almost 2 years ago

Modify numpy read mode r+ to r for read-only system https://github.com/hyunwoongko/pecab/issues/5

pecab - v1.0.7

Published by hyunwoongko almost 2 years ago

Improve drop_space.

pecab - v1.0.4

Published by hyunwoongko almost 2 years ago

pecab - v1.0.3

Published by hyunwoongko almost 2 years ago

Apply LRU cache to _tokenize method to reduce elapse time for same inputs

pecab - v1.0.2

Published by hyunwoongko almost 2 years ago

pecab - v1.0.0

Published by hyunwoongko almost 2 years ago

Pecab

Pecab is pure python Korean morpheme analyzer based on Mecab. Mecab is a CRF-based morpheme analyzer made by Taku Kudo at 2011. It is very fast and accurate at the same time, which is why it is still very popular even though it is quite old. However, it is known to be one of the most tricky libraries to install, and in fact many people have had a hard time installing Mecab.

So, since a few years ago, I wanted to make a pure python version of Mecab that was easy to install while inheriting the advantages of Mecab.
Now, Pecab came out. This ensures results very similar to Mecab and at the same time easy to install. For more details, please refer the following.

Installation

pip install pecab

Usages

The user API of Pecab is inspired by KoNLPy,
a one of the most famous natural language processing package in South Korea.

1) PeCab(): creating Pecab object.

from pecab import PeCab

pecab = PeCab()

2) morphs(text): splits text into morphemes.

pecab.morphs("아버지가방에들어가시다")
['아버지', '', '', '', '들어가', '', '']

3) pos(text): returns morphemes and POS tags together.

pecab.pos("이것은 문장입니다.")
[('이것', 'NP'), ('', 'JX'), ('문장', 'NNG'), ('입니다', 'VCP+EF'), ('.', 'SF')]

4) nouns(text): returns all nouns in the input text.

pecab.nouns("자장면을 먹을까? 짬뽕을 먹을까? 그것이 고민이로다.")
["자장면", "짬뽕", "그것", "고민"]

5) Pecab(user_dict=List[str]): Set up a user dictionary.

Note that words included in the user dictionary cannot contain spaces.

  • Without user_dict
from pecab import PeCab

pecab = PeCab()
pecab.pos("저는 삼성디지털프라자에서 지펠냉장고를 샀어요.")
[('', 'NP'), ('', 'JX'), ('삼성', 'NNP'), ('디지털', 'NNP'), ('프라자', 'NNP'), ('에서', 'JKB'), ('', 'NNP'), ('', 'NNP'), ('냉장고', 'NNG'), ('', 'JKO'), ('', 'VV+EP'), ('어요', 'EF'), ('.', 'SF')]
  • With user_dict
from pecab import PeCab

user_dict = ["삼성디지털프라자", "지펠냉장고"]
pecab = PeCab(user_dict=user_dict)
pecab.pos("저는 삼성디지털프라자에서 지펠냉장고를 샀어요.")
[('', 'NP'), ('', 'JX'), ('삼성디지털프라자', 'NNG'), ('에서', 'JKB'), ('지펠냉장고', 'NNG'), ('', 'JKO'), ('', 'VV+EP'), ('어요', 'EF'), ('.', 'SF')]

6) PeCab(split_compound=bool): Divide compound words into smaller pieces.

from pecab import PeCab

pecab = PeCab(split_compound=True)
pecab.morphs("가벼운 냉장고를 샀어요.")
['가볍', '', '냉장', '', '', '', 'ㅏㅆ', '어요', '.']

7) ANY_PECAB_FUNCTION(text, drop_space=bool): Determines whether spaces are returned or not.

This can be used for all of morphs, pos, nouns. default value of this is True.

from pecab import PeCab

pecab = PeCab()
pecab.pos("토끼정에서 크림 우동을 시켰어요.")
[('토끼', 'NNG'), ('', 'NNG'), ('에서', 'JKB'), ('크림', 'NNG'), ('우동', 'NNG'), ('', 'JKO'), ('시켰', 'VV+EP'), ('어요', 'EF'), ('.', 'SF')]

pecab.pos("토끼정에서 크림 우동을 시켰어요.", drop_space=False)
[('토끼', 'NNG'), ('', 'NNG'), ('에서', 'JKB'), (' ', 'SP'), ('크림', 'NNG'), (' ', 'SP'), ('우동', 'NNG'), ('', 'JKO'), (' ', 'SP'), ('시켰', 'VV+EP'), ('어요', 'EF'), ('.', 'SF')]