Underthesea - Vietnamese NLP Toolkit
GPL-3.0 License
Bot releases are hidden (Show)
Published by rain1024 over 6 years ago
word_sent
function to word_tokenize
The main focus in this release is fix dependencies hell error which is reported by @dthphuong and @YannDubs. This fix will enhance speed in installation process of underthesea and remove all unnecessary dependencies in underthesea by default.
Another import update is an API change. We rename word_sent
function to word_tokenize
which is a better name for word segmentation task.
Thanks to @rain1024, @JackNhat for the contributions!
Published by rain1024 almost 7 years ago
The main feature in this release is aspect sentiment analysis
. We conduct a banch of experiments with social posts data in bank domain. Traditional classifiers such as SVM, Naive Bayes, Gradient Boosting Tree with count features and tfidf features still yield the better result (59.5% in f1 score), compare with deep learning models like fasttext and CNN. You can view live demo of Vietnamese aspect sentiment analysis in underthesea service
We rename underthesea-flow project to languageflow, integrate new models (KimCNNCLassifier, XGBoostClassifier). See more detail in languageflow documentation
Thanks to @rain1024, @JackNhat for the contributions!
Published by rain1024 about 7 years ago
The main feature in this release is named entity recognition
. Our experiments focus on conditional random fields models, which yield a reasonable result and fast (~20 mins per experiment). For more information about NER experiments, go to its own repository.
A lot of work in this month to improve our pipeline, a new project underthesea-flow is created for this reason.
We also create a new project underthesea.amr in response to the raise of AMR. Our first goal is create first 3000 Vietnamese annotated sentences in our AMR bank.
Thanks to @rain1024, @JackNhat, @vunb for the contributions!
Published by rain1024 about 7 years ago
The main feature in this release is text classification
. We experiments some standard classifiers (Naive Bayes, SVM family, xgboost) and a trendy classifier fasttext
in very large Vietnamse news data set (30k sentences). The winner is fasttext because it's very fast and yeild best accuracy and f1 score. For more information about classification experiments, follow the this link to its own repository.
We're afraid that we can't support one line install
due to many dependencies come with v1.1.4 (fasttext, sklearn). Other reason is we want to separate models and code. So after install underthesea, you must do a small step is download models. Check out how to make underthesea works with four lines
in Installation section here.
See you next release!
Published by rain1024 about 7 years ago
Published by rain1024 over 7 years ago
Word Segmentation, POS Tagging, Chunking
Support python 2 only