Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
GLM (General Language Model)
Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
Code for NAACL 2024 main conference paper "An Empirical Study of Consistency Regularization for E...
Public release of the TransCoder research project https://arxiv.org/pdf/2006.03511.pdf
Multilingual Sentence & Image Embeddings with BERT
Training scripts and instructions how to reproduce our systems submitted to the NEWS 2018 Task on...
text2vec, text to vector. 文本向量表征工具,把文本转化为向量矩阵,实现了Word2Vec、RankBM25、Sentence-BERT、CoSENT等文本表征、文本相似...
Access a database of word frequencies, in various natural languages.
Sequence to Sequence from Scratch Using Pytorch
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
The guide to tackle with the Text Summarization
This is an open source project (formerly named Listen, Attend and Spell - PyTorch Implementation)...