Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.
MIT License
Pipeline for Analyzing Text Data: Acquire, Preprocess, Analyze
General Assembly's 2015 Data Science course in Washington, DC
This repository is to support contributions for tools for the Project CodeNet dataset hosted in DAX
文本挖掘和预处理工具(文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等),无监督或弱监督方法
A set of data files that can be used to train tesseract-ocr to read Georgian script (ქართული ენა)
Cleaned E2E NLG Challenge data + supporting scripts
An English-to-Cantonese machine translation model
2019-SOTA简繁中文拼写检查工具:FASPell Chinese Spell Checker (Chinese Spell Check / 中文拼写检错 / 中文拼写纠错 / 中文拼写检查)
Data repository for pretrained NLP models and NLP corpora.
Code for the paper Neural Pipeline for Zero-Shot Data-to-Text Generation
Searching for structural similarities across billions of molecules in milliseconds
Basically SentEval with German language downstream tasks
Simple Solution for Multi-Criteria Chinese Word Segmentation
A Reddit bot that summarizes news articles written in Spanish or English. It uses a custom built ...