c4-dataset-script

Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.

MIT License

Stars

119

View Code on GitHub

Ecosystems: Python, Apache Spark

Issue Statistics

Past Year

All Time

Total Pull Requests

Merged Pull Requests

Total Issues

Time to Close Issues

N/A

Related Projects

text-as-data

Pipeline for Analyzing Text Data: Acquire, Preprocess, Analyze

30 Jan 2015 8

DAT8

General Assembly's 2015 Data Science course in Washington, DC

07 Aug 2015 1,606

Project_CodeNet

This repository is to support contributions for tools for the Project CodeNet dataset hosted in DAX

03 May 2021 1,536

HarvestText

文本挖掘和预处理工具（文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等），无监督或弱监督方法

19 Nov 2018 2,391

tesseract-georgian

A set of data files that can be used to train tesseract-ocr to read Georgian script (ქართული ენა)

04 Apr 2015 15

e2e-cleaning

Cleaned E2E NLG Challenge data + supporting scripts

11 Jun 2019 21

MT-SFT-ShareGPT

18 Aug 2024 3

TransCan

An English-to-Cantonese machine translation model

06 Nov 2022 49

FASPell

2019-SOTA简繁中文拼写检查工具：FASPell Chinese Spell Checker (Chinese Spell Check / 中文拼写检错 / 中文拼写纠错 / 中文拼写检查)

26 Sep 2019 1,199

gensim-data

Data repository for pretrained NLP models and NLP corpora.

13 Oct 2017 974

zeroshot-d2t-pipeline

Code for the paper Neural Pipeline for Zero-Shot Data-to-Text Generation

15 Nov 2021 15

usearch-molecules

Searching for structural similarities across billions of molecules in milliseconds

10 Jun 2023 46

sentence-embedding-evaluation-german

Basically SentEval with German language downstream tasks

08 Apr 2022 0

multi-criteria-cws

Simple Solution for Multi-Criteria Chinese Word Segmentation

05 Dec 2017 300

summarizer

A Reddit bot that summarizes news articles written in Spanish or English. It uses a custom built ...

10 Feb 2019 269