c4-dataset-script

Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.

MIT License

Stars
119

Issue Statistics

Past Year

All Time

Total Pull Requests
0
10
Merged Pull Requests
0
10
Total Issues
0
0
Time to Close Issues
N/A
N/A
Related Projects