Read specific lines of a text file without loading it in memory.
GPL-3.0 License
Utilities for preprocessing text for deep learning with Keras
An example of how to use spaCy for extremely large files without running into memory issues
Freeing data processing from scripting madness by providing a set of platform-agnostic customizab...
Read datasets in a standard way
Pipeline component for spaCy (and other spaCy-wrapped parsers such as spacy-stanza and spacy-udpi...
Home of StarCoder2!
This project shows how to derive the total number of training tokens from a large text dataset fr...
We are trying to define a framework for NLP tasks that easily maps any kind of word embedding dat...
Explore large language models in 512MB of RAM
A Simple Bulk Labelling Tool
Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
A fast framework for pre-processing (Cleaning text, Reduction of vocabulary, Feature extraction ...
Home of StarCoder: fine-tuning & inference!
GLM (General Language Model)