count-tokens-hf-datasets

This project shows how to derive the total number of training tokens from a large text dataset from 🤗 datasets with Apache Beam and Dataflow.

Stars

Ecosystems: Python

Statistics for this project are still being loaded, please check back later.

This Project is only repository for solving AI Engineer Party

Public release of the TransCoder research project https://arxiv.org/pdf/2006.03511.pdf

Recaption large (Web)Datasets with vllm and save the artifacts.

Ongoing research training transformer language models at scale, including: BERT & GPT-2

The simplest, fastest repository for training/finetuning medium-sized GPTs.

Home of StarCoder2!

Apache Beam is a unified programming model for Batch and Streaming data processing.

Pretrained language model with 100B parameters

Unsupervised Language Modeling at scale for robust sentiment classification

Home of StarCoder: fine-tuning & inference!

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transfo...