GPT2 fine-tuning pipeline with KerasNLP, TensorFlow, and TensorFlow Extended
APACHE-2.0 License
This project demonstrates how to build a machine learning pipeline for fine-tuning GPT2 on Alpaca dataset with the technologies of TensorFlow Extended(TFX), KerasNLP, TensorFlow, and Hugging Face Hub. This project is done as a part of 2023 Keras Community Sprint held by the official Keras team at Google.
The demand on building ChatGPT like Large Language Model(LLM)s has been dramatically increasing since early 2023 because of their promising capabilities. In order to build a customized and private LLM based Chatbot applications, we need to fine-tune a language model(i.e. GPT2) on (instruction, response) paried custom dataset.
This project uses GPT2 model from KerasNLP library as the base language model and fine-tune the GPT2 on Stanford Alpaca dataset from alpaca-lora repository.
NOTE: The Alpaca dataset used in this project is the enhanced version of the original Standford Alpaca dataset by open source communities to fix some flaws manually and with GPT4 API.
Further, in order to automate fine-tuning process, this project embedded the fine-tuning process in and end to end machine learning pipeline built in TensorFlow Extended(TFX). Within the pipline, when the data is given, the following TFX components are sequentially triggered, and the data in between components is shared in TFRecord format.
Alpaca dataset is injected into the TFX pipeline through TFX ExampleGen component. It is assumed that the data is prepared as TFRecord format beforehand. TensorFlow Dataset allows us to create TFRecords easily without knowing much about TFRecords. If you are curious, check out the alpaca sub directory to find about how-to.
Injected data is transformed into instruction-following format through TFX Transform component. The original Alpaca dataset separately stores instruction
, input
, and response
for each conversation. However, they should be merged into a single string in the following format:
f"""### Instruction:
{instruction_txt}
### Input:
{input_txt}
### Response:
{response_txt}
"""
SavedModel
with custom a signature(this is a minimum requirement to serve TensorFlow/Keras model within TensorFlow Serving).Fine-tuned model is pushed to the Hugging Face Model Hub through custom TFX HFPusher component. At each time the model is pushed, new revision name(based on date) is assigned to it to distinguish the version of the model.
With the additonal capability of the custom TFX HFPusher component, it publishes a prepared template application to Hugging Face Space Hub. At each time the model is pushed, some strings within the template is replaced by real values at runtime such as revision name
.
Currently, Vertex AI is not supported to run this pipeline due to the CUDA and cuDNN version conflicts between TFX and KerasNLP. However, you can simply run the whole pipeline in a local and colab environment as below.
CUDA >= 11.6
and cuDNN >= 8.6
. Below these versions, some KerasNLP GPT2 model would fail. As of 07/28/2023, the default Colab environment comes with higher versions of the two frameworks.Install dependencies
# it is recommended to run the following pip command in venv
$ cd training_pipeline
$ pip install -r requirements.txt
Replace Hugging Face Token inside pipeline/configs.py
with the environment variable. This token will be used to push the model and publish a space application on Hugging Face Hub. If you are not familiar with how to get Hugging Face Access Token, check out the official document about it.
$ HF_ACCESS_TOKEN="YOUR Hugging Face Access Token"
$ envsubst '$HF_ACCESS_TOKEN' < pipeline/configs.py \
> pipeline/configs.py
Create TFX pipeline with tfx pipeline create
command. This command registers a TFX pipeline system wide. After the creation, if you modify something in the pipeline perspective, you need to run tfx pipeline update
instead of create
. In this case, the options and their values remain the same. Any modifications of the files inside modules
directory does not require to run tfx pipeline update
.
$ tfx pipeline create --pipeline-path local_runner.py \
--engine local
Once TFX pipeline is created(registered) successfully, you can run the pipeline with tfx run create
command. It will go through each component sequentially, and any intermediate products will be stored under the current directory.
$ tfx run create --pipeline-name kerasnlp-gpt2-alpaca-pipeline \
--engine local
SavedModel
format