Dreambooth (LoRA) with well-organized code structure. Naive adaptation from 🤗Diffusers.
MIT License
models
, datasets
, engines
, tools
, utils
, to make it more readable and maintainable, and can be easily extended to other tasks.conda create -n dreambooth python=3.8
conda activate dreambooth
# install pytorch
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
# install diffusers from source
pip install git+https://github.com/huggingface/diffusers
pip install -r requirements.txt
Step1
: Prepare your custom images and put them in a folder. Normally, 5 to 10 images are enough. We recommend you to mannuallly crop the images to the same size, e.g., 512x512, to avoid unwanted artifacts.Step2
: Initialize a Accelerate environment. Accelerate is a PyTorch library that simplifies the process of launching multi-GPU training and evaluation jobs. It is developed by Hugging Face.
accelerate config
Step3
: Run the training script. Both checkpoints and samples will be saved in the work_dirs
folder. Normally, it only takes 1-2 minutes to fine-tune the model with only 8GB GPU memoroccupied. 150 epochs are enough to train a object, however, when training on human face, we recommend you to train for 800 epochs. The hyper-parameters of Dreambooth is quite sensitive, you can refer to the original blog for some insights.
accelerate launch main.py \
--pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
--instance_data_dir="imgs/dogs" \
--instance_prompt="a photo of sks dog" \
--validation_prompt="a photo of sks dog is swimming" \
--with_prior_preservation \
--class_prompt=='a photo of dog' \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--learning_rate=2e-4 \
--max_train_steps=150 \
--validation_epochs 4
Prior preservation is used to avoid overfitting and language-drift (check out the paper to learn more if youre interested). For prior preservation, you use other images of the same class as part of the training process. The nice thing is that you can generate those images using the Stable Diffusion model itself! The training script will save the generated images to a local path you specify.
accelerate launch main.py \
--pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
--instance_data_dir="imgs/dogs" \
--instance_prompt="a photo of sks dog" \
--validation_prompt="a photo of sks dog is swimming" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--learning_rate=2e-4 \
--max_train_steps=150 \
--validation_epochs 10
You can aslo fine-tune the text encoder (CLIP) with LoRA. However we find this leads to unconverged results. This phenomenon is opposite to the results reported in the Original Implementation
accelerate launch main.py \
--pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
--instance_data_dir="imgs/dogs" \
--instance_prompt="a photo of sks dog" \
--validation_prompt="a photo of sks dog is swimming" \
--with_prior_preservation \
--train_text_encoder \
--class_prompt=='a photo of dog' \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--learning_rate=2e-4 \
--max_train_steps=150 \
--validation_epochs 4
After training, you can use the following command to generate images from a prompt. We also provide a pretrained checkpoint for dog (in the example)
wget https://github.com/Mountchicken/Structured_Dreambooth_LoRA/releases/download/checkpoint_dog/checkpoint-200.zip
unzip -q checkpoint-200.zip
accelerate launch main.py \
--pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
--checkpoint_dir="checkpoint-200" \
--prompt="A photo of sks dog is swimming \
--output_dir=$OUTPUT_DIR