🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
APACHE-2.0 License
Bot releases are hidden (Show)
This release emphasizes Stable Diffusion 3, Stability AI’s latest iteration of the Stable Diffusion family of models. It was introduced in Scaling Rectified Flow Transformers for High-Resolution Image Synthesis by Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach.
As the model is gated, before using it with diffusers
, you first need to go to the Stable Diffusion 3 Medium Hugging Face page, fill in the form and accept the gate. Once you are in, you need to log in so that your system knows you’ve accepted the gate.
huggingface-cli login
The code below shows how to perform text-to-image generation with SD3:
import torch
from diffusers import StableDiffusion3Pipeline
pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
image = pipe(
"A cat holding a sign that says hello world",
negative_prompt="",
num_inference_steps=28,
guidance_scale=7.0,
).images[0]
image
Refer to our documentation for learning all the optimizations you can apply to SD3 as well as the image-to-image pipeline.
Additionally, we support DreamBooth + LoRA fine-tuning of Stable Diffusion 3 through rectified flow. Check out this directory for more details.
Published by yiyixuxu 5 months ago
Published by sayakpaul 5 months ago
This patch release primarily introduces the Hunyuan DiT pipeline from the Tencent team.
Hunyuan DiT is a transformer-based diffusion pipeline, introduced in the Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding paper by the Tencent Hunyuan.
import torch
from diffusers import HunyuanDiTPipeline
pipe = HunyuanDiTPipeline.from_pretrained(
"Tencent-Hunyuan/HunyuanDiT-Diffusers", torch_dtype=torch.float16
)
pipe.to("cuda")
# You may also use English prompt as HunyuanDiT supports both English and Chinese
# prompt = "An astronaut riding a horse"
prompt = "一个宇航员在骑马"
image = pipe(prompt).images[0]
🧠 This pipeline has support for multi-linguality.
📜 Refer to the official docs here to learn more about it.
Thanks to @gnobitab, for contributing Hunyuan DiT in #8240.
Transformer2DModel
by @sayakpaul in #7647norm_type
safely while remapping by @sayakpaul in #8370The following contributors have made significant changes to the library over the last release:
Published by sayakpaul 5 months ago
Diffusion models are known for their abilities in the space of generative modeling. This release of diffusers
introduces the first official pipeline (Marigold) for discriminative tasks such as depth estimation and surface normals’ estimation!
Starting this release, we will also highlight the changes and features from the library that make it easy to integrate community checkpoints, features, and so on. Read on!
Proposed in Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation, Marigold introduces a diffusion model and associated fine-tuning protocol for monocular depth estimation. It can also be extended to perform surface normals’ estimation.
(Image taken from the official repository)
The code snippet below shows how to use this pipeline for depth estimation:
import diffusers
import torch
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
"prs-eth/marigold-depth-lcm-v1-0", variant="fp16", torch_dtype=torch.float16
).to("cuda")
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
depth = pipe(image)
vis = pipe.image_processor.visualize_depth(depth.prediction)
vis[0].save("einstein_depth.png")
depth_16bit = pipe.image_processor.export_depth_to_16bit_png(depth.prediction)
depth_16bit[0].save("einstein_depth_16bit.png")
Check out the API documentation here. We also have a detailed guide about the pipeline here.
Thanks to @toshas, one of the authors of Marigold, who contributed this in #7847.
from_single_file
🌀We have further refactored from_single_file
to align its logic more closely to the from_pretrained
method. The biggest benefit of doing this is that it allows us to expand single file loading support beyond Stable Diffusion-like pipelines and models. It also makes it easier to load models that are saved and shared in their original format.
Some of the changes introduced in this refactor:
runwayml/stable-diffusion-v1-5
repository to configure the model components and pipeline.config
argument and pass in either a path to a local model repo or a repo id on the Hugging Face Hub.pipe = StableDiffusionPipeline.from_single_file("...", config=<model repo id or local repo path>)
from_single_file
method in Pipelines such as num_in_channels
, scheduler_type
, image_size
and upcast_attention
. This is an anti-pattern that we have supported in previous versions of the library when we assumed that it would only be relevant to Stable Diffusion based models. However, given that there is a demand to support other model types, we feel it is necessary for single-file loading behavior to adhere to the conventions set in our other loading methods. Configuring individual model components through a pipeline loading method is not something we support in from_pretrained
, and therefore, we will be deprecating support for this behavior in from_single_file
as well.PixArt Simga is the successor to PixArt Alpha. PixArt Sigma is capable of directly generating images at 4K resolution. It can also produce images of markedly higher fidelity and improved alignment with text prompts. It comes with a massive sequence length of 300 (for reference, PixArt Alpha has a maximum sequence length of 120)!
import torch
from diffusers import PixArtSigmaPipeline
# You can replace the checkpoint id with "PixArt-alpha/PixArt-Sigma-XL-2-512-MS" too.
pipe = PixArtSigmaPipeline.from_pretrained(
"PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", torch_dtype=torch.float16
)
# Enable memory optimizations.
pipe.enable_model_cpu_offload()
prompt = "A small cactus with a happy face in the Sahara desert."
image = pipe(prompt).images[0]
📃 Refer to the documentation here to learn more about PixArt Sigma.
Thanks to @lawrence-cj, one of the authors of PixArt Sigma, who contributed this in #7857.
@a-r-r-o-w contributed the Stable Diffusion XL (SDXL) version of AnimateDiff in #6721. However, note that this is currently an experimental feature, as only a beta release of the motion adapter checkpoint is available.
import torch
from diffusers.models import MotionAdapter
from diffusers import AnimateDiffSDXLPipeline, DDIMScheduler
from diffusers.utils import export_to_gif
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-sdxl-beta", torch_dtype=torch.float16)
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
scheduler = DDIMScheduler.from_pretrained(
model_id,
subfolder="scheduler",
clip_sample=False,
beta_schedule="linear",
steps_offset=1,
)
pipe = AnimateDiffSDXLPipeline.from_pretrained(
model_id,
motion_adapter=adapter,
scheduler=scheduler,
torch_dtype=torch.float16,
variant="fp16",
).enable_model_cpu_offload()
# enable memory savings
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()
output = pipe(
prompt="a panda surfing in the ocean, realistic, high quality",
negative_prompt="low quality, worst quality",
num_inference_steps=20,
guidance_scale=8,
width=1024,
height=1024,
num_frames=16,
)
frames = output.frames[0]
export_to_gif(frames, "animation.gif")
📜 Refer to the documentation to learn more.
@UmerHA contributed the support to control the scales of different LoRA blocks in a granular manner in #7352. Depending on the LoRA checkpoint one is using, this granular control can significantly impact the quality of the generated outputs. Following code block shows how this feature can be used while performing inference:
...
adapter_weight_scales = { "unet": { "down": 0, "mid": 1, "up": 0} }
pipe.set_adapters("pixel", adapter_weight_scales)
image = pipe(
prompt, num_inference_steps=30, generator=torch.manual_seed(0)
).images[0]
✍️ Refer to our documentation for more details and a full-fledged example.
More granular control of scale could be extended to IP-Adapters too. @DannHuang contributed to the support of InstantStyle, aka granular control of IP-Adapter scales, in #7668. The following code block shows how this feature could be used when performing inference with IP-Adapters:
...
scale = {
"down": {"block_2": [0.0, 1.0]},
"up": {"block_0": [0.0, 1.0, 0.0]},
}
pipeline.set_ip_adapter_scale(scale)
This way, one can generate images following only the style or layout from the image prompt, with significantly improved diversity. This is achieved by only activating IP-Adapters to specific parts of the model.
Check out the documentation here.
ControlNet-XS was introduced in ControlNet-XS by Denis Zavadski and Carsten Rother. Based on the observation, the control model in the original ControlNet can be made much smaller and still produce good results. ControlNet-XS generates images comparable to a regular ControlNet, but it is 20-25% faster (see benchmark with StableDiffusion-XL) and uses ~45% less memory.
ControlNet-XS is supported for both Stable Diffusion and Stable Diffusion XL.
Thanks to @UmerHA for contributing ControlNet-XS in #5827 and #6772.
We introduced custom timesteps support for some of our pipelines and schedulers. You can now set your scheduler with a list of arbitrary timesteps. For example, you can use the AYS timesteps schedule to achieve very nice results with only 10 denoising steps.
from diffusers.schedulers import AysSchedules
sampling_schedule = AysSchedules["StableDiffusionXLTimesteps"]
pipe = StableDiffusionXLPipeline.from_pretrained(
"SG161222/RealVisXL_V4.0",
torch_dtype=torch.float16,
variant="fp16",
).to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, algorithm_type="sde-dpmsolver++")
prompt = "A cinematic shot of a cute little rabbit wearing a jacket and doing a thumbs up"
image = pipe(prompt=prompt, timesteps=sampling_schedule).images[0]
Check out the documentation here
device_map
in Pipelines 🧪We have introduced experimental support for device_map
in our pipelines. This feature becomes relevant when you have multiple accelerators to distribute the components of a pipeline. Currently, we support only “balanced” device_map
. However, we plan to support other device mapping strategies relevant to diffusion models in the future.
from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
device_map="balanced"
)
image = pipeline("a dog").images[0]
In cases where you might be limited to low VRAM accelerators, you can use device_map
to benefit from them. Below, we simulate a situation where we have access to two GPUs, each having only a GB of VRAM (through the max_memory
argument).
from diffusers import DiffusionPipeline
import torch
max_memory = {0:"1GB", 1:"1GB"}
pipeline = DiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
use_safetensors=True,
device_map="balanced",
max_memory=max_memory
)
image = pipeline("a dog").images[0]
📜 Refer to the documentation to learn more about it.
VQGAN, proposed in Taming Transformers for High-Resolution Image Synthesis, is a crucial component in the modern generative image modeling toolbox. Once it is trained, its encoder can be leveraged to compute general-purpose tokens from input images.
Thanks to @isamu-isozaki, who contributed a script and related utilities to train VQGANs in #5483. For details, refer to the official training directory.
VideoProcessor
ClassSimilar to the VaeImageProcessor
class, we have introduced a VideoProcessor
to help make the preprocessing and postprocessing of videos easier and a little more streamlined across the pipelines. Refer to the documentation to learn more.
Starting with this release, we provide guides and tutorials to help users get started with some of the most frequently used tasks in image and video generation. For this release, we have a series of three guides about outpainting with different techniques:
We introduced official callbacks that you can conveniently plug into your pipeline. For example, to turn off classifier-free guidance after denoising steps with SDXLCFGCutoffCallback
.
import torch
from diffusers import DiffusionPipeline
from diffusers.callbacks import SDXLCFGCutoffCallback
callback = SDXLCFGCutoffCallback(cutoff_step_ratio=0.4)
pipeline = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16",
).to("cuda")
prompt = "a sports car at the road, best quality, high quality, high detail, 8k resolution"
out = pipeline(
prompt=prompt,
num_inference_steps=25,
callback_on_step_end=callback,
)
Read more on our documentation 📜
from_pipe
APIStarting with this release note, we will highlight the new community pipelines! More and more of our pipelines were added as community pipelines first and graduated as official pipelines once people started to use them a lot! We do not require community pipelines to follow diffusers’ coding style, so it is the easiest way to contribute to diffusers 😊
We also introduced a from_pipe
API that’s very useful for the community pipelines that share checkpoints with our official pipelines and improve generation quality in some way:) You can use from_pipe(...)
to load many community pipelines without additional memory requirements. With this API, you can easily switch between different pipelines to apply different techniques.
Read more about from_pipe
API in our documentation 📃.
Here are four new community pipelines since our last release.
BoxDiff lets you use bounding box coordinates for a more controlled generation. Here is an example of how you can apply this technique on a stable diffusion pipeline you had created (i.e. pipe_sd
in the below example)
pipe_box = DiffusionPipeline.from_pipe(
pipe_sd,
custom_pipeline="pipeline_stable_diffusion_boxdiff",
)
pipe_box.enable_model_cpu_offload()
phrases = ["aurora","reindeer","meadow","lake","mountain"]
boxes = [[1,3,512,202], [75,344,421,495], [1,327,508,507], [2,217,507,341], [1,135,509,242]]
boxes = [[x / 512 for x in box] for box in boxes]
generator = torch.Generator(device="cpu").manual_seed(42)
images = pipe_box(
prompt,
boxdiff_phrases=phrases,
boxdiff_boxes=boxes,
boxdiff_kwargs={
"attention_res": 16,
"normalize_eot": True
},
num_inference_steps=50,
generator=generator,
).images
Check out this community pipeline here
HD-Painter can enhance inpainting pipelines with improved prompt faithfulness and generate higher resolution (up to 2k). You can switch from BoxDiff to HD-Painter like this
pipe = DiffusionPipeline.from_pipe(
pipe_box,
custom_pipeline="hd_painter"
)
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
prompt = "wooden boat"
init_image = load_image("https://raw.githubusercontent.com/Picsart-AI-Research/HD-Painter/main/__assets__/samples/images/2.jpg")
mask_image = load_image("https://raw.githubusercontent.com/Picsart-AI-Research/HD-Painter/main/__assets__/samples/masks/2.png")
image = pipe (prompt, init_image, mask_image, use_rasg = True, use_painta = True, generator=torch.manual_seed(12345)).images[0]
Check out this community pipeline here
Differential Diffusion enables customization of the amount of change per pixel or per image region. It’s very effective in inpainting and outpainting.
pipeline = DiffusionPipeline.from_pipe(
pipe_sdxl,
custom_pipeline="pipeline_stable_diffusion_xl_differential_img2img",
).to("cuda")
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config, use_karras_sigmas=True)
prompt = "a green pear"
negative_prompt = "blurry"
image = pipeline(
prompt=prompt,
negative_prompt=negative_prompt,
guidance_scale=7.5,
num_inference_steps=25,
original_image=image,
image=image,
strength=1.0,
map=mask,
).images[0]
Check out this community pipeline here.
FRESCO aka FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation enables zero-shot video-to-video translation. Learn more about it from here.
distutils
by @sayakpaul in #7455IP-Adapter
] Fix IP-Adapter Support and Refactor Callback for StableDiffusionPanoramaPipeline
by @standardAI in #7262str_to_bool
definition in testing utils by @DN6 in #7461Docs
] Fix typos by @standardAI in #7451test_lora_layers_peft.py
by @UmerHA in #7394ConsistencyDecoderVAE
by @standardAI in #7290test_lora_fuse_nan
on mps by @UmerHA in #7481final_sigma_zero
to UniPCMultistep by @Beinsezii in #7517time_context
) by @KimbingNg in #7268from_pipe
method to DiffusionPipeline
by @yiyixuxu in #7241rescale_betas_zero_snr
by @Beinsezii in #7531test_freeu_enabled
on MPS by @UmerHA in #7570transformer_2d
forward logic into meaningful conditions. by @sayakpaul in #7489libsndfile1-dev
and libgl1
from workflows by @sayakpaul in #7543device_map
support to pipelines by @sayakpaul in #6857logger.warn
with logger.warning
by @Sai-Suraj-27 in #7643is_cosxl_edit
arg in SDXL ip2p. by @sayakpaul in #7650optimization
by @WentianZhang-ML in #7639ruff
configuration to avoid deprecated configuration warning by @Sai-Suraj-27 in #7637optimization
. by @WentianZhang-ML in #7698type annotations
for compatability with python 3.8 by @Sai-Suraj-27 in #7648@classmethod
by @Sai-Suraj-27 in #7653ModelMixin
by @sayakpaul in #6396is_sequential_cpu_offload
by @yiyixuxu in #7788resume_download
deprecation by @Wauplin in #7843from_single_file
logic with from_pretrained
by @DN6 in #7496_optional_components
in StableCascadeCombinedPipeline
by @yiyixuxu in #7894timesteps
and sigmas
by @yiyixuxu in #7817contributing.md
file by @Sai-Suraj-27 in #7638save_pretrained
logic for compatibility by @rebel-kblee in #7821diffusers-cli env
by @standardAI in #7403added_cond_kwargs
when using IP-Adapter in StableDiffusionXLControlNetInpaintPipeline by @detkov in #7924isinstance
calls by @Sai-Suraj-27 in #7710cross_attention_kwargs
to StableDiffusionInstructPix2PixPipeline
by @AlexeyZhuravlev in #7961docstrings
according to the Google Style Guide by @Sai-Suraj-27 in #7717freedesktop_os_release()
in diffusers cli for Python >=3.10 by @DN6 in #8235resume_download
deprecation V2 by @Wauplin in #8267from_single_file
docs by @DN6 in #8268raise
messages by @standardAI in #8272The following contributors have made significant changes to the library over the last release:
IP-Adapter
] Fix IP-Adapter Support and Refactor Callback for StableDiffusionPanoramaPipeline
(#7262)Docs
] Fix typos (#7451)ConsistencyDecoderVAE
(#7290)diffusers-cli env
(#7403)raise
messages (#8272)test_lora_layers_peft.py
(#7394)test_lora_fuse_nan
on mps (#7481)test_freeu_enabled
on MPS (#7570)Published by sayakpaul 7 months ago
Published by sayakpaul 7 months ago
Published by sayakpaul 7 months ago
We are adding support for a new text-to-image model building on Würstchen called Stable Cascade, which comes with a non-commercial license. The Stable Cascade line of pipelines differs from Stable Diffusion in that they are built upon three distinct models and allow for hierarchical compression of image patients, achieving remarkable outputs.
from diffusers import StableCascadePriorPipeline, StableCascadeDecoderPipeline
import torch
prior = StableCascadePriorPipeline.from_pretrained(
"stabilityai/stable-cascade-prior",
torch_dtype=torch.bfloat16,
).to("cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image_emb = prior(prompt=prompt).image_embeddings[0]
decoder = StableCascadeDecoderPipeline.from_pretrained(
"stabilityai/stable-cascade",
torch_dtype=torch.bfloat16,
).to("cuda")
image = pipe(image_embeddings=image_emb, prompt=prompt).images[0]
image
📜 Check out the docs here to know more about the model.
Note: You will need a torch>=2.2.0
to use the torch.bfloat16
data type with the Stable Cascade pipeline.
PlaygroundAI released a new v2.5 model (playgroundai/playground-v2.5-1024px-aesthetic
), which particularly excels at aesthetics. The model closely follows the architecture of Stable Diffusion XL, except for a few tweaks. This release comes with support for this model:
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained(
"playgroundai/playground-v2.5-1024px-aesthetic",
torch_dtype=torch.float16,
variant="fp16",
).to("cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt=prompt, num_inference_steps=50, guidance_scale=3).images[0]
image
Loading from the original single-file checkpoint is also supported:
from diffusers import StableDiffusionXLPipeline, EDMDPMSolverMultistepScheduler
import torch
url = "https://huggingface.co/playgroundai/playground-v2.5-1024px-aesthetic/blob/main/playground-v2.5-1024px-aesthetic.safetensors"
pipeline = StableDiffusionXLPipeline.from_single_file(url)
pipeline.to(device="cuda", dtype=torch.float16)
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipeline(prompt=prompt, guidance_scale=3.0).images[0]
image.save("playground_test_image.png")
You can also perform LoRA DreamBooth training with the playgroundai/playground-v2.5-1024px-aesthetic
checkpoint:
accelerate launch train_dreambooth_lora_sdxl.py \
--pretrained_model_name_or_path="playgroundai/playground-v2.5-1024px-aesthetic" \
--instance_data_dir="dog" \
--output_dir="dog-playground-lora" \
--mixed_precision="fp16" \
--instance_prompt="a photo of sks dog" \
--resolution=1024 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--learning_rate=1e-4 \
--use_8bit_adam \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_prompt="A photo of sks dog in a bucket" \
--validation_epochs=25 \
--seed="0" \
--push_to_hub
To know more, follow the instructions here.
EDM refers to the training and sampling techniques introduced in the following paper: Elucidating the Design Space of Diffusion-Based Generative Models. We have introduced support for training using the EDM formulation in our train_dreambooth_lora_sdxl.py
script.
To train stabilityai/stable-diffusion-xl-base-1.0
using the EDM formulation, you just have to specify the --do_edm_style_training
flag in your training command, and voila 🤗
If you’re interested in extending this formulation to other training scripts, we refer you to this PR.
To better support the Playground v2.5 model and EDM-style training in general, we are bringing support for EDMDPMSolverMultistepScheduler
and EDMEulerScheduler
. These support the EDM formulations of the DPMSolverMultistepScheduler
and EulerDiscreteScheduler
, respectively.
Trajectory Consistency Distillation (TCD) enables a model to generate higher quality and more detailed images with fewer steps. Moreover, owing to the effective error mitigation during the distillation process, TCD demonstrates superior performance even under conditions of large inference steps. It was proposed in Trajectory Consistency Distillation.
This release comes with the support of a TCDScheduler
that enables this kind of fast sampling. Much like LCM-LoRA, TCD requires an additional adapter for the acceleration. The code snippet below shows a usage:
import torch
from diffusers import StableDiffusionXLPipeline, TCDScheduler
device = "cuda"
base_model_id = "stabilityai/stable-diffusion-xl-base-1.0"
tcd_lora_id = "h1t/TCD-SDXL-LoRA"
pipe = StableDiffusionXLPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16, variant="fp16").to(device)
pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config)
pipe.load_lora_weights(tcd_lora_id)
pipe.fuse_lora()
prompt = "Painting of the orange cat Otto von Garfield, Count of Bismarck-Schönhausen, Duke of Lauenburg, Minister-President of Prussia. Depicted wearing a Prussian Pickelhaube and eating his favorite meal - lasagna."
image = pipe(
prompt=prompt,
num_inference_steps=4,
guidance_scale=0,
eta=0.3,
generator=torch.Generator(device=device).manual_seed(0),
).images[0]
📜 Check out the docs here to know more about TCD.
Many thanks to @mhh0318 for contributing the TCDScheduler
in #7174 and the guide in #7259.
All the pipelines supporting IP-Adapter accept a ip_adapter_image_embeds
argument. If you need to run the IP-Adapter multiple times with the same image, you can encode the image once and save the embedding to the disk. This saves computation time and is especially useful when building UIs. Additionally, ComfyUI image embeddings for IP-Adapters are fully compatible in Diffusers and should work out-of-box.
We have also introduced support for providing binary masks to specify which portion of the output image should be assigned to an IP-Adapter. For each input IP-Adapter image, a binary mask and an IP-Adapter must be provided. Thanks to @fabiorigano for contributing this feature through #6847.
📜 To know about the exact usage of both of the above, refer to our official guide.
We thank our community members, @fabiorigano, @asomoza, and @cubiq, for their guidance and input on these features.
Merging LoRAs can be a fun and creative way to create new and unique images. Diffusers provides merging support with the set_adapters
method which concatenates the weights of the LoRAs to merge.
Now, Diffusers also supports the add_weighted_adapter
method from the PEFT library, unlocking more efficient merging method like TIES, DARE, linear, and even combinations of these merging methods like dare_ties
.
📜 Take a look at the Merge LoRAs guide to learn more about merging in Diffusers.
We are adding support to the real image editing technique called LEDITS++: Limitless Image Editing using Text-to-Image Models, a parameter-free method, requiring no fine-tuning nor any optimization.
To edit real images, the LEDITS++ pipelines first invert the image DPM-solver++ scheduler that facilitates editing with as little as 20 total diffusion steps for inversion and inference combined. LEDITS++ guidance is defined such that it both reflects the direction of the edit (if we want to push away from/towards the edit concept) and the strength of the effect. The guidance also includes a masking term focused on relevant image regions which, for multiple edits especially, ensures that the corresponding guidance terms for each concept remain mostly isolated, limiting interference.
The code snippet below shows a usage:
import torch
import PIL
import requests
from io import BytesIO
from diffusers import LEditsPPPipelineStableDiffusionXL, AutoencoderKL
device = "cuda"
base_model_id = "stabilityai/stable-diffusion-xl-base-1.0"
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)
pipe = LEditsPPPipelineStableDiffusionXL.from_pretrained(
base_model_id,
vae=vae,
torch_dtype=torch.float16
).to(device)
def download_image(url):
response = requests.get(url)
return PIL.Image.open(BytesIO(response.content)).convert("RGB")
img_url = "https://www.aiml.informatik.tu-darmstadt.de/people/mbrack/tennis.jpg"
image = download_image(img_url)
_ = pipe.invert(
image = image,
num_inversion_steps=50,
skip=0.2
)
edited_image = pipe(
editing_prompt=["tennis ball","tomato"],
reverse_editing_direction=[True,False],
edit_guidance_scale=[5.0,10.0],
edit_threshold=[0.9,0.85],)
📜 Check out the docs here to learn more about LEDITS++.
Thanks to @manuelbrack for contributing this in #6074.
config_file
argument to ControlNetModel when using from_single_file
by @DN6 in #6959PEFT
/ docs
] Add a note about torch.compile by @younesbelkada in #6864strength
parameter in Controlnet_img2img
pipelines by @tlpss in #6951torch_dtype
to set_module_tensor_to_device
by @yiyixuxu in #6994load_model_dict_into_meta
for ControlNet from_single_file
by @DN6 in #7034disable_full_determinism
from StableVideoDiffusion xformers test. by @DN6 in #7039Refactor
] save_model_card
function in text_to_image
examples by @standardAI in #7051Refactor
] StableDiffusionReferencePipeline
inheriting from DiffusionPipeline
by @standardAI in #7071PEFT
/ Core
] Copy the state dict when passing it to load_lora_weights
by @younesbelkada in #7058uv
in the Dockerfiles. by @sayakpaul in #7094Docs
] Fix typos by @standardAI in #7118rescale_betas_zero_snr
by @Beinsezii in #7097Docs
] Fix typos by @standardAI in #7131prepare_ip_adapter_image_embeds
and skip load image_encoder
by @yiyixuxu in #7016uv
version for now and a minor change in the Slack notification by @sayakpaul in #7155torch.compile
by @sayakpaul in #7161callback_on_step_end
for StableDiffusionLDM3DPipeline
by @rootonchair in #7149from_config
by @yiyixuxu in #7192StableVideoDiffusionPipeline
by @JinayJain in #7143denoising_end
parameter to ControlNetPipeline for SDXL by @UmerHA in #6175depth_colored
with color_map=None
by @qqii in #7170return_dict
and minor doc updates by @a-r-r-o-w in #7105export_to_video
default by @DN6 in #6990logger.warning
by @sayakpaul in #7289from_single_file
by @DN6 in #7282UNet2DConditionModel
documentation by @alexanderbonnet in #7291The following contributors have made significant changes to the library over the last release:
callback_on_step_end
for StableDiffusionLDM3DPipeline
(#7149)Refactor
] save_model_card
function in text_to_image
examples (#7051)Refactor
] StableDiffusionReferencePipeline
inheriting from DiffusionPipeline
(#7071)Docs
] Fix typos (#7118)Docs
] Fix typos (#7131)return_dict
and minor doc updates (#7105)Published by yiyixuxu 8 months ago
get_order_list
for solver_order=2
and lower_order_final=True
by @yiyixuxu in #6953Published by sayakpaul 9 months ago
In v0.26.0, we introduced a bug 🐛 in the BasicTransformerBlock
by removing some boolean flags. This caused many popular libraries tomesd
to break. We have fixed that in this release. Thanks to @vladmandic for bringing this to our attention.
self.use_ada_layer_norm_*
params back to BasicTransformerBlock
by @yiyixuxu in #6841Published by sayakpaul 9 months ago
In the v0.26.0 release, we slipped in the torchvision
library as a required library, which shouldn't have been the case. This is now fixed.
Published by sayakpaul 9 months ago
This new release comes with two new video pipelines, a more unified and consistent experience for single-file checkpoint loading, support for multiple IP-Adapters’ inference with multiple reference images, and more.
I2VGenXL is an image-to-video pipeline, proposed in I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models.
import torch
from diffusers import I2VGenXLPipeline
from diffusers.utils import export_to_gif, load_image
repo_id = "ali-vilab/i2vgen-xl"
pipeline = I2VGenXLPipeline.from_pretrained(repo_id, torch_dtype=torch.float16).to("cuda")
pipeline.enable_model_cpu_offload()
image_url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/i2vgen_xl_images/img_0001.jpg"
image = load_image(image_url).convert("RGB")
prompt = "A green frog floats on the surface of the water on green lotus leaves, with several pink lotus flowers, in a Chinese painting style."
negative_prompt = "Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms"
generator = torch.manual_seed(8888)
frames = pipeline(
prompt=prompt,
image=image,
num_inference_steps=50,
negative_prompt=negative_prompt,
generator=generator,
).frames
export_to_gif(frames[0], "i2v.gif")
📜 Check out the docs here.
PIA is a Personalized Image Animator, that aligns with condition images, controls motion by text, and is compatible with various T2I models without specific tuning. PIA uses a base T2I model with temporal alignment layers for image animation. A key component of PIA is the condition module, which transfers appearance information for individual frame synthesis in the latent space, thus allowing a stronger focus on motion alignment. PIA was introduced in PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models.
import torch
from diffusers import (
EulerDiscreteScheduler,
MotionAdapter,
PIAPipeline,
)
from diffusers.utils import export_to_gif, load_image
adapter = MotionAdapter.from_pretrained("openmmlab/PIA-condition-adapter")
pipe = PIAPipeline.from_pretrained("SG161222/Realistic_Vision_V6.0_B1_noVAE", motion_adapter=adapter, torch_dtype=torch.float16)
pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
pipe.enable_vae_slicing()
image = load_image(
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/pix2pix/cat_6.png?download=true"
)
image = image.resize((512, 512))
prompt = "cat in a field"
negative_prompt = "wrong white balance, dark, sketches,worst quality,low quality"
generator = torch.Generator("cpu").manual_seed(0)
output = pipe(image=image, prompt=prompt, generator=generator)
frames = output.frames[0]
export_to_gif(frames, "pia-animation.gif")
📜 Check out the docs here.
IP-Adapters are becoming quite popular, so we have added support for performing inference multiple IP-Adapters and multiple reference images! Thanks to @asomoza for their help. Get started with the code below:
import torch
from diffusers import AutoPipelineForText2Image, DDIMScheduler
from transformers import CLIPVisionModelWithProjection
from diffusers.utils import load_image
image_encoder = CLIPVisionModelWithProjection.from_pretrained(
"h94/IP-Adapter",
subfolder="models/image_encoder",
torch_dtype=torch.float16,
)
pipeline = AutoPipelineForText2Image.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
image_encoder=image_encoder,
)
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name=["ip-adapter-plus_sdxl_vit-h.safetensors", "ip-adapter-plus-face_sdxl_vit-h.safetensors"])
pipeline.set_ip_adapter_scale([0.7, 0.3])
pipeline.enable_model_cpu_offload()
face_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/women_input.png")
style_folder = "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/style_ziggy"
style_images = [load_image(f"{style_folder}/img{i}.png") for i in range(10)]
generator = torch.Generator(device="cpu").manual_seed(0)
image = pipeline(
prompt="wonderwoman",
ip_adapter_image=[style_images, face_image],
negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
num_inference_steps=50
generator=generator,
).images[0]
Reference style images:
📜 Check out the docs here.
from_single_file()
utility has been refactored for better readability and to follow similar semantics as from_pretrained()
. Support for loading single file checkpoints and configs from URLs has also been added.
We introduced a fix for DPM schedulers, so now you can use it with SDXL to generate high-quality images in fewer steps than the Euler scheduler.
Apart from these, we have done a myriad of refactoring to improve the library design and will continue to do so in the coming days.
use_karras_sigmas
option by @yiyixuxu in #6477unets
module 🦋 by @sayakpaul in #6630from_single_file()
by @sayakpaul in #6638transformers
modules by @sayakpaul in #6747tensor_to_vid
function in video pipelines by @DN6 in #6715alpha_cumprod
to device to avoid redundant data movement. by @woshiyyya in #6704is_flaky
to test_model_cpu_offload_forward_pass by @sayakpaul in #6762The following contributors have made significant changes to the library over the last release:
Published by patrickvonplaten 9 months ago
Make sure diffusers
can correctly be used in offline mode again: https://github.com/huggingface/diffusers/pull/1767#issuecomment-1896194917
Published by sayakpaul 10 months ago
aMUSEd is a lightweight text to image model based off of the MUSE architecture. aMUSEd is particularly useful in applications that require a lightweight and fast model, such as generating many images quickly at once. aMUSEd is currently a research release.
aMUSEd is a VQVAE token-based transformer that can generate an image in fewer forward passes than many diffusion models. In contrast with MUSE, it uses the smaller text encoder CLIP-L/14 instead of T5-XXL. Due to its small parameter count and few forward pass generation process, amused can generate many images quickly. This benefit is seen particularly at larger batch sizes.
Text-to-image generation
import torch
from diffusers import AmusedPipeline
pipe = AmusedPipeline.from_pretrained(
"amused/amused-512", variant="fp16", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
prompt = "cowboy"
image = pipe(prompt, generator=torch.manual_seed(8)).images[0]
image.save("text2image_512.png")
Image-to-image generation
import torch
from diffusers import AmusedImg2ImgPipeline
from diffusers.utils import load_image
pipe = AmusedImg2ImgPipeline.from_pretrained(
"amused/amused-512", variant="fp16", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
prompt = "apple watercolor"
input_image = (
load_image(
"https://huggingface.co/amused/amused-512/resolve/main/assets/image2image_256_orig.png"
)
.resize((512, 512))
.convert("RGB")
)
image = pipe(prompt, input_image, strength=0.7, generator=torch.manual_seed(3)).images[0]
image.save("image2image_512.png")
Inpainting
import torch
from diffusers import AmusedInpaintPipeline
from diffusers.utils import load_image
from PIL import Image
pipe = AmusedInpaintPipeline.from_pretrained(
"amused/amused-512", variant="fp16", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
prompt = "a man with glasses"
input_image = (
load_image(
"https://huggingface.co/amused/amused-512/resolve/main/assets/inpainting_256_orig.png"
)
.resize((512, 512))
.convert("RGB")
)
mask = (
load_image(
"https://huggingface.co/amused/amused-512/resolve/main/assets/inpainting_256_mask.png"
)
.resize((512, 512))
.convert("L")
)
image = pipe(prompt, input_image, mask, generator=torch.manual_seed(3)).images[0]
image.save(f"inpainting_512.png")
📜 Docs: https://huggingface.co/docs/diffusers/main/en/api/pipelines/amused
🛠️ Models:
mused-256
: https://huggingface.co/amused/amused-256 (603M params)amused-512
: https://huggingface.co/amused/amused-512 (608M params)We’re excited to present an array of optimization techniques that can be used to accelerate the inference latency of text-to-image diffusion models. All of these can be done in native PyTorch without requiring additional C++ code.
These techniques are not specific to Stable Diffusion XL (SDXL) and can be used to improve other text-to-image diffusion models too. Starting from default fp32 precision, we can achieve a 3x speed improvement by applying different PyTorch optimization techniques. We encourage you to check out the detailed docs provided below.
Note: Compared to the default way most people use Diffusers which is fp16 + SDPA, applying all the optimization explained in the blog below yields a 30% speed-up.
📜 Docs: https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion
🌠 PyTorch blog post: https://pytorch.org/blog/accelerating-generative-ai-3/
Interrupting the diffusion process is particularly useful when building UIs that work with Diffusers because it allows users to stop the generation process if they're unhappy with the intermediate results. You can incorporate this into your pipeline with a callback.
This callback function should take the following arguments: pipe
, i
, t
, and callback_kwargs
(this must be returned). Set the pipeline's _interrupt
attribute to True
to stop the diffusion process after a certain number of steps. You are also free to implement your own custom stopping logic inside the callback.
In this example, the diffusion process is stopped after 10 steps even though num_inference_steps
is set to 50.
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
pipe.enable_model_cpu_offload()
num_inference_steps = 50
def interrupt_callback(pipe, i, t, callback_kwargs):
stop_idx = 10
if i == stop_idx:
pipe._interrupt = True
return callback_kwargs
pipe(
"A photo of a cat",
num_inference_steps=num_inference_steps,
callback_on_step_end=interrupt_callback,
)
📜 Docs: https://huggingface.co/docs/diffusers/main/en/using-diffusers/callback
peft
in our LoRA training examplesWe incorporated peft
in all the officially supported training examples concerning LoRA. This greatly simplifies the code and improves readability. LoRA training hasn't been easier, thanks to peft
!
We incorporated best practices from peft
to make LCM LoRA training for SDXL more memory-friendly. As such, you don't have to initialize two UNets (teacher and student) anymore. This version also integrates with the datasets
library for quick experimentation. Check out this section for more details.
logging
] Fix assertion bug by @standardAI in #6012Docs
] Update a link by @standardAI in #6014self.sigmas
during init by @yiyixuxu in #6006#Copied from
mechanism by @stevhliu in #6007PEFT
] Adapt example scripts to use PEFT by @younesbelkada in #5388rescale_betas_zero_snr
by @Beinsezii in #6024add_noise
function by @yiyixuxu in #6085VaeImageProcessor.numpy_to_pil
by @edwardwli in #6111Docs
] Fix typos by @standardAI in #6122\
in lora.md by @pierd in #6174image_encoder
by @yiyixuxu in #6151rescale_betas_zero_snr
by @Beinsezii in #6187stable_diffusion
by @sayakpaul in #6264stable_diffusion
. by @sayakpaul in #6261stable_diffusion
by @sayakpaul in #6262ValueError
for a nested image list as StableDiffusionControlNetPipeline
input. by @celestialphineas in #6286datasets
version of LCM LoRA SDXL by @sayakpaul in #5778Peft
/ Lora
] Add adapter_names
in fuse_lora
by @younesbelkada in #5823FutureWarning
by @Justin900429 in #6317peft
loadable when peft
isn't installed by @sayakpaul in #6306The following contributors have made significant changes to the library over the last release:
Published by patrickvonplaten 11 months ago
Stable Video Diffusion is a powerful image-to-video generation model that can generate high resolution (576x1024) 2-4 seconds videos conditioned on the input image.
There are two variants of SVD. SVD and SVD-XT. The SVD checkpoint is trained to generate 14 frames and the SVD-XT checkpoint is further finetuned to generate 25 frames.
You need to condition the generation on an initial image, as follows:
import torch
from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video
pipe = StableVideoDiffusionPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
)
pipe.enable_model_cpu_offload()
# Load the conditioning image
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png?download=true")
image = image.resize((1024, 576))
generator = torch.manual_seed(42)
frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]
export_to_video(frames, "generated.mp4", fps=7)
Since generating videos is more memory intensive, we can use the decode_chunk_size
argument to control how many frames are decoded at once. This will reduce the memory usage. It's recommended to tweak this value based on your GPU memory. Setting decode_chunk_size=1
will decode one frame at a time and will use the least amount of memory, but the video might have some flickering.
Additionally, we also use model cpu offloading to reduce the memory usage.
SDXL Turbo is an adversarial time-distilled Stable Diffusion XL (SDXL) model capable of running inference in as little as 1 step. Also, it does not use classifier-free guidance, further increasing its speed. On a good consumer GPU, you can now generate an image in just 100ms.
For text-to-image, pass a text prompt. By default, SDXL Turbo generates a 512x512 image, and that resolution gives the best results. You can try setting the height
and width
parameters to 768x768 or 1024x1024, but you should expect quality degradations when doing so.
Make sure to set guidance_scale
to 0.0 to disable, as the model was trained without it. A single inference step is enough to generate high quality images.
Increasing the number of steps to 2, 3 or 4 should improve image quality.
from diffusers import AutoPipelineForText2Image
import torch
pipeline_text2image = AutoPipelineForText2Image.from_pretrained("stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16")
pipeline_text2image = pipeline_text2image.to("cuda")
prompt = "A cinematic shot of a baby racoon wearing an intricate italian priest robe."
image = pipeline_text2image(prompt=prompt, guidance_scale=0.0, num_inference_steps=1).images[0]
image
For image-to-image generation, make sure that num_inference_steps * strength
is larger or equal to 1.
The image-to-image pipeline will run for int(num_inference_steps * strength)
steps, e.g. 0.5 * 2.0 = 1
step in
our example below.
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image, make_image_grid
# use from_pipe to avoid consuming additional memory when loading a checkpoint
pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_text2image).to("cuda")
init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
init_image = init_image.resize((512, 512))
prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"
image = pipeline(prompt, image=init_image, strength=0.5, guidance_scale=0.0, num_inference_steps=2).images[0]
make_image_grid([init_image, image], rows=1, cols=2)
IP Adapters have shown to be remarkably powerful at images conditioned on other images.
Thanks to @okotaku, we have added IP adapters to the most important pipelines allowing you to combine them for a variety of different workflows, e.g. they work with Img2Img2, ControlNet, and LCM-LoRA out of the box.
from diffusers import DiffusionPipeline, LCMScheduler
import torch
from diffusers.utils import load_image
model_id = "sd-dreambooth-library/herge-style"
lcm_lora_id = "latent-consistency/lcm-lora-sdv1-5"
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
pipe.load_lora_weights(lcm_lora_id)
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
prompt = "best quality, high quality"
image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png")
images = pipe(
prompt=prompt,
ip_adapter_image=image,
num_inference_steps=4,
guidance_scale=1,
).images[0]
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
from diffusers.utils import load_image
controlnet_model_path = "lllyasviel/control_v11f1p_sd15_depth"
controlnet = ControlNetModel.from_pretrained(controlnet_model_path, torch_dtype=torch.float16)
pipeline = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16)
pipeline.to("cuda")
image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/statue.png")
depth_map = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/depth.png")
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
generator = torch.Generator(device="cpu").manual_seed(33)
images = pipeline(
prompt='best quality, high quality',
image=depth_map,
ip_adapter_image=image,
negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
num_inference_steps=50,
generator=generator,
).images
images[0].save("yiyi_test_2_out.png")
ip_image | condition | output |
---|---|---|
For more information:
Kandinsky has released the 3rd version, which has much improved text-to-image alignment thanks to using Flan-T5 as the text encoder.
from diffusers import AutoPipelineForText2Image
import torch
pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
prompt = "A photograph of the inside of a subway train. There are raccoons sitting on the seats. One of them is reading a newspaper. The window shows the city in the background."
generator = torch.Generator(device="cpu").manual_seed(0)
image = pipe(prompt, num_inference_steps=25, generator=generator).images[0]
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image
import torch
pipe = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
prompt = "A painting of the inside of a subway train with tiny raccoons."
image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky3/t2i.png")
generator = torch.Generator(device="cpu").manual_seed(0)
image = pipe(prompt, image=image, strength=0.75, num_inference_steps=25, generator=generator).images[0]
Check it out:
Docs
] Update and make improvements by @standardAI in #5819usedforsecurity=False
in hashlib methods (FIPS compliance) by @Wauplin in #5790Tests
/LoRA
/PEFT
] Test also on PEFT / transformers / accelerate latest by @younesbelkada in #5820test
/ peft
] Fix silent behaviour on PR tests by @younesbelkada in #5852Docs
] Update and make improvements" by @standardAI in #5858lpw_stable_diffusion_xl
pipeline if pipe.enable_sequential_cpu_offload()
enabled by @VicGrygorchyk in #5885test_examples.py
for better readability by @sayakpaul in #5946The following contributors have made significant changes to the library over the last release:
Published by patrickvonplaten 11 months ago
Small patch release to make sure the correct PEFT version is installed.
Published by sayakpaul 11 months ago
Latent Consistency Models (LCM) made quite the mark in the Stable Diffusion community by enabling ultra-fast inference. LCM author @luosiallen, alongside @patil-suraj and @dg845, managed to extend the LCM support for Stable Diffusion XL (SDXL) and pack everything into a LoRA.
The approach is called LCM LoRA.
Below is an example of using LCM LoRA, taking just 4 inference steps:
from diffusers import DiffusionPipeline, LCMScheduler
import torch
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
lcm_lora_id = "latent-consistency/lcm-lora-sdxl"
pipe = DiffusionPipeline.from_pretrained(model_id, variant="fp16", torch_dtype=torch.float16).to("cuda")
pipe.load_lora_weights(lcm_lora_id)
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
prompt = "close-up photography of old man standing in the rain at night, in a street lit by lamps, leica 35mm summilux"
image = pipe(
prompt=prompt,
num_inference_steps=4,
guidance_scale=1,
).images[0]
You can combine the LoRA with Img2Img, Inpaint, ControlNet, ...
as well as with other LoRAs 🤯
👉 Checkpoints
📜 Docs
If you want to learn more about the approach, please have a look at the following:
Continuing the work of Latent Consistency Models (LCM), we've applied the approach to SDXL as well and give you SSD-1B and SDXL fine-tuned checkpoints.
from diffusers import DiffusionPipeline, UNet2DConditionModel, LCMScheduler
import torch
unet = UNet2DConditionModel.from_pretrained(
"latent-consistency/lcm-sdxl",
torch_dtype=torch.float16,
variant="fp16",
)
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", unet=unet, torch_dtype=torch.float16
).to("cuda")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k"
generator = torch.manual_seed(0)
image = pipe(
prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=1.0
).images[0]
👉 Checkpoints
📜 Docs
OpenAI open-sourced the consistency decoder used in DALL-E 3. It improves the decoding part in the Stable Diffusion v1 family of models.
import torch
from diffusers import DiffusionPipeline, ConsistencyDecoderVAE
vae = ConsistencyDecoderVAE.from_pretrained("openai/consistency-decoder", torch_dtype=pipe.torch_dtype)
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", vae=vae, torch_dtype=torch.float16
).to("cuda")
pipe("horse", generator=torch.manual_seed(0)).images
Find the documentation here to learn more.
mask_feature
so that precomputed embeddings work with a batch size > 1 by @sayakpaul in #5677diffusers
can be used without Transformers by @sayakpaul in #5668Docs
] Fix typos, improve, update at Using Diffusers' Task page by @standardAI in #5611Published by sayakpaul 11 months ago
🐛 There were some sneaky bugs in the PixArt-Alpha and LCM Image-to-Image pipelines which have been fixed in this release.
Published by patrickvonplaten 12 months ago
mask_feature
so that precomputed embeddings work with a batch size > 1 by @sayakpaul in #5677diffusers
can be used without Transformers by @sayakpaul in #5668Published by patrickvonplaten 12 months ago
Published by patrickvonplaten 12 months ago
LCMs enable a significantly fast inference process for diffusion models. They require far fewer inference steps to produce high-resolution images without compromising the image quality too much. Below is a usage example:
import torch
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained("SimianLuo/LCM_Dreamshaper_v7", torch_dtype=torch.float32)
# To save GPU memory, torch.float16 can be used, but it may compromise image quality.
pipe.to(torch_device="cuda", torch_dtype=torch.float32)
prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k"
# Can be set to 1~50 steps. LCM support fast inference even <= 4 steps. Recommend: 1~8 steps.
num_inference_steps = 4
images = pipe(prompt=prompt, num_inference_steps=num_inference_steps, guidance_scale=8.0).images
Refer to the documentation to learn more.
LCM comes with both text-to-image and image-to-image pipelines and they were contributed by @luosiallen, @nagolinc, and @dg845.
PixArt-Alpha is a Transformer-based text-to-image diffusion model that rivals the quality of the existing state-of-the-art ones, such as Stable Diffusion XL, Imagen, and DALL-E 2, while being more efficient.
It was trained T5 text embeddings and has a maximum sequence length of 120. Thus, it allows for more detailed prompt inputs, unlocking better quality generations.
Despite the large text encoder, with model offloading, it takes a little under 11GBs of VRAM to run the PixArtAlphaPipeline
:
from diffusers import PixArtAlphaPipeline
import torch
pipeline_id = "PixArt-alpha/PixArt-XL-2-1024-MS"
pipeline = PixArtAlphaPipeline.from_pretrained(pipeline_id, torch_dtype=torch.float16)
pipeline.enable_model_cpu_offload()
prompt = "A small cactus with a happy face in the Sahara desert."
image = pipe(prompt).images[0]
image.save("sahara.png")
Check out the docs to learn more.
AnimateDiff is a modelling framework that allows you to create videos using pre-existing Stable Diffusion text-to-image models. It achieves this by inserting motion module layers into a frozen text-to-image model and training it on video clips to extract a motion prior.
These motion modules are applied after the ResNet and Attention blocks in the Stable Diffusion UNet. Their purpose is to introduce coherent motion across image frames. To support these modules, we introduce the concepts of a MotionAdapter
and a UNetMotionModel
. These serve as a convenient way to use these motion modules with existing Stable Diffusion models.
The following example demonstrates how you can utilize the motion modules with an existing Stable Diffusion text-to-image model.
import torch
from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler
from diffusers.utils import export_to_gif
# Load the motion adapter
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
# load SD 1.5 based finetuned model
model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter)
scheduler = DDIMScheduler.from_pretrained(
model_id, subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1
)
pipe.scheduler = scheduler
# enable memory savings
pipe.enable_vae_slicing()
pipe.enable_model_cpu_offload()
output = pipe(
prompt=(
"masterpiece, bestquality, highlydetailed, ultradetailed, sunset, "
"orange sky, warm lighting, fishing boats, ocean waves seagulls, "
"rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, "
"golden hour, coastal landscape, seaside scenery"
),
negative_prompt="bad quality, worse quality",
num_frames=16,
guidance_scale=7.5,
num_inference_steps=25,
generator=torch.Generator("cpu").manual_seed(42),
)
frames = output.frames[0]
export_to_gif(frames, "animation.gif")
You can convert an existing 2D UNet into a UNetMotionModel
:
from diffusers import MotionAdapter, UNetMotionModel, UNet2DConditionModel
unet = UNetMotionModel()
# Load from an existing 2D UNet and MotionAdapter
unet2D = UNet2DConditionModel.from_pretrained("SG161222/Realistic_Vision_V5.1_noVAE", subfolder="unet")
motion_adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
# load motion adapter here
unet_motion = UNetMotionModel.from_unet2d(unet2D, motion_adapter: Optional = None)
# Or load motion modules after init
unet_motion.load_motion_modules(motion_adapter)
# freeze all 2D UNet layers except for the motion modules for finetuning
unet_motion.freeze_unet2d_params()
# Save only motion modules
unet_motion.save_motion_module(<path to save model>, push_to_hub=True)
AnimateDiff also comes with motion LoRA modules, letting you control subtleties:
import torch
from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler
from diffusers.utils import export_to_gif
# Load the motion adapter
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
# load SD 1.5 based finetuned model
model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter)
pipe.load_lora_weights("guoyww/animatediff-motion-lora-zoom-out", adapter_name="zoom-out")
scheduler = DDIMScheduler.from_pretrained(
model_id, subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1
)
pipe.scheduler = scheduler
# enable memory savings
pipe.enable_vae_slicing()
pipe.enable_model_cpu_offload()
output = pipe(
prompt=(
"masterpiece, bestquality, highlydetailed, ultradetailed, sunset, "
"orange sky, warm lighting, fishing boats, ocean waves seagulls, "
"rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, "
"golden hour, coastal landscape, seaside scenery"
),
negative_prompt="bad quality, worse quality",
num_frames=16,
guidance_scale=7.5,
num_inference_steps=25,
generator=torch.Generator("cpu").manual_seed(42),
)
frames = output.frames[0]
export_to_gif(frames, "animation.gif")
Check out the documentation to learn more.
There are many adapters (LoRA, for example) trained in different styles to achieve different effects. You can even combine multiple adapters to create new and unique images. With the 🤗 PEFT integration in 🤗 Diffusers, it is really easy to load and manage adapters for inference.
Here is an example of combining multiple LoRAs using this new integration:
from diffusers import DiffusionPipeline
import torch
pipe_id = "stabilityai/stable-diffusion-xl-base-1.0"
pipe = DiffusionPipeline.from_pretrained(pipe_id, torch_dtype=torch.float16).to("cuda")
# Load LoRA 1.
pipe.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy")
# Load LoRA 2.
pipe.load_lora_weights("nerijs/pixel-art-xl", weight_name="pixel-art-xl.safetensors", adapter_name="pixel")
# Combine the adapters.
pipe.set_adapters(["pixel", "toy"], adapter_weights=[0.5, 1.0])
# Perform inference.
prompt = "toy_face of a hacker with a hoodie, pixel art"
image = pipe(
prompt, num_inference_steps=30, cross_attention_kwargs={"scale": 1.0}, generator=torch.manual_seed(0)
).images[0]
image
Refer to the documentation to learn more.
We have had support for community pipelines for a while now. This enables fast integration for pipelines we cannot directly integrate within the core codebase of the library. However, community pipelines always rely on the building blocks from Diffusers, which can be restrictive for advanced use cases.
To elevate this, we’re elevating community pipelines with community components starting this release 🤗 By specifying trust_remote_code=True
and writing the pipeline repository in a specific way, users can customize their pipeline and component code as flexibly as possible:
from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
"<change-username>/<change-id>", trust_remote_code=True, torch_dtype=torch.float16
).to("cuda")
prompt = "hello"
# Text embeds
prompt_embeds, negative_embeds = pipeline.encode_prompt(prompt)
# Keyframes generation (8x64x40, 2fps)
video_frames = pipeline(
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_embeds,
num_frames=8,
height=40,
width=64,
num_inference_steps=2,
guidance_scale=9.0,
output_type="pt"
).frames
Refer to the documentation to learn more.
Most 🤗 Diffusers pipelines now accept a callback_on_step_end
argument that allows you to change the default behavior of denoising loop with custom defined functions. Here is an example of a callback function we can write to disable classifier free guidance after 40% of inference steps to save compute with minimum tradeoff in performance.
def callback_dynamic_cfg(pipe, step_index, timestep, callback_kwargs):
# adjust the batch_size of prompt_embeds according to guidance_scale
if step_index == int(pipe.num_timestep * 0.4):
prompt_embeds = callback_kwargs["prompt_embeds"]
prompt_embeds =prompt_embeds.chunk(2)[-1]
# update guidance_scale and prompt_embeds
pipe._guidance_scale = 0.0
callback_kwargs["prompt_embeds"] = prompt_embeds
return callback_kwargs
Here’s how you can use it:
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "a photo of an astronaut riding a horse on mars"
generator = torch.Generator(device="cuda").manual_seed(1)
out= pipe(prompt, generator=generator, callback_on_step_end=callback_custom_cfg, callback_on_step_end_tensor_inputs=['prompt_embeds'])
out.images[0].save("out_custom_cfg.png")
Check out the docs to learn more.
PEFT
/ LoRA
] Fix text encoder scaling by @younesbelkada in #5204diffusers/models
by @a-r-r-o-w in #5299jnp.array
in types with jnp.ndarray
. by @hvaara in #4719diffusers/models
by @a-r-r-o-w in #5312StableDiffusionXLImg2ImgPipeline
creation in sdxl tutorial by @soumik12345 in #5367core
/ PEFT
/ LoRA
] Integrate PEFT into Unet by @younesbelkada in #5151from_single_file()
]fix: local single file loading. by @sayakpaul in #5440PEFT
] Fix scale unscale with LoRA adapters by @younesbelkada in #5417diffusers/models
by @a-r-r-o-w in #5391torch_dtype
argument in from_single_file
of ControlNetModel by @xuyxu in #5528PEFT
/ Tests
] Add peft slow tests on push by @younesbelkada in #5419core
/ PEFT
]Bump transformers min version for PEFT integration by @younesbelkada in #5579PEFT
/ LoRA
] Fix civitai bug when network alpha is an empty dict by @younesbelkada in #5608AutoPipeline.from_pipe()
when creating a controlnet pipeline from an existing controlnet by @yiyixuxu in #5638Docs
] Fix typos, improve, update at Using Diffusers' Tecniques page by @standardAI in #5627The following contributors have made significant changes to the library over the last release:
diffusers/models
(#5299)diffusers/models
(#5312)diffusers/models
(#5391)Docs
] Fix typos, improve, update at Using Diffusers' Tecniques page (#5627)