text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
APACHE-2.0 License
VILA - a multi-image visual language model with training, inference and evaluation recipe, deploy...
Code Release of F-LMM: Grounding Frozen Large Multimodal Models
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
Latte: Latent Diffusion Transformer for Video Generation.
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多...
Open-sourced codes for MiniGPT-4 and MiniGPT-v2 (https://minigpt-4.github.io, https://minigpt-v2....
Mixture-of-Experts for Large Vision-Language Models
【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
a state-of-the-art-level open visual language model | 多模态预训练模型
Official repo for VGen: a holistic video generation ecosystem for video generation building on di...
[ECCV 2024, Oral] DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by ...
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation