Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
MIT License
🔥🔥🔥🔥 (Earlier YOLOv7 not official one) YOLO with Transformers and Instance Segmentation, with Ten...
IDM-VTON : Improving Diffusion Models for Authentic Virtual Try-on in the Wild
Code and documents of LongLoRA and LongAlpaca (ICLR 2024 Oral)
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
COYO-700M: Large-scale Image-Text Pair Dataset
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
[ICML'23] StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis
a state-of-the-art-level open visual language model | 多模态预训练模型
The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".
Contrastive Language-Audio Pretraining
CKIP Neural Chinese Word Segmentation, POS Tagging, and NER
[CVPR 2024] Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation
In defence of metric learning for speaker recognition
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"