
This repository is used to collect papers and code in the field of AI.

MIT License



This repository is used to collect papers and code in the field of AI. The contents contain the following parts:

Table of Content

  ├─ NLP/  
    ├─ Word2Vec/  
    ├─ Seq2Seq/           
    └─ Pretraining/  
      ├─ Large Language Model/          
      ├─ LLM Application/ 
        ├─ AI Agent/          
        ├─ Academic/          
        ├─ Code/       
        ├─ Financial Application/
        ├─ Information Retrieval/  
        ├─ Math/     
        ├─ Medicine and Law/   
        ├─ Recommend System/      
        └─ Tool Learning/             
      ├─ LLM Technique/ 
        ├─ Alignment/          
        ├─ Context Length/          
        ├─ Corpus/       
        ├─ Evaluation/
        ├─ Hallucination/  
        ├─ Inference/     
        ├─ MoE/   
        ├─ PEFT/     
        ├─ Prompt Learning/   
        ├─ RAG/       
        └─ Reasoning and Planning/       
      ├─ LLM Theory/       
      └─ Chinese Model/             
  ├─ CV/  
    ├─ CV Application/          
    ├─ Contrastive Learning/         
    ├─ Foundation Model/ 
    ├─ Generative Model (GAN and VAE)/          
    ├─ Image Editing/          
    ├─ Object Detection/          
    ├─ Semantic Segmentation/            
    └─ Video/          
  ├─ Multimodal/       
    ├─ Audio/          
    ├─ BLIP/         
    ├─ CLIP/        
    ├─ Diffusion Model/   
    ├─ Multimodal LLM/          
    ├─ Text2Image/          
    ├─ Text2Video/            
    └─ Survey/           
  │─ Reinforcement Learning/ 
  │─ GNN/ 
  └─ Transformer Architecture/          


1. Word2Vec

  • Efficient Estimation of Word Representations in Vector Space, Mikolov et al., arxiv 2013. [paper]
  • Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al., NIPS 2013. [paper]
  • Distributed representations of sentences and documents, Le and Mikolov, ICML 2014. [paper]
  • Word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method, Goldberg and Levy, arxiv 2014. [paper]
  • word2vec Parameter Learning Explained, Rong, arxiv 2014. [paper]
  • Glove: Global vectors for word representation.Pennington et al., EMNLP 2014. [paper][code]
  • fastText: Bag of Tricks for Efficient Text Classification, Joulin et al., arxiv 2016. [paper][code]
  • ELMo: Deep Contextualized Word Representations, Peters et al., NAACL 2018. [paper]
  • BPE: Neural Machine Translation of Rare Words with Subword Units, Sennrich et al., ACL 2016. [paper][code]
  • Byte-Level BPE: Neural Machine Translation with Byte-Level Subwords, Wang et al., arxiv 2019. [paper][code]

2. Seq2Seq

  • Generating Sequences With Recurrent Neural Networks, Graves, arxiv 2013. [paper]
  • Sequence to Sequence Learning with Neural Networks, Sutskever et al., NeruIPS 2014. [paper]
  • Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau et al., ICLR 2015. [paper][code]
  • On the Properties of Neural Machine Translation: Encoder-Decoder Approaches, Cho et al., arxiv 2014. [paper]
  • Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, Cho et al., arxiv 2014. [paper]
  • [fairseq][fairseq2][pytorch-seq2seq]

3. Pretraining

3.1 Large Language Model

3.2 LLM Application

3.2.1 AI Agent
  • LLM Powered Autonomous Agents, Lilian Weng, 2023. [blog][LLMAgentPapers][LLM-Agents-Papers][awesome-language-agents][Awesome-Papers-Autonomous-Agent]

  • A Survey on Large Language Model based Autonomous Agents, Wang et al., [paper][code][LLM-Agent-Paper-Digest]

  • The Rise and Potential of Large Language Model Based Agents: A Survey, Xi et al., arxiv 2023. [paper][code]

  • Agent AI: Surveying the Horizons of Multimodal Interaction, Durante et al., arxiv 2024. [paper]

  • Position Paper: Agent AI Towards a Holistic Intelligence, Huang et al., arxiv 2024. [paper]

  • AgentBench: Evaluating LLMs as Agents, Liu et al., ICLR 2024. [paper][code][VisualAgentBench][OSWorld][AgentGym]

  • Agents: An Open-source Framework for Autonomous Language Agents, Zhou et al., arxiv 2023. [paper][code]

  • AutoAgents: A Framework for Automatic Agent Generation, Chen et al., arxiv 2023. [paper][code]

  • AgentTuning: Enabling Generalized Agent Abilities for LLMs, Zeng et al., arxiv 2023. [paper][code]

  • AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors, Chen et al., ICLR 2024. [paper][code]

  • AppAgent: Multimodal Agents as Smartphone Users, Zhang et al., arxiv 2023. [paper][code][digirl]

  • Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception, Wang et al., arxiv 2024. [paper][code][Mobile-Agent-v2]

  • Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, Li et al., arxiv 2024. [paper][code]

  • AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, Wu et al., arxiv 2023. [paper][code]

  • CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society, Li et al., NeurIPS 2023. [paper][code][crab]

  • ChatDev: Communicative Agents for Software Development, Qian et al., ACL 2024. [paper][code][gpt-pilot]

  • MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework, Hong et al., ICLR 2024 Oral. [paper][code]

  • ProAgent: From Robotic Process Automation to Agentic Process Automation, Ye et al., arxiv 2023. [paper][code]

  • RepoAgent: An LLM-Powered Open-Source Framework for Repository-level Code Documentation Generation, Luo et al., arxiv 2024. [paper][code]

  • Generative Agents: Interactive Simulacra of Human Behavior, Park et al., arxiv 2023. [paper][code][GPTeam]

  • CogAgent: A Visual Language Model for GUI Agents, Hong et al., CVPR 2024. [paper][code]

  • OpenAgents: An Open Platform for Language Agents in the Wild, Xie et al., arxiv 2023. [paper][code]

  • TaskWeaver: A Code-First Agent Framework, Qiao et al., arxiv 2023. [paper][code]

  • MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge, Fan et al., NeurIPS 2022 Outstanding Paper. [paper][code]

  • Voyager: An Open-Ended Embodied Agent with Large Language Models, Wang et al., arxiv 2023. [paper][code]

  • Eureka: Human-Level Reward Design via Coding Large Language Models, Ma et al., ICLR 2024. [paper][code][DrEureka]

  • LEGENT: Open Platform for Embodied Agents, Cheng et al., ACL 2024. [paper][code]

  • Mind2Web: Towards a Generalist Agent for the Web, Deng et al., NeurIPS 2023. [paper][code][AutoWebGLM]

  • WebArena: A Realistic Web Environment for Building Autonomous Agents, Zhou et al., ICLR 2024. [paper][code][visualwebarena][agent-workflow-memory][WindowsAgentArena]

  • SeeAct: GPT-4V(ision) is a Generalist Web Agent, if Grounded, Zheng et al., arxiv 2024. [paper][code]

  • Cradle: Empowering Foundation Agents Towards General Computer Control, Tan et al., arxiv 2024. [paper][code]

  • AgentScope: A Flexible yet Robust Multi-Agent Platform, Gao et al., arxiv 2024. [paper][code][modelscope-agent]

  • AgentGym: Evolving Large Language Model-based Agents across Diverse Environments, Xi et al., arxiv 2024. [paper][code]

  • Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence, Chen et al., arxiv 2024. [paper][code]

  • CLASI: Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent, ByteDance Research, 2024. [paper][translation-agent]

  • Automated Design of Agentic Systems, Hu et al., arxiv 2024. [paper][code][agent-zero][AgentK]

  • Foundation Models in Robotics: Applications, Challenges, and the Future, Firoozi et al., arxiv 2023. [paper][code]

  • Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI, Liu et al., arxiv 2024. [paper][code]

  • RT-1: Robotics Transformer for Real-World Control at Scale, Brohan et al., arxiv 2022. [paper][code][IRASim]

  • RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, Brohan et al., arxiv 2023. [paper][Unofficial Implementation][RT-H: Action Hierarchies Using Language]

  • Open X-Embodiment: Robotic Learning Datasets and RT-X Models, Open X-Embodiment Collaboration, arxiv 2023. [paper][code]

  • Shaping the future of advanced robotics, Google DeepMind 2024. [blog]

  • RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation, Wang et al., ICML 2024. [paper][code]

  • RL-GPT: Integrating Reinforcement Learning and Code-as-policy, Liu et al., arxiv 2024. [paper]

  • Genie: Generative Interactive Environments, Bruce et al., ICML 2024 Best Paper. [paper][GameNGen][GameGen-O]

  • Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation, Fu et al., arxiv 2024. [paper][code][Hardware Code][Learning Code][UMI][humanplus][TeleVision][Surgical Robot Transformer][lifelike-agility-and-play][ReKep]

  • Octo: An Open-Source Generalist Robot Policy, Ghosh et al., arxiv 2024. [paper][code][BodyTransformer][crossformer]

  • GRUtopia: Dream General Robots in a City at Scale, Wang et al., arxiv 2024. [paper][code]

  • HPT: Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers, Wang et al., NeurIPS 2024 Spotlight. [paper][code]

  • [LeRobot][DORA][awesome-ai-agents][IsaacLab][Awesome-Robotics-3D][AimRT]

  • [AutoGPT][GPT-Engineer][AgentGPT]

  • [BabyAGI][SuperAGI][OpenAGI]

  • [open-interpreter][Homepage][rawdog][OpenCodeInterpreter]

  • XAgent: An Autonomous Agent for Complex Task Solving, [blog][code]

  • [crewAI][PraisonAI][llama_deploy][phidata][gpt-computer-assistant][agentic_patterns]

  • [translation-agent][agent-zero][AgentK][Twitter Personality][RD-Agent]

3.2.2 Academic
  • Galactica: A Large Language Model for Science, Taylor et al., arxiv 2022. [paper][code]

  • K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization, Deng et al., arxiv 2023. [paper][code][pdf_parser]

  • GeoGalactica: A Scientific Large Language Model in Geoscience, Lin et al., arxiv 2024. [paper][code][sciparser]

  • Scientific Large Language Models: A Survey on Biological & Chemical Domains, Zhang et al., arxiv 2024. [paper][code][sciknoweval]

  • SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning, Zhang et al., arxiv 2024. [paper][code]

  • ChemLLM: A Chemical Large Language Model, Zhang et al., arxiv 2024. [paper][model]

  • LangCell: Language-Cell Pre-training for Cell Identity Understanding, Zhao et al., ICML 2024. [paper][code][scFoundation]

  • SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers, Pramanick et al., arxiv 2024. [paper][code]

  • STORM: Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models, Shao et al., NAACL 2024. [paper][code]

  • Automated Peer Reviewing in Paper SEA: Standardization, Evaluation, and Analysis, Yu et al., arxiv 2024. [paper][code]

  • OpenResearcher: Unleashing AI for Accelerated Scientific Research, Zheng et al., arxiv 2024. [paper][code][Paper Copilot][SciAgentsDiscovery][paper-qa][GraphReasoning]

  • The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery, Lu et al., arxiv 2024. [paper][code]

  • Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers, Si et al., arxiv 2024. [paper][code]

  • [Awesome-Scientific-Language-Models][gpt_academic][ChatPaper][scispacy][awesome-ai4s][xVal]

3.2.3 Code
  • Neural code generation, CMU 2024 Spring. [link]

  • Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code, Zhang et al., arxiv 2023. [paper][Awesome-Code-LLM][MFTCoder][Awesome-Code-LLM]

  • Source Code Data Augmentation for Deep Learning: A Survey, Zhuo et al., arxiv 2023. [paper][code]

  • Codex: Evaluating Large Language Models Trained on Code, Chen et al., arxiv 2021. [paper][human-eval][CriticGPT][On scalable oversight with weak LLMs judging strong LLMs]

  • Code Llama: Open Foundation Models for Code, Rozière et al., arxiv 2023. [paper][code][model][llamacoder]

  • CodeGemma: Open Code Models Based on Gemma, [blog][report]

  • AlphaCode: Competition-Level Code Generation with AlphaCode, Li et al., arxiv 2022. [paper][dataset][AlphaCode2_Tech_Report]

  • CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X, Zheng et al., KDD 2023. [paper][code][CodeGeeX2][CodeGeeX4]

  • CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis, Nijkamp et al., ICLR 2022. [paper][code]

  • CodeGen2: Lessons for Training LLMs on Programming and Natural Languages, Nijkamp et al., ICLR 2023. [paper][code]

  • CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules, Le et al., arxiv 2023. [paper][code]

  • StarCoder: may the source be with you, Li et al., arxiv 2023. [paper][code][bigcode-project][model]

  • StarCoder 2 and The Stack v2: The Next Generation, Lozhkov et al., 2024. [paper][code][starcoder.cpp]

  • WizardCoder: Empowering Code Large Language Models with Evol-Instruct, Luo et al., ICLR 2024. [paper][code]

  • Magicoder: Source Code Is All You Need, Wei et al., arxiv 2023. [paper][code]

  • Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering, Ridnik et al., arxiv 2024. [paper][code][pr-agent][cover-agent]

  • DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence, Guo et al., arxiv 2024. [paper][code]

  • DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence, Zhu et al., CoRR 2024. [paper][code][DeepSeek-V2.5]

  • Qwen2.5-Coder Technical Report, Hui et al., arxiv 2024. [paper][code]

  • If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents, Yang et al., arxiv 2024. [paper]

  • Design2Code: How Far Are We From Automating Front-End Engineering?, Si et al., arxiv 2024. [paper][code]

  • AutoCoder: Enhancing Code Large Language Model with AIEV-Instruct, Lei et al., arxiv 2024. [paper][code]

  • SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering, Yang et al., arxiv 2024. [paper][code][swe-bench-technical-report][CodeR]

  • Agentless: Demystifying LLM-based Software Engineering Agents, Xia et al., arxiv 2024. [paper][code]

  • BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions, Zhuo et al., arxiv 2024. [paper][code]

  • OpenDevin: An Open Platform for AI Software Developers as Generalist Agents, Wang et al., arxiv 2024. [paper][code]

  • Planning In Natural Language Improves LLM Search For Code Generation, Wang et al., arxiv 2024. [paper]

  • Large Language Model-Based Agents for Software Engineering: A Survey, Liu et al., arxiv 2024. [paper][code]

  • HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale, Phan et al., arxiv 2024. [paper][code]

  • [Yi-Coder][aiXcoder-7B][codealpaca]

  • [OpenDevin][devika][auto-code-rover][developer][aider][claude-engineer][SuperCoder]

  • [screenshot-to-code][vanna][NL2SQL_Handbook][TAG-Bench]

3.2.4 Financial Application
  • DocLLM: A layout-aware generative language model for multimodal document understanding, Wang et al., arxiv 2024. [paper]

  • DocGraphLM: Documental Graph Language Model for Information Extraction, Wang et al., arxiv 2023. [paper]

  • FinBERT: A Pretrained Language Model for Financial Communications, Yang et al., arxiv 2020. [paper][Wiley paper][code][finBERT][valuesimplex/FinBERT]

  • FinGPT: Open-Source Financial Large Language Models, Yang et al., IJCAI 2023. [paper][code]

  • FinRobot: An Open-Source AI Agent Platform for Financial Applications using Large Language Models, Yang et al., arxiv 2024. [paper][code]

  • FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets, Wang et al., arxiv 2023. [paper][code]

  • Instruct-FinGPT: Financial Sentiment Analysis by Instruction Tuning of General-Purpose Large Language Models, Zhang et al., arxiv 2023. [paper][code]

  • FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance, Liu et al., arxiv 2020. [paper][code]

  • FinRL-Meta: Market Environments and Benchmarks for Data-Driven Financial Reinforcement Learning, Liu et al., NeurIPS 2022. [paper][code]

  • DISC-FinLLM: A Chinese Financial Large Language Model based on Multiple Experts Fine-tuning, Chen et al., arxiv 2023. [paper][code]

  • A Multimodal Foundation Agent for Financial Trading: Tool-Augmented, Diversified, and Generalist, Zhang et al., arxiv 2024. [paper]

  • XuanYuan 2.0: A Large Chinese Financial Chat Model with Hundreds of Billions Parameters, Zhang et al., arxiv 2023. [paper][code]

  • Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications, Xie et al., arxiv 2024. [paper][code]

  • StructGPT: A General Framework for Large Language Model to Reason over Structured Data, Jiang et al., arxiv 2023. [paper][code]

  • Large Language Model for Table Processing: A Survey, Lu et al., arxiv 2024. [paper][llm-table-survey][table-transformer][Awesome-Tabular-LLMs][Awesome-LLM-Tabular][Table-LLaVA]

  • rLLM: Relational Table Learning with LLMs, Li et al., arxiv 2024. [paper][code]

  • Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow, Zhang et al., arxiv 2023. [paper][code]

  • Data Interpreter: An LLM Agent For Data Science, Hong et al., arxiv 2024. [paper][code]

  • AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain Framework, Li et al., COLING 2024. [paper][code]

  • LLMFactor: Extracting Profitable Factors through Prompts for Explainable Stock Movement Prediction, Wang et al., arxiv 2024. [paper][MIGA]

  • A Survey of Large Language Models in Finance (FinLLMs), Lee et al., arxiv 2024. [paper][code][Revolutionizing Finance with LLMs: An Overview of Applications and Insights]

  • A Survey of Large Language Models for Financial Applications: Progress, Prospects and Challenges, Nie et al., arxiv 2024. [paper]

  • PEER: Expertizing Domain-Specific Tasks with a Multi-Agent Framework and Tuning Methods, Wang et al., arxiv 2024. [paper][code][Stockagent]

  • Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset, Zhu et al., ACL 2024. [paper][code]

  • [gpt-investor][FinGLM][agentUniverse][gs-quant][stockbot-on-groq][Real-Time-Stock-Market-Prediction-using-Ensemble-DL-and-Rainbow-DQN]

3.2.5 Information Retrieval
  • ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, Khattab et al., SIGIR 2020. [paper][simbert][roformer-sim]

  • ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction, Santhanam et al., NAACL 2022. [paper][code][RAGatouille][A Reproducibility Study of PLAID][Jina-ColBERT-v2]

  • ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval, Louis et al., arxiv 2024. [paper][code][model]

  • NCI: A Neural Corpus Indexer for Document Retrieval, Wang et al., NeurIPS 2022 Outstanding Paper. [paper][code][DSI-transformers][GDR EACL 2024 Oral]

  • HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels, Gao et al., ACL 2023. [paper][code]

  • Query2doc: Query Expansion with Large Language Models, Wang et al., EMNLP 2023. [paper][Query Expansion by Prompting Large Language Models]

  • RankGPT: Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents, Sun et al., EMNLP 2023 Outstanding Paper. [paper][code]

  • Large Language Models for Information Retrieval: A Survey, Zhu et al., arxiv 2023. [paper][code][YuLan-IR]

  • Large Language Models for Generative Information Extraction: A Survey, Xu et al., arxiv 2023. [paper][code][UIE][NERRE][uie_pytorch]

  • LLaRA: Making Large Language Models A Better Foundation For Dense Retrieval, Li et al., arxiv 2023. [paper][code]

  • UniGen: A Unified Generative Framework for Retrieval and Question Answering with Large Language Models, Li et al., AAAI 2024. [paper]

  • INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning, Zhu et al., ACL 2024. [paper][code][ChatRetriever]

  • GenIR: From Matching to Generation: A Survey on Generative Information Retrieval, Li et al., arxiv 2024. [paper][code]

  • D2LLM: Decomposed and Distilled Large Language Models for Semantic Search, Liao et al., ACL 2024. [paper][code]

  • BM25S: Orders of magnitude faster lexical search via eager sparse scoring, Xing Han Lù, arxiv 2024. [paper][code][rank_bm25][pyserini]

  • MindSearch: Mimicking Human Minds Elicits Deep AI Searcher, Chen et al., arxiv 2024. [paper][code]

  • Smart Multi-Modal Search: Contextual Sparse and Dense Embedding Integration in Adobe Express, Aroraa et al., arxiv 2024. [paper]

  • SIGIR-AP 2023 Tutorial: Recent Advances in Generative Information Retrieval [link]

  • SIGIR 2024 Tutorial: Large Language Model Powered Agents for Information Retrieval [link]

  • [search_with_lepton][LLocalSearch][FreeAskInternet][storm][searxng][Perplexica][rag-search][sensei]

  • [similarities][text2vec]

  • [SearchEngine]

3.2.6 Math
  • ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving, Gou et al., ICLR 2024. [paper][code]

  • MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models, Yu et al., ICLR 2024. [paper][code]

  • MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models, Lu et al., ICLR 2024 Oral. [paper][code][MathBench][OlympiadBench]

  • InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning, Ying et al., arxiv 2024. [paper][code]

  • DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, Shao et al., arxiv 2024. [paper][code][DeepSeek-Prover-V1.5]

  • Common 7B Language Models Already Possess Strong Math Capabilities, Li et al., arxiv 2024. [paper][code]

  • ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline, Xu et al., arxiv 2024. [paper][code]

  • AlphaMath Almost Zero: process Supervision without process, Chen et al., arxiv 2024. [paper][code]

  • JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models, Zhou et al., NeurIPS 2024. [paper][code]

  • Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B, Zhang et al., arxiv 2024. [paper][code][LLaMA-Berry]

  • Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models, Shi et al., arxiv 2024. [paper][code]

  • We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?, Qiao et al., arxiv 2024. [paper][code]

  • MAVIS: Mathematical Visual Instruction Tuning, Zhang et al., arxiv 2024. [paper][code]

  • Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement, Yang et al., arxiv 2024. [paper][code][Qwen2.5-Math-Demo]

  • AI Mathematical Olympiad - Progress Prize 1, Kaggle Competition 2024. [Numina 1st Place Solution][project-numina/aimo-progress-prize][How NuminaMath Won the 1st AIMO Progress Prize][NuminaMath-7B-TIR][AI achieves silver-medal standard solving International Mathematical Olympiad problems]

3.2.7 Medicine and Law
3.2.8 Recommend System
3.2.9 Tool Learning
  • Tool Learning with Foundation Models, Qin et al., arxiv 2023. [paper][code]

  • Tool Learning with Large Language Models: A Survey, Qu et al., arxiv 2024. [paper][code]

  • Toolformer: Language Models Can Teach Themselves to Use Tools, Schick et al., arxiv 2023. [paper][toolformer-pytorch][conceptofmind/toolformer][xrsrke/toolformer][Graph_Toolformer]

  • ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, Qin et al., ICLR 2024 Spotlight. [paper][code][StableToolBench]

  • Gorilla: Large Language Model Connected with Massive APIs, Patil et al., arxiv 2023. [paper][code]

  • GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction, Yang et al., arxiv 2023. [paper][code]

  • RestGPT: Connecting Large Language Models with Real-World RESTful APIs, Song et al., arxiv 2023. [paper][code]

  • LLMCompiler: An LLM Compiler for Parallel Function Calling, Kim et al., arxiv 2023. [paper][code]

  • Large Language Models as Tool Makers, Cai et al, arxiv 2023. [paper][code]

  • ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases Tang et al., arxiv 2023. [paper][code][ToolQA][toolbench]

  • ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search, Zhuang et al., arxiv 2023. [paper][[code]]

  • Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models, Lu et al., NeurIPS 2023. [paper][code]

  • ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios, Ye et al., arxiv 2024. [paper][code]

  • AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls, Du et al., arxiv 2024. [paper][code]

  • LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error, Wang et al., arxiv 2024. [paper][code]

  • What Are Tools Anyway? A Survey from the Language Model Perspective, Wang et al., arxiv 2024. [paper]

  • ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities, Lu et al., arxiv 2024. [paper][code]

  • Re-Invoke: Tool Invocation Rewriting for Zero-Shot Tool Retrieval, Chen et al., arxiv 2024. [paper]

  • ToolACE: Winning the Points of LLM Function Calling, Liu et al., arxiv 2024. [paper]

  • [functionary][ToolLearningPapers][awesome-tool-llm]

3.3 LLM Technique

3.3.1 Alignment
3.3.2 Context Length
  • ALiBi: Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, Press et al., ICLR 2022. [paper][code]
  • Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation, Chen et al., arxiv 2023. [paper]
  • Scaling Transformer to 1M tokens and beyond with RMT, Bulatov et al., AAAI 2024. [paper][code][LM-RMT]
  • RecurrentGPT: Interactive Generation of (Arbitrarily) Long Text, Zhou et al., arxiv 2023. [paper][code]
  • LongNet: Scaling Transformers to 1,000,000,000 Tokens, Ding et al., arxiv 2023. [paper][code][unofficial code]
  • Focused Transformer: Contrastive Training for Context Scaling, Tworkowski et al., NeurIPS 2023. [paper][code]
  • LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, Chen et al., ICLR 2024 Oral. [paper][code]
  • StreamingLLM: Efficient Streaming Language Models with Attention Sinks, Xiao et al., ICLR 2024. [paper][code][SwiftInfer][SwiftInfer blog]
  • YaRN: Efficient Context Window Extension of Large Language Models, Peng et al., ICLR 2024. [paper][code][LM-Infinite]
  • Ring Attention with Blockwise Transformers for Near-Infinite Context, Liu et al., ICLR 2024. [paper][code][ring-attention-pytorch][local-attention][tree_attention]
  • LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression, Jiang et al., ACL 2024. [paper][code]
  • LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens, Ding et al., arxiv 2024. [paper][code]
  • LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning, Jin et al., arxiv 2024. [paper][code]
  • The What, Why, and How of Context Length Extension Techniques in Large Language Models -- A Detailed Survey, Pawar et al., arxiv 2024. [paper][Awesome-LLM-Long-Context-Modeling]
  • Data Engineering for Scaling Language Models to 128K Context, Fu et al., arxiv 2024. [paper][code]
  • CEPE: Long-Context Language Modeling with Parallel Context Encoding, Yen et al., ACL 2024. [paper][code]
  • Training-Free Long-Context Scaling of Large Language Models, An et al., ICML 2024. [paper][code]
  • InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory, Xiao et al., NeurIPS 2024. [paper][code]
  • Counting-Stars: A Simple, Efficient, and Reasonable Strategy for Evaluating Long-Context Large Language Models, Song et al., arxiv 2024. [paper][code][LLMTest_NeedleInAHaystack][LooGLE][LongBench][google-deepmind/loft]
  • Infini-Transformer: Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention, Munkhdalai et al., arxiv 2024. [paper][infini-transformer-pytorch][InfiniTransformer][infini-mini-transformer][megalodon]
  • Extending Llama-3's Context Ten-Fold Overnight, Zhang et al., arxiv 2024. [paper][code][activation_beacon]
  • Make Your LLM Fully Utilize the Context, An et al., arxiv 2024. [paper][code]
  • CoPE: Contextual Position Encoding: Learning to Count What's Important, Golovneva et al., arxiv 2024. [paper][rope_cope]
  • Scaling Granite Code Models to 128K Context, Stallone et al., arxiv 2024. [paper][code]
  • Generalizing an LLM from 8k to 1M Context using Qwen-Agent, Qwen Team, 2024. [blog]
  • LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs, Bai et al., arxiv 2024. [paper][code][LongCite]
  • A failed experiment: Infini-Attention, and why we should keep trying, HuggingFace Blog, 2024. [blog][Magic Blog]
3.3.3 Corpus
  • [datatrove][datasets][doccano][label-studio][autolabel]

  • *Thinking about High-Quality Human Data, Lilian Weng, 2024. [blog]

  • C4: Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus, Dodge et al., arxiv 2021. [paper][dataset]

  • The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset, Laurençon et al., NeurIPS 2023. [paper][code][dataset]

  • The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only, Penedo et al., arxiv 2023. [paper][dataset]

  • Data-Juicer: A One-Stop Data Processing System for Large Language Models, Chen et al., arxiv 2023. [paper][code]

  • UltraChat: Enhancing Chat Language Models by Scaling High-quality Instructional Conversations, Ding et al., EMNLP 2023. [paper][code][ultrachat]

  • UltraFeedback: Boosting Language Models with High-quality Feedback, Cui et al., ICML 2024. [paper][code][UltraInteract_sft]

  • What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning, Liu et al., ICLR 2024. [paper][code]

  • WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset, Qiu et al., arxiv 2024. [paper][dataset][LabelLLM][labelU][MinerU][PDF-Extract-Kit]

  • Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research, Soldaini et al., ACL 2024. [paper][code][OLMo]

  • Datasets for Large Language Models: A Comprehensive Survey, Liu et al., arxiv 2024. [paper][Awesome-LLMs-Datasets]

  • DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows, Patel et al., arxiv 2024. [paper][code]

  • Large Language Models for Data Annotation: A Survey, Tan et al., arxiv 2024. [paper][code]

  • Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance, Ye et al., arxiv 2024. [paper][code]

  • COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning, Bai et al., arxiv 2024. [paper][dataset]

  • Best Practices and Lessons Learned on Synthetic Data for Language Models, Liu et al., arxiv 2024. [paper]

  • The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale, HuggingFace, 2024. [paper][blogpost][fineweb][fineweb-edu]

  • DataComp: In search of the next generation of multimodal datasets, Gadre et al., arxiv 2023. [paper][code]

  • DataComp-LM: In search of the next generation of training sets for language models, Li et al., arxiv 2024. [paper][code][apple/DCLM-7B-8k]

  • Scaling Synthetic Data Creation with 1,000,000,000 Personas, Chan et al., arxiv 2024. [paper][code]

  • Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale, Zhou et al., arxiv 2024. [paper][code]

  • MinerU: An Open-Source Solution for Precise Document Content Extraction, Wang et al., arxiv 2024. [paper][code]

  • Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models, Lai et al., arxiv 2024. [paper][BLIP]

  • [RedPajama-Data][xland-minigrid-datasets][OmniCorpus][dclm][Infinity-Instruct][MNBVC][LMSYS-Chat-1M]

  • [llm-datasets][Awesome-LLM-Synthetic-Data]

3.3.4 Evaluation
3.3.5 Hallucination
  • Extrinsic Hallucinations in LLMs, Lilian Weng, 2024. [blog]
  • Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models, Zhang et al., arxiv 2023. [paper][code]
  • A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions, Huang et al., arxiv 2023. [paper][code][Awesome-MLLM-Hallucination]
  • The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models, Li et al., arxiv 2024. [paper][code]
  • FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios, Chem et al., arxiv 2023. [paper][code][OlympicArena][FActScore]
  • Chain-of-Verification Reduces Hallucination in Large Language Models, Dhuliawala et al., arxiv 2023. [paper][code]
  • HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models, Guan et al., CVPR 2024. [paper][code]
  • Woodpecker: Hallucination Correction for Multimodal Large Language Models, Yin et al., arxiv 2023. [paper][code]
  • OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation, Huang et al., CVPR 2024 Highlight. [paper][code]
  • TrustLLM: Trustworthiness in Large Language Models, Sun et al., arxiv 2024. [paper][code]
  • SAFE: Long-form factuality in large language models, Wei et al., arxiv 2024. [paper][code]
  • RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models, Hu et al., arxiv 2024. [paper][code][HaluAgent][LLMsKnow]
  • Detecting hallucinations in large language models using semantic entropy, Farquhar et al., Nature 2024. [paper][semantic_uncertainty][long_hallucinations][Semantic Uncertainty ICLR 2023][Lynx-hallucination-detection]
  • A Survey on the Honesty of Large Language Models, Li et al., arxiv 2024. [paper][code]
3.3.6 Inference
3.3.7 MoE
  • Mixture of Experts Explained, Sanseviero et al., Hugging Face Blog 2023. [blog]

  • Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Shazeer et al., arxiv 2017. [paper][Re-Implementation]

  • GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, Lepikhin et al., arxiv 2020. [paper][mixture-of-experts]

  • MegaBlocks: Efficient Sparse Training with Mixture-of-Experts, Gale et al., arxiv 2022. [paper][code]

  • Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models, Shen et al., arxiv 2023. [paper][[code]]

  • Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, Fedus et al., arxiv 2021. [paper][code]

  • Fast Inference of Mixture-of-Experts Language Models with Offloading, Eliseev and Mazur, arxiv 2023. [paper][code]

  • Mixtral-8×7B: Mixtral of Experts, Jiang et al., arxiv 2023. [paper][code][megablocks-public][model][blog][Chinese-Mixtral-8x7B][Chinese-Mixtral]

  • DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models, Dai et al., ACL 2024. [paper][code]

  • DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, DeepSeek-AI, arxiv 2024. [paper][code][DeepSeek-V2.5]

  • Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models, Wang et al., ACL 2024. [paper][code][Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts]

  • Evolutionary Optimization of Model Merging Recipes, Akiba et al., arxiv 2024. [paper][code]

  • A Closer Look into Mixture-of-Experts in Large Language Models, Lo et al., arxiv 2024. [paper][code]

  • A Survey on Mixture of Experts, Cai et al., arxiv 2024. [paper][code]

  • HMoE: Heterogeneous Mixture of Experts for Language Modeling, Wang et al., arxiv 2024. [paper]

  • OLMoE: Open Mixture-of-Experts Language Models, Muennighoff et al., arxiv 2024. [paper][code]

  • [llama-moe][Aurora][OpenMoE][makeMoE][PEER-pytorch][GRIN-MoE]

3.3.8 PEFT (Parameter-efficient Fine-tuning)
3.3.9 Prompt Learning
3.3.10 RAG (Retrieval Augmented Generation)
Text Embedding
3.3.11 Reasoning and Planning
  • Few-Shot-CoT: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Wei et al., NeurIPS 2022. [paper][chain-of-thought-hub]

  • Self-Consistency Improves Chain of Thought Reasoning in Language Models, Wang et al., ICLR 2023. [paper]

  • Zero-Shot-CoT: Large Language Models are Zero-Shot Reasoners, Kojima et al., NeurIPS 2022. [paper][code]

  • Auto-CoT: Automatic Chain of Thought Prompting in Large Language Models, Zhang et al., ICLR 2023. [paper][code]

  • Multimodal Chain-of-Thought Reasoning in Language Models, Zhang et al., arxiv 2023. [paper][code]

  • Chain-of-Thought Reasoning Without Prompting, Wang et al., arxiv 2024. [paper]

  • ReAct: Synergizing Reasoning and Acting in Language Models, Yao et al., ICLR 2023. [paper][code]

  • MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action, Yang et al., arxiv 2023. [paper][code]

  • Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al., NeurIPS 2023. [paper][code][Plug in and Play Implementation][tree-of-thought-prompting]

  • Graph of Thoughts: Solving Elaborate Problems with Large Language Models, Besta et al., arxiv 2023. [paper][code]

  • Cumulative Reasoning with Large Language Models, Zhang et al., arxiv 2023. [paper][code][On the Diagram of Thought]

  • Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models, Sel et al., arxiv 2023. [paper][unofficial code]

  • Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation, Ding et al., arxiv 2023. [paper][code]

  • Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models, Ye et al., arxiv 2024. [paper][code]

  • Large Language Models Are Reasoning Teachers, Ho et al., ACL 2023. [paper][code]

  • Least-to-Most Prompting Enables Complex Reasoning in Large Language Models, Zhou et al., ICLR 2023. [paper]

  • DEPS: Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents, Wang et al., arxiv 2023. [paper][code]

  • RAP: Reasoning with Language Model is Planning with World Model, Hao et al., EMNLP 2023. [paper][code][LLM Reasoners COLM 2024]

  • LEMA: Learning From Mistakes Makes LLM Better Reasoner, An et al., arxiv 2023. [paper][code]

  • Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks, Chen et al., TMLR 2023. [paper][code]

  • Chain of Code: Reasoning with a Language Model-Augmented Code Emulator, Li et al., arxiv 2023. [paper][[code]]

  • The Impact of Reasoning Step Length on Large Language Models, Jin et al., arxiv 2024. [paper][code]

  • Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models, Wang et al., ACL 2023. [paper][code][maestro]

  • Improving Factuality and Reasoning in Language Models through Multiagent Debate, Du et al., arxiv 2023. [paper][code][Multi-Agents-Debate]

  • Self-Refine: Iterative Refinement with Self-Feedback, Madaan et al., arxiv 2023. [paper][code][MCT Self-Refine]

  • Reflexion: Language Agents with Verbal Reinforcement Learning, Shinn et al., NeurIPS 2023. [paper][code]

  • CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, Gou et al., ICLR 2024. [paper][code]

  • LATS: Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models, Zhou et al., ICML 2024. [paper][code]

  • Self-Discover: Large Language Models Self-Compose Reasoning Structures, Zhou et al., NeurIPS 2024. [paper][unofficial implementation][SELF-DISCOVER]

  • RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation, Wang et al., arxiv 2024. [paper][code]

  • KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents, Zhu et al., arxiv 2024. [paper][code][KnowLM]

  • Advancing LLM Reasoning Generalists with Preference Trees, Yuan et al., arxiv 2024. [paper][code]

  • Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models, Yang et al., arxiv 2024. [paper][code][SymbCoT]

  • ReST-EM: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models, Singh et al., arxiv 2023. [paper][unofficial code]

  • ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent, Aksitov et al., arxiv 2023. [paper][[code]]

  • Searchformer: Beyond A: Better Planning with Transformers via Search Dynamics Bootstrapping*, Lehnert et al., COLM 2024. [paper][code]

  • How Far Are We from Intelligent Visual Deductive Reasoning?, Zhang et al., arxiv 2024. [paper][code]

  • PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers, Lee et al., arxiv 2024. [paper][code]

  • Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning, Kim et al., arxiv 2024. [paper][code]

  • Sibyl: Simple yet Effective Agent Framework for Complex Real-world Reasoning, Wang et al., arxiv 2024. [paper][code]

  • QueryAgent: A Reliable and Efficient Reasoning Framework with Environmental Feedback-based Self-Correction, Huang et al., ACL 2024. [paper][code]

  • Internal Consistency and Self-Feedback in Large Language Models: A Survey, Liang et al., arxiv 2024. [paper][code]

  • Prover-Verifier Games improve legibility of language model outputs, Kirchner et al., 2024. [blog][paper]

  • Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning, Wang et al., ACL 2024. [paper][code]

  • ReST-MCTS: LLM Self-Training via Process Reward Guided Tree Search*, Zhang et al., arxiv 2024. [paper][code]

  • rStar: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers, Qi et al., arxiv 2024. [paper][code][Orca 2][Quiet-STaR]

  • OpenAI o1: Learning to Reason with LLMs, OpenAI, 2024. [blog][Agent Q][Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters][Let's Verify Step by Step][Awesome-LLM-Strawberry][O1-Journey]

  • VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment, Kazemnejad et al., arxiv 2024. [paper][code]

  • [llm-reasoners][g1][Open-O1][show-me]


3.4 LLM Theory

  • Scaling Laws for Neural Language Models, Kaplan et al., arxiv 2020. [paper][unofficial code]

  • Emergent Abilities of Large Language Models, Wei et al., TMRL 2022. [paper]

  • Chinchilla: Training Compute-Optimal Large Language Models, Hoffmann et al., NeurIPS 2022. [paper]

  • Scaling Laws for Autoregressive Generative Modeling, Henighan et al., arxiv 2020. [paper]

  • Are Emergent Abilities of Large Language Models a Mirage, Schaeffer et al., NeurIPS 2023 Outstanding Paper. [paper]

  • Understanding Emergent Abilities of Language Models from the Loss Perspective, Du et al., arxiv 2024. [paper]

  • S2A: System 2 Attention (is something you might need too), Weston et al., arxiv 2023. [paper][Distilling System 2 into System 1][system-2-research]

  • Memory3: Language Modeling with Explicit Memory, Yang et al., arxiv 2024. [paper]

  • Scaling Laws for Downstream Task Performance of Large Language Models, Isik et al., arxiv 2024. [paper]

  • Scalable Pre-training of Large Autoregressive Image Models, El-Nouby et al., arxiv 2024. [paper][code]

  • When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method, Zhang et al., ICLR 2024. [paper]

  • Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws, Allen-Zhu et al, arxiv 2024. [paper]

  • Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process, Ye et al., arxiv 2024. [paper][project page]

  • Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems, Ye et al., arxiv 2024. [paper]

  • Language Modeling Is Compression, Delétang et al., arxiv 2023. [paper]

  • Language Models Represent Space and Time, Gurnee and Tegmark, ICLR 2024. [paper][code]

  • The Platonic Representation Hypothesis, Huh et al., arxiv 2024. [paper][code]

  • Observational Scaling Laws and the Predictability of Language Model Performance, Ruan et al., arxiv 2024. [paper][code]

  • Language models can explain neurons in language models, OpenAI, 2023. [blog][code][transformer-debugger]

  • Scaling and evaluating sparse autoencoders, Gao et al., arxiv 2024. [OpenAI Blog][paper][code][sae-auto-interp]

  • Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, Anthropic, 2023. [blog]

  • Mapping the Mind of a Large Language Model, Anthropic, 2024. [blog]

  • Usable XAI: 10 Strategies Towards Exploiting Explainability in the LLM Era, Wu et al., arxiv 2024. [paper][code]

  • LM Transparency Tool: Interactive Tool for Analyzing Transformer Language Models, Tufanov et al., arxiv 2024. [paper][code]

  • Transformer Explainer: Interactive Learning of Text-Generative Models, Cho et al., arxiv 2024. [paper][code][demo]

  • What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation, Singh et al., ICML 2024 Spotlight. [paper][code]

  • [Transformer Circuits Thread][colah's blog][Transformer Interpretability][Awesome-Interpretability-in-Large-Language-Models][TransformerLens][inseq]

  • ROME: Locating and Editing Factual Associations in GPT, Meng et al., NeurIPS 2022. [paper][code][FastEdit]

  • Editing Large Language Models: Problems, Methods, and Opportunities, Yao et al., EMNLP 2023. [paper][code][Knowledge Mechanisms in Large Language Models: A Survey and Perspective]

  • A Comprehensive Study of Knowledge Editing for Large Language Models, Zhang et al., arxiv 2024. [paper][code]

3.5 Chinese Model


  • CS231n: Deep Learning for Computer Vision [link]

1. Basic for CV

  • AlexNet: ImageNet Classification with Deep Convolutional Neural Networks, Krizhevsky et al., NIPS 2012. [paper]
  • VGG: Very Deep Convolutional Networks for Large-Scale Image Recognition, Simonyan et al., ICLR 2015. [paper]
  • GoogLeNet: Going Deeper with Convolutions, Szegedy et al., CVPR 2015. [paper]
  • ResNet: Deep Residual Learning for Image Recognition, He et al., CVPR 2016 Best Paper. [paper][code]
  • DenseNet: Densely Connected Convolutional Networks, Huang et al., CVPR 2017 Oral. [paper][code]
  • EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, Tan et al., ICML 2019. [paper][code][EfficientNet-PyTorch][noisystudent]
  • BYOL: Bootstrap your own latent: A new approach to self-supervised Learning, Grill et al., arxiv 2020. [paper][code][byol-pytorch][simsiam]
  • ConvNeXt: A ConvNet for the 2020s, Liu et al., CVPR 2022. [paper][code]

2. Contrastive Learning

  • MoCo: Momentum Contrast for Unsupervised Visual Representation Learning, He et al., CVPR 2020. [paper][code]

  • SimCLR: A Simple Framework for Contrastive Learning of Visual Representations, Chen et al., PMLR 2020. [paper][code]

  • CoCa: Contrastive Captioners are Image-Text Foundation Models, Yu et al., arxiv 2024. [paper][CoCa-pytorch][multimodal]

  • DINOv2: Learning Robust Visual Features without Supervision, Oquab et al., arxiv 2023. [paper][code]

  • FeatUp: A Model-Agnostic Framework for Features at Any Resolution, Fu et al., ICLR 2024. [paper][code]

  • InfoNCE Loss: Representation Learning with Contrastive Predictive Coding, Oord et al., arxiv 2018. [paper][unofficial code]

3. CV Application

4. Foundation Model

  • ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., ICLR 2021. [paper][code][vit-pytorch][efficientvit][EfficientFormer][ViT-Adapter]

  • ViT-Adapter: Vision Transformer Adapter for Dense Predictions, Chen et al., ICLR 2023 Spotlight. [paper][code]

  • Vision Transformers Need Registers, Darcet et al., ICLR 2024 Outstanding Paper. [paper]

  • DeiT: Training data-efficient image transformers & distillation through attention, Touvron et al., ICML 2021. [paper][code]

  • ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, Kim et al., ICML 2021. [paper][code]

  • Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, Liu et al., ICCV 2021. [paper][code]

  • MAE: Masked Autoencoders Are Scalable Vision Learners, He et al., CVPR 2022. [paper][code][FLIP]

  • Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks, Xiao et al., CVPR 2024 Oral. [paper][model][Inference code]

  • LVM: Sequential Modeling Enables Scalable Learning for Large Vision Models, Bai et al., arxiv 2023. [paper][code]

  • GLEE: General Object Foundation Model for Images and Videos at Scale, Wu wt al., CVPR 2024 Highlight. [paper][code]

  • Tokenize Anything via Prompting, Pan et al., arxiv 2023. [paper][code]

  • Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model Zhu et al., ICML 2024. [paper][code][VMamba][mambaout]

  • MambaVision: A Hybrid Mamba-Transformer Vision Backbone, Hatamizadeh and Kautz, arxiv 2024. [paper][code]

  • Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data, Yang et al., arxiv 2024. [paper][code][Depth-Anything-V2][ml-depth-pro]

  • Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models, Guo et al., arxiv 2024. [paper][code]

  • TiTok: An Image is Worth 32 Tokens for Reconstruction and Generation, Yu et al., arxiv 2024. [paper][titok-pytorch]

  • Theia: Distilling Diverse Vision Foundation Models for Robot Learning, Shang et al., arxiv 2024. [paper][code]

  • [pytorch-image-models][Pointcept]

5. Generative Model (GAN and VAE)

6. Image Editing

  • InstructPix2Pix: Learning to Follow Image Editing Instructions, Brooks et al., CVPR 2023 Highlight. [paper][code]

  • Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold, Pan et al., SIGGRAPH 2023. [paper][code]

  • DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing, Shi et al., arxiv 2023. [paper][code]

  • DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models, Mou et al., ICLR 2024 Spolight. [paper][code]

  • LEDITS++: Limitless Image Editing using Text-to-Image Models, Brack et al., arxiv 2023. [paper][code][demo]

  • Diffusion Model-Based Image Editing: A Survey, Huang et al., arxiv 2024. [paper][code]

  • MimicBrush: Zero-shot Image Editing with Reference Imitation, Chen et al., arxiv 2024. [paper][code][EchoMimic]

  • A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models, Shuai et al., arxiv 2024. [paper][code]

  • [ComfyUI-UltraEdit-ZHO]

7. Object Detection

  • DETR: End-to-End Object Detection with Transformers, Carion et al., arxiv 2020. [paper][code]

  • Focus-DETR: Less is More_Focus Attention for Efficient DETR, Zheng et al., arxiv 2023. [paper][code]

  • U2-Net_Going Deeper with Nested U-Structure for Salient Object Detection, Qin et al., arxiv 2020. [paper][code]

  • YOLO: You Only Look Once: Unified, Real-Time Object Detection Redmon et al., arxiv 2015. [paper]

  • YOLOX: Exceeding YOLO Series in 2021, Ge et al., arxiv 2021. [paper][code]

  • Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism, Wang et al., arxiv 2023. [paper][code]

  • Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection, Liu et al., ECCV 2024. [paper][code][OV-DINO][OmDet]

  • YOLO-World: Real-Time Open-Vocabulary Object Detection, Cheng et al., CVPR 2024. [paper][code]

  • YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information, Wang et al., arxiv 2024. [paper][code]

  • T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy, Jiang et al., arxiv 2024. [paper][code]

  • YOLOv10: Real-Time End-to-End Object Detection, Wang et al., arxiv 2024. [paper][code]

  • [detectron2][yolov5][mmdetection][detrex][ultralytics][AlphaPose]

8. Semantic Segmentation

9. Video

  • VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training, Tong et al., NeurIPS 2022 Spotlight. [paper][code]
  • MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation, Wang et al., arxiv 2024. [paper]
  • [V-JEPA][I-JEPA]
  • VideoMamba: State Space Model for Efficient Video Understanding, Li et al., ECCV 2024. [paper][code]
  • VideoChat: Chat-Centric Video Understanding, Li et al., CVPR 2024 Highlight. [paper][code]
  • MVBench: A Comprehensive Multi-modal Video Understanding Benchmark, Li et al., CVPR 2024 Highlight. [paper][code]
  • OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer, Zhang et al., EMNLP 2024. [paper][code]
  • MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions, Ju et al., arxiv 2024. [paper][code]
  • MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling, Men et al., arxiv 2024. [paper][code][MIMO-pytorch]

10. Survey for CV

  • ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy, Vishniakov et al., arxiv 2023. [paper][code]
  • Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey, Xin et al., arxiv 2024. [paper][code]


1. Audio

2. Blip

  • ALBEF: Align before Fuse: Vision and Language Representation Learning with Momentum Distillation, Li et al., NeurIPS 2021. [paper][code]
  • BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, Li et al., ICML 2022. [paper][code][laion-coco]
  • BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, Li et al., ICML 2023. [paper][code]
  • InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, Dai et al., arxiv 2023. [paper][code]
  • X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning, Panagopoulou et al., arxiv 2023. [paper][code]
  • xGen-MM (BLIP-3): A Family of Open Large Multimodal Models, Xue et al., arxiv 2024. [paper][code]
  • xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations, Qin et al., arxiv 2024. [paper][code]
  • LAVIS: A Library for Language-Vision Intelligence, Li et al., arxiv 2022. [paper][code]
  • VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, Bao et al., NeurIPS 2022. [paper][code]
  • BEiT: BERT Pre-Training of Image Transformers, Bao et al., ICLR 2022 Oral presentation. [paper][code]
  • BeiT-V3: Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, Wang et al., CVPR 2023. [paper][code]

3. Clip

  • CLIP: Learning Transferable Visual Models From Natural Language Supervision, Radford et al., ICML 2021. [paper][code][open_clip][clip-as-service][SigLIP][EVA][DIVA]
  • DALL-E2: Hierarchical Text-Conditional Image Generation with CLIP Latents, Ramesh et al., arxiv 2022. [paper][code]
  • HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention, Geng et al., ICLR 2023. [paper][code]
  • Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese, Yang et al., arxiv 2022. [paper][code]
  • MetaCLIP: Demystifying CLIP Data, Xu et al., ICLR 2024 Spotlight. [paper][code]
  • Alpha-CLIP: A CLIP Model Focusing on Wherever You Want, Sun et al., arxiv 2023. [paper][code][Bootstrap3D]
  • MMVP: Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs, Tong et al., arxiv 2024. [paper][code]
  • MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training, Vasu et al., CVPR 20224. [paper][code]
  • Long-CLIP: Unlocking the Long-Text Capability of CLIP, Zhang et al., arxiv 2024. [paper][code]
  • CLOC: Contrastive Localized Language-Image Pre-Training, Chen et al., arxiv 2024. [paper]

4. Diffusion Model

  • Tutorial on Diffusion Models for Imaging and Vision, Stanley H. Chan, arxiv 2024. [paper][diffusion-models-class]

  • Denoising Diffusion Probabilistic Models, Ho et al., NeurIPS 2020. [paper][code][Pytorch Implementation][RDDM]

  • Improved Denoising Diffusion Probabilistic Models, Nichol and Dhariwal, ICML 2021. [paper][code]

  • Diffusion Models Beat GANs on Image Synthesis, Dhariwal and Nichol, NeurIPS 2021. [paper][code]

  • Classifier-Free Diffusion Guidance, Ho and Salimans, NeurIPS 2021. [paper][code]

  • GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, Nichol et al., arxiv 2021. [paper][code]

  • DALL-E2: Hierarchical Text-Conditional Image Generation with CLIP Latents, Ramesh et al., arxiv 2022. [paper][code][dalle-mini]

  • Stable-Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et al., CVPR 2022. [paper][code][CompVis/stable-diffusion][Stability-AI/stablediffusion][ml-stable-diffusion]

  • SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis, Podell et al., arxiv 2023. [paper][code][SDXL-Lightning]

  • Introducing Stable Cascade, Stability AI, 2024. [link][code][model]

  • SDXL-Turbo: Adversarial Diffusion Distillation, Sauer et al., arxiv 2023. [paper][code]

  • LCM: Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference, Luo et al., arxiv 2023. [paper][code][Hyper-SD][DMD2][ddim]

  • LCM-LoRA: A Universal Stable-Diffusion Acceleration Module, Luo et al., arxiv 2023. [paper][code][diffusion-forcing]

  • Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, Esser et al., ICML 2024 Best Paper. [paper][model][mmdit]

  • SD3-Turbo: Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation, Sauer et al., arxiv 2024. [paper]

  • StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation, Kodaira et al., arxiv 2023. [paper][code]

  • DiffuseKronA: A Parameter Efficient Fine-tuning Method for Personalized Diffusion Models, Marjit et al., arxiv 2024. [paper][code]

  • Video Diffusion Models, Ho et al., arxiv 2022. [paper][code]

  • Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets, Blattmann et al., arxiv 2023. [paper][code][Stable Video 4D][VideoCrafter][Video-Infinity]

  • Consistency Models, Song et al., arxiv 2023. [paper][code][Consistency Decoder]

  • A Survey on Video Diffusion Models, Xing et al., srxiv 2023. [paper][code]

  • Diffusion Models: A Comprehensive Survey of Methods and Applications, Yang et al., arxiv 2023. [paper][code]

  • Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation, Yu et al., ICLR 2024. [paper][magvit2-pytorch][LlamaGen]

  • The Chosen One: Consistent Characters in Text-to-Image Diffusion Models, Avrahami et al., arxiv 2023. [paper][code]

  • U-ViT: All are Worth Words: A ViT Backbone for Diffusion Models, Bao et al., CVPR 2023. [paper][code]

  • UniDiffuser: One Transformer Fits All Distributions in Multi-Modal Diffusion, Bao et al., arxiv 2023. [paper][code]

  • Matryoshka Diffusion Models, Gu et al., arxiv 2023. [paper][code]

  • SEDD: Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution, Lou et al., ICML 2024 Best Paper. [paper][code]

  • l-DAE: Deconstructing Denoising Diffusion Models for Self-Supervised Learning, Chen et al., arxiv 2024. [paper]

  • DiT: Scalable Diffusion Models with Transformers, Peebles et al., ICCV 2023 Oral. [paper][code][OpenDiT][VideoSys][MDT][PipeFusion][fast-DiT]

  • SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers, Ma et al., arxiv 2024. [paper][code]

  • Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis, Ren et al., arxiv 2024. [paper][model]

  • Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer, Yang et al., arxiv 2024. [paper][code]

  • Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion, Chen et al., arxiv 2024. [paper][code]

  • Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget, Sehwag et al., arxiv 2024. [paper][code]

  • Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model, Zhou et al. arxiv 2024. [paper][transfusion-pytorch][chameleon][MonoFormer]

  • Github Repositories

  • [Awesome-Diffusion-Models][Awesome-Video-Diffusion]

  • [stable-diffusion-webui][stable-diffusion-webui-colab][sd-webui-controlnet][stable-diffusion-webui-forge][automatic]

  • [Fooocus][Omost]

  • [ComfyUI][streamlit][gradio][ComfyUI-Workflows-ZHO][ComfyUI_Bxb]

  • [diffusers][DiffSynth-Studio]

5. Multimodal LLM

  • LLaVA: Visual Instruction Tuning, Liu et al., NeurIPS 2023 Oral. [paper][code][vip-llava][LLaVA-pp][TinyLLaVA_Factory][LLaVA-RLHF]

  • LLaVA-1.5: Improved Baselines with Visual Instruction Tuning, Liu et al., arxiv 2023. [paper][code][LLaVA-UHD][LLaVA-HR]

  • LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models, Li et al., arxiv 2024. [paper][code][Open-LLaVA-NeXT][MG-LLaVA][LongVA][LongLLaVA]

  • LLaVA-OneVision: Easy Visual Task Transfer, Li et al., arxiv 2024. [paper][code]

  • LLaVA-Video: Video Instruction Tuning With Synthetic Data, Zhang et al., arxiv 2024. [paper][code][LLaVA-Critic]

  • LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day, Li et al., arxiv 2023. [paper][code]

  • Video-LLaVA: Learning United Visual Representation by Alignment Before Projection, Lin et al., arxiv 2023. [paper][code][PLLaVA]

  • MoE-LLaVA: Mixture of Experts for Large Vision-Language Models, Lin et al., arxiv 2024. [paper][code]

  • MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models, Zhu et al., arxiv 2023. [paper][code][MiniGPT-4-ZH]

  • MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning, Chen et al., arxiv 2023. [paper][code]

  • MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens, Ataallah et al., arxiv 2024. [paper][code]

  • MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens, Zheng et al., arxiv 2023. [paper][code]

  • Flamingo: a Visual Language Model for Few-Shot Learning, Alayrac et al., NeurIPS 2022. [paper][open-flamingo][flamingo-pytorch]

  • Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding, Zhang et al., EMNLP 2023. [paper][code][VideoLLaMA2][VideoLLM-online]

  • BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs, Zhao et al., arxiv 2023. [paper][code][AnyGPT]

  • Emu: Generative Pretraining in Multimodality, Sun et al., ICLR 2024. [paper][code]

  • Emu3: Next-Token Prediction is All You Need, Wang et al., arxiv 2024. [paper][code]

  • EVE: Unveiling Encoder-Free Vision-Language Models, Diao et al., arxiv 2024. [paper][code]

  • CogVLM: Visual Expert for Pretrained Language Models, Wang et al., arxiv 2023. [paper][code][VisualGLM-6B][CogCoM]

  • CogVLM2: Visual Language Models for Image and Video Understanding, Hong et al., arxiv 2024. [paper][code]

  • DreamLLM: Synergistic Multimodal Comprehension and Creation, Dong et al., ICLR 2024 Spotlight. [paper][code][dreambench_plus]

  • Meta-Transformer: A Unified Framework for Multimodal Learning, Zhang et al., arxiv 2023. [paper][code]

  • NExT-GPT: Any-to-Any Multimodal LLM, Wu et al., arxiv 2023. [paper][code]

  • Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models, Wu et al., arxiv 2023. [paper][code]

  • SoM: Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V, Yang et al., arxiv 2023. [paper][code]

  • Ferret: Refer and Ground Anything Anywhere at Any Granularity, You et al., arxiv 2023. [paper][code][Ferret-UI]

  • 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities, Bachmann et al., arxiv 2024. [paper][code][MM1.5]

  • Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond, Bai et al., arxiv 2023. [paper][code]

  • Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution, Wang et al., arxiv 2024. [paper][code][][finetune-Qwen2-VL][Oryx]

  • InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition, Zhang et al., arxiv 2023. [paper][code]

  • InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks, Chen et al., CVPR 2024 Oral. [paper][code][InternVideo][InternVid][InternVL1.5 paper]

  • DeepSeek-VL: Towards Real-World Vision-Language Understanding, Lu et al., arxiv 2024. [paper][code]

  • ShareGPT4V: Improving Large Multi-Modal Models with Better Captions, Chen et al., arxiv 2023. [paper][code]

  • ShareGPT4Video: Improving Video Understanding and Generation with Better Captions, Chen et al., arxiv 2024. [paper][code]

  • TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones, Yuan et al., arxiv 2023. [paper][code]

  • Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models, Li et al., CVPR 2024. [paper][code]

  • Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models, Wei et al., arxiv 2023. [paper][code]

  • Vary-toy: Small Language Model Meets with Reinforced Vision Vocabulary, Wei et al., arxiv 2024. [paper][code]

  • VILA: On Pre-training for Visual Language Models, Lin et al., CVPR 2024. [paper][code][LongVILA][Eagle][NVLM]

  • POINTS: Improving Your Vision-language Model with Affordable Strategies, Liu et al., arxiv 2024. [paper]

  • LWM: World Model on Million-Length Video And Language With RingAttention, Liu et al., arxiv 2024. [paper][code]

  • Chameleon: Mixed-Modal Early-Fusion Foundation Models, Chameleon Team, arxiv 2024. [paper][code]

  • Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts, Li et al., arxiv 2024. [paper][code]

  • RL4VLM: Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning, Zhai et al., arxiv 2024. [paper][code][RLHF-V][RLAIF-V]

  • OpenVLA: An Open-Source Vision-Language-Action Model, Kim et al., arxiv 2024. [paper][code]

  • Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis, Fu et al., arxiv 2024. [paper][code][lmms-eval][VLMEvalKit][multimodal-needle-in-a-haystack][MM-NIAH][VideoNIAH][ChartMimic][WildVision]

  • MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities, Yu et al., ICML 2024. [paper][code][UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling]

  • Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs, Tong et al., arxiv 2024. [paper][code]

  • video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models, Sun et al., ICML 2024. [paper][code]

  • ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation, Chern et al., arxiv 2024. [paper][code]

  • PaliGemma: A versatile 3B VLM for transfer, Beyer et al., arxiv 2024. [paper][code][pytorch-paligemma][Pixtral-12B-2409]

  • MiniCPM-V: A GPT-4V Level MLLM on Your Phone, Yao et al., arxiv 2024. [paper][code][VisCPM][RLHF-V][RLAIF-V]

  • VITA: Towards Open-Source Interactive Omni Multimodal LLM, Fu et al., arxiv 2024. [paper][code]

  • Show-o: One Single Transformer to Unify Multimodal Understanding and Generation, Xie et al., arxiv 2024. [paper][code][Transfusion][VILA-U][LWM]

  • MIO: A Foundation Model on Multimodal Tokens, Wang et al., arxiv 2024. [paper]

  • [MiniCPM-V][moondream][MobileVLM][OmniFusion][Bunny][MiCo][Vitron][mPLUG-Owl][mPLUG-DocOwl][Ovis]

  • [datacomp][MMDU][MINT-1T][OpenVid-1M][SkyScript-100M]

  • [mllm][lmms-finetune]

6. Text2Image

  • DALL-E: Zero-Shot Text-to-Image Generation, Ramesh et al., arxiv 2021. [paper][code]

  • DALL-E3: Improving Image Generation with Better Captions, Betker et al., OpenAI 2023. [paper][code][blog][Glyph-ByT5]

  • ControlNet: Adding Conditional Control to Text-to-Image Diffusion Models, Zhang et al., ICCV 2023 Marr Prize. [paper][code][ControlNet_Plus_Plus][ControlNeXt][ControlAR]

  • T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models, Mou et al., AAAI 2024. [paper][code]

  • AnyText: Multilingual Visual Text Generation And Editing, Tuo et al., arxiv 2023. [paper][code]

  • RPG: Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs, Yang et al., ICML 2024. [paper][code]

  • LAION-5B: An open large-scale dataset for training next generation image-text models, Schuhmann et al., NeurIPS 2022. [paper][code][blog][laion-coco]

  • DeepFloyd IF: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, Saharia et al., arxiv 2022. [paper][code]

  • Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, Saharia et al., NeurIPS 2022. [paper][unofficial code]

  • Instruct-Imagen: Image Generation with Multi-modal Instruction, Hu et al., arxiv 2024. [paper][Imagen 3]

  • CogView: Mastering Text-to-Image Generation via Transformers, Ding et al., NeurIPS 2021. [paper][code][ImageReward]

  • CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers, Ding et al., arxiv 2022. [paper][code]

  • CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion, Zheng et al., ECCV 2024. [paper][code]

  • TextDiffuser: Diffusion Models as Text Painters, Chen et al., arxiv 2023. [paper][code]

  • TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering, Chen et al., arxiv 2023. [paper][code]

  • PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis, Chen et al., arxiv 2023. [paper][code]

  • PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models, Chen et al., arxiv 2024. [paper][code]

  • PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation, Chen et al., arxiv 2024. [paper][code]

  • IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models, Ye et al., arxiv 2023. [paper][code][ID-Animator][InstantID]

  • Controllable Generation with Text-to-Image Diffusion Models: A Survey, Cao et al., arxiv 2024. [paper][code]

  • StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation, Zhou et al., arxiv 2024. [paper][code][AutoStudio]

  • Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding, Li et al., arxiv 2024. [paper][code][xDiT]

  • [Kolors][Kolors-Virtual-Try-On][EVLM: An Efficient Vision-Language Model for Visual Understanding]

  • [flux][x-flux][x-flux-comfyui][FLUX.1-dev-LoRA]

7. Text2Video

  • Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation, Hu et al., arxiv 2023. [paper][code][Open-AnimateAnyone][Moore-AnimateAnyone][AnimateAnyone][UniAnimate]

  • EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions, Tian et al., arxiv 2024. [paper][code][V-Express]

  • AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation, Wei wt al., arxiv 2024. [paper][code]

  • DreaMoving: A Human Video Generation Framework based on Diffusion Models, Feng et al., arxiv 2023. [paper][code]

  • MagicAnimate:Temporally Consistent Human Image Animation using Diffusion Model, Xu et al., arxiv 2023. [paper][code][champ][MegActor]

  • DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors, Xing et al., ECCV 2024. [paper][code]

  • LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control, Guo et al., arxiv 2024. [paper][code][FasterLivePortrait][FollowYourEmoji]

  • FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis, Liang et al., arxiv 2023. [paper][code]

  • [Awesome-Video-Diffusion]

  • Video Diffusion Models, Ho et al., arxiv 2022. [paper][video-diffusion-pytorch]

  • Make-A-Video: Text-to-Video Generation without Text-Video Data, Singer et al., arxiv 2022. [paper][make-a-video-pytorch]

  • Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation, Wu et al., ICCV 2023. [paper][code]

  • Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators, Khachatryan et al., ICCV 2023 Oral. [paper][code][StreamingT2V]

  • CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers, Hong et al., ICLR 2023. [paper][code]

  • CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer, Yang et al., arxiv 2024. [paper][code]

  • Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos, Ma et al., AAAI 2024. [paper][code][Follow-Your-Pose v2][Follow-Your-Emoji]

  • Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts, Ma et al., arxiv 2024. [paper][code]

  • AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning, Guo et al., arxiv 2023. [paper][code][AnimateDiff-Lightning]

  • StableVideo: Text-driven Consistency-aware Diffusion Video Editing, Chai et al., ICCV 2023. [paper][code]

  • I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models, Zhang et al., arxiv 2023. [paper][code]

  • TF-T2V: A Recipe for Scaling up Text-to-Video Generation with Text-free Videos, Wang et al., arxiv 2023. [paper][code]

  • Lumiere: A Space-Time Diffusion Model for Video Generation, Bar-Tal et al., arxiv 2024. [paper][lumiere-pytorch]

  • Sora: Creating video from text, OpenAI, 2024. [blog][Generative Models for Image and Long Video Synthesis][Generative Models of Images and Neural Networks][Open-Sora][VideoSys][Open-Sora-Plan][minisora][SoraWebui][MuseV][PhysDreamer][easyanimate]

  • Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models, Liu et al., arxiv 2024. [paper][code]

  • Mora: Enabling Generalist Video Generation via A Multi-Agent Framework, Yuan et al., arxiv 2024. [paper][code]

  • Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution, Dehghani et al., NeurIPS 2024. [paper][unofficial code]

  • VideoPoet: A Large Language Model for Zero-Shot Video Generation, Kondratyuk et al., ICML 2024 Best Paper. [paper]

  • Latte: Latent Diffusion Transformer for Video Generation, Ma et al., arxiv 2024. [paper][code][LaVIT][LaVie][VBench][Vchitect-2.0][LiteGen]

  • Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis, Menapace et al., arxiv 2024. [paper][articulated-animation]

  • FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance, Feng et al., arxiv 2024. [paper][code]

  • DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos, Hu et al., arxiv 2024. [paper][code]

  • Loong: Generating Minute-level Long Videos with Autoregressive Language Models, Wang et al., arxiv 2024. [paper]

  • Movie Gen: A Cast of Media Foundation Models, The Movie Gen team @ Meta, 2024. [blog][paper][unofficial code]

  • [MoneyPrinterTurbo][clapper][videos][manim]

8. Survey for Multimodal

  • A Survey on Multimodal Large Language Models, Yin et al., arxiv 2023. [paper][Awesome-Multimodal-Large-Language-Models]
  • Multimodal Foundation Models: From Specialists to General-Purpose Assistants, Li et al., arxiv 2023. [paper][cvinw_readings]
  • From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities, Lu et al., arxiv 2024. [paper][Leaderboards]
  • Efficient Multimodal Large Language Models: A Survey, Jin et al., arxiv 2024. [paper][code]
  • An Introduction to Vision-Language Modeling, Bordes et al., arxiv 2024. [paper]
  • Building and better understanding vision-language models: insights and future directions, Laurençon et al., arxiv 2024. [paper]

9. Other

  • Fuyu-8B: A Multimodal Architecture for AI Agents Bavishi et al., Adept blog 2023. [blog][model]
  • Otter: A Multi-Modal Model with In-Context Instruction Tuning, Li et al., arxiv 2023. [paper][code]
  • OtterHD: A High-Resolution Multi-modality Model, Li et al., arxiv 2023. [paper][code][model]
  • CM3leon: Scaling Autoregressive Multi-Modal Models_Pretraining and Instruction Tuning, Yu et al., arxiv 2023. [paper][Unofficial Implementation]
  • MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer, Tian et al., arxiv 2024. [paper][code]
  • CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations, Qi et al., arxiv 2024. [paper][code]
  • SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models, Gao et al., arxiv 2024. [paper][code]
  • Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers, Gao et al., arxiv 2024. [paper][code]
  • Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining, Liu et al., arxiv 2024. [paper][code]
  • LWM: World Model on Million-Length Video And Language With RingAttention, Liu et al., arxiv 2024. [paper][code]
  • Chameleon: Mixed-Modal Early-Fusion Foundation Models, Chameleon Team, arxiv 2024. [paper][code]
  • *SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation, Ge et al., arxiv 2024. [paper][code][SEED][SEED-Story]

Reinforcement Learning

1.Basic for RL

2. LLM for decision making

  • Decision Transformer_Reinforcement Learning via Sequence Modeling, Chen et al., NeurIPS 2021. [paper][code]
  • Trajectory Transformer: Offline Reinforcement Learning as One Big Sequence Modeling Problem, Janner et al., NeurIPS 2021. [paper][code]
  • Guiding Pretraining in Reinforcement Learning with Large Language Models, Du et al., ICML 2023. [paper][code]
  • Introspective Tips: Large Language Model for In-Context Decision Making, Chen et al., arxiv 2023. [paper]
  • Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions, Chebotar et al., CoRL 2023. [paper][Unofficial Implementation]
  • Survey on Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods, Cao et al., arxiv 2024. [paper]


Survey for GNN

Transformer Architecture