To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
MIT License
Bot releases are hidden (Show)
Official Codebase for MEGAVERSE: (published in ACL: NAACL 2024)
Generation of protein sequences and evolutionary alignments via discrete diffusion models
Tutel MoE: An Optimized Mixture-of-Experts Implementation
A unified evaluation framework for large language models
A Python package for generating concise, high-quality summaries of a probability distribution
Foundation Architecture for (M)LLMs
Building modular LMs with parameter-efficient fine-tuning.
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using S...
Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable...
AICI: Prompts as (Wasm) Programs