To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
MIT License
Building modular LMs with parameter-efficient fine-tuning.
A Python package for generating concise, high-quality summaries of a probability distribution
Foundation Architecture for (M)LLMs
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using S...
A unified evaluation framework for large language models
Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable...
Official Codebase for MEGAVERSE: (published in ACL: NAACL 2024)
Tutel MoE: An Optimized Mixture-of-Experts Implementation
Generation of protein sequences and evolutionary alignments via discrete diffusion models
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
AICI: Prompts as (Wasm) Programs