MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

MIT License

Downloads

501

Stars

725

Committers

View Code on GitHub Visit Website View on X

Ecosystems: Playwright, Windows, VS Code Extension, Windows UI Library (WinUI), TypeScript

Bot releases are hidden (Show)

No releases found yet, please check back later.

Package Rankings

Top 35.48% on Pypi.org

Related Projects

NeuralSpeech

04 Nov 2021 1,371

BioGPT

15 Aug 2022 4,292

MEGAVERSE

Official Codebase for MEGAVERSE: (published in ACL: NAACL 2024)

04 Jun 2024 6

evodiff

Generation of protein sequences and evolutionary alignments via discrete diffusion models

07 Jun 2022 487

tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation

06 Aug 2021 716

promptbench

A unified evaluation framework for large language models

13 Jun 2023 2,407

goodpoints

A Python package for generating concise, high-quality summaries of a probability distribution

03 Nov 2021 39

torchscale

Foundation Architecture for (M)LLMs

17 Nov 2022 3,006

mttl

Building modular LMs with parameter-efficient fine-tuning.

11 Jul 2022 76

VRL3

06 Nov 2022 32

DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

23 Mar 2022 1,856

zero-shot-scfoundation

04 Oct 2023 46

Swin-Transformer

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using S...

25 Mar 2021 13,692

BiDR

Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable...

28 Feb 2022 15

aici

AICI: Prompts as (Wasm) Programs

26 Sep 2023 1,916