FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/
OTHER License
FBGEMM_GPU v0.6.0 has been tested and known to work on the following setups:
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
FBGEMM_GPU can be fetched directly from PyPI:
# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.6.0
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.6.0
Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cu121/
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.6.0 --index-url https://download.pytorch.org/whl/cpu
Published by spcyppt about 1 year ago
FBGEMM_GPU v0.5.0 has been tested and known to work on the following setups:
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
FBGEMM_GPU can be fetched directly from PyPI:
# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.5.0
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.5.0
Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cu121/
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cpu
group_index_select_dim0
(#1968)group_index_select
(#1764, #1884)permute_pooled_embs_kernel
(#1913)all_to_one
enhancements (#1674, #1962)asynchronous_complete_cumsum
(#1707)permute_indices_weights_kernel_2
(#1852)pack_segments
(#1708)reorder_batched_ad_indices
(#1901, #1902, #1932, #1933, 1711)nbit-cpu-with-spec
benchmark in FBGEMM-GPU's TBE benchmark suite (#1892)Published by q10 over 1 year ago
FBGEMM_GPU v0.4.1 has been tested and known to work on the following setups:
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
FBGEMM_GPU may be fetched directly from PyPI:
# FBGEMM_GPU (CUDA variant)
pip install fbgemm-gpu==0.4.1
# FBGEMM_GPU (CPU variant)
pip install fbgemm-gpu-cpu==0.4.1
This is a minor release whose main purpose is to deliver Python 3.11 support.
Published by q10 over 1 year ago
FBGEMM_GPU v0.4.0 has been tested and known to work on the following setups:
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
FBGEMM_GPU may be fetched directly from PyPI:
# FBGEMM_GPU (CUDA variant)
pip install fbgemm-gpu==0.4.0
# FBGEMM_GPU (CPU variant)
pip install fbgemm-gpu-cpu==0.4.0
[lfu|lru]_cache_insert_byte_kernel
vectorization (#1475)jagged_dense_dense_elementwise_add_jagged
(#1487)group_index_select
(#1421, #1592)index_select
for selecting KeyJaggedTensor dim 1 (previously support only dim 0) (#1429)jagged_index_select
for CPU (#1586)asynchronous_complete_cumsum
(#1573)nbit_device_with_spec
for table batched embedding inference benchmark (#1455, #1465)bottom_unique_k_per_row
for faster Zipf data generation (for FBGEMM benchmarks) (#1447)Published by mjanderson09 almost 2 years ago
Minor release
Published by mjanderson09 almost 2 years ago
Table Batched Embedding enhancements:
AMD Support (beta) (#1102, #1193)
Quantized Communication Primitives (#1219, #1337)
Sparse kernel enhancements
Improved documentation for Jagged Tensors and SplitTableBatchedEmbeddingBagsCodegen
Optimized 2x2 kernel for AVX2 (#1280)
Full Changelog: https://github.com/pytorch/FBGEMM/commits/v0.3.0
Published by mjanderson09 over 2 years ago
Inference Table Batched Embedding (TBE) Enhancements (#951, #984)
The table batched embedding (TBE) operator is an important base operation for embedding lookup for recommendation system inference on GPU. We added the following enhancements for performance and flexibility:
Inference FP8 Table Batched Embedding (TBE) (#1091)
The table batched embedding (TBE) previously supported FP32, FP16, INT8, INT4, and INT2 embedding weight types. While these weight types work well in many models, we integrate FP8 weight types (in both GPU and CPU operations) to allow for numerical and performance evaluations of FP8 in our models. Compared to INT8, FP8 does not require the additional bias and scale storage and calculations. Additionally, the next generation of H100 GPUs has the FP8 support on Tensor Core (mainly matmul ops).
Jagged Tensor Kernels (#1006, #1008)
We added optimized kernels to speed up TorchRec Jagged Tensor. The purpose of JaggedTensor is to handle the case where one dimension of the input data is “jagged”, meaning that each consecutive row in a given dimension may be a different length, which is often the case with sparse feature inputs in recommendation systems.
Optimized permute102-baddbmm-permute102 (#1048)
It is difficult to fuse various matrix multiplications where the batch size is not the batch size of the model, switching the batch dimension is a quick solution. We created the permute102_baddbmm_permute102 operation that switches the first and the second dimension, performs the batched matrix multiplication and then switches back. Currently we only support forward pass with FP16 data type and will support FP32 type and backward pass in the future.
Optimized index_select for dim 0 index selection (#1113)
index_select is normally used as part of a sparse operation. While PyTorch supports a generic index_select for an arbitrary-dimension index selection, its performance for a special case like the dim 0 index selection is suboptimal. For this reason, we implement a specialized index_select for dim 0. In some cases, we have observed 1.4x performance gain from FBGEMM’s index_select compared to the one from PyTorch (using uniform index distribution).
Full Changelog: https://github.com/pytorch/FBGEMM/commits/v0.2.0