deepsparse

Sparsity-aware deep learning inference runtime for CPUs

OTHER License

Downloads
8.7K
Stars
3K
Committers
43

Bot releases are visible (Hide)

deepsparse - DeepSparse v1.7.1 Patch Release Latest Release

Published by jeanniefinks 7 months ago

This is a patch release for 1.7.0 that contains the following changes:

  • Detokenization has been fixed for streaming outputs with models that use sentencepiece-based tokenizers. (#1635)
deepsparse - DeepSparse v1.7.0

Published by jeanniefinks 7 months ago

New Features:

  • DeepSparse Pipelines v2 was introduced, enabling more complex pipelines to be represented. Text Generation (compatible with Hugging Face Transformers) and Image Classification pipelines have been refactored to the v2 format. (#1324, #1385, #1460, #1596, #1502, #1460, #1626)
  • OpenAI Server compatibility added on top of Pipelines v2. (#1445, #1477)
  • deepsparse.evaluate APIs and CLIs added with plugins for perplexity and lm-eval-harness for LLM evaluations. (#1596)
  • An example was added demonstrating how to use LLMPerf for benchmarking DeepSparse LLM servers. (#1502)
  • Continuous batching support has been added for text generation pipelines and inference server pathways, enabling inference over multiple text streams at once. (#1569, #1571)

Changes:

  • Exposed sequence_length for greater control over text generation pipelines. (#1518)
  • deepsparse.analyze functionality has been updated to work properly with LLMs. (#1324)
  • The logging and timing infrastructure for Pipelines expanded to enable more thorough tracking and logging, in addition to furthering support for integrations with Prometheus and other standard logging platforms. (#1614)
  • UX improved for text generation pipelines to more closely match Hugging Face Transformers pipelines. (#1583, #1584, #1590, #1592, #1598)

Resolved Issues:

  • Compile time for dense LLMs is no longer very slow.
  • Text generation pipeline bug fixes: corrected sampling logic errors and inappropriate in-place logits mutation resulting in incorrect answers for LLMs when using sampling. (#1406, #1414)
  • KV cache was fixed for improper handling of the kv_cache input while using external KV cache management, which resulted in inaccurate model inference for ONNX Runtime comparison pathways. (#1337)
  • Benchmarking runs for LLMs with internal KV cache no longer crash or report inaccurate numbers. (#1512, #1514)
  • SciPy dependencies were removed to address issues for CV pipelines where they would fail on import of scipy and crash. (#1604, #1602)

Known Issues:

  • OPT models produce incorrect outputs and are no longer supported.
  • Streaming support is limited within the DeepSparse Pipeline v2 framework for tasks other than text generation.
deepsparse - DeepSparse v1.6.1 Patch Release

Published by jeanniefinks 10 months ago

This is a patch release for 1.6.0 that contains the following changes:

  • The filename of the Neural Magic DeepSparse Community License in the DeepSparse GitHub repository has been renamed from LICENSE-NEURALMAGIC to LICENSE for higher visibility in the DeepSparse GitHub repository and the C++ engine package tarball, deepsparse_api_demo.tar.gz. (#1485)
deepsparse - DeepSparse v1.6.0

Published by jeanniefinks 10 months ago

New Features:

  • Version support added:

    • Python 3.11 (#1323, #1432)
    • ONNX 1.14 and Opset 14 (#1072, #1097)
    • NumPy 1.21.6 (#1094)
  • Decoder-only text generation LLMs are optimized in DeepSparse and offer state-of-the-art performance with sparsity!pip install deepsparse[llm] and then use the TextGeneration Pipeline. For performance details, check out our Sparse Fine-Tuning paper.
    (#1022, #1035, #1061, #1081, #1132, #1122, #1137, #1121, #1139, #1126, #1151, #1140, #1173, #1166, #1176, #1172, #1190, #1142, #1205, #1204, #1212, #1214, #1194, #1218, #1196, #1217, #1216, #1225, #1240, #1254, #1246, #1250, #1266, #1270, #1276, #1274, #1235, #1284, #1285, #1304, #1308, #1310, #1313, #1272)

  • OpenAI-compatible DeepSparse Server has been added, enabling standard OpenAI requests for performant LLMs. (#1171, #1221, #1228, #1317)

  • MLServer-compatible pathways for DeepSparse Server to enable standard MLServer requests. (#1237)

  • CLIP model support for deployments and performance functionality is now enabled. (Documentation) (#1098, #1145, #1203)

  • Several encoder-decoder networks have been optimized for performance: Donut, Whisper, and T5.

  • Support for ARM processors is now generally available. ARMv8.2 or above is required for quantized performance. (#1307)

  • Support for macOS is now in Beta. macOS Ventura (version 13) or above and Apple silicon are required. (#1088, #1096, #1290, #1307)

  • DeepSparse Server updated to support generic pipeline Python implementations for easy extensibility. (#1033)

  • YOLOv8 deployment pipelines and model support have been added. (#1044, #1052, #1040, #1138, #1261)

  • AWS and GCP marketplace documentation added: AWS | GCP (#1056, #1057)

  • DigitalOcean marketplace integration added. (Documentation) (#1109)

  • DeepSparse Azure marketplace integration added. (Documentation) (#1066)

  • DeepSparse Pipeline timing added. To access, utilize pipeline.timer_manager or utilize deepsparse.benchmark_pipeline CLI. (#1062, #1150, #1268, #1259, #1294)

  • TorchScriptEngine class added to enable benchmarking and evaluation comparisons to DeepSparse. (#1015)
    debug_analysis API now supports exporting CSVs, enabling easier analysis. (#1253)

  • SentenceTransformers deployment and performance support have been added. (#1301)

Changes:

  • DeepSparse upgraded for the SparseZoo V2 model file structure changes, which expands the number of supported files and reduces the number of bytes that need to be downloaded for model checkpoints, folders, and files. (#1233, #1234, #1303, #1318)

  • YOLOv5 deployment pipelines migrated to install from nm-yolov5 on PyPI and remove the autoinstall from the nm-yolov5 GitHub repository that would happen on invocation of the relevant pathways, enabling more predictable environments. (#1030, #1101, #1129, #1111, #1167)

  • Docker builds are updated to consistently rebuild for new releases and nightlies. ( #1012, #1068, #1069, #1113, #1144)

  • Torchvision deployment pipelines have been upgraded to support 0.14.x. (#1034)

  • README and documentation updated to include: Slack Community name change, Contact Us form introduction, Python version changes; corrections for YOLOv5 torchvision, transformers, and SparseZoo broken links; and installation command. (#1041, #1042, #1043, #1039, #1048, #931, #960, #1279, #1282, #1280, #1313)

  • Python 3.7 is now deprecated. (#1060, #1148)

  • ONNX utilities are updated so that ONNX model arguments can be passed as either a model file path (past behavior) or an ONNX ModelProto Python object. (#1089)

  • Deployment directories containing a model.onnx will now load properly for all pipelines supported by DeepSparse Server. Before, specific paths needed to be supplied to the exact model.onnx file rather than a deployment directory. (#1131)

  • Flake8 updated to 6.1 to enable the latest standards for running make quality. (#1156)

  • Automatic link checking has been added to GitHub actions. (#1226)

  • DeepSparse Pipeline has been changed to make it printable, such that __str__ and __repr__ is implemented and will show useful information when a pipeline is printed. (#1298)

  • nm-transformers package has been fully removed and replaced with the native transformers package that works with DeepSparse. (#1302)

Performance and Compression Improvements:

  • The memory footprint used during model compilation for models with external weights has been greatly reduced.
  • The memory footprint has been reduced by sharing weights between compiled engines, for example, when using bucketing.
  • Matrix-Vector Multiplication (GEVM) with a sparse weight matrix is now supported for both performance and reduced memory footprint.
  • Matrix-Matrix Multiplication (GEMM) with a sparse weight matrix is further optimized for performance and reduced memory footprint.
  • AVX2-VNNI instructions are now used to improve the performance of DeepSparse.
  • Grouped Query Attention (GQA) in transformers is now optimized.
  • Improved performance of Gathers with constant data and dynamic indices, like the ones used for embeddings in transformers and recommendation models.
  • The InstanceNormalization operator is now supported for performance.
  • The Where operator has improved performance in some cases by fusing it onto other operators.
  • The CLIP operator is now supported for performance with operands of any data type.

Resolved Issues:

  • Assertion failures for GEMM operations with broadcast-stacked dimensions have been resolved.
  • Updated unit and integration tests to remove temporary test files and limit test file creation, which were not being properly deleted. (#1058)
  • deepsparse.benchmark was failing with AttributeError when the -shapes argument was supplied, causing no benchmarks to be measured. (#1071)
  • Deepsparse Server with a model.onnx file in the model directory was causing the server to raise an exception for image classification pipelines. (#1070)
  • Generate_random_inputs function no longer creates random data with shapes 0 when ONNX files containing dynamic dimensions were given. (#1086)
  • Pydantic version pinned to <2.0 preventing NameErrors from being raised anytime pipelines are constructed. (#1104)
  • AWS Lambda serverless examples and implementations updated to avoid exceptions being thrown while running inference in AWS Lambda. (#1115)
  • DeepSparse Pipelines: if num_cores was not supplied as an explicit kwarg for a bucketing pipeline, it would trigger a key error. This is now updated to ensure the pipeline works correctly without num_cores being explicitly supplied as an kwarg. (#1152)
  • eval_downstream for Transformers pathways no longer fails due to a PyTorch requirement not being installed. The fix now removes the PyTorch support dependency, and it runs correctly through. (#1187)
  • Reliability for unit test test_pipeline_call_is_async has been improved to produce consistent test results. (#1251, #1264, #1267)
  • Torchvision previously needed to be installed for any tests to pass, including transformers and other unrelated pipelines. If it was not installed, then the tests would fail with an import error. (#1251)

Known Issues:

  • The compile time for dense LLMs can be very slow. Compile time to be addressed in forthcoming release.
deepsparse - DeepSparse v1.5.3 Patch Release

Published by jeanniefinks about 1 year ago

This is a patch release for 1.5.0 that contains the following changes:

  • A rare segmentation fault on AVX2 systems has been fixed. This could have happened when an input to the network is quantized.
deepsparse - DeepSparse v1.5.2 Patch Release

Published by jeanniefinks over 1 year ago

This is a patch release for 1.5.0 that contains the following changes:

  • Pinned dependency Pydantic, a data validation library for Python, to < v2.0, to prevent current workflows from breaking. Pydantic upgrade planned for future release. (#1107)
deepsparse - DeepSparse v1.5.1 Patch Release

Published by jeanniefinks over 1 year ago

This is a patch release for 1.5.0 that contains the following changes:

  • Latest 1.5-supported transformers datasets are incompatible with pandas 2.0. Future releases will support later datasets versions so this is to restrict pandas to < 2.0. (#1074)
deepsparse - DeepSparse v1.5.0

Published by jeanniefinks over 1 year ago

New Features:

  • ONNX evaluation pipeline for OpenPifPaf (#915)
  • YOLOv8 segmentation pipelines and validation (#924)
  • deepsparse.benchmark_sweep CLI to enable sweeps of benchmarks across different settings such as cores and batch sizes (#860)
  • Engine.generate_random_inputs() API (#966)
  • Example data logging configurations for pipelines/server (#867)
  • Expanded built-in functions for NLP and CV pipeline logging to enable better monitoring (#865) (#862)
  • Product usage analytics tracking in DeepSparse Community edition (documentation)

Performance Improvements:

  • Inference latency for unstructured sparse-quantized CNNs has been improved by up to 2x.
  • Inference throughput and latency for dense CNNs has been improved by up to 20%.
  • Inference throughput and latency for dense transformers has been improved by up to 30%.
  • The following operators are now supported for performance:
    • Neg, Unsqueeze with non-constant inputs
    • MatMulInteger with two non-constant inputs
    • GEMM with constant weights and 4D or 5D inputs

Changes:

  • Transformers and YOLOv5 integrations migrated from auto install to install from PyPI packages. Going forward, pip install deepsparse[transformers] and pip install deepsparse[yolov5] will need to be used.
  • DeepSparse now uses hwloc to determine CPU topology. This fixes a bug where DeepSparse could not be used performantly inside of a Kubernetes cluster with a static CPU manager policy.
  • When users pass in a num_streams parameter that is smaller than the number of cores, multi-stream and elastic scheduler behaviors have been improved. Previously, DeepSparse would divide the system into num_streams chunks and fill each chunk until it ran out of threads. Now, each stream will use a number of threads equal to num_cores divided by num_streams, with the remainder distributed in a round-robin fashion.

Resolved Issues:

  • In networks with a Clip operator where min isn't equal to zero, performance bugs no longer occurs.

  • Crashing eliminated:

    • Pipeline conll eval using ignore_labels. (#903)
    • YOLOv8 pipelines handling models with dynamic inputs. (#967)
    • QA pipelines with sequence lengths equal to or less than 128. (#889)
    • Image classification pipelines handling PNG images. (#870)
    • ONNX overriding of shapes if a list was not passed in; this now automatically wraps in a list. (#914)
  • Assertion errors/failures removed:

    • Networks with both Convolutions and GEMM operations.
    • YOLOv8 model compilation.
    • Slice and Unsqueeze operators with a negative axis.
    • OPT models involving a constant tensor that is broadcast in two different ways.

Known Issues:

  • None
deepsparse - DeepSparse v1.4.2 Patch Release

Published by jeanniefinks over 1 year ago

This is a patch release for 1.4.0 that contains the following changes:

  • Fallback support for YOLOv5 models with dynamic input shapes provided (not recommended pathway). (#971)
  • Loading of system logging configuration now addressed. (#858)
deepsparse - DeepSparse v1.4.1 Patch Release

Published by jeanniefinks over 1 year ago

This is a patch release for 1.4.0 that contains the following changes:

  • The bounding boxes for YOLOv5 pipelines now scales with correct detection boxes. (#881)
deepsparse - DeepSparse v1.4.0

Published by jeanniefinks over 1 year ago

New Features:

  • OpenPifPaf deployment pipelines support (#788)
  • VITPose example deployment pipeline (#794)
  • DeepSparse Server logging with support for metrics, timings, and input/output values through Prometheus (#821, #791)

Changes:

  • Inference speed improved by up to 20% on dense FP32 BERT models.
  • Inference speed improved by up to 50% on quantized EfficientNetV1 and by up to 10% on quantized EfficientNetV2.
  • YOLOv5 integration upgraded to the latest upstream.

Resolved Issues:

  • DeepSparse no longer improperly detects each core as belonging to its own socket on some virtual machines, including those on OVHcloud.
  • When running networks with any Quantized Depthwise Convolution with a nontrivial w_zero_point parameter no longer produces an assertion failure. Trivial in this case means that the zero point is equal to 128 for uint8 data, or 0 for int8 data.
  • At executable_buffer.cpp (see https://github.com/neuralmagic/deepsparse/issues/899), an assertion failure no longer occurs.
  • In quantized transformer models, a rare assertion failure no longer occurs.

Known Issues:

  • None
deepsparse - DeepSparse v1.3.2 Patch Release

Published by jeanniefinks over 1 year ago

This is a patch release for 1.3.0 that contains the following changes:

  • Softmax operators from ONNX Opset 13 and later now behave correctly in DeepSparse. Previously, the semantics of Softmax from ONNX Opset 11 were applied, which would result in incorrect answers in some cases.
  • Quantized YOLOv8 models are now supported in DeepSparse. Previously, the user would have encountered an assertion failure.
deepsparse - DeepSparse v1.3.1 Patch Release

Published by jeanniefinks almost 2 years ago

This is a patch release for 1.3.0 that contains the following changes:

  • Performance on some unstructured sparse quantized YOLOv5 models has been improved. This fixes a performance regression compared to DeepSparse 1.1.
  • DeepSparse no longer throws an exception when it cannot determine L3 cache information and instead logs a warning message.
  • An assertion failure on some compound sparse quantized transformer models has been fixed.
  • Models with ONNX opset 13 Squeeze operators no longer exhibit poor performance, and DeepSparse now sees speedup from sparsity when running them.
  • NumPy version pinned to <=1.21.6 to avoid deprecation warning/index errors in pipelines.
deepsparse - DeepSparse v1.3.0

Published by jeanniefinks almost 2 years ago

New Features:

  • Bfloat16 is now supported on CPUs with the AVX512_BF16 extension. Users can expect up to 30% performance improvement for sparse FP32 networks and an up to 75% performance improvement for dense FP32 networks. This feature is opt-in and is specified with the default_precision parameter in the configuration file.
  • Several options can now be specified using a configuration file.
  • Max and min operators are now supported for performance.
  • SQuAD 2.0 support provided.
  • NLP multi-label and eval support added.
  • Fraction of supported operations property added to engine class.
  • New ML Ops logging capabilities implemented, including metrics logging, custom functions, and Prometheus support.

Changes:

  • Minimum Python version set to 3.7.
  • The default logging level has been changed to warn.
  • Timing functions and a default no-op deallocator have been added to improve usability of the C++ API.
  • DeepSparse now supports the axes parameter to be specified either as an input or an attribute in several ONNX operators.
  • Model compilation times have been improved on machines with many cores.
  • YOLOv5 pipelines upgraded to latest state from Ultralytics.
  • Transformers pipelines upgraded to latest state from Hugging Face.

Resolved Issues:

  • DeepSparse no longer crashes with an assertion failure for softmax operators on dimensions with a single element.
  • DeepSparse no longer crashes with an assertion failure on some unstructured sparse quantized BERT models.
  • Image classification evaluation script no longer crashes for larger batch sizes.

Known Issues:

  • None
deepsparse - DeepSparse v1.2.0

Published by jeanniefinks almost 2 years ago

New Features:

  • DeepSparse Engine Trial and Enterprise Editions now available, including license key activations.
  • DeepSparse Pipelines document classification use case in NLP supported.

Changes:

  • Mock engine tests added to enable faster and more precise unit tests in pipelines and Python code.
  • DeepSparse Engine benchmarking updated to use time.perf_counter for more accurate benchmarks.
  • Dynamic batch implemented to be more generic so it can support any pipeline.
  • Minimum Python version changed to 3.7 as 3.6 reached EOL.

Performance:

  • Performance improvements for unstructured sparse quantized convolutional neural networks implemented for throughput use cases.

Resolved Issues:

  • In the C++ interface, the engine no longer crashes with a segmentation fault when the num_streams provided to the engine_context_t is greater than the number of physical CPU cores.
  • The engine no longer crashes with assertion failures when running YOLOv4.
  • YOLACT pipelines fixed where dynamic batch was not working and exported images had color channels improperly swapped.
  • DeepSparse Server no longer crashes for hyphenated task names such as "question-answering."
  • Computer vision pipelines now additionally accept single NumPy array inputs.
  • Protobuf version for ONNX 1.12 compatibility pinned to prevent installation failures on some systems.

Known Issues:

  • None
deepsparse - DeepSparse v1.1.0

Published by jeanniefinks about 2 years ago

New Features:

Changes:

  • The behavior of the Multi-stream scheduler is now identical to the Elastic scheduler, and the old Multi-stream scheduler has been removed.
  • NLP pipelines for question answering, text classification, and token classification upgraded to improve accuracy and better match the SparseML training pathways.
  • Updates made across the repository for new SparseZoo Python APIs.
  • Max torchvision version increased to 0.12.0 for computer vision deployment pathways.

Performance:

  • Inference performance improvements for
    • unstructured sparse quantized Transformer models.
    • slow activation functions (such as Gelu or Swish) when they follow a QuantizeLinear operator.
    • some sparse 1D convolutions. Speedups of up to 3x are observed.
    • Squeeze, when operating on a single axis.

Resolved Issues:

  • Assertion errors no longer when one node had multiple inputs, both coming from the same node no longer occurs.
  • An assertion error no longer appears when a MatMul operator followed a Transpose or Reshape operator no longer occurs.
  • Pipelines now support hyphenated versions of standard task names such as question-answering,

Known Issues:

  • In the C++ interface, the engine will crash with a segmentation fault when the num_streams provided to the engine_context_t is greater than the number of physical CPU cores.
deepsparse - DeepSparse v1.0.2 Patch Release

Published by jeanniefinks over 2 years ago

This is a patch release for 1.0.0 that contains the following changes:

  • Question answering pipeline pre-processing now to exactly match the SparseML training pre-processing. Before there were differences between the logic of the two that was leading to minor drops in accuracy.
deepsparse - DeepSparse v1.0.1 Patch Release

Published by jeanniefinks over 2 years ago

This is a patch release for 1.0.0 that contains the following changes:

Crashes with an assertion failure no longer happen in the following cases:

  • during model compilation for a convolution with a 1x1 kernel with 2x2 convolution strides.
  • when setting the num_streams parameter to fewer than the number of NUMA nodes.

The engine no longer enters an infinite loop when an operation has multiple inputs coming from the same source.

Error messaging improved for installation failures of non-supported operating systems.

Supported transformers datasets version capped for compatibility with pipelines.

deepsparse - DeepSparse v1.0.0

Published by jeanniefinks over 2 years ago

New Features:

  • Support added for running multiple models with the same engine when using the Elastic Scheduler.
  • When using the Elastic Scheduler, the caller can now use the num_streams argument to tune the number of requests that are processed in parallel.
  • Pipeline and annotation support added and generalized for transformers, yolov5, and torchvision.
  • Documentation additions made for transformers, yolov5, torchvision, and serving that focus on model deployment for the given integrations.
  • AWS SageMaker example created.

Changes:

  • Click as a root dependency added as the new preferred route for CLI invocation and arg management.

Performance:

  • Inference performance has been improved for unstructured sparse quantized models on AVX2 and AVX-512 systems that do not support VNNI instructions. This includes up to 20% on BERT and 45% on ResNet-50.

Resolved Issues:

  • When a layer operates on a dataset larger than 2GB, potential crashes no longer happen.
  • Assertion error addressed for Reduce operations where the reduction axis is of length 1.
  • Rare assertion failure addressed related to Tensor Columns.
  • When running the DeepSparse Engine on a system with a non-uniform system topology, model compilation now properly terminates.

Known Issues:

  • In rare cases, the engine may crash with an assertion failure during model compilation for a convolution with a 1x1 kernel with 2x2 convolution strides; hotfix forthcoming.
  • The engine will crash with an assertion failure when setting the num_streams parameter to fewer than the number of NUMA nodes; hotfix forthcoming.
  • In rare cases, the engine may enter an infinite loop when an operation has multiple inputs coming from the same source; hotfix forthcoming.
deepsparse - DeepSparse v0.12.2 Patch Release

Published by jeanniefinks over 2 years ago

This is a patch release for 0.12.0 that contains the following changes:

  • Protobuf is restricted to version < 4.0 as the newer version breaks ONNX.
Package Rankings
Top 3.53% on Pypi.org
Top 6.75% on Proxy.golang.org
Related Projects