Bot releases are hidden (Show)

MegEngine - MegEngine v1.13.4 Latest Release

Published by Wanwan1996 6 months ago

MegBrain

Bug fixes

通用组件

修复 dump 开启 CD4 + FP16 时 clip 阶段图优化异常， MIN op 相关 bug 导致 dump 出错的问题
修复 megengine tensor 类型为 bool 时 index 操作未能正确定位地址的问题

XLA

修复多机训练时，device 设置错误的问题

CUDA

修复由于缺少一个 void ** 的强制转换而引发无法通过编译的问题。

New Features

Python API

添加 FillPoly 算子
增加 erf 接口

CUDA

增加对 Hopper 系列 GPU 的支持

通用组件

修复在 io16xc32 模式下 reduce 算子无法执行的问题

XLA

XLA 增加对 FP16 数据类型的支持
新增支持 xla 打包的脚本，自 v8.20.3（包含）及以后可以用以下方式安装 xla： megbrain[xla]==8.20.3+cu111

Dataloader

Dataloader 支持 cuda 数据转换

Bug fixes

Common components

Fixed the issue where the clip stage diagram optimization was abnormal when CD4 + FP16 was turned on for dump, and MIN op related bugs caused dump errors.
Fix the problem that the index operation fails to correctly locate the address when the megengine tensor type is bool.

XLA

Fixed the problem of incorrect device settings during multi-machine training.

CUDA

Fixed the problem of failing to compile due to the lack of a void ** cast.

New Features

Python API

Add FillPoly operator.
Add erf interface.

CUDA

Add support for Hopper series GPUs.

Common components

Fix the problem that reduce operator cannot be executed in io16xc32 mode.

XLA

XLA adds support for FP16 data type.
Added scripts that support xla packaging. From v8.20.3 (included) and later, xla can be installed in the following way: megbrain[xla]==8.20.3+cu111.

Dataloader

Dataloader supports cuda data conversion.

MegEngine - MegEngine v1.13.3

Published by Wanwan1996 10 months ago

MegEngine

HighLight

新增支持寒武纪思元系列 AI 芯片训练和推理。

know issue

dump 开启 CD4 + FP16 时 clip 阶段图优化异常， MIN op 相关 bug 导致 dump 出错，预计在 v1.13.4 修复。

Bug fixes

第三方硬件

修复 rocm 编译失败的问题。
修复在寒武纪 590 上找不到 checksum_kernel_union4 kernel 的问题。

通用组件

修复 trace 模式时 reshape 算子不支持 int64 的 shape 输入的问题。
修复 tile 算子 workspace 计算错误的问题。
修复由于 NHWCD4 优化 pass 处理错误导致 seg transformer 模型无法 dump 的问题。
修复 megfile 版本依赖固定的问题。
修复 module_stats 函数计算 traced_module 模型参数量和计算量报错的问题。
优化了在异步执行出错时的报错信息，提供给用户进一步定位问题的方法。
在 graph 执行出错抛出异常前提供了更多的错误信息。
修复因缺少头文件 limits 而引发的编译错误。

发版流程

修复在不带 MGE_WITH_CUSTOM_OP 编译参数时编译 megbrain cuda 后端不通过的问题。

XLA

修复 xla 显存占用不稳定的问题。
修复 XLA 出现的 indexing 错误。
修复 XLA 无法 Trace GradManager Callback 的问题；修复 XLA 无法 Trace 带有 property 装饰的 module 的问题。

CUDA

暂时关闭了两个调用 cudnn-v8 的算法（AlgoCUDNNConvV8，AlgoCUDNNConvBiasActivationV8）以修复计算结果的对分问题。
修复已知问题，正式支持 cuda11.8。

文档

修复 megengine 中 _mgb.so 丢失的问题。

New Features

Python API

新增 einsum 算子。
增加对 exponential opr 的支持。
增加对多项式分布采样的支持。
增加对 Remap 算子的支持。
增加对 GaussianBlur 算子的支持。

第三方硬件

寒武纪平台支持 neuware 1.13.0 版本。
支持寒武纪平台训练和推理。

通用组件

增加对 dilate 算子的支持。
修复 ohos thread local存在的内存泄漏问题

XLA

xla 后端添加 fake_quant、tqt 算子。
在 xla 中支持 linspace，stack，resize，resize backward 算子。
支持 XLA 后端添加 lsq 算子。

Improvements

Dataloader

将 datamonitor 中统计的 dataset 和 transform 时间修改为一个 batch 的总时间，使其与 collator time 和 ipc time 统计口径保持一致。

MegEngine Lite

Bug Fixes

文档

修复 lite 中 get_elem_size 方法文档描述与实现不一致的问题。

MegEngine

HighLight

Added support for Cambrian MLU series AI chip training and inference.

know issue

When dump turns on CD4 + FP16, the clip phase diagram optimization is abnormal. MIN op related bugs cause dump errors. It is expected to be fixed in the next new version (MegBrian v8.20.4)

Bug fixes

Third-party hardware

Fix the problem of rocm compilation failure.
Fixed an issue where the checksum_kernel_union4 kernel could not be found on Cambrian 590.

Common components

Fixed the bug that the reshape operator does not support int64 shape input in trace mode.
Fixed the problem of incorrect calculation of tile operator workspace.
Fixed the issue where the seg transformer model cannot be dumped due to NHWCD4 optimization pass processing errors.
Fix megfile version dependency fixing problem.
Fix the problem of module_stats function calculating the traced_module model parameters and calculation amount reporting an error.
Optimize the error messages during asynchronous execution errors, providing users with methods to further locate issues.。
Provide more error information before throwing an exception when an error occurs during graph execution.
Fix the compilation error caused by the missing header file "limits".

Release process

Fix the problem that the megbrain cuda backend fails to pass when compiled without the MGE_WITH_CUSTOM_OP compilation parameter.

XLA

Fix the unstable occupation of cuda memory of xla.
Fix indexing problems with XLA.
Fix the problem that XLA cannot trace GradManager Callback.
Fix the problem that XLA cannot trace modules with property decorations.

CUDA

Temporarily closed two algorithms that call cudnn-v8 (AlgoCUDNNConvV8, AlgoCUDNNConvBiasActivationV8) to fix the bisection problem of calculation results.
Formal support for cuda11.8。

Documentation

Fixed loss of mgb.so in megengine.

New Features

Python API

Implements einsum operator.
Add exponential opr.
Added support for polynomial distribution sampling.
Add Remap module.
Add GaussianBlur module.

Third-party hardware

Cambrian platform supports neuware version 1.13.0.
Support Cambricon training and inference.

Common components

Add the dilate operator.
Fix memory leak issues in OHOS thread local storage.

XLA

Add fake quant and tqt operators to the xla backend.
XLA supports linspace, stack, resize, resize backward operators。
The lsq operator is added to the XLA back-end.

Improvements

Dataloader

Modify the dataset and transform time statistics in datamonitor to the total time of a batch to make it consistent with the statistical calibers of collator time and ipc time.

MegEngine Lite

Bug Fixes

Documentation

Fix the inconsistency between the documentation and implementation of the get_elem_size method in lite.

MegEngine - MegEngine v1.13.2

Published by Wanwan1996 12 months ago

MegEngine

Highlight

支持 cuda118 正式版本，已知问题见 know issue。
MegEngine-XLA 发布正式版，经 XLA 优化后在 cuda11.8/cudnn8.6.0 上 basecls/basedet 上典型网络可获得 10%~90% 的速度提升。

know issue

cuda118 在使用 TensorRT 进行推理时可能出现资源析构异常的问题。

Bugfix

Python API

修复 arange function 不能设置 device 为 cpu 的问题。

第三方硬件

修复多模型多线程的环境中，atlas 报 event 资源不够的问题。
修复 atlas 同步时需要激活 atlas_env 的问题；修复由于 tensordesc 没释放导致的内存泄漏问题；修复 aclInit 重复的问题。

通用组件

修复 custom op 实现 builtin op 时静态变量初始化顺序错误的问题。
修复 megengine 包含 setenv 依赖导致在 android 环境下存在的内存踩踏风险问题。

XLA

修复 xla 使用时显存增大、找不到 ptxas 以及 rng seed 设置不正确的问题。

ARM

升级 ndk 版本到 r25c，以解决旧版 ndk 下 armv7 开启 -D_FORTIFY_SOURCE=2 不生效的问题；修复 conv_backdata 算子访存越界问题；优化编译速度，android 设备编译可提速 30%。

New Features

Python API

增加 python 侧的高维 sort 支持。
添加 flip、rotate、resize、rot90 算子。

周边工具

支持 dump 模型在 MegBrain v8.14 的前向兼容。

通用组件

添加 where 的 kernel 实现。

XLA

XLA 支持 partial_trace 的函数在输入 shape 变化的情况下 fallback 到原始的 python 函数；partial_trace 支持将 all_reduce 等集合通信算子编译到 xla executable，以提升 xla trace 的模型性能；partial_trace 支持 trace, optimizer._update，支持加速 optimizer step 方法。

CUDA

添加三种 mixup 的三种 gpu 实现（cutmix, fmix, mixup）。
新增对 cropandpad 算子的支持。
增加 elemwise uint16 dtype 计算的支持。

Dataloader

新增 dataloader 对数据各阶段处理的监控，通过环境变量 os.environ[‘MGE_DATA_MONITOR’] =‘1’ 打开此功能。
num_workers = 0 时, 获取拉取数据时间 dataset_time、数据转换时间 transform_time、拼 batch 时间 collate_time；
num_workers > 0时，在以上指标基础上，可再获取到进程通信时间 IPC_time。

Improvements

文档

优化现有 api 的 docstring。

MegEngine Lite

Bug Fixes

通用组件

修复调用 get_io_tensor 获取设备类型时概率性出错的问题。

MegEngine

know issue

Cuda118 may encounter a resource destruction exception when using TensorRT for inference;
The training benchmark avg_cpu_usage indicator has an average increase of 32.4% compared to the previous two versions;

Bugfix

Python API

Fix the bug that arange function cannot set device to cpu.

Third-party hardware

Fixed memory leak problem caused by tensordesc not being free;Fixed the problem that atlas_env activation is required during synchronization;Fixed aclInit repeated problem.

Common components

Fix the problem of wrong initialization order of static variables when custom op implements builtin op.
Fixed the problem that Megengine uses setenv may cause the memory stampede risk in android.

XLA

Fixed the problem of increased video memory, unable to find ptxas and incorrect rng seed settings when using xla.

ARM

Upgrade the ndk version to r25c to solve the problem that -D_FORTIFY_SOURCE=2 does not take effect when armv7 is enabled under the old version of ndk; fix the conv_backdata operator memory access out-of-bounds problem; optimize the compilation speed, android device compilation can be accelerated by 30%.

New Features

Python API

Add support for high-dimensional sort on the python side.
Add flip, rotate, resize and rot90 operators.

Peripheral tools

Support forward compatibility of dumped models in MegBrain v8.14.

Common components

Add the kernel implementation of where operator。

XLA

XLA supports the function of partial_trace to fallback to the original python function when the input shape changes; partial_trace supports compiling set communication operators such as all_reduce into xla executable to improve the model performance of xla trace; partial_trace supports trace, optimizer._update, Supports accelerated optimizer step method.

CUDA

Add three mixup gpu implementations (cutmix, fmix, mixup).
Add cropandpad operation.
Add support for elemwise uint16 dtype calculations.

Dataloader

Add the dataloader monitoring function for each stage of data processing: when num_workers = 0, obtain the data pulling time dataset_time, data conversion time transform_time, and batch batch time collate_time. When num_workers > 0, use os.environ['MGE_DATA_MONITOR'] ='1' to obtain the process communication time IPC_time.

Improvements

Documentation

Optimize the docstring of existing interfaces.

MegEngine Lite

Bug Fixes

Common components

Fixed the problem of probabilistic errors when calling get_io_tensor to obtain the device type.

MegEngine - MegEngine v1.13.1

Published by Wanwan1996 about 1 year ago

MegEngine - MegEngine v1.13

Published by Wanwan1996 over 1 year ago

MegEngine

HighLight

MegEngine 支持 Trace 后的图使用 XLA 进行编译优化并执行，在 cuda11.8/cudnn8.6.0 上典型分类网络可获得 10%~80% 的速度提升。此特性为试验性特性。关于此功能更多信息请参考文档链接
后续版本将不再支持 cuda10.1。

Bugfix

Dataloader

优化 dataloader 的报错机制，避免 Dataloader worker 闪退及卡死的情况。
消除 pyarrow.SerializationContext() 的 future warning，提升使用体验。
修复 pyarrow 版本高于 1.12 时反复 warning 的问题。

第三方硬件

支持 atlas 启用 aipp 后输入 format 可以为多种类型（nhwc、nchw、nc1hwc0）。

通用组件

修复 slice 的 start 为负数时，index 结果错误的问题。
修复由于 ArgSpec 中的参数类型信息被序列化导致的 TracedModule 兼容性问题。

New Features

Python API

支持 megengine tensor 与 dlpack 的互相转换。
interpolate op 新增 trilinear 模式。

CUDA

添加 cuda/naive mha proxy 实现。

通用组件

jit.trace 支持 without host 模式, 目前主要用途是接入其他深度学习编译器（例如 xla），without host 为 True 时，被 trace 包装的函数经过编译后不会再执行函数原始的 python 代码，也不会检查算子序列是否与 trace 记录的序列一致，使用时需要您保证被 trace 部分完全静态。
支持外部框架 tensor 与 mge tensor 做计算，例如 mge.tensor(troch.tensor)+mge.tensor 即获取两者相加的结果。

XLA

实现 mge op 到 XLA HLO IR 的 lowering rule，支持在 MegEngine 中编译并调用 XLA。

MegEngine

HighLight

MegEngine supports XLA to compile, optimize and execute graphs after Trace. Typical classification networks on cuda11.8/cudnn8.6.0 can achieve a speed increase of 10%~80%. This feature is experimental. For more information about this function, please refer to Here
Subsequent versions will no longer support cuda10.1.

Bugfix

Dataloader

fix dataloader worker crash quietly in some cases.
Remove the warning of pyarrow on some interfaces.
Fix the problem of repeated warnings when pyarrow version is higher than 1.12.

第三方硬件

Enabled multi-type input format when using atlas with aipp (nhwc、nchw、nc1hwc0).

通用组件

Fixed the problem that the index result was wrong when the start of the slice was negative.
Fixed TracedModule compatibility issue due to parameter type information in ArgSpec being serialized.

New Features

Python API

Support the conversion between megengine tensor and dlpack tensor.
Add trilinear mode for interpolate operator.

CUDA

Add cuda/naive MHA proxy implementation.

通用组件

jit.trace supports without host mode. When without host is True, the function wrapped by trace will not execute the original python code of the function after compilation, nor will it check whether the operator sequence is consistent with the sequence recorded by trace. When using it, you need to ensure that the traced part is completely static.
Support external framework tensor and mge tensor to do calculations, for example, mge.tensor(troch.tensor)+mge.tensor is to get the result of the addition of the two.

XLA

Implement the lowering rules from mge Op to XLA HLO IR, and support compiling and calling XLA in MegEngine.

MegEngine - MegEngine v1.12.4

Published by Wanwan1996 over 1 year ago

MegEngine

HighLight

训练侧默认开启 CUDA_MODULE_LOADING，节省了 fatbin 加载带来的 CUDA 显存开销（对于 cuda 版本为118及以上的包有效），您将有更多的显存可以使用。（使用的 kernel 种类越少，节省效果越明显，最多可为您节省 900MB 显存）
包括此版本在内的近两个版本（v1.12.4，v1.13）会保持对 cuda10.1、cuda11.4 的支持，后续将不再支持 cuda10.1，请您知晓～

Bugfix

Python API

修复了 F.flatten 和 Tensor.flatten 签名未对齐的问题，目前两者均统一为 flatten(start_axis, end_axis)。
python 层 multiheadattention functional/module 接口格式修改，用于后续进一步解决原始接口中存在的不能给出中间的 attn matrix、qkvo projection bias 不可组合等问题。
c++ 层 multiheadattention functional/module 接口格式修改，用于后续进一步解决原始接口中存在的不能给出中间的 attn matrix、qkvo projection bias 不可组合等问题。

Dataloader

修复 dataloader 中读取系统内存大小后未关闭相关文件导致的 warning。

通用组件

修复 trace 时如果 tarced_function 的 return 是复杂嵌套类型，报错信息不直观的问题。
修复 gitlab 登录 windows 环境打印的错误信息乱码问题。
修复了开启 DTR 情况下多卡训练概率性崩溃的问题。

周边工具

完善 windows 平台下 whl 包的环境依赖。

ARM

修复了 macos aarch64 下开启 fp16 编译失败的问题。

文档

修复 readme 的拼写错误。

New Features

Python API

profiler 为 functional 添加 scope，用于记录其调用的层次结构（目前支持 functional/module scope）。

CUDA

新增对 aarch64 下 cuda11.8 的编译支持。
支持并完善 windows cuda118 工具链。
训练侧默认开启 CUDA_MODULE_LOADING，节省了 fatbin 加载带来的 CUDA 显存开销（对于 cuda 版本为118及以上的包有效），您将有更多的显存可以使用。

通用组件

profiler 新增了两个指标，以帮助您更直观地获取当前模型训练的性能指标（具体可见MR内容）。gpu 忙碌比：gpu_usage_ratio，gpu 训练时间占整体训练时间的比例；model.step 时间占比：train_time_ratio，实际用于训练的时间（各 epoch 的第一个 step 开始到最后一个 step 结束的时间之和）占整体训练时间的比例。
完善 unsupported opr 的报错 log，便于您直接获取到所输入的模型中具体没有实现的 opr 信息。
加入对复数的支持，包括四则运算、求导、拆包、打包等基本运算（新增 op： F.polar，F.imag，F.real，F.complex；添加了复数支持的旧 op：add，sub，mul，negate，reshape）

Improvements

通用组件

完善 symbolic trace 中部分不能通过静态推导值的 tensor 调用 numpy 方法时的报错信息，使之更完整合理。

量化

量化添加对 linear_bn，linear_bn_relu 的支持。

MegEngine

HighLight

The training side will open the CUDA_MODULE_LOADING default to save the CUDA video memory overhead brought by Fatbin loading (effective for the CUDA version of 118 and above), and you will have more memory to use. (The fewer types of Kernel you use, the more obvious saving the effect, you can save you at most 900MB of memory)
Nearly two versions (V1.12.4, V1.13), including this version, will maintain support for CUDA10.1 and CUDA11.4. In the future, CUDA10.1 will no longer be supported. Please know ~

Bugfix

Python API

Fixed the problem that the signatures of F.flatten and Tensor.flatten were not aligned. Currently both are unified as flatten(start_axis, end_axis).
The python layer multiheadattention functional/module interface format modification is used to further solve the problems in the original interface that the intermediate attn matrix cannot be given, and the qkvo projection bias cannot be combined.
The C++ layer multiheadattention functional/module interface format modification is used to further solve the problems in the original interface that the intermediate attn matrix cannot be given, and the qkvo projection bias cannot be combined.

Dataloader

Fix the warning caused by not closing related files after reading the system memory size in dataloader.

通用组件

Fixed the problem that the error message is not intuitive when the return of tarced_function is a complex nested type.
Fix the issue of garbled error messages printed by Gitlab logging into the Windows environment.
Fixed the probabilistic crash of multi-card training after enabling DTR.

周边工具

Improving the environmental dependency of whl package on windows platform.

ARM

Fix compile error on macos aarch64 with fp16 enabled.

文档

Fix the typo in README.md.

New Features

Python API

The profiler adds a scope to functional to record the hierarchy of its calls (currently supports functional/module scope).

CUDA

Support compiling with cuda11.8 on aarch64.
Support and improve the windows cuda118 toolchain.
CUDA_MODULE_LOADING is enabled by default on the training side, which saves the CUDA video memory overhead caused by fatbin loading (valid for packages with cuda version 118 and above), and you will have more video memory available.

通用组件

The profiler has added two new indicators, the gpu busy ratio (gpu_usage_ratio) and the model.step time ratio (train_time_ratio), to help users more intuitively obtain the overall performance indicators of the current model training.
Added support for complex numbers, including basic operations such as four arithmetic operations, derivation, unpacking, and packaging (new ops: F.polar, F.imag, F.real, F.complex; old ops with complex number support: add , sub, mul, negate, reshape)

Improvements

通用组件

Improve the error message in the symbolic trace when some tensors that cannot statically derive the value call the numpy method to make it more complete and reasonable.

量化

Quantization added linear bn, linear bn relu support.

MegEngine - MegEngine v1.12.3

Published by Wanwan1996 over 1 year ago

MegEngine

HighLight

添加 general_norm 算子，支持对指定轴进行 norm 操作。例如 shape=[1,3,256,256]，给定 list [0,3]，表明对第 0 维和第 3 维进行 norm。
新增 multiattention 的 cuda 后端实现。

Bugfix

CUDA

修复 MegEngine CUDA 在 ubuntu 22.04 上构建失败的问题。
修正在 mali 2.0 驱动上开启 ION 后部分模型会 crash 的问题。
修复部分用户环境无法识别 CUDA 卡的问题。

量化

修复 lsq fakequant 无法从普通 observe 获取量化参数的问题。

通用组件

修复 logsigmoid 在某些情况下反向会溢出的问题。
修复 float16 winograd f43 分块的计算错误问题。
修复在开启 sublinear 后并未节省内存的问题。
修复多线程加载模型时概率性 crash 的 bug。
修复因 ParameterizedDType 初始化存在 race condition 导致的模型推理崩溃的 bug。
修复 imperative runtime 退出时析构顺序问题导致程序Segmentation fault 的 bug。
修复 DeformableConv 在 cpu backend 下不支持 algorithms 接口的问题。
完善 trace 在没有输出情况下的报错信息，使其更加友好。
修复 GeneralNorm 的weight、bias参数对初始化时机敏感的问题，以确保其在调用forward前被正确attach。
完善 trace 输入非法时的报错信息，使其更友好；修复 jit.dump 导出模型可能出现（例如调用apply_on_var_node 构造包含多个 operator node 子图的算子时） OpNode 和 VarNode 名字重复的问题。
修复分布式训练由于多 stream 内存管理导致的内存泄露问题。

New Features

CUDA

添加 general_norm 算子，支持对指定轴进行 norm 操作。例如 shape=[1,3,256,256]，给定 list[0,3]，表明对第 0 维和第 3 维进行 norm。
新增 multiattention 的 cuda 后端实现。

通用组件

增加 elemwise 操作数为 None 时的合理报错信息。
profiler 增加记录 python 和 dispatcher 调用栈的功能。
在 Lite::TensorBatchCollector 中增加通过 id 获取对应 tensor 的接口。
优化 PyMegEngineLite 开发体验: 编译完成后直接执行 PYTHONPATH=lite/pylite:$PYTHONPATH python3 就可以开始使用 MegEngineLite python 接口。

Improvements

文档

readme 中添加编译工具链选择的相关内容。
修正 api 文档的一些错误内容。

MegEngine Lite

Bugfix

通用组件

修复 MegEngineLite load and run 中 lar fitting 模式不支持 ioc16 的问题。
MegEngineLite python3 支持从网络接口文件加载模型（目前支持 oss 直接读取与 fileobject 的方式）。

MegEngine

HighLight

Bugfix

Python API

Fixed an issue where general-norm has an assertion is always true.

CUDA

Fix host build at ubuntu 22.04.
Fixed a problem where some user environments did not recognize the CUDA card.

Quantify

Fixed an issue where lsq fakequant could not obtain quantization parameters from common observe.

Common components

Fixed the problem where logsigmoid would overflow when backpropagated in some cases.
Fix the calculation error of winograd (f16) f43 partition.
Fixed the problem that memory was not saved after opening sublinear.
Fix the bug of probabilistic crash when loading models in multiple threads.
Fix the bug of model inference crash caused by race condition in ParameterizedDType initialization.
Fix the program Segmentation fault bug caused by the order of destruction when the imperative runtime exits.
Fix the problem of DeformableConv kernel not support algorithms interface in cpu backend.
Improve the trace error message, when the user's trace function has no output, the error message is more friendly.
Fix the parameter acquisition issue with GeneralNorm to ensure that it is correctly attached before calling Forward.
Improve the error message when entering illegal input to make it more friendly; repair the jit.dump export model may have the problem of repeating OpNode and VarNode names.
Fixed a memory leak in distributed training due to multi-stream memory management.

New Features

CUDA

Add the general_norm operator to support the norm operation on the specified axis. For example, shape=[1,3,256,256], given list[0,3], indicates that the norm is performed on the 0th dimension and the 3rd dimension.
Add a cuda implement for multiattention operator.

Common components

Add error message when the operand of elemwise is None.
Profiler Added the ability to log python and dispatcher call stacks.
Add the interface to get the corresponding tensor by id in Lite::TensorBatchCollector.
Optimize the development experience of PyMegEngineLite: After the compilation is completed, directly execute PYTHONPATH=lite/pylite:$PYTHONPATH python3 to start using the MegEngineLite python interface.

Improvements

Documentation

Add the description of compilation toolchain selection in readme.
Fix some errors in documentation.

MegEngine Lite

Bugfix

Common components

Fix that the lar fitting mode does not support ioc16.
Megenginelite Python3 supports loading models from network interface files (the current model reading method supports OSS and FileObject).

MegEngine - MegEngine v1.12.2

Published by Wanwan1996 over 1 year ago

MegEngine

HighLight

ARM CPU FP16 推理性能大幅提升，以 vgg16 模型为例，在 mi9 设备上耗时由 481.252ms 减少为 168.300ms。在 dump 的时候加上 --enable-ioc16 即可。
新增对 CUDA118_CUDNN860_TRT8531 构建的支持。
添加 ConcatDataset 数据类型用于合并多个现有数据集，相关文档见 ConcatDataset 。

Bugfix

发版流程

修复 split 在未指定输出 shape 时 dump 失败的问题。

Python API

修复 indexing 操作（getitem） start 为空，step 为负数时崩溃的 bug（例如a[::-1]）。
修复 indexing 操作（setitem）dtype promotion 行为与 numpy 不一致的 bug。

CUDA

修复 TRT 加载出错或运行出错时，日志未显示 error 信息的 bug。

ARM

修复 thread_local 在 android 平台存在内存泄漏问题。

通用组件

将 tensor 的 dtype 属性和 np.dtype 对齐。
修复 channel wise conv channel padding 时，pass 出错的问题。
修复 cpp 为 opt 版本时，MGB_USE_MEGDNN_DBG不生效的问题，修复前此 env 仅在 python whl 版本和 c++ debug 版本可用，修复后在任何版本都生效。
修复开启 no_profile_when_shape_change 选项时 cudnn 概率性选不中算法的问题。
修复 nchw44 布局的 channel padding pass 中 reduce axis 为负数时发生的crash。
修复开启 DTR 时，使用 stack/concat 算子程序崩溃的问题
修复在 c++ 模型上做图手术后部分 op（ConvTranspose，MatrixMul）参数信息丢失的问题。
修复 traced module 部分 api（topk，arange, full, linspace, conv_transpose2d/3d, quantized.convtranspose2d）的兼容性问题，以解决新版本（v8.19.1）无法 load 历史版本 .tm 模型的问题。
修复 TracedModule 的 BackwardFoldScale pass 上可能（受模型复杂程度影响）会出现的死循环问题。

New Features

Dataloader

添加 ConcatDataset 数据类型用于合并多个现有数据集。

Python API

添加 F.nn.instance_norm 接口。

CUDA

新增对 CUDA118_CUDNN860_TRT8531 构建以及 Nvidia 4X sm_89 卡的支持。

ARM

新增 fp16 hybird direct 卷积。
新增 conv1x1 对fp16 nchw88 的支持。
添加 ARM CPU 平台的 Float16 NCHW88 Winograd 算法，提升 Float16 计算性能，以 Vgg16 模型为例，耗时由481.252ms减少为168.300ms。
新增 Float16 MK8 8x8 matmul 算法。

通用组件

megengine 算子支持 shape 中包含0的 tensor 作为输入。

分布式训练

移除分布式训练的 shared memory 后端。

Improvements

通用组件

对部署在 dlopen/dlclose 的用户场景，建议开启编译链接 c++_shared。内部 megvii3 用户，BUILD 目标配置 is_linking_system_dynamic_library = True；CMake 用户：编译参数追加 EXTRA_CMAKE_ARGS 包含 -DANDROID_STL=c++_shared 配置，比如编译 android 版本可执行EXTRA_CMAKE_ARGS=" -DANDROID_STL=c++_shared" ./scripts/cmake-build/cross_build_android_arm_inference.sh。

ARM

优化 ARM FP16 gevm 性能（在 aarch64 的不同 shape 上对 gflops 指标进行测试，95% 的 shape 有11%～156% 不等的性能提升）。

MegEngine Lite

Bugfix

通用组件

修复 lite io 接口设置多输出模型的 output 属性不生效问题。
Load and run 支持对 mgv2 格式模型的自动识别，修复了使用 megenginelite 接口进行推理导致的一些优化选项无法使用的问题，目前接口推理采用 megengine。

New Features

通用组件

load_and_run 支持在线 float32 转 float16（通过 --enable-ioc16 开启）。

MegEngine

Bug fixes

Release Process

Fix the problem that dump fails when split does not specify an output shape.

Python API

Fix the bug that the indexing operation (getitem) crashes when the start is empty and the step is negative (e.g. a[::-1]).
Fix the bug that the behavior of indexing operation (setitem) dtype promotion is inconsistent with numpy.

CUDA

Fix the bug that the log does not display the error message when the TRT loads or runs incorrectly.

ARM

Fix thread_local has a memory leak problem on the android platform.

Common components

Align the dtype property for tensor with np.dtype.
Fix channel padding pass of channel wise conv.
Fix the problem that MGB_USE_MEGDNN_DBG does not take effect when cpp is the opt version. Before the fix, this env is only available in the python whl version and c++ debug version. After the fix, it will take effect in any version.
Fix the bug that cudnn probabilistically fails to select the algorithm when the no_profile_when_shape_change option is turned on.
Fix the bug that crashes when the reduce axis of channel padding pass of nchw44 layout is negative.
Fix the crash issue caused by stack/concat operators when DTR is enabled.
Fix the problem that some op (ConvTranspose, MatrixMul) parameter information is lost after the operation on the C++ model.
Fix the compatibility problem of some APIs (topk, arange, full, linspace, conv_transpose2d/3d, quantized.convtranspose2d) of the traced module to solve the problem that the new version cannot load the .tm model of the historical version.
Fix the infinite loop problem that may occur on the BackwardFoldScale pass of TracedModule (affected by the complexity of the model).

New Features

Dataloader

Add ConcatDataset that supports merging multiple datasets.

Python API

Add F.nn.instance_norm interface.

CUDA

Support for CUDA118_CUDNN860_TRT8531 builds，Start supporting Nvidia 4X sm_89 cards.

ARM

Add fp16 hybrid direct conv algo.
Adjust the conv1x1 algorithm to support fp16 nchw88.
Add Float16 NCHW88 Winograd algorithm for ARM CPU backend to improve Float16 computation performance.Taking the Vgg16 model as an example, the elapsed time is reduced from 481.252ms to 168.300ms.
Add Float16 MK8 8x8 matmul algorithm.

Common components

MegEngine operators support tensors whose shape contains 0 as inputs.

Distributed Training

Remove shared memory backend of distributed training.

Improvements

ARM

Optimize ARM FP16 gevm performance (gflops test on aarch64, 95% of shapes have performance improvement ranging from 11% to 156%).

Common components

For user scenarios deployed in dlopen/dlclose, it is recommended to open the compilation link c++_ shared。 CMake users: EXTRA_CMAKE_ARGS=" -DANDROID_STL=c++_shared" ./scripts/cmake-build/cross_build_android_arm_inference.sh

MegEngine Lite

Bug Fixes

Common components

Fix the problem that the output attribute of multiple output models set by lite io interface does not take effect.
Load and run supports automatic recognition of mgv2 model, and fixes the problem that some optimization options cannot be used due to inference using the megenginelite interface. Currently, the megengine interface is used for inference.

New Features

Common components

Load and run supports convert fp32 to fp16 online (enabled by --enable-ioc16).

MegEngine - MegEngine v1.12.1

Published by Wanwan1996 almost 2 years ago

MegEngine

Bugfix

Dataloader

修复 1.12.0 Dataloader 不能将 Infinite 作为输入问题。

CUDA

修复当 cuda/cudnn 头文件在 CPATH 中， MGE_WITH_CUDA=OFF 时的编译错误。

通用组件

将android 构建时的cpp 标准显式调整到 c++17，以解决第三方通过 add_custom_command 调用 MegEngine 构建时，无法编译 libion 的问题；修复 load_and_run --iter 0 时，log 乱码的问题。
修复在开启 no_profiling_on_shape_change 时，错误地重置了低比特量化 Tensor 的 layout 而导致的报错。
修复动态 shape subtensor channel padding 的断言错误。

MegEngine

Bugfix

Dataloader

Fix the problem that Dataloader cannot take Infinite as input in v8.19.0.

CUDA

Fix compilation errors when cuda/cudnn header files are in CPATH and MGE_WITH_CUDA=OFF.

Common components

Config android c++ standard to c++17 to fix build failed when called by add_custom_command;Do not print lar summary log when load_and_run with iter 0.
Fix the bug caused by incorrectly resetting the layout of the low-bit quantized Tensor when no_profiling_on_shape_change is True.
Fix subtensor padding channel assert issue.

Full Changelog: https://github.com/MegEngine/MegEngine/compare/v1.12.0...v1.12.1

MegEngine - MegEngine v1.12.0

Published by Wanwan1996 almost 2 years ago

MegEngine

HighLight

针对 BaseDet 中一些 host bound 严重的算子进行了优化，整体模型较上个版本相比 fp32 下平均提速 12%，fp16 下平均提速 19%，其中包含 group_norm 算子的网络显存降低 20%，在与 cvpack2 中有对应 pytorch 模型的网络相比，速度差距在 2% 以内，基本与 pytorch 对应的模型持平。
修改「descending」默认值为 true 以符合惯常情况下大家对 topK 的定义，topk 默认行为由升序改为降序。
增加了对 python 3.10 的支持。

Bugfix

Dataloader

修复 Infinite sampler 无法获取 batchsize 的问题，并增加了使用示例与参数说明。
修复 ReplacementSampler 设置采样权重后采样结果不符合预期的 bug。
修复 ReplacementSampler 有 weight 时输出的 indices 不符合预期的问题。

Python API

修复 deconv 与 bn 融合错误的问题。
修复 softmax 在 cpu 上计算结果不正确的问题。
修复 ImageNet 解压路径错误的问题。

量化

修复 matmul 对量化 dtype 推理错误的问题。
禁止模型以非对称 qint8 的量化模式推理，去除 fake_quant_bias 里的 assert 以支持更多 QAT 量化模式。

CUDA

修复 Region Restricted Conv 不支持输入的 group 维度等于 1 的情形。
修复使用 --copt “-DMEGDNN_DISABLE_FLOAT16=0” 编译选项时，undefined 的报错。

ARM

修复 fallback im2col 算子所需 workspace 比实际需求大的问题。

X86

修复 x86 INT8 matmul 算子在代码重构时性能变差的问题。

通用组件

无 cuda 环境中开启 subgraph jit 特性可能导致部分 functional API 调用报错，subgraph jit 特性临时改为默认不开启。
修复模型多次初始化时偶发内存用量不一致的问题。
修复 tensor astype 成量化类型时概率性 segmentfault 问题和内存泄露问题。
修复 v1.11.0 及之后的版本 Elemwise multitype 的 loader 和 dumper 函数无法向前兼容的问题。
对 mge(fbs) 格式化，补充 tensor_value_dumper 和 tensor_value_loader 用户接口，方便用户在模型 dump 和 load 阶段自定义一些行为，比如模型的压缩和解压。
修复模型仅能通过 forward 函数进行参数统计导致的参数缺失的问题。
修复 megengine 训练时默认 async_level 情况下的数据竞争导致的运行中随机报错。
修复 load and run jit 设置对非 CUDA 后端无效的问题，增加了 jit 对 CPU 后端的支持。
修复 dump 量化模型时，开启 enable_nchw4/32/64 等选项报 shape 或 channel 不匹配的问题。
调整编译配置，使之对开发者模式更加友好：只需要设置 PYTHONPATH 到 imperative/python 即可，详细参见 scripts/whl/BUILD_PYTHON_WHL_README.md。
移除 python3.8 及之后的 SyntaxWarning。
修复 MegEngine 和 python中 mod 计算结果不一致的问题。
修复 symbolic trace时，3维输入的 matmul 输出 shape 计算错误问题。
修复 ConvolutionBackwardData 算子推断 layout 错误导致的概率性训练崩溃 bug。
加速 reshape、setsubtensor、subtensor、concat、stack 算子。
修复 NormElemwisePass 中 named_args 接口未更新的问题。

文档

修复 warp_affine 的文档错误。

New Features

Python API

deconv 支持 fuse bn 操作。

CUDA

CUDA 上customop 支持新的 RuntimeArgs 参数。
取消 RegionRestrictedConv mask 类型为 uint8 时输入和输出 tensor size 必须为4的倍数的限制。

ARM

ARM 平台支持 fp16 nchw88 im2col 算法，此算法性能较 fp32 nchw44 快2倍左右，主要用于提升 ARM fp16 模型推理速度。
添加 ARM NCHW88 fp16 pooling 算法。

通用组件

Region Restricted Conv 支持 bias。
nchw44/nchw88/nchw44-dot 三种 layout 在 channel 上不满足要求时会 padding channel。
添加 grouonorm 算子。
增加了对 python 3.10 的支持。
为 custom op 新增 cuda 相关的辅助函数，以允许 custom op 异步执行。

Improvements

Python API

修改 descending 默认值为 true ，topk 默认行为由升序改为降序。

CUDA

完善了 dump 和 load 使用的 tensorrt 版本不一致时的错误信息。

MegEngine Lite

Bugfix

通用组件

修复 lite 运行跨 compnode 的模型时 zero copy 不生效的问题。
修复 lite zero copy pass 触发的 UAF 问题。

周边工具

修复 load_and_run fitting 模式下 fast-run 不工作的问题。

MegEngine

Bugfix

Dataloader

Fixed the problem that Infinite sampler cannot getbatchsize, and added usage examples and parameter descriptions.
Fix the bug that ReplacementSampler gets wrong sampling results after setting the sampling weights.
Fix the problem that the indices output by ReplacementSampler does not meet expectations when it has weight.

Python API

Fixed deconv and bn fusion error.
Fixed softmax calculation result incorrectly on cpu.
Fix bad path when untarring imagenet data.

Quantify

Fixed matmul inference error for quantized dtypes.
Forbid the model to reason in the quantization mode of asymmetric qint8, and remove the assert in fake_quant_bias to support more QAT quantization modes.

CUDA

Fixed the issue when Region Region Restrict Conv's group is 1.
Fix the undefined error when using the -- copt "- DMEGDNN_DISABLE_FLOAT16=0" compilation option.

ARM

Fix the problem that the workspace required by the fallback im2col operator is larger than the actual requirement.

X86

Fix x86 INT8 matmul operator's poor performance during code refactoring.

Common components

Enabling the subgraph jit feature in a non-cuda environment may cause some functional API calls to throw errors.
The subgraph jit feature is temporarily changed to be disabled by default.
Fix occasional inconsistent memory usage when the model was initialized multiple times.
Fixed probabilistic segmentfault and memory leak when when set tensor dtype to Quantized.
Fixed the problem that the loader and dumper functions of Elemwise multitype cannot be forward compatible in v1.11.0 and later versions.
Implement user interface 'tensor_value_dumper' and 'tensor_value_loader' for fbs model, used for user register some behavior at model dump and load stage, for example model compress and decompression.
Fixed an issue where module_stats does not support information statistics for axion models.
Fixed random errors during operation caused by data races in the case of default async_level during megengine training.
Support JIT CPU backend and fix load and run jit options invalid for backend exclude CUDA.
Fix the error when dump model to nchw4/32/64 tensor format.
Fix build CMakeLists and script to get better experience.
Remove SyntaxWarning after python3.8.
Fix the problem of mod op get the different result between MegEngine and python.
Fix the probabilistic training crash bug caused by the deduce layout error of ConvolutionBackwardData operator.
Speed up reshape, setsubtensor, subtensor, concat, stack operators.
Fix the problem that the named_args interface in NormElemwisePass is not updated.

Documentation

Fix documentation error of warp_affine.

New Features

Python API

Deconv supports fuse bn operations.

CUDA

Add param RuntimeArgs to customop kernel on CUDA.
Cancel the restriction that the input and output tensor size must be a multiple of 4 when the RegionRestrictedConv mask type is uint8.

ARM

The ARM platform supports the fp16 nchw88 im2col algorithm, which is about twice faster than the fp32 nchw44 algorithm, and is mainly used to improve the reasoning speed of the ARM fp16 model.
Add ARM NCHW88 fp16 pooling algorithm.

Common components

Region Restricted Conv support bias.
Three layouts (nchw44/nchw88/nchw44-dot) will padding the channel when the channel does not meet the requirements.
Add grouonorm operator.
Add support for python 3.10.

Improvments

Python API

Change the default value of descending to true , and the default behavior of topk is changed from ascending to descending.

CUDA

Improve the error message when the tensorrt versions used by dump and load are inconsistent.

MegEngine Lite

Bugfix

Common components

Fix lite zero copy issue at cross compnode env.
Fix var UAF in lite zero copy pass.

Peripheral tools

Fixed fast-run not working in load_and_run fitting mode.

MegEngine - MegEngine v1.11.1

Published by Wanwan1996 almost 2 years ago

MegEngine

Bugfix

通用组件

修改分组卷积计算中通道不匹配时的错误信息，使其更好理解。

CUDA

修复 import megengine 时，cuda 版本检测时报错信息冗余的问题，使报错信息更加合理。
修复了 TRT7 带来的内存泄漏问题。
修复了某些情况下找不到 libnvrtc-builtins.so 的问题。

模型序列化

序列化 fbsv2 模型时，用户可配置是否序列化中间 tensor 信息，当不序列化中间 tensor 信息，可以压缩模型大小。

周边工具

修复 megbrain 访问 redis server 时，对同一个 std::future 反复调用其 get 接口，进而产生的 future_error：no state 问题。

量化

修复 SyncExponentialMovingAverageObserver 在单机场景下不可用的问题。

Python API

修复 deconv 的 flops 统计错误的问题。
修复 Tensor 值小于 1e-4 时，打印显示为0的问题。
增加非 float32 的输入支持，在输入类型不满足要求但仍然是数值时，输出一个元素全为 False 的 bool 类型 Tensor。

New Features

新增 meshgrid 算子的实现。

Python API

通用组件

为 conv_transpose 添加output_padding参数，用来控制输出的图像尺寸。
将 MegEngine 中使用 flatbuffer 定义的模型格式文件上传到 github 中。
新增 warp_affine 算子反向的支持。

ARM

增加 ARM nchw 的 winograd f43 算法的实现，优化 nchw 下 arm 的部分 conv3*3 的速度，有6%～74%的提升。
增加 ARM nchw44 的 winograd f43 实现。

Improvements

ARM

优化 arm fp32 的 sigmoid，推理速度有 10% 的性能提升。

文档

优化文档中关于 batch_norm 的参数及使用介绍的描述，使之更完整明确。
更新 max_pool2d/copy 接口文档，使之更完整明确。
优化部分 python 接口的 docstring。

Dataloader

优化 dataloader，在 basekps 上的性能平均提高5倍。

MegEngine

Bugfix

Common components

Modify the error infomation of group convolution when input channel mismatch more readable.

CUDA

Making the error message of cuda version detection more reasonable when importing megengine.
Fix TRT7 workspace memory leak.
Fix missing libnvrtc-builtins.so in some environments.

Model serialization

When serializing the fbsv2 model, the user can configure whether to serialize the middle_tensor tensor information.

Peripheral tools

Fixed the future_error: no state generated by megbrain calling its get interface repeatedly to the same std::future when accessing the redis server.

Quantify

Fix SyncExponentialMovingAverageObserver is not available in non distributed mode.

Python API

Fix stats error for deconv flops.
Fix the problem that the tensor value is printed as 0 when it is less than 1e-4.
Add non-float32 input support, output a bool type Tensor whose elements are all False when the input type does not meet the requirements but is still numeric.

New Features

Python API

Add meshgrid opr.

Common components

Add the output_padding parameter for conv_transpose to control the output image size.
Upload the model format file defined by flatbuffer in MegEngine to github.
Support the backward of warp_affine operator.

ARM

Added winograd f43 implementation for ARM nchw.
Add FP32 winograd F43 NCHW44 algo.

Improvements

ARM

Optimize the sigmoid of arm fp32, improve the performance by 10%.

CUDA

Document

Optimize the description of the parameters and usage of batch_norm in the document to make it more complete and clear.
Update the interface document of max_pool2d/copy operator.
Optimize the docstring of some Python API.

Dataloader

Optimizing dataloader, the performance on basekps is improved by an average of 5 times.

MegEngine - MegEngine v1.11.0

Published by megvii-mge about 2 years ago

MegEngine

HighLight

新增 CUDA INT4 支持。在 cuda11.4 + cudnn8.2.1 + trt7.2.2.3 + A2 卡上验证，和 Float32 相比，ResNet-50 Acc top1 精度损失 0.993%，速度提升5.8倍（557.969ms ->96.726ms ）; 和 INT8 相比，ResNet-50 Acc top1 精度损失 0.131%，速度提升 1.3 倍(125.76ms -> 96.726ms)。详情参考MegEngine example 。
尝鲜通道： python3 -m pip install megengine==1.11.0+cu114 -f https://megengine.org.cn/whl/mge.html
Netron 可以可视化 Traced Module 了！欢迎大家体验： https://netron.app/

Bugfix

发版流程

修复 traced module 中重命名张量导致的错误。

通用组件

修复 fastrun 过程中跳过算法的判定条件。
修复 fastrun 过程中显存占用过多触发的 OOM 错误。
修复 Windows7 + 32bit + 多线程组合情况下，进程无法退出问题。
修复了参数初始化时 tensor 格式信息丢失的问题。
修改 nchw44 broadcast_vec 的场景下的算法选择, 修复 nchw44 的 elemwise 性能缺陷。
修复源码污染问题，使得 git status 恢复只显示用户本人的改动信息。
优化卷积通道不匹配，Matmul shape 不匹配时的输出信息，使其更好理解。
修复读取 persist cache 过程中由于网络原因导致的偶发性数据读取异常问题。
修复参数 tensor 初始化中未考虑 DTR 导致的卡死问题。
修复 softmax 运行时动态创建 elemwise 等 opr 导致不能开 record2 优化的问题
修复 elewise multitype 所引发的前向兼容的问题，使得之前的 load and run 可以正常运行该版本 dump 下来的模型。
修复 Repeat 算子无法开启 trace 模式的问题。
修复 load_and_run fitting 模式下仅指定输入 shape 或给定输入 batch-size 时设置无效等问题。
修复 ReduceMean 不同版本之间以及相同版本的 CPU 与 GPU 之间误差较大的问题。
修复 1.10 版本的模型内存占用增大的问题。

CUDA

修复 cutlass 编译 SM86 时间过长或者编译失败问题。
更改多卡环境的检测逻辑。取消初始化时对当前所有显卡是否支持 import megengine 的检测与提示，只有当运行时所使用的显卡不支持 import megengine 时才报错。
修复 cudnn8 的编译不通过的问题。
修复了 TensorRT8 在编译由于不指定 LIBRARY_PATH 导致失败的问题。

周边工具

修复 load_and_run 中 record_comp_seq 没有生效的问题。
修复参数和技术量统计工具中由于 long 类型的表示范围限制导致模型计算量的计算不准确的问题。
修复 load_and_run 中模型包含测试用例在全局图优化 dump 模型时报错的问题。
修复参数量和计算量统计工具 module_stats 重复统计共享权重的问题。
修复 megengine.tools.network_visualize 不支持CondTake 导致报错的问题。
修复 load and run 设置 multithread 后，没有加速效果的bug。

ROCM

修复 ROCM 平台由于缺少 conv bias 的实现导致的卷积算子无法执行的问题。

分布式训练

修复多卡训练时设置 async_level 为0会导致训练卡死的问题。

New Features

Python API

新增暴露如下API： is_cambricon_available、is_atlas_available、is_rocm_available、what_is_xpu。

通用组件

resize 反向传播支持 fp16 及 nhwc 的数据格式
CPU 和 CUDA 的 algo policy 的 cache 写入方式改为追加模式
elemwise multitype 中添加输出类型为 bool 的 opr，以提升megengine.functional.isnan、megengine.functional.not_equal、megengine.functional.less_equal、megengine.functional.greater_equal、megengine.functional.greater、megengine.functional.less、megengine.functional.isinf 、megengine.functional.equal 这些 opr 的性能，优化后整体和 pytorch 一致，其中megengine.functional.isinf 、megengine.functional.equal 优于pytorch表现。
增加可以查询whl包中的 trt、cudnn 版本、cuda 版本的接口：megengine.get_cuda_version、megengine.get_cudnn_version、megengine.get_tensorrt_version
使用 VF 指令优化 X86 和 RVV 的 GI 直接卷积, winograd 卷积, nchw_nchw44 卷积, 矩阵乘性能。经过验证 ResNet18 在 amax04 有 50ms 性能提升。
矩阵乘：12 Gflops -> 20 Gflops E5-2620 v4 @ 3.0GHz amax, 0.3 Gflops -> 1.2 Gflops @ nezha D1
GI algo RVV 去掉 FIXLEN 的依赖, 避免 FIXLEN 产生多余的 load/store 操作，加速推理过程，RVV 上 resnet18 模型有 5%～10% 的提升。
优化 softmax 的实现。在 arm 的设备上，优化后的 softmax 实现相较于之前代理版 softmax 性能提升 10 倍左右。
新增支持 TensorRT8 的编译的工具链。
load_and_run 增加 mdl 模型可用的 optimize_for_inference 优化选项，可以用来实现 optimize-for-inference 的图优化, 如bn融合。

ARM

针对 pooling 算子，支持 nchw44 format 下的 reduce 和 elemwise 算子融合。

第三方硬件

优化 X86+RISC-V 的性能，在resnet18 上验证加速 1.1 倍。

周边工具

load and run 添加运行时给定 loader init 接口的功能，使业务侧业务的 loader 在修改 init api 名字后指定参数可以继续加载。此功能使用参数：--c-opr-init-interface 。
使用示例：./load_and_run --c-opr-init-interface="your_loader_init_API"。
c-opr-init-interface 的默认值为 mgb_c_opr_init 。举例在业务中业务可能使用的值为： anc_c_opr_init。
load_nerwork_and_run 支持权重预处理以及设置warm up iter数。

发版流程

添加 cu114 whl包的生成方式。

Improvements

ARM

优化 CPU 上 reduce Opr 在 shape (xxx，xxx, 2/3/4) 的最后维度进行 reduce 时候的前向计算性能，提升约10倍。

CUDA

优化 conv2d padding mode 为 reflect 时的性能，大 shape 场景下提升明显，经过验证提升约50%。

文档

优化 functional.vision 模块中 roi_pooling，roi_align，nms，remap，warp_affine，warp_perspective，interpolate 的文档描述。
优化 pad 的文档中关于 mode 参数的描述，使之更准确。
优化 dataloader、Dataset、MNIST dataset 的文档描述，使之更完整明确。

MegEngine Lite

Bugfix

修复 MegengineLite 的 python 接口中 get_io_tensor、slice 及 concat 接口反复调用导致的内存泄漏问题。
修复 lite 中同时开 fast_run 和 nchw44 会挂的问题。

New Features

MegEngine Lite的 LiteConfig 增加 auto_optimize_inference 选项进行设备检测，可以根据推理时的CPU信息自动设置对应的 layout 优化选项。
添加 Lite 中 set_data_by_share 和 set_data_by_copy 接口，当输入是 numpy ndarry 时必须是连续的断言。

MegEngine - MegEngine v1.10.0

Published by kagome1007 about 2 years ago

MegEngine

HighLight

MegEngine 模型支持前向兼容性。即新版本的 MegEngine 序列化的模型可以在老版本的 MegEngine 加载。
- 从该版本及以上的版本，具备向前兼容的能力。
- 部分场景不具备向前兼容的能力。例如使用了新版本中新增的 opr，此时则不可向前兼容。
增加 python3.9 的支持。

Know Issue

v1.10 trace 模式下 sublinear 和静态图 dtr 是失效的。
2080ti cuda 上 ResNet50 推理耗时略慢于 v1.9。
树莓派上 VGG 推理耗时略慢于v1.9。

Bugfix

Python API

限制把输入自动转换成 tensor 的场景：仅 elemwise 会自动转换输入为 tensor。
修复 megengine.functional.matmul 在动态图模式下反传时挂掉的问题。
修复 megengine.functional.transpose 的 shape 推断错误。
修复 conv 反传和 megengine.random.RNG 算子中空 tensor 的问题。
限制 trace 模式下的 megengine.functional.concat 的 apply 时输入是非 tensor 的类型转换。
修复 megengine.functional 里比较函数结果的 dtype 不为 bool 的问题。

混合精度训练

修复 v1.9 版本在 BaseCls 上部分网络显存占用增大的问题。

通用组件

修复 fp16 参数使 AMP 不能工作的问题。
修复cpuinfo版本，以避免ARM上dlopen时可能造成内存泄露的问题。
修复 adaptive_pooling 在推不出 shape 时 ndim 不正确设置的问题。
修复 riscv64 gcc 使用大于 O0 的编译优化选项报错的问题。
修复异步读写 tensor shape 的错误。
修复 advanced indexing 在一个元素被多次取出时的求导错误。
修复commit改变会导致大量文件重新编译的问题。
修复 fastrun 与 heuristic 混用时缓存混乱的问题。
修复某些情况下在 fork 之后，使用 megengine.get_cuda_compute_capability 接口获取 cuda 环境报错的问题。
修复不能 attach 已经在求导路径上的 Tensor 的问题。
修复类似 softmax 等通过其他 Opr 组合完成计算的 Opr 在 midout 之后运行奔溃问题。
修复 pooling，matmul 中执行 policy 缺失的问题。
修复使用 MegEngineLite 推理，并 reset memory 之后报错的问题，具体为修复 reduce opr 中，当 input 的内存地址发生改变时报错的问题，在实际执行前增加了 update 的功能。
修复 path 里不带 nvcc 时使用 jit 相关的函数会挂的问题。
修复 reduce 算子在 v1.9 其参数 keepdims 的默认值从 True 修改为 False 后，reduce 前后 dim 维度不一样的问题。
修复 layernorm 训练不稳定、normalize 的维度较小时速慢的问题。
修复在极小的概率下 tensor 产生时 shape 信息不全导致获取 shape 时出现卡死的情况。
修复在 adaptivate_pooling 中输入 tensor 作为 tshape 时抛出异常的问题。
修复 reduce 在 backward 构建反向图时，不参与反向计算，没有梯度时抛出异常的问题。
使输入带 axis 选项的 op 都支持负数 axis。
修复使用 GraphInference 跑 mge 计算图时出现的内存泄漏的问题
修复 fastrun 过程中跳过算法的判定条件。
修复 fastrun 过程中显存占用过多触发的 OOM 错误。
修复 maximum(x,x) 求导错误的问题。
在 cmake中添加 MGE_WITH_BENCHMARK 选项，允许开启 DNN 中 BENCHMARK 的编译。
修复 Function 中的 inplace 操作。
修复 broadcast_to 不能被 trace 的问题。
使用 tensor 去构造新 tensor 时检查 dtype, device 等其他参数。

发版流程

修复 traced module 中重命名张量导致的错误。
修复 traced module 中可能错误抛出异常的问题。
修复 traced module 中的兼容性问题

ARM

修复 ARM 上执行 NHWCD4 模型的报错信息。

周边工具

修复 load_and_run fitting 模式下用户开启 const_shape 时 shape 变化的模型抛出异常的问题。
修复 load_and_run 中 record_comp_seq 没有生效的问题。
修复 profile 时 altas 的 event sync 的问题。

New Features

Python API

移除 Imperative python 接口里的 Symbolvar，并将其功能由 Tensor 实现（兼容之前的 mgo 图手术代码）。
新增了支持大 batch size 训练的 lamb 优化器。
megengine.functional.nn.roi_align 算子支持空 tensor 的输入。
添加 swapaxes 接口支持维度交换功能。

通用组件

优化 third_party 的准备工作，增添可选项，改善只训练或者只推理用户的体验。在 cmake 前添加 EXTRA_CMAKE_ARGS="-DMGE_SYNC_THIRD_PARTY=ON" ，会自动调整编译所需的 THIRD_PARTY 库。
增加检查本机 CUDA 版本和当前 MegEngine 依赖的 CUDA 版本是否匹配，如果不匹配打印 warning 信息，如下图所示。
支持对 uint16 tensor 进行 astype 。
在 fastrun 的 profile 模式中添加 warmup，以提高评判的准确。
MegEngine 模型支持前向兼容性。即新版本的 MegEngine 序列化的模型可以在老版本的 MegEngine 加载。
补全 gi 对 risc-v 的支持。
增加 python3.9 的支持。

ARM

在 arm_common 中添加了 chanwise 的 9x9点和 11x11 点积运算；9x9 的情况下有 25% 的无用计算, 11x11 的情况下无用计算只有 8.3%, 在满足对齐的情况下测试 9x9 与 11x11 耗时差距不大，因此推荐使用 11x11 的版本。
在 dnn/src/fallback/matrix_mul 下实现一个 gi 版本的 gemm 非 mk4 的版本。

CUDA

支持 int1 conv 的基本实现。

三方硬件

支持 Atlas710 的硬件。

周边工具

优化了 cmake 编译说明 , 如有问题欢迎提交 PR 修改或在论坛提出反馈。
在 load_and_run 中添加了 fitting 模式接口。
load_and_run --input 选项新增指定输入 shape 的用法。使用格式：--input="data_name:{d0,d1,d2, ...,dn}" 。
load_and_run 新增 layout_transform_batch_size 选项，支持指定全局图优化输入的 batch size。

Improvements

Python API

提高 megengine.functional.nn.pixel_shuffle 在小 shape 下的性能，可达 500%。
提高 megengine.functional.matmul 在小 shape 下的性能约 15%。

通用组件

优化跨 stream 张量复制。
优化 adaptive_pooling 实现。imperative 情况下的 megengine.functional.nn.adaptive_avg_pool2d megengine.functional.nn.adaptive_max_pool2d 速度提升约 6.5 倍。
优化 megengine.functional.nn.conv_transpose3d 实现。imperative 情况下的速度提升约 2 倍。
优化 pooling 实现。imperative 情况下 megengine.functional.nn.avg_pool2d megengine.functional.nn.max_pool2d的速度提升约 5 倍。
优化 megengine.functional.nn.conv_transpose2d 实现。imperative 情况下的速度提升约 3 倍。
在 heuristic cache 中使用简单构造 key 的方式，获得性能提升。
重写 matmul 和 batchmatmul 的自定义求导规则，提升matmul batchmatmul 反向计算速度，与 1.9 版本相比， vit 模型训练单个迭代训练时间从 354ms 降低到 350ms。
缩小单个 sm cuda 编译时间到原来的 2/3。

CUDA

优化大尺寸卷积的 CUDA direct 算法性能，正向的速度达到峰值的 80% 以上。

MegEngine Lite

Bugfix

修复 lite_shared.dll 没有在install 目录的问题。
修复从 numpy 拷贝数据到 device tensor 的错误。
修复 cpu:default 下多线程执行，MegEngine Lite 仍使用同一个线程的问题。
修复 pylite 中的接口名: set_tensorrt_cache → set_redis_cache
修复旧版本load_and_run无法解析历史的打包模型的兼容性问题。

New Features

MegEngine Lite 中添加上传和下载 redis cache 的功能。
MegEngine Lite 中增加 LITE_extra_configure 接口，用户可以设置是否使用模型信息进行网络配置。

MegEngine

Bugfix

Python API

restrict using convert_inputs in py_apply.
Fix megengine.functional.matmul grad error.
Fix megengine.functional.transpose shape infer.
Fix empty tensor bug of conv_bwd and megengine.random.RNG.
Restrict value converts to tensor for megengine.functional.concat.
Fix return dtype of comparison.
Fix the problem that cuda environment cannot be used after fork.
Fix the problem that tensors already in gradient path cannot be attached.
Fix the crash of some Operators running after midout, these Operators will call other Operator to finish compute task, such as softmax.
Fix the problem that policy is missed for pooling and matmul.
Fix the problem of reporting an error when the input memory address changes in reduce opr, and add the update function to fix it before the actual execution.

AMP

Fixed v1.9 the memory usage incresing problem of some network on basecls .

Common components

Fix an amp error occuring when some parameters has float16 dtype.
Fix cpuinfo version to avoid memory leakage when dlopen on arm.
fix incorrect ndim when could not infer shape for adaptive_pooling.
Fix riscv64 gcc error when using compilation optimization options greater than O0.
fix bug when asynchronously read/write tensor's shape.
print warning information when CUDA on user's PC mismatched with CUDA which in MegEngine.
Fix advanced indexing grad error.
Fix many object need recompile when commit id changed.
Fix lookup heuristic cache even in fastrun.
Fixed the problem that jit related functions will fail when NVCC not in path.
Fixed the problem that the default behavior of reduce operation is inconsistent with older version whick keepdims.
Fixed the problem that layernorm training is unstable and the speed is slow with small normalization dimensions.
Fixed the situation where the tensor would get stuck when getting shape if the probability of creating a tensor was not complete.
Fixed the problem when entering tensor as tshape in adaptivate_pooling.
Fix the problem that reduce does not participate in reverse calculation when constructing backward graphs and throws exceptions when there is no gradient.
Make input op with axis option support negative axis.
Fixed memory leak when using GraphInference to run mge calculation graphs.
Fix skip condition in fastrun.
Fix OOM error in fastrun.
Fix grad of maximum(x, x).
Add the MGE_WITH_BENCHMARK option to cmake to allow the compilation of BENCHMARK in DNN.
Fix inplace operation on autodiff.Function.
broadcast_to supports mutable target shape.
check args when construct tensor with existing tensor.

Release process

Fix the bug occurred when renaming tensor in traced module.
Fix trace_module function may raise error in finally scope
Fix traced module compatible issues.

ARM

Fix error message when executing NHWCD4 model on ARM.

Peripheral tools

Fix the problem that the model whose shape changes when the user turns on const_shape in load_and_run fitting mode throws an exception.
Fix the bug that record_comp_seq in load_and_run does not take effect.
Fix the bug of event sync of altas when profiling.

New Features

Python API

Remove Symbolvar and implement its function in Tensor.
Add lamb optimizer that supports large batch size training.
megengine.functional.nn.roi_align operator supports empty tensor input.
Add swapaxes interface to support dimension swapping.

Common components

Optimize third_party's prepare, add options, and improve the experience of training-only or inference-only users.Adding EXTRA_CMAKE_ARGS="-DMGE_SYNC_THIRD_PARTY=ON" before cmake will automatically adjust the THIRD_PARTY library required for compilation.
Add warmup before profile in fastrun.
MegEngine models support forward compatibility. That is, the model serialized by the new version of MegEngine can be loaded in the old version of MegEngine.
Complete gi support for risc-v.
support python3.9.

Third-party hardware

supports Atlas710.

ARM

Added chanwise's 11x11 & 9x9 dot product operation in arm_common.
Implement a gi version of gemm's non-mk4 algorithm under dnn/src/fallback/matrix_mul.

CUDA

Support simple implementation of int1 conv.

Peripheral tools

Improve cmake build note,if you have any questions, welcome to contribute or give feedback in here.
Added fitting mode interface for load_and_run.
Add the usage of specifying input shape to the --input option of load_and_run. format: --input="data_name:{d0,d1,d2, ...,dn}".
Add layout_transform_batch_size option for load_and_run to specify global layout transform input batch size.

Improvements

Python API

Speed up megengine.functional.nn.pixel_shuffle on small shapes by up to 500%
Speed up megengine.functional.matmul on small shapes by 15%

CUDA

Speedup CUDA direct large conv.

Common components

improve cross stream memory borrowing.
Speed up megengine.functional.nn.adaptive_avg_pool2d megengine.functional.nn.adaptive_max_pool2d on imperative by 6.5 times。
Speed up megengine.functional.nn.conv_transpose3d on imperative by 2 times.
Speed up megengine.functional.nn.avg_pool2d on imperative by 5 times.
Speed up megengine.functional.nn.conv_transpose2d on imperative by 3 times.
using the simple hash key in heuristic cache.
Rewrite the custom grad rules of matmul and batchmatmul to improve the backward calculation speed. Compared with version 1.9, the training time of one iteration of vit model is reduced from 354ms to 350ms.
Reduced single sm cuda compile time to 2/3.

MegEngine Lite

Bugfix

Fix the bug that lite_shared.dll is not in the install directory.
Fix set data by copy on device tensor.
Fix cpu:default create new thread.
correct set_redis_cache API name in pylite.
Fixed the compatibility issue that the packaged model could not be resolved with the old version of load_and_run.

New Features

Add redis cache support for uploading and downloading in MegEngine Lite.
Add LITE_extra_configure interface for Lite. Users can set whether to use model info for network configuration.

MegEngine - MegEngine v1.9.1

Published by kagome1007 over 2 years ago

MegEngine

Bugfix

Python API

修复 conv 反传和 megengine.random.RNG 算子中空 tensor 的问题。
限制 trace 模式下的 megengine.functional.concat 的 apply 时输入是非 tensor 的类型转换。

MegEngine Lite

Bugfix

修复 cpu:default 下多线程执行，MegEngine Lite 仍使用同一个线程的问题。

MegEngine - MegEngine v1.9.0

Published by megvii-mge over 2 years ago

MegEngine

Known Issue

使用 megengine.random.RNG 的输入包含 0 维 tensor 场景，训练会报错。

HighLight

本次版本性能有较大提升，大部分网络训练提速约 10% ， host bound 严重的场景如检测模型，QAT 训练等有 20%~40% 的加速。尤其是在小 batch、amp 等情况下有显著提速。在 BaseCls 的多卡训练上验证，平均提速15.4%。
- 支持在一些算子中，输出张量可以与输入张量共享数据（Memory Forwarding）。此时不会发生数据拷贝，只有当数据是共享的张量发生修改时，才会触发数据拷贝，保证共享这一部分数据的其他张量不会受到影响。涉及到的算子包括：megengine.functional.transpose、megengine.functional.broadcast_to、megengine.functional.reshape 、megengine.functional.expand_dims 、megengine.functional.split 、张量索引等。这样可以尽可能地减少数据拷贝的过程，性能得到提升。为了防止极端情况下显存异常，提供 megengine.config.disable_memory_forwarding 用于禁用这项功能。

Notice

本次版本对 python3.5 的支持继续维持，从下个版本 MegEngine v1.10（MegBrain v8.17）开始将停止，请大家注意提前做好准备。

Bug fixes

Python API

修复使 @ 运算符与 megengine.functional.matmul 的行为一致。
修复使用 megengine.functional.nn.pad ，输出 Tensor 值可能为全 0 的问题。
为 megengine.functional.nn.remap megengine.data.transform.Resize 添加 nearnest mode 模式。

通用组件

修复在混合精度训练时无法使用 megengine.functional.nn.sync_batch_norm 的问题。
修复全局优化 conv 与两个 nolinear 算子融合时出错的问题。
修复不开 fastrun 的情况下大 kernel 卷积速度慢的问题。
修复对输入为非 float32 的类型求导时不报错，并且没有梯度的问题。
修复分布式训练 RPC 通信 IO 中断问题。
修复 BatchNorm 对二阶导的支持问题。

New Features

Python API

megengine.functional.nn.conv1d megengine.functional.nn.conv2d 增加 padding_mode 参数，支持 zeros、reflect、replicate 模式。

CUDA

添加大核的 direct conv 实现。
添加 implicit bmm 大核 depthwise conv 的实现。
CUDA 上 resize 的 nearest mode 支持不止 1 和 3 的多通道输入。

通用组件

基于业务降噪模型进行关于 cd4 优化，主要是添加 NHWC 和 NHWCD4 两种 format 之间的转换。在业务的降噪模型上验证性能提升 15% 左右。
添加 int1 数据类型的支持。
tensor indexing 中支持 np.newaxis(None) 。

Improvements

通用组件

优化性能，大部分网络训练提速约 10% ， host bound严重的 vit、检测模型，在 QAT 场景有 20%~40% 的加速。
提升 op dispatch 系统的性能。修复了 v1.8 使用的新 dispatch 系统存在的性能问题，修复后性能与 v1.7 持平。
提升 dispatch 系统 jit trace 性能。性能与 v1.7 相比略有提升。开启 trace 下部分模型训练性能提升如下， ResNet50 提升 0.7% ， ShuffleNet 提升 9%， ATSS 提升 10%。
subgraph op 支持 shape 推导和 jit fusion 优化，并用 subgraph op 重写了部分由 elemwise 组合成的性能较差的op。优化后 megengine.functional.nn.hsigmoid、megengine.functional.nn.relu6、megengine.functional.nn.prelu、megengine.module.LeakyReLU、megengine.functional.nn.softplus 、megengine.functional.nn.logsigmoid、megengine.functional.where 性能在大输入 shape 时与 pytorch 持平。
提升batch_norm的性能，小尺寸下提升 4.3 倍。
优化 reduce op 性能，速度提升 75%。

CUDA

融合 conv 和 h_swish，部分模型性能提升。

MegEngine Lite

Bug fixes

lite 修复全局图优化接口 symbolvar 替换不完整导致 cuda 设备上无法使用的问题。
修复 load_and_run lite 模型全局图优化接口与 fast-run 接口使用冲突的问题。
修复 load_and_run 使用 “–cuda” 参数时报错的问题

New Features

lite-c 接口中添加错误码和全局获取错误码的接口 LITE_get_last_error_code。
lite 增加通过虚拟地址查询物理地址的接口。
load_and_run 支持 lite 模型全局图优化。

Improvements

优化 Lite 中 get_data_by_share python 接口的性能。在算法仓的模型中略有性能提升。

MegEngine

Bug fixes

Python API

make operator "@" behaves in a way consistent with the behavior of megengine.functional.matmul .
Fix the output tensor of megengine.functional.nn.pad may be all 0 .
Add the nearNest mode for megengine.functional.nn.remap and megengine.data.transform.Resize .

Common components

Fix megengine.functional.nn.sync_batch_norm not being available when training with mixed precision.
Fix bug of fuse conv bias and two nolinear opr.
Fix the problem of poor performance of the large kernel convolution without fastrun.
Fixed bug gm attach non-float type does not report error without gradient.
Fix the IO interruption for RPC communication when distributed training.
Fix BatchNorm support for higher-order differentiation.

New Features

Python API

Add padding_mode parameter，support zeros、reflect、replicate mode for megengine.functional.nn.conv1d megengine.functional.nn.conv2d.

CUDA

Add implementation of large kernel's direct conv algo.
Add implementation of large kernel's depthwise conv by implicit bmm.
The nearest mode of resize on cuda supports more than 1 and 3 multi-channel inputs.

Common components

Add conversion between NHWC and NHWCD4 formats.
Add support for int1 dtype.
Add np.newaxis(None) for tensor indexing.

Improvements

Common components

Optimized performance, Most networks speed up to 10%, host bound heavy VIT or detection models, QAT scenarios speed up 20% to 40%.
Improve the performance of the op dispatch system. Fix the performance problems of the new dispatch system in version 1.8. After the repair, the performance is the same as that of version 1.7.
Improve the jit trace performance of the dispatch system. The performance is slightly improved compared to the 1.7 version.
When trace is enabled, the training performance of some models is improved as follows, resnet50 0.7%, shufflenet 9%, and atss 10%.
Subgraph op supports shape infer and jit fusion optimization, and rewrites some ops with it.
Performance of megengine.functional.nn.hsigmoid、megengine.functional.nn.relu6、megengine.functional.nn.prelu、megengine.module.LeakyReLU、megengine.functional.nn.softplus 、megengine.functional.nn.logsigmoid、megengine.functional.where, and where is on par with pytorch for large input shapes.
Improve the performance of the op batch_norm by 4.3 times for small object.
Improve the performance of the op reduce,speed up 75%.

CUDA

Fusion of conv and h_swish, the performance of some models is improved.

MegEngine Lite

Bug fixes

Fix lite global layout transform symbolvar replace error.
Fix the conflict between load_and_run lite model global layout transform optimization interface and fast-run interface.
Fix load_and_run error when using "--cuda" parameter.

New Features

Add 'LITE_get_last_error_code' interface in lite-c.
Add get physic address interface in lite.
Load_and_run supports lite model global layout transform optimization.

Improvements

Optimize the get_data_by_share interface of LiteTensor.

MegEngine - MegEngine v1.8.2

Published by megvii-mge over 2 years ago

MegEngine

Known Issue

训练和推理的GPU显存占用（MiB）各模型有不同程度的增加。

New Features

CUDA

添加大卷积核的 direct conv 实现。
添加 implicit bmm 大卷积核 depthwise conv 的实现。

MegEngine

New Features

CUDA

Add implementation of large kernel's direct conv algo.
Add implementation of large kernel's depthwise conv by implicit bmm.

MegEngine - MegEngine v1.8.1

Published by megvii-mge over 2 years ago

MegEngine

Notice

从下个版本 MegEngine v1.9 开始将停止对 python3.5 支持，请大家提前做好准备。

HighLight

megengine.functional.topk 新增「descending」以定义排序行为，本次版本默认为「False」保持从小到大排列，如果未指定则提示warning 信息。在 v1.12 版本将修改「descending」默认值为 true 以符合惯常情况下大家对 topK 的定义，即从选出二维矩阵中 Top-K 个最大元素。
MegEngine 支持端上训练，使用参考这里。

Bug fixes

Python API

修复 megengine.functional.floor_div 对于异号整数输入的计算错误。
使 megengine.functional.broadcast_to 接受 None，表示这一维无需进行广播以支持 -1 shape 自动推导。

发版流程

修复 MegEngine v1.7 版本序列化的 TM 模型，由 MegEngine v1.8 版本加载做图手术会失败的问题。
TracedModule Bug 修复如下。
- 修复无法序列化第三方后端中 op 的问题。
- 修复 Input 类型 expr 未绑定 top_graph 的问题。
- 修复图手术中将 ModuleNode 作为输入时，expr 的插入位置计算错误的问题。
- 修复 TracedModule 加载 v1.7 及之前含有 ones 或 zeros 的模型无法运行的问题。
- 修复 TracedModule 在部分情况下递归过深的问题。
- 修复 TracedModule 无法重复 trace 的问题。
- 修复 TracedModule 无法正确识别 pad 的问题。
- 改善 TracedModule 对不合法输入的报错信息。
修复同时开全局图优化和 fastrun 时，选中的算法只有 naive 时会报错的问题。

CUDA

前置输入 Tensor 太大的判断，优化错误提示信息，避免直接输出 cuDNN 报错。
修复 tensorrt 改变 shape 时，output推导错误问题

通用组件

修复 MegDNN fallback 的 ConvBias 算子不可用的问题。
修复图优化之后无法正常 fastrun 模型中的 matmul 和 pooling 的问题。
修复在低内存环境（8G）无法编译 MegEngine 的问题。
修复将较大的 numpy array 转换为 tensor，或将较大的 tensor 转换为 numpy array 时，占用额外内存的问题。
增加计算设备上的异步错误的检查与报错。
修复了 tensor 的 ndim 未知时 indexing 操作无法被 trace 的问题。

周边工具

修复 load and run 命令行输入的数据无法解析的问题
修复 io dump 中 qint4 和 bool 数据类型 dump 错误
修复megengine.utils.module_stats没有import相关库而无法使用的问题
修复 load and run 编译 cuda 时错误。
删除 dump_with_testcase 工具。
修复 load and run 无法识别用 flatbuffer 序列化模型的问题。
修复参数和计算量统计工具 module_stats 接口的 inputs 为 dict 时，无法统计的问题。
修复 load and run工具使用 --get-static-mem-info选项，统计得到的权重信息数据有误的问题。
修复 load_and_run 工具中，使用形如 –input "ratio:1.0" 选项时的参数解析错误。

New Features

Python API

添加 megengine.functional.diag 算子。

发版流程

TracedModule 支持在图手术过程中修改 Node 的名字。
为 TracedModule 提供一个 enable_expr_checker 开关，以在 trace 时进行更多检查。

ARM

优化 Arm 中部分数学计算的实现，性能有微弱的提升
ARM 后端支持 rnn_cell/lstm_cell/lstm 算子
添加 elemwise 部分 case 对多线程的支持，以支持 TS 项目部分模型性能优化。

第三方硬件

增加对寒武纪 MLU270 支持。
TensorRT Runtime Opr 支持动态 shape 的模型,且可根据输入 shape 主动选择相近「IOptimizationProfile」。

通用组件

CPU 支持运行 int4 模型。
megengine.functional.nn.remap 支持 dtype 为 float16 下的求导
优化非连续情况下的 typecvt 的性能
新增端上训练支持，更多详情查看这里
在 windows 系统上，load_and_run 增加动态链接 MegEngine 支持。

周边工具

新增了 cmake 格式化工具，执行可将 cmake 相关文件进行格式化。
Custom Op 增加 JIT 构建工具，文档待补充。
支持构建 Android whl 包。

Improvements

Python API

优化 megengine.random.RNG.uniform API中 low=0 & high=1 的情况下的 elemwise 开销，单算子性能提升约75% 。

CUDA

改进 megengine.functional.nn.softmax 在 axis 为标量时，CUDA 平台上的性能提升约200%～450%。
提高 megengine.functional.nn.dropout 在 CUDA 平台上的性能，可提升约 650%。
提高 megengine.functional.nn.layer_norm 在 CUDA 平台上的性能，可提升约 540%。

ARM

当一个 tensor 需要进行 int16/uint16 → float 的转换，并且转换后的数据进行 Mul/ADD 运算时，将多个运算合并为 ElemwiseMultiType，在010项目的 369 号模型验证性能提升约20倍(23512.8us →1208 us)。

通用组件

动态 AMP 性能提升，多个模型验证可提升约1% 。
优化 cpu 环境下 jit.trace 的时间。bs 256 、VGG16 模型验证，jit.trace 从约 4 分钟提升至 2 分钟。
修复在 cpu 上模型执行速度过慢的问题，在 VGG16 bs 10 验证从 10 分钟提升至约 6s。

MegEngine Lite

Bug fixes

修复 lite 中 TensorBatchCollector 的 device id 设置错误
Lite 中空 tensor 的 to_numpy 方法增加输出 Tensor 的数据类型信息
修复用户在自定义模型输出空间时部分模型推理失败的问题
修复 MegEngine Lite 的 device 配置接口为只设置 xpu 的 device type 为用户指定的 device type 。
修复 MegEngine Lite python 接口在 TensorBatchCollector 的 batch id 出错时没有报错日志输出的问题。
修复 MegEngine Lite 开启「record level 2」时报错的问题。

New Features

lite 中增加对寒武纪的支持。
MegEngineLite 新增一个名为 get_data_by_share 的接口。通过调用该接口，用户可以零拷贝地获得一个 lite tensor 的 numpy 对象。
增加 cv 的分类与检测的 example 。
新增全局图优化支持。

MegEngine

Notice

Drop support for python3.5 from MegEngine v1.9.

HighLight

megengine.functional.topk will default to descending order in v1.12. Please specify the "descending" argument during the transition period.
MegEngine support Device Training，you can refer to here.

Bug fixes

Python API

Correct behavior of megengine.functional.floor_div for integers with opposite sign.
Allow passing None to megengine.functional.broadcast_to , meaning the corresponding axis should not broadcast.

Release process

Fix a compatibility issue with TracedModule.
Fix TracedModule Bug ：
- Fix the problem that ops in third-party backend such as tensorrt can not be serialized.
- Fix the problem that input expr bound top_ graph failed.
- Fix the problem of incorrect calculation of expr insertion position when ModuleNode is used as input of graph operation.
- Fix a bug of v1.7: the model with ones or zeros can't work.
- Fix a recursion too deep issue when copying traced module.
- Fix an error that prevents traced module from tracing a module more than once.
- Fix traced module not recognizing pad.
- Improve error message for illegal inputs feed into traced module.
Fixed the problem that when global graph optimization and fastrun are enabled at the same time, an error will be reported when the selected algorithm is only naive.

CUDA

To judge that the front input Tensor is too large, optimize the error message, and avoid directly outputting cuDNN to report errors.
Fixed output derivation error when tensorrt changed shape.

Common components

Fix the problem that the ConvBias operator of MegDNN fallback is not available.
matmul, pooling operators support fastrun, which will lead to better inference performance for C++ models.
MegEngine（8G） fix build issue at low memory env(8G).
Reduce memory consumption when a large numpy array is converted to tensor or a large tensor is converted to numpy array
Add out-of-bound access check for some operators.
Fix the problem that the indexing operation cannot be traced when the ndim of the tensor is unknown.

Peripheral tools

Fixed the problem that the data entered in the load and run command line could not be parsed.
Fix qint4 and bool data type dump errors in io dump.
Fix the problem that megengine.utils.module_stats cannot be used without import related libraries.
Fix load and run build error when build with CUDA.
Remove dump_with_testcase tool.
Fix the problem that load and run cannot recognize the serialized model with flatbuffer.
fix a bug in megengine.tools.network_visualize when inputs is instance of dict.
Fix a bug that user will get wrong statistic when using --get-static-mem-info.
Fix a bug that load_and_run will get parsing error when meet command like –input "ratio:1.0".

New Features

Python API

Add megengine.functional.diag operator.

Release process

Support that the name of node can to be modified during the graph operation in TraceModule.
Add a enable_expr_checker switch for traced module, which adds more checks during tracing.

ARM

Optimize the implementation of some mathematical calculations in arm, the performance is slightly improved.
Add arm rnn_cell/lstm_cell/lstm operator.
Support part of arm ternary elemwise multithread.

Third-party hardware

Added support for cambricon MLU270.
Supporting dynamic shape model in TensorRT Runtime Opr and set closest IOptimizationProfile according to input shape automatically .

Common components

CPU supports running int4 model.
Support backward computation for float16 dtype in remap.
Optimize the performance of typecvt in non-continuous situations.
Add training based on cpp Interface, more.
For windows system, load_and_run supports dynamicly linking megengine now.

Peripheral tools

Added a cmake formatting tool: cmakeformat.py.
Add the JIT builder for Custom Op.
Support build python wheel for Android(termux env).

Improvements

Python API

Add fastpath when low=0 and high=1 for megengine.random.RNG.uniform to improve performance.

CUDA

Improve performance of softmax when axis is scalar on CUDA platforms, by 200% - 450%.
Enhance performance of dropout on CUDA platforms by up to 650%.
Enhance performance of layer_norm on CUDA platforms, by up to 540%.

ARM

ADD an operator fusion case of TypeCvt and Elemwise. A pass will fuse a Typecvt(uint16 to float) operator and one Elemwise operator(Mul/ADD) to an ElemwiseMultiType operator and developing relative kernel on aarch64.

Common components

Add fastpath when low=0 and high=1 for megengine.random.RNG.uniform to improve performance.
Optimize the placement order of algorithms in matrixmul under the x86 platform in dnn to improve the dump time of jit.trace(bs256 VGG16, 4min -> 2min).
Fix the problem that the model speed on CPU is too slow (bs10 VGG16,10min -> 6s).

MegEngine Lite

Bug fixes

Fix the device ID setting error of tensorbatchcollector in lite.
Add data type information when call empty tensor to_numpy method.
Fix the problem that some model inferences fail when users customize the output space of the model.
Fix device type configuration for megengine lite. Now only the devices of which the device type is unspecified will be modified.
Add warning for megengine lite python interface, when error of batch indexes occurs in the TensorBatchCollector.
Fix runtime error when record level of megengine lite is 2.

New Features

Add interface for cambricon models in lite.
Add a new interface in megenginelite tensor module named get_data_by_share. A zero-copy numpy object will be returned containing data of a lite tensor object.
Add classification and detection examples in lite.
Add megenginelite Python & c/c++ global graph optimization interface.